• Welcome to BirdForum, the internet's largest birding community with thousands of members from all over the world. The forums are dedicated to wild birds, birding, binoculars and equipment and all that goes with it.

    Please register for an account to take part in the discussions in the forum, post your pictures in the gallery and more.
ZEISS DTI thermal imaging cameras. For more discoveries at night, and during the day.

Scolopaci (1 Viewer)

SACC proposal

Gibson & Baker (in press). Multiple gene sequences resolve phylogenetic relationships in the shorebird suborder Scolopaci (Aves: Charadriiformes). Mol Phylogenet Evol. [abstract]
Banks 2012. Classification and Nomenclature of the Sandpipers (Aves: Arenariinae). Zootaxa 3513: 86–88. [preview]
AOU-SACC Proposal #555 (Remsen, Oct 2012): Reclassification of the Scolopacidae.
The above does not deal with the problems of generic limits in the Calididrinae, e.g., sister relationship of Aphriza to C. canutus; see Banks (2012). Dick Banks is submitting a proposal to NACC on this, and I suggest we do not meddle until dust settled.
 
Last edited:
(I have significant problems with the data in Gibson & Baker 2012.)
Trying to crawl my way through this... ;) (Sorry: quite long!) Gibson (2010) and Gibson & Baker (2012) offered only a purely Bayesian analysis, which is not something a trust a lot, so I was interested to see how the results would look if analysed with Maximum Likelihood.

The analysis was based on 5 genes: 12S-rRNA (12s, mitochondrial), cytochrome oxidase subunit-1 (cox1/coi/barcode, mitochondrial), cytochrome b (cytb, mitochondrial), NADH dehydrogenase subunit-2 (nd2, mitochondrial), and Recombination Activating Gene 1 (rag1, nuclear, exon). Reconstructing exactly their dataset is not really possible, however, because (1) their cox1 sequences were taken from a "concurrent DNA barcoding study in [thei]r lab" (presumably Rebecca Elbourne's MSc thesis), and (2) for the other genes, they clearly took many of their sequences from earlier datasets; in both cases, they did not detail which sequences they used. Sequences specifically associated to this work appear in GenBank under accession numbers JQ962980-JQ963056: although these were presumably all used in the analysis, they undoubtedly only represent a small subset of the analysed matrix (77 sequences in total; the full data matrix has 86 taxa * 5 genes = 430 cells).

Cox1 sequences, for all the taxa in the tree except Hydrophasianus chirurgus and Microparra capensis (both also lacking in Rebecca Elbourne's thesis), can easily be retrieved from BOLD. Except in a few rare cases, several sequences are available for each taxon, and their congruence can readily be checked. These data look entirely problem-free to me. As to the other 4 genes and included taxa, a list of what can be retrieved from GenBank appears in the attached .txt file (mostly interesting to buid your own idea about possible gaps in the sampling).

I presumed that, where sequences are not available either in GenBank or in BOLD, the gene must have been coded as missing data for the given taxon in the analysis (ie., a part of the data was not kept away from the public). On the other hand, it is certain that a part of what can be accessed in GenBank, in particular sequences published by entirely different research groups, was not used in the analysis, even though it existed.

================​

I first started with some exploratory analysis of the non-cox1 data, building trees for each gene (in some cases for specific parts of the genes), trying to compare sequences derived from the same taxon, looking at alignments, etc. The data, unfortunately (and as I already wrote above), does not appear completely problem-free. What follows is a list of issues involving sequences that were prossibly/presumably included in the analysis.
  • - Phalaropus tricolor, cytb: AY894240 (voucher: RCA87-192, 870 bp, Pereira & Baker 2005)
    - Gallinago gallinago, cytb: FJ603652-54 (voucher: multiple, 702 bp, Baker et al. 2009)
    - Gallinago gallinago, cytb: FJ787309-10 (voucher: 5984/5312, 943 bp, Hering & Päckert 2010)
    - Gallinago gallinago, cytb: AF194445-6 (voucher: /, 277 bp, Austin unpubl.)
    - Gallinago delicata, cytb: FJ603651 (voucher: 1B-2986, 702 bp, Baker et al. 2009)

    These are all near-identical; in trees, they appear embedded in Gallinago.
    Conclusion: AY894240 (Pereira & Baker 2005) is presumably misidentified, and is G. gallinago/delicata.
  • - Limnodromus scolopaceus, 12s: EF373090 (voucher: MKP-1523, 552 bp, Baker et al. 2007)
    - Limnodromus scolopaceus, 12s: AF285806 (voucher: /, 462 bp, Spellman & Winker 2001)
    - Phalaropus tricolor, 12s: DQ674581 (voucher: /, 1042 bp, Fain & Houde 2007)

    EF373090 is near-identical to DQ674581, but highly divergent from DQ674581; in trees, EF373090 and DQ674581 cluster with Phalaropus tricolor AY894155, then with tringines, not with scolopacines.
    Conclusion: EF373090 (Baker et al. 2007) is presumably misidentified, and is Phalaropus tricolor.
  • - Limnodromus scolopaceus, cytb: EF373140 (voucher: MKP-1523, 925 bp, Baker et al. 2007)
    - Limnodromus scolopaceus, cytb: AF285819 (voucher: /, 1019 bp, Spellman & Winker 2001)

    EF373140 is highly divergent from AF285819; in trees, it appears as the sister group of (Phalaropus fulicarius JQ963055 + Phalaropus lobatus AY894239), and associated to tringines, not to scolopacines; for AF285819, see below.
    Conclusion: EF373140 (Baker et al. 2007) is problematic/misidentified; considering that Baker et al.'s 12s "Limnodromus scolopaceus" appears to be a misidentified Phalaropus tricolor, it seems reasonable to me to assume that the same applies to their cytb.
  • - Gallinago gallinago, 12s: EF373082 (voucher: MKP-1590, 554 bp, Baker et al. 2007)
    - Gallinago gallinago, 12s: DQ674576 (voucher: /, 1044 bp, Fain & Houde 2007)
    - Gallinago gallinago, 12s: FJ603664-66 (voucher: multiple, 675 bp, Baker et al. 2009)
    - Limnodromus scolopaceus, 12s: AF285806 (voucher: /, 462 bp, Spellman & Winker 2001)

    EF373082 is near-identical to AF285806, but highly divergent from DQ674576 and FJ603664-66 (which are all near-identical to one another); in trees, EF373082 and AF285806 appear as the sister group of (Limnodromus griseus JQ962988 + Limnodromus sp. DQ674578).
    Conclusion: EF373082 (Baker et al. 2007) is presumably misidentified, and is Limnodromus scolopaceus.
  • - Gallinago gallinago, cytb: EF373132 (voucher: MKP-1590, 943 bp, Baker et al. 2007)
    - Gallinago gallinago, cytb: FJ603652-54 (voucher: multiple, 702 bp, Baker et al. 2009)
    - Gallinago gallinago, cytb: FJ787309-10 (voucher: 5984/5312, 943 bp, Hering & Päckert 2010)
    - Gallinago gallinago, cytb: AF194445-6 (voucher: /, 277 bp, Austin unpubl.)
    - Gallinago delicata, cytb: FJ603651 (voucher: 1B-2986, 702 bp, Baker et al. 2009)
    - Limnodromus scolopaceus, cytb: AF285819 (voucher: /, 1019 bp, Spellman & Winker 2001)

    EF373132 is highly divergent from FJ603652-54, FJ787309-10, AF194445-6, and FJ603651 (which are all near-identical to one another); in trees, it clusters first with AF285819, then with Limnodromus griseus JQ963049. However, EF373132 and AF285819 are not near-identical; they start the same, remain so over the first 345 bp of EF373132, then start diverging more and more until the end of the sequence (suggesting a sequencing problem in one of them); the overal distance between them is about 5%; in trees, this divergence is reconstructed as fully autapomorphic for AF285819; looking at the distances between the two sequences and their nearest BLAST matches also shows that AF285819 is globally more distant from all closely-related shorebird sequences than EF373132.
    Conclusion: This one is a though call. EF373132 is clearly not G. gallinago, and is at least mostly L. scolopaceus; as Baker et al.'s 12s "G. gallinago" appears to be, in its entirety, a misidentified L. scolopaceus, it seems reasonable to consider that the same could apply to their cytb. But this doesn't solve everything: besides this consideration, either EF373132 or AF285819 still appears to be incorrect due to a sequencing problem; the apparent autapomorphic character of the divergence of AF285819 leads me to think that a sequencing problem occurred there. IOW: I think the most likely explanations to what I see, is that (1) EF373132 (Baker et al. 2007) is misidentified, and is Limnodromus scolopaceus (but I see no strong suggestion that it suffers other problems); and (2) AF285819 (Spellman & Winker 2001), although correctly identified, suffers from a sequencing problem.
  • - Gallinago delicata, cytb: JQ963043 (voucher: JGS-1783, 852 bp, Gibson & Baker 2012)
    - Gallinago gallinago, cytb: FJ603652-54 (voucher: multiple, 702 bp, Baker et al. 2009)
    - Gallinago gallinago, cytb: FJ787309-10 (voucher: 5984/5312, 943 bp, Hering & Päckert 2010)
    - Gallinago gallinago, cytb: AF194445-6 (voucher: /, 277 bp, Austin unpubl.)
    - Gallinago delicata, cytb: FJ603651 (voucher: 1B-2986, 702 bp, Baker et al. 2009)

    JQ963043 is made of two sequenced fragments (353 and 411 bp respectively), separated by an 88 bp non-sequenced gap (a "poly-N" in the sequence). This sequence is congruent with the other sequences of G. gallinago/delicata listed above, except for the 213 last bp of the first sequenced fragment: this part differs by 15 substitutions (and has an additional 13 unidentified positions, which suggests a problem as well).
    Conclusion: there was apparently a problem with the sequencing of this part of JQ963043 (Gibson & Baker 2012).
  • - Numenius minutus, cytb: EF373145 (voucher: S-072-78498, 963 bp, Baker et al. 2007)
    - Numenius arquata, cytb: AF417929 (voucher: /, 1143 bp, Chen et al. 2003)

    These two sequences are near-identical; in trees, they appear embedded in the "large curlew" group (clading with N. madagascariensis AF417925, as the sister group of N. americanus JQ963052) which is the position of N. arquata (both based on traditional morphology/biogeography, and on all other available mt genes).
    Conclusion: EF373145 is presumably misdentified, and is Numenius arquata.
  • - Numenius tahitiensis, 12s: JQ962997 (voucher: BTCU-113, 550 bp, Gibson & Baker 2012)
    - Limosa limosa islandica, 12s: JQ962990 (voucher: MKP-1596, 555 bp, Gibson & Baker 2012)

    These two sequences are near-identical; in trees, they appear embedded in Limosa.
    Conclusion: JQ962997 (Gibson & Baker 2012) is presumably misdentified, and is a western L. limosa, probably L. l. islandica.
  • - Phalaropus lobatus, rag1: AY894222 (voucher: JMP-2057, 885 bp, Pereira & Baker 2005)
    - Microparra capensis, rag1: EF373194 (voucher: MKP-1479, 2611 bp, Baker et al. 2007)
    - Tringa stagnatilis, rag1: AY894219 (voucher: MKP-1353, 885 bp, Pereira & Baker 2005)

    AY894222 and EF373194 are near-identical (where EF373194 overlaps with the others = first 854 bp of EF373194, last 854 bp of the other two); in trees based on this fragment, these three sequences cluster together with high support, outside both Scolopacidae and Jacanidae. BLAST searches based on these sequences produce odd results, with Haematopus ater AY228794 appearing systematically among the closest matches.
    Conclusion: where they overlap, these three sequences are presumably wrong, but I can't identify the problem precisely. Note that the rest of EF373194, as far as can be judged, behaves "normally" (ie., Microparra clusters with Irediparra and Actophilornis in Jacanidae) and may perfectly be correct.
Of course it can be difficult to be sure that an apparently wrong sequence in GenBank was actually used in a published analysis. This is particularly true for simple misidentifications, as sequences may as well have been interverted at the time of deposition, and the analysis might be fully correct. Here, however, some aspects of the published tree also suggest that the dataset was not "fully clean". The relationships within Numenius in this tree, in particular, are clearly wrong as far as I'm concerned, but not unexpected given the problems described above (tahitiensis "most divergent", attracted towards godwits due to its 12s Limosa sequence; minutus probably closer to "large curlews" than it really is, due to its arquata cytb). The long branches leading to Gallinago gallinago and G. delicata, as well as the "less-than-1" PP associated to this node, might also be linked to the inclusion of wrong sequences (these taxa are not supposed to differ at all genetically). Similarly, the long branch leading to, and the weakness of the tree reconstruction (low PPs) around Tringa stagnatilis might easily be due to the rag1 sequence AY894219.

================​

Taking the above into consideration, I reconstructed a dataset, using cox1 sequences from BOLD, and sequences of the other 4 genes, as available in GenBank. I omitted the sequences that looked problematic to me, or used them as being what I believed they are, as explained above. I divided this dataset into 9 partition as in Gibson & Baker 2012: one for the 12s, one for 1st and 2nd positions of each of the 4 coding genes, one for 3rd positions of each of the 4 coding genes; I selected models for these partitions based on the AICc criterion in TreeFinder; then I reconstructed a "best" ML tree, and ran a 100-replicate bootstrap analysis. The unbootstrapped tree and the consensus tree from the bootstrap analysis are attached.

The results are rather similar to Gibson & Baker's, with the main exception of the curlews (I omitted the two problematic sequences, which resulted in a tree consistent with single-gene analyses - minutus most divergent, tahitiensis sister to whimbrels. The support for internal nodes, in ML, is rather weak. The Calidris radiation, in particular, also appears basically unresolved (which is rather unsurprising, given that many of the taxa actually stand in the matrix based on a cox1 sequence only).

(My take at high PP/low BS, is that this indicates that a "best" tree is clearly identified given the dataset, but that this tree should probably be expected to be highly sensitive to the addition of new data.)

OK, I'll stop with this for now (but I'd welcome thoughts on the above).

Cheers, L -
 

Attachments

  • data-GenBank-accession-numbers.txt
    26 KB · Views: 302
  • Scolopaci.mtDNA+rag1.taxa_as_G&B.no-support.pdf
    9.4 KB · Views: 387
  • Scolopaci.mtDNA+rag1.taxa_as_G&B.consensus.pdf
    9.8 KB · Views: 319
Last edited:
Excellent work, Laurent. It seems that quite a few of these issues could have been caught if the authors of the various papers had BLASTed all of their sequences before using them in their analyses. At least the results are not affected too badly.
 
Seconded! Excellent work. It is interesting to speculate what the p-values of some of the published trees might be if the probability of human error in the whole process were factored in.
 
Proposal (665) to SACC:

Note: This proposal was originally submitted to NACC, which voted to accept (7 to 3) the proposal and implemented it into NACC classification (Chesser et al. 2013 Supplement in Auk)

Revise the classification of sandpipers and turnstones (Arenariinae)
 
The support for internal nodes, in ML, is rather weak. The Calidris radiation, in particular, also appears basically unresolved (which is rather unsurprising, given that many of the taxa actually stand in the matrix based on a cox1 sequence only).
I recently noted that the data set in GenBank was completed a posteriori by Allan Baker. These include:
  • a set of 12S (KF041181-207) sequences deposited on 22 May 2013;
  • a set of ND2 (KC969089-113), RAG1 (KC969114-150) and cytb (KC969151-174) sequences deposited on 28 Aug 2013; and
  • two sets of cox1 (KF009511-49; KF147196-205) sequences deposited on 6 May and 26 May 2014.
In GenBank, these sequences are linked to Gibson & Baker 2012, even though they were deposited well after the paper was published, and are of course not referenced in it (the paper just states: "Sequences were deposited in GenBank (accession numbers JQ962980 - JQ963056)"). But they seem consistent with the results of the original analysis, and provide, among other things, a much better support to the relationships within Calidris s.l.

(So I feel much more comfortable with recent generic treatments of this group, now. |=)|)
 

Attachments

  • Scolopacidae.12s-cox1-cytb-nd2-rag1.pdf
    8.7 KB · Views: 146
I recently noted that the data set in GenBank was completed a posteriori by Allan Baker. These include:
  • a set of 12S (KF041181-207) sequences deposited on 22 May 2013;
  • a set of ND2 (KC969089-113), RAG1 (KC969114-150) and cytb (KC969151-174) sequences deposited on 28 Aug 2013; and
  • two sets of cox1 (KF009511-49; KF147196-205) sequences deposited on 6 May and 26 May 2014.
In GenBank, these sequences are linked to Gibson & Baker 2012, even though they were deposited well after the paper was published, and are of course not referenced in it (the paper just states: "Sequences were deposited in GenBank (accession numbers JQ962980 - JQ963056)"). But they seem consistent with the results of the original analysis, and provide, among other things, a much better support to the relationships within Calidris s.l.

(So I feel much more comfortable with recent generic treatments of this group, now. |=)|)

Laurent, this is really interesting. Thanks.
 
Calidris and Tringa

Zuhao Huang & Feiyun Tu. DNA barcoding and phylogeny of Calidris and Tringa (Aves: Scolopacidae). Mitochondrial DNA Part A: DNA Mapping, Sequencing, and Analysis.

[Abstract]
 

Users who are viewing this thread

Back
Top