Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2020-05-12

AUTHORS

Martin Steinegger, Steven L. Salzberg

ABSTRACT

Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to “complete” model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminator More... »

PAGES

115

Identifiers

URI

http://scigraph.springernature.com/pub.10.1186/s13059-020-02023-1

DOI

http://dx.doi.org/10.1186/s13059-020-02023-1

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1127546487

PUBMED

https://www.ncbi.nlm.nih.gov/pubmed/32398145


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0806", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information Systems", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Animals", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "DNA Contamination", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Databases, Nucleic Acid", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Genome", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Humans", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Mice", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "Institute of Molecular Biology and Genetics, Seoul National University, 08826, Seoul, South Korea", 
          "id": "http://www.grid.ac/institutes/grid.31501.36", 
          "name": [
            "School of Biological Sciences, Seoul National University, 08826, Seoul, South Korea", 
            "Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 21218, Baltimore, Maryland, USA", 
            "Institute of Molecular Biology and Genetics, Seoul National University, 08826, Seoul, South Korea"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Steinegger", 
        "givenName": "Martin", 
        "id": "sg:person.01014200153.21", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01014200153.21"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Departments of Computer Science and Biostatistics, Johns Hopkins University, 21218, Baltimore, Maryland, USA", 
          "id": "http://www.grid.ac/institutes/grid.21107.35", 
          "name": [
            "Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 21218, Baltimore, Maryland, USA", 
            "Department of Biomedical Engineering, Johns Hopkins University, 21218, Baltimore, Maryland, USA", 
            "Departments of Computer Science and Biostatistics, Johns Hopkins University, 21218, Baltimore, Maryland, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Salzberg", 
        "givenName": "Steven L.", 
        "id": "sg:person.01223441713.02", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01223441713.02"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "sg:pub.10.1038/ng.3852", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1085098857", 
          "https://doi.org/10.1038/ng.3852"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1186/s13059-017-1214-2", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1085212334", 
          "https://doi.org/10.1186/s13059-017-1214-2"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1186/s13059-016-0997-x", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1050687712", 
          "https://doi.org/10.1186/s13059-016-0997-x"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/nbt.3988", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1092237583", 
          "https://doi.org/10.1038/nbt.3988"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/s41598-018-22416-4", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1101298975", 
          "https://doi.org/10.1038/s41598-018-22416-4"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1186/1944-3277-10-18", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1004510227", 
          "https://doi.org/10.1186/1944-3277-10-18"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/s41467-018-04964-5", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1105107305", 
          "https://doi.org/10.1038/s41467-018-04964-5"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1186/s13059-018-1568-0", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1109919786", 
          "https://doi.org/10.1186/s13059-018-1568-0"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/nmeth.1923", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1006541515", 
          "https://doi.org/10.1038/nmeth.1923"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/s41467-019-11306-6", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1119805245", 
          "https://doi.org/10.1038/s41467-019-11306-6"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1186/1471-2105-10-421", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1050579230", 
          "https://doi.org/10.1186/1471-2105-10-421"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2020-05-12", 
    "datePublishedReg": "2020-05-12", 
    "description": "Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to \u201ccomplete\u201d model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminator", 
    "genre": "article", 
    "id": "sg:pub.10.1186/s13059-020-02023-1", 
    "isAccessibleForFree": true, 
    "isFundedItemOf": [
      {
        "id": "sg:grant.8383234", 
        "type": "MonetaryGrant"
      }, 
      {
        "id": "sg:grant.7874043", 
        "type": "MonetaryGrant"
      }, 
      {
        "id": "sg:grant.2529453", 
        "type": "MonetaryGrant"
      }
    ], 
    "isPartOf": [
      {
        "id": "sg:journal.1023439", 
        "issn": [
          "1474-760X", 
          "1465-6906"
        ], 
        "name": "Genome Biology", 
        "publisher": "Springer Nature", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "1", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "21"
      }
    ], 
    "keywords": [
      "source code", 
      "input size", 
      "model organism genomes", 
      "public databases", 
      "reference database", 
      "database", 
      "efficient method", 
      "computer", 
      "organism's genome", 
      "code", 
      "RefSeq", 
      "reference sequence", 
      "method", 
      "sequence", 
      "quality", 
      "nr database", 
      "identifies", 
      "analysis", 
      "entry", 
      "sequence comparison", 
      "draft", 
      "comparison", 
      "size", 
      "whole range", 
      "GenBank", 
      "range", 
      "genomic analysis", 
      "TB", 
      "genome", 
      "days", 
      "contamination"
    ], 
    "name": "Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank", 
    "pagination": "115", 
    "productId": [
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1127546487"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1186/s13059-020-02023-1"
        ]
      }, 
      {
        "name": "pubmed_id", 
        "type": "PropertyValue", 
        "value": [
          "32398145"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1186/s13059-020-02023-1", 
      "https://app.dimensions.ai/details/publication/pub.1127546487"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2022-11-24T21:05", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-springernature-scigraph/baseset/20221124/entities/gbq_results/article/article_842.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "https://doi.org/10.1186/s13059-020-02023-1"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1186/s13059-020-02023-1'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1186/s13059-020-02023-1'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1186/s13059-020-02023-1'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1186/s13059-020-02023-1'


 

This table displays all metadata directly associated to this object as RDF triples.

180 TRIPLES      21 PREDICATES      73 URIs      54 LITERALS      13 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1186/s13059-020-02023-1 schema:about N16dfe2d269ba42958137cd550e381141
2 N203b826d0d6042b1abd3bad7887dbcdd
3 N32580f88947b4724bb6b1934aa52acde
4 N55d4779aef144acfb480a127245a231a
5 N7efdf506528348c2b3b7598fc3e1096d
6 Neec381b185664daba08f6d56d854877a
7 anzsrc-for:08
8 anzsrc-for:0806
9 schema:author N84d9e7eecb984b76ab3413c17972dc5f
10 schema:citation sg:pub.10.1038/nbt.3988
11 sg:pub.10.1038/ng.3852
12 sg:pub.10.1038/nmeth.1923
13 sg:pub.10.1038/s41467-018-04964-5
14 sg:pub.10.1038/s41467-019-11306-6
15 sg:pub.10.1038/s41598-018-22416-4
16 sg:pub.10.1186/1471-2105-10-421
17 sg:pub.10.1186/1944-3277-10-18
18 sg:pub.10.1186/s13059-016-0997-x
19 sg:pub.10.1186/s13059-017-1214-2
20 sg:pub.10.1186/s13059-018-1568-0
21 schema:datePublished 2020-05-12
22 schema:datePublishedReg 2020-05-12
23 schema:description Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to “complete” model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminator
24 schema:genre article
25 schema:isAccessibleForFree true
26 schema:isPartOf N742f249e37cf4b91a202a177faf95b00
27 Na8e47b8d591b408f9dd99ef59009a5aa
28 sg:journal.1023439
29 schema:keywords GenBank
30 RefSeq
31 TB
32 analysis
33 code
34 comparison
35 computer
36 contamination
37 database
38 days
39 draft
40 efficient method
41 entry
42 genome
43 genomic analysis
44 identifies
45 input size
46 method
47 model organism genomes
48 nr database
49 organism's genome
50 public databases
51 quality
52 range
53 reference database
54 reference sequence
55 sequence
56 sequence comparison
57 size
58 source code
59 whole range
60 schema:name Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank
61 schema:pagination 115
62 schema:productId N6d0a592c626b44da9ac8a60fefa4a6fe
63 N6dbafa8a4ac04b078f176bb4c93a0c41
64 N8b39059bc061402c80e08fcaf9cc9e85
65 schema:sameAs https://app.dimensions.ai/details/publication/pub.1127546487
66 https://doi.org/10.1186/s13059-020-02023-1
67 schema:sdDatePublished 2022-11-24T21:05
68 schema:sdLicense https://scigraph.springernature.com/explorer/license/
69 schema:sdPublisher Naf01bf1ae3b44799ba23039a0af5af15
70 schema:url https://doi.org/10.1186/s13059-020-02023-1
71 sgo:license sg:explorer/license/
72 sgo:sdDataset articles
73 rdf:type schema:ScholarlyArticle
74 N16dfe2d269ba42958137cd550e381141 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
75 schema:name DNA Contamination
76 rdf:type schema:DefinedTerm
77 N203b826d0d6042b1abd3bad7887dbcdd schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
78 schema:name Genome
79 rdf:type schema:DefinedTerm
80 N32580f88947b4724bb6b1934aa52acde schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
81 schema:name Animals
82 rdf:type schema:DefinedTerm
83 N55d4779aef144acfb480a127245a231a schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
84 schema:name Mice
85 rdf:type schema:DefinedTerm
86 N6d0a592c626b44da9ac8a60fefa4a6fe schema:name doi
87 schema:value 10.1186/s13059-020-02023-1
88 rdf:type schema:PropertyValue
89 N6dbafa8a4ac04b078f176bb4c93a0c41 schema:name dimensions_id
90 schema:value pub.1127546487
91 rdf:type schema:PropertyValue
92 N742f249e37cf4b91a202a177faf95b00 schema:issueNumber 1
93 rdf:type schema:PublicationIssue
94 N7efdf506528348c2b3b7598fc3e1096d schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
95 schema:name Databases, Nucleic Acid
96 rdf:type schema:DefinedTerm
97 N84d9e7eecb984b76ab3413c17972dc5f rdf:first sg:person.01014200153.21
98 rdf:rest Nd9a0d3f8da3c4a12abdddfefcac27ea9
99 N8b39059bc061402c80e08fcaf9cc9e85 schema:name pubmed_id
100 schema:value 32398145
101 rdf:type schema:PropertyValue
102 Na8e47b8d591b408f9dd99ef59009a5aa schema:volumeNumber 21
103 rdf:type schema:PublicationVolume
104 Naf01bf1ae3b44799ba23039a0af5af15 schema:name Springer Nature - SN SciGraph project
105 rdf:type schema:Organization
106 Nd9a0d3f8da3c4a12abdddfefcac27ea9 rdf:first sg:person.01223441713.02
107 rdf:rest rdf:nil
108 Neec381b185664daba08f6d56d854877a schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
109 schema:name Humans
110 rdf:type schema:DefinedTerm
111 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
112 schema:name Information and Computing Sciences
113 rdf:type schema:DefinedTerm
114 anzsrc-for:0806 schema:inDefinedTermSet anzsrc-for:
115 schema:name Information Systems
116 rdf:type schema:DefinedTerm
117 sg:grant.2529453 http://pending.schema.org/fundedItem sg:pub.10.1186/s13059-020-02023-1
118 rdf:type schema:MonetaryGrant
119 sg:grant.7874043 http://pending.schema.org/fundedItem sg:pub.10.1186/s13059-020-02023-1
120 rdf:type schema:MonetaryGrant
121 sg:grant.8383234 http://pending.schema.org/fundedItem sg:pub.10.1186/s13059-020-02023-1
122 rdf:type schema:MonetaryGrant
123 sg:journal.1023439 schema:issn 1465-6906
124 1474-760X
125 schema:name Genome Biology
126 schema:publisher Springer Nature
127 rdf:type schema:Periodical
128 sg:person.01014200153.21 schema:affiliation grid-institutes:grid.31501.36
129 schema:familyName Steinegger
130 schema:givenName Martin
131 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01014200153.21
132 rdf:type schema:Person
133 sg:person.01223441713.02 schema:affiliation grid-institutes:grid.21107.35
134 schema:familyName Salzberg
135 schema:givenName Steven L.
136 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01223441713.02
137 rdf:type schema:Person
138 sg:pub.10.1038/nbt.3988 schema:sameAs https://app.dimensions.ai/details/publication/pub.1092237583
139 https://doi.org/10.1038/nbt.3988
140 rdf:type schema:CreativeWork
141 sg:pub.10.1038/ng.3852 schema:sameAs https://app.dimensions.ai/details/publication/pub.1085098857
142 https://doi.org/10.1038/ng.3852
143 rdf:type schema:CreativeWork
144 sg:pub.10.1038/nmeth.1923 schema:sameAs https://app.dimensions.ai/details/publication/pub.1006541515
145 https://doi.org/10.1038/nmeth.1923
146 rdf:type schema:CreativeWork
147 sg:pub.10.1038/s41467-018-04964-5 schema:sameAs https://app.dimensions.ai/details/publication/pub.1105107305
148 https://doi.org/10.1038/s41467-018-04964-5
149 rdf:type schema:CreativeWork
150 sg:pub.10.1038/s41467-019-11306-6 schema:sameAs https://app.dimensions.ai/details/publication/pub.1119805245
151 https://doi.org/10.1038/s41467-019-11306-6
152 rdf:type schema:CreativeWork
153 sg:pub.10.1038/s41598-018-22416-4 schema:sameAs https://app.dimensions.ai/details/publication/pub.1101298975
154 https://doi.org/10.1038/s41598-018-22416-4
155 rdf:type schema:CreativeWork
156 sg:pub.10.1186/1471-2105-10-421 schema:sameAs https://app.dimensions.ai/details/publication/pub.1050579230
157 https://doi.org/10.1186/1471-2105-10-421
158 rdf:type schema:CreativeWork
159 sg:pub.10.1186/1944-3277-10-18 schema:sameAs https://app.dimensions.ai/details/publication/pub.1004510227
160 https://doi.org/10.1186/1944-3277-10-18
161 rdf:type schema:CreativeWork
162 sg:pub.10.1186/s13059-016-0997-x schema:sameAs https://app.dimensions.ai/details/publication/pub.1050687712
163 https://doi.org/10.1186/s13059-016-0997-x
164 rdf:type schema:CreativeWork
165 sg:pub.10.1186/s13059-017-1214-2 schema:sameAs https://app.dimensions.ai/details/publication/pub.1085212334
166 https://doi.org/10.1186/s13059-017-1214-2
167 rdf:type schema:CreativeWork
168 sg:pub.10.1186/s13059-018-1568-0 schema:sameAs https://app.dimensions.ai/details/publication/pub.1109919786
169 https://doi.org/10.1186/s13059-018-1568-0
170 rdf:type schema:CreativeWork
171 grid-institutes:grid.21107.35 schema:alternateName Departments of Computer Science and Biostatistics, Johns Hopkins University, 21218, Baltimore, Maryland, USA
172 schema:name Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 21218, Baltimore, Maryland, USA
173 Department of Biomedical Engineering, Johns Hopkins University, 21218, Baltimore, Maryland, USA
174 Departments of Computer Science and Biostatistics, Johns Hopkins University, 21218, Baltimore, Maryland, USA
175 rdf:type schema:Organization
176 grid-institutes:grid.31501.36 schema:alternateName Institute of Molecular Biology and Genetics, Seoul National University, 08826, Seoul, South Korea
177 schema:name Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 21218, Baltimore, Maryland, USA
178 Institute of Molecular Biology and Genetics, Seoul National University, 08826, Seoul, South Korea
179 School of Biological Sciences, Seoul National University, 08826, Seoul, South Korea
180 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...