Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2020-05-12

AUTHORS

Martin Steinegger, Steven L. Salzberg

ABSTRACT

Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to “complete” model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminator More... »

PAGES

115

Identifiers

URI

http://scigraph.springernature.com/pub.10.1186/s13059-020-02023-1

DOI

http://dx.doi.org/10.1186/s13059-020-02023-1

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1127546487

PUBMED

https://www.ncbi.nlm.nih.gov/pubmed/32398145


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0806", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information Systems", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Animals", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "DNA Contamination", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Databases, Nucleic Acid", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Genome", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Humans", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Mice", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "Institute of Molecular Biology and Genetics, Seoul National University, 08826, Seoul, South Korea", 
          "id": "http://www.grid.ac/institutes/grid.31501.36", 
          "name": [
            "School of Biological Sciences, Seoul National University, 08826, Seoul, South Korea", 
            "Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 21218, Baltimore, Maryland, USA", 
            "Institute of Molecular Biology and Genetics, Seoul National University, 08826, Seoul, South Korea"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Steinegger", 
        "givenName": "Martin", 
        "id": "sg:person.01014200153.21", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01014200153.21"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Departments of Computer Science and Biostatistics, Johns Hopkins University, 21218, Baltimore, Maryland, USA", 
          "id": "http://www.grid.ac/institutes/grid.21107.35", 
          "name": [
            "Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 21218, Baltimore, Maryland, USA", 
            "Department of Biomedical Engineering, Johns Hopkins University, 21218, Baltimore, Maryland, USA", 
            "Departments of Computer Science and Biostatistics, Johns Hopkins University, 21218, Baltimore, Maryland, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Salzberg", 
        "givenName": "Steven L.", 
        "id": "sg:person.01223441713.02", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01223441713.02"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "sg:pub.10.1186/1944-3277-10-18", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1004510227", 
          "https://doi.org/10.1186/1944-3277-10-18"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1186/s13059-018-1568-0", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1109919786", 
          "https://doi.org/10.1186/s13059-018-1568-0"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/nmeth.1923", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1006541515", 
          "https://doi.org/10.1038/nmeth.1923"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1186/s13059-016-0997-x", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1050687712", 
          "https://doi.org/10.1186/s13059-016-0997-x"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1186/s13059-017-1214-2", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1085212334", 
          "https://doi.org/10.1186/s13059-017-1214-2"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/s41467-019-11306-6", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1119805245", 
          "https://doi.org/10.1038/s41467-019-11306-6"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1186/1471-2105-10-421", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1050579230", 
          "https://doi.org/10.1186/1471-2105-10-421"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/ng.3852", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1085098857", 
          "https://doi.org/10.1038/ng.3852"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/s41467-018-04964-5", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1105107305", 
          "https://doi.org/10.1038/s41467-018-04964-5"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/nbt.3988", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1092237583", 
          "https://doi.org/10.1038/nbt.3988"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/s41598-018-22416-4", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1101298975", 
          "https://doi.org/10.1038/s41598-018-22416-4"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2020-05-12", 
    "datePublishedReg": "2020-05-12", 
    "description": "Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to \u201ccomplete\u201d model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminator", 
    "genre": "article", 
    "id": "sg:pub.10.1186/s13059-020-02023-1", 
    "isAccessibleForFree": true, 
    "isFundedItemOf": [
      {
        "id": "sg:grant.7874043", 
        "type": "MonetaryGrant"
      }, 
      {
        "id": "sg:grant.8383234", 
        "type": "MonetaryGrant"
      }, 
      {
        "id": "sg:grant.2529453", 
        "type": "MonetaryGrant"
      }
    ], 
    "isPartOf": [
      {
        "id": "sg:journal.1023439", 
        "issn": [
          "1474-760X", 
          "1465-6906"
        ], 
        "name": "Genome Biology", 
        "publisher": "Springer Nature", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "1", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "21"
      }
    ], 
    "keywords": [
      "source code", 
      "input size", 
      "model organism genomes", 
      "public databases", 
      "reference database", 
      "database", 
      "efficient method", 
      "computer", 
      "organism's genome", 
      "code", 
      "RefSeq", 
      "reference sequence", 
      "method", 
      "sequence", 
      "quality", 
      "nr database", 
      "identifies", 
      "analysis", 
      "entry", 
      "sequence comparison", 
      "draft", 
      "comparison", 
      "size", 
      "whole range", 
      "GenBank", 
      "range", 
      "genomic analysis", 
      "TB", 
      "genome", 
      "days", 
      "contamination"
    ], 
    "name": "Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank", 
    "pagination": "115", 
    "productId": [
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1127546487"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1186/s13059-020-02023-1"
        ]
      }, 
      {
        "name": "pubmed_id", 
        "type": "PropertyValue", 
        "value": [
          "32398145"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1186/s13059-020-02023-1", 
      "https://app.dimensions.ai/details/publication/pub.1127546487"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2022-09-02T16:05", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-springernature-scigraph/baseset/20220902/entities/gbq_results/article/article_850.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "https://doi.org/10.1186/s13059-020-02023-1"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1186/s13059-020-02023-1'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1186/s13059-020-02023-1'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1186/s13059-020-02023-1'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1186/s13059-020-02023-1'


 

This table displays all metadata directly associated to this object as RDF triples.

180 TRIPLES      21 PREDICATES      73 URIs      54 LITERALS      13 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1186/s13059-020-02023-1 schema:about N0b5d080edab64de5a92e70387689ed03
2 N369ff911e616443d8973293befeb7adb
3 N518ee18fa587419cad5588ad737da45a
4 N540e4a1651034e3584b3726f6b20081e
5 N7dee6077d56944c4971968851403aa69
6 Nf80998c3fb464e36931f2879f3f8b03a
7 anzsrc-for:08
8 anzsrc-for:0806
9 schema:author Nceaf3b655e454b7db3f235ae669cd7bc
10 schema:citation sg:pub.10.1038/nbt.3988
11 sg:pub.10.1038/ng.3852
12 sg:pub.10.1038/nmeth.1923
13 sg:pub.10.1038/s41467-018-04964-5
14 sg:pub.10.1038/s41467-019-11306-6
15 sg:pub.10.1038/s41598-018-22416-4
16 sg:pub.10.1186/1471-2105-10-421
17 sg:pub.10.1186/1944-3277-10-18
18 sg:pub.10.1186/s13059-016-0997-x
19 sg:pub.10.1186/s13059-017-1214-2
20 sg:pub.10.1186/s13059-018-1568-0
21 schema:datePublished 2020-05-12
22 schema:datePublishedReg 2020-05-12
23 schema:description Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to “complete” model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminator
24 schema:genre article
25 schema:isAccessibleForFree true
26 schema:isPartOf N37beb2783a0f418f91a6fe28fd64c165
27 Nf66325c5300d48ab9cdee46cf243e563
28 sg:journal.1023439
29 schema:keywords GenBank
30 RefSeq
31 TB
32 analysis
33 code
34 comparison
35 computer
36 contamination
37 database
38 days
39 draft
40 efficient method
41 entry
42 genome
43 genomic analysis
44 identifies
45 input size
46 method
47 model organism genomes
48 nr database
49 organism's genome
50 public databases
51 quality
52 range
53 reference database
54 reference sequence
55 sequence
56 sequence comparison
57 size
58 source code
59 whole range
60 schema:name Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank
61 schema:pagination 115
62 schema:productId N2ceeee4e682d4dd5aa099a8ea470bd61
63 N81ca55f15ef14f029316672de5e216d3
64 N998e3e9c9f4b4cdeab68e62834970cba
65 schema:sameAs https://app.dimensions.ai/details/publication/pub.1127546487
66 https://doi.org/10.1186/s13059-020-02023-1
67 schema:sdDatePublished 2022-09-02T16:05
68 schema:sdLicense https://scigraph.springernature.com/explorer/license/
69 schema:sdPublisher N077970de0c0d471da1cec01516df603f
70 schema:url https://doi.org/10.1186/s13059-020-02023-1
71 sgo:license sg:explorer/license/
72 sgo:sdDataset articles
73 rdf:type schema:ScholarlyArticle
74 N077970de0c0d471da1cec01516df603f schema:name Springer Nature - SN SciGraph project
75 rdf:type schema:Organization
76 N0b5d080edab64de5a92e70387689ed03 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
77 schema:name Mice
78 rdf:type schema:DefinedTerm
79 N2ceeee4e682d4dd5aa099a8ea470bd61 schema:name dimensions_id
80 schema:value pub.1127546487
81 rdf:type schema:PropertyValue
82 N369ff911e616443d8973293befeb7adb schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
83 schema:name Genome
84 rdf:type schema:DefinedTerm
85 N37beb2783a0f418f91a6fe28fd64c165 schema:issueNumber 1
86 rdf:type schema:PublicationIssue
87 N518ee18fa587419cad5588ad737da45a schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
88 schema:name Humans
89 rdf:type schema:DefinedTerm
90 N53579732b58148c7bb41cf62f9a8ac10 rdf:first sg:person.01223441713.02
91 rdf:rest rdf:nil
92 N540e4a1651034e3584b3726f6b20081e schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
93 schema:name Databases, Nucleic Acid
94 rdf:type schema:DefinedTerm
95 N7dee6077d56944c4971968851403aa69 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
96 schema:name DNA Contamination
97 rdf:type schema:DefinedTerm
98 N81ca55f15ef14f029316672de5e216d3 schema:name pubmed_id
99 schema:value 32398145
100 rdf:type schema:PropertyValue
101 N998e3e9c9f4b4cdeab68e62834970cba schema:name doi
102 schema:value 10.1186/s13059-020-02023-1
103 rdf:type schema:PropertyValue
104 Nceaf3b655e454b7db3f235ae669cd7bc rdf:first sg:person.01014200153.21
105 rdf:rest N53579732b58148c7bb41cf62f9a8ac10
106 Nf66325c5300d48ab9cdee46cf243e563 schema:volumeNumber 21
107 rdf:type schema:PublicationVolume
108 Nf80998c3fb464e36931f2879f3f8b03a schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
109 schema:name Animals
110 rdf:type schema:DefinedTerm
111 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
112 schema:name Information and Computing Sciences
113 rdf:type schema:DefinedTerm
114 anzsrc-for:0806 schema:inDefinedTermSet anzsrc-for:
115 schema:name Information Systems
116 rdf:type schema:DefinedTerm
117 sg:grant.2529453 http://pending.schema.org/fundedItem sg:pub.10.1186/s13059-020-02023-1
118 rdf:type schema:MonetaryGrant
119 sg:grant.7874043 http://pending.schema.org/fundedItem sg:pub.10.1186/s13059-020-02023-1
120 rdf:type schema:MonetaryGrant
121 sg:grant.8383234 http://pending.schema.org/fundedItem sg:pub.10.1186/s13059-020-02023-1
122 rdf:type schema:MonetaryGrant
123 sg:journal.1023439 schema:issn 1465-6906
124 1474-760X
125 schema:name Genome Biology
126 schema:publisher Springer Nature
127 rdf:type schema:Periodical
128 sg:person.01014200153.21 schema:affiliation grid-institutes:grid.31501.36
129 schema:familyName Steinegger
130 schema:givenName Martin
131 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01014200153.21
132 rdf:type schema:Person
133 sg:person.01223441713.02 schema:affiliation grid-institutes:grid.21107.35
134 schema:familyName Salzberg
135 schema:givenName Steven L.
136 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01223441713.02
137 rdf:type schema:Person
138 sg:pub.10.1038/nbt.3988 schema:sameAs https://app.dimensions.ai/details/publication/pub.1092237583
139 https://doi.org/10.1038/nbt.3988
140 rdf:type schema:CreativeWork
141 sg:pub.10.1038/ng.3852 schema:sameAs https://app.dimensions.ai/details/publication/pub.1085098857
142 https://doi.org/10.1038/ng.3852
143 rdf:type schema:CreativeWork
144 sg:pub.10.1038/nmeth.1923 schema:sameAs https://app.dimensions.ai/details/publication/pub.1006541515
145 https://doi.org/10.1038/nmeth.1923
146 rdf:type schema:CreativeWork
147 sg:pub.10.1038/s41467-018-04964-5 schema:sameAs https://app.dimensions.ai/details/publication/pub.1105107305
148 https://doi.org/10.1038/s41467-018-04964-5
149 rdf:type schema:CreativeWork
150 sg:pub.10.1038/s41467-019-11306-6 schema:sameAs https://app.dimensions.ai/details/publication/pub.1119805245
151 https://doi.org/10.1038/s41467-019-11306-6
152 rdf:type schema:CreativeWork
153 sg:pub.10.1038/s41598-018-22416-4 schema:sameAs https://app.dimensions.ai/details/publication/pub.1101298975
154 https://doi.org/10.1038/s41598-018-22416-4
155 rdf:type schema:CreativeWork
156 sg:pub.10.1186/1471-2105-10-421 schema:sameAs https://app.dimensions.ai/details/publication/pub.1050579230
157 https://doi.org/10.1186/1471-2105-10-421
158 rdf:type schema:CreativeWork
159 sg:pub.10.1186/1944-3277-10-18 schema:sameAs https://app.dimensions.ai/details/publication/pub.1004510227
160 https://doi.org/10.1186/1944-3277-10-18
161 rdf:type schema:CreativeWork
162 sg:pub.10.1186/s13059-016-0997-x schema:sameAs https://app.dimensions.ai/details/publication/pub.1050687712
163 https://doi.org/10.1186/s13059-016-0997-x
164 rdf:type schema:CreativeWork
165 sg:pub.10.1186/s13059-017-1214-2 schema:sameAs https://app.dimensions.ai/details/publication/pub.1085212334
166 https://doi.org/10.1186/s13059-017-1214-2
167 rdf:type schema:CreativeWork
168 sg:pub.10.1186/s13059-018-1568-0 schema:sameAs https://app.dimensions.ai/details/publication/pub.1109919786
169 https://doi.org/10.1186/s13059-018-1568-0
170 rdf:type schema:CreativeWork
171 grid-institutes:grid.21107.35 schema:alternateName Departments of Computer Science and Biostatistics, Johns Hopkins University, 21218, Baltimore, Maryland, USA
172 schema:name Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 21218, Baltimore, Maryland, USA
173 Department of Biomedical Engineering, Johns Hopkins University, 21218, Baltimore, Maryland, USA
174 Departments of Computer Science and Biostatistics, Johns Hopkins University, 21218, Baltimore, Maryland, USA
175 rdf:type schema:Organization
176 grid-institutes:grid.31501.36 schema:alternateName Institute of Molecular Biology and Genetics, Seoul National University, 08826, Seoul, South Korea
177 schema:name Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, 21218, Baltimore, Maryland, USA
178 Institute of Molecular Biology and Genetics, Seoul National University, 08826, Seoul, South Korea
179 School of Biological Sciences, Seoul National University, 08826, Seoul, South Korea
180 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...