A Quantitative and Qualitative Characterization of k-mer Based Alignment-Free Phylogeny Construction View Full Text


Ontology type: schema:Chapter     


Chapter Info

DATE

2019-02-14

AUTHORS

Filippo Utro , Daniel E. Platt , Laxmi Parida

ABSTRACT

The rapidly growing volume of genomic data, including pathogens, both invites exploration of possible phylogenetic relationships among unclassified organisms, and challenges standard techniques that require multiple sequence alignment. Further, the ability to probe variations in selection pressure e.g. among viral outbreaks, is an important characterization of the life of a virus in its biological reservoir.In this paper, we derived the probability distribution of k-mer alignment lengths between random sequences for a given optimized score to quantify the probability that a given alignment was not better than chance, and applied it to Human Papiloma Virus (HPV), primate mtDNA, and Ebola. Even for highly variable HPV types, the number of k-mers required to significantly distinguish an alignment of related genomes from random sequences was reduced from 64 for 1-mers to 6 for 3-mers and 4 for 4-mers, indicating k-mers provide sufficient specificity to be able to characterize differences in sequences by their k-mer frequencies, allowing distances based on the k-mer frequencies to proxy for evolutionary distance. We computed mtDNA coding sequence and Ebola phylogeny construction. Primate mtDNA coding region k-mer UPGMA phylogenies reproduced most of the expected primate phylogeny. The Mantel test, applied to RAxML and Bayesian phylogenetic distances between Ebola samples versus 3-mer frequency distances, was highly significant (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 1\times 10^{-5}$$\end{document}). We characterized differences in selection pressure between coding and non-coding regions, and of selection in early cell cycle vs. late genes in Ebola. Coding versus non-coding regions showed evidence of purifying selection, while the early vs. late cell cycle proteins showed differences with late cycle proteins resembling influenza like immunological response, noting the g-proteins are among the late genes. More... »

PAGES

19-31

Book

TITLE

Computational Intelligence Methods for Bioinformatics and Biostatistics

ISBN

978-3-030-14159-2
978-3-030-14160-8

Identifiers

URI

http://scigraph.springernature.com/pub.10.1007/978-3-030-14160-8_3

DOI

http://dx.doi.org/10.1007/978-3-030-14160-8_3

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1112112244


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/01", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Mathematical Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0104", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Statistics", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "Computational Biology Center, IBM T. J. Watson Research, 10598, Yorktown Heights, NY, USA", 
          "id": "http://www.grid.ac/institutes/None", 
          "name": [
            "Computational Biology Center, IBM T. J. Watson Research, 10598, Yorktown Heights, NY, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Utro", 
        "givenName": "Filippo", 
        "id": "sg:person.01176571007.38", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01176571007.38"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Computational Biology Center, IBM T. J. Watson Research, 10598, Yorktown Heights, NY, USA", 
          "id": "http://www.grid.ac/institutes/None", 
          "name": [
            "Computational Biology Center, IBM T. J. Watson Research, 10598, Yorktown Heights, NY, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Platt", 
        "givenName": "Daniel E.", 
        "id": "sg:person.01332106363.98", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01332106363.98"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Computational Biology Center, IBM T. J. Watson Research, 10598, Yorktown Heights, NY, USA", 
          "id": "http://www.grid.ac/institutes/None", 
          "name": [
            "Computational Biology Center, IBM T. J. Watson Research, 10598, Yorktown Heights, NY, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Parida", 
        "givenName": "Laxmi", 
        "id": "sg:person.01336557015.68", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01336557015.68"
        ], 
        "type": "Person"
      }
    ], 
    "datePublished": "2019-02-14", 
    "datePublishedReg": "2019-02-14", 
    "description": "The rapidly growing volume of genomic data, including pathogens, both invites exploration of possible phylogenetic relationships among unclassified organisms, and challenges standard techniques that require multiple sequence alignment. Further, the ability to probe variations in selection pressure e.g. among viral outbreaks, is an important characterization of the life of a virus in its biological reservoir.In this paper, we derived the probability distribution of k-mer alignment lengths between random sequences for a given optimized score to quantify the probability that a given alignment was not better than chance, and applied it to Human Papiloma Virus (HPV), primate mtDNA, and Ebola. Even for highly variable HPV types, the number of k-mers required to significantly distinguish an alignment of related genomes from random sequences was reduced from 64 for 1-mers to 6 for 3-mers and 4 for 4-mers, indicating k-mers provide sufficient specificity to be able to characterize differences in sequences by their k-mer frequencies, allowing distances based on the k-mer frequencies to proxy for evolutionary distance. We computed mtDNA coding sequence and Ebola phylogeny construction. Primate mtDNA coding region k-mer UPGMA phylogenies reproduced most of the expected primate phylogeny. The Mantel test, applied to RAxML and Bayesian phylogenetic distances between Ebola samples versus 3-mer frequency distances, was highly significant (\\documentclass[12pt]{minimal}\n\t\t\t\t\\usepackage{amsmath}\n\t\t\t\t\\usepackage{wasysym}\n\t\t\t\t\\usepackage{amsfonts}\n\t\t\t\t\\usepackage{amssymb}\n\t\t\t\t\\usepackage{amsbsy}\n\t\t\t\t\\usepackage{mathrsfs}\n\t\t\t\t\\usepackage{upgreek}\n\t\t\t\t\\setlength{\\oddsidemargin}{-69pt}\n\t\t\t\t\\begin{document}$$\\le 1\\times 10^{-5}$$\\end{document}). We characterized differences in selection pressure between coding and non-coding regions, and of selection in early cell cycle vs. late genes in Ebola. Coding versus non-coding regions showed evidence of purifying selection, while the early vs. late cell cycle proteins showed differences with late cycle proteins resembling influenza like immunological response, noting the g-proteins are among the late genes.", 
    "editor": [
      {
        "familyName": "Bartoletti", 
        "givenName": "Massimo", 
        "type": "Person"
      }, 
      {
        "familyName": "Barla", 
        "givenName": "Annalisa", 
        "type": "Person"
      }, 
      {
        "familyName": "Bracciali", 
        "givenName": "Andrea", 
        "type": "Person"
      }, 
      {
        "familyName": "Klau", 
        "givenName": "Gunnar W.", 
        "type": "Person"
      }, 
      {
        "familyName": "Peterson", 
        "givenName": "Leif", 
        "type": "Person"
      }, 
      {
        "familyName": "Policriti", 
        "givenName": "Alberto", 
        "type": "Person"
      }, 
      {
        "familyName": "Tagliaferri", 
        "givenName": "Roberto", 
        "type": "Person"
      }
    ], 
    "genre": "chapter", 
    "id": "sg:pub.10.1007/978-3-030-14160-8_3", 
    "isAccessibleForFree": false, 
    "isPartOf": {
      "isbn": [
        "978-3-030-14159-2", 
        "978-3-030-14160-8"
      ], 
      "name": "Computational Intelligence Methods for Bioinformatics and Biostatistics", 
      "type": "Book"
    }, 
    "keywords": [
      "random sequence", 
      "probability distribution", 
      "important characterization", 
      "primate mtDNA", 
      "mtDNA coding sequences", 
      "mer frequency", 
      "unclassified organisms", 
      "k-mers", 
      "k-mer frequencies", 
      "phylogeny construction", 
      "standard techniques", 
      "frequency distance", 
      "distance", 
      "probability", 
      "qualitative characterization", 
      "human papiloma virus", 
      "construction", 
      "multiple sequence alignment", 
      "RAxML", 
      "distribution", 
      "alignment length", 
      "frequency", 
      "evolutionary distance", 
      "genomic data", 
      "sequence", 
      "selection", 
      "technique", 
      "number", 
      "characterization", 
      "region", 
      "alignment", 
      "reservoir", 
      "related genomes", 
      "length", 
      "sequence alignment", 
      "variation", 
      "e.", 
      "data", 
      "types", 
      "early cell cycles", 
      "exploration", 
      "non-coding regions", 
      "volume", 
      "late genes", 
      "cycle proteins", 
      "samples", 
      "pressure", 
      "possible phylogenetic relationships", 
      "differences", 
      "cell cycle proteins", 
      "test", 
      "phylogenetic relationships", 
      "ability", 
      "phylogenetic distance", 
      "cycle", 
      "primate phylogeny", 
      "Mantel test", 
      "relationship", 
      "selection pressure", 
      "coding sequence", 
      "biological reservoir", 
      "cell cycle", 
      "chance", 
      "Ebola", 
      "proxy", 
      "phylogeny", 
      "protein", 
      "response", 
      "genes", 
      "genome", 
      "mtDNA", 
      "organisms", 
      "virus", 
      "pathogens", 
      "sufficient specificity", 
      "viral outbreaks", 
      "outbreak", 
      "evidence", 
      "specificity", 
      "immunological response", 
      "influenza", 
      "life", 
      "scores", 
      "HPV types", 
      "paper"
    ], 
    "name": "A Quantitative and Qualitative Characterization of k-mer Based Alignment-Free Phylogeny Construction", 
    "pagination": "19-31", 
    "productId": [
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1112112244"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1007/978-3-030-14160-8_3"
        ]
      }
    ], 
    "publisher": {
      "name": "Springer Nature", 
      "type": "Organisation"
    }, 
    "sameAs": [
      "https://doi.org/10.1007/978-3-030-14160-8_3", 
      "https://app.dimensions.ai/details/publication/pub.1112112244"
    ], 
    "sdDataset": "chapters", 
    "sdDatePublished": "2022-09-02T16:17", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-springernature-scigraph/baseset/20220902/entities/gbq_results/chapter/chapter_60.jsonl", 
    "type": "Chapter", 
    "url": "https://doi.org/10.1007/978-3-030-14160-8_3"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-3-030-14160-8_3'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-3-030-14160-8_3'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-3-030-14160-8_3'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-3-030-14160-8_3'


 

This table displays all metadata directly associated to this object as RDF triples.

188 TRIPLES      22 PREDICATES      109 URIs      102 LITERALS      7 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1007/978-3-030-14160-8_3 schema:about anzsrc-for:01
2 anzsrc-for:0104
3 schema:author Ne150330dc1194eca863f9dff514c0fd3
4 schema:datePublished 2019-02-14
5 schema:datePublishedReg 2019-02-14
6 schema:description The rapidly growing volume of genomic data, including pathogens, both invites exploration of possible phylogenetic relationships among unclassified organisms, and challenges standard techniques that require multiple sequence alignment. Further, the ability to probe variations in selection pressure e.g. among viral outbreaks, is an important characterization of the life of a virus in its biological reservoir.In this paper, we derived the probability distribution of k-mer alignment lengths between random sequences for a given optimized score to quantify the probability that a given alignment was not better than chance, and applied it to Human Papiloma Virus (HPV), primate mtDNA, and Ebola. Even for highly variable HPV types, the number of k-mers required to significantly distinguish an alignment of related genomes from random sequences was reduced from 64 for 1-mers to 6 for 3-mers and 4 for 4-mers, indicating k-mers provide sufficient specificity to be able to characterize differences in sequences by their k-mer frequencies, allowing distances based on the k-mer frequencies to proxy for evolutionary distance. We computed mtDNA coding sequence and Ebola phylogeny construction. Primate mtDNA coding region k-mer UPGMA phylogenies reproduced most of the expected primate phylogeny. The Mantel test, applied to RAxML and Bayesian phylogenetic distances between Ebola samples versus 3-mer frequency distances, was highly significant (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\le 1\times 10^{-5}$$\end{document}). We characterized differences in selection pressure between coding and non-coding regions, and of selection in early cell cycle vs. late genes in Ebola. Coding versus non-coding regions showed evidence of purifying selection, while the early vs. late cell cycle proteins showed differences with late cycle proteins resembling influenza like immunological response, noting the g-proteins are among the late genes.
7 schema:editor N17739d857ce44f6abceba692eedb34eb
8 schema:genre chapter
9 schema:isAccessibleForFree false
10 schema:isPartOf N0ca8ece77e8c4027856498337e91a993
11 schema:keywords Ebola
12 HPV types
13 Mantel test
14 RAxML
15 ability
16 alignment
17 alignment length
18 biological reservoir
19 cell cycle
20 cell cycle proteins
21 chance
22 characterization
23 coding sequence
24 construction
25 cycle
26 cycle proteins
27 data
28 differences
29 distance
30 distribution
31 e.
32 early cell cycles
33 evidence
34 evolutionary distance
35 exploration
36 frequency
37 frequency distance
38 genes
39 genome
40 genomic data
41 human papiloma virus
42 immunological response
43 important characterization
44 influenza
45 k-mer frequencies
46 k-mers
47 late genes
48 length
49 life
50 mer frequency
51 mtDNA
52 mtDNA coding sequences
53 multiple sequence alignment
54 non-coding regions
55 number
56 organisms
57 outbreak
58 paper
59 pathogens
60 phylogenetic distance
61 phylogenetic relationships
62 phylogeny
63 phylogeny construction
64 possible phylogenetic relationships
65 pressure
66 primate mtDNA
67 primate phylogeny
68 probability
69 probability distribution
70 protein
71 proxy
72 qualitative characterization
73 random sequence
74 region
75 related genomes
76 relationship
77 reservoir
78 response
79 samples
80 scores
81 selection
82 selection pressure
83 sequence
84 sequence alignment
85 specificity
86 standard techniques
87 sufficient specificity
88 technique
89 test
90 types
91 unclassified organisms
92 variation
93 viral outbreaks
94 virus
95 volume
96 schema:name A Quantitative and Qualitative Characterization of k-mer Based Alignment-Free Phylogeny Construction
97 schema:pagination 19-31
98 schema:productId N9bab25640c854db6a51c264c0f9272f1
99 Nd63dc005fd0243ccb85e1982e99ee14b
100 schema:publisher N6d271c7d32aa4148b8764b8eb9ee6d6f
101 schema:sameAs https://app.dimensions.ai/details/publication/pub.1112112244
102 https://doi.org/10.1007/978-3-030-14160-8_3
103 schema:sdDatePublished 2022-09-02T16:17
104 schema:sdLicense https://scigraph.springernature.com/explorer/license/
105 schema:sdPublisher N50e19b93e4f64520924b7764e92192b7
106 schema:url https://doi.org/10.1007/978-3-030-14160-8_3
107 sgo:license sg:explorer/license/
108 sgo:sdDataset chapters
109 rdf:type schema:Chapter
110 N0ca8ece77e8c4027856498337e91a993 schema:isbn 978-3-030-14159-2
111 978-3-030-14160-8
112 schema:name Computational Intelligence Methods for Bioinformatics and Biostatistics
113 rdf:type schema:Book
114 N17739d857ce44f6abceba692eedb34eb rdf:first N52e3d7d114b74849b26b1201c9725367
115 rdf:rest N902cb1f35ab44ffbb04c55460aa905f1
116 N19f6434be0314c649c1aeaf3f8212202 rdf:first sg:person.01332106363.98
117 rdf:rest N71a3bd1afecb40ffbcdec94e7f5f09fd
118 N352768984aa9427aaedbc4e597fb80cf schema:familyName Policriti
119 schema:givenName Alberto
120 rdf:type schema:Person
121 N3c157aea36f24fdcbbfa43097d06ff37 schema:familyName Tagliaferri
122 schema:givenName Roberto
123 rdf:type schema:Person
124 N3ce2e0412fc24d36adf8f14b373479f6 rdf:first N4e33083de79a4f35bc7226740b32adbb
125 rdf:rest N72954dd2a9a542beba0344c6889952db
126 N4e33083de79a4f35bc7226740b32adbb schema:familyName Bracciali
127 schema:givenName Andrea
128 rdf:type schema:Person
129 N50e19b93e4f64520924b7764e92192b7 schema:name Springer Nature - SN SciGraph project
130 rdf:type schema:Organization
131 N52e3d7d114b74849b26b1201c9725367 schema:familyName Bartoletti
132 schema:givenName Massimo
133 rdf:type schema:Person
134 N5a598da4417445efbd15bda9f5cb6028 schema:familyName Barla
135 schema:givenName Annalisa
136 rdf:type schema:Person
137 N619f809e3c5042e49edfc05aca83812a rdf:first N9aa0a028fd3342f1b5a333b42019a8ca
138 rdf:rest Na08a5c45da5f4f1f8a506ffc75eff3f8
139 N6d271c7d32aa4148b8764b8eb9ee6d6f schema:name Springer Nature
140 rdf:type schema:Organisation
141 N71a3bd1afecb40ffbcdec94e7f5f09fd rdf:first sg:person.01336557015.68
142 rdf:rest rdf:nil
143 N72954dd2a9a542beba0344c6889952db rdf:first Nd3a1326ed1d240989f62f3c346b9ed28
144 rdf:rest N619f809e3c5042e49edfc05aca83812a
145 N8af2c90e37aa4e31aa93d48b4046f3f8 rdf:first N3c157aea36f24fdcbbfa43097d06ff37
146 rdf:rest rdf:nil
147 N902cb1f35ab44ffbb04c55460aa905f1 rdf:first N5a598da4417445efbd15bda9f5cb6028
148 rdf:rest N3ce2e0412fc24d36adf8f14b373479f6
149 N9aa0a028fd3342f1b5a333b42019a8ca schema:familyName Peterson
150 schema:givenName Leif
151 rdf:type schema:Person
152 N9bab25640c854db6a51c264c0f9272f1 schema:name dimensions_id
153 schema:value pub.1112112244
154 rdf:type schema:PropertyValue
155 Na08a5c45da5f4f1f8a506ffc75eff3f8 rdf:first N352768984aa9427aaedbc4e597fb80cf
156 rdf:rest N8af2c90e37aa4e31aa93d48b4046f3f8
157 Nd3a1326ed1d240989f62f3c346b9ed28 schema:familyName Klau
158 schema:givenName Gunnar W.
159 rdf:type schema:Person
160 Nd63dc005fd0243ccb85e1982e99ee14b schema:name doi
161 schema:value 10.1007/978-3-030-14160-8_3
162 rdf:type schema:PropertyValue
163 Ne150330dc1194eca863f9dff514c0fd3 rdf:first sg:person.01176571007.38
164 rdf:rest N19f6434be0314c649c1aeaf3f8212202
165 anzsrc-for:01 schema:inDefinedTermSet anzsrc-for:
166 schema:name Mathematical Sciences
167 rdf:type schema:DefinedTerm
168 anzsrc-for:0104 schema:inDefinedTermSet anzsrc-for:
169 schema:name Statistics
170 rdf:type schema:DefinedTerm
171 sg:person.01176571007.38 schema:affiliation grid-institutes:None
172 schema:familyName Utro
173 schema:givenName Filippo
174 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01176571007.38
175 rdf:type schema:Person
176 sg:person.01332106363.98 schema:affiliation grid-institutes:None
177 schema:familyName Platt
178 schema:givenName Daniel E.
179 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01332106363.98
180 rdf:type schema:Person
181 sg:person.01336557015.68 schema:affiliation grid-institutes:None
182 schema:familyName Parida
183 schema:givenName Laxmi
184 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01336557015.68
185 rdf:type schema:Person
186 grid-institutes:None schema:alternateName Computational Biology Center, IBM T. J. Watson Research, 10598, Yorktown Heights, NY, USA
187 schema:name Computational Biology Center, IBM T. J. Watson Research, 10598, Yorktown Heights, NY, USA
188 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...