Resolving Multicopy Duplications de novo Using Polyploid Phasing View Full Text


Ontology type: schema:Chapter      Open Access: True


Chapter Info

DATE

2017-04-12

AUTHORS

Mark J. Chaisson , Sudipto Mukherjee , Sreeram Kannan , Evan E. Eichler

ABSTRACT

While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog-specific variants. In this paper, we study the problem of resolving the variations in multicopy, long segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation-clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on average 7.0 haplotypes in 10-copy duplication datasets whereas existing algorithms reconstruct less than one copy on average. More... »

PAGES

117-133

Identifiers

URI

http://scigraph.springernature.com/pub.10.1007/978-3-319-56970-3_8

DOI

http://dx.doi.org/10.1007/978-3-319-56970-3_8

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1086882567

PUBMED

https://www.ncbi.nlm.nih.gov/pubmed/28808695


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Artificial Intelligence and Image Processing", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "Department of Genome Sciences, University of Washington, 98195, Seattle, Washington, USA", 
          "id": "http://www.grid.ac/institutes/grid.34477.33", 
          "name": [
            "Department of Genome Sciences, University of Washington, 98195, Seattle, Washington, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Chaisson", 
        "givenName": "Mark J.", 
        "id": "sg:person.012610254333.24", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012610254333.24"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Department of Electrical Engineering, University of Washington, 98195, Seattle, Washington, USA", 
          "id": "http://www.grid.ac/institutes/grid.34477.33", 
          "name": [
            "Department of Electrical Engineering, University of Washington, 98195, Seattle, Washington, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Mukherjee", 
        "givenName": "Sudipto", 
        "id": "sg:person.013662602567.42", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013662602567.42"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Department of Electrical Engineering, University of Washington, 98195, Seattle, Washington, USA", 
          "id": "http://www.grid.ac/institutes/grid.34477.33", 
          "name": [
            "Department of Electrical Engineering, University of Washington, 98195, Seattle, Washington, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Kannan", 
        "givenName": "Sreeram", 
        "id": "sg:person.015735307063.16", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015735307063.16"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Howard Hughes Medical Institute, University of Washington, 98195, Seattle, Washington, USA", 
          "id": "http://www.grid.ac/institutes/grid.34477.33", 
          "name": [
            "Department of Genome Sciences, University of Washington, 98195, Seattle, Washington, USA", 
            "Howard Hughes Medical Institute, University of Washington, 98195, Seattle, Washington, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Eichler", 
        "givenName": "Evan E.", 
        "id": "sg:person.0705101106.89", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0705101106.89"
        ], 
        "type": "Person"
      }
    ], 
    "datePublished": "2017-04-12", 
    "datePublishedReg": "2017-04-12", 
    "description": "While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog-specific variants. In this paper, we study the problem of resolving the variations in multicopy, long segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation-clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on average 7.0 haplotypes in 10-copy duplication datasets whereas existing algorithms reconstruct less than one copy on average.", 
    "editor": [
      {
        "familyName": "Sahinalp", 
        "givenName": "S. Cenk", 
        "type": "Person"
      }
    ], 
    "genre": "chapter", 
    "id": "sg:pub.10.1007/978-3-319-56970-3_8", 
    "inLanguage": "en", 
    "isAccessibleForFree": true, 
    "isPartOf": {
      "isbn": [
        "978-3-319-56969-7", 
        "978-3-319-56970-3"
      ], 
      "name": "Research in Computational Molecular Biology", 
      "type": "Book"
    }, 
    "keywords": [
      "long segmental duplications", 
      "reconstruction accuracy", 
      "matrix completion", 
      "segmental duplications", 
      "likelihood scores", 
      "second algorithm", 
      "de novo assembly", 
      "algorithm", 
      "performance metrics", 
      "strong regularization", 
      "large structural rearrangements", 
      "datasets", 
      "superior performance", 
      "evolutionary studies", 
      "novo assembly", 
      "simulation methodology", 
      "detailed simulation methodology", 
      "duplication", 
      "sequencing system", 
      "genome", 
      "structural rearrangements", 
      "accuracy", 
      "sequence", 
      "same time", 
      "reads", 
      "haplotypes", 
      "metrics", 
      "paralogs", 
      "assembly", 
      "complex region", 
      "multicopy", 
      "regularization", 
      "genes", 
      "variants", 
      "performance", 
      "copies", 
      "novo", 
      "system", 
      "methodology", 
      "rearrangement", 
      "phasing", 
      "region", 
      "variation", 
      "number", 
      "frontier", 
      "time", 
      "ability", 
      "array", 
      "assumption", 
      "resolution", 
      "completion", 
      "sizable number", 
      "unprecedented rise", 
      "rise", 
      "fraction", 
      "study", 
      "likelihood", 
      "correlation", 
      "scores", 
      "problem", 
      "paper", 
      "single-molecule sequencing systems", 
      "mammalian de novo assemblies", 
      "paralog-specific variants", 
      "polyploid phasing", 
      "discrete matrix completion", 
      "correlation-clustering algorithm", 
      "duplication datasets", 
      "Resolving Multicopy Duplications", 
      "Multicopy Duplications"
    ], 
    "name": "Resolving Multicopy Duplications de novo Using Polyploid Phasing", 
    "pagination": "117-133", 
    "productId": [
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1086882567"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1007/978-3-319-56970-3_8"
        ]
      }, 
      {
        "name": "pubmed_id", 
        "type": "PropertyValue", 
        "value": [
          "28808695"
        ]
      }
    ], 
    "publisher": {
      "name": "Springer Nature", 
      "type": "Organisation"
    }, 
    "sameAs": [
      "https://doi.org/10.1007/978-3-319-56970-3_8", 
      "https://app.dimensions.ai/details/publication/pub.1086882567"
    ], 
    "sdDataset": "chapters", 
    "sdDatePublished": "2021-12-01T19:56", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-springernature-scigraph/baseset/20211201/entities/gbq_results/chapter/chapter_121.jsonl", 
    "type": "Chapter", 
    "url": "https://doi.org/10.1007/978-3-319-56970-3_8"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-3-319-56970-3_8'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-3-319-56970-3_8'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-3-319-56970-3_8'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-3-319-56970-3_8'


 

This table displays all metadata directly associated to this object as RDF triples.

159 TRIPLES      23 PREDICATES      96 URIs      89 LITERALS      8 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1007/978-3-319-56970-3_8 schema:about anzsrc-for:08
2 anzsrc-for:0801
3 schema:author N77c7071b948d4367923854123f206b60
4 schema:datePublished 2017-04-12
5 schema:datePublishedReg 2017-04-12
6 schema:description While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog-specific variants. In this paper, we study the problem of resolving the variations in multicopy, long segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation-clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on average 7.0 haplotypes in 10-copy duplication datasets whereas existing algorithms reconstruct less than one copy on average.
7 schema:editor N7da062b526db44849acd723cb6fdd27a
8 schema:genre chapter
9 schema:inLanguage en
10 schema:isAccessibleForFree true
11 schema:isPartOf N39acf03e1bd34241ac47a6d61d91ba11
12 schema:keywords Multicopy Duplications
13 Resolving Multicopy Duplications
14 ability
15 accuracy
16 algorithm
17 array
18 assembly
19 assumption
20 completion
21 complex region
22 copies
23 correlation
24 correlation-clustering algorithm
25 datasets
26 de novo assembly
27 detailed simulation methodology
28 discrete matrix completion
29 duplication
30 duplication datasets
31 evolutionary studies
32 fraction
33 frontier
34 genes
35 genome
36 haplotypes
37 large structural rearrangements
38 likelihood
39 likelihood scores
40 long segmental duplications
41 mammalian de novo assemblies
42 matrix completion
43 methodology
44 metrics
45 multicopy
46 novo
47 novo assembly
48 number
49 paper
50 paralog-specific variants
51 paralogs
52 performance
53 performance metrics
54 phasing
55 polyploid phasing
56 problem
57 reads
58 rearrangement
59 reconstruction accuracy
60 region
61 regularization
62 resolution
63 rise
64 same time
65 scores
66 second algorithm
67 segmental duplications
68 sequence
69 sequencing system
70 simulation methodology
71 single-molecule sequencing systems
72 sizable number
73 strong regularization
74 structural rearrangements
75 study
76 superior performance
77 system
78 time
79 unprecedented rise
80 variants
81 variation
82 schema:name Resolving Multicopy Duplications de novo Using Polyploid Phasing
83 schema:pagination 117-133
84 schema:productId Na5f05640d2a943059774e64860ba92b6
85 Nae2ce59aba774d73a8929a13c8158197
86 Nb6628ed6764e407e808decc2a2f1000d
87 schema:publisher N065c20fba4d34413a2230ef87b8d15d3
88 schema:sameAs https://app.dimensions.ai/details/publication/pub.1086882567
89 https://doi.org/10.1007/978-3-319-56970-3_8
90 schema:sdDatePublished 2021-12-01T19:56
91 schema:sdLicense https://scigraph.springernature.com/explorer/license/
92 schema:sdPublisher N4e6d84b636144e7ea87ac87b249431e0
93 schema:url https://doi.org/10.1007/978-3-319-56970-3_8
94 sgo:license sg:explorer/license/
95 sgo:sdDataset chapters
96 rdf:type schema:Chapter
97 N065c20fba4d34413a2230ef87b8d15d3 schema:name Springer Nature
98 rdf:type schema:Organisation
99 N39acf03e1bd34241ac47a6d61d91ba11 schema:isbn 978-3-319-56969-7
100 978-3-319-56970-3
101 schema:name Research in Computational Molecular Biology
102 rdf:type schema:Book
103 N4e6d84b636144e7ea87ac87b249431e0 schema:name Springer Nature - SN SciGraph project
104 rdf:type schema:Organization
105 N5c4ef2a3d834455f88820465966abed9 rdf:first sg:person.0705101106.89
106 rdf:rest rdf:nil
107 N5e64a7f65b784609a713d2467b104951 rdf:first sg:person.015735307063.16
108 rdf:rest N5c4ef2a3d834455f88820465966abed9
109 N77c7071b948d4367923854123f206b60 rdf:first sg:person.012610254333.24
110 rdf:rest Na7c6074132ca4f0e87f44ee3ef5d56ff
111 N7834886be307433188eca8deef87976f schema:familyName Sahinalp
112 schema:givenName S. Cenk
113 rdf:type schema:Person
114 N7da062b526db44849acd723cb6fdd27a rdf:first N7834886be307433188eca8deef87976f
115 rdf:rest rdf:nil
116 Na5f05640d2a943059774e64860ba92b6 schema:name pubmed_id
117 schema:value 28808695
118 rdf:type schema:PropertyValue
119 Na7c6074132ca4f0e87f44ee3ef5d56ff rdf:first sg:person.013662602567.42
120 rdf:rest N5e64a7f65b784609a713d2467b104951
121 Nae2ce59aba774d73a8929a13c8158197 schema:name dimensions_id
122 schema:value pub.1086882567
123 rdf:type schema:PropertyValue
124 Nb6628ed6764e407e808decc2a2f1000d schema:name doi
125 schema:value 10.1007/978-3-319-56970-3_8
126 rdf:type schema:PropertyValue
127 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
128 schema:name Information and Computing Sciences
129 rdf:type schema:DefinedTerm
130 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
131 schema:name Artificial Intelligence and Image Processing
132 rdf:type schema:DefinedTerm
133 sg:person.012610254333.24 schema:affiliation grid-institutes:grid.34477.33
134 schema:familyName Chaisson
135 schema:givenName Mark J.
136 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012610254333.24
137 rdf:type schema:Person
138 sg:person.013662602567.42 schema:affiliation grid-institutes:grid.34477.33
139 schema:familyName Mukherjee
140 schema:givenName Sudipto
141 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013662602567.42
142 rdf:type schema:Person
143 sg:person.015735307063.16 schema:affiliation grid-institutes:grid.34477.33
144 schema:familyName Kannan
145 schema:givenName Sreeram
146 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015735307063.16
147 rdf:type schema:Person
148 sg:person.0705101106.89 schema:affiliation grid-institutes:grid.34477.33
149 schema:familyName Eichler
150 schema:givenName Evan E.
151 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0705101106.89
152 rdf:type schema:Person
153 grid-institutes:grid.34477.33 schema:alternateName Department of Electrical Engineering, University of Washington, 98195, Seattle, Washington, USA
154 Department of Genome Sciences, University of Washington, 98195, Seattle, Washington, USA
155 Howard Hughes Medical Institute, University of Washington, 98195, Seattle, Washington, USA
156 schema:name Department of Electrical Engineering, University of Washington, 98195, Seattle, Washington, USA
157 Department of Genome Sciences, University of Washington, 98195, Seattle, Washington, USA
158 Howard Hughes Medical Institute, University of Washington, 98195, Seattle, Washington, USA
159 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...