A Fast, Provably Accurate Approximation Algorithm for Sparse Principal Component Analysis Reveals Human Genetic Variation Across the World View Full Text


Ontology type: schema:Chapter      Open Access: True


Chapter Info

DATE

2022-04-29

AUTHORS

Agniva Chowdhury , Aritra Bose , Samson Zhou , David P. Woodruff , Petros Drineas

ABSTRACT

Principal component analysis (PCA) is a widely used dimensionality reduction technique in machine learning and multivariate statistics. To improve the interpretability of PCA, various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis (SPCA). In this paper, we present ThreSPCA, a provably accurate algorithm based on thresholding the Singular Value Decomposition for the SPCA problem, without imposing any restrictive assumptions on the input covariance matrix. Our thresholding algorithm is conceptually simple; much faster than current state-of-the-art; and performs well in practice. When applied to genotype data from the 1000 Genomes Project, ThreSPCA is faster than previous benchmarks, at least as accurate, and leads to a set of interpretable biomarkers, revealing genetic diversity across the world. More... »

PAGES

86-106

Identifiers

URI

http://scigraph.springernature.com/pub.10.1007/978-3-031-04749-7_6

DOI

http://dx.doi.org/10.1007/978-3-031-04749-7_6

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1147802318


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/06", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Biological Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0604", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Genetics", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA", 
          "id": "http://www.grid.ac/institutes/grid.135519.a", 
          "name": [
            "Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Chowdhury", 
        "givenName": "Agniva", 
        "id": "sg:person.016112654201.64", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.016112654201.64"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Computational Genomics, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA", 
          "id": "http://www.grid.ac/institutes/grid.481554.9", 
          "name": [
            "Computational Genomics, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Bose", 
        "givenName": "Aritra", 
        "id": "sg:person.015510615227.17", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015510615227.17"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA", 
          "id": "http://www.grid.ac/institutes/grid.147455.6", 
          "name": [
            "School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Zhou", 
        "givenName": "Samson", 
        "id": "sg:person.011661310235.28", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011661310235.28"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA", 
          "id": "http://www.grid.ac/institutes/grid.147455.6", 
          "name": [
            "School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Woodruff", 
        "givenName": "David P.", 
        "id": "sg:person.012727410605.86", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012727410605.86"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Department of Computer Science, Purdue University, West Lafayette, IN, USA", 
          "id": "http://www.grid.ac/institutes/grid.169077.e", 
          "name": [
            "Department of Computer Science, Purdue University, West Lafayette, IN, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Drineas", 
        "givenName": "Petros", 
        "id": "sg:person.011256317073.58", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011256317073.58"
        ], 
        "type": "Person"
      }
    ], 
    "datePublished": "2022-04-29", 
    "datePublishedReg": "2022-04-29", 
    "description": "Principal component analysis (PCA) is a widely used dimensionality reduction technique in machine learning and multivariate statistics. To improve the interpretability of PCA, various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis (SPCA). In this paper, we present ThreSPCA, a provably accurate algorithm based on thresholding the Singular Value Decomposition for the SPCA problem, without imposing any restrictive assumptions on the input covariance matrix. Our thresholding algorithm is conceptually simple; much faster than current state-of-the-art; and performs well in practice. When applied to genotype data from the 1000 Genomes Project, ThreSPCA is faster than previous benchmarks, at least as accurate, and leads to a set of interpretable biomarkers, revealing genetic diversity across the world.", 
    "editor": [
      {
        "familyName": "Pe'er", 
        "givenName": "Itsik", 
        "type": "Person"
      }
    ], 
    "genre": "chapter", 
    "id": "sg:pub.10.1007/978-3-031-04749-7_6", 
    "isAccessibleForFree": true, 
    "isPartOf": {
      "isbn": [
        "978-3-031-04748-0", 
        "978-3-031-04749-7"
      ], 
      "name": "Research in Computational Molecular Biology", 
      "type": "Book"
    }, 
    "keywords": [
      "sparse principal component analysis", 
      "input covariance matrix", 
      "singular value decomposition", 
      "accurate approximation algorithm", 
      "covariance matrix", 
      "approximation algorithm", 
      "value decomposition", 
      "restrictive assumptions", 
      "accurate algorithm", 
      "dimensionality reduction techniques", 
      "reduction techniques", 
      "previous benchmarks", 
      "algorithm", 
      "thresholding algorithm", 
      "interpretable biomarkers", 
      "multivariate statistics", 
      "principal component analysis", 
      "machine learning", 
      "component analysis", 
      "statistics", 
      "problem", 
      "assumption", 
      "matrix", 
      "interpretability", 
      "set", 
      "decomposition", 
      "current state", 
      "genotype data", 
      "benchmarks", 
      "approach", 
      "analysis", 
      "technique", 
      "state", 
      "variation", 
      "direction loading", 
      "data", 
      "learning", 
      "art", 
      "loading", 
      "project", 
      "world", 
      "Genome Project", 
      "practice", 
      "diversity", 
      "human genetic variation", 
      "genetic diversity", 
      "genetic variation", 
      "biomarkers", 
      "paper"
    ], 
    "name": "A Fast, Provably Accurate Approximation Algorithm for Sparse Principal Component Analysis Reveals Human Genetic Variation Across the World", 
    "pagination": "86-106", 
    "productId": [
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1147802318"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1007/978-3-031-04749-7_6"
        ]
      }
    ], 
    "publisher": {
      "name": "Springer Nature", 
      "type": "Organisation"
    }, 
    "sameAs": [
      "https://doi.org/10.1007/978-3-031-04749-7_6", 
      "https://app.dimensions.ai/details/publication/pub.1147802318"
    ], 
    "sdDataset": "chapters", 
    "sdDatePublished": "2022-11-24T21:13", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-springernature-scigraph/baseset/20221124/entities/gbq_results/chapter/chapter_191.jsonl", 
    "type": "Chapter", 
    "url": "https://doi.org/10.1007/978-3-031-04749-7_6"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-3-031-04749-7_6'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-3-031-04749-7_6'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-3-031-04749-7_6'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-3-031-04749-7_6'


 

This table displays all metadata directly associated to this object as RDF triples.

145 TRIPLES      22 PREDICATES      73 URIs      66 LITERALS      7 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1007/978-3-031-04749-7_6 schema:about anzsrc-for:06
2 anzsrc-for:0604
3 schema:author N36b5455856ac4b5aa2b5e3d88fcc73de
4 schema:datePublished 2022-04-29
5 schema:datePublishedReg 2022-04-29
6 schema:description Principal component analysis (PCA) is a widely used dimensionality reduction technique in machine learning and multivariate statistics. To improve the interpretability of PCA, various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis (SPCA). In this paper, we present ThreSPCA, a provably accurate algorithm based on thresholding the Singular Value Decomposition for the SPCA problem, without imposing any restrictive assumptions on the input covariance matrix. Our thresholding algorithm is conceptually simple; much faster than current state-of-the-art; and performs well in practice. When applied to genotype data from the 1000 Genomes Project, ThreSPCA is faster than previous benchmarks, at least as accurate, and leads to a set of interpretable biomarkers, revealing genetic diversity across the world.
7 schema:editor N13ff2f9fdd144e1dbc407060412fe784
8 schema:genre chapter
9 schema:isAccessibleForFree true
10 schema:isPartOf N683493e9e734458187a06ab2605db6e0
11 schema:keywords Genome Project
12 accurate algorithm
13 accurate approximation algorithm
14 algorithm
15 analysis
16 approach
17 approximation algorithm
18 art
19 assumption
20 benchmarks
21 biomarkers
22 component analysis
23 covariance matrix
24 current state
25 data
26 decomposition
27 dimensionality reduction techniques
28 direction loading
29 diversity
30 genetic diversity
31 genetic variation
32 genotype data
33 human genetic variation
34 input covariance matrix
35 interpretability
36 interpretable biomarkers
37 learning
38 loading
39 machine learning
40 matrix
41 multivariate statistics
42 paper
43 practice
44 previous benchmarks
45 principal component analysis
46 problem
47 project
48 reduction techniques
49 restrictive assumptions
50 set
51 singular value decomposition
52 sparse principal component analysis
53 state
54 statistics
55 technique
56 thresholding algorithm
57 value decomposition
58 variation
59 world
60 schema:name A Fast, Provably Accurate Approximation Algorithm for Sparse Principal Component Analysis Reveals Human Genetic Variation Across the World
61 schema:pagination 86-106
62 schema:productId N1af079b646cd41fda4404255b91e3918
63 Nc574d555afa54c59affd8f3acc985c16
64 schema:publisher N3b9a5543242b44ae97969a486a23afbc
65 schema:sameAs https://app.dimensions.ai/details/publication/pub.1147802318
66 https://doi.org/10.1007/978-3-031-04749-7_6
67 schema:sdDatePublished 2022-11-24T21:13
68 schema:sdLicense https://scigraph.springernature.com/explorer/license/
69 schema:sdPublisher N848d897a045c42efa5c54a8da181eae0
70 schema:url https://doi.org/10.1007/978-3-031-04749-7_6
71 sgo:license sg:explorer/license/
72 sgo:sdDataset chapters
73 rdf:type schema:Chapter
74 N0369ae533ebc4f579c360b7f8f41b8ef rdf:first sg:person.015510615227.17
75 rdf:rest Nca7817e5145143e0aacbd0e9566f753c
76 N13ff2f9fdd144e1dbc407060412fe784 rdf:first N1d783456264e498da4a9c9de3ce3a888
77 rdf:rest rdf:nil
78 N1af079b646cd41fda4404255b91e3918 schema:name doi
79 schema:value 10.1007/978-3-031-04749-7_6
80 rdf:type schema:PropertyValue
81 N1d783456264e498da4a9c9de3ce3a888 schema:familyName Pe'er
82 schema:givenName Itsik
83 rdf:type schema:Person
84 N36b5455856ac4b5aa2b5e3d88fcc73de rdf:first sg:person.016112654201.64
85 rdf:rest N0369ae533ebc4f579c360b7f8f41b8ef
86 N3b9a5543242b44ae97969a486a23afbc schema:name Springer Nature
87 rdf:type schema:Organisation
88 N50bde4bf0fe54e45b01a1e5f58cb6d15 rdf:first sg:person.012727410605.86
89 rdf:rest N60298ddba3a54bf08f7dd26783fa199e
90 N60298ddba3a54bf08f7dd26783fa199e rdf:first sg:person.011256317073.58
91 rdf:rest rdf:nil
92 N683493e9e734458187a06ab2605db6e0 schema:isbn 978-3-031-04748-0
93 978-3-031-04749-7
94 schema:name Research in Computational Molecular Biology
95 rdf:type schema:Book
96 N848d897a045c42efa5c54a8da181eae0 schema:name Springer Nature - SN SciGraph project
97 rdf:type schema:Organization
98 Nc574d555afa54c59affd8f3acc985c16 schema:name dimensions_id
99 schema:value pub.1147802318
100 rdf:type schema:PropertyValue
101 Nca7817e5145143e0aacbd0e9566f753c rdf:first sg:person.011661310235.28
102 rdf:rest N50bde4bf0fe54e45b01a1e5f58cb6d15
103 anzsrc-for:06 schema:inDefinedTermSet anzsrc-for:
104 schema:name Biological Sciences
105 rdf:type schema:DefinedTerm
106 anzsrc-for:0604 schema:inDefinedTermSet anzsrc-for:
107 schema:name Genetics
108 rdf:type schema:DefinedTerm
109 sg:person.011256317073.58 schema:affiliation grid-institutes:grid.169077.e
110 schema:familyName Drineas
111 schema:givenName Petros
112 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011256317073.58
113 rdf:type schema:Person
114 sg:person.011661310235.28 schema:affiliation grid-institutes:grid.147455.6
115 schema:familyName Zhou
116 schema:givenName Samson
117 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011661310235.28
118 rdf:type schema:Person
119 sg:person.012727410605.86 schema:affiliation grid-institutes:grid.147455.6
120 schema:familyName Woodruff
121 schema:givenName David P.
122 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012727410605.86
123 rdf:type schema:Person
124 sg:person.015510615227.17 schema:affiliation grid-institutes:grid.481554.9
125 schema:familyName Bose
126 schema:givenName Aritra
127 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015510615227.17
128 rdf:type schema:Person
129 sg:person.016112654201.64 schema:affiliation grid-institutes:grid.135519.a
130 schema:familyName Chowdhury
131 schema:givenName Agniva
132 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.016112654201.64
133 rdf:type schema:Person
134 grid-institutes:grid.135519.a schema:alternateName Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA
135 schema:name Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA
136 rdf:type schema:Organization
137 grid-institutes:grid.147455.6 schema:alternateName School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
138 schema:name School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
139 rdf:type schema:Organization
140 grid-institutes:grid.169077.e schema:alternateName Department of Computer Science, Purdue University, West Lafayette, IN, USA
141 schema:name Department of Computer Science, Purdue University, West Lafayette, IN, USA
142 rdf:type schema:Organization
143 grid-institutes:grid.481554.9 schema:alternateName Computational Genomics, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
144 schema:name Computational Genomics, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
145 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...