Does encoding matter? A novel view on the quantitative genetic trait prediction problem View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2016-07-19

AUTHORS

Dan He, Laxmi Parida

ABSTRACT

BackgroundGiven a set of biallelic molecular markers, such as SNPs, with genotype values encoded numerically on a collection of plant, animal or human samples, the goal of genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Genetic trait prediction is usually represented as linear regression models which require quantitative encodings for the genotypes. There are lots of work on the prediction algorithms, but none of the existing work investigated the effects of the encodings on the genetic trait prediction problem.MethodsIn this work, we view the genetic trait prediction problem from a novel angle: a multiple regression on categorical data problem, which requires encoding the categorical data into numerical data. We further proposed two novel encoding methods and we show that they are able to generate numerical features with higher predictive power.Results and DiscussionOur experiments show that our methods are superior to the other encoding methods for both single marker model and epistasis model. We showed that the quantitative genetic trait prediction problem heavily depends on the encoding of genotypes, for both single marker model and epistasis model.ConclusionsWe conducted a detailed analysis on the performance of the hybrid encodings. To our knowledge, this is the first work that discusses the effects of encodings for genetic trait prediction problem. More... »

PAGES

272

References to SciGraph publications

Identifiers

URI

http://scigraph.springernature.com/pub.10.1186/s12859-016-1127-1

DOI

http://dx.doi.org/10.1186/s12859-016-1127-1

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1001829525

PUBMED

https://www.ncbi.nlm.nih.gov/pubmed/27454886


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/06", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Biological Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0604", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Genetics", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Algorithms", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Animals", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Eukaryota", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Genetic Markers", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Genotype", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Humans", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Models, Genetic", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Phenotype", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Plants", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Polymorphism, Single Nucleotide", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Quantitative Trait Loci", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "IBM T.J. Watson Research, Yorktown Heights, NY, USA", 
          "id": "http://www.grid.ac/institutes/grid.481554.9", 
          "name": [
            "IBM T.J. Watson Research, Yorktown Heights, NY, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "He", 
        "givenName": "Dan", 
        "id": "sg:person.0607503176.22", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0607503176.22"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "IBM T.J. Watson Research, Yorktown Heights, NY, USA", 
          "id": "http://www.grid.ac/institutes/grid.481554.9", 
          "name": [
            "IBM T.J. Watson Research, Yorktown Heights, NY, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Parida", 
        "givenName": "Laxmi", 
        "id": "sg:person.01336557015.68", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01336557015.68"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "sg:pub.10.1007/bf00994018", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1025150743", 
          "https://doi.org/10.1007/bf00994018"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1023/b:stco.0000035301.49549.88", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1000991887", 
          "https://doi.org/10.1023/b:stco.0000035301.49549.88"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/ncomms1467", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1011822475", 
          "https://doi.org/10.1038/ncomms1467"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2016-07-19", 
    "datePublishedReg": "2016-07-19", 
    "description": "BackgroundGiven a set of biallelic molecular markers, such as SNPs, with genotype values encoded numerically on a collection of plant, animal or human samples, the goal of genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Genetic trait prediction is usually represented as linear regression models which require quantitative encodings for the genotypes. There are lots of work on the prediction algorithms, but none of the existing work investigated the effects of the encodings on the genetic trait prediction problem.MethodsIn this work, we view the genetic trait prediction problem from a novel angle: a multiple regression on categorical data problem, which requires encoding the categorical data into numerical data. We further proposed two novel encoding methods and we show that they are able to generate numerical features with higher predictive power.Results and DiscussionOur experiments show that our methods are superior to the other encoding methods for both single marker model and epistasis model. We showed that the quantitative genetic trait prediction problem heavily depends on the encoding of genotypes, for both single marker model and epistasis model.ConclusionsWe conducted a detailed analysis on the performance of the hybrid encodings. To our knowledge, this is the first work that discusses the effects of encodings for genetic trait prediction problem.", 
    "genre": "article", 
    "id": "sg:pub.10.1186/s12859-016-1127-1", 
    "isAccessibleForFree": true, 
    "isPartOf": [
      {
        "id": "sg:journal.1023786", 
        "issn": [
          "1471-2105"
        ], 
        "name": "BMC Bioinformatics", 
        "publisher": "Springer Nature", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "Suppl 9", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "17"
      }
    ], 
    "keywords": [
      "genetic trait prediction problem", 
      "genetic trait prediction", 
      "single-marker models", 
      "trait prediction", 
      "collection of plants", 
      "biallelic molecular markers", 
      "categorical data problems", 
      "marker effects", 
      "genotype values", 
      "trait values", 
      "molecular markers", 
      "quantitative trait values", 
      "effects of encodings", 
      "epistasis models", 
      "genotypes", 
      "plants", 
      "high predictive power", 
      "SNPs", 
      "multiple regression", 
      "animals", 
      "linear regression models", 
      "matter", 
      "effect", 
      "human samples", 
      "markers", 
      "regression models", 
      "experiments", 
      "values", 
      "collection", 
      "prediction problem", 
      "prediction", 
      "ConclusionsWe", 
      "data", 
      "knowledge", 
      "marker model", 
      "goal", 
      "samples", 
      "predictive power", 
      "model", 
      "method", 
      "work", 
      "problem", 
      "data problem", 
      "numerical data", 
      "results", 
      "regression", 
      "numerical features", 
      "first work", 
      "analysis", 
      "performance", 
      "categorical data", 
      "MethodsIn", 
      "set", 
      "detailed analysis", 
      "hybrid encoding", 
      "view", 
      "novel view", 
      "quantitative encodings", 
      "novel angle", 
      "encoding", 
      "angle", 
      "features", 
      "power", 
      "BackgroundGiven"
    ], 
    "name": "Does encoding matter? A novel view on the quantitative genetic trait prediction problem", 
    "pagination": "272", 
    "productId": [
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1001829525"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1186/s12859-016-1127-1"
        ]
      }, 
      {
        "name": "pubmed_id", 
        "type": "PropertyValue", 
        "value": [
          "27454886"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1186/s12859-016-1127-1", 
      "https://app.dimensions.ai/details/publication/pub.1001829525"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2022-09-02T15:59", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-springernature-scigraph/baseset/20220902/entities/gbq_results/article/article_701.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "https://doi.org/10.1186/s12859-016-1127-1"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1186/s12859-016-1127-1'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1186/s12859-016-1127-1'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1186/s12859-016-1127-1'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1186/s12859-016-1127-1'


 

This table displays all metadata directly associated to this object as RDF triples.

187 TRIPLES      21 PREDICATES      103 URIs      92 LITERALS      18 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1186/s12859-016-1127-1 schema:about N101af4c21daf4a3ea65a9a9a85036117
2 N4e48747481164aa79ca5e1622a95cb9f
3 Na4f1f563041744818729463e0bcecad1
4 Na8f5cb7618cb45b8be13637326e6de99
5 Nbb71e15d8add40b996d091e7dde272d9
6 Nbba934ddd086487a9824efcb5ba4e9dd
7 Nc0e4702b0cca4a179482504215c2c391
8 Ncca567e1d8cd4da9a06b53d7d2019925
9 Nd0277d0148d84aeda60450f24d7960fa
10 Nf21f4eb6543a4b74868b39dd0c05b560
11 Nf52cb1be45d744afbfbcd2d11a181f80
12 anzsrc-for:06
13 anzsrc-for:0604
14 schema:author N83982cd5a6934907a8c0ee4b001a3efc
15 schema:citation sg:pub.10.1007/bf00994018
16 sg:pub.10.1023/b:stco.0000035301.49549.88
17 sg:pub.10.1038/ncomms1467
18 schema:datePublished 2016-07-19
19 schema:datePublishedReg 2016-07-19
20 schema:description BackgroundGiven a set of biallelic molecular markers, such as SNPs, with genotype values encoded numerically on a collection of plant, animal or human samples, the goal of genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Genetic trait prediction is usually represented as linear regression models which require quantitative encodings for the genotypes. There are lots of work on the prediction algorithms, but none of the existing work investigated the effects of the encodings on the genetic trait prediction problem.MethodsIn this work, we view the genetic trait prediction problem from a novel angle: a multiple regression on categorical data problem, which requires encoding the categorical data into numerical data. We further proposed two novel encoding methods and we show that they are able to generate numerical features with higher predictive power.Results and DiscussionOur experiments show that our methods are superior to the other encoding methods for both single marker model and epistasis model. We showed that the quantitative genetic trait prediction problem heavily depends on the encoding of genotypes, for both single marker model and epistasis model.ConclusionsWe conducted a detailed analysis on the performance of the hybrid encodings. To our knowledge, this is the first work that discusses the effects of encodings for genetic trait prediction problem.
21 schema:genre article
22 schema:isAccessibleForFree true
23 schema:isPartOf N009e5b36a21c418bab1dc487ba065b9f
24 N709cfb55d58f47788ab250b933e4661b
25 sg:journal.1023786
26 schema:keywords BackgroundGiven
27 ConclusionsWe
28 MethodsIn
29 SNPs
30 analysis
31 angle
32 animals
33 biallelic molecular markers
34 categorical data
35 categorical data problems
36 collection
37 collection of plants
38 data
39 data problem
40 detailed analysis
41 effect
42 effects of encodings
43 encoding
44 epistasis models
45 experiments
46 features
47 first work
48 genetic trait prediction
49 genetic trait prediction problem
50 genotype values
51 genotypes
52 goal
53 high predictive power
54 human samples
55 hybrid encoding
56 knowledge
57 linear regression models
58 marker effects
59 marker model
60 markers
61 matter
62 method
63 model
64 molecular markers
65 multiple regression
66 novel angle
67 novel view
68 numerical data
69 numerical features
70 performance
71 plants
72 power
73 prediction
74 prediction problem
75 predictive power
76 problem
77 quantitative encodings
78 quantitative trait values
79 regression
80 regression models
81 results
82 samples
83 set
84 single-marker models
85 trait prediction
86 trait values
87 values
88 view
89 work
90 schema:name Does encoding matter? A novel view on the quantitative genetic trait prediction problem
91 schema:pagination 272
92 schema:productId N64e31b13b1fd48c6b96477fa7e4fd483
93 Nc200f9c757cb49b981b4e2e203e3f6f1
94 Nf1db93ee012d45cfa9b3ab9a2d57908d
95 schema:sameAs https://app.dimensions.ai/details/publication/pub.1001829525
96 https://doi.org/10.1186/s12859-016-1127-1
97 schema:sdDatePublished 2022-09-02T15:59
98 schema:sdLicense https://scigraph.springernature.com/explorer/license/
99 schema:sdPublisher N9326b6821ff0411b9f1c450bc3cc6188
100 schema:url https://doi.org/10.1186/s12859-016-1127-1
101 sgo:license sg:explorer/license/
102 sgo:sdDataset articles
103 rdf:type schema:ScholarlyArticle
104 N009e5b36a21c418bab1dc487ba065b9f schema:issueNumber Suppl 9
105 rdf:type schema:PublicationIssue
106 N101af4c21daf4a3ea65a9a9a85036117 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
107 schema:name Humans
108 rdf:type schema:DefinedTerm
109 N2bc82908c73a4724ab86d83ac2a5f283 rdf:first sg:person.01336557015.68
110 rdf:rest rdf:nil
111 N4e48747481164aa79ca5e1622a95cb9f schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
112 schema:name Algorithms
113 rdf:type schema:DefinedTerm
114 N64e31b13b1fd48c6b96477fa7e4fd483 schema:name pubmed_id
115 schema:value 27454886
116 rdf:type schema:PropertyValue
117 N709cfb55d58f47788ab250b933e4661b schema:volumeNumber 17
118 rdf:type schema:PublicationVolume
119 N83982cd5a6934907a8c0ee4b001a3efc rdf:first sg:person.0607503176.22
120 rdf:rest N2bc82908c73a4724ab86d83ac2a5f283
121 N9326b6821ff0411b9f1c450bc3cc6188 schema:name Springer Nature - SN SciGraph project
122 rdf:type schema:Organization
123 Na4f1f563041744818729463e0bcecad1 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
124 schema:name Genetic Markers
125 rdf:type schema:DefinedTerm
126 Na8f5cb7618cb45b8be13637326e6de99 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
127 schema:name Quantitative Trait Loci
128 rdf:type schema:DefinedTerm
129 Nbb71e15d8add40b996d091e7dde272d9 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
130 schema:name Plants
131 rdf:type schema:DefinedTerm
132 Nbba934ddd086487a9824efcb5ba4e9dd schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
133 schema:name Eukaryota
134 rdf:type schema:DefinedTerm
135 Nc0e4702b0cca4a179482504215c2c391 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
136 schema:name Animals
137 rdf:type schema:DefinedTerm
138 Nc200f9c757cb49b981b4e2e203e3f6f1 schema:name dimensions_id
139 schema:value pub.1001829525
140 rdf:type schema:PropertyValue
141 Ncca567e1d8cd4da9a06b53d7d2019925 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
142 schema:name Genotype
143 rdf:type schema:DefinedTerm
144 Nd0277d0148d84aeda60450f24d7960fa schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
145 schema:name Models, Genetic
146 rdf:type schema:DefinedTerm
147 Nf1db93ee012d45cfa9b3ab9a2d57908d schema:name doi
148 schema:value 10.1186/s12859-016-1127-1
149 rdf:type schema:PropertyValue
150 Nf21f4eb6543a4b74868b39dd0c05b560 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
151 schema:name Polymorphism, Single Nucleotide
152 rdf:type schema:DefinedTerm
153 Nf52cb1be45d744afbfbcd2d11a181f80 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
154 schema:name Phenotype
155 rdf:type schema:DefinedTerm
156 anzsrc-for:06 schema:inDefinedTermSet anzsrc-for:
157 schema:name Biological Sciences
158 rdf:type schema:DefinedTerm
159 anzsrc-for:0604 schema:inDefinedTermSet anzsrc-for:
160 schema:name Genetics
161 rdf:type schema:DefinedTerm
162 sg:journal.1023786 schema:issn 1471-2105
163 schema:name BMC Bioinformatics
164 schema:publisher Springer Nature
165 rdf:type schema:Periodical
166 sg:person.01336557015.68 schema:affiliation grid-institutes:grid.481554.9
167 schema:familyName Parida
168 schema:givenName Laxmi
169 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01336557015.68
170 rdf:type schema:Person
171 sg:person.0607503176.22 schema:affiliation grid-institutes:grid.481554.9
172 schema:familyName He
173 schema:givenName Dan
174 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0607503176.22
175 rdf:type schema:Person
176 sg:pub.10.1007/bf00994018 schema:sameAs https://app.dimensions.ai/details/publication/pub.1025150743
177 https://doi.org/10.1007/bf00994018
178 rdf:type schema:CreativeWork
179 sg:pub.10.1023/b:stco.0000035301.49549.88 schema:sameAs https://app.dimensions.ai/details/publication/pub.1000991887
180 https://doi.org/10.1023/b:stco.0000035301.49549.88
181 rdf:type schema:CreativeWork
182 sg:pub.10.1038/ncomms1467 schema:sameAs https://app.dimensions.ai/details/publication/pub.1011822475
183 https://doi.org/10.1038/ncomms1467
184 rdf:type schema:CreativeWork
185 grid-institutes:grid.481554.9 schema:alternateName IBM T.J. Watson Research, Yorktown Heights, NY, USA
186 schema:name IBM T.J. Watson Research, Yorktown Heights, NY, USA
187 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...