Does encoding matter? A novel view on the quantitative genetic trait prediction problem View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2016-07-19

AUTHORS

Dan He, Laxmi Parida

ABSTRACT

BackgroundGiven a set of biallelic molecular markers, such as SNPs, with genotype values encoded numerically on a collection of plant, animal or human samples, the goal of genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Genetic trait prediction is usually represented as linear regression models which require quantitative encodings for the genotypes. There are lots of work on the prediction algorithms, but none of the existing work investigated the effects of the encodings on the genetic trait prediction problem.MethodsIn this work, we view the genetic trait prediction problem from a novel angle: a multiple regression on categorical data problem, which requires encoding the categorical data into numerical data. We further proposed two novel encoding methods and we show that they are able to generate numerical features with higher predictive power.Results and DiscussionOur experiments show that our methods are superior to the other encoding methods for both single marker model and epistasis model. We showed that the quantitative genetic trait prediction problem heavily depends on the encoding of genotypes, for both single marker model and epistasis model.ConclusionsWe conducted a detailed analysis on the performance of the hybrid encodings. To our knowledge, this is the first work that discusses the effects of encodings for genetic trait prediction problem. More... »

PAGES

272

References to SciGraph publications

Identifiers

URI

http://scigraph.springernature.com/pub.10.1186/s12859-016-1127-1

DOI

http://dx.doi.org/10.1186/s12859-016-1127-1

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1001829525

PUBMED

https://www.ncbi.nlm.nih.gov/pubmed/27454886


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/06", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Biological Sciences", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0604", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Genetics", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Algorithms", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Animals", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Eukaryota", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Genetic Markers", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Genotype", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Humans", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Models, Genetic", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Phenotype", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Plants", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Polymorphism, Single Nucleotide", 
        "type": "DefinedTerm"
      }, 
      {
        "inDefinedTermSet": "https://www.nlm.nih.gov/mesh/", 
        "name": "Quantitative Trait Loci", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "IBM T.J. Watson Research, Yorktown Heights, NY, USA", 
          "id": "http://www.grid.ac/institutes/grid.481554.9", 
          "name": [
            "IBM T.J. Watson Research, Yorktown Heights, NY, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "He", 
        "givenName": "Dan", 
        "id": "sg:person.0607503176.22", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0607503176.22"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "IBM T.J. Watson Research, Yorktown Heights, NY, USA", 
          "id": "http://www.grid.ac/institutes/grid.481554.9", 
          "name": [
            "IBM T.J. Watson Research, Yorktown Heights, NY, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Parida", 
        "givenName": "Laxmi", 
        "id": "sg:person.01336557015.68", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01336557015.68"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "sg:pub.10.1007/bf00994018", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1025150743", 
          "https://doi.org/10.1007/bf00994018"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1023/b:stco.0000035301.49549.88", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1000991887", 
          "https://doi.org/10.1023/b:stco.0000035301.49549.88"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1038/ncomms1467", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1011822475", 
          "https://doi.org/10.1038/ncomms1467"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2016-07-19", 
    "datePublishedReg": "2016-07-19", 
    "description": "BackgroundGiven a set of biallelic molecular markers, such as SNPs, with genotype values encoded numerically on a collection of plant, animal or human samples, the goal of genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Genetic trait prediction is usually represented as linear regression models which require quantitative encodings for the genotypes. There are lots of work on the prediction algorithms, but none of the existing work investigated the effects of the encodings on the genetic trait prediction problem.MethodsIn this work, we view the genetic trait prediction problem from a novel angle: a multiple regression on categorical data problem, which requires encoding the categorical data into numerical data. We further proposed two novel encoding methods and we show that they are able to generate numerical features with higher predictive power.Results and DiscussionOur experiments show that our methods are superior to the other encoding methods for both single marker model and epistasis model. We showed that the quantitative genetic trait prediction problem heavily depends on the encoding of genotypes, for both single marker model and epistasis model.ConclusionsWe conducted a detailed analysis on the performance of the hybrid encodings. To our knowledge, this is the first work that discusses the effects of encodings for genetic trait prediction problem.", 
    "genre": "article", 
    "id": "sg:pub.10.1186/s12859-016-1127-1", 
    "isAccessibleForFree": true, 
    "isPartOf": [
      {
        "id": "sg:journal.1023786", 
        "issn": [
          "1471-2105"
        ], 
        "name": "BMC Bioinformatics", 
        "publisher": "Springer Nature", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "Suppl 9", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "17"
      }
    ], 
    "keywords": [
      "genetic trait prediction problem", 
      "genetic trait prediction", 
      "single-marker models", 
      "trait prediction", 
      "collection of plants", 
      "biallelic molecular markers", 
      "categorical data problems", 
      "marker effects", 
      "genotype values", 
      "trait values", 
      "molecular markers", 
      "quantitative trait values", 
      "effects of encodings", 
      "epistasis models", 
      "genotypes", 
      "plants", 
      "high predictive power", 
      "SNPs", 
      "multiple regression", 
      "animals", 
      "linear regression models", 
      "matter", 
      "effect", 
      "human samples", 
      "markers", 
      "regression models", 
      "experiments", 
      "values", 
      "collection", 
      "prediction problem", 
      "prediction", 
      "ConclusionsWe", 
      "data", 
      "knowledge", 
      "marker model", 
      "goal", 
      "samples", 
      "predictive power", 
      "model", 
      "method", 
      "work", 
      "problem", 
      "data problem", 
      "numerical data", 
      "results", 
      "regression", 
      "numerical features", 
      "first work", 
      "analysis", 
      "performance", 
      "categorical data", 
      "MethodsIn", 
      "set", 
      "detailed analysis", 
      "hybrid encoding", 
      "view", 
      "novel view", 
      "quantitative encodings", 
      "novel angle", 
      "encoding", 
      "angle", 
      "features", 
      "power", 
      "BackgroundGiven"
    ], 
    "name": "Does encoding matter? A novel view on the quantitative genetic trait prediction problem", 
    "pagination": "272", 
    "productId": [
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1001829525"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1186/s12859-016-1127-1"
        ]
      }, 
      {
        "name": "pubmed_id", 
        "type": "PropertyValue", 
        "value": [
          "27454886"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1186/s12859-016-1127-1", 
      "https://app.dimensions.ai/details/publication/pub.1001829525"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2022-12-01T06:34", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-springernature-scigraph/baseset/20221201/entities/gbq_results/article/article_693.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "https://doi.org/10.1186/s12859-016-1127-1"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1186/s12859-016-1127-1'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1186/s12859-016-1127-1'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1186/s12859-016-1127-1'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1186/s12859-016-1127-1'


 

This table displays all metadata directly associated to this object as RDF triples.

187 TRIPLES      21 PREDICATES      103 URIs      92 LITERALS      18 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1186/s12859-016-1127-1 schema:about N11df542451e14012814fd396a64ac8e6
2 N122b480587fe4daa9e234bac7cc008e4
3 N38ddbccdfd8545d4a3da0cdd324468ec
4 N4c6480683a1f421084819badb2bc34ac
5 N6c2641974c46490c82d21ee2e39491ee
6 N6e9696fd5c66415ab9ed5b3a9e810e11
7 N87b25196a2b24191ada34d47f80ac310
8 N94fa73b801b141269ca5f0549e7752b0
9 Na41a7342a2764d34b7e27412841ff79d
10 Nb5dd9a5e6bc542a1ada74e203d1db280
11 Nba180af59ae047c3bff48600143f6b6b
12 anzsrc-for:06
13 anzsrc-for:0604
14 schema:author N6041aa44b6794f2aa1fa20c528cfee4c
15 schema:citation sg:pub.10.1007/bf00994018
16 sg:pub.10.1023/b:stco.0000035301.49549.88
17 sg:pub.10.1038/ncomms1467
18 schema:datePublished 2016-07-19
19 schema:datePublishedReg 2016-07-19
20 schema:description BackgroundGiven a set of biallelic molecular markers, such as SNPs, with genotype values encoded numerically on a collection of plant, animal or human samples, the goal of genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Genetic trait prediction is usually represented as linear regression models which require quantitative encodings for the genotypes. There are lots of work on the prediction algorithms, but none of the existing work investigated the effects of the encodings on the genetic trait prediction problem.MethodsIn this work, we view the genetic trait prediction problem from a novel angle: a multiple regression on categorical data problem, which requires encoding the categorical data into numerical data. We further proposed two novel encoding methods and we show that they are able to generate numerical features with higher predictive power.Results and DiscussionOur experiments show that our methods are superior to the other encoding methods for both single marker model and epistasis model. We showed that the quantitative genetic trait prediction problem heavily depends on the encoding of genotypes, for both single marker model and epistasis model.ConclusionsWe conducted a detailed analysis on the performance of the hybrid encodings. To our knowledge, this is the first work that discusses the effects of encodings for genetic trait prediction problem.
21 schema:genre article
22 schema:isAccessibleForFree true
23 schema:isPartOf Nbbd82f876052449a8586dbf63fed98cc
24 Nea3f1217ec35499085039e4a887bdf52
25 sg:journal.1023786
26 schema:keywords BackgroundGiven
27 ConclusionsWe
28 MethodsIn
29 SNPs
30 analysis
31 angle
32 animals
33 biallelic molecular markers
34 categorical data
35 categorical data problems
36 collection
37 collection of plants
38 data
39 data problem
40 detailed analysis
41 effect
42 effects of encodings
43 encoding
44 epistasis models
45 experiments
46 features
47 first work
48 genetic trait prediction
49 genetic trait prediction problem
50 genotype values
51 genotypes
52 goal
53 high predictive power
54 human samples
55 hybrid encoding
56 knowledge
57 linear regression models
58 marker effects
59 marker model
60 markers
61 matter
62 method
63 model
64 molecular markers
65 multiple regression
66 novel angle
67 novel view
68 numerical data
69 numerical features
70 performance
71 plants
72 power
73 prediction
74 prediction problem
75 predictive power
76 problem
77 quantitative encodings
78 quantitative trait values
79 regression
80 regression models
81 results
82 samples
83 set
84 single-marker models
85 trait prediction
86 trait values
87 values
88 view
89 work
90 schema:name Does encoding matter? A novel view on the quantitative genetic trait prediction problem
91 schema:pagination 272
92 schema:productId N18c52f132d2347ec8f2a02d310d4a37c
93 N5977083d0e0f45beaf64e956e7929948
94 Ned3805f4c7554424b0cd073097edba32
95 schema:sameAs https://app.dimensions.ai/details/publication/pub.1001829525
96 https://doi.org/10.1186/s12859-016-1127-1
97 schema:sdDatePublished 2022-12-01T06:34
98 schema:sdLicense https://scigraph.springernature.com/explorer/license/
99 schema:sdPublisher N9bdb76dd5852454ca3b674e1ccfc5e50
100 schema:url https://doi.org/10.1186/s12859-016-1127-1
101 sgo:license sg:explorer/license/
102 sgo:sdDataset articles
103 rdf:type schema:ScholarlyArticle
104 N11df542451e14012814fd396a64ac8e6 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
105 schema:name Humans
106 rdf:type schema:DefinedTerm
107 N122b480587fe4daa9e234bac7cc008e4 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
108 schema:name Polymorphism, Single Nucleotide
109 rdf:type schema:DefinedTerm
110 N18c52f132d2347ec8f2a02d310d4a37c schema:name pubmed_id
111 schema:value 27454886
112 rdf:type schema:PropertyValue
113 N38ddbccdfd8545d4a3da0cdd324468ec schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
114 schema:name Animals
115 rdf:type schema:DefinedTerm
116 N4c6480683a1f421084819badb2bc34ac schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
117 schema:name Genotype
118 rdf:type schema:DefinedTerm
119 N5977083d0e0f45beaf64e956e7929948 schema:name doi
120 schema:value 10.1186/s12859-016-1127-1
121 rdf:type schema:PropertyValue
122 N6041aa44b6794f2aa1fa20c528cfee4c rdf:first sg:person.0607503176.22
123 rdf:rest N686eb5343eff4ed7a7b2a941ee13518a
124 N686eb5343eff4ed7a7b2a941ee13518a rdf:first sg:person.01336557015.68
125 rdf:rest rdf:nil
126 N6c2641974c46490c82d21ee2e39491ee schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
127 schema:name Quantitative Trait Loci
128 rdf:type schema:DefinedTerm
129 N6e9696fd5c66415ab9ed5b3a9e810e11 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
130 schema:name Eukaryota
131 rdf:type schema:DefinedTerm
132 N87b25196a2b24191ada34d47f80ac310 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
133 schema:name Genetic Markers
134 rdf:type schema:DefinedTerm
135 N94fa73b801b141269ca5f0549e7752b0 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
136 schema:name Models, Genetic
137 rdf:type schema:DefinedTerm
138 N9bdb76dd5852454ca3b674e1ccfc5e50 schema:name Springer Nature - SN SciGraph project
139 rdf:type schema:Organization
140 Na41a7342a2764d34b7e27412841ff79d schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
141 schema:name Plants
142 rdf:type schema:DefinedTerm
143 Nb5dd9a5e6bc542a1ada74e203d1db280 schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
144 schema:name Phenotype
145 rdf:type schema:DefinedTerm
146 Nba180af59ae047c3bff48600143f6b6b schema:inDefinedTermSet https://www.nlm.nih.gov/mesh/
147 schema:name Algorithms
148 rdf:type schema:DefinedTerm
149 Nbbd82f876052449a8586dbf63fed98cc schema:issueNumber Suppl 9
150 rdf:type schema:PublicationIssue
151 Nea3f1217ec35499085039e4a887bdf52 schema:volumeNumber 17
152 rdf:type schema:PublicationVolume
153 Ned3805f4c7554424b0cd073097edba32 schema:name dimensions_id
154 schema:value pub.1001829525
155 rdf:type schema:PropertyValue
156 anzsrc-for:06 schema:inDefinedTermSet anzsrc-for:
157 schema:name Biological Sciences
158 rdf:type schema:DefinedTerm
159 anzsrc-for:0604 schema:inDefinedTermSet anzsrc-for:
160 schema:name Genetics
161 rdf:type schema:DefinedTerm
162 sg:journal.1023786 schema:issn 1471-2105
163 schema:name BMC Bioinformatics
164 schema:publisher Springer Nature
165 rdf:type schema:Periodical
166 sg:person.01336557015.68 schema:affiliation grid-institutes:grid.481554.9
167 schema:familyName Parida
168 schema:givenName Laxmi
169 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01336557015.68
170 rdf:type schema:Person
171 sg:person.0607503176.22 schema:affiliation grid-institutes:grid.481554.9
172 schema:familyName He
173 schema:givenName Dan
174 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0607503176.22
175 rdf:type schema:Person
176 sg:pub.10.1007/bf00994018 schema:sameAs https://app.dimensions.ai/details/publication/pub.1025150743
177 https://doi.org/10.1007/bf00994018
178 rdf:type schema:CreativeWork
179 sg:pub.10.1023/b:stco.0000035301.49549.88 schema:sameAs https://app.dimensions.ai/details/publication/pub.1000991887
180 https://doi.org/10.1023/b:stco.0000035301.49549.88
181 rdf:type schema:CreativeWork
182 sg:pub.10.1038/ncomms1467 schema:sameAs https://app.dimensions.ai/details/publication/pub.1011822475
183 https://doi.org/10.1038/ncomms1467
184 rdf:type schema:CreativeWork
185 grid-institutes:grid.481554.9 schema:alternateName IBM T.J. Watson Research, Yorktown Heights, NY, USA
186 schema:name IBM T.J. Watson Research, Yorktown Heights, NY, USA
187 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...