Accelerating EM for Large Databases View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2001-12

AUTHORS

Bo Thiesson, Christopher Meek, David Heckerman

ABSTRACT

The EM algorithm is a popular method for parameter estimation in a variety of problems involving missing data. However, the EM algorithm often requires significant computational resources and has been dismissed as impractical for large databases. We present two approaches that significantly reduce the computational cost of applying the EM algorithm to databases with a large number of cases, including databases with large dimensionality. Both approaches are based on partial E-steps for which we can use the results of Neal and Hinton (In Jordan, M. (Ed.), Learning in Graphical Models, pp. 355–371. The Netherlands: Kluwer Academic Publishers) to obtain the standard convergence guarantees of EM. The first approach is a version of the incremental EM algorithm, described in Neal and Hinton (1998), which cycles through data cases in blocks. The number of cases in each block dramatically effects the efficiency of the algorithm. We provide amethod for selecting a near optimal block size. The second approach, which we call lazy EM, will, at scheduled iterations, evaluate the significance of each data case and then proceed for several iterations actively using only the significant cases. We demonstrate that both methods can significantly reduce computational costs through their application to high-dimensional real-world and synthetic mixture modeling problems for large databases. More... »

PAGES

279-299

Identifiers

URI

http://scigraph.springernature.com/pub.10.1023/a:1017986506241

DOI

http://dx.doi.org/10.1023/a:1017986506241

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1017911237


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Artificial Intelligence and Image Processing", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "Microsoft (United States)", 
          "id": "https://www.grid.ac/institutes/grid.419815.0", 
          "name": [
            "Microsoft Research, One Microsoft Way, 98052, Redmond, WA, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Thiesson", 
        "givenName": "Bo", 
        "id": "sg:person.013761575755.26", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013761575755.26"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Microsoft (United States)", 
          "id": "https://www.grid.ac/institutes/grid.419815.0", 
          "name": [
            "Microsoft Research, One Microsoft Way, 98052, Redmond, WA, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Meek", 
        "givenName": "Christopher", 
        "id": "sg:person.01352023432.48", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01352023432.48"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Microsoft (United States)", 
          "id": "https://www.grid.ac/institutes/grid.419815.0", 
          "name": [
            "Microsoft Research, One Microsoft Way, 98052, Redmond, WA, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Heckerman", 
        "givenName": "David", 
        "id": "sg:person.01134362461.98", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01134362461.98"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "sg:pub.10.1007/978-94-011-5014-9_12", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1009494930", 
          "https://doi.org/10.1007/978-94-011-5014-9_12"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1162/089976600300015853", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1015553486"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/347090.347123", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1024820511"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1023/a:1007469629108", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1044802083", 
          "https://doi.org/10.1023/a:1007469629108"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/233269.233324", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1053685676"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1093/biomet/80.2.267", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1059420382"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1109/icassp.1995.479281", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1093482120"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2001-12", 
    "datePublishedReg": "2001-12-01", 
    "description": "The EM algorithm is a popular method for parameter estimation in a variety of problems involving missing data. However, the EM algorithm often requires significant computational resources and has been dismissed as impractical for large databases. We present two approaches that significantly reduce the computational cost of applying the EM algorithm to databases with a large number of cases, including databases with large dimensionality. Both approaches are based on partial E-steps for which we can use the results of Neal and Hinton (In Jordan, M. (Ed.), Learning in Graphical Models, pp. 355\u2013371. The Netherlands: Kluwer Academic Publishers) to obtain the standard convergence guarantees of EM. The first approach is a version of the incremental EM algorithm, described in Neal and Hinton (1998), which cycles through data cases in blocks. The number of cases in each block dramatically effects the efficiency of the algorithm. We provide amethod for selecting a near optimal block size. The second approach, which we call lazy EM, will, at scheduled iterations, evaluate the significance of each data case and then proceed for several iterations actively using only the significant cases. We demonstrate that both methods can significantly reduce computational costs through their application to high-dimensional real-world and synthetic mixture modeling problems for large databases.", 
    "genre": "research_article", 
    "id": "sg:pub.10.1023/a:1017986506241", 
    "inLanguage": [
      "en"
    ], 
    "isAccessibleForFree": true, 
    "isPartOf": [
      {
        "id": "sg:journal.1125588", 
        "issn": [
          "0885-6125", 
          "1573-0565"
        ], 
        "name": "Machine Learning", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "3", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "45"
      }
    ], 
    "name": "Accelerating EM for Large Databases", 
    "pagination": "279-299", 
    "productId": [
      {
        "name": "readcube_id", 
        "type": "PropertyValue", 
        "value": [
          "5556fc13dfb48f3086620b936fd0d88d39848a040f2fa61f00b4c875787d04d5"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1023/a:1017986506241"
        ]
      }, 
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1017911237"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1023/a:1017986506241", 
      "https://app.dimensions.ai/details/publication/pub.1017911237"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2019-04-11T01:58", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8700_00000504.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "http://link.springer.com/10.1023/A:1017986506241"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1023/a:1017986506241'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1023/a:1017986506241'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1023/a:1017986506241'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1023/a:1017986506241'


 

This table displays all metadata directly associated to this object as RDF triples.

98 TRIPLES      21 PREDICATES      34 URIs      19 LITERALS      7 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1023/a:1017986506241 schema:about anzsrc-for:08
2 anzsrc-for:0801
3 schema:author N39c39dd93a4b4aafbcbf2df3b20aee00
4 schema:citation sg:pub.10.1007/978-94-011-5014-9_12
5 sg:pub.10.1023/a:1007469629108
6 https://doi.org/10.1093/biomet/80.2.267
7 https://doi.org/10.1109/icassp.1995.479281
8 https://doi.org/10.1145/233269.233324
9 https://doi.org/10.1145/347090.347123
10 https://doi.org/10.1162/089976600300015853
11 schema:datePublished 2001-12
12 schema:datePublishedReg 2001-12-01
13 schema:description The EM algorithm is a popular method for parameter estimation in a variety of problems involving missing data. However, the EM algorithm often requires significant computational resources and has been dismissed as impractical for large databases. We present two approaches that significantly reduce the computational cost of applying the EM algorithm to databases with a large number of cases, including databases with large dimensionality. Both approaches are based on partial E-steps for which we can use the results of Neal and Hinton (In Jordan, M. (Ed.), Learning in Graphical Models, pp. 355–371. The Netherlands: Kluwer Academic Publishers) to obtain the standard convergence guarantees of EM. The first approach is a version of the incremental EM algorithm, described in Neal and Hinton (1998), which cycles through data cases in blocks. The number of cases in each block dramatically effects the efficiency of the algorithm. We provide amethod for selecting a near optimal block size. The second approach, which we call lazy EM, will, at scheduled iterations, evaluate the significance of each data case and then proceed for several iterations actively using only the significant cases. We demonstrate that both methods can significantly reduce computational costs through their application to high-dimensional real-world and synthetic mixture modeling problems for large databases.
14 schema:genre research_article
15 schema:inLanguage en
16 schema:isAccessibleForFree true
17 schema:isPartOf N9c9d5a4ab919424fa50c6568e2517e77
18 Ncab595ab8570410fb8b372fce7be14ff
19 sg:journal.1125588
20 schema:name Accelerating EM for Large Databases
21 schema:pagination 279-299
22 schema:productId N4871d5dcce3149c6a6e425bcde28019a
23 N5e1addf61a6b49808efbb7319bef8ffa
24 N66f6eba5cd0c46e0a7cd32193c3015f1
25 schema:sameAs https://app.dimensions.ai/details/publication/pub.1017911237
26 https://doi.org/10.1023/a:1017986506241
27 schema:sdDatePublished 2019-04-11T01:58
28 schema:sdLicense https://scigraph.springernature.com/explorer/license/
29 schema:sdPublisher N02c1b3b53ec14d219ff51fbd24e72649
30 schema:url http://link.springer.com/10.1023/A:1017986506241
31 sgo:license sg:explorer/license/
32 sgo:sdDataset articles
33 rdf:type schema:ScholarlyArticle
34 N02c1b3b53ec14d219ff51fbd24e72649 schema:name Springer Nature - SN SciGraph project
35 rdf:type schema:Organization
36 N39c39dd93a4b4aafbcbf2df3b20aee00 rdf:first sg:person.013761575755.26
37 rdf:rest Nd8b14fe6631346cba68d8ae705c7e898
38 N4871d5dcce3149c6a6e425bcde28019a schema:name doi
39 schema:value 10.1023/a:1017986506241
40 rdf:type schema:PropertyValue
41 N5e1addf61a6b49808efbb7319bef8ffa schema:name dimensions_id
42 schema:value pub.1017911237
43 rdf:type schema:PropertyValue
44 N66f6eba5cd0c46e0a7cd32193c3015f1 schema:name readcube_id
45 schema:value 5556fc13dfb48f3086620b936fd0d88d39848a040f2fa61f00b4c875787d04d5
46 rdf:type schema:PropertyValue
47 N9c9d5a4ab919424fa50c6568e2517e77 schema:issueNumber 3
48 rdf:type schema:PublicationIssue
49 Ncab595ab8570410fb8b372fce7be14ff schema:volumeNumber 45
50 rdf:type schema:PublicationVolume
51 Nd8b14fe6631346cba68d8ae705c7e898 rdf:first sg:person.01352023432.48
52 rdf:rest Nff722e4dc8c54cf68614fea7c0b8e4aa
53 Nff722e4dc8c54cf68614fea7c0b8e4aa rdf:first sg:person.01134362461.98
54 rdf:rest rdf:nil
55 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
56 schema:name Information and Computing Sciences
57 rdf:type schema:DefinedTerm
58 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
59 schema:name Artificial Intelligence and Image Processing
60 rdf:type schema:DefinedTerm
61 sg:journal.1125588 schema:issn 0885-6125
62 1573-0565
63 schema:name Machine Learning
64 rdf:type schema:Periodical
65 sg:person.01134362461.98 schema:affiliation https://www.grid.ac/institutes/grid.419815.0
66 schema:familyName Heckerman
67 schema:givenName David
68 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01134362461.98
69 rdf:type schema:Person
70 sg:person.01352023432.48 schema:affiliation https://www.grid.ac/institutes/grid.419815.0
71 schema:familyName Meek
72 schema:givenName Christopher
73 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01352023432.48
74 rdf:type schema:Person
75 sg:person.013761575755.26 schema:affiliation https://www.grid.ac/institutes/grid.419815.0
76 schema:familyName Thiesson
77 schema:givenName Bo
78 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013761575755.26
79 rdf:type schema:Person
80 sg:pub.10.1007/978-94-011-5014-9_12 schema:sameAs https://app.dimensions.ai/details/publication/pub.1009494930
81 https://doi.org/10.1007/978-94-011-5014-9_12
82 rdf:type schema:CreativeWork
83 sg:pub.10.1023/a:1007469629108 schema:sameAs https://app.dimensions.ai/details/publication/pub.1044802083
84 https://doi.org/10.1023/a:1007469629108
85 rdf:type schema:CreativeWork
86 https://doi.org/10.1093/biomet/80.2.267 schema:sameAs https://app.dimensions.ai/details/publication/pub.1059420382
87 rdf:type schema:CreativeWork
88 https://doi.org/10.1109/icassp.1995.479281 schema:sameAs https://app.dimensions.ai/details/publication/pub.1093482120
89 rdf:type schema:CreativeWork
90 https://doi.org/10.1145/233269.233324 schema:sameAs https://app.dimensions.ai/details/publication/pub.1053685676
91 rdf:type schema:CreativeWork
92 https://doi.org/10.1145/347090.347123 schema:sameAs https://app.dimensions.ai/details/publication/pub.1024820511
93 rdf:type schema:CreativeWork
94 https://doi.org/10.1162/089976600300015853 schema:sameAs https://app.dimensions.ai/details/publication/pub.1015553486
95 rdf:type schema:CreativeWork
96 https://www.grid.ac/institutes/grid.419815.0 schema:alternateName Microsoft (United States)
97 schema:name Microsoft Research, One Microsoft Way, 98052, Redmond, WA, USA
98 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...