Accelerating EM for Large Databases View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2001-12

AUTHORS

Bo Thiesson, Christopher Meek, David Heckerman

ABSTRACT

The EM algorithm is a popular method for parameter estimation in a variety of problems involving missing data. However, the EM algorithm often requires significant computational resources and has been dismissed as impractical for large databases. We present two approaches that significantly reduce the computational cost of applying the EM algorithm to databases with a large number of cases, including databases with large dimensionality. Both approaches are based on partial E-steps for which we can use the results of Neal and Hinton (In Jordan, M. (Ed.), Learning in Graphical Models, pp. 355–371. The Netherlands: Kluwer Academic Publishers) to obtain the standard convergence guarantees of EM. The first approach is a version of the incremental EM algorithm, described in Neal and Hinton (1998), which cycles through data cases in blocks. The number of cases in each block dramatically effects the efficiency of the algorithm. We provide amethod for selecting a near optimal block size. The second approach, which we call lazy EM, will, at scheduled iterations, evaluate the significance of each data case and then proceed for several iterations actively using only the significant cases. We demonstrate that both methods can significantly reduce computational costs through their application to high-dimensional real-world and synthetic mixture modeling problems for large databases. More... »

PAGES

279-299

Identifiers

URI

http://scigraph.springernature.com/pub.10.1023/a:1017986506241

DOI

http://dx.doi.org/10.1023/a:1017986506241

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1017911237


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Artificial Intelligence and Image Processing", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Information and Computing Sciences", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "Microsoft (United States)", 
          "id": "https://www.grid.ac/institutes/grid.419815.0", 
          "name": [
            "Microsoft Research, One Microsoft Way, 98052, Redmond, WA, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Thiesson", 
        "givenName": "Bo", 
        "id": "sg:person.013761575755.26", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013761575755.26"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Microsoft (United States)", 
          "id": "https://www.grid.ac/institutes/grid.419815.0", 
          "name": [
            "Microsoft Research, One Microsoft Way, 98052, Redmond, WA, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Meek", 
        "givenName": "Christopher", 
        "id": "sg:person.01352023432.48", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01352023432.48"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "Microsoft (United States)", 
          "id": "https://www.grid.ac/institutes/grid.419815.0", 
          "name": [
            "Microsoft Research, One Microsoft Way, 98052, Redmond, WA, USA"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Heckerman", 
        "givenName": "David", 
        "id": "sg:person.01134362461.98", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01134362461.98"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "sg:pub.10.1007/978-94-011-5014-9_12", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1009494930", 
          "https://doi.org/10.1007/978-94-011-5014-9_12"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1162/089976600300015853", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1015553486"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/347090.347123", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1024820511"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1023/a:1007469629108", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1044802083", 
          "https://doi.org/10.1023/a:1007469629108"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1145/233269.233324", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1053685676"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1093/biomet/80.2.267", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1059420382"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1109/icassp.1995.479281", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1093482120"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2001-12", 
    "datePublishedReg": "2001-12-01", 
    "description": "The EM algorithm is a popular method for parameter estimation in a variety of problems involving missing data. However, the EM algorithm often requires significant computational resources and has been dismissed as impractical for large databases. We present two approaches that significantly reduce the computational cost of applying the EM algorithm to databases with a large number of cases, including databases with large dimensionality. Both approaches are based on partial E-steps for which we can use the results of Neal and Hinton (In Jordan, M. (Ed.), Learning in Graphical Models, pp. 355\u2013371. The Netherlands: Kluwer Academic Publishers) to obtain the standard convergence guarantees of EM. The first approach is a version of the incremental EM algorithm, described in Neal and Hinton (1998), which cycles through data cases in blocks. The number of cases in each block dramatically effects the efficiency of the algorithm. We provide amethod for selecting a near optimal block size. The second approach, which we call lazy EM, will, at scheduled iterations, evaluate the significance of each data case and then proceed for several iterations actively using only the significant cases. We demonstrate that both methods can significantly reduce computational costs through their application to high-dimensional real-world and synthetic mixture modeling problems for large databases.", 
    "genre": "research_article", 
    "id": "sg:pub.10.1023/a:1017986506241", 
    "inLanguage": [
      "en"
    ], 
    "isAccessibleForFree": true, 
    "isPartOf": [
      {
        "id": "sg:journal.1125588", 
        "issn": [
          "0885-6125", 
          "1573-0565"
        ], 
        "name": "Machine Learning", 
        "type": "Periodical"
      }, 
      {
        "issueNumber": "3", 
        "type": "PublicationIssue"
      }, 
      {
        "type": "PublicationVolume", 
        "volumeNumber": "45"
      }
    ], 
    "name": "Accelerating EM for Large Databases", 
    "pagination": "279-299", 
    "productId": [
      {
        "name": "readcube_id", 
        "type": "PropertyValue", 
        "value": [
          "5556fc13dfb48f3086620b936fd0d88d39848a040f2fa61f00b4c875787d04d5"
        ]
      }, 
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1023/a:1017986506241"
        ]
      }, 
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1017911237"
        ]
      }
    ], 
    "sameAs": [
      "https://doi.org/10.1023/a:1017986506241", 
      "https://app.dimensions.ai/details/publication/pub.1017911237"
    ], 
    "sdDataset": "articles", 
    "sdDatePublished": "2019-04-11T01:58", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8700_00000504.jsonl", 
    "type": "ScholarlyArticle", 
    "url": "http://link.springer.com/10.1023/A:1017986506241"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1023/a:1017986506241'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1023/a:1017986506241'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1023/a:1017986506241'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1023/a:1017986506241'


 

This table displays all metadata directly associated to this object as RDF triples.

98 TRIPLES      21 PREDICATES      34 URIs      19 LITERALS      7 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1023/a:1017986506241 schema:about anzsrc-for:08
2 anzsrc-for:0801
3 schema:author Ne38960f4425c4c7da4b8c6706622190b
4 schema:citation sg:pub.10.1007/978-94-011-5014-9_12
5 sg:pub.10.1023/a:1007469629108
6 https://doi.org/10.1093/biomet/80.2.267
7 https://doi.org/10.1109/icassp.1995.479281
8 https://doi.org/10.1145/233269.233324
9 https://doi.org/10.1145/347090.347123
10 https://doi.org/10.1162/089976600300015853
11 schema:datePublished 2001-12
12 schema:datePublishedReg 2001-12-01
13 schema:description The EM algorithm is a popular method for parameter estimation in a variety of problems involving missing data. However, the EM algorithm often requires significant computational resources and has been dismissed as impractical for large databases. We present two approaches that significantly reduce the computational cost of applying the EM algorithm to databases with a large number of cases, including databases with large dimensionality. Both approaches are based on partial E-steps for which we can use the results of Neal and Hinton (In Jordan, M. (Ed.), Learning in Graphical Models, pp. 355–371. The Netherlands: Kluwer Academic Publishers) to obtain the standard convergence guarantees of EM. The first approach is a version of the incremental EM algorithm, described in Neal and Hinton (1998), which cycles through data cases in blocks. The number of cases in each block dramatically effects the efficiency of the algorithm. We provide amethod for selecting a near optimal block size. The second approach, which we call lazy EM, will, at scheduled iterations, evaluate the significance of each data case and then proceed for several iterations actively using only the significant cases. We demonstrate that both methods can significantly reduce computational costs through their application to high-dimensional real-world and synthetic mixture modeling problems for large databases.
14 schema:genre research_article
15 schema:inLanguage en
16 schema:isAccessibleForFree true
17 schema:isPartOf N45f38b3d43c24a9e9511a498a3030bf8
18 N4f39d41d8caf4e44b011890ad829f796
19 sg:journal.1125588
20 schema:name Accelerating EM for Large Databases
21 schema:pagination 279-299
22 schema:productId N0b41e4626e394d4b8f2a1628a8f3de9d
23 N36f765aad3674315b888ee0dfeb6ee2f
24 Nc4f5cba1c6ad432ebcc9e66b4dc0ee00
25 schema:sameAs https://app.dimensions.ai/details/publication/pub.1017911237
26 https://doi.org/10.1023/a:1017986506241
27 schema:sdDatePublished 2019-04-11T01:58
28 schema:sdLicense https://scigraph.springernature.com/explorer/license/
29 schema:sdPublisher N61d0b79fb3284e0c87a534f2136fe22e
30 schema:url http://link.springer.com/10.1023/A:1017986506241
31 sgo:license sg:explorer/license/
32 sgo:sdDataset articles
33 rdf:type schema:ScholarlyArticle
34 N0b41e4626e394d4b8f2a1628a8f3de9d schema:name dimensions_id
35 schema:value pub.1017911237
36 rdf:type schema:PropertyValue
37 N18e29ec4556c482eac8849e30113e567 rdf:first sg:person.01352023432.48
38 rdf:rest N248115b5f5524e94b18d42347cdd195c
39 N248115b5f5524e94b18d42347cdd195c rdf:first sg:person.01134362461.98
40 rdf:rest rdf:nil
41 N36f765aad3674315b888ee0dfeb6ee2f schema:name readcube_id
42 schema:value 5556fc13dfb48f3086620b936fd0d88d39848a040f2fa61f00b4c875787d04d5
43 rdf:type schema:PropertyValue
44 N45f38b3d43c24a9e9511a498a3030bf8 schema:issueNumber 3
45 rdf:type schema:PublicationIssue
46 N4f39d41d8caf4e44b011890ad829f796 schema:volumeNumber 45
47 rdf:type schema:PublicationVolume
48 N61d0b79fb3284e0c87a534f2136fe22e schema:name Springer Nature - SN SciGraph project
49 rdf:type schema:Organization
50 Nc4f5cba1c6ad432ebcc9e66b4dc0ee00 schema:name doi
51 schema:value 10.1023/a:1017986506241
52 rdf:type schema:PropertyValue
53 Ne38960f4425c4c7da4b8c6706622190b rdf:first sg:person.013761575755.26
54 rdf:rest N18e29ec4556c482eac8849e30113e567
55 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
56 schema:name Information and Computing Sciences
57 rdf:type schema:DefinedTerm
58 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
59 schema:name Artificial Intelligence and Image Processing
60 rdf:type schema:DefinedTerm
61 sg:journal.1125588 schema:issn 0885-6125
62 1573-0565
63 schema:name Machine Learning
64 rdf:type schema:Periodical
65 sg:person.01134362461.98 schema:affiliation https://www.grid.ac/institutes/grid.419815.0
66 schema:familyName Heckerman
67 schema:givenName David
68 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01134362461.98
69 rdf:type schema:Person
70 sg:person.01352023432.48 schema:affiliation https://www.grid.ac/institutes/grid.419815.0
71 schema:familyName Meek
72 schema:givenName Christopher
73 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.01352023432.48
74 rdf:type schema:Person
75 sg:person.013761575755.26 schema:affiliation https://www.grid.ac/institutes/grid.419815.0
76 schema:familyName Thiesson
77 schema:givenName Bo
78 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013761575755.26
79 rdf:type schema:Person
80 sg:pub.10.1007/978-94-011-5014-9_12 schema:sameAs https://app.dimensions.ai/details/publication/pub.1009494930
81 https://doi.org/10.1007/978-94-011-5014-9_12
82 rdf:type schema:CreativeWork
83 sg:pub.10.1023/a:1007469629108 schema:sameAs https://app.dimensions.ai/details/publication/pub.1044802083
84 https://doi.org/10.1023/a:1007469629108
85 rdf:type schema:CreativeWork
86 https://doi.org/10.1093/biomet/80.2.267 schema:sameAs https://app.dimensions.ai/details/publication/pub.1059420382
87 rdf:type schema:CreativeWork
88 https://doi.org/10.1109/icassp.1995.479281 schema:sameAs https://app.dimensions.ai/details/publication/pub.1093482120
89 rdf:type schema:CreativeWork
90 https://doi.org/10.1145/233269.233324 schema:sameAs https://app.dimensions.ai/details/publication/pub.1053685676
91 rdf:type schema:CreativeWork
92 https://doi.org/10.1145/347090.347123 schema:sameAs https://app.dimensions.ai/details/publication/pub.1024820511
93 rdf:type schema:CreativeWork
94 https://doi.org/10.1162/089976600300015853 schema:sameAs https://app.dimensions.ai/details/publication/pub.1015553486
95 rdf:type schema:CreativeWork
96 https://www.grid.ac/institutes/grid.419815.0 schema:alternateName Microsoft (United States)
97 schema:name Microsoft Research, One Microsoft Way, 98052, Redmond, WA, USA
98 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...