Bursty and Hierarchical Structure in Streams View Full Text


Ontology type: schema:ScholarlyArticle     


Article Info

DATE

2003-10

AUTHORS

Jon Kleinberg

ABSTRACT

A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise—that the appearance of a topic in a document stream is signaled by a “burst of activity,” with certain features rising sharply in frequency as the topic emerges.The goal of the present work is to develop a formal approach for modeling such “bursts,” in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them. More... »

PAGES

373-397

References to SciGraph publications

  • 1998-07. The Hierarchical Hidden Markov Model: Analysis and Applications in MACHINE LEARNING
  • 1999-02. Statistical Models for Text Segmentation in MACHINE LEARNING
  • 2002-06-25. Finding Frequent Items in Data Streams in AUTOMATA, LANGUAGES AND PROGRAMMING
  • Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1023/a:1024940629314

    DOI

    http://dx.doi.org/10.1023/a:1024940629314

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1042400043


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Artificial Intelligence and Image Processing", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0806", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information Systems", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "Department of Computer Science, Cornell University, 14853, Ithaca, NY, USA", 
              "id": "http://www.grid.ac/institutes/grid.5386.8", 
              "name": [
                "Department of Computer Science, Cornell University, 14853, Ithaca, NY, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Kleinberg", 
            "givenName": "Jon", 
            "id": "sg:person.011522233557.04", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011522233557.04"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "sg:pub.10.1007/3-540-45465-9_59", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1002330524", 
              "https://doi.org/10.1007/3-540-45465-9_59"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1023/a:1007469218079", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1030131329", 
              "https://doi.org/10.1023/a:1007469218079"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1023/a:1007506220214", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1051234706", 
              "https://doi.org/10.1023/a:1007506220214"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2003-10", 
        "datePublishedReg": "2003-10-01", 
        "description": "A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise\u2014that the appearance of a topic in a document stream is signaled by a \u201cburst of activity,\u201d with certain features rising sharply in frequency as the topic emerges.The goal of the present work is to develop a formal approach for modeling such \u201cbursts,\u201d in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them.", 
        "genre": "article", 
        "id": "sg:pub.10.1023/a:1024940629314", 
        "inLanguage": "en", 
        "isAccessibleForFree": false, 
        "isPartOf": [
          {
            "id": "sg:journal.1041853", 
            "issn": [
              "1384-5810", 
              "1573-756X"
            ], 
            "name": "Data Mining and Knowledge Discovery", 
            "publisher": "Springer Nature", 
            "type": "Periodical"
          }, 
          {
            "issueNumber": "4", 
            "type": "PublicationIssue"
          }, 
          {
            "type": "PublicationVolume", 
            "volumeNumber": "7"
          }
        ], 
        "keywords": [
          "document streams", 
          "text data mining", 
          "bursty network traffic", 
          "infinite-state automata", 
          "hierarchical structure", 
          "data mining", 
          "network traffic", 
          "underlying content", 
          "set of bursts", 
          "formal approach", 
          "paper archives", 
          "meaningful structures", 
          "particular research field", 
          "such streams", 
          "fundamental problem", 
          "news articles", 
          "overall stream", 
          "mining works", 
          "research field", 
          "state transitions", 
          "streams", 
          "natural meaning", 
          "mining", 
          "algorithm", 
          "mail", 
          "traffic", 
          "topic", 
          "bursty", 
          "organizational framework", 
          "automata", 
          "framework", 
          "certain features", 
          "representation", 
          "work", 
          "set", 
          "period of time", 
          "archives", 
          "features", 
          "goal", 
          "example", 
          "time", 
          "way", 
          "model", 
          "experiments", 
          "premise", 
          "bursts of activity", 
          "terms", 
          "structure", 
          "natural examples", 
          "content", 
          "field", 
          "area", 
          "meaning", 
          "article", 
          "literature", 
          "theory", 
          "analogy", 
          "bursts", 
          "scale", 
          "appearance", 
          "present work", 
          "time scales", 
          "phenomenon", 
          "rise", 
          "longer time scales", 
          "frequency", 
          "activity", 
          "transition", 
          "intensity", 
          "period", 
          "similar phenomenon", 
          "approach", 
          "problem", 
          "text mining work", 
          "intuitive premise", 
          "research paper archives"
        ], 
        "name": "Bursty and Hierarchical Structure in Streams", 
        "pagination": "373-397", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1042400043"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1023/a:1024940629314"
            ]
          }
        ], 
        "sameAs": [
          "https://doi.org/10.1023/a:1024940629314", 
          "https://app.dimensions.ai/details/publication/pub.1042400043"
        ], 
        "sdDataset": "articles", 
        "sdDatePublished": "2022-01-01T18:12", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-springernature-scigraph/baseset/20220101/entities/gbq_results/article/article_364.jsonl", 
        "type": "ScholarlyArticle", 
        "url": "https://doi.org/10.1023/a:1024940629314"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1023/a:1024940629314'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1023/a:1024940629314'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1023/a:1024940629314'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1023/a:1024940629314'


     

    This table displays all metadata directly associated to this object as RDF triples.

    150 TRIPLES      22 PREDICATES      105 URIs      93 LITERALS      6 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1023/a:1024940629314 schema:about anzsrc-for:08
    2 anzsrc-for:0801
    3 anzsrc-for:0806
    4 schema:author N871aebd136164398a0386aec2382f8ce
    5 schema:citation sg:pub.10.1007/3-540-45465-9_59
    6 sg:pub.10.1023/a:1007469218079
    7 sg:pub.10.1023/a:1007506220214
    8 schema:datePublished 2003-10
    9 schema:datePublishedReg 2003-10-01
    10 schema:description A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise—that the appearance of a topic in a document stream is signaled by a “burst of activity,” with certain features rising sharply in frequency as the topic emerges.The goal of the present work is to develop a formal approach for modeling such “bursts,” in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them.
    11 schema:genre article
    12 schema:inLanguage en
    13 schema:isAccessibleForFree false
    14 schema:isPartOf N05f0888abc8f4fd0add7a3789fb7beb2
    15 N93b4c04b476d470db3da03ce5486a9c7
    16 sg:journal.1041853
    17 schema:keywords activity
    18 algorithm
    19 analogy
    20 appearance
    21 approach
    22 archives
    23 area
    24 article
    25 automata
    26 bursts
    27 bursts of activity
    28 bursty
    29 bursty network traffic
    30 certain features
    31 content
    32 data mining
    33 document streams
    34 example
    35 experiments
    36 features
    37 field
    38 formal approach
    39 framework
    40 frequency
    41 fundamental problem
    42 goal
    43 hierarchical structure
    44 infinite-state automata
    45 intensity
    46 intuitive premise
    47 literature
    48 longer time scales
    49 mail
    50 meaning
    51 meaningful structures
    52 mining
    53 mining works
    54 model
    55 natural examples
    56 natural meaning
    57 network traffic
    58 news articles
    59 organizational framework
    60 overall stream
    61 paper archives
    62 particular research field
    63 period
    64 period of time
    65 phenomenon
    66 premise
    67 present work
    68 problem
    69 representation
    70 research field
    71 research paper archives
    72 rise
    73 scale
    74 set
    75 set of bursts
    76 similar phenomenon
    77 state transitions
    78 streams
    79 structure
    80 such streams
    81 terms
    82 text data mining
    83 text mining work
    84 theory
    85 time
    86 time scales
    87 topic
    88 traffic
    89 transition
    90 underlying content
    91 way
    92 work
    93 schema:name Bursty and Hierarchical Structure in Streams
    94 schema:pagination 373-397
    95 schema:productId N2c7251f33a0345ff9c322420f3f63066
    96 N53cc62dbdd834a849055232b687ed1ac
    97 schema:sameAs https://app.dimensions.ai/details/publication/pub.1042400043
    98 https://doi.org/10.1023/a:1024940629314
    99 schema:sdDatePublished 2022-01-01T18:12
    100 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    101 schema:sdPublisher Nca7d263f1a5140978ee3d6d6fe78060c
    102 schema:url https://doi.org/10.1023/a:1024940629314
    103 sgo:license sg:explorer/license/
    104 sgo:sdDataset articles
    105 rdf:type schema:ScholarlyArticle
    106 N05f0888abc8f4fd0add7a3789fb7beb2 schema:volumeNumber 7
    107 rdf:type schema:PublicationVolume
    108 N2c7251f33a0345ff9c322420f3f63066 schema:name dimensions_id
    109 schema:value pub.1042400043
    110 rdf:type schema:PropertyValue
    111 N53cc62dbdd834a849055232b687ed1ac schema:name doi
    112 schema:value 10.1023/a:1024940629314
    113 rdf:type schema:PropertyValue
    114 N871aebd136164398a0386aec2382f8ce rdf:first sg:person.011522233557.04
    115 rdf:rest rdf:nil
    116 N93b4c04b476d470db3da03ce5486a9c7 schema:issueNumber 4
    117 rdf:type schema:PublicationIssue
    118 Nca7d263f1a5140978ee3d6d6fe78060c schema:name Springer Nature - SN SciGraph project
    119 rdf:type schema:Organization
    120 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    121 schema:name Information and Computing Sciences
    122 rdf:type schema:DefinedTerm
    123 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
    124 schema:name Artificial Intelligence and Image Processing
    125 rdf:type schema:DefinedTerm
    126 anzsrc-for:0806 schema:inDefinedTermSet anzsrc-for:
    127 schema:name Information Systems
    128 rdf:type schema:DefinedTerm
    129 sg:journal.1041853 schema:issn 1384-5810
    130 1573-756X
    131 schema:name Data Mining and Knowledge Discovery
    132 schema:publisher Springer Nature
    133 rdf:type schema:Periodical
    134 sg:person.011522233557.04 schema:affiliation grid-institutes:grid.5386.8
    135 schema:familyName Kleinberg
    136 schema:givenName Jon
    137 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011522233557.04
    138 rdf:type schema:Person
    139 sg:pub.10.1007/3-540-45465-9_59 schema:sameAs https://app.dimensions.ai/details/publication/pub.1002330524
    140 https://doi.org/10.1007/3-540-45465-9_59
    141 rdf:type schema:CreativeWork
    142 sg:pub.10.1023/a:1007469218079 schema:sameAs https://app.dimensions.ai/details/publication/pub.1030131329
    143 https://doi.org/10.1023/a:1007469218079
    144 rdf:type schema:CreativeWork
    145 sg:pub.10.1023/a:1007506220214 schema:sameAs https://app.dimensions.ai/details/publication/pub.1051234706
    146 https://doi.org/10.1023/a:1007506220214
    147 rdf:type schema:CreativeWork
    148 grid-institutes:grid.5386.8 schema:alternateName Department of Computer Science, Cornell University, 14853, Ithaca, NY, USA
    149 schema:name Department of Computer Science, Cornell University, 14853, Ithaca, NY, USA
    150 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...