Bursty and Hierarchical Structure in Streams View Full Text


Ontology type: schema:ScholarlyArticle     


Article Info

DATE

2003-10

AUTHORS

Jon Kleinberg

ABSTRACT

A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise—that the appearance of a topic in a document stream is signaled by a “burst of activity,” with certain features rising sharply in frequency as the topic emerges.The goal of the present work is to develop a formal approach for modeling such “bursts,” in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them. More... »

PAGES

373-397

References to SciGraph publications

  • 1998-07. The Hierarchical Hidden Markov Model: Analysis and Applications in MACHINE LEARNING
  • 1999-02. Statistical Models for Text Segmentation in MACHINE LEARNING
  • 2002-06-25. Finding Frequent Items in Data Streams in AUTOMATA, LANGUAGES AND PROGRAMMING
  • Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1023/a:1024940629314

    DOI

    http://dx.doi.org/10.1023/a:1024940629314

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1042400043


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Artificial Intelligence and Image Processing", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0806", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information Systems", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "Department of Computer Science, Cornell University, 14853, Ithaca, NY, USA", 
              "id": "http://www.grid.ac/institutes/grid.5386.8", 
              "name": [
                "Department of Computer Science, Cornell University, 14853, Ithaca, NY, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Kleinberg", 
            "givenName": "Jon", 
            "id": "sg:person.011522233557.04", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011522233557.04"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "sg:pub.10.1023/a:1007506220214", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1051234706", 
              "https://doi.org/10.1023/a:1007506220214"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1023/a:1007469218079", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1030131329", 
              "https://doi.org/10.1023/a:1007469218079"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-45465-9_59", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1002330524", 
              "https://doi.org/10.1007/3-540-45465-9_59"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2003-10", 
        "datePublishedReg": "2003-10-01", 
        "description": "A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise\u2014that the appearance of a topic in a document stream is signaled by a \u201cburst of activity,\u201d with certain features rising sharply in frequency as the topic emerges.The goal of the present work is to develop a formal approach for modeling such \u201cbursts,\u201d in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them.", 
        "genre": "article", 
        "id": "sg:pub.10.1023/a:1024940629314", 
        "inLanguage": "en", 
        "isAccessibleForFree": false, 
        "isPartOf": [
          {
            "id": "sg:journal.1041853", 
            "issn": [
              "1384-5810", 
              "1573-756X"
            ], 
            "name": "Data Mining and Knowledge Discovery", 
            "publisher": "Springer Nature", 
            "type": "Periodical"
          }, 
          {
            "issueNumber": "4", 
            "type": "PublicationIssue"
          }, 
          {
            "type": "PublicationVolume", 
            "volumeNumber": "7"
          }
        ], 
        "keywords": [
          "document streams", 
          "text data mining", 
          "text mining work", 
          "bursty network traffic", 
          "hierarchical structure", 
          "data mining", 
          "network traffic", 
          "underlying content", 
          "set of bursts", 
          "formal approach", 
          "infinite-state automata", 
          "paper archives", 
          "particular research field", 
          "meaningful structures", 
          "such streams", 
          "fundamental problem", 
          "news articles", 
          "research field", 
          "mining works", 
          "overall stream", 
          "streams", 
          "state transitions", 
          "mail", 
          "mining", 
          "algorithm", 
          "traffic", 
          "organizational framework", 
          "topic", 
          "automata", 
          "natural meaning", 
          "bursty", 
          "certain features", 
          "framework", 
          "representation", 
          "period of time", 
          "set", 
          "work", 
          "archives", 
          "features", 
          "goal", 
          "time", 
          "way", 
          "example", 
          "model", 
          "premise", 
          "experiments", 
          "structure", 
          "terms", 
          "natural examples", 
          "bursts of activity", 
          "field", 
          "content", 
          "article", 
          "area", 
          "meaning", 
          "literature", 
          "theory", 
          "analogy", 
          "scale", 
          "appearance", 
          "bursts", 
          "present work", 
          "time scales", 
          "rise", 
          "phenomenon", 
          "long time scales", 
          "activity", 
          "frequency", 
          "transition", 
          "intensity", 
          "period", 
          "similar phenomenon", 
          "approach", 
          "problem"
        ], 
        "name": "Bursty and Hierarchical Structure in Streams", 
        "pagination": "373-397", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1042400043"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1023/a:1024940629314"
            ]
          }
        ], 
        "sameAs": [
          "https://doi.org/10.1023/a:1024940629314", 
          "https://app.dimensions.ai/details/publication/pub.1042400043"
        ], 
        "sdDataset": "articles", 
        "sdDatePublished": "2022-05-10T09:54", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-springernature-scigraph/baseset/20220509/entities/gbq_results/article/article_363.jsonl", 
        "type": "ScholarlyArticle", 
        "url": "https://doi.org/10.1023/a:1024940629314"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1023/a:1024940629314'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1023/a:1024940629314'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1023/a:1024940629314'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1023/a:1024940629314'


     

    This table displays all metadata directly associated to this object as RDF triples.

    148 TRIPLES      22 PREDICATES      103 URIs      91 LITERALS      6 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1023/a:1024940629314 schema:about anzsrc-for:08
    2 anzsrc-for:0801
    3 anzsrc-for:0806
    4 schema:author N99d4aae70520401598cbf5dadf0357e9
    5 schema:citation sg:pub.10.1007/3-540-45465-9_59
    6 sg:pub.10.1023/a:1007469218079
    7 sg:pub.10.1023/a:1007506220214
    8 schema:datePublished 2003-10
    9 schema:datePublishedReg 2003-10-01
    10 schema:description A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise—that the appearance of a topic in a document stream is signaled by a “burst of activity,” with certain features rising sharply in frequency as the topic emerges.The goal of the present work is to develop a formal approach for modeling such “bursts,” in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them.
    11 schema:genre article
    12 schema:inLanguage en
    13 schema:isAccessibleForFree false
    14 schema:isPartOf N236b283b34b9499ebb21578950e38f4a
    15 N9c1d5a4e99cf4e4d8b12ab18ccc404d0
    16 sg:journal.1041853
    17 schema:keywords activity
    18 algorithm
    19 analogy
    20 appearance
    21 approach
    22 archives
    23 area
    24 article
    25 automata
    26 bursts
    27 bursts of activity
    28 bursty
    29 bursty network traffic
    30 certain features
    31 content
    32 data mining
    33 document streams
    34 example
    35 experiments
    36 features
    37 field
    38 formal approach
    39 framework
    40 frequency
    41 fundamental problem
    42 goal
    43 hierarchical structure
    44 infinite-state automata
    45 intensity
    46 literature
    47 long time scales
    48 mail
    49 meaning
    50 meaningful structures
    51 mining
    52 mining works
    53 model
    54 natural examples
    55 natural meaning
    56 network traffic
    57 news articles
    58 organizational framework
    59 overall stream
    60 paper archives
    61 particular research field
    62 period
    63 period of time
    64 phenomenon
    65 premise
    66 present work
    67 problem
    68 representation
    69 research field
    70 rise
    71 scale
    72 set
    73 set of bursts
    74 similar phenomenon
    75 state transitions
    76 streams
    77 structure
    78 such streams
    79 terms
    80 text data mining
    81 text mining work
    82 theory
    83 time
    84 time scales
    85 topic
    86 traffic
    87 transition
    88 underlying content
    89 way
    90 work
    91 schema:name Bursty and Hierarchical Structure in Streams
    92 schema:pagination 373-397
    93 schema:productId N03d06a441409401aa6552a2c832cbdc5
    94 N7673a7f244ab44648d69fa44c9e57332
    95 schema:sameAs https://app.dimensions.ai/details/publication/pub.1042400043
    96 https://doi.org/10.1023/a:1024940629314
    97 schema:sdDatePublished 2022-05-10T09:54
    98 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    99 schema:sdPublisher N59ec8993a2b64d1da5318faa32541846
    100 schema:url https://doi.org/10.1023/a:1024940629314
    101 sgo:license sg:explorer/license/
    102 sgo:sdDataset articles
    103 rdf:type schema:ScholarlyArticle
    104 N03d06a441409401aa6552a2c832cbdc5 schema:name dimensions_id
    105 schema:value pub.1042400043
    106 rdf:type schema:PropertyValue
    107 N236b283b34b9499ebb21578950e38f4a schema:issueNumber 4
    108 rdf:type schema:PublicationIssue
    109 N59ec8993a2b64d1da5318faa32541846 schema:name Springer Nature - SN SciGraph project
    110 rdf:type schema:Organization
    111 N7673a7f244ab44648d69fa44c9e57332 schema:name doi
    112 schema:value 10.1023/a:1024940629314
    113 rdf:type schema:PropertyValue
    114 N99d4aae70520401598cbf5dadf0357e9 rdf:first sg:person.011522233557.04
    115 rdf:rest rdf:nil
    116 N9c1d5a4e99cf4e4d8b12ab18ccc404d0 schema:volumeNumber 7
    117 rdf:type schema:PublicationVolume
    118 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    119 schema:name Information and Computing Sciences
    120 rdf:type schema:DefinedTerm
    121 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
    122 schema:name Artificial Intelligence and Image Processing
    123 rdf:type schema:DefinedTerm
    124 anzsrc-for:0806 schema:inDefinedTermSet anzsrc-for:
    125 schema:name Information Systems
    126 rdf:type schema:DefinedTerm
    127 sg:journal.1041853 schema:issn 1384-5810
    128 1573-756X
    129 schema:name Data Mining and Knowledge Discovery
    130 schema:publisher Springer Nature
    131 rdf:type schema:Periodical
    132 sg:person.011522233557.04 schema:affiliation grid-institutes:grid.5386.8
    133 schema:familyName Kleinberg
    134 schema:givenName Jon
    135 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011522233557.04
    136 rdf:type schema:Person
    137 sg:pub.10.1007/3-540-45465-9_59 schema:sameAs https://app.dimensions.ai/details/publication/pub.1002330524
    138 https://doi.org/10.1007/3-540-45465-9_59
    139 rdf:type schema:CreativeWork
    140 sg:pub.10.1023/a:1007469218079 schema:sameAs https://app.dimensions.ai/details/publication/pub.1030131329
    141 https://doi.org/10.1023/a:1007469218079
    142 rdf:type schema:CreativeWork
    143 sg:pub.10.1023/a:1007506220214 schema:sameAs https://app.dimensions.ai/details/publication/pub.1051234706
    144 https://doi.org/10.1023/a:1007506220214
    145 rdf:type schema:CreativeWork
    146 grid-institutes:grid.5386.8 schema:alternateName Department of Computer Science, Cornell University, 14853, Ithaca, NY, USA
    147 schema:name Department of Computer Science, Cornell University, 14853, Ithaca, NY, USA
    148 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...