The Enron Corpus: A New Dataset for Email Classification Research View Full Text


Ontology type: schema:Chapter      Open Access: True


Chapter Info

DATE

2004

AUTHORS

Bryan Klimt , Yiming Yang

ABSTRACT

Automated classification of email messages into user-specific folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a state-of-the-art classifier (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classifier, and using all the sections in combination with regression weights. More... »

PAGES

217-226

References to SciGraph publications

  • 2000. A Comparative Study of Classification Based Personal E-mail Filtering in KNOWLEDGE DISCOVERY AND DATA MINING. CURRENT ISSUES AND NEW APPLICATIONS
  • Book

    TITLE

    Machine Learning: ECML 2004

    ISBN

    978-3-540-23105-9
    978-3-540-30115-8

    Author Affiliations

    Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/978-3-540-30115-8_22

    DOI

    http://dx.doi.org/10.1007/978-3-540-30115-8_22

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1044538060


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Artificial Intelligence and Image Processing", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "Carnegie Mellon University", 
              "id": "https://www.grid.ac/institutes/grid.147455.6", 
              "name": [
                "Language Technologies Institute, Carnegie Mellon University, 15213-8213, Pittsburgh, PA, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Klimt", 
            "givenName": "Bryan", 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "Carnegie Mellon University", 
              "id": "https://www.grid.ac/institutes/grid.147455.6", 
              "name": [
                "Language Technologies Institute, Carnegie Mellon University, 15213-8213, Pittsburgh, PA, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Yang", 
            "givenName": "Yiming", 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "https://doi.org/10.1145/301136.301209", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1015443772"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-45571-x_48", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1025259101", 
              "https://doi.org/10.1007/3-540-45571-x_48"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1111/0824-7935.00127", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1028564231"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1111/0824-7935.00127", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1028564231"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/383952.383975", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1030665885"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/s0306-4573(96)00063-5", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1032831757"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2004", 
        "datePublishedReg": "2004-01-01", 
        "description": "Automated classification of email messages into user-specific folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a state-of-the-art classifier (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classifier, and using all the sections in combination with regression weights.", 
        "editor": [
          {
            "familyName": "Boulicaut", 
            "givenName": "Jean-Fran\u00e7ois", 
            "type": "Person"
          }, 
          {
            "familyName": "Esposito", 
            "givenName": "Floriana", 
            "type": "Person"
          }, 
          {
            "familyName": "Giannotti", 
            "givenName": "Fosca", 
            "type": "Person"
          }, 
          {
            "familyName": "Pedreschi", 
            "givenName": "Dino", 
            "type": "Person"
          }
        ], 
        "genre": "chapter", 
        "id": "sg:pub.10.1007/978-3-540-30115-8_22", 
        "inLanguage": [
          "en"
        ], 
        "isAccessibleForFree": true, 
        "isPartOf": {
          "isbn": [
            "978-3-540-23105-9", 
            "978-3-540-30115-8"
          ], 
          "name": "Machine Learning: ECML 2004", 
          "type": "Book"
        }, 
        "name": "The Enron Corpus: A New Dataset for Email Classification Research", 
        "pagination": "217-226", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1044538060"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/978-3-540-30115-8_22"
            ]
          }, 
          {
            "name": "readcube_id", 
            "type": "PropertyValue", 
            "value": [
              "d42691d58ab78d7b639651235b91641fe88c6485d196e99d8ee967ea38302881"
            ]
          }
        ], 
        "publisher": {
          "location": "Berlin, Heidelberg", 
          "name": "Springer Berlin Heidelberg", 
          "type": "Organisation"
        }, 
        "sameAs": [
          "https://doi.org/10.1007/978-3-540-30115-8_22", 
          "https://app.dimensions.ai/details/publication/pub.1044538060"
        ], 
        "sdDataset": "chapters", 
        "sdDatePublished": "2019-04-16T08:25", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000363_0000000363/records_70053_00000002.jsonl", 
        "type": "Chapter", 
        "url": "https://link.springer.com/10.1007%2F978-3-540-30115-8_22"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-30115-8_22'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-30115-8_22'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-30115-8_22'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-30115-8_22'


     

    This table displays all metadata directly associated to this object as RDF triples.

    101 TRIPLES      23 PREDICATES      32 URIs      20 LITERALS      8 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/978-3-540-30115-8_22 schema:about anzsrc-for:08
    2 anzsrc-for:0801
    3 schema:author Nc9ccbb2bc9564b0fb9bf49726d28785d
    4 schema:citation sg:pub.10.1007/3-540-45571-x_48
    5 https://doi.org/10.1016/s0306-4573(96)00063-5
    6 https://doi.org/10.1111/0824-7935.00127
    7 https://doi.org/10.1145/301136.301209
    8 https://doi.org/10.1145/383952.383975
    9 schema:datePublished 2004
    10 schema:datePublishedReg 2004-01-01
    11 schema:description Automated classification of email messages into user-specific folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a state-of-the-art classifier (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classifier, and using all the sections in combination with regression weights.
    12 schema:editor N155675b3fbe1456f8880bf92af0a5725
    13 schema:genre chapter
    14 schema:inLanguage en
    15 schema:isAccessibleForFree true
    16 schema:isPartOf N7360b1d0756d497088f63f3ae26f41db
    17 schema:name The Enron Corpus: A New Dataset for Email Classification Research
    18 schema:pagination 217-226
    19 schema:productId N5d9f2d5fe85e46f895a3a7bfc7c9b2a2
    20 N9f5fb7c600c54d3bb065e5526152ef16
    21 Nd397dce3c997406f9956a238310d94c3
    22 schema:publisher N5171b9eb4cf94e1295ebac73475473a3
    23 schema:sameAs https://app.dimensions.ai/details/publication/pub.1044538060
    24 https://doi.org/10.1007/978-3-540-30115-8_22
    25 schema:sdDatePublished 2019-04-16T08:25
    26 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    27 schema:sdPublisher N19f5a421a5b1448dadb884c004c3e2c5
    28 schema:url https://link.springer.com/10.1007%2F978-3-540-30115-8_22
    29 sgo:license sg:explorer/license/
    30 sgo:sdDataset chapters
    31 rdf:type schema:Chapter
    32 N1272ff2d46a24e2782a9286046c92935 schema:affiliation https://www.grid.ac/institutes/grid.147455.6
    33 schema:familyName Klimt
    34 schema:givenName Bryan
    35 rdf:type schema:Person
    36 N155675b3fbe1456f8880bf92af0a5725 rdf:first Nb9404ae44d94417f987d43cf303a9c53
    37 rdf:rest N8e7db560867a46d98eab659e1c7a8ec5
    38 N19f5a421a5b1448dadb884c004c3e2c5 schema:name Springer Nature - SN SciGraph project
    39 rdf:type schema:Organization
    40 N5171b9eb4cf94e1295ebac73475473a3 schema:location Berlin, Heidelberg
    41 schema:name Springer Berlin Heidelberg
    42 rdf:type schema:Organisation
    43 N585c7af3c8164ec59caf02331cbf921f rdf:first Nb64e9b5e06c9419aa651f542fe7de2f0
    44 rdf:rest rdf:nil
    45 N5d9f2d5fe85e46f895a3a7bfc7c9b2a2 schema:name doi
    46 schema:value 10.1007/978-3-540-30115-8_22
    47 rdf:type schema:PropertyValue
    48 N7360b1d0756d497088f63f3ae26f41db schema:isbn 978-3-540-23105-9
    49 978-3-540-30115-8
    50 schema:name Machine Learning: ECML 2004
    51 rdf:type schema:Book
    52 N771100e50c7d4122999ab4bb753a8584 schema:familyName Pedreschi
    53 schema:givenName Dino
    54 rdf:type schema:Person
    55 N8104894bc3ea4397b06c92f33b821c4d schema:familyName Esposito
    56 schema:givenName Floriana
    57 rdf:type schema:Person
    58 N8e7db560867a46d98eab659e1c7a8ec5 rdf:first N8104894bc3ea4397b06c92f33b821c4d
    59 rdf:rest Nbfe70f26a6c14431a30f8915f2d7f83e
    60 N9f5fb7c600c54d3bb065e5526152ef16 schema:name readcube_id
    61 schema:value d42691d58ab78d7b639651235b91641fe88c6485d196e99d8ee967ea38302881
    62 rdf:type schema:PropertyValue
    63 Nb64e9b5e06c9419aa651f542fe7de2f0 schema:affiliation https://www.grid.ac/institutes/grid.147455.6
    64 schema:familyName Yang
    65 schema:givenName Yiming
    66 rdf:type schema:Person
    67 Nb9404ae44d94417f987d43cf303a9c53 schema:familyName Boulicaut
    68 schema:givenName Jean-François
    69 rdf:type schema:Person
    70 Nbfe70f26a6c14431a30f8915f2d7f83e rdf:first Nc4952c41a46849c88be54074d8202c02
    71 rdf:rest Nc8277afa72474ffdbbd0752759603a80
    72 Nc4952c41a46849c88be54074d8202c02 schema:familyName Giannotti
    73 schema:givenName Fosca
    74 rdf:type schema:Person
    75 Nc8277afa72474ffdbbd0752759603a80 rdf:first N771100e50c7d4122999ab4bb753a8584
    76 rdf:rest rdf:nil
    77 Nc9ccbb2bc9564b0fb9bf49726d28785d rdf:first N1272ff2d46a24e2782a9286046c92935
    78 rdf:rest N585c7af3c8164ec59caf02331cbf921f
    79 Nd397dce3c997406f9956a238310d94c3 schema:name dimensions_id
    80 schema:value pub.1044538060
    81 rdf:type schema:PropertyValue
    82 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    83 schema:name Information and Computing Sciences
    84 rdf:type schema:DefinedTerm
    85 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
    86 schema:name Artificial Intelligence and Image Processing
    87 rdf:type schema:DefinedTerm
    88 sg:pub.10.1007/3-540-45571-x_48 schema:sameAs https://app.dimensions.ai/details/publication/pub.1025259101
    89 https://doi.org/10.1007/3-540-45571-x_48
    90 rdf:type schema:CreativeWork
    91 https://doi.org/10.1016/s0306-4573(96)00063-5 schema:sameAs https://app.dimensions.ai/details/publication/pub.1032831757
    92 rdf:type schema:CreativeWork
    93 https://doi.org/10.1111/0824-7935.00127 schema:sameAs https://app.dimensions.ai/details/publication/pub.1028564231
    94 rdf:type schema:CreativeWork
    95 https://doi.org/10.1145/301136.301209 schema:sameAs https://app.dimensions.ai/details/publication/pub.1015443772
    96 rdf:type schema:CreativeWork
    97 https://doi.org/10.1145/383952.383975 schema:sameAs https://app.dimensions.ai/details/publication/pub.1030665885
    98 rdf:type schema:CreativeWork
    99 https://www.grid.ac/institutes/grid.147455.6 schema:alternateName Carnegie Mellon University
    100 schema:name Language Technologies Institute, Carnegie Mellon University, 15213-8213, Pittsburgh, PA, USA
    101 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...