The Enron Corpus: A New Dataset for Email Classification Research View Full Text


Ontology type: schema:Chapter      Open Access: True


Chapter Info

DATE

2004

AUTHORS

Bryan Klimt , Yiming Yang

ABSTRACT

Automated classification of email messages into user-specific folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a state-of-the-art classifier (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classifier, and using all the sections in combination with regression weights. More... »

PAGES

217-226

References to SciGraph publications

  • 2000. A Comparative Study of Classification Based Personal E-mail Filtering in KNOWLEDGE DISCOVERY AND DATA MINING. CURRENT ISSUES AND NEW APPLICATIONS
  • Book

    TITLE

    Machine Learning: ECML 2004

    ISBN

    978-3-540-23105-9
    978-3-540-30115-8

    Author Affiliations

    Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/978-3-540-30115-8_22

    DOI

    http://dx.doi.org/10.1007/978-3-540-30115-8_22

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1044538060


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Artificial Intelligence and Image Processing", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "Carnegie Mellon University", 
              "id": "https://www.grid.ac/institutes/grid.147455.6", 
              "name": [
                "Language Technologies Institute, Carnegie Mellon University, 15213-8213, Pittsburgh, PA, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Klimt", 
            "givenName": "Bryan", 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "Carnegie Mellon University", 
              "id": "https://www.grid.ac/institutes/grid.147455.6", 
              "name": [
                "Language Technologies Institute, Carnegie Mellon University, 15213-8213, Pittsburgh, PA, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Yang", 
            "givenName": "Yiming", 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "https://doi.org/10.1145/301136.301209", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1015443772"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-45571-x_48", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1025259101", 
              "https://doi.org/10.1007/3-540-45571-x_48"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1111/0824-7935.00127", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1028564231"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1111/0824-7935.00127", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1028564231"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/383952.383975", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1030665885"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/s0306-4573(96)00063-5", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1032831757"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2004", 
        "datePublishedReg": "2004-01-01", 
        "description": "Automated classification of email messages into user-specific folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a state-of-the-art classifier (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classifier, and using all the sections in combination with regression weights.", 
        "editor": [
          {
            "familyName": "Boulicaut", 
            "givenName": "Jean-Fran\u00e7ois", 
            "type": "Person"
          }, 
          {
            "familyName": "Esposito", 
            "givenName": "Floriana", 
            "type": "Person"
          }, 
          {
            "familyName": "Giannotti", 
            "givenName": "Fosca", 
            "type": "Person"
          }, 
          {
            "familyName": "Pedreschi", 
            "givenName": "Dino", 
            "type": "Person"
          }
        ], 
        "genre": "chapter", 
        "id": "sg:pub.10.1007/978-3-540-30115-8_22", 
        "inLanguage": [
          "en"
        ], 
        "isAccessibleForFree": true, 
        "isPartOf": {
          "isbn": [
            "978-3-540-23105-9", 
            "978-3-540-30115-8"
          ], 
          "name": "Machine Learning: ECML 2004", 
          "type": "Book"
        }, 
        "name": "The Enron Corpus: A New Dataset for Email Classification Research", 
        "pagination": "217-226", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1044538060"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/978-3-540-30115-8_22"
            ]
          }, 
          {
            "name": "readcube_id", 
            "type": "PropertyValue", 
            "value": [
              "d42691d58ab78d7b639651235b91641fe88c6485d196e99d8ee967ea38302881"
            ]
          }
        ], 
        "publisher": {
          "location": "Berlin, Heidelberg", 
          "name": "Springer Berlin Heidelberg", 
          "type": "Organisation"
        }, 
        "sameAs": [
          "https://doi.org/10.1007/978-3-540-30115-8_22", 
          "https://app.dimensions.ai/details/publication/pub.1044538060"
        ], 
        "sdDataset": "chapters", 
        "sdDatePublished": "2019-04-16T08:25", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000363_0000000363/records_70053_00000002.jsonl", 
        "type": "Chapter", 
        "url": "https://link.springer.com/10.1007%2F978-3-540-30115-8_22"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-30115-8_22'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-30115-8_22'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-30115-8_22'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-3-540-30115-8_22'


     

    This table displays all metadata directly associated to this object as RDF triples.

    101 TRIPLES      23 PREDICATES      32 URIs      20 LITERALS      8 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/978-3-540-30115-8_22 schema:about anzsrc-for:08
    2 anzsrc-for:0801
    3 schema:author Nbd6c3ee0b7e74c1a8e0d18659eca1514
    4 schema:citation sg:pub.10.1007/3-540-45571-x_48
    5 https://doi.org/10.1016/s0306-4573(96)00063-5
    6 https://doi.org/10.1111/0824-7935.00127
    7 https://doi.org/10.1145/301136.301209
    8 https://doi.org/10.1145/383952.383975
    9 schema:datePublished 2004
    10 schema:datePublishedReg 2004-01-01
    11 schema:description Automated classification of email messages into user-specific folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze its suitability with respect to email folder prediction, and provide the baseline results of a state-of-the-art classifier (Support Vector Machines) under various conditions, including the cases of using individual sections (From, To, Subject and body) alone as the input to the classifier, and using all the sections in combination with regression weights.
    12 schema:editor N8980ebfec3ed498a8ddaa1181bf1b772
    13 schema:genre chapter
    14 schema:inLanguage en
    15 schema:isAccessibleForFree true
    16 schema:isPartOf N7bfeebfc0e3847edb90b728d3cab86d7
    17 schema:name The Enron Corpus: A New Dataset for Email Classification Research
    18 schema:pagination 217-226
    19 schema:productId N3b5a1c00060246798e5013cbe28ef5b4
    20 N8718e866d9ca4e7ca00e7391a3d1cdac
    21 Nd50698bc211948a18a3d23521d67bece
    22 schema:publisher Nf8708b0fe7c1443bb7c8091b15a99202
    23 schema:sameAs https://app.dimensions.ai/details/publication/pub.1044538060
    24 https://doi.org/10.1007/978-3-540-30115-8_22
    25 schema:sdDatePublished 2019-04-16T08:25
    26 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    27 schema:sdPublisher N41ae85d9c47c4438871da9550c3b3122
    28 schema:url https://link.springer.com/10.1007%2F978-3-540-30115-8_22
    29 sgo:license sg:explorer/license/
    30 sgo:sdDataset chapters
    31 rdf:type schema:Chapter
    32 N3b5a1c00060246798e5013cbe28ef5b4 schema:name readcube_id
    33 schema:value d42691d58ab78d7b639651235b91641fe88c6485d196e99d8ee967ea38302881
    34 rdf:type schema:PropertyValue
    35 N41ae85d9c47c4438871da9550c3b3122 schema:name Springer Nature - SN SciGraph project
    36 rdf:type schema:Organization
    37 N48735b90422a4310be6be96512fc6f58 schema:familyName Giannotti
    38 schema:givenName Fosca
    39 rdf:type schema:Person
    40 N5d361935972d436db53c0a0410c57c27 schema:familyName Boulicaut
    41 schema:givenName Jean-François
    42 rdf:type schema:Person
    43 N63821a21f27f4b8797809be72bfb02f3 rdf:first N48735b90422a4310be6be96512fc6f58
    44 rdf:rest Nb6445d574e214768b5c151148b9b4fc4
    45 N7bfeebfc0e3847edb90b728d3cab86d7 schema:isbn 978-3-540-23105-9
    46 978-3-540-30115-8
    47 schema:name Machine Learning: ECML 2004
    48 rdf:type schema:Book
    49 N7f5b6bd291d040a689915cef33fbeeb7 rdf:first N870489c8cef640b3867fe4c5f6794f85
    50 rdf:rest N63821a21f27f4b8797809be72bfb02f3
    51 N870489c8cef640b3867fe4c5f6794f85 schema:familyName Esposito
    52 schema:givenName Floriana
    53 rdf:type schema:Person
    54 N8718e866d9ca4e7ca00e7391a3d1cdac schema:name dimensions_id
    55 schema:value pub.1044538060
    56 rdf:type schema:PropertyValue
    57 N8980ebfec3ed498a8ddaa1181bf1b772 rdf:first N5d361935972d436db53c0a0410c57c27
    58 rdf:rest N7f5b6bd291d040a689915cef33fbeeb7
    59 N8b8fdeb2495d4acd9ccb1a9f58bdcca0 schema:familyName Pedreschi
    60 schema:givenName Dino
    61 rdf:type schema:Person
    62 Nb6445d574e214768b5c151148b9b4fc4 rdf:first N8b8fdeb2495d4acd9ccb1a9f58bdcca0
    63 rdf:rest rdf:nil
    64 Nbd6c3ee0b7e74c1a8e0d18659eca1514 rdf:first Nd827aa9353eb472daaf694c934b3edd4
    65 rdf:rest Nd5ce0f616d694387b8ba5d5a7f28e84c
    66 Nd50698bc211948a18a3d23521d67bece schema:name doi
    67 schema:value 10.1007/978-3-540-30115-8_22
    68 rdf:type schema:PropertyValue
    69 Nd5ce0f616d694387b8ba5d5a7f28e84c rdf:first Ne1240a260c4e4273833c8d6ed698eed8
    70 rdf:rest rdf:nil
    71 Nd827aa9353eb472daaf694c934b3edd4 schema:affiliation https://www.grid.ac/institutes/grid.147455.6
    72 schema:familyName Klimt
    73 schema:givenName Bryan
    74 rdf:type schema:Person
    75 Ne1240a260c4e4273833c8d6ed698eed8 schema:affiliation https://www.grid.ac/institutes/grid.147455.6
    76 schema:familyName Yang
    77 schema:givenName Yiming
    78 rdf:type schema:Person
    79 Nf8708b0fe7c1443bb7c8091b15a99202 schema:location Berlin, Heidelberg
    80 schema:name Springer Berlin Heidelberg
    81 rdf:type schema:Organisation
    82 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    83 schema:name Information and Computing Sciences
    84 rdf:type schema:DefinedTerm
    85 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
    86 schema:name Artificial Intelligence and Image Processing
    87 rdf:type schema:DefinedTerm
    88 sg:pub.10.1007/3-540-45571-x_48 schema:sameAs https://app.dimensions.ai/details/publication/pub.1025259101
    89 https://doi.org/10.1007/3-540-45571-x_48
    90 rdf:type schema:CreativeWork
    91 https://doi.org/10.1016/s0306-4573(96)00063-5 schema:sameAs https://app.dimensions.ai/details/publication/pub.1032831757
    92 rdf:type schema:CreativeWork
    93 https://doi.org/10.1111/0824-7935.00127 schema:sameAs https://app.dimensions.ai/details/publication/pub.1028564231
    94 rdf:type schema:CreativeWork
    95 https://doi.org/10.1145/301136.301209 schema:sameAs https://app.dimensions.ai/details/publication/pub.1015443772
    96 rdf:type schema:CreativeWork
    97 https://doi.org/10.1145/383952.383975 schema:sameAs https://app.dimensions.ai/details/publication/pub.1030665885
    98 rdf:type schema:CreativeWork
    99 https://www.grid.ac/institutes/grid.147455.6 schema:alternateName Carnegie Mellon University
    100 schema:name Language Technologies Institute, Carnegie Mellon University, 15213-8213, Pittsburgh, PA, USA
    101 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...