Keyphrases Extraction from Scientific Documents: Improving Machine Learning Approaches with Natural Language Processing View Full Text


Ontology type: schema:Chapter     


Chapter Info

DATE

2010

AUTHORS

Mikalai Krapivin , Aliaksandr Autayeu , Maurizio Marchese , Enrico Blanzieri , Nicola Segata

ABSTRACT

In this paper we use Natural Language Processing techniques to improve different machine learning approaches (Support Vector Machines (SVM), Local SVM, Random Forests) to the problem of automatic keyphrases extraction from scientific papers. For the evaluation we propose a large and high-quality dataset: 2000 ACM papers from the Computer Science domain. We evaluate by comparison with expert-assigned keyphrases. Evaluation shows promising results that outperform state-of-the-art Bayesian learning system KEA improving the average F-Measure from 22% (KEA) to 30% (Random Forest) on the same dataset without the use of controlled vocabularies. Finally, we report a detailed analysis of the effect of the individual NLP features and data set size on the overall quality of extracted keyphrases. More... »

PAGES

102-111

References to SciGraph publications

  • 1995-09. Support-vector networks in MACHINE LEARNING
  • 1996-08. Bagging predictors in MACHINE LEARNING
  • 2009. Fast Local Support Vector Machines for Large Datasets in MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION
  • Book

    TITLE

    The Role of Digital Libraries in a Time of Global Change

    ISBN

    978-3-642-13653-5
    978-3-642-13654-2

    Author Affiliations

    Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/978-3-642-13654-2_12

    DOI

    http://dx.doi.org/10.1007/978-3-642-13654-2_12

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1021073515


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Artificial Intelligence and Image Processing", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "University of Trento", 
              "id": "https://www.grid.ac/institutes/grid.11696.39", 
              "name": [
                "DISI, University of Trento, Italy"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Krapivin", 
            "givenName": "Mikalai", 
            "id": "sg:person.013355166725.41", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013355166725.41"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "University of Trento", 
              "id": "https://www.grid.ac/institutes/grid.11696.39", 
              "name": [
                "DISI, University of Trento, Italy"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Autayeu", 
            "givenName": "Aliaksandr", 
            "id": "sg:person.014750127725.00", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014750127725.00"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "University of Trento", 
              "id": "https://www.grid.ac/institutes/grid.11696.39", 
              "name": [
                "DISI, University of Trento, Italy"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Marchese", 
            "givenName": "Maurizio", 
            "id": "sg:person.015022132271.13", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015022132271.13"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "University of Trento", 
              "id": "https://www.grid.ac/institutes/grid.11696.39", 
              "name": [
                "DISI, University of Trento, Italy"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Blanzieri", 
            "givenName": "Enrico", 
            "id": "sg:person.013033541655.32", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013033541655.32"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "University of Trento", 
              "id": "https://www.grid.ac/institutes/grid.11696.39", 
              "name": [
                "DISI, University of Trento, Italy"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Segata", 
            "givenName": "Nicola", 
            "id": "sg:person.0736227144.03", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0736227144.03"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "sg:pub.10.1007/bf00058655", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1002929950", 
              "https://doi.org/10.1007/bf00058655"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/bf00058655", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1002929950", 
              "https://doi.org/10.1007/bf00058655"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-03070-3_22", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1010416050", 
              "https://doi.org/10.1007/978-3-642-03070-3_22"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-03070-3_22", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1010416050", 
              "https://doi.org/10.1007/978-3-642-03070-3_22"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/j.ipm.2007.01.015", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1013274639"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/1961189.1961199", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1013637525"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.3115/1119355.1119383", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1017259208"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1017/s1351324906004505", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1021611628"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1017/s1351324906004505", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1021611628"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/bf00994018", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1025150743", 
              "https://doi.org/10.1007/bf00994018"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/313238.313437", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1049303403"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1145/1141753.1141819", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1052237831"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/tgrs.2008.916090", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1061610744"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/jcdl.2003.1204842", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1093514231"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/wi.2005.87", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1095077013"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2010", 
        "datePublishedReg": "2010-01-01", 
        "description": "In this paper we use Natural Language Processing techniques to improve different machine learning approaches (Support Vector Machines (SVM), Local SVM, Random Forests) to the problem of automatic keyphrases extraction from scientific papers. For the evaluation we propose a large and high-quality dataset: 2000 ACM papers from the Computer Science domain. We evaluate by comparison with expert-assigned keyphrases. Evaluation shows promising results that outperform state-of-the-art Bayesian learning system KEA improving the average F-Measure from 22% (KEA) to 30% (Random Forest) on the same dataset without the use of controlled vocabularies. Finally, we report a detailed analysis of the effect of the individual NLP features and data set size on the overall quality of extracted keyphrases.", 
        "editor": [
          {
            "familyName": "Chowdhury", 
            "givenName": "Gobinda", 
            "type": "Person"
          }, 
          {
            "familyName": "Koo", 
            "givenName": "Chris", 
            "type": "Person"
          }, 
          {
            "familyName": "Hunter", 
            "givenName": "Jane", 
            "type": "Person"
          }
        ], 
        "genre": "chapter", 
        "id": "sg:pub.10.1007/978-3-642-13654-2_12", 
        "inLanguage": [
          "en"
        ], 
        "isAccessibleForFree": false, 
        "isPartOf": {
          "isbn": [
            "978-3-642-13653-5", 
            "978-3-642-13654-2"
          ], 
          "name": "The Role of Digital Libraries in a Time of Global Change", 
          "type": "Book"
        }, 
        "name": "Keyphrases Extraction from Scientific Documents: Improving Machine Learning Approaches with Natural Language Processing", 
        "pagination": "102-111", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1021073515"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/978-3-642-13654-2_12"
            ]
          }, 
          {
            "name": "readcube_id", 
            "type": "PropertyValue", 
            "value": [
              "6467d0b35eed09856272377c1f6bb58918b24278455fdb3677445558063bf233"
            ]
          }
        ], 
        "publisher": {
          "location": "Berlin, Heidelberg", 
          "name": "Springer Berlin Heidelberg", 
          "type": "Organisation"
        }, 
        "sameAs": [
          "https://doi.org/10.1007/978-3-642-13654-2_12", 
          "https://app.dimensions.ai/details/publication/pub.1021073515"
        ], 
        "sdDataset": "chapters", 
        "sdDatePublished": "2019-04-16T08:03", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000359_0000000359/records_29204_00000001.jsonl", 
        "type": "Chapter", 
        "url": "https://link.springer.com/10.1007%2F978-3-642-13654-2_12"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-3-642-13654-2_12'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-3-642-13654-2_12'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-3-642-13654-2_12'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-3-642-13654-2_12'


     

    This table displays all metadata directly associated to this object as RDF triples.

    142 TRIPLES      23 PREDICATES      39 URIs      20 LITERALS      8 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/978-3-642-13654-2_12 schema:about anzsrc-for:08
    2 anzsrc-for:0801
    3 schema:author Nbe2319be6f064a7c9ce63fb82932b17c
    4 schema:citation sg:pub.10.1007/978-3-642-03070-3_22
    5 sg:pub.10.1007/bf00058655
    6 sg:pub.10.1007/bf00994018
    7 https://doi.org/10.1016/j.ipm.2007.01.015
    8 https://doi.org/10.1017/s1351324906004505
    9 https://doi.org/10.1109/jcdl.2003.1204842
    10 https://doi.org/10.1109/tgrs.2008.916090
    11 https://doi.org/10.1109/wi.2005.87
    12 https://doi.org/10.1145/1141753.1141819
    13 https://doi.org/10.1145/1961189.1961199
    14 https://doi.org/10.1145/313238.313437
    15 https://doi.org/10.3115/1119355.1119383
    16 schema:datePublished 2010
    17 schema:datePublishedReg 2010-01-01
    18 schema:description In this paper we use Natural Language Processing techniques to improve different machine learning approaches (Support Vector Machines (SVM), Local SVM, Random Forests) to the problem of automatic keyphrases extraction from scientific papers. For the evaluation we propose a large and high-quality dataset: 2000 ACM papers from the Computer Science domain. We evaluate by comparison with expert-assigned keyphrases. Evaluation shows promising results that outperform state-of-the-art Bayesian learning system KEA improving the average F-Measure from 22% (KEA) to 30% (Random Forest) on the same dataset without the use of controlled vocabularies. Finally, we report a detailed analysis of the effect of the individual NLP features and data set size on the overall quality of extracted keyphrases.
    19 schema:editor N3c89e13629a348bf8716b44ba2dbf9fd
    20 schema:genre chapter
    21 schema:inLanguage en
    22 schema:isAccessibleForFree false
    23 schema:isPartOf N55a853add6764916946811a7931cc7e9
    24 schema:name Keyphrases Extraction from Scientific Documents: Improving Machine Learning Approaches with Natural Language Processing
    25 schema:pagination 102-111
    26 schema:productId N4a9d61dcda8a4d4f9d0a124449a857df
    27 Nbae947b178e14bf9a09f7a78120a249d
    28 Nd7b858cbb64247ada77797ae0403e3e9
    29 schema:publisher N3e1c1f819d98450d938214c201af9f7f
    30 schema:sameAs https://app.dimensions.ai/details/publication/pub.1021073515
    31 https://doi.org/10.1007/978-3-642-13654-2_12
    32 schema:sdDatePublished 2019-04-16T08:03
    33 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    34 schema:sdPublisher N7165eef9b54a4a6a8dd0782624ef250a
    35 schema:url https://link.springer.com/10.1007%2F978-3-642-13654-2_12
    36 sgo:license sg:explorer/license/
    37 sgo:sdDataset chapters
    38 rdf:type schema:Chapter
    39 N0740a4b5d7714dfca5017094fbd97455 rdf:first sg:person.014750127725.00
    40 rdf:rest N4806a220522f4f00939b4f8acb81cf8a
    41 N2aeb017fe4e34f90a2360aaf5c585979 rdf:first Ndddb5232b254459d9b2777d442ac7640
    42 rdf:rest Nbad24d1de29246f2acc4fbbe18835cd1
    43 N3c89e13629a348bf8716b44ba2dbf9fd rdf:first Nec7a2444078742efa85d4540e96241ec
    44 rdf:rest N2aeb017fe4e34f90a2360aaf5c585979
    45 N3e1c1f819d98450d938214c201af9f7f schema:location Berlin, Heidelberg
    46 schema:name Springer Berlin Heidelberg
    47 rdf:type schema:Organisation
    48 N4806a220522f4f00939b4f8acb81cf8a rdf:first sg:person.015022132271.13
    49 rdf:rest N8ab24c0fd85543e98cea761cb9ae7159
    50 N4a410284b2eb4d288cd026cb60b66b69 rdf:first sg:person.0736227144.03
    51 rdf:rest rdf:nil
    52 N4a9d61dcda8a4d4f9d0a124449a857df schema:name doi
    53 schema:value 10.1007/978-3-642-13654-2_12
    54 rdf:type schema:PropertyValue
    55 N55a853add6764916946811a7931cc7e9 schema:isbn 978-3-642-13653-5
    56 978-3-642-13654-2
    57 schema:name The Role of Digital Libraries in a Time of Global Change
    58 rdf:type schema:Book
    59 N7165eef9b54a4a6a8dd0782624ef250a schema:name Springer Nature - SN SciGraph project
    60 rdf:type schema:Organization
    61 N8ab24c0fd85543e98cea761cb9ae7159 rdf:first sg:person.013033541655.32
    62 rdf:rest N4a410284b2eb4d288cd026cb60b66b69
    63 Nbad24d1de29246f2acc4fbbe18835cd1 rdf:first Nfd447bdb4f4e4f45a7a5fe0c0d1bc0e9
    64 rdf:rest rdf:nil
    65 Nbae947b178e14bf9a09f7a78120a249d schema:name readcube_id
    66 schema:value 6467d0b35eed09856272377c1f6bb58918b24278455fdb3677445558063bf233
    67 rdf:type schema:PropertyValue
    68 Nbe2319be6f064a7c9ce63fb82932b17c rdf:first sg:person.013355166725.41
    69 rdf:rest N0740a4b5d7714dfca5017094fbd97455
    70 Nd7b858cbb64247ada77797ae0403e3e9 schema:name dimensions_id
    71 schema:value pub.1021073515
    72 rdf:type schema:PropertyValue
    73 Ndddb5232b254459d9b2777d442ac7640 schema:familyName Koo
    74 schema:givenName Chris
    75 rdf:type schema:Person
    76 Nec7a2444078742efa85d4540e96241ec schema:familyName Chowdhury
    77 schema:givenName Gobinda
    78 rdf:type schema:Person
    79 Nfd447bdb4f4e4f45a7a5fe0c0d1bc0e9 schema:familyName Hunter
    80 schema:givenName Jane
    81 rdf:type schema:Person
    82 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    83 schema:name Information and Computing Sciences
    84 rdf:type schema:DefinedTerm
    85 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
    86 schema:name Artificial Intelligence and Image Processing
    87 rdf:type schema:DefinedTerm
    88 sg:person.013033541655.32 schema:affiliation https://www.grid.ac/institutes/grid.11696.39
    89 schema:familyName Blanzieri
    90 schema:givenName Enrico
    91 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013033541655.32
    92 rdf:type schema:Person
    93 sg:person.013355166725.41 schema:affiliation https://www.grid.ac/institutes/grid.11696.39
    94 schema:familyName Krapivin
    95 schema:givenName Mikalai
    96 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013355166725.41
    97 rdf:type schema:Person
    98 sg:person.014750127725.00 schema:affiliation https://www.grid.ac/institutes/grid.11696.39
    99 schema:familyName Autayeu
    100 schema:givenName Aliaksandr
    101 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014750127725.00
    102 rdf:type schema:Person
    103 sg:person.015022132271.13 schema:affiliation https://www.grid.ac/institutes/grid.11696.39
    104 schema:familyName Marchese
    105 schema:givenName Maurizio
    106 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.015022132271.13
    107 rdf:type schema:Person
    108 sg:person.0736227144.03 schema:affiliation https://www.grid.ac/institutes/grid.11696.39
    109 schema:familyName Segata
    110 schema:givenName Nicola
    111 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0736227144.03
    112 rdf:type schema:Person
    113 sg:pub.10.1007/978-3-642-03070-3_22 schema:sameAs https://app.dimensions.ai/details/publication/pub.1010416050
    114 https://doi.org/10.1007/978-3-642-03070-3_22
    115 rdf:type schema:CreativeWork
    116 sg:pub.10.1007/bf00058655 schema:sameAs https://app.dimensions.ai/details/publication/pub.1002929950
    117 https://doi.org/10.1007/bf00058655
    118 rdf:type schema:CreativeWork
    119 sg:pub.10.1007/bf00994018 schema:sameAs https://app.dimensions.ai/details/publication/pub.1025150743
    120 https://doi.org/10.1007/bf00994018
    121 rdf:type schema:CreativeWork
    122 https://doi.org/10.1016/j.ipm.2007.01.015 schema:sameAs https://app.dimensions.ai/details/publication/pub.1013274639
    123 rdf:type schema:CreativeWork
    124 https://doi.org/10.1017/s1351324906004505 schema:sameAs https://app.dimensions.ai/details/publication/pub.1021611628
    125 rdf:type schema:CreativeWork
    126 https://doi.org/10.1109/jcdl.2003.1204842 schema:sameAs https://app.dimensions.ai/details/publication/pub.1093514231
    127 rdf:type schema:CreativeWork
    128 https://doi.org/10.1109/tgrs.2008.916090 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061610744
    129 rdf:type schema:CreativeWork
    130 https://doi.org/10.1109/wi.2005.87 schema:sameAs https://app.dimensions.ai/details/publication/pub.1095077013
    131 rdf:type schema:CreativeWork
    132 https://doi.org/10.1145/1141753.1141819 schema:sameAs https://app.dimensions.ai/details/publication/pub.1052237831
    133 rdf:type schema:CreativeWork
    134 https://doi.org/10.1145/1961189.1961199 schema:sameAs https://app.dimensions.ai/details/publication/pub.1013637525
    135 rdf:type schema:CreativeWork
    136 https://doi.org/10.1145/313238.313437 schema:sameAs https://app.dimensions.ai/details/publication/pub.1049303403
    137 rdf:type schema:CreativeWork
    138 https://doi.org/10.3115/1119355.1119383 schema:sameAs https://app.dimensions.ai/details/publication/pub.1017259208
    139 rdf:type schema:CreativeWork
    140 https://www.grid.ac/institutes/grid.11696.39 schema:alternateName University of Trento
    141 schema:name DISI, University of Trento, Italy
    142 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...