Methods for Collection and Evaluation of Comparable Documents View Full Text


Ontology type: schema:Chapter     


Chapter Info

DATE

2013

AUTHORS

Monica Lestari Paramita , David Guthrie , Evangelos Kanoulas , Rob Gaizauskas , Paul Clough , Mark Sanderson

ABSTRACT

Considerable attention is being paid to methods for gathering and evaluating comparable corpora, not only to improve Statistical Machine Translation (SMT) but for other applications as well, e.g. the extraction of paraphrases. The potential value of such corpora requires efficient and effective methods for gathering and evaluating them. Most of these methods have been tested in retrieving document pairs for well resourced languages, however there is a lack of work in areas of less popular (under resourced) languages, or domains. This chapter describes the work in developing methods for automatically gathering comparable corpora from the Web, specifically for under resourced languages. Different online sources are investigated and an evaluation method is developed to assess the quality of the retrieved documents. More... »

PAGES

93-112

References to SciGraph publications

Book

TITLE

Building and Using Comparable Corpora

ISBN

978-3-642-20127-1
978-3-642-20128-8

Author Affiliations

From Grant

Identifiers

URI

http://scigraph.springernature.com/pub.10.1007/978-3-642-20128-8_5

DOI

http://dx.doi.org/10.1007/978-3-642-20128-8_5

DIMENSIONS

https://app.dimensions.ai/details/publication/pub.1040358735


Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
Incoming Citations Browse incoming citations for this publication using opencitations.net

JSON-LD is the canonical representation for SciGraph data.

TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

[
  {
    "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
    "about": [
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/2004", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Linguistics", 
        "type": "DefinedTerm"
      }, 
      {
        "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/20", 
        "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
        "name": "Language, Communication and Culture", 
        "type": "DefinedTerm"
      }
    ], 
    "author": [
      {
        "affiliation": {
          "alternateName": "University of Sheffield", 
          "id": "https://www.grid.ac/institutes/grid.11835.3e", 
          "name": [
            "University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, UK"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Paramita", 
        "givenName": "Monica Lestari", 
        "id": "sg:person.012371336361.33", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012371336361.33"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "University of Sheffield", 
          "id": "https://www.grid.ac/institutes/grid.11835.3e", 
          "name": [
            "University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, UK"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Guthrie", 
        "givenName": "David", 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "University of Sheffield", 
          "id": "https://www.grid.ac/institutes/grid.11835.3e", 
          "name": [
            "University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, UK"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Kanoulas", 
        "givenName": "Evangelos", 
        "id": "sg:person.016661346557.44", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.016661346557.44"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "University of Sheffield", 
          "id": "https://www.grid.ac/institutes/grid.11835.3e", 
          "name": [
            "University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, UK"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Gaizauskas", 
        "givenName": "Rob", 
        "id": "sg:person.011056325453.22", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011056325453.22"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "University of Sheffield", 
          "id": "https://www.grid.ac/institutes/grid.11835.3e", 
          "name": [
            "University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, UK"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Clough", 
        "givenName": "Paul", 
        "id": "sg:person.016305763421.13", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.016305763421.13"
        ], 
        "type": "Person"
      }, 
      {
        "affiliation": {
          "alternateName": "University of Sheffield", 
          "id": "https://www.grid.ac/institutes/grid.11835.3e", 
          "name": [
            "University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, UK"
          ], 
          "type": "Organization"
        }, 
        "familyName": "Sanderson", 
        "givenName": "Mark", 
        "id": "sg:person.011123173064.51", 
        "sameAs": [
          "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011123173064.51"
        ], 
        "type": "Person"
      }
    ], 
    "citation": [
      {
        "id": "https://doi.org/10.1145/361219.361220", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1004270480"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/11735106_37", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1004983812", 
          "https://doi.org/10.1007/11735106_37"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/11735106_37", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1004983812", 
          "https://doi.org/10.1007/11735106_37"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/s10791-008-9058-8", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1022743558", 
          "https://doi.org/10.1007/s10791-008-9058-8"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/s10791-008-9058-8", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1022743558", 
          "https://doi.org/10.1007/s10791-008-9058-8"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.3115/992628.992709", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1045494205"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "sg:pub.10.1007/s10115-003-0121-x", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1046768341", 
          "https://doi.org/10.1007/s10115-003-0121-x"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1162/089120105775299168", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1046794238"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.2498/cit.2005.04.01", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1070861268"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.1109/iccea.2010.203", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1095242006"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.3115/1626431.1626466", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1099204529"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.3115/1034678.1034757", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1099239517"
        ], 
        "type": "CreativeWork"
      }, 
      {
        "id": "https://doi.org/10.3115/1034678.1034757", 
        "sameAs": [
          "https://app.dimensions.ai/details/publication/pub.1099239517"
        ], 
        "type": "CreativeWork"
      }
    ], 
    "datePublished": "2013", 
    "datePublishedReg": "2013-01-01", 
    "description": "Considerable attention is being paid to methods for gathering and evaluating comparable corpora, not only to improve Statistical Machine Translation (SMT) but for other applications as well, e.g. the extraction of paraphrases. The potential value of such corpora requires efficient and effective methods for gathering and evaluating them. Most of these methods have been tested in retrieving document pairs for well resourced languages, however there is a lack of work in areas of less popular (under resourced) languages, or domains. This chapter describes the work in developing methods for automatically gathering comparable corpora from the Web, specifically for under resourced languages. Different online sources are investigated and an evaluation method is developed to assess the quality of the retrieved documents.", 
    "editor": [
      {
        "familyName": "Sharoff", 
        "givenName": "Serge", 
        "type": "Person"
      }, 
      {
        "familyName": "Rapp", 
        "givenName": "Reinhard", 
        "type": "Person"
      }, 
      {
        "familyName": "Zweigenbaum", 
        "givenName": "Pierre", 
        "type": "Person"
      }, 
      {
        "familyName": "Fung", 
        "givenName": "Pascale", 
        "type": "Person"
      }
    ], 
    "genre": "chapter", 
    "id": "sg:pub.10.1007/978-3-642-20128-8_5", 
    "inLanguage": [
      "en"
    ], 
    "isAccessibleForFree": false, 
    "isFundedItemOf": [
      {
        "id": "sg:grant.3786579", 
        "type": "MonetaryGrant"
      }
    ], 
    "isPartOf": {
      "isbn": [
        "978-3-642-20127-1", 
        "978-3-642-20128-8"
      ], 
      "name": "Building and Using Comparable Corpora", 
      "type": "Book"
    }, 
    "name": "Methods for Collection and Evaluation of Comparable Documents", 
    "pagination": "93-112", 
    "productId": [
      {
        "name": "doi", 
        "type": "PropertyValue", 
        "value": [
          "10.1007/978-3-642-20128-8_5"
        ]
      }, 
      {
        "name": "readcube_id", 
        "type": "PropertyValue", 
        "value": [
          "c04048b0ad060dc22a94f59867ff83836eda8cfc14155a31816997f49c82f77a"
        ]
      }, 
      {
        "name": "dimensions_id", 
        "type": "PropertyValue", 
        "value": [
          "pub.1040358735"
        ]
      }
    ], 
    "publisher": {
      "location": "Berlin, Heidelberg", 
      "name": "Springer Berlin Heidelberg", 
      "type": "Organisation"
    }, 
    "sameAs": [
      "https://doi.org/10.1007/978-3-642-20128-8_5", 
      "https://app.dimensions.ai/details/publication/pub.1040358735"
    ], 
    "sdDataset": "chapters", 
    "sdDatePublished": "2019-04-15T11:36", 
    "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
    "sdPublisher": {
      "name": "Springer Nature - SN SciGraph project", 
      "type": "Organization"
    }, 
    "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8660_00000268.jsonl", 
    "type": "Chapter", 
    "url": "http://link.springer.com/10.1007/978-3-642-20128-8_5"
  }
]
 

Download the RDF metadata as:  json-ld nt turtle xml License info

HOW TO GET THIS DATA PROGRAMMATICALLY:

JSON-LD is a popular format for linked data which is fully compatible with JSON.

curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-3-642-20128-8_5'

N-Triples is a line-based linked data format ideal for batch operations.

curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-3-642-20128-8_5'

Turtle is a human-readable linked data format.

curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-3-642-20128-8_5'

RDF/XML is a standard XML format for linked data.

curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-3-642-20128-8_5'


 

This table displays all metadata directly associated to this object as RDF triples.

149 TRIPLES      23 PREDICATES      37 URIs      20 LITERALS      8 BLANK NODES

Subject Predicate Object
1 sg:pub.10.1007/978-3-642-20128-8_5 schema:about anzsrc-for:20
2 anzsrc-for:2004
3 schema:author N75c071b208834b8cae5a8a13ebe810f2
4 schema:citation sg:pub.10.1007/11735106_37
5 sg:pub.10.1007/s10115-003-0121-x
6 sg:pub.10.1007/s10791-008-9058-8
7 https://doi.org/10.1109/iccea.2010.203
8 https://doi.org/10.1145/361219.361220
9 https://doi.org/10.1162/089120105775299168
10 https://doi.org/10.2498/cit.2005.04.01
11 https://doi.org/10.3115/1034678.1034757
12 https://doi.org/10.3115/1626431.1626466
13 https://doi.org/10.3115/992628.992709
14 schema:datePublished 2013
15 schema:datePublishedReg 2013-01-01
16 schema:description Considerable attention is being paid to methods for gathering and evaluating comparable corpora, not only to improve Statistical Machine Translation (SMT) but for other applications as well, e.g. the extraction of paraphrases. The potential value of such corpora requires efficient and effective methods for gathering and evaluating them. Most of these methods have been tested in retrieving document pairs for well resourced languages, however there is a lack of work in areas of less popular (under resourced) languages, or domains. This chapter describes the work in developing methods for automatically gathering comparable corpora from the Web, specifically for under resourced languages. Different online sources are investigated and an evaluation method is developed to assess the quality of the retrieved documents.
17 schema:editor N4fc22150dba441a0a1cb007772982cb8
18 schema:genre chapter
19 schema:inLanguage en
20 schema:isAccessibleForFree false
21 schema:isPartOf N150369e491ba4f9ca09cc990fa981e93
22 schema:name Methods for Collection and Evaluation of Comparable Documents
23 schema:pagination 93-112
24 schema:productId N133f992018274e309c44e7c1bc14584d
25 N7c20717e6af648079ab923390298300b
26 Nc019e27526f047058a08ce838dc2c57a
27 schema:publisher N8f3f8ed4db9e4dfeb1e8254e79a6bb3c
28 schema:sameAs https://app.dimensions.ai/details/publication/pub.1040358735
29 https://doi.org/10.1007/978-3-642-20128-8_5
30 schema:sdDatePublished 2019-04-15T11:36
31 schema:sdLicense https://scigraph.springernature.com/explorer/license/
32 schema:sdPublisher N2e57614da5804912906198db58468816
33 schema:url http://link.springer.com/10.1007/978-3-642-20128-8_5
34 sgo:license sg:explorer/license/
35 sgo:sdDataset chapters
36 rdf:type schema:Chapter
37 N0e4b1b230d3b42a9bf9f4c070d560155 rdf:first sg:person.016305763421.13
38 rdf:rest N6e61f6b2193e43f7bc55c78591e84851
39 N133f992018274e309c44e7c1bc14584d schema:name doi
40 schema:value 10.1007/978-3-642-20128-8_5
41 rdf:type schema:PropertyValue
42 N150369e491ba4f9ca09cc990fa981e93 schema:isbn 978-3-642-20127-1
43 978-3-642-20128-8
44 schema:name Building and Using Comparable Corpora
45 rdf:type schema:Book
46 N1b4497e6c8b142558fc2ee627ad2f2ec rdf:first N6685ae9dc83341e3844ee297011352e9
47 rdf:rest N80276d2033884f87982cb5950303e1cd
48 N1d153a38860b47c9ae71de24abb7f183 rdf:first N3af9887d1c49478596534b60f9d442be
49 rdf:rest N58b833f41f78429ea015368f7f253592
50 N237373b342514c8eb8010518dd99962b schema:familyName Sharoff
51 schema:givenName Serge
52 rdf:type schema:Person
53 N2e57614da5804912906198db58468816 schema:name Springer Nature - SN SciGraph project
54 rdf:type schema:Organization
55 N3af9887d1c49478596534b60f9d442be schema:affiliation https://www.grid.ac/institutes/grid.11835.3e
56 schema:familyName Guthrie
57 schema:givenName David
58 rdf:type schema:Person
59 N4fc22150dba441a0a1cb007772982cb8 rdf:first N237373b342514c8eb8010518dd99962b
60 rdf:rest N7e514d0a02f54bc792fc5f33be9fb44b
61 N58b833f41f78429ea015368f7f253592 rdf:first sg:person.016661346557.44
62 rdf:rest Nf34821dc302b4e93b766c7f312c3e539
63 N5e645f93d52d4e5cbfd71088df0ee1ee schema:familyName Rapp
64 schema:givenName Reinhard
65 rdf:type schema:Person
66 N6685ae9dc83341e3844ee297011352e9 schema:familyName Zweigenbaum
67 schema:givenName Pierre
68 rdf:type schema:Person
69 N6e61f6b2193e43f7bc55c78591e84851 rdf:first sg:person.011123173064.51
70 rdf:rest rdf:nil
71 N75c071b208834b8cae5a8a13ebe810f2 rdf:first sg:person.012371336361.33
72 rdf:rest N1d153a38860b47c9ae71de24abb7f183
73 N7c20717e6af648079ab923390298300b schema:name readcube_id
74 schema:value c04048b0ad060dc22a94f59867ff83836eda8cfc14155a31816997f49c82f77a
75 rdf:type schema:PropertyValue
76 N7cef1dd6ab5947daad9a8feab5f64cd9 schema:familyName Fung
77 schema:givenName Pascale
78 rdf:type schema:Person
79 N7e514d0a02f54bc792fc5f33be9fb44b rdf:first N5e645f93d52d4e5cbfd71088df0ee1ee
80 rdf:rest N1b4497e6c8b142558fc2ee627ad2f2ec
81 N80276d2033884f87982cb5950303e1cd rdf:first N7cef1dd6ab5947daad9a8feab5f64cd9
82 rdf:rest rdf:nil
83 N8f3f8ed4db9e4dfeb1e8254e79a6bb3c schema:location Berlin, Heidelberg
84 schema:name Springer Berlin Heidelberg
85 rdf:type schema:Organisation
86 Nc019e27526f047058a08ce838dc2c57a schema:name dimensions_id
87 schema:value pub.1040358735
88 rdf:type schema:PropertyValue
89 Nf34821dc302b4e93b766c7f312c3e539 rdf:first sg:person.011056325453.22
90 rdf:rest N0e4b1b230d3b42a9bf9f4c070d560155
91 anzsrc-for:20 schema:inDefinedTermSet anzsrc-for:
92 schema:name Language, Communication and Culture
93 rdf:type schema:DefinedTerm
94 anzsrc-for:2004 schema:inDefinedTermSet anzsrc-for:
95 schema:name Linguistics
96 rdf:type schema:DefinedTerm
97 sg:grant.3786579 http://pending.schema.org/fundedItem sg:pub.10.1007/978-3-642-20128-8_5
98 rdf:type schema:MonetaryGrant
99 sg:person.011056325453.22 schema:affiliation https://www.grid.ac/institutes/grid.11835.3e
100 schema:familyName Gaizauskas
101 schema:givenName Rob
102 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011056325453.22
103 rdf:type schema:Person
104 sg:person.011123173064.51 schema:affiliation https://www.grid.ac/institutes/grid.11835.3e
105 schema:familyName Sanderson
106 schema:givenName Mark
107 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011123173064.51
108 rdf:type schema:Person
109 sg:person.012371336361.33 schema:affiliation https://www.grid.ac/institutes/grid.11835.3e
110 schema:familyName Paramita
111 schema:givenName Monica Lestari
112 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012371336361.33
113 rdf:type schema:Person
114 sg:person.016305763421.13 schema:affiliation https://www.grid.ac/institutes/grid.11835.3e
115 schema:familyName Clough
116 schema:givenName Paul
117 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.016305763421.13
118 rdf:type schema:Person
119 sg:person.016661346557.44 schema:affiliation https://www.grid.ac/institutes/grid.11835.3e
120 schema:familyName Kanoulas
121 schema:givenName Evangelos
122 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.016661346557.44
123 rdf:type schema:Person
124 sg:pub.10.1007/11735106_37 schema:sameAs https://app.dimensions.ai/details/publication/pub.1004983812
125 https://doi.org/10.1007/11735106_37
126 rdf:type schema:CreativeWork
127 sg:pub.10.1007/s10115-003-0121-x schema:sameAs https://app.dimensions.ai/details/publication/pub.1046768341
128 https://doi.org/10.1007/s10115-003-0121-x
129 rdf:type schema:CreativeWork
130 sg:pub.10.1007/s10791-008-9058-8 schema:sameAs https://app.dimensions.ai/details/publication/pub.1022743558
131 https://doi.org/10.1007/s10791-008-9058-8
132 rdf:type schema:CreativeWork
133 https://doi.org/10.1109/iccea.2010.203 schema:sameAs https://app.dimensions.ai/details/publication/pub.1095242006
134 rdf:type schema:CreativeWork
135 https://doi.org/10.1145/361219.361220 schema:sameAs https://app.dimensions.ai/details/publication/pub.1004270480
136 rdf:type schema:CreativeWork
137 https://doi.org/10.1162/089120105775299168 schema:sameAs https://app.dimensions.ai/details/publication/pub.1046794238
138 rdf:type schema:CreativeWork
139 https://doi.org/10.2498/cit.2005.04.01 schema:sameAs https://app.dimensions.ai/details/publication/pub.1070861268
140 rdf:type schema:CreativeWork
141 https://doi.org/10.3115/1034678.1034757 schema:sameAs https://app.dimensions.ai/details/publication/pub.1099239517
142 rdf:type schema:CreativeWork
143 https://doi.org/10.3115/1626431.1626466 schema:sameAs https://app.dimensions.ai/details/publication/pub.1099204529
144 rdf:type schema:CreativeWork
145 https://doi.org/10.3115/992628.992709 schema:sameAs https://app.dimensions.ai/details/publication/pub.1045494205
146 rdf:type schema:CreativeWork
147 https://www.grid.ac/institutes/grid.11835.3e schema:alternateName University of Sheffield
148 schema:name University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, UK
149 rdf:type schema:Organization
 




Preview window. Press ESC to close (or click here)


...