Expanding N-grams for Code-Switch Language Models View Full Text


Ontology type: schema:Chapter     


Chapter Info

DATE

2019

AUTHORS

Injy Hamed , Mohamed Elmahdy , Slim Abdennadher

ABSTRACT

It has become common, especially among urban youth, for people to use more than one language in their everyday conversations - a phenomenon referred to by linguists as “code-switching”. With the rise in globalization and the widespread of code-switching among multilingual societies, a great demand has been placed on Natural Language Processing (NLP) applications to be able to handle such mixed data. In this paper, we present our efforts in language modeling for code-switch Arabic-English. In order to train a language model (LM), huge amounts of text data is required in the respective language. However, the main challenge faced in language modeling for code-switch languages, is the lack of available data. In this paper, we propose an approach to artificially generate code-switch Arabic-English n-grams and thus improve the language model. This was done by expanding the relatively-small available corpus and its corresponding n-grams using translation-based approaches. The final LM achieved relative improvements in both perplexity and OOV rates of 1.97% and 16.36% respectively. More... »

PAGES

221-229

References to SciGraph publications

  • 2007. Baby-Steps Towards Building a Spanglish Language Model in COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING
  • Book

    TITLE

    Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2018

    ISBN

    978-3-319-99009-5
    978-3-319-99010-1

    Author Affiliations

    Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/978-3-319-99010-1_20

    DOI

    http://dx.doi.org/10.1007/978-3-319-99010-1_20

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1106409157


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/2004", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Linguistics", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/20", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Language, Communication and Culture", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "German University in Cairo", 
              "id": "https://www.grid.ac/institutes/grid.187323.c", 
              "name": [
                "The German University in Cairo"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Hamed", 
            "givenName": "Injy", 
            "id": "sg:person.016542541344.73", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.016542541344.73"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "German University in Cairo", 
              "id": "https://www.grid.ac/institutes/grid.187323.c", 
              "name": [
                "The German University in Cairo"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Elmahdy", 
            "givenName": "Mohamed", 
            "id": "sg:person.013322724557.53", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013322724557.53"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "German University in Cairo", 
              "id": "https://www.grid.ac/institutes/grid.187323.c", 
              "name": [
                "The German University in Cairo"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Abdennadher", 
            "givenName": "Slim", 
            "id": "sg:person.010445445574.13", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.010445445574.13"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "https://doi.org/10.1016/j.pragma.2004.10.010", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1002477324"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1111/1467-971x.00181", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1008993237"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/j.procs.2016.04.039", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1017048649"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/s0167-6393(00)00095-9", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1038142378"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1177/0739986304272358", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1043943858"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1177/0739986304272358", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1043943858"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/j.procs.2016.04.044", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1044573307"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-540-70939-8_7", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1052256105", 
              "https://doi.org/10.1007/978-3-540-70939-8_7"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-540-70939-8_7", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1052256105", 
              "https://doi.org/10.1007/978-3-540-70939-8_7"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/msp.2008.918417", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1061423090"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.5923/j.ajsp.20120205.02", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1073508165"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1016/j.procs.2017.10.111", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1092616592"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/icassp.2014.6854536", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1093350674"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/fskd.2009.434", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1093463963"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/iscslp.2010.5684900", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1093814750"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.1109/icassp.2012.6289015", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1095336192"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.3115/v1/w14-3904", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1099117564"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "https://doi.org/10.3115/v1/w14-3904", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1099117564"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2019", 
        "datePublishedReg": "2019-01-01", 
        "description": "It has become common, especially among urban youth, for people to use more than one language in their everyday conversations - a phenomenon referred to by linguists as \u201ccode-switching\u201d. With the rise in globalization and the widespread of code-switching among multilingual societies, a great demand has been placed on Natural Language Processing (NLP) applications to be able to handle such mixed data. In this paper, we present our efforts in language modeling for code-switch Arabic-English. In order to train a language model (LM), huge amounts of text data is required in the respective language. However, the main challenge faced in language modeling for code-switch languages, is the lack of available data. In this paper, we propose an approach to artificially generate code-switch Arabic-English n-grams and thus improve the language model. This was done by expanding the relatively-small available corpus and its corresponding n-grams using translation-based approaches. The final LM achieved relative improvements in both perplexity and OOV rates of 1.97% and 16.36% respectively.", 
        "editor": [
          {
            "familyName": "Hassanien", 
            "givenName": "Aboul Ella", 
            "type": "Person"
          }, 
          {
            "familyName": "Tolba", 
            "givenName": "Mohamed F.", 
            "type": "Person"
          }, 
          {
            "familyName": "Shaalan", 
            "givenName": "Khaled", 
            "type": "Person"
          }, 
          {
            "familyName": "Azar", 
            "givenName": "Ahmad Taher", 
            "type": "Person"
          }
        ], 
        "genre": "chapter", 
        "id": "sg:pub.10.1007/978-3-319-99010-1_20", 
        "inLanguage": [
          "en"
        ], 
        "isAccessibleForFree": false, 
        "isPartOf": {
          "isbn": [
            "978-3-319-99009-5", 
            "978-3-319-99010-1"
          ], 
          "name": "Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2018", 
          "type": "Book"
        }, 
        "name": "Expanding N-grams for Code-Switch Language Models", 
        "pagination": "221-229", 
        "productId": [
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/978-3-319-99010-1_20"
            ]
          }, 
          {
            "name": "readcube_id", 
            "type": "PropertyValue", 
            "value": [
              "3262ecd6e5e62b28b9adb4e2180fc3b8a1ae883139e275330b703f62323d5557"
            ]
          }, 
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1106409157"
            ]
          }
        ], 
        "publisher": {
          "location": "Cham", 
          "name": "Springer International Publishing", 
          "type": "Organisation"
        }, 
        "sameAs": [
          "https://doi.org/10.1007/978-3-319-99010-1_20", 
          "https://app.dimensions.ai/details/publication/pub.1106409157"
        ], 
        "sdDataset": "chapters", 
        "sdDatePublished": "2019-04-15T22:23", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-uberresearch-data-dimensions-target-20181106-alternative/cleanup/v134/2549eaecd7973599484d7c17b260dba0a4ecb94b/merge/v9/a6c9fde33151104705d4d7ff012ea9563521a3ce/jats-lookup/v90/0000000001_0000000264/records_8693_00000485.jsonl", 
        "type": "Chapter", 
        "url": "http://link.springer.com/10.1007/978-3-319-99010-1_20"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/978-3-319-99010-1_20'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/978-3-319-99010-1_20'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/978-3-319-99010-1_20'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/978-3-319-99010-1_20'


     

    This table displays all metadata directly associated to this object as RDF triples.

    140 TRIPLES      23 PREDICATES      42 URIs      20 LITERALS      8 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/978-3-319-99010-1_20 schema:about anzsrc-for:20
    2 anzsrc-for:2004
    3 schema:author N98cf1da853d94ab790960e632c3a9c19
    4 schema:citation sg:pub.10.1007/978-3-540-70939-8_7
    5 https://doi.org/10.1016/j.pragma.2004.10.010
    6 https://doi.org/10.1016/j.procs.2016.04.039
    7 https://doi.org/10.1016/j.procs.2016.04.044
    8 https://doi.org/10.1016/j.procs.2017.10.111
    9 https://doi.org/10.1016/s0167-6393(00)00095-9
    10 https://doi.org/10.1109/fskd.2009.434
    11 https://doi.org/10.1109/icassp.2012.6289015
    12 https://doi.org/10.1109/icassp.2014.6854536
    13 https://doi.org/10.1109/iscslp.2010.5684900
    14 https://doi.org/10.1109/msp.2008.918417
    15 https://doi.org/10.1111/1467-971x.00181
    16 https://doi.org/10.1177/0739986304272358
    17 https://doi.org/10.3115/v1/w14-3904
    18 https://doi.org/10.5923/j.ajsp.20120205.02
    19 schema:datePublished 2019
    20 schema:datePublishedReg 2019-01-01
    21 schema:description It has become common, especially among urban youth, for people to use more than one language in their everyday conversations - a phenomenon referred to by linguists as “code-switching”. With the rise in globalization and the widespread of code-switching among multilingual societies, a great demand has been placed on Natural Language Processing (NLP) applications to be able to handle such mixed data. In this paper, we present our efforts in language modeling for code-switch Arabic-English. In order to train a language model (LM), huge amounts of text data is required in the respective language. However, the main challenge faced in language modeling for code-switch languages, is the lack of available data. In this paper, we propose an approach to artificially generate code-switch Arabic-English n-grams and thus improve the language model. This was done by expanding the relatively-small available corpus and its corresponding n-grams using translation-based approaches. The final LM achieved relative improvements in both perplexity and OOV rates of 1.97% and 16.36% respectively.
    22 schema:editor N13b7a9e7305149e49f6022223f36c474
    23 schema:genre chapter
    24 schema:inLanguage en
    25 schema:isAccessibleForFree false
    26 schema:isPartOf N1aa7257ebd504ef1a8b9c869b84c2156
    27 schema:name Expanding N-grams for Code-Switch Language Models
    28 schema:pagination 221-229
    29 schema:productId N8f1b01286b2f4f86ad22fd1a2e3dd873
    30 Ne17fe465f0c44d7580bd29369b3fa767
    31 Nff1975d0675b40c09b7a847bd395947e
    32 schema:publisher Nad11d5f64d514926bb4903ddd64ed89f
    33 schema:sameAs https://app.dimensions.ai/details/publication/pub.1106409157
    34 https://doi.org/10.1007/978-3-319-99010-1_20
    35 schema:sdDatePublished 2019-04-15T22:23
    36 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    37 schema:sdPublisher N634fe05d95044fa7bec80d0a01f810c0
    38 schema:url http://link.springer.com/10.1007/978-3-319-99010-1_20
    39 sgo:license sg:explorer/license/
    40 sgo:sdDataset chapters
    41 rdf:type schema:Chapter
    42 N088c069b51734ba795f335fbdb8a33a9 rdf:first N69f5bddabecb44288cbad7b3e24f5e38
    43 rdf:rest Nbc349e9387e14b6eb8846fc06c77f375
    44 N13b7a9e7305149e49f6022223f36c474 rdf:first N80de7526edfb4386b2249d2ce6db701b
    45 rdf:rest Nf9ccff2facbc40a1ac6cc26d106b105d
    46 N16142747ebcd48c4a25c2d398d711c97 schema:familyName Azar
    47 schema:givenName Ahmad Taher
    48 rdf:type schema:Person
    49 N1aa7257ebd504ef1a8b9c869b84c2156 schema:isbn 978-3-319-99009-5
    50 978-3-319-99010-1
    51 schema:name Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2018
    52 rdf:type schema:Book
    53 N3b120f974cf74aa6b35f7021dd3facad rdf:first sg:person.013322724557.53
    54 rdf:rest N880be9c10e084f14b0bd7bed03d4295b
    55 N634fe05d95044fa7bec80d0a01f810c0 schema:name Springer Nature - SN SciGraph project
    56 rdf:type schema:Organization
    57 N69f5bddabecb44288cbad7b3e24f5e38 schema:familyName Shaalan
    58 schema:givenName Khaled
    59 rdf:type schema:Person
    60 N80de7526edfb4386b2249d2ce6db701b schema:familyName Hassanien
    61 schema:givenName Aboul Ella
    62 rdf:type schema:Person
    63 N82292a278dc1490e8b304eaa2b6445b9 schema:familyName Tolba
    64 schema:givenName Mohamed F.
    65 rdf:type schema:Person
    66 N880be9c10e084f14b0bd7bed03d4295b rdf:first sg:person.010445445574.13
    67 rdf:rest rdf:nil
    68 N8f1b01286b2f4f86ad22fd1a2e3dd873 schema:name doi
    69 schema:value 10.1007/978-3-319-99010-1_20
    70 rdf:type schema:PropertyValue
    71 N98cf1da853d94ab790960e632c3a9c19 rdf:first sg:person.016542541344.73
    72 rdf:rest N3b120f974cf74aa6b35f7021dd3facad
    73 Nad11d5f64d514926bb4903ddd64ed89f schema:location Cham
    74 schema:name Springer International Publishing
    75 rdf:type schema:Organisation
    76 Nbc349e9387e14b6eb8846fc06c77f375 rdf:first N16142747ebcd48c4a25c2d398d711c97
    77 rdf:rest rdf:nil
    78 Ne17fe465f0c44d7580bd29369b3fa767 schema:name readcube_id
    79 schema:value 3262ecd6e5e62b28b9adb4e2180fc3b8a1ae883139e275330b703f62323d5557
    80 rdf:type schema:PropertyValue
    81 Nf9ccff2facbc40a1ac6cc26d106b105d rdf:first N82292a278dc1490e8b304eaa2b6445b9
    82 rdf:rest N088c069b51734ba795f335fbdb8a33a9
    83 Nff1975d0675b40c09b7a847bd395947e schema:name dimensions_id
    84 schema:value pub.1106409157
    85 rdf:type schema:PropertyValue
    86 anzsrc-for:20 schema:inDefinedTermSet anzsrc-for:
    87 schema:name Language, Communication and Culture
    88 rdf:type schema:DefinedTerm
    89 anzsrc-for:2004 schema:inDefinedTermSet anzsrc-for:
    90 schema:name Linguistics
    91 rdf:type schema:DefinedTerm
    92 sg:person.010445445574.13 schema:affiliation https://www.grid.ac/institutes/grid.187323.c
    93 schema:familyName Abdennadher
    94 schema:givenName Slim
    95 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.010445445574.13
    96 rdf:type schema:Person
    97 sg:person.013322724557.53 schema:affiliation https://www.grid.ac/institutes/grid.187323.c
    98 schema:familyName Elmahdy
    99 schema:givenName Mohamed
    100 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013322724557.53
    101 rdf:type schema:Person
    102 sg:person.016542541344.73 schema:affiliation https://www.grid.ac/institutes/grid.187323.c
    103 schema:familyName Hamed
    104 schema:givenName Injy
    105 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.016542541344.73
    106 rdf:type schema:Person
    107 sg:pub.10.1007/978-3-540-70939-8_7 schema:sameAs https://app.dimensions.ai/details/publication/pub.1052256105
    108 https://doi.org/10.1007/978-3-540-70939-8_7
    109 rdf:type schema:CreativeWork
    110 https://doi.org/10.1016/j.pragma.2004.10.010 schema:sameAs https://app.dimensions.ai/details/publication/pub.1002477324
    111 rdf:type schema:CreativeWork
    112 https://doi.org/10.1016/j.procs.2016.04.039 schema:sameAs https://app.dimensions.ai/details/publication/pub.1017048649
    113 rdf:type schema:CreativeWork
    114 https://doi.org/10.1016/j.procs.2016.04.044 schema:sameAs https://app.dimensions.ai/details/publication/pub.1044573307
    115 rdf:type schema:CreativeWork
    116 https://doi.org/10.1016/j.procs.2017.10.111 schema:sameAs https://app.dimensions.ai/details/publication/pub.1092616592
    117 rdf:type schema:CreativeWork
    118 https://doi.org/10.1016/s0167-6393(00)00095-9 schema:sameAs https://app.dimensions.ai/details/publication/pub.1038142378
    119 rdf:type schema:CreativeWork
    120 https://doi.org/10.1109/fskd.2009.434 schema:sameAs https://app.dimensions.ai/details/publication/pub.1093463963
    121 rdf:type schema:CreativeWork
    122 https://doi.org/10.1109/icassp.2012.6289015 schema:sameAs https://app.dimensions.ai/details/publication/pub.1095336192
    123 rdf:type schema:CreativeWork
    124 https://doi.org/10.1109/icassp.2014.6854536 schema:sameAs https://app.dimensions.ai/details/publication/pub.1093350674
    125 rdf:type schema:CreativeWork
    126 https://doi.org/10.1109/iscslp.2010.5684900 schema:sameAs https://app.dimensions.ai/details/publication/pub.1093814750
    127 rdf:type schema:CreativeWork
    128 https://doi.org/10.1109/msp.2008.918417 schema:sameAs https://app.dimensions.ai/details/publication/pub.1061423090
    129 rdf:type schema:CreativeWork
    130 https://doi.org/10.1111/1467-971x.00181 schema:sameAs https://app.dimensions.ai/details/publication/pub.1008993237
    131 rdf:type schema:CreativeWork
    132 https://doi.org/10.1177/0739986304272358 schema:sameAs https://app.dimensions.ai/details/publication/pub.1043943858
    133 rdf:type schema:CreativeWork
    134 https://doi.org/10.3115/v1/w14-3904 schema:sameAs https://app.dimensions.ai/details/publication/pub.1099117564
    135 rdf:type schema:CreativeWork
    136 https://doi.org/10.5923/j.ajsp.20120205.02 schema:sameAs https://app.dimensions.ai/details/publication/pub.1073508165
    137 rdf:type schema:CreativeWork
    138 https://www.grid.ac/institutes/grid.187323.c schema:alternateName German University in Cairo
    139 schema:name The German University in Cairo
    140 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...