Data distribution debugging in machine learning pipelines View Full Text


Ontology type: schema:ScholarlyArticle     


Article Info

DATE

2022-01-31

AUTHORS

Stefan Grafberger, Paul Groth, Julia Stoyanovich, Sebastian Schelter

ABSTRACT

Machine learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this widespread use are garnering attention from policy makers, scientists, and the media. ML applications are often brittle with respect to their input data, which leads to concerns about their correctness, reliability, and fairness. In this paper, we describe mlinspect, a library that helps diagnose and mitigate technical bias that may arise during preprocessing steps in an ML pipeline. We refer to these problems collectively as data distribution bugs. The key idea is to extract a directed acyclic graph representation of the dataflow from a preprocessing pipeline and to use this representation to automatically instrument the code with predefined inspections. These inspections are based on a lightweight annotation propagation approach to propagate metadata such as lineage information from operator to operator. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation. We discuss the design and implementation of the mlinspect library and give a comprehensive end-to-end example that illustrates its functionality. More... »

PAGES

1-24

References to SciGraph publications

  • 2016-03-15. The FAIR Guiding Principles for scientific data management and stewardship in SCIENTIFIC DATA
  • 2015-03-21. noWorkflow: Capturing and Analyzing Provenance of Scripts in PROVENANCE AND ANNOTATION OF DATA AND PROCESSES
  • 2010. StarFlow: A Script-Centric Data Analysis Environment in PROVENANCE AND ANNOTATION OF DATA AND PROCESSES
  • 2017-10-16. A survey on provenance: What for? What form? What from? in THE VLDB JOURNAL
  • Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w

    DOI

    http://dx.doi.org/10.1007/s00778-021-00726-w

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1145112127


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Artificial Intelligence and Image Processing", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "University of Amsterdam, Amsterdam, Netherlands", 
              "id": "http://www.grid.ac/institutes/grid.7177.6", 
              "name": [
                "University of Amsterdam, Amsterdam, Netherlands"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Grafberger", 
            "givenName": "Stefan", 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "University of Amsterdam, Amsterdam, Netherlands", 
              "id": "http://www.grid.ac/institutes/grid.7177.6", 
              "name": [
                "University of Amsterdam, Amsterdam, Netherlands"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Groth", 
            "givenName": "Paul", 
            "id": "sg:person.012677400323.64", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012677400323.64"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "New York University, New York, USA", 
              "id": "http://www.grid.ac/institutes/grid.137628.9", 
              "name": [
                "New York University, New York, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Stoyanovich", 
            "givenName": "Julia", 
            "id": "sg:person.0615021500.86", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0615021500.86"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "University of Amsterdam, Amsterdam, Netherlands", 
              "id": "http://www.grid.ac/institutes/grid.7177.6", 
              "name": [
                "University of Amsterdam, Amsterdam, Netherlands"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Schelter", 
            "givenName": "Sebastian", 
            "id": "sg:person.014235450664.80", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014235450664.80"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "sg:pub.10.1007/s00778-017-0486-1", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1092243991", 
              "https://doi.org/10.1007/s00778-017-0486-1"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1038/sdata.2016.18", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1005603549", 
              "https://doi.org/10.1038/sdata.2016.18"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-319-16462-5_6", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1015501966", 
              "https://doi.org/10.1007/978-3-319-16462-5_6"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-17819-1_27", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1021418233", 
              "https://doi.org/10.1007/978-3-642-17819-1_27"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2022-01-31", 
        "datePublishedReg": "2022-01-31", 
        "description": "Machine learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this widespread use are garnering attention from policy makers, scientists, and the media. ML applications are often brittle with respect to their input data, which leads to concerns about their correctness, reliability, and fairness. In this paper, we describe mlinspect, a library that helps diagnose and mitigate technical bias that may arise during preprocessing steps in an ML pipeline. We refer to these problems collectively as data distribution bugs. The key idea is to extract a directed acyclic graph representation of the dataflow from a preprocessing pipeline and to use this representation to automatically instrument the code with predefined inspections. These inspections are based on a lightweight annotation propagation approach to propagate metadata such as lineage information from operator to operator. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation. We discuss the design and implementation of the mlinspect library and give a comprehensive end-to-end example that illustrates its functionality.", 
        "genre": "article", 
        "id": "sg:pub.10.1007/s00778-021-00726-w", 
        "inLanguage": "en", 
        "isAccessibleForFree": false, 
        "isFundedItemOf": [
          {
            "id": "sg:grant.7923555", 
            "type": "MonetaryGrant"
          }, 
          {
            "id": "sg:grant.8567278", 
            "type": "MonetaryGrant"
          }, 
          {
            "id": "sg:grant.8566338", 
            "type": "MonetaryGrant"
          }
        ], 
        "isPartOf": [
          {
            "id": "sg:journal.1044889", 
            "issn": [
              "1066-8888", 
              "0949-877X"
            ], 
            "name": "The VLDB Journal", 
            "publisher": "Springer Nature", 
            "type": "Periodical"
          }
        ], 
        "keywords": [
          "machine learning", 
          "data science libraries", 
          "acyclic graph representation", 
          "declarative abstraction", 
          "code instrumentation", 
          "ML pipeline", 
          "ML applications", 
          "data distribution", 
          "graph representation", 
          "key idea", 
          "input data", 
          "comprehensive end", 
          "lineage information", 
          "end examples", 
          "propagation approach", 
          "sciences libraries", 
          "pipeline", 
          "library", 
          "dataflow", 
          "metadata", 
          "representation", 
          "operators", 
          "machine", 
          "inspection", 
          "correctness", 
          "abstraction", 
          "bugs", 
          "learning", 
          "fairness", 
          "widespread use", 
          "code", 
          "implementation", 
          "functionality", 
          "information", 
          "impactful decisions", 
          "technical bias", 
          "applications", 
          "reliability", 
          "decisions", 
          "idea", 
          "design", 
          "example", 
          "work", 
          "operates", 
          "step", 
          "data", 
          "makers", 
          "scientists", 
          "attention", 
          "end", 
          "use", 
          "concern", 
          "instrumentation", 
          "respect", 
          "policy makers", 
          "medium", 
          "distribution", 
          "bias", 
          "contrast", 
          "risk", 
          "paper", 
          "problem", 
          "approach"
        ], 
        "name": "Data distribution debugging in machine learning pipelines", 
        "pagination": "1-24", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1145112127"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/s00778-021-00726-w"
            ]
          }
        ], 
        "sameAs": [
          "https://doi.org/10.1007/s00778-021-00726-w", 
          "https://app.dimensions.ai/details/publication/pub.1145112127"
        ], 
        "sdDataset": "articles", 
        "sdDatePublished": "2022-06-01T22:26", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-springernature-scigraph/baseset/20220601/entities/gbq_results/article/article_940.jsonl", 
        "type": "ScholarlyArticle", 
        "url": "https://doi.org/10.1007/s00778-021-00726-w"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w'


     

    This table displays all metadata directly associated to this object as RDF triples.

    160 TRIPLES      22 PREDICATES      90 URIs      78 LITERALS      4 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/s00778-021-00726-w schema:about anzsrc-for:08
    2 anzsrc-for:0801
    3 schema:author N1d5753529e284142959d85c1dba22bc2
    4 schema:citation sg:pub.10.1007/978-3-319-16462-5_6
    5 sg:pub.10.1007/978-3-642-17819-1_27
    6 sg:pub.10.1007/s00778-017-0486-1
    7 sg:pub.10.1038/sdata.2016.18
    8 schema:datePublished 2022-01-31
    9 schema:datePublishedReg 2022-01-31
    10 schema:description Machine learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this widespread use are garnering attention from policy makers, scientists, and the media. ML applications are often brittle with respect to their input data, which leads to concerns about their correctness, reliability, and fairness. In this paper, we describe mlinspect, a library that helps diagnose and mitigate technical bias that may arise during preprocessing steps in an ML pipeline. We refer to these problems collectively as data distribution bugs. The key idea is to extract a directed acyclic graph representation of the dataflow from a preprocessing pipeline and to use this representation to automatically instrument the code with predefined inspections. These inspections are based on a lightweight annotation propagation approach to propagate metadata such as lineage information from operator to operator. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation. We discuss the design and implementation of the mlinspect library and give a comprehensive end-to-end example that illustrates its functionality.
    11 schema:genre article
    12 schema:inLanguage en
    13 schema:isAccessibleForFree false
    14 schema:isPartOf sg:journal.1044889
    15 schema:keywords ML applications
    16 ML pipeline
    17 abstraction
    18 acyclic graph representation
    19 applications
    20 approach
    21 attention
    22 bias
    23 bugs
    24 code
    25 code instrumentation
    26 comprehensive end
    27 concern
    28 contrast
    29 correctness
    30 data
    31 data distribution
    32 data science libraries
    33 dataflow
    34 decisions
    35 declarative abstraction
    36 design
    37 distribution
    38 end
    39 end examples
    40 example
    41 fairness
    42 functionality
    43 graph representation
    44 idea
    45 impactful decisions
    46 implementation
    47 information
    48 input data
    49 inspection
    50 instrumentation
    51 key idea
    52 learning
    53 library
    54 lineage information
    55 machine
    56 machine learning
    57 makers
    58 medium
    59 metadata
    60 operates
    61 operators
    62 paper
    63 pipeline
    64 policy makers
    65 problem
    66 propagation approach
    67 reliability
    68 representation
    69 respect
    70 risk
    71 sciences libraries
    72 scientists
    73 step
    74 technical bias
    75 use
    76 widespread use
    77 work
    78 schema:name Data distribution debugging in machine learning pipelines
    79 schema:pagination 1-24
    80 schema:productId N03a7c2ea8cc2433b9548f24d6c01b1c5
    81 N353f1090b0e34c8eac11b403551298c2
    82 schema:sameAs https://app.dimensions.ai/details/publication/pub.1145112127
    83 https://doi.org/10.1007/s00778-021-00726-w
    84 schema:sdDatePublished 2022-06-01T22:26
    85 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    86 schema:sdPublisher N9d120057eef0445f8a5804533ebdbd26
    87 schema:url https://doi.org/10.1007/s00778-021-00726-w
    88 sgo:license sg:explorer/license/
    89 sgo:sdDataset articles
    90 rdf:type schema:ScholarlyArticle
    91 N03a7c2ea8cc2433b9548f24d6c01b1c5 schema:name doi
    92 schema:value 10.1007/s00778-021-00726-w
    93 rdf:type schema:PropertyValue
    94 N1d5753529e284142959d85c1dba22bc2 rdf:first Nef4f158411624cc786936acfaa036ff5
    95 rdf:rest N21024948ccf345148bc9250d92f63345
    96 N21024948ccf345148bc9250d92f63345 rdf:first sg:person.012677400323.64
    97 rdf:rest Ncbcb281017074158b08ddad4217d7252
    98 N353f1090b0e34c8eac11b403551298c2 schema:name dimensions_id
    99 schema:value pub.1145112127
    100 rdf:type schema:PropertyValue
    101 N59da31d944b8470f8689962661da9fc9 rdf:first sg:person.014235450664.80
    102 rdf:rest rdf:nil
    103 N9d120057eef0445f8a5804533ebdbd26 schema:name Springer Nature - SN SciGraph project
    104 rdf:type schema:Organization
    105 Ncbcb281017074158b08ddad4217d7252 rdf:first sg:person.0615021500.86
    106 rdf:rest N59da31d944b8470f8689962661da9fc9
    107 Nef4f158411624cc786936acfaa036ff5 schema:affiliation grid-institutes:grid.7177.6
    108 schema:familyName Grafberger
    109 schema:givenName Stefan
    110 rdf:type schema:Person
    111 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    112 schema:name Information and Computing Sciences
    113 rdf:type schema:DefinedTerm
    114 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
    115 schema:name Artificial Intelligence and Image Processing
    116 rdf:type schema:DefinedTerm
    117 sg:grant.7923555 http://pending.schema.org/fundedItem sg:pub.10.1007/s00778-021-00726-w
    118 rdf:type schema:MonetaryGrant
    119 sg:grant.8566338 http://pending.schema.org/fundedItem sg:pub.10.1007/s00778-021-00726-w
    120 rdf:type schema:MonetaryGrant
    121 sg:grant.8567278 http://pending.schema.org/fundedItem sg:pub.10.1007/s00778-021-00726-w
    122 rdf:type schema:MonetaryGrant
    123 sg:journal.1044889 schema:issn 0949-877X
    124 1066-8888
    125 schema:name The VLDB Journal
    126 schema:publisher Springer Nature
    127 rdf:type schema:Periodical
    128 sg:person.012677400323.64 schema:affiliation grid-institutes:grid.7177.6
    129 schema:familyName Groth
    130 schema:givenName Paul
    131 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012677400323.64
    132 rdf:type schema:Person
    133 sg:person.014235450664.80 schema:affiliation grid-institutes:grid.7177.6
    134 schema:familyName Schelter
    135 schema:givenName Sebastian
    136 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014235450664.80
    137 rdf:type schema:Person
    138 sg:person.0615021500.86 schema:affiliation grid-institutes:grid.137628.9
    139 schema:familyName Stoyanovich
    140 schema:givenName Julia
    141 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0615021500.86
    142 rdf:type schema:Person
    143 sg:pub.10.1007/978-3-319-16462-5_6 schema:sameAs https://app.dimensions.ai/details/publication/pub.1015501966
    144 https://doi.org/10.1007/978-3-319-16462-5_6
    145 rdf:type schema:CreativeWork
    146 sg:pub.10.1007/978-3-642-17819-1_27 schema:sameAs https://app.dimensions.ai/details/publication/pub.1021418233
    147 https://doi.org/10.1007/978-3-642-17819-1_27
    148 rdf:type schema:CreativeWork
    149 sg:pub.10.1007/s00778-017-0486-1 schema:sameAs https://app.dimensions.ai/details/publication/pub.1092243991
    150 https://doi.org/10.1007/s00778-017-0486-1
    151 rdf:type schema:CreativeWork
    152 sg:pub.10.1038/sdata.2016.18 schema:sameAs https://app.dimensions.ai/details/publication/pub.1005603549
    153 https://doi.org/10.1038/sdata.2016.18
    154 rdf:type schema:CreativeWork
    155 grid-institutes:grid.137628.9 schema:alternateName New York University, New York, USA
    156 schema:name New York University, New York, USA
    157 rdf:type schema:Organization
    158 grid-institutes:grid.7177.6 schema:alternateName University of Amsterdam, Amsterdam, Netherlands
    159 schema:name University of Amsterdam, Amsterdam, Netherlands
    160 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...