Data distribution debugging in machine learning pipelines View Full Text


Ontology type: schema:ScholarlyArticle     


Article Info

DATE

2022-01-31

AUTHORS

Stefan Grafberger, Paul Groth, Julia Stoyanovich, Sebastian Schelter

ABSTRACT

Machine learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this widespread use are garnering attention from policy makers, scientists, and the media. ML applications are often brittle with respect to their input data, which leads to concerns about their correctness, reliability, and fairness. In this paper, we describe mlinspect, a library that helps diagnose and mitigate technical bias that may arise during preprocessing steps in an ML pipeline. We refer to these problems collectively as data distribution bugs. The key idea is to extract a directed acyclic graph representation of the dataflow from a preprocessing pipeline and to use this representation to automatically instrument the code with predefined inspections. These inspections are based on a lightweight annotation propagation approach to propagate metadata such as lineage information from operator to operator. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation. We discuss the design and implementation of the mlinspect library and give a comprehensive end-to-end example that illustrates its functionality. More... »

PAGES

1-24

References to SciGraph publications

  • 2016-03-15. The FAIR Guiding Principles for scientific data management and stewardship in SCIENTIFIC DATA
  • 2015-03-21. noWorkflow: Capturing and Analyzing Provenance of Scripts in PROVENANCE AND ANNOTATION OF DATA AND PROCESSES
  • 2010-11-30. StarFlow: A Script-Centric Data Analysis Environment in PROVENANCE AND ANNOTATION OF DATA AND PROCESSES
  • 2017-10-16. A survey on provenance: What for? What form? What from? in THE VLDB JOURNAL
  • Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w

    DOI

    http://dx.doi.org/10.1007/s00778-021-00726-w

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1145112127


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Artificial Intelligence and Image Processing", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "University of Amsterdam, Amsterdam, Netherlands", 
              "id": "http://www.grid.ac/institutes/grid.7177.6", 
              "name": [
                "University of Amsterdam, Amsterdam, Netherlands"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Grafberger", 
            "givenName": "Stefan", 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "University of Amsterdam, Amsterdam, Netherlands", 
              "id": "http://www.grid.ac/institutes/grid.7177.6", 
              "name": [
                "University of Amsterdam, Amsterdam, Netherlands"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Groth", 
            "givenName": "Paul", 
            "id": "sg:person.012677400323.64", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012677400323.64"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "New York University, New York, USA", 
              "id": "http://www.grid.ac/institutes/grid.137628.9", 
              "name": [
                "New York University, New York, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Stoyanovich", 
            "givenName": "Julia", 
            "id": "sg:person.0615021500.86", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0615021500.86"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "University of Amsterdam, Amsterdam, Netherlands", 
              "id": "http://www.grid.ac/institutes/grid.7177.6", 
              "name": [
                "University of Amsterdam, Amsterdam, Netherlands"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Schelter", 
            "givenName": "Sebastian", 
            "id": "sg:person.014235450664.80", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014235450664.80"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "sg:pub.10.1007/978-3-642-17819-1_27", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1021418233", 
              "https://doi.org/10.1007/978-3-642-17819-1_27"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/s00778-017-0486-1", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1092243991", 
              "https://doi.org/10.1007/s00778-017-0486-1"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1038/sdata.2016.18", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1005603549", 
              "https://doi.org/10.1038/sdata.2016.18"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-319-16462-5_6", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1015501966", 
              "https://doi.org/10.1007/978-3-319-16462-5_6"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2022-01-31", 
        "datePublishedReg": "2022-01-31", 
        "description": "Machine learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this widespread use are garnering attention from policy makers, scientists, and the media. ML applications are often brittle with respect to their input data, which leads to concerns about their correctness, reliability, and fairness. In this paper, we describe mlinspect, a library that helps diagnose and mitigate technical bias that may arise during preprocessing steps in an ML pipeline. We refer to these problems collectively as data distribution bugs. The key idea is to extract a directed acyclic graph representation of the dataflow from a preprocessing pipeline and to use this representation to automatically instrument the code with predefined inspections. These inspections are based on a lightweight annotation propagation approach to propagate metadata such as lineage information from operator to operator. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation. We discuss the design and implementation of the mlinspect library and give a comprehensive end-to-end example that illustrates its functionality.", 
        "genre": "article", 
        "id": "sg:pub.10.1007/s00778-021-00726-w", 
        "isAccessibleForFree": false, 
        "isFundedItemOf": [
          {
            "id": "sg:grant.8567278", 
            "type": "MonetaryGrant"
          }, 
          {
            "id": "sg:grant.7923555", 
            "type": "MonetaryGrant"
          }, 
          {
            "id": "sg:grant.8566338", 
            "type": "MonetaryGrant"
          }
        ], 
        "isPartOf": [
          {
            "id": "sg:journal.1044889", 
            "issn": [
              "1066-8888", 
              "0949-877X"
            ], 
            "name": "The VLDB Journal", 
            "publisher": "Springer Nature", 
            "type": "Periodical"
          }
        ], 
        "keywords": [
          "machine learning", 
          "data science libraries", 
          "acyclic graph representation", 
          "declarative abstraction", 
          "code instrumentation", 
          "ML pipeline", 
          "ML applications", 
          "data distribution", 
          "graph representation", 
          "key idea", 
          "input data", 
          "comprehensive end", 
          "lineage information", 
          "end examples", 
          "propagation approach", 
          "sciences libraries", 
          "pipeline", 
          "library", 
          "dataflow", 
          "metadata", 
          "representation", 
          "operators", 
          "machine", 
          "inspection", 
          "correctness", 
          "abstraction", 
          "bugs", 
          "learning", 
          "fairness", 
          "widespread use", 
          "code", 
          "implementation", 
          "functionality", 
          "information", 
          "impactful decisions", 
          "technical bias", 
          "applications", 
          "reliability", 
          "decisions", 
          "idea", 
          "design", 
          "example", 
          "work", 
          "operates", 
          "step", 
          "data", 
          "makers", 
          "scientists", 
          "attention", 
          "end", 
          "use", 
          "concern", 
          "instrumentation", 
          "respect", 
          "policy makers", 
          "medium", 
          "distribution", 
          "bias", 
          "contrast", 
          "risk", 
          "paper", 
          "problem", 
          "approach"
        ], 
        "name": "Data distribution debugging in machine learning pipelines", 
        "pagination": "1-24", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1145112127"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/s00778-021-00726-w"
            ]
          }
        ], 
        "sameAs": [
          "https://doi.org/10.1007/s00778-021-00726-w", 
          "https://app.dimensions.ai/details/publication/pub.1145112127"
        ], 
        "sdDataset": "articles", 
        "sdDatePublished": "2022-08-04T17:11", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-springernature-scigraph/baseset/20220804/entities/gbq_results/article/article_926.jsonl", 
        "type": "ScholarlyArticle", 
        "url": "https://doi.org/10.1007/s00778-021-00726-w"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w'


     

    This table displays all metadata directly associated to this object as RDF triples.

    159 TRIPLES      21 PREDICATES      89 URIs      77 LITERALS      4 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/s00778-021-00726-w schema:about anzsrc-for:08
    2 anzsrc-for:0801
    3 schema:author N5579400e06b64633954ae493acee2a14
    4 schema:citation sg:pub.10.1007/978-3-319-16462-5_6
    5 sg:pub.10.1007/978-3-642-17819-1_27
    6 sg:pub.10.1007/s00778-017-0486-1
    7 sg:pub.10.1038/sdata.2016.18
    8 schema:datePublished 2022-01-31
    9 schema:datePublishedReg 2022-01-31
    10 schema:description Machine learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this widespread use are garnering attention from policy makers, scientists, and the media. ML applications are often brittle with respect to their input data, which leads to concerns about their correctness, reliability, and fairness. In this paper, we describe mlinspect, a library that helps diagnose and mitigate technical bias that may arise during preprocessing steps in an ML pipeline. We refer to these problems collectively as data distribution bugs. The key idea is to extract a directed acyclic graph representation of the dataflow from a preprocessing pipeline and to use this representation to automatically instrument the code with predefined inspections. These inspections are based on a lightweight annotation propagation approach to propagate metadata such as lineage information from operator to operator. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation. We discuss the design and implementation of the mlinspect library and give a comprehensive end-to-end example that illustrates its functionality.
    11 schema:genre article
    12 schema:isAccessibleForFree false
    13 schema:isPartOf sg:journal.1044889
    14 schema:keywords ML applications
    15 ML pipeline
    16 abstraction
    17 acyclic graph representation
    18 applications
    19 approach
    20 attention
    21 bias
    22 bugs
    23 code
    24 code instrumentation
    25 comprehensive end
    26 concern
    27 contrast
    28 correctness
    29 data
    30 data distribution
    31 data science libraries
    32 dataflow
    33 decisions
    34 declarative abstraction
    35 design
    36 distribution
    37 end
    38 end examples
    39 example
    40 fairness
    41 functionality
    42 graph representation
    43 idea
    44 impactful decisions
    45 implementation
    46 information
    47 input data
    48 inspection
    49 instrumentation
    50 key idea
    51 learning
    52 library
    53 lineage information
    54 machine
    55 machine learning
    56 makers
    57 medium
    58 metadata
    59 operates
    60 operators
    61 paper
    62 pipeline
    63 policy makers
    64 problem
    65 propagation approach
    66 reliability
    67 representation
    68 respect
    69 risk
    70 sciences libraries
    71 scientists
    72 step
    73 technical bias
    74 use
    75 widespread use
    76 work
    77 schema:name Data distribution debugging in machine learning pipelines
    78 schema:pagination 1-24
    79 schema:productId Nb3910906bdbf4f4480f7605e21b4b4bd
    80 Nbffcac38ad234e7492ee552f15d0af1a
    81 schema:sameAs https://app.dimensions.ai/details/publication/pub.1145112127
    82 https://doi.org/10.1007/s00778-021-00726-w
    83 schema:sdDatePublished 2022-08-04T17:11
    84 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    85 schema:sdPublisher N85a6fa63d8cd4637872c4a2acc6c1882
    86 schema:url https://doi.org/10.1007/s00778-021-00726-w
    87 sgo:license sg:explorer/license/
    88 sgo:sdDataset articles
    89 rdf:type schema:ScholarlyArticle
    90 N0332b8b63e7d476f82d66d32c7ee0dc9 schema:affiliation grid-institutes:grid.7177.6
    91 schema:familyName Grafberger
    92 schema:givenName Stefan
    93 rdf:type schema:Person
    94 N5579400e06b64633954ae493acee2a14 rdf:first N0332b8b63e7d476f82d66d32c7ee0dc9
    95 rdf:rest N5e8b60814f884d9b81964f17268ffbb2
    96 N5e8b60814f884d9b81964f17268ffbb2 rdf:first sg:person.012677400323.64
    97 rdf:rest Ncbecf8a8591c4ef4a32eb90ec94ceb7f
    98 N7246cf163fcc4df5b8321620d48b7afa rdf:first sg:person.014235450664.80
    99 rdf:rest rdf:nil
    100 N85a6fa63d8cd4637872c4a2acc6c1882 schema:name Springer Nature - SN SciGraph project
    101 rdf:type schema:Organization
    102 Nb3910906bdbf4f4480f7605e21b4b4bd schema:name doi
    103 schema:value 10.1007/s00778-021-00726-w
    104 rdf:type schema:PropertyValue
    105 Nbffcac38ad234e7492ee552f15d0af1a schema:name dimensions_id
    106 schema:value pub.1145112127
    107 rdf:type schema:PropertyValue
    108 Ncbecf8a8591c4ef4a32eb90ec94ceb7f rdf:first sg:person.0615021500.86
    109 rdf:rest N7246cf163fcc4df5b8321620d48b7afa
    110 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    111 schema:name Information and Computing Sciences
    112 rdf:type schema:DefinedTerm
    113 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
    114 schema:name Artificial Intelligence and Image Processing
    115 rdf:type schema:DefinedTerm
    116 sg:grant.7923555 http://pending.schema.org/fundedItem sg:pub.10.1007/s00778-021-00726-w
    117 rdf:type schema:MonetaryGrant
    118 sg:grant.8566338 http://pending.schema.org/fundedItem sg:pub.10.1007/s00778-021-00726-w
    119 rdf:type schema:MonetaryGrant
    120 sg:grant.8567278 http://pending.schema.org/fundedItem sg:pub.10.1007/s00778-021-00726-w
    121 rdf:type schema:MonetaryGrant
    122 sg:journal.1044889 schema:issn 0949-877X
    123 1066-8888
    124 schema:name The VLDB Journal
    125 schema:publisher Springer Nature
    126 rdf:type schema:Periodical
    127 sg:person.012677400323.64 schema:affiliation grid-institutes:grid.7177.6
    128 schema:familyName Groth
    129 schema:givenName Paul
    130 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012677400323.64
    131 rdf:type schema:Person
    132 sg:person.014235450664.80 schema:affiliation grid-institutes:grid.7177.6
    133 schema:familyName Schelter
    134 schema:givenName Sebastian
    135 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014235450664.80
    136 rdf:type schema:Person
    137 sg:person.0615021500.86 schema:affiliation grid-institutes:grid.137628.9
    138 schema:familyName Stoyanovich
    139 schema:givenName Julia
    140 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0615021500.86
    141 rdf:type schema:Person
    142 sg:pub.10.1007/978-3-319-16462-5_6 schema:sameAs https://app.dimensions.ai/details/publication/pub.1015501966
    143 https://doi.org/10.1007/978-3-319-16462-5_6
    144 rdf:type schema:CreativeWork
    145 sg:pub.10.1007/978-3-642-17819-1_27 schema:sameAs https://app.dimensions.ai/details/publication/pub.1021418233
    146 https://doi.org/10.1007/978-3-642-17819-1_27
    147 rdf:type schema:CreativeWork
    148 sg:pub.10.1007/s00778-017-0486-1 schema:sameAs https://app.dimensions.ai/details/publication/pub.1092243991
    149 https://doi.org/10.1007/s00778-017-0486-1
    150 rdf:type schema:CreativeWork
    151 sg:pub.10.1038/sdata.2016.18 schema:sameAs https://app.dimensions.ai/details/publication/pub.1005603549
    152 https://doi.org/10.1038/sdata.2016.18
    153 rdf:type schema:CreativeWork
    154 grid-institutes:grid.137628.9 schema:alternateName New York University, New York, USA
    155 schema:name New York University, New York, USA
    156 rdf:type schema:Organization
    157 grid-institutes:grid.7177.6 schema:alternateName University of Amsterdam, Amsterdam, Netherlands
    158 schema:name University of Amsterdam, Amsterdam, Netherlands
    159 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...