Ontology type: schema:ScholarlyArticle
2022-01-31
AUTHORSStefan Grafberger, Paul Groth, Julia Stoyanovich, Sebastian Schelter
ABSTRACTMachine learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this widespread use are garnering attention from policy makers, scientists, and the media. ML applications are often brittle with respect to their input data, which leads to concerns about their correctness, reliability, and fairness. In this paper, we describe mlinspect, a library that helps diagnose and mitigate technical bias that may arise during preprocessing steps in an ML pipeline. We refer to these problems collectively as data distribution bugs. The key idea is to extract a directed acyclic graph representation of the dataflow from a preprocessing pipeline and to use this representation to automatically instrument the code with predefined inspections. These inspections are based on a lightweight annotation propagation approach to propagate metadata such as lineage information from operator to operator. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation. We discuss the design and implementation of the mlinspect library and give a comprehensive end-to-end example that illustrates its functionality. More... »
PAGES1-24
http://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w
DOIhttp://dx.doi.org/10.1007/s00778-021-00726-w
DIMENSIONShttps://app.dimensions.ai/details/publication/pub.1145112127
JSON-LD is the canonical representation for SciGraph data.
TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT
[
{
"@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json",
"about": [
{
"id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08",
"inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/",
"name": "Information and Computing Sciences",
"type": "DefinedTerm"
},
{
"id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801",
"inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/",
"name": "Artificial Intelligence and Image Processing",
"type": "DefinedTerm"
}
],
"author": [
{
"affiliation": {
"alternateName": "University of Amsterdam, Amsterdam, Netherlands",
"id": "http://www.grid.ac/institutes/grid.7177.6",
"name": [
"University of Amsterdam, Amsterdam, Netherlands"
],
"type": "Organization"
},
"familyName": "Grafberger",
"givenName": "Stefan",
"type": "Person"
},
{
"affiliation": {
"alternateName": "University of Amsterdam, Amsterdam, Netherlands",
"id": "http://www.grid.ac/institutes/grid.7177.6",
"name": [
"University of Amsterdam, Amsterdam, Netherlands"
],
"type": "Organization"
},
"familyName": "Groth",
"givenName": "Paul",
"id": "sg:person.012677400323.64",
"sameAs": [
"https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012677400323.64"
],
"type": "Person"
},
{
"affiliation": {
"alternateName": "New York University, New York, USA",
"id": "http://www.grid.ac/institutes/grid.137628.9",
"name": [
"New York University, New York, USA"
],
"type": "Organization"
},
"familyName": "Stoyanovich",
"givenName": "Julia",
"id": "sg:person.0615021500.86",
"sameAs": [
"https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0615021500.86"
],
"type": "Person"
},
{
"affiliation": {
"alternateName": "University of Amsterdam, Amsterdam, Netherlands",
"id": "http://www.grid.ac/institutes/grid.7177.6",
"name": [
"University of Amsterdam, Amsterdam, Netherlands"
],
"type": "Organization"
},
"familyName": "Schelter",
"givenName": "Sebastian",
"id": "sg:person.014235450664.80",
"sameAs": [
"https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014235450664.80"
],
"type": "Person"
}
],
"citation": [
{
"id": "sg:pub.10.1007/s00778-017-0486-1",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1092243991",
"https://doi.org/10.1007/s00778-017-0486-1"
],
"type": "CreativeWork"
},
{
"id": "sg:pub.10.1038/sdata.2016.18",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1005603549",
"https://doi.org/10.1038/sdata.2016.18"
],
"type": "CreativeWork"
},
{
"id": "sg:pub.10.1007/978-3-319-16462-5_6",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1015501966",
"https://doi.org/10.1007/978-3-319-16462-5_6"
],
"type": "CreativeWork"
},
{
"id": "sg:pub.10.1007/978-3-642-17819-1_27",
"sameAs": [
"https://app.dimensions.ai/details/publication/pub.1021418233",
"https://doi.org/10.1007/978-3-642-17819-1_27"
],
"type": "CreativeWork"
}
],
"datePublished": "2022-01-31",
"datePublishedReg": "2022-01-31",
"description": "Machine learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this widespread use are garnering attention from policy makers, scientists, and the media. ML applications are often brittle with respect to their input data, which leads to concerns about their correctness, reliability, and fairness. In this paper, we describe mlinspect, a library that helps diagnose and mitigate technical bias that may arise during preprocessing steps in an ML pipeline. We refer to these problems collectively as data distribution bugs. The key idea is to extract a directed acyclic graph representation of the dataflow from a preprocessing pipeline and to use this representation to automatically instrument the code with predefined inspections. These inspections are based on a lightweight annotation propagation approach to propagate metadata such as lineage information from operator to operator. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation. We discuss the design and implementation of the mlinspect library and give a comprehensive end-to-end example that illustrates its functionality.",
"genre": "article",
"id": "sg:pub.10.1007/s00778-021-00726-w",
"inLanguage": "en",
"isAccessibleForFree": false,
"isFundedItemOf": [
{
"id": "sg:grant.7923555",
"type": "MonetaryGrant"
},
{
"id": "sg:grant.8567278",
"type": "MonetaryGrant"
},
{
"id": "sg:grant.8566338",
"type": "MonetaryGrant"
}
],
"isPartOf": [
{
"id": "sg:journal.1044889",
"issn": [
"1066-8888",
"0949-877X"
],
"name": "The VLDB Journal",
"publisher": "Springer Nature",
"type": "Periodical"
}
],
"keywords": [
"machine learning",
"data science libraries",
"acyclic graph representation",
"declarative abstraction",
"code instrumentation",
"ML pipeline",
"ML applications",
"data distribution",
"graph representation",
"key idea",
"input data",
"comprehensive end",
"lineage information",
"end examples",
"propagation approach",
"sciences libraries",
"pipeline",
"library",
"dataflow",
"metadata",
"representation",
"operators",
"machine",
"inspection",
"correctness",
"abstraction",
"bugs",
"learning",
"fairness",
"widespread use",
"code",
"implementation",
"functionality",
"information",
"impactful decisions",
"technical bias",
"applications",
"reliability",
"decisions",
"idea",
"design",
"example",
"work",
"operates",
"step",
"data",
"makers",
"scientists",
"attention",
"end",
"use",
"concern",
"instrumentation",
"respect",
"policy makers",
"medium",
"distribution",
"bias",
"contrast",
"risk",
"paper",
"problem",
"approach"
],
"name": "Data distribution debugging in machine learning pipelines",
"pagination": "1-24",
"productId": [
{
"name": "dimensions_id",
"type": "PropertyValue",
"value": [
"pub.1145112127"
]
},
{
"name": "doi",
"type": "PropertyValue",
"value": [
"10.1007/s00778-021-00726-w"
]
}
],
"sameAs": [
"https://doi.org/10.1007/s00778-021-00726-w",
"https://app.dimensions.ai/details/publication/pub.1145112127"
],
"sdDataset": "articles",
"sdDatePublished": "2022-06-01T22:26",
"sdLicense": "https://scigraph.springernature.com/explorer/license/",
"sdPublisher": {
"name": "Springer Nature - SN SciGraph project",
"type": "Organization"
},
"sdSource": "s3://com-springernature-scigraph/baseset/20220601/entities/gbq_results/article/article_940.jsonl",
"type": "ScholarlyArticle",
"url": "https://doi.org/10.1007/s00778-021-00726-w"
}
]
Download the RDF metadata as: json-ld nt turtle xml License info
JSON-LD is a popular format for linked data which is fully compatible with JSON.
curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w'
N-Triples is a line-based linked data format ideal for batch operations.
curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w'
Turtle is a human-readable linked data format.
curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w'
RDF/XML is a standard XML format for linked data.
curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/s00778-021-00726-w'
This table displays all metadata directly associated to this object as RDF triples.
160 TRIPLES
22 PREDICATES
90 URIs
78 LITERALS
4 BLANK NODES