A schema aware ETL workflow generator View Full Text


Ontology type: schema:ScholarlyArticle     


Article Info

DATE

2012-05-03

AUTHORS

Naiqiao Du, Xiaojun Ye, Jianmin Wang

ABSTRACT

Extract, Transform and Load (ETL) processes organized as workflows play an important role in data warehousing. As ETL workflows are usually complex, various ETL facilities have been developed to address their control-flow process modeling and execution control. To evaluate the quality of ETL facilities, Synthetic ETL workflow test cases, consisting of control-flow and data-flow aspects are needed to check ETL facility functionalities at construction time and to validate the correctness and performance of ETL facilities at run time. Although there are some synthetic workflow and data set test case generation approaches existed in literatures, little work is done to consider both aspects at the same time specifically for ETL workflow generators. To address this issue, this paper proposes a schema aware ETL workflow generator with which users can characterize their ETL workflows by various parameters and get ETL workflow test cases with control-flow of ETL activities, complied schemas and associated recordsets. Our generator consists of three steps. First, with type and ratio of individual activities and their connection characteristic parameter specification, the generator will produce ETL activities and form ETL skeleton which determine how generated activities are cooperated with each other. Second, with schema transformation characteristic parameter specification, e.g. ranges of numbers of attributes, the generator will resolve attribute dependencies and refine input/output schemas with complied attributes and their data types. In the last step, recordsets are generated following cardinality specifications. ETL workflows in specific patterns are produced in the experiment in order to show the ability of our generator. Also experiments to generate thousands of ETL workflow test cases in seconds have been done to verify the usability of the generator. More... »

PAGES

453-471

References to SciGraph publications

  • 2008-05-09. An approach for incorporating quality-based cost–benefit analysis in data warehouse design in INFORMATION SYSTEMS FRONTIERS
  • 2010-10-30. A mixed transaction processing and operational reporting benchmark in INFORMATION SYSTEMS FRONTIERS
  • 2009. Benchmarking ETL Workflows in PERFORMANCE EVALUATION AND BENCHMARKING
  • 2003-05-27. A Top-Down Petri Net-Based Approach for Dynamic Workflow Modeling in BUSINESS PROCESS MANAGEMENT
  • 2009. Cost-Based Vectorization of Instance-Based Integration Processes in ADVANCES IN DATABASES AND INFORMATION SYSTEMS
  • Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/s10796-012-9352-2

    DOI

    http://dx.doi.org/10.1007/s10796-012-9352-2

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1000288683


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0803", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Computer Software", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "School of Software, Tsinghua University, Beijing, China", 
              "id": "http://www.grid.ac/institutes/grid.12527.33", 
              "name": [
                "Department of Computer Science and Technology, Tsinghua University, Beijing, China", 
                "School of Software, Tsinghua University, Beijing, China"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Du", 
            "givenName": "Naiqiao", 
            "id": "sg:person.011441724007.03", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011441724007.03"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "Tsinghua National Laboratory for Information Science and Technology (TNList), Beijing, China", 
              "id": "http://www.grid.ac/institutes/grid.12527.33", 
              "name": [
                "School of Software, Tsinghua University, Beijing, China", 
                "Key Laboratory for Information System Security, Ministry of Education, Beijing, China", 
                "Tsinghua National Laboratory for Information Science and Technology (TNList), Beijing, China"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Ye", 
            "givenName": "Xiaojun", 
            "id": "sg:person.013540135713.71", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013540135713.71"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "Tsinghua National Laboratory for Information Science and Technology (TNList), Beijing, China", 
              "id": "http://www.grid.ac/institutes/grid.12527.33", 
              "name": [
                "School of Software, Tsinghua University, Beijing, China", 
                "Key Laboratory for Information System Security, Ministry of Education, Beijing, China", 
                "Tsinghua National Laboratory for Information Science and Technology (TNList), Beijing, China"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Wang", 
            "givenName": "Jianmin", 
            "id": "sg:person.012303351315.43", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012303351315.43"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "sg:pub.10.1007/s10796-010-9283-8", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1014339706", 
              "https://doi.org/10.1007/s10796-010-9283-8"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/s10796-008-9077-4", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1016064927", 
              "https://doi.org/10.1007/s10796-008-9077-4"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/3-540-44895-0_23", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1037745899", 
              "https://doi.org/10.1007/3-540-44895-0_23"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-10424-4_15", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1030915466", 
              "https://doi.org/10.1007/978-3-642-10424-4_15"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-03973-7_19", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1038061018", 
              "https://doi.org/10.1007/978-3-642-03973-7_19"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2012-05-03", 
        "datePublishedReg": "2012-05-03", 
        "description": "Extract, Transform and Load (ETL) processes organized as workflows play an important role in data warehousing. As ETL workflows are usually complex, various ETL facilities have been developed to address their control-flow process modeling and execution control. To evaluate the quality of ETL facilities, Synthetic ETL workflow test cases, consisting of control-flow and data-flow aspects are needed to check ETL facility functionalities at construction time and to validate the correctness and performance of ETL facilities at run time. Although there are some synthetic workflow and data set test case generation approaches existed in literatures, little work is done to consider both aspects at the same time specifically for ETL workflow generators. To address this issue, this paper proposes a schema aware ETL workflow generator with which users can characterize their ETL workflows by various parameters and get ETL workflow test cases with control-flow of ETL activities, complied schemas and associated recordsets. Our generator consists of three steps. First, with type and ratio of individual activities and their connection characteristic parameter specification, the generator will produce ETL activities and form ETL skeleton which determine how generated activities are cooperated with each other. Second, with schema transformation characteristic parameter specification, e.g. ranges of numbers of attributes, the generator will resolve attribute dependencies and refine input/output schemas with complied attributes and their data types. In the last step, recordsets are generated following cardinality specifications. ETL workflows in specific patterns are produced in the experiment in order to show the ability of our generator. Also experiments to generate thousands of ETL workflow test cases in seconds have been done to verify the usability of the generator.", 
        "genre": "article", 
        "id": "sg:pub.10.1007/s10796-012-9352-2", 
        "inLanguage": "en", 
        "isAccessibleForFree": false, 
        "isPartOf": [
          {
            "id": "sg:journal.1136609", 
            "issn": [
              "1387-3326", 
              "1572-9419"
            ], 
            "name": "Information Systems Frontiers", 
            "publisher": "Springer Nature", 
            "type": "Periodical"
          }, 
          {
            "issueNumber": "3", 
            "type": "PublicationIssue"
          }, 
          {
            "type": "PublicationVolume", 
            "volumeNumber": "16"
          }
        ], 
        "keywords": [
          "test cases", 
          "construction time", 
          "generator", 
          "load process", 
          "process modeling", 
          "facilities", 
          "execution control", 
          "modeling", 
          "parameter specification", 
          "specification", 
          "experiments", 
          "generation approach", 
          "performance", 
          "same time", 
          "transform", 
          "step", 
          "last step", 
          "parameters", 
          "range of numbers", 
          "process", 
          "time", 
          "functionality", 
          "run time", 
          "ratio", 
          "work", 
          "range", 
          "order", 
          "little work", 
          "types", 
          "dependency", 
          "facility functionality", 
          "seconds", 
          "workflow", 
          "correctness", 
          "important role", 
          "quality", 
          "cases", 
          "approach", 
          "workflow generator", 
          "control", 
          "aspects", 
          "attributes", 
          "issues", 
          "ETL activities", 
          "usability", 
          "ability", 
          "number", 
          "skeleton", 
          "thousands", 
          "warehousing", 
          "data types", 
          "literature", 
          "users", 
          "patterns", 
          "test case generation approach", 
          "activity", 
          "role", 
          "data warehousing", 
          "attribute dependencies", 
          "specific patterns", 
          "schema", 
          "individual activities", 
          "ETL workflows", 
          "paper", 
          "data-flow aspect", 
          "synthetic workflows", 
          "output schema", 
          "ETL facilities", 
          "control-flow process modeling", 
          "Synthetic ETL workflow test cases", 
          "ETL workflow test cases", 
          "workflow test cases", 
          "ETL facility functionalities", 
          "data set test case generation approaches", 
          "set test case generation approaches", 
          "case generation approaches", 
          "ETL workflow generator", 
          "schema aware ETL workflow generator", 
          "aware ETL workflow generator", 
          "recordsets", 
          "connection characteristic parameter specification", 
          "characteristic parameter specification", 
          "form ETL skeleton", 
          "ETL skeleton", 
          "schema transformation characteristic parameter specification", 
          "transformation characteristic parameter specification", 
          "input/output schemas", 
          "cardinality specifications"
        ], 
        "name": "A schema aware ETL workflow generator", 
        "pagination": "453-471", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1000288683"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/s10796-012-9352-2"
            ]
          }
        ], 
        "sameAs": [
          "https://doi.org/10.1007/s10796-012-9352-2", 
          "https://app.dimensions.ai/details/publication/pub.1000288683"
        ], 
        "sdDataset": "articles", 
        "sdDatePublished": "2022-01-01T18:27", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-springernature-scigraph/baseset/20220101/entities/gbq_results/article/article_580.jsonl", 
        "type": "ScholarlyArticle", 
        "url": "https://doi.org/10.1007/s10796-012-9352-2"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/s10796-012-9352-2'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/s10796-012-9352-2'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/s10796-012-9352-2'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/s10796-012-9352-2'


     

    This table displays all metadata directly associated to this object as RDF triples.

    184 TRIPLES      22 PREDICATES      118 URIs      105 LITERALS      6 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/s10796-012-9352-2 schema:about anzsrc-for:08
    2 anzsrc-for:0803
    3 schema:author Nff03cfeca60442b3b95ab4120849e769
    4 schema:citation sg:pub.10.1007/3-540-44895-0_23
    5 sg:pub.10.1007/978-3-642-03973-7_19
    6 sg:pub.10.1007/978-3-642-10424-4_15
    7 sg:pub.10.1007/s10796-008-9077-4
    8 sg:pub.10.1007/s10796-010-9283-8
    9 schema:datePublished 2012-05-03
    10 schema:datePublishedReg 2012-05-03
    11 schema:description Extract, Transform and Load (ETL) processes organized as workflows play an important role in data warehousing. As ETL workflows are usually complex, various ETL facilities have been developed to address their control-flow process modeling and execution control. To evaluate the quality of ETL facilities, Synthetic ETL workflow test cases, consisting of control-flow and data-flow aspects are needed to check ETL facility functionalities at construction time and to validate the correctness and performance of ETL facilities at run time. Although there are some synthetic workflow and data set test case generation approaches existed in literatures, little work is done to consider both aspects at the same time specifically for ETL workflow generators. To address this issue, this paper proposes a schema aware ETL workflow generator with which users can characterize their ETL workflows by various parameters and get ETL workflow test cases with control-flow of ETL activities, complied schemas and associated recordsets. Our generator consists of three steps. First, with type and ratio of individual activities and their connection characteristic parameter specification, the generator will produce ETL activities and form ETL skeleton which determine how generated activities are cooperated with each other. Second, with schema transformation characteristic parameter specification, e.g. ranges of numbers of attributes, the generator will resolve attribute dependencies and refine input/output schemas with complied attributes and their data types. In the last step, recordsets are generated following cardinality specifications. ETL workflows in specific patterns are produced in the experiment in order to show the ability of our generator. Also experiments to generate thousands of ETL workflow test cases in seconds have been done to verify the usability of the generator.
    12 schema:genre article
    13 schema:inLanguage en
    14 schema:isAccessibleForFree false
    15 schema:isPartOf N0a5f5008d4174fd39a4b75d25d1cc3b2
    16 N61b5957da4ae436a8eb99321131fd144
    17 sg:journal.1136609
    18 schema:keywords ETL activities
    19 ETL facilities
    20 ETL facility functionalities
    21 ETL skeleton
    22 ETL workflow generator
    23 ETL workflow test cases
    24 ETL workflows
    25 Synthetic ETL workflow test cases
    26 ability
    27 activity
    28 approach
    29 aspects
    30 attribute dependencies
    31 attributes
    32 aware ETL workflow generator
    33 cardinality specifications
    34 case generation approaches
    35 cases
    36 characteristic parameter specification
    37 connection characteristic parameter specification
    38 construction time
    39 control
    40 control-flow process modeling
    41 correctness
    42 data set test case generation approaches
    43 data types
    44 data warehousing
    45 data-flow aspect
    46 dependency
    47 execution control
    48 experiments
    49 facilities
    50 facility functionality
    51 form ETL skeleton
    52 functionality
    53 generation approach
    54 generator
    55 important role
    56 individual activities
    57 input/output schemas
    58 issues
    59 last step
    60 literature
    61 little work
    62 load process
    63 modeling
    64 number
    65 order
    66 output schema
    67 paper
    68 parameter specification
    69 parameters
    70 patterns
    71 performance
    72 process
    73 process modeling
    74 quality
    75 range
    76 range of numbers
    77 ratio
    78 recordsets
    79 role
    80 run time
    81 same time
    82 schema
    83 schema aware ETL workflow generator
    84 schema transformation characteristic parameter specification
    85 seconds
    86 set test case generation approaches
    87 skeleton
    88 specific patterns
    89 specification
    90 step
    91 synthetic workflows
    92 test case generation approach
    93 test cases
    94 thousands
    95 time
    96 transform
    97 transformation characteristic parameter specification
    98 types
    99 usability
    100 users
    101 warehousing
    102 work
    103 workflow
    104 workflow generator
    105 workflow test cases
    106 schema:name A schema aware ETL workflow generator
    107 schema:pagination 453-471
    108 schema:productId N33dc8dfe9c0a4934836b6066bd69d939
    109 Nf741acdf0a07496b8df98d353f699504
    110 schema:sameAs https://app.dimensions.ai/details/publication/pub.1000288683
    111 https://doi.org/10.1007/s10796-012-9352-2
    112 schema:sdDatePublished 2022-01-01T18:27
    113 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    114 schema:sdPublisher N52f8919b55224337b317755c2edecf1b
    115 schema:url https://doi.org/10.1007/s10796-012-9352-2
    116 sgo:license sg:explorer/license/
    117 sgo:sdDataset articles
    118 rdf:type schema:ScholarlyArticle
    119 N0a5f5008d4174fd39a4b75d25d1cc3b2 schema:volumeNumber 16
    120 rdf:type schema:PublicationVolume
    121 N33dc8dfe9c0a4934836b6066bd69d939 schema:name doi
    122 schema:value 10.1007/s10796-012-9352-2
    123 rdf:type schema:PropertyValue
    124 N52f8919b55224337b317755c2edecf1b schema:name Springer Nature - SN SciGraph project
    125 rdf:type schema:Organization
    126 N61b5957da4ae436a8eb99321131fd144 schema:issueNumber 3
    127 rdf:type schema:PublicationIssue
    128 N9ba828045c654e8b9757a1f360427509 rdf:first sg:person.012303351315.43
    129 rdf:rest rdf:nil
    130 Nc63fbc7da7c04834a8b764ac91bc368d rdf:first sg:person.013540135713.71
    131 rdf:rest N9ba828045c654e8b9757a1f360427509
    132 Nf741acdf0a07496b8df98d353f699504 schema:name dimensions_id
    133 schema:value pub.1000288683
    134 rdf:type schema:PropertyValue
    135 Nff03cfeca60442b3b95ab4120849e769 rdf:first sg:person.011441724007.03
    136 rdf:rest Nc63fbc7da7c04834a8b764ac91bc368d
    137 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    138 schema:name Information and Computing Sciences
    139 rdf:type schema:DefinedTerm
    140 anzsrc-for:0803 schema:inDefinedTermSet anzsrc-for:
    141 schema:name Computer Software
    142 rdf:type schema:DefinedTerm
    143 sg:journal.1136609 schema:issn 1387-3326
    144 1572-9419
    145 schema:name Information Systems Frontiers
    146 schema:publisher Springer Nature
    147 rdf:type schema:Periodical
    148 sg:person.011441724007.03 schema:affiliation grid-institutes:grid.12527.33
    149 schema:familyName Du
    150 schema:givenName Naiqiao
    151 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011441724007.03
    152 rdf:type schema:Person
    153 sg:person.012303351315.43 schema:affiliation grid-institutes:grid.12527.33
    154 schema:familyName Wang
    155 schema:givenName Jianmin
    156 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.012303351315.43
    157 rdf:type schema:Person
    158 sg:person.013540135713.71 schema:affiliation grid-institutes:grid.12527.33
    159 schema:familyName Ye
    160 schema:givenName Xiaojun
    161 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013540135713.71
    162 rdf:type schema:Person
    163 sg:pub.10.1007/3-540-44895-0_23 schema:sameAs https://app.dimensions.ai/details/publication/pub.1037745899
    164 https://doi.org/10.1007/3-540-44895-0_23
    165 rdf:type schema:CreativeWork
    166 sg:pub.10.1007/978-3-642-03973-7_19 schema:sameAs https://app.dimensions.ai/details/publication/pub.1038061018
    167 https://doi.org/10.1007/978-3-642-03973-7_19
    168 rdf:type schema:CreativeWork
    169 sg:pub.10.1007/978-3-642-10424-4_15 schema:sameAs https://app.dimensions.ai/details/publication/pub.1030915466
    170 https://doi.org/10.1007/978-3-642-10424-4_15
    171 rdf:type schema:CreativeWork
    172 sg:pub.10.1007/s10796-008-9077-4 schema:sameAs https://app.dimensions.ai/details/publication/pub.1016064927
    173 https://doi.org/10.1007/s10796-008-9077-4
    174 rdf:type schema:CreativeWork
    175 sg:pub.10.1007/s10796-010-9283-8 schema:sameAs https://app.dimensions.ai/details/publication/pub.1014339706
    176 https://doi.org/10.1007/s10796-010-9283-8
    177 rdf:type schema:CreativeWork
    178 grid-institutes:grid.12527.33 schema:alternateName School of Software, Tsinghua University, Beijing, China
    179 Tsinghua National Laboratory for Information Science and Technology (TNList), Beijing, China
    180 schema:name Department of Computer Science and Technology, Tsinghua University, Beijing, China
    181 Key Laboratory for Information System Security, Ministry of Education, Beijing, China
    182 School of Software, Tsinghua University, Beijing, China
    183 Tsinghua National Laboratory for Information Science and Technology (TNList), Beijing, China
    184 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...