Ask Your Neurons: A Deep Learning Approach to Visual Question Answering View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2017-08-29

AUTHORS

Mateusz Malinowski, Marcus Rohrbach, Mario Fritz

ABSTRACT

We propose a Deep Learning approach to the visual question answering task, where machines answer to questions about real-world images. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language inputs (image and question). We evaluate our approaches on the DAQUAR as well as the VQA dataset where we also report various baselines, including an analysis how much information is contained in the language part only. To study human consensus, we propose two novel metrics and collect additional answers which extend the original DAQUAR dataset to DAQUAR-Consensus. Finally, we evaluate a rich set of design choices how to encode, combine and decode information in our proposed Deep Learning formulation. More... »

PAGES

110-135

References to SciGraph publications

  • 2016-09-17. Grounding of Textual Phrases in Images by Reconstruction in COMPUTER VISION – ECCV 2016
  • 2015-04-11. ImageNet Large Scale Visual Recognition Challenge in INTERNATIONAL JOURNAL OF COMPUTER VISION
  • 2016-09-17. Segmentation from Natural Language Expressions in COMPUTER VISION – ECCV 2016
  • 2016-09-16. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering in COMPUTER VISION – ECCV 2016
  • 2012. Indoor Segmentation and Support Inference from RGBD Images in COMPUTER VISION – ECCV 2012
  • 2016-09-17. Modeling Context in Referring Expressions in COMPUTER VISION – ECCV 2016
  • 2017-02-06. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations in INTERNATIONAL JOURNAL OF COMPUTER VISION
  • 2014. Microsoft COCO: Common Objects in Context in COMPUTER VISION – ECCV 2014
  • Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/s11263-017-1038-2

    DOI

    http://dx.doi.org/10.1007/s11263-017-1038-2

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1091379710


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Artificial Intelligence and Image Processing", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbr\u00fccken, Germany", 
              "id": "http://www.grid.ac/institutes/grid.419528.3", 
              "name": [
                "Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbr\u00fccken, Germany"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Malinowski", 
            "givenName": "Mateusz", 
            "id": "sg:person.07716544521.15", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.07716544521.15"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "UC Berkeley EECS, Berkeley, CA, USA", 
              "id": "http://www.grid.ac/institutes/grid.47840.3f", 
              "name": [
                "UC Berkeley EECS, Berkeley, CA, USA"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Rohrbach", 
            "givenName": "Marcus", 
            "id": "sg:person.014537716115.47", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014537716115.47"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbr\u00fccken, Germany", 
              "id": "http://www.grid.ac/institutes/grid.419528.3", 
              "name": [
                "Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbr\u00fccken, Germany"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Fritz", 
            "givenName": "Mario", 
            "id": "sg:person.013361072755.17", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013361072755.17"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "sg:pub.10.1007/978-3-319-10602-1_48", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1045321436", 
              "https://doi.org/10.1007/978-3-319-10602-1_48"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/s11263-016-0981-7", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1083549046", 
              "https://doi.org/10.1007/s11263-016-0981-7"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-319-46448-0_49", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1039098005", 
              "https://doi.org/10.1007/978-3-319-46448-0_49"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-319-46448-0_7", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1000181029", 
              "https://doi.org/10.1007/978-3-319-46448-0_7"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/s11263-015-0816-y", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1009767488", 
              "https://doi.org/10.1007/s11263-015-0816-y"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-319-46478-7_28", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1019481205", 
              "https://doi.org/10.1007/978-3-319-46478-7_28"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-642-33715-4_54", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1053469442", 
              "https://doi.org/10.1007/978-3-642-33715-4_54"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-319-46475-6_5", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1042471266", 
              "https://doi.org/10.1007/978-3-319-46475-6_5"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2017-08-29", 
        "datePublishedReg": "2017-08-29", 
        "description": "We propose a Deep Learning approach to the visual question answering task, where machines answer to questions about real-world images. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language inputs (image and question). We evaluate our approaches on the DAQUAR as well as the VQA dataset where we also report various baselines, including an analysis how much information is contained in the language part only. To study human consensus, we propose two novel metrics and collect additional answers which extend the original DAQUAR dataset to DAQUAR-Consensus. Finally, we evaluate a rich set of design choices how to encode, combine and decode information in our proposed Deep Learning formulation.", 
        "genre": "article", 
        "id": "sg:pub.10.1007/s11263-017-1038-2", 
        "isAccessibleForFree": true, 
        "isPartOf": [
          {
            "id": "sg:journal.1032807", 
            "issn": [
              "0920-5691", 
              "1573-1405"
            ], 
            "name": "International Journal of Computer Vision", 
            "publisher": "Springer Nature", 
            "type": "Periodical"
          }, 
          {
            "issueNumber": "1-3", 
            "type": "PublicationIssue"
          }, 
          {
            "type": "PublicationVolume", 
            "volumeNumber": "125"
          }
        ], 
        "keywords": [
          "deep learning approach", 
          "learning approach", 
          "natural language input", 
          "deep learning formulation", 
          "natural language processing", 
          "real-world images", 
          "Visual Question Answering", 
          "multi-modal problems", 
          "visual question", 
          "VQA datasets", 
          "image representation", 
          "Question Answering", 
          "language processing", 
          "end formulation", 
          "learning formulation", 
          "language part", 
          "rich set", 
          "design choices", 
          "novel metric", 
          "human consensus", 
          "dataset", 
          "language input", 
          "previous efforts", 
          "additional answers", 
          "Answering", 
          "information", 
          "machine", 
          "latest advances", 
          "language output", 
          "task", 
          "images", 
          "metrics", 
          "processing", 
          "representation", 
          "set", 
          "input", 
          "output", 
          "answers", 
          "advances", 
          "efforts", 
          "end", 
          "formulation", 
          "questions", 
          "part", 
          "choice", 
          "consensus", 
          "analysis", 
          "neurons", 
          "baseline", 
          "contrast", 
          "approach", 
          "problem"
        ], 
        "name": "Ask Your Neurons: A Deep Learning Approach to Visual Question Answering", 
        "pagination": "110-135", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1091379710"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/s11263-017-1038-2"
            ]
          }
        ], 
        "sameAs": [
          "https://doi.org/10.1007/s11263-017-1038-2", 
          "https://app.dimensions.ai/details/publication/pub.1091379710"
        ], 
        "sdDataset": "articles", 
        "sdDatePublished": "2022-12-01T06:36", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-springernature-scigraph/baseset/20221201/entities/gbq_results/article/article_740.jsonl", 
        "type": "ScholarlyArticle", 
        "url": "https://doi.org/10.1007/s11263-017-1038-2"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/s11263-017-1038-2'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/s11263-017-1038-2'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/s11263-017-1038-2'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/s11263-017-1038-2'


     

    This table displays all metadata directly associated to this object as RDF triples.

    158 TRIPLES      21 PREDICATES      84 URIs      68 LITERALS      6 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/s11263-017-1038-2 schema:about anzsrc-for:08
    2 anzsrc-for:0801
    3 schema:author N52bed6fae1b648ff81c34d431ba87004
    4 schema:citation sg:pub.10.1007/978-3-319-10602-1_48
    5 sg:pub.10.1007/978-3-319-46448-0_49
    6 sg:pub.10.1007/978-3-319-46448-0_7
    7 sg:pub.10.1007/978-3-319-46475-6_5
    8 sg:pub.10.1007/978-3-319-46478-7_28
    9 sg:pub.10.1007/978-3-642-33715-4_54
    10 sg:pub.10.1007/s11263-015-0816-y
    11 sg:pub.10.1007/s11263-016-0981-7
    12 schema:datePublished 2017-08-29
    13 schema:datePublishedReg 2017-08-29
    14 schema:description We propose a Deep Learning approach to the visual question answering task, where machines answer to questions about real-world images. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language inputs (image and question). We evaluate our approaches on the DAQUAR as well as the VQA dataset where we also report various baselines, including an analysis how much information is contained in the language part only. To study human consensus, we propose two novel metrics and collect additional answers which extend the original DAQUAR dataset to DAQUAR-Consensus. Finally, we evaluate a rich set of design choices how to encode, combine and decode information in our proposed Deep Learning formulation.
    15 schema:genre article
    16 schema:isAccessibleForFree true
    17 schema:isPartOf N81fb99780c204a388d1ea5c476ed0416
    18 N86127db23c90435c93934a2f63e1c22c
    19 sg:journal.1032807
    20 schema:keywords Answering
    21 Question Answering
    22 VQA datasets
    23 Visual Question Answering
    24 additional answers
    25 advances
    26 analysis
    27 answers
    28 approach
    29 baseline
    30 choice
    31 consensus
    32 contrast
    33 dataset
    34 deep learning approach
    35 deep learning formulation
    36 design choices
    37 efforts
    38 end
    39 end formulation
    40 formulation
    41 human consensus
    42 image representation
    43 images
    44 information
    45 input
    46 language input
    47 language output
    48 language part
    49 language processing
    50 latest advances
    51 learning approach
    52 learning formulation
    53 machine
    54 metrics
    55 multi-modal problems
    56 natural language input
    57 natural language processing
    58 neurons
    59 novel metric
    60 output
    61 part
    62 previous efforts
    63 problem
    64 processing
    65 questions
    66 real-world images
    67 representation
    68 rich set
    69 set
    70 task
    71 visual question
    72 schema:name Ask Your Neurons: A Deep Learning Approach to Visual Question Answering
    73 schema:pagination 110-135
    74 schema:productId N2a85df878bc048bbb4bd6e758b6fc3a2
    75 Nc72d175a7c5c48208ee351aea254bed8
    76 schema:sameAs https://app.dimensions.ai/details/publication/pub.1091379710
    77 https://doi.org/10.1007/s11263-017-1038-2
    78 schema:sdDatePublished 2022-12-01T06:36
    79 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    80 schema:sdPublisher N12bd530bb6664bfa976b176d1184b9eb
    81 schema:url https://doi.org/10.1007/s11263-017-1038-2
    82 sgo:license sg:explorer/license/
    83 sgo:sdDataset articles
    84 rdf:type schema:ScholarlyArticle
    85 N12bd530bb6664bfa976b176d1184b9eb schema:name Springer Nature - SN SciGraph project
    86 rdf:type schema:Organization
    87 N2a85df878bc048bbb4bd6e758b6fc3a2 schema:name dimensions_id
    88 schema:value pub.1091379710
    89 rdf:type schema:PropertyValue
    90 N52bed6fae1b648ff81c34d431ba87004 rdf:first sg:person.07716544521.15
    91 rdf:rest N548165449d1b47249f3019346d2a5e28
    92 N548165449d1b47249f3019346d2a5e28 rdf:first sg:person.014537716115.47
    93 rdf:rest Ndedf6e2118734ea1b6feb45c823439dd
    94 N81fb99780c204a388d1ea5c476ed0416 schema:volumeNumber 125
    95 rdf:type schema:PublicationVolume
    96 N86127db23c90435c93934a2f63e1c22c schema:issueNumber 1-3
    97 rdf:type schema:PublicationIssue
    98 Nc72d175a7c5c48208ee351aea254bed8 schema:name doi
    99 schema:value 10.1007/s11263-017-1038-2
    100 rdf:type schema:PropertyValue
    101 Ndedf6e2118734ea1b6feb45c823439dd rdf:first sg:person.013361072755.17
    102 rdf:rest rdf:nil
    103 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    104 schema:name Information and Computing Sciences
    105 rdf:type schema:DefinedTerm
    106 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
    107 schema:name Artificial Intelligence and Image Processing
    108 rdf:type schema:DefinedTerm
    109 sg:journal.1032807 schema:issn 0920-5691
    110 1573-1405
    111 schema:name International Journal of Computer Vision
    112 schema:publisher Springer Nature
    113 rdf:type schema:Periodical
    114 sg:person.013361072755.17 schema:affiliation grid-institutes:grid.419528.3
    115 schema:familyName Fritz
    116 schema:givenName Mario
    117 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013361072755.17
    118 rdf:type schema:Person
    119 sg:person.014537716115.47 schema:affiliation grid-institutes:grid.47840.3f
    120 schema:familyName Rohrbach
    121 schema:givenName Marcus
    122 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014537716115.47
    123 rdf:type schema:Person
    124 sg:person.07716544521.15 schema:affiliation grid-institutes:grid.419528.3
    125 schema:familyName Malinowski
    126 schema:givenName Mateusz
    127 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.07716544521.15
    128 rdf:type schema:Person
    129 sg:pub.10.1007/978-3-319-10602-1_48 schema:sameAs https://app.dimensions.ai/details/publication/pub.1045321436
    130 https://doi.org/10.1007/978-3-319-10602-1_48
    131 rdf:type schema:CreativeWork
    132 sg:pub.10.1007/978-3-319-46448-0_49 schema:sameAs https://app.dimensions.ai/details/publication/pub.1039098005
    133 https://doi.org/10.1007/978-3-319-46448-0_49
    134 rdf:type schema:CreativeWork
    135 sg:pub.10.1007/978-3-319-46448-0_7 schema:sameAs https://app.dimensions.ai/details/publication/pub.1000181029
    136 https://doi.org/10.1007/978-3-319-46448-0_7
    137 rdf:type schema:CreativeWork
    138 sg:pub.10.1007/978-3-319-46475-6_5 schema:sameAs https://app.dimensions.ai/details/publication/pub.1042471266
    139 https://doi.org/10.1007/978-3-319-46475-6_5
    140 rdf:type schema:CreativeWork
    141 sg:pub.10.1007/978-3-319-46478-7_28 schema:sameAs https://app.dimensions.ai/details/publication/pub.1019481205
    142 https://doi.org/10.1007/978-3-319-46478-7_28
    143 rdf:type schema:CreativeWork
    144 sg:pub.10.1007/978-3-642-33715-4_54 schema:sameAs https://app.dimensions.ai/details/publication/pub.1053469442
    145 https://doi.org/10.1007/978-3-642-33715-4_54
    146 rdf:type schema:CreativeWork
    147 sg:pub.10.1007/s11263-015-0816-y schema:sameAs https://app.dimensions.ai/details/publication/pub.1009767488
    148 https://doi.org/10.1007/s11263-015-0816-y
    149 rdf:type schema:CreativeWork
    150 sg:pub.10.1007/s11263-016-0981-7 schema:sameAs https://app.dimensions.ai/details/publication/pub.1083549046
    151 https://doi.org/10.1007/s11263-016-0981-7
    152 rdf:type schema:CreativeWork
    153 grid-institutes:grid.419528.3 schema:alternateName Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
    154 schema:name Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
    155 rdf:type schema:Organization
    156 grid-institutes:grid.47840.3f schema:alternateName UC Berkeley EECS, Berkeley, CA, USA
    157 schema:name UC Berkeley EECS, Berkeley, CA, USA
    158 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...