Learning to Prompt for Vision-Language Models View Full Text


Ontology type: schema:ScholarlyArticle      Open Access: True


Article Info

DATE

2022-07-31

AUTHORS

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu

ABSTRACT

Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming—one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt’s context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts. More... »

PAGES

2337-2348

References to SciGraph publications

  • 2014. Food-101 – Mining Discriminative Components with Random Forests in COMPUTER VISION – ECCV 2014
  • 2020-11-13. Rethinking Few-Shot Image Classification: A Good Embedding is All You Need? in COMPUTER VISION – ECCV 2020
  • 2016-09-16. Learning Visual Features from Large Weakly Supervised Data in COMPUTER VISION – ECCV 2016
  • Identifiers

    URI

    http://scigraph.springernature.com/pub.10.1007/s11263-022-01653-1

    DOI

    http://dx.doi.org/10.1007/s11263-022-01653-1

    DIMENSIONS

    https://app.dimensions.ai/details/publication/pub.1149883714


    Indexing Status Check whether this publication has been indexed by Scopus and Web Of Science using the SN Indexing Status Tool
    Incoming Citations Browse incoming citations for this publication using opencitations.net

    JSON-LD is the canonical representation for SciGraph data.

    TIP: You can open this SciGraph record using an external JSON-LD service: JSON-LD Playground Google SDTT

    [
      {
        "@context": "https://springernature.github.io/scigraph/jsonld/sgcontext.json", 
        "about": [
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/08", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Information and Computing Sciences", 
            "type": "DefinedTerm"
          }, 
          {
            "id": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/0801", 
            "inDefinedTermSet": "http://purl.org/au-research/vocabulary/anzsrc-for/2008/", 
            "name": "Artificial Intelligence and Image Processing", 
            "type": "DefinedTerm"
          }
        ], 
        "author": [
          {
            "affiliation": {
              "alternateName": "S-Lab, Nanyang Technological University, Singapore, Singapore", 
              "id": "http://www.grid.ac/institutes/grid.59025.3b", 
              "name": [
                "S-Lab, Nanyang Technological University, Singapore, Singapore"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Zhou", 
            "givenName": "Kaiyang", 
            "id": "sg:person.011762512201.54", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011762512201.54"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "S-Lab, Nanyang Technological University, Singapore, Singapore", 
              "id": "http://www.grid.ac/institutes/grid.59025.3b", 
              "name": [
                "S-Lab, Nanyang Technological University, Singapore, Singapore"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Yang", 
            "givenName": "Jingkang", 
            "id": "sg:person.013114723224.35", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013114723224.35"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "S-Lab, Nanyang Technological University, Singapore, Singapore", 
              "id": "http://www.grid.ac/institutes/grid.59025.3b", 
              "name": [
                "S-Lab, Nanyang Technological University, Singapore, Singapore"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Loy", 
            "givenName": "Chen Change", 
            "id": "sg:person.0576204646.86", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0576204646.86"
            ], 
            "type": "Person"
          }, 
          {
            "affiliation": {
              "alternateName": "S-Lab, Nanyang Technological University, Singapore, Singapore", 
              "id": "http://www.grid.ac/institutes/grid.59025.3b", 
              "name": [
                "S-Lab, Nanyang Technological University, Singapore, Singapore"
              ], 
              "type": "Organization"
            }, 
            "familyName": "Liu", 
            "givenName": "Ziwei", 
            "id": "sg:person.014067543377.43", 
            "sameAs": [
              "https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014067543377.43"
            ], 
            "type": "Person"
          }
        ], 
        "citation": [
          {
            "id": "sg:pub.10.1007/978-3-030-58568-6_16", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1132566862", 
              "https://doi.org/10.1007/978-3-030-58568-6_16"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-319-46478-7_5", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1009969072", 
              "https://doi.org/10.1007/978-3-319-46478-7_5"
            ], 
            "type": "CreativeWork"
          }, 
          {
            "id": "sg:pub.10.1007/978-3-319-10599-4_29", 
            "sameAs": [
              "https://app.dimensions.ai/details/publication/pub.1039161303", 
              "https://doi.org/10.1007/978-3-319-10599-4_29"
            ], 
            "type": "CreativeWork"
          }
        ], 
        "datePublished": "2022-07-31", 
        "datePublishedReg": "2022-07-31", 
        "description": "Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming\u2014one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt\u2019s context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.", 
        "genre": "article", 
        "id": "sg:pub.10.1007/s11263-022-01653-1", 
        "isAccessibleForFree": true, 
        "isPartOf": [
          {
            "id": "sg:journal.1032807", 
            "issn": [
              "0920-5691", 
              "1573-1405"
            ], 
            "name": "International Journal of Computer Vision", 
            "publisher": "Springer Nature", 
            "type": "Periodical"
          }, 
          {
            "issueNumber": "9", 
            "type": "PublicationIssue"
          }, 
          {
            "type": "PublicationVolume", 
            "volumeNumber": "130"
          }
        ], 
        "keywords": [
          "vision-language models", 
          "natural language processing", 
          "downstream tasks", 
          "different image recognition tasks", 
          "image recognition tasks", 
          "zero-shot model", 
          "pre-trained parameters", 
          "learning-based approach", 
          "zero-shot transfer", 
          "common feature space", 
          "class of interest", 
          "context words", 
          "recognition task", 
          "image recognition", 
          "domain expertise", 
          "Extensive experiments", 
          "language processing", 
          "natural language", 
          "generalization performance", 
          "align images", 
          "feature space", 
          "context optimization", 
          "classification weights", 
          "learning research", 
          "task", 
          "words", 
          "prompts", 
          "traditional representations", 
          "such models", 
          "major challenge", 
          "huge impact", 
          "representation", 
          "context", 
          "dataset", 
          "engineering", 
          "shot", 
          "performance", 
          "language", 
          "significant improvement", 
          "processing", 
          "images", 
          "clips", 
          "implementation", 
          "model", 
          "recognition", 
          "labels", 
          "more shots", 
          "simple approach", 
          "optimization", 
          "text", 
          "wording", 
          "research", 
          "recent advances", 
          "challenges", 
          "expertise", 
          "wide range", 
          "significant amount", 
          "coops", 
          "great potential", 
          "average gain", 
          "space", 
          "vector", 
          "work", 
          "practice", 
          "advances", 
          "approach", 
          "experiments", 
          "class", 
          "impact", 
          "improvement", 
          "interest", 
          "time", 
          "amount", 
          "gain", 
          "parameters", 
          "changes", 
          "slight changes", 
          "potential", 
          "transfer", 
          "weight", 
          "range", 
          "margin"
        ], 
        "name": "Learning to Prompt for Vision-Language Models", 
        "pagination": "2337-2348", 
        "productId": [
          {
            "name": "dimensions_id", 
            "type": "PropertyValue", 
            "value": [
              "pub.1149883714"
            ]
          }, 
          {
            "name": "doi", 
            "type": "PropertyValue", 
            "value": [
              "10.1007/s11263-022-01653-1"
            ]
          }
        ], 
        "sameAs": [
          "https://doi.org/10.1007/s11263-022-01653-1", 
          "https://app.dimensions.ai/details/publication/pub.1149883714"
        ], 
        "sdDataset": "articles", 
        "sdDatePublished": "2022-11-24T21:09", 
        "sdLicense": "https://scigraph.springernature.com/explorer/license/", 
        "sdPublisher": {
          "name": "Springer Nature - SN SciGraph project", 
          "type": "Organization"
        }, 
        "sdSource": "s3://com-springernature-scigraph/baseset/20221124/entities/gbq_results/article/article_943.jsonl", 
        "type": "ScholarlyArticle", 
        "url": "https://doi.org/10.1007/s11263-022-01653-1"
      }
    ]
     

    Download the RDF metadata as:  json-ld nt turtle xml License info

    HOW TO GET THIS DATA PROGRAMMATICALLY:

    JSON-LD is a popular format for linked data which is fully compatible with JSON.

    curl -H 'Accept: application/ld+json' 'https://scigraph.springernature.com/pub.10.1007/s11263-022-01653-1'

    N-Triples is a line-based linked data format ideal for batch operations.

    curl -H 'Accept: application/n-triples' 'https://scigraph.springernature.com/pub.10.1007/s11263-022-01653-1'

    Turtle is a human-readable linked data format.

    curl -H 'Accept: text/turtle' 'https://scigraph.springernature.com/pub.10.1007/s11263-022-01653-1'

    RDF/XML is a standard XML format for linked data.

    curl -H 'Accept: application/rdf+xml' 'https://scigraph.springernature.com/pub.10.1007/s11263-022-01653-1'


     

    This table displays all metadata directly associated to this object as RDF triples.

    172 TRIPLES      21 PREDICATES      109 URIs      98 LITERALS      6 BLANK NODES

    Subject Predicate Object
    1 sg:pub.10.1007/s11263-022-01653-1 schema:about anzsrc-for:08
    2 anzsrc-for:0801
    3 schema:author N76b138444f4b44f099a6762c70632318
    4 schema:citation sg:pub.10.1007/978-3-030-58568-6_16
    5 sg:pub.10.1007/978-3-319-10599-4_29
    6 sg:pub.10.1007/978-3-319-46478-7_5
    7 schema:datePublished 2022-07-31
    8 schema:datePublishedReg 2022-07-31
    9 schema:description Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming—one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt’s context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.
    10 schema:genre article
    11 schema:isAccessibleForFree true
    12 schema:isPartOf N22f29689f77248a4b10ff479117d6624
    13 Ne2676588a73343d6bca3a1369406ab17
    14 sg:journal.1032807
    15 schema:keywords Extensive experiments
    16 advances
    17 align images
    18 amount
    19 approach
    20 average gain
    21 challenges
    22 changes
    23 class
    24 class of interest
    25 classification weights
    26 clips
    27 common feature space
    28 context
    29 context optimization
    30 context words
    31 coops
    32 dataset
    33 different image recognition tasks
    34 domain expertise
    35 downstream tasks
    36 engineering
    37 experiments
    38 expertise
    39 feature space
    40 gain
    41 generalization performance
    42 great potential
    43 huge impact
    44 image recognition
    45 image recognition tasks
    46 images
    47 impact
    48 implementation
    49 improvement
    50 interest
    51 labels
    52 language
    53 language processing
    54 learning research
    55 learning-based approach
    56 major challenge
    57 margin
    58 model
    59 more shots
    60 natural language
    61 natural language processing
    62 optimization
    63 parameters
    64 performance
    65 potential
    66 practice
    67 pre-trained parameters
    68 processing
    69 prompts
    70 range
    71 recent advances
    72 recognition
    73 recognition task
    74 representation
    75 research
    76 shot
    77 significant amount
    78 significant improvement
    79 simple approach
    80 slight changes
    81 space
    82 such models
    83 task
    84 text
    85 time
    86 traditional representations
    87 transfer
    88 vector
    89 vision-language models
    90 weight
    91 wide range
    92 wording
    93 words
    94 work
    95 zero-shot model
    96 zero-shot transfer
    97 schema:name Learning to Prompt for Vision-Language Models
    98 schema:pagination 2337-2348
    99 schema:productId N344b78acf6b24752a4361951148b5240
    100 N742e1eeb0fdf4fcfac3bfc6db61cd4ec
    101 schema:sameAs https://app.dimensions.ai/details/publication/pub.1149883714
    102 https://doi.org/10.1007/s11263-022-01653-1
    103 schema:sdDatePublished 2022-11-24T21:09
    104 schema:sdLicense https://scigraph.springernature.com/explorer/license/
    105 schema:sdPublisher Nfbf6956d62e140b8a6c4f15a7d4853f2
    106 schema:url https://doi.org/10.1007/s11263-022-01653-1
    107 sgo:license sg:explorer/license/
    108 sgo:sdDataset articles
    109 rdf:type schema:ScholarlyArticle
    110 N22f29689f77248a4b10ff479117d6624 schema:volumeNumber 130
    111 rdf:type schema:PublicationVolume
    112 N344b78acf6b24752a4361951148b5240 schema:name doi
    113 schema:value 10.1007/s11263-022-01653-1
    114 rdf:type schema:PropertyValue
    115 N362925fcff8a40a5a12025066025dd94 rdf:first sg:person.0576204646.86
    116 rdf:rest N39e65afdb40c470595111d18b8f10203
    117 N39e65afdb40c470595111d18b8f10203 rdf:first sg:person.014067543377.43
    118 rdf:rest rdf:nil
    119 N742e1eeb0fdf4fcfac3bfc6db61cd4ec schema:name dimensions_id
    120 schema:value pub.1149883714
    121 rdf:type schema:PropertyValue
    122 N76b138444f4b44f099a6762c70632318 rdf:first sg:person.011762512201.54
    123 rdf:rest Nf393ff7560584d6e9bbb753fbb6da614
    124 Ne2676588a73343d6bca3a1369406ab17 schema:issueNumber 9
    125 rdf:type schema:PublicationIssue
    126 Nf393ff7560584d6e9bbb753fbb6da614 rdf:first sg:person.013114723224.35
    127 rdf:rest N362925fcff8a40a5a12025066025dd94
    128 Nfbf6956d62e140b8a6c4f15a7d4853f2 schema:name Springer Nature - SN SciGraph project
    129 rdf:type schema:Organization
    130 anzsrc-for:08 schema:inDefinedTermSet anzsrc-for:
    131 schema:name Information and Computing Sciences
    132 rdf:type schema:DefinedTerm
    133 anzsrc-for:0801 schema:inDefinedTermSet anzsrc-for:
    134 schema:name Artificial Intelligence and Image Processing
    135 rdf:type schema:DefinedTerm
    136 sg:journal.1032807 schema:issn 0920-5691
    137 1573-1405
    138 schema:name International Journal of Computer Vision
    139 schema:publisher Springer Nature
    140 rdf:type schema:Periodical
    141 sg:person.011762512201.54 schema:affiliation grid-institutes:grid.59025.3b
    142 schema:familyName Zhou
    143 schema:givenName Kaiyang
    144 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.011762512201.54
    145 rdf:type schema:Person
    146 sg:person.013114723224.35 schema:affiliation grid-institutes:grid.59025.3b
    147 schema:familyName Yang
    148 schema:givenName Jingkang
    149 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.013114723224.35
    150 rdf:type schema:Person
    151 sg:person.014067543377.43 schema:affiliation grid-institutes:grid.59025.3b
    152 schema:familyName Liu
    153 schema:givenName Ziwei
    154 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.014067543377.43
    155 rdf:type schema:Person
    156 sg:person.0576204646.86 schema:affiliation grid-institutes:grid.59025.3b
    157 schema:familyName Loy
    158 schema:givenName Chen Change
    159 schema:sameAs https://app.dimensions.ai/discover/publication?and_facet_researcher=ur.0576204646.86
    160 rdf:type schema:Person
    161 sg:pub.10.1007/978-3-030-58568-6_16 schema:sameAs https://app.dimensions.ai/details/publication/pub.1132566862
    162 https://doi.org/10.1007/978-3-030-58568-6_16
    163 rdf:type schema:CreativeWork
    164 sg:pub.10.1007/978-3-319-10599-4_29 schema:sameAs https://app.dimensions.ai/details/publication/pub.1039161303
    165 https://doi.org/10.1007/978-3-319-10599-4_29
    166 rdf:type schema:CreativeWork
    167 sg:pub.10.1007/978-3-319-46478-7_5 schema:sameAs https://app.dimensions.ai/details/publication/pub.1009969072
    168 https://doi.org/10.1007/978-3-319-46478-7_5
    169 rdf:type schema:CreativeWork
    170 grid-institutes:grid.59025.3b schema:alternateName S-Lab, Nanyang Technological University, Singapore, Singapore
    171 schema:name S-Lab, Nanyang Technological University, Singapore, Singapore
    172 rdf:type schema:Organization
     




    Preview window. Press ESC to close (or click here)


    ...