Patching open-vocabulary models by interpolating weights
Paper | Code | Data
Open-vocabulary models like CLIP achieve high accuracy across many image classification tasks. However,
there are still settings where their zero-shot performance is far from optimal. We study
PAINT interactive demo
Explore predictions for both a supported task (CIFAR-10) and a patching task (MNIST) by
setting an interpolation factor α.
When α = 0, we have the model θ0, which is the original open-vocabulary model
(here a CLIP ViT-B/32).
When α = 1, we have the model θ1, which is fully fine-tuned on MNIST.
Notice for α around 0.25, it is possible to recover a patched model, with negligible performance
drop on
CIFAR-10, which gains around 50 percentage points on MNIST relative to θ0.
α =
Supported task (CIFAR-10) | Patching task (MNIST) | ||
---|---|---|---|
AccCIFAR-10(θpatch) = |
|
AccMNIST(θpatch) = |
|
Example inputs | θpatch predictions | Example inputs | θpatch predictions |
|
|
||
|
|
||
|
|
Note: for purposes of this demo, predictions are cached.
Team
Bibtex
@article {ilharco2022patching,
title={Patching open-vocabulary models by interpolating weights},
author={Ilharco, Gabriel and Wortsman, Mitchell and Gadre, Samir Yitzhak and Song, Shuran and Hajishirzi, Hannaneh and Kornblith, Simon and Farhadi, Ali and Schmidt, Ludwig},
journal={arXiv},
year={2022}
}
Acknowledgements
We thank Akari Asai, Alex Fang, Huy Ha, Ari Holtzman, Pieter-Jan Kindermans, Marco Tulio Ribeiro, Ofir Press, Sarah Pratt, Sewon Min, Thao Nguyen and Tim Dettmers for helpful discussions and feedback, and Hyak at UW for computing support.
Contact
If you have any questions, please contact Gabriel, Mitchell, and Samir