Bookmark: Zero-shot image-to-text generation with BLIP-2

lqdev👽02/15/2023

This guide introduces BLIP-2 from Salesforce Research that enables a suite of state-of-the-art visual-language models that are now available in 🤗 Transformers. We'll show you how to use it for image captioning, prompted image captioning, visual question-answering, and chat-based prompting.

BLIP-2 bridges the modality gap between vision and language models by adding a lightweight Querying Transformer (Q-Former) between an off-the-shelf frozen pre-trained image encoder and a frozen large language model. Q-Former is the only trainable part of BLIP-2; both the image encoder and language model remain frozen.

Permalink: /feed/blip-2-zero-shot-image-to-text-generation/

Tags: #untagged

Back to feed

Send me a message or webmention