Introduction
LENS (Large Language Models ENnhanced to See) is a model that combines the reasoning capabilities of LLMs with computer vision. It combines vision-based models to get general-purpose data on the image and then uses an LLM with context to answer user-defined queries.
In this post, let’s take a look at how it works.
Working
LENS consists of 3 different vision modules and 1 language module. The vision modules generate all information irrespective of the user query. Then, this information and user-specific context are given to the LLM to generate the output.
Here’s what each model is responsible for:
Tag Module
The module generates the most appropriate tag for a given image using the CLIP vision encoder. The prompt is “The photo is of _______”
Attribute Model
The attribute model uses a pre-trained CLIP model to generate descriptions of the image.
Intensive captioner model
It uses the BLIP model to generate N captions that capture diverse aspects of the image.
Visual Vocabularies
These act as a connecting dot between the image and the text. The tags are obtained from various datasets such as object detection and semantic segmentation, and genome. Attributes data uses GPT-3 to get descriptions for various object types. These are then used by the vision modules to generate a description of the image free of user query context.
Reasoning Module
This is a trained LLM module that gets as input the output of the 3 vision modules as well as the use query.
Here are some of the examples given in the paper
Conclusion
LENS is an example of adapting LLMs to vision tasks. This will make to make LLMs more general-purpose and work with images and eventually with videos and other forms of visual input as well.
That’s it for this issue. I hope you found this article interesting. Until next time!
📖Resources
Let’s connect :)