Newsletter #18: Vision via language

How to achieve computer vision using natural language

Jul 13, 2023

Introduction

LENS (Large Language Models ENnhanced to See) is a model that combines the reasoning capabilities of LLMs with computer vision. It combines vision-based models to get general-purpose data on the image and then uses an LLM with context to answer user-defined queries.

In this post, let’s take a look at how it works.

Working

LENS consists of 3 different vision modules and 1 language module. The vision modules generate all information irrespective of the user query. Then, this information and user-specific context are given to the LLM to generate the output.

Here’s what each model is responsible for:

Tag Module

The module generates the most appropriate tag for a given image using the CLIP vision encoder. The prompt is “The photo is of _______”

Attribute Model

The attribute model uses a pre-trained CLIP model to generate descriptions of the image.

Intensive captioner model

It uses the BLIP model to generate N captions that capture diverse aspects of the image.

Visual Vocabularies

These act as a connecting dot between the image and the text. The tags are obtained from various datasets such as object detection and semantic segmentation, and genome. Attributes data uses GPT-3 to get descriptions for various object types. These are then used by the vision modules to generate a description of the image free of user query context.

Reasoning Module

This is a trained LLM module that gets as input the output of the 3 vision modules as well as the use query.

Here are some of the examples given in the paper

Conclusion

LENS is an example of adapting LLMs to vision tasks. This will make to make LLMs more general-purpose and work with images and eventually with videos and other forms of visual input as well.

That’s it for this issue. I hope you found this article interesting. Until next time!

📖Resources

LENS Paper

Let’s connect :)

Twitter | Instagram

Decoding Coding

Discussion about this post