Introduction
CM3Leon (pronounced chameleon) is a retrieval-augmented, token-based, decoder-only multi-modal language that can not only generate images but also infill images as well as generate and infill text. It combines image and text generation using a combination of models to achieve performance comparable to diffusion models.
Let’s take a look at how it works!
Design
The pre-training stage has two steps
Image Tokenization
The image tokenizer tokenized a 256x256 image into 1024 tokens from a vocabulary of 8192 tokens. Here, vocabulary means the set of all possible tokens. For the text, the process uses a pre-trained transformer with a vocabulary of 56320. In addition to image and text, is also a special token <break> which is used to differentiate between them.
Retrieval Augmentation
The retriever takes a query q, and a candidate document m, and returns a relevance score r(q, m). Using a CLIP-based architecture, the text and image parts are separated and tokenized. Then the average of the two vector representations is used as the vector representation of the document.
Training
While training, 3 parameters are used for selecting the candidate documents:
The document selected should be relevant to the input sequence x.
The document should contain images and text instead of just one of the two.
While selecting top K documents, no two documents should be too similar.
Objective function
The objective function of CM3Leon is the next token loss -log p(x) where x is the input sequence.
Here’s an example of how all of it works together
Conclusion
CM3Leon is in line with the last few posts which combine LLMs and vision transformers to get good results in not only image generation but also image description. This will be the last post where I just break down the paper on a general level, from the next post, I plan to explore classic research papers including all the math as well as code if possible. I am excited about what’s next! Let me know if you have any suggestions on which topics I should cover.
That’s it for this issue. I hope you found this article interesting. Until next time!
📖Resources
Let’s connect :)