This post covers multimodal systems in general, including LMMs. It consists of 3 parts.
Part 1 covers the context for multimodality, including why multimodal, different data modalities, and types of multimodal tasks.
Part 2 discusses the fundamentals of a multimodal system, using the examples of CLIP, which lays the foundation for many future multimodal systems, and Flamingo, whose impressive performance gave rise to LMMs.
Part 3 discusses some active research areas for LMMs, including generating multimodal outputs and adapters for more efficient multimodal training, covering newer multimodal systems such as BLIP-2, LLaVA, LLaMA-Adapter V2, LAVIN, etc.