GK Question

technology hard fill_blank

The technique that enables LLMs to reason about visual inputs by combining vision and language models is called ________.

Answer: Multimodal Learning

Multimodal models (CLIP, LLaVA) process text, images, audio jointly, enabling visual question answering, image captioning, and cross-modal retrieval. Critical for next-gen AI applications.

Topic Advanced AI/ML
Exam Relevance UPSC, Banking, SSC