technology hard Fill in the Blank

The technique that enables LLMs to reason about visual inputs by combining vision and language models is called ________.

SWAYAM / NPTEL
e-Sign / Aadhaar e-KYC
dimensions
Multimodal Learning

Answer: Multimodal Learning

Multimodal models (CLIP, LLaVA) process text, images, audio jointly, enabling visual question answering, image captioning, and cross-modal retrieval. Critical for next-gen AI applications.

Topic Advanced AI/ML

Exam Relevance UPSC, Banking, SSC

More technology Questions Back to all questions

Quick Quiz Actions

Create a custom practice set

Weekly challenge coming soon

The technique that enables LLMs to reason about visual inputs by combining vision and language models is called ________.