technology hard Fill in the Blank

The technique that enables LLMs to reason about visual inputs by combining vision and language models is called ________ Learning.

  1. Edge
  2. Federated
  3. Multimodal
  4. Reinforcement

Answer: Multimodal

Multimodal models (CLIP, LLaVA) process text, images, audio jointly, enabling visual question answering, image captioning, and cross-modal retrieval.

Topic Advanced AI/ML
Exam Relevance UPSC, Banking, SSC