PaperLLMsv1.0

Visual Instruction Tuning (LLaVA)

by University of Wisconsin–Madison / Microsoft Research · open-source · Last verified 2026-03-17

Introduced LLaVA (Large Language and Vision Assistant), a multimodal model trained via visual instruction tuning using GPT-4-generated multimodal instruction-following data. LLaVA demonstrates impressive multimodal chat abilities and achieves 85.1% on Science QA, pioneering open-source visual instruction tuning.

https://arxiv.org/abs/2304.08485 ↗

B+

B+—Good

Adoption: A+Quality: A+Freshness: ACitations: AEngagement: F

Specifications

License: Apache 2.0
Pricing: open-source
Capabilities: visual-question-answering, image-description, multimodal-chat, instruction-following
Integrations: huggingface, ollama
Use Cases: visual-qa, image-analysis, multimodal-assistants
API Available: No
Tags: llava, multimodal, instruction-tuning, vision-language, open-source
Added: 2026-03-17
Completeness: 100%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service