Visual Instruction Tuning (LLaVA)
by University of Wisconsin–Madison / Microsoft Research · open-source · Last verified 2026-03-17
Introduced LLaVA (Large Language and Vision Assistant), a multimodal model trained via visual instruction tuning using GPT-4-generated multimodal instruction-following data. LLaVA demonstrates impressive multimodal chat abilities and achieves 85.1% on Science QA, pioneering open-source visual instruction tuning.
https://arxiv.org/abs/2304.08485 ↗B+
B+—Good
Adoption: A+Quality: A+Freshness: ACitations: AEngagement: F
Specifications
- License
- Apache 2.0
- Pricing
- open-source
- Capabilities
- visual-question-answering, image-description, multimodal-chat, instruction-following
- Integrations
- huggingface, ollama
- Use Cases
- visual-qa, image-analysis, multimodal-assistants
- API Available
- No
- Tags
- llava, multimodal, instruction-tuning, vision-language, open-source
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
76Adoption
90
Quality
90
Freshness
82
Citations
88
Engagement
0
Put AI to work for your business
Deploy this paper alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.