Skip to main content
PaperLLMsv1.0

Visual Instruction Tuning (LLaVA)

by University of Wisconsin–Madison / Microsoft Research · open-source · Last verified 2026-03-17

Introduced LLaVA (Large Language and Vision Assistant), a multimodal model trained via visual instruction tuning using GPT-4-generated multimodal instruction-following data. LLaVA demonstrates impressive multimodal chat abilities and achieves 85.1% on Science QA, pioneering open-source visual instruction tuning.

https://arxiv.org/abs/2304.08485
B+
B+Good
Adoption: A+Quality: A+Freshness: ACitations: AEngagement: F

Specifications

License
Apache 2.0
Pricing
open-source
Capabilities
visual-question-answering, image-description, multimodal-chat, instruction-following
Integrations
huggingface, ollama
Use Cases
visual-qa, image-analysis, multimodal-assistants
API Available
No
Tags
llava, multimodal, instruction-tuning, vision-language, open-source
Added
2026-03-17
Completeness
100%

Index Score

76
Adoption
90
Quality
90
Freshness
82
Citations
88
Engagement
0

Put AI to work for your business

Deploy this paper alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service