Skip to main content
Modelmultimodalv1.0

Video-LLaVA

by PKU-YuanLab · free · Last verified 2026-03-17

Video-LLaVA is an open video-language model that extends the LLaVA architecture with temporal video understanding capabilities, enabling detailed question answering and reasoning over video content. It achieves strong performance on video QA benchmarks by aligning visual features from both images and videos into a shared representation space.

https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf
C
CBelow Average
Adoption: CQuality: B+Freshness: B+Citations: C+Engagement: F

Specifications

License
Apache 2.0
Pricing
free
Capabilities
video-understanding, visual-question-answering, temporal-reasoning, image-understanding
Integrations
Hugging Face, Transformers
Use Cases
video-qa, video-analysis, temporal-reasoning, multimodal-research
API Available
No
Parameters
7B
Context Window
4K
Modalities
text, image, video
Training Cutoff
2023
Tags
video-understanding, vision-language, open-source, temporal-reasoning
Added
2026-03-17
Completeness
100%

Index Score

47
Adoption
45
Quality
76
Freshness
76
Citations
55
Engagement
0

Explore the full AI ecosystem on Agents as a Service