Modelmultimodalv1.0

Video-LLaVA

by PKU-YuanLab · free · Last verified 2026-03-17

Video-LLaVA is an open video-language model that extends the LLaVA architecture with temporal video understanding capabilities, enabling detailed question answering and reasoning over video content. It achieves strong performance on video QA benchmarks by aligning visual features from both images and videos into a shared representation space.

https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf ↗

C—Below Average

Adoption: CQuality: B+Freshness: B+Citations: C+Engagement: F

Specifications

License: Apache 2.0
Pricing: free
Capabilities: video-understanding, visual-question-answering, temporal-reasoning, image-understanding
Integrations: Hugging Face, Transformers
Use Cases: video-qa, video-analysis, temporal-reasoning, multimodal-research
API Available: No
Parameters: 7B
Context Window: 4K
Modalities: text, image, video
Training Cutoff: 2023
Tags: video-understanding, vision-language, open-source, temporal-reasoning
Added: 2026-03-17
Completeness: 100%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need help choosing the right model?

Get Expert Guidance

Explore the full AI ecosystem on Agents as a Service