CogVLM: Visual Expert for Pretrained Language Models
by Tsinghua University / Zhipu AI · free · Last verified 2026-03-17
CogVLM is a vision-language model that enhances pretrained language models (LLMs) with visual understanding. It introduces a trainable visual expert module into each layer of a frozen LLM, enabling deep fusion of image and text features. This approach achieves state-of-the-art results on numerous vision-language benchmarks without altering the original language model's parameters.
https://arxiv.org/abs/2311.03079 ↗B
B—Above Average
Adoption: B+Quality: AFreshness: ACitations: BEngagement: F
Specifications
- License
- Apache 2.0
- Pricing
- free
- Capabilities
- Visual Question Answering (VQA), Image Captioning, Visual Grounding, Complex Visual Reasoning, OCR-Free Text Understanding, Multi-turn Visual Dialogue, Object Detection via Text Queries, Detailed Image Description
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Tags
- cogvlm, multimodal, visual-expert, deep-fusion, vision-language, large-language-model, computer-vision, visual-question-answering, llm-adaptation, state-of-the-art, open-source
- Added
- 2026-03-17
- Completeness
- 0.9%
Index Score
63.4Adoption
72
Quality
88
Freshness
83
Citations
68
Engagement
0