SkillComputer Visionv1.0

Visual Grounding

by AaaS · open-source · Last verified 2026-03-17

Trains agents to localize specific image regions described by natural language referring expressions, bridging the gap between language and spatial visual understanding. Covers grounding models (Grounding DINO, Grounded SAM), evaluation metrics (R@k, mAP), and integration into tool-use agents for UI automation and document analysis.

https://aaas.blog/skill/visual-grounding ↗

D—Poor

Adoption: C+Quality: AFreshness: ACitations: FEngagement: F

Specifications

License: MIT
Pricing: open-source
Capabilities: referring-expression-comprehension, region-proposal, open-vocabulary-detection, phrase-grounding, spatial-reasoning
Integrations: grounding-dino, grounded-sam, huggingface, roboflow
Use Cases: ui-automation, robot-manipulation, document-region-extraction, visual-search
API Available: No
Difficulty: advanced
Prerequisites: object-detection, visual-question-answering
Supported Agents: computer-use
Tags: grounding, referring-expression, region, vision-language
Added: 2026-03-17
Completeness: 80%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Ready to add this skill to your workflow?

Start Building

Explore the full AI ecosystem on Agents as a Service