Skip to main content
brand
context
industry
strategy
AaaS
Channel

Computer Vision

Image recognition, generation, video analysis

36 entities in this channel

DatasetComputer Vision

ImageNet-1K

by ImageNet / Stanford Vision Lab

The canonical large-scale visual recognition benchmark containing 1.28 million training images across 1,000 object categories. ImageNet-1K underpins the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and has driven the majority of deep learning breakthroughs in computer vision since 2012.

image-classificationobject-recognitionbenchmark
83.3A
DatasetComputer Vision

COCO 2017

by Microsoft

Microsoft COCO (Common Objects in Context) 2017 provides 118K training images with 860K object instances annotated with bounding boxes, segmentation masks, keypoints, and captions across 80 object categories. It remains the primary benchmark for object detection and instance segmentation research.

object-detectionsegmentationkeypoints
82.5A
PaperComputer Vision

Learning Transferable Visual Models From Natural Language Supervision (CLIP)

by OpenAI

Introduced CLIP (Contrastive Language-Image Pre-training), a model trained on 400 million image-text pairs using contrastive learning that achieves remarkable zero-shot transfer to diverse vision tasks. CLIP became foundational for vision-language alignment and generative AI pipelines.

clipcontrastive-learningzero-shot
82.2A
PaperComputer Vision

High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion)

by CompVis / Stability AI

Introduced Latent Diffusion Models (LDMs), which perform the diffusion process in a compressed latent space rather than pixel space, dramatically reducing computational cost while maintaining image quality. This work underpins Stable Diffusion, the most widely used open-source image generation model.

stable-diffusionlatent-diffusiontext-to-image
82A
PaperComputer Vision

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

by Google Brain

Introduced the Vision Transformer (ViT), demonstrating that a pure transformer applied directly to sequences of image patches achieves state-of-the-art performance on image classification when pretrained on large datasets. The paper challenged the dominance of convolutional neural networks in computer vision.

vision-transformerimage-classificationattention
81.9A
BenchmarkComputer Vision

ImageNet

by Deng et al. / Stanford / Princeton

ImageNet (ILSVRC) is the foundational large-scale visual recognition benchmark with 1.2 million training images across 1,000 object categories. Top-1 and Top-5 accuracy on the validation set have been the standard measure of progress in image classification for over a decade.

image-classificationvisiontop-1-accuracy
81.2A
BenchmarkComputer Vision

COCO Detection

by Lin et al. / Microsoft

COCO Detection is the standard benchmark for object detection and instance segmentation, featuring 330,000 images with over 1.5 million annotated instances across 80 object categories. Mean Average Precision (mAP) at various IoU thresholds is the primary metric.

object-detectioninstance-segmentationvision
80.2A
PaperComputer Vision

Segment Anything

by Meta AI

Introduced the Segment Anything Model (SAM) and the SA-1B dataset of 1 billion masks on 11 million images. SAM is a promptable segmentation foundation model that generalizes to new image distributions and tasks without additional training, enabling a new paradigm of interactive segmentation.

segmentationfoundation-modelpromptable
79.2B+
ModelComputer Vision

Midjourney V6

by Midjourney

Midjourney V6 represents a major leap in photorealism, prompt adherence, and artistic coherence, setting a new industry benchmark for AI image generation quality. It introduced native text rendering within images and dramatically improved its understanding of complex, multi-subject prompts.

image-generationtext-to-imagecreative-ai
77.2B+
DatasetComputer Vision

SA-1B (Segment Anything)

by Meta AI

SA-1B is Meta AI's massive segmentation dataset released alongside the Segment Anything Model (SAM), containing over 1 billion high-quality segmentation masks across 11 million diverse, high-resolution images. It is the largest segmentation dataset ever created and enables training of generalist vision models with strong zero-shot transfer capabilities.

segmentationSAMfoundation-model
77.2B+
PaperComputer Vision

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)

by OpenAI

Presented DALL-E 2 (unCLIP), a hierarchical text-conditional image generation system using CLIP image embeddings as a prior and a diffusion decoder. The system achieves state-of-the-art photorealism and text-image alignment, substantially advancing the field of text-to-image synthesis.

dall-e-2text-to-imagediffusion
77.1B+
DatasetComputer Vision

Open Images V7

by Google

Google's Open Images V7 is one of the largest existing datasets with object-level annotations, containing approximately 9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives across 600+ object classes.

object-detectionsegmentationvisual-relationships
76.1B+
BenchmarkComputer Vision

ADE20K Segmentation

by Zhou et al. / MIT CSAIL

ADE20K is the benchmark for semantic scene parsing, containing 25,000 images densely annotated with 150 semantic categories. Mean Intersection over Union (mIoU) is the standard metric, and it drives progress in perception systems for autonomous driving, robotics, and scene understanding.

semantic-segmentationscene-parsingvision
76B+
PaperComputer Vision

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

by Stability AI

Presented SDXL, a significantly improved latent diffusion model architecture featuring a 3.5B parameter UNet backbone with a secondary refiner model, conditioning on image size and crop parameters, and a curated high-aesthetic dataset. SDXL substantially improves visual quality and prompt adherence over prior Stable Diffusion versions.

sdxlstable-diffusiontext-to-image
74.5B+
ModelComputer Vision

Stable Diffusion XL

by Stability AI

Stability AI's high-resolution image generation model producing photorealistic and artistic images at 1024x1024 resolution. Features a two-stage architecture with a base model and refiner for enhanced detail and compositional quality.

image-generationdiffusionopen-source
74.4B+
DatasetComputer Vision

LAION-5B

by LAION

The largest openly available image-text pair dataset, containing 5.85 billion CLIP-filtered image-text pairs across English, multilingual, and aesthetic subsets. LAION-5B was the primary training corpus for Stable Diffusion, DALL-E 2 replications, and numerous open vision-language models, enabling the open-source community to train competitive text-to-image generation models.

multimodalimage-textlarge-scale
74.2B+
DatasetComputer Vision

ADE20K Dataset

by MIT CSAIL

ADE20K is a densely annotated semantic segmentation dataset containing over 27,000 images with pixel-level annotations for 150 semantic categories covering both indoor and outdoor scenes. It is the primary benchmark for scene parsing and semantic segmentation tasks in the computer vision community.

semantic-segmentationscene-parsingscene-understanding
74.2B+
ModelComputer Vision

DALL-E 3

by OpenAI

OpenAI's most advanced image generation model with native ChatGPT integration. Features dramatically improved prompt following, text rendering, and safety mitigations compared to DALL-E 2, generating high-fidelity images from natural language descriptions.

image-generationtext-to-imagecreative
72.2B+
BenchmarkComputer Vision

Flickr30k

by Young et al. / University of Illinois

Flickr30k is a benchmark for image-text retrieval and visual grounding, comprising 31,783 Flickr images each paired with five human-written captions. Models are evaluated on bidirectional image-to-text and text-to-image retrieval recall at ranks 1, 5, and 10.

image-captioningvisual-groundingretrieval
70.9B+
BenchmarkComputer Vision

VQA v2

by Georgia Tech / VT

Visual Question Answering benchmark requiring models to answer open-ended questions about images. Version 2 balances the dataset to reduce language biases, ensuring models must genuinely understand image content rather than relying on question-type priors.

benchmarkevaluationmultimodal
70.3B+
ModelComputer Vision

FLUX 1.1 Pro

by Black Forest Labs

FLUX 1.1 Pro from Black Forest Labs is a next-generation text-to-image model built by the original creators of Stable Diffusion, offering superior prompt comprehension, anatomical accuracy, and photorealistic detail. It sets a new open-weights standard with exceptional speed and quality, available in Pro, Dev, and Schnell variants for different use cases.

image-generationtext-to-imageopen-source
70.1B+
SkillComputer Vision

Object Detection

by AaaS

A core computer vision skill that enables agents to identify and locate objects within an image or video stream. By predicting bounding boxes and class labels for each object, this skill forms the foundation for environmental understanding. It is crucial for applications requiring spatial awareness, from autonomous navigation to automated inspection.

computer-visionobject-detectionbounding-box
68.3B
ModelComputer Vision

Sora

by OpenAI

Sora is a text-to-video diffusion transformer model by OpenAI that generates high-fidelity, minute-long videos from textual prompts. It demonstrates an advanced understanding of language and the physical world, enabling complex scenes with multiple characters, specific motions, and coherent narratives.

video-generationtext-to-videoopenai
68B
ScriptComputer Vision

Object Detection Setup

by Ultralytics

Bootstraps a production-ready object detection workflow using YOLOv8 or RT-DETR, including webcam/video stream ingestion, NMS post-processing, and annotation overlay rendering. Outputs annotated frames and a structured JSON detections log suitable for downstream analytics.

object-detectionyolobounding-boxes
67.9B
ModelComputer Vision

Stable Diffusion 3

by Stability AI

Stable Diffusion 3 is a powerful text-to-image model using a Multimodal Diffusion Transformer (MMDiT) architecture. It excels at generating images with unprecedented text quality, adhering closely to complex prompts, and achieving high photorealism and compositional accuracy compared to its predecessors.

image-generationdiffusiontext-to-image
67.55B
BenchmarkComputer Vision

MMMU

by CUHK / Waterloo

MMMU is a challenging multimodal benchmark designed to evaluate large models on expert-level tasks. It contains over 11,500 college-level problems spanning six core disciplines, requiring models to integrate deep subject knowledge with visual perception to answer multiple-choice questions with detailed reasoning.

benchmarkevaluationmultimodal
66.9B
SkillComputer Vision

Visual Question Answering

by AaaS

Enables agents to answer free-form natural language questions about images by grounding language in visual features. Covers prompt construction for vision-language models, chain-of-thought visual reasoning, and failure modes such as hallucination and spatial confusion.

vqavision-languagemultimodal
66.8B
SkillComputer Vision

OCR Pipeline

by AaaS

Builds end-to-end pipelines for extracting structured text from images, scanned documents, and PDFs using OCR engines combined with layout analysis. Teaches preprocessing, engine selection (Tesseract, PaddleOCR, Google Document AI), post-correction, and handoff to language models for structured extraction.

ocrdocument-parsingtext-extraction
65.1B
SkillComputer Vision

Image Generation Prompting

by AaaS

Master structured prompting for text-to-image diffusion models like Stable Diffusion and Midjourney. Learn to control style, composition, and quality using techniques such as negative prompting, LoRA weights, and iterative refinement. This skill enables the programmatic generation of consistent, on-brand imagery at scale.

image-generationdiffusionprompt-craft
64.6B
SkillComputer Vision

Image Segmentation

by AaaS

Covers semantic, instance, and panoptic segmentation techniques that enable agents to produce pixel-level masks for scene understanding. Includes practical guidance on using SAM 2, Mask R-CNN, and integrating segmentation outputs into multimodal pipelines.

visionsegmentationSAM
62.7B
ScriptComputer Vision

Image Classification Pipeline

by Community

End-to-end image classification pipeline that handles dataset loading, preprocessing, model inference, and result export using PyTorch and torchvision. Supports batch inference against any Hugging Face ViT or ResNet checkpoint with configurable confidence thresholds.

image-classificationvisionpytorch
62.7B
ScriptComputer Vision

OCR Pipeline Script

by Community

This script provides a sophisticated OCR pipeline that intelligently routes documents to the most suitable engine—Tesseract, PaddleOCR, or a cloud API—based on image quality analysis. It processes various document types and outputs structured JSON containing text sorted by reading order, complete with bounding box coordinates and confidence scores for each word or line.

ocrtext-extractiondocument-ai
62.1B
ScriptComputer Vision

Image Segmentation Script

by Meta AI

Runs Segment Anything Model (SAM 2) or Mask2Former on image batches, producing per-pixel segmentation masks with class labels and confidence scores. Includes utilities for mask overlay visualization and RLE-encoded mask export compatible with COCO annotation format.

segmentationsammask
62B
ScriptComputer Vision

Visual Search Engine

by Community

This script provides a complete framework for building a multimodal visual search engine. It uses CLIP to generate image and text embeddings, which are indexed in a vector database like Qdrant or Weaviate for efficient similarity search. The system supports both text-to-image and image-to-image queries and includes a FastAPI server for API access.

visual-searchimage-embeddingssimilarity-search
59.4C+
SkillComputer Vision

Video Understanding

by AaaS

Covers temporal reasoning over video streams, including frame sampling strategies, action recognition, scene change detection, and dense video captioning. Teaches agents to leverage video-native models (Gemini 1.5 Pro, Video-LLaVA) and build efficient pipelines that avoid processing every frame.

videotemporalaction-recognition
57.3C+
ScriptComputer Vision

Face Recognition Setup

by Community

Configures a face recognition system using InsightFace or DeepFace, supporting gallery enrollment, real-time identification against a FAISS vector store, and liveness detection. Designed with privacy-first defaults and includes GDPR-compliant consent logging.

face-recognitionbiometricsdeepface
53.5C+