Computer Vision

object-detectionsegmentationkeypoints

COCO 2017

by Microsoft

Microsoft COCO (Common Objects in Context) 2017 provides 118K training images with 860K object instances annotated with bounding boxes, segmentation masks, keypoints, and captions across 80 object categories. It remains the primary benchmark for object detection and instance segmentation research.

82.5A

clipcontrastive-learningzero-shot

Learning Transferable Visual Models From Natural Language Supervision (CLIP)

by OpenAI

Introduced CLIP (Contrastive Language-Image Pre-training), a model trained on 400 million image-text pairs using contrastive learning that achieves remarkable zero-shot transfer to diverse vision tasks. CLIP became foundational for vision-language alignment and generative AI pipelines.

82.2A

stable-diffusionlatent-diffusiontext-to-image

High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion)

by CompVis / Stability AI

Introduced Latent Diffusion Models (LDMs), which perform the diffusion process in a compressed latent space rather than pixel space, dramatically reducing computational cost while maintaining image quality. This work underpins Stable Diffusion, the most widely used open-source image generation model.

82A

vision-transformerimage-classificationattention

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

by Google Brain

Introduced the Vision Transformer (ViT), demonstrating that a pure transformer applied directly to sequences of image patches achieves state-of-the-art performance on image classification when pretrained on large datasets. The paper challenged the dominance of convolutional neural networks in computer vision.

81.9A

image-classificationvisiontop-1-accuracy

ImageNet

by Deng et al. / Stanford / Princeton

ImageNet (ILSVRC) is the foundational large-scale visual recognition benchmark with 1.2 million training images across 1,000 object categories. Top-1 and Top-5 accuracy on the validation set have been the standard measure of progress in image classification for over a decade.

81.2A

object-detectioninstance-segmentationvision

COCO Detection

by Lin et al. / Microsoft

COCO Detection is the standard benchmark for object detection and instance segmentation, featuring 330,000 images with over 1.5 million annotated instances across 80 object categories. Mean Average Precision (mAP) at various IoU thresholds is the primary metric.

80.2A

segmentationfoundation-modelpromptable

Segment Anything

by Meta AI

Introduced the Segment Anything Model (SAM) and the SA-1B dataset of 1 billion masks on 11 million images. SAM is a promptable segmentation foundation model that generalizes to new image distributions and tasks without additional training, enabling a new paradigm of interactive segmentation.

79.2B+

image-generationtext-to-imagecreative-ai

Midjourney V6

by Midjourney

Midjourney V6 represents a major leap in photorealism, prompt adherence, and artistic coherence, setting a new industry benchmark for AI image generation quality. It introduced native text rendering within images and dramatically improved its understanding of complex, multi-subject prompts.

77.2B+

segmentationSAMfoundation-model

SA-1B (Segment Anything)

by Meta AI

SA-1B is Meta AI's massive segmentation dataset released alongside the Segment Anything Model (SAM), containing over 1 billion high-quality segmentation masks across 11 million diverse, high-resolution images. It is the largest segmentation dataset ever created and enables training of generalist vision models with strong zero-shot transfer capabilities.

77.2B+

dall-e-2text-to-imagediffusion

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)

by OpenAI

Presented DALL-E 2 (unCLIP), a hierarchical text-conditional image generation system using CLIP image embeddings as a prior and a diffusion decoder. The system achieves state-of-the-art photorealism and text-image alignment, substantially advancing the field of text-to-image synthesis.

77.1B+

object-detectionsegmentationvisual-relationships

Open Images V7

by Google

Google's Open Images V7 is one of the largest existing datasets with object-level annotations, containing approximately 9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives across 600+ object classes.

76.1B+

semantic-segmentationscene-parsingvision

ADE20K Segmentation

by Zhou et al. / MIT CSAIL

ADE20K is the benchmark for semantic scene parsing, containing 25,000 images densely annotated with 150 semantic categories. Mean Intersection over Union (mIoU) is the standard metric, and it drives progress in perception systems for autonomous driving, robotics, and scene understanding.

76B+

sdxlstable-diffusiontext-to-image

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

by Stability AI

Presented SDXL, a significantly improved latent diffusion model architecture featuring a 3.5B parameter UNet backbone with a secondary refiner model, conditioning on image size and crop parameters, and a curated high-aesthetic dataset. SDXL substantially improves visual quality and prompt adherence over prior Stable Diffusion versions.

74.5B+

image-generationdiffusionopen-source

Stable Diffusion XL

by Stability AI

Stability AI's high-resolution image generation model producing photorealistic and artistic images at 1024x1024 resolution. Features a two-stage architecture with a base model and refiner for enhanced detail and compositional quality.

74.4B+

multimodalimage-textlarge-scale

LAION-5B

by LAION

The largest openly available image-text pair dataset, containing 5.85 billion CLIP-filtered image-text pairs across English, multilingual, and aesthetic subsets. LAION-5B was the primary training corpus for Stable Diffusion, DALL-E 2 replications, and numerous open vision-language models, enabling the open-source community to train competitive text-to-image generation models.

74.2B+

semantic-segmentationscene-parsingscene-understanding

ADE20K Dataset

by MIT CSAIL

ADE20K is a densely annotated semantic segmentation dataset containing over 27,000 images with pixel-level annotations for 150 semantic categories covering both indoor and outdoor scenes. It is the primary benchmark for scene parsing and semantic segmentation tasks in the computer vision community.

74.2B+

image-generationtext-to-imagecreative

DALL-E 3

by OpenAI

OpenAI's most advanced image generation model with native ChatGPT integration. Features dramatically improved prompt following, text rendering, and safety mitigations compared to DALL-E 2, generating high-fidelity images from natural language descriptions.

72.2B+

image-captioningvisual-groundingretrieval

Flickr30k

by Young et al. / University of Illinois

Flickr30k is a benchmark for image-text retrieval and visual grounding, comprising 31,783 Flickr images each paired with five human-written captions. Models are evaluated on bidirectional image-to-text and text-to-image retrieval recall at ranks 1, 5, and 10.

70.9B+

benchmarkevaluationmultimodal

VQA v2

by Georgia Tech / VT

Visual Question Answering benchmark requiring models to answer open-ended questions about images. Version 2 balances the dataset to reduce language biases, ensuring models must genuinely understand image content rather than relying on question-type priors.

70.3B+

image-generationtext-to-imageopen-source

FLUX 1.1 Pro

by Black Forest Labs

FLUX 1.1 Pro from Black Forest Labs is a next-generation text-to-image model built by the original creators of Stable Diffusion, offering superior prompt comprehension, anatomical accuracy, and photorealistic detail. It sets a new open-weights standard with exceptional speed and quality, available in Pro, Dev, and Schnell variants for different use cases.

70.1B+

computer-visionobject-detectionbounding-box

Object Detection

by AaaS

A core computer vision skill that enables agents to identify and locate objects within an image or video stream. By predicting bounding boxes and class labels for each object, this skill forms the foundation for environmental understanding. It is crucial for applications requiring spatial awareness, from autonomous navigation to automated inspection.

68.3B

video-generationtext-to-videoopenai

Sora

by OpenAI

Sora is a text-to-video diffusion transformer model by OpenAI that generates high-fidelity, minute-long videos from textual prompts. It demonstrates an advanced understanding of language and the physical world, enabling complex scenes with multiple characters, specific motions, and coherent narratives.

68B

object-detectionyolobounding-boxes

Object Detection Setup

by Ultralytics

Bootstraps a production-ready object detection workflow using YOLOv8 or RT-DETR, including webcam/video stream ingestion, NMS post-processing, and annotation overlay rendering. Outputs annotated frames and a structured JSON detections log suitable for downstream analytics.

67.9B

image-generationdiffusiontext-to-image

Stable Diffusion 3

by Stability AI

Stable Diffusion 3 is a powerful text-to-image model using a Multimodal Diffusion Transformer (MMDiT) architecture. It excels at generating images with unprecedented text quality, adhering closely to complex prompts, and achieving high photorealism and compositional accuracy compared to its predecessors.

67.55B

benchmarkevaluationmultimodal

MMMU

by CUHK / Waterloo

MMMU is a challenging multimodal benchmark designed to evaluate large models on expert-level tasks. It contains over 11,500 college-level problems spanning six core disciplines, requiring models to integrate deep subject knowledge with visual perception to answer multiple-choice questions with detailed reasoning.

66.9B

vqavision-languagemultimodal

Visual Question Answering

by AaaS

Enables agents to answer free-form natural language questions about images by grounding language in visual features. Covers prompt construction for vision-language models, chain-of-thought visual reasoning, and failure modes such as hallucination and spatial confusion.

66.8B

ocrdocument-parsingtext-extraction

OCR Pipeline

by AaaS

Builds end-to-end pipelines for extracting structured text from images, scanned documents, and PDFs using OCR engines combined with layout analysis. Teaches preprocessing, engine selection (Tesseract, PaddleOCR, Google Document AI), post-correction, and handoff to language models for structured extraction.

65.1B

image-generationdiffusionprompt-craft

Image Generation Prompting

by AaaS

Master structured prompting for text-to-image diffusion models like Stable Diffusion and Midjourney. Learn to control style, composition, and quality using techniques such as negative prompting, LoRA weights, and iterative refinement. This skill enables the programmatic generation of consistent, on-brand imagery at scale.

64.6B

Image Segmentation

by AaaS

Covers semantic, instance, and panoptic segmentation techniques that enable agents to produce pixel-level masks for scene understanding. Includes practical guidance on using SAM 2, Mask R-CNN, and integrating segmentation outputs into multimodal pipelines.

visionsegmentationSAM

62.7B

image-classificationvisionpytorch

Image Classification Pipeline

by Community

End-to-end image classification pipeline that handles dataset loading, preprocessing, model inference, and result export using PyTorch and torchvision. Supports batch inference against any Hugging Face ViT or ResNet checkpoint with configurable confidence thresholds.

62.7B

ocrtext-extractiondocument-ai

OCR Pipeline Script

by Community

This script provides a sophisticated OCR pipeline that intelligently routes documents to the most suitable engine—Tesseract, PaddleOCR, or a cloud API—based on image quality analysis. It processes various document types and outputs structured JSON containing text sorted by reading order, complete with bounding box coordinates and confidence scores for each word or line.

62.1B

Image Segmentation Script

by Meta AI

Runs Segment Anything Model (SAM 2) or Mask2Former on image batches, producing per-pixel segmentation masks with class labels and confidence scores. Includes utilities for mask overlay visualization and RLE-encoded mask export compatible with COCO annotation format.

segmentationsammask

62B

visual-searchimage-embeddingssimilarity-search

Visual Search Engine

by Community

This script provides a complete framework for building a multimodal visual search engine. It uses CLIP to generate image and text embeddings, which are indexed in a vector database like Qdrant or Weaviate for efficient similarity search. The system supports both text-to-image and image-to-image queries and includes a FastAPI server for API access.

59.4C+

videotemporalaction-recognition

Video Understanding

by AaaS

Covers temporal reasoning over video streams, including frame sampling strategies, action recognition, scene change detection, and dense video captioning. Teaches agents to leverage video-native models (Gemini 1.5 Pro, Video-LLaVA) and build efficient pipelines that avoid processing every frame.

57.3C+