Computer Vision
Image recognition, generation, video analysis
36 entities in this channel
ImageNet-1K
by ImageNet / Stanford Vision Lab
The canonical large-scale visual recognition benchmark containing 1.28 million training images across 1,000 object categories. ImageNet-1K underpins the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and has driven the majority of deep learning breakthroughs in computer vision since 2012.
COCO 2017
by Microsoft
Microsoft COCO (Common Objects in Context) 2017 provides 118K training images with 860K object instances annotated with bounding boxes, segmentation masks, keypoints, and captions across 80 object categories. It remains the primary benchmark for object detection and instance segmentation research.
Learning Transferable Visual Models From Natural Language Supervision (CLIP)
by OpenAI
Introduced CLIP (Contrastive Language-Image Pre-training), a model trained on 400 million image-text pairs using contrastive learning that achieves remarkable zero-shot transfer to diverse vision tasks. CLIP became foundational for vision-language alignment and generative AI pipelines.
High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion)
by CompVis / Stability AI
Introduced Latent Diffusion Models (LDMs), which perform the diffusion process in a compressed latent space rather than pixel space, dramatically reducing computational cost while maintaining image quality. This work underpins Stable Diffusion, the most widely used open-source image generation model.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
by Google Brain
Introduced the Vision Transformer (ViT), demonstrating that a pure transformer applied directly to sequences of image patches achieves state-of-the-art performance on image classification when pretrained on large datasets. The paper challenged the dominance of convolutional neural networks in computer vision.
ImageNet
by Deng et al. / Stanford / Princeton
ImageNet (ILSVRC) is the foundational large-scale visual recognition benchmark with 1.2 million training images across 1,000 object categories. Top-1 and Top-5 accuracy on the validation set have been the standard measure of progress in image classification for over a decade.
COCO Detection
by Lin et al. / Microsoft
COCO Detection is the standard benchmark for object detection and instance segmentation, featuring 330,000 images with over 1.5 million annotated instances across 80 object categories. Mean Average Precision (mAP) at various IoU thresholds is the primary metric.
Segment Anything
by Meta AI
Introduced the Segment Anything Model (SAM) and the SA-1B dataset of 1 billion masks on 11 million images. SAM is a promptable segmentation foundation model that generalizes to new image distributions and tasks without additional training, enabling a new paradigm of interactive segmentation.
Midjourney V6
by Midjourney
Midjourney V6 represents a major leap in photorealism, prompt adherence, and artistic coherence, setting a new industry benchmark for AI image generation quality. It introduced native text rendering within images and dramatically improved its understanding of complex, multi-subject prompts.
SA-1B (Segment Anything)
by Meta AI
SA-1B is Meta AI's massive segmentation dataset released alongside the Segment Anything Model (SAM), containing over 1 billion high-quality segmentation masks across 11 million diverse, high-resolution images. It is the largest segmentation dataset ever created and enables training of generalist vision models with strong zero-shot transfer capabilities.
Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)
by OpenAI
Presented DALL-E 2 (unCLIP), a hierarchical text-conditional image generation system using CLIP image embeddings as a prior and a diffusion decoder. The system achieves state-of-the-art photorealism and text-image alignment, substantially advancing the field of text-to-image synthesis.
Open Images V7
by Google
Google's Open Images V7 is one of the largest existing datasets with object-level annotations, containing approximately 9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives across 600+ object classes.
ADE20K Segmentation
by Zhou et al. / MIT CSAIL
ADE20K is the benchmark for semantic scene parsing, containing 25,000 images densely annotated with 150 semantic categories. Mean Intersection over Union (mIoU) is the standard metric, and it drives progress in perception systems for autonomous driving, robotics, and scene understanding.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
by Stability AI
Presented SDXL, a significantly improved latent diffusion model architecture featuring a 3.5B parameter UNet backbone with a secondary refiner model, conditioning on image size and crop parameters, and a curated high-aesthetic dataset. SDXL substantially improves visual quality and prompt adherence over prior Stable Diffusion versions.
Stable Diffusion XL
by Stability AI
Stability AI's high-resolution image generation model producing photorealistic and artistic images at 1024x1024 resolution. Features a two-stage architecture with a base model and refiner for enhanced detail and compositional quality.
LAION-5B
by LAION
The largest openly available image-text pair dataset, containing 5.85 billion CLIP-filtered image-text pairs across English, multilingual, and aesthetic subsets. LAION-5B was the primary training corpus for Stable Diffusion, DALL-E 2 replications, and numerous open vision-language models, enabling the open-source community to train competitive text-to-image generation models.
ADE20K Dataset
by MIT CSAIL
ADE20K is a densely annotated semantic segmentation dataset containing over 27,000 images with pixel-level annotations for 150 semantic categories covering both indoor and outdoor scenes. It is the primary benchmark for scene parsing and semantic segmentation tasks in the computer vision community.
DALL-E 3
by OpenAI
OpenAI's most advanced image generation model with native ChatGPT integration. Features dramatically improved prompt following, text rendering, and safety mitigations compared to DALL-E 2, generating high-fidelity images from natural language descriptions.
Flickr30k
by Young et al. / University of Illinois
Flickr30k is a benchmark for image-text retrieval and visual grounding, comprising 31,783 Flickr images each paired with five human-written captions. Models are evaluated on bidirectional image-to-text and text-to-image retrieval recall at ranks 1, 5, and 10.
VQA v2
by Georgia Tech / VT
Visual Question Answering benchmark requiring models to answer open-ended questions about images. Version 2 balances the dataset to reduce language biases, ensuring models must genuinely understand image content rather than relying on question-type priors.
FLUX 1.1 Pro
by Black Forest Labs
FLUX 1.1 Pro from Black Forest Labs is a next-generation text-to-image model built by the original creators of Stable Diffusion, offering superior prompt comprehension, anatomical accuracy, and photorealistic detail. It sets a new open-weights standard with exceptional speed and quality, available in Pro, Dev, and Schnell variants for different use cases.
Object Detection
by AaaS
A core computer vision skill that enables agents to identify and locate objects within an image or video stream. By predicting bounding boxes and class labels for each object, this skill forms the foundation for environmental understanding. It is crucial for applications requiring spatial awareness, from autonomous navigation to automated inspection.
Sora
by OpenAI
Sora is a text-to-video diffusion transformer model by OpenAI that generates high-fidelity, minute-long videos from textual prompts. It demonstrates an advanced understanding of language and the physical world, enabling complex scenes with multiple characters, specific motions, and coherent narratives.
Object Detection Setup
by Ultralytics
Bootstraps a production-ready object detection workflow using YOLOv8 or RT-DETR, including webcam/video stream ingestion, NMS post-processing, and annotation overlay rendering. Outputs annotated frames and a structured JSON detections log suitable for downstream analytics.
Stable Diffusion 3
by Stability AI
Stable Diffusion 3 is a powerful text-to-image model using a Multimodal Diffusion Transformer (MMDiT) architecture. It excels at generating images with unprecedented text quality, adhering closely to complex prompts, and achieving high photorealism and compositional accuracy compared to its predecessors.
MMMU
by CUHK / Waterloo
MMMU is a challenging multimodal benchmark designed to evaluate large models on expert-level tasks. It contains over 11,500 college-level problems spanning six core disciplines, requiring models to integrate deep subject knowledge with visual perception to answer multiple-choice questions with detailed reasoning.
Visual Question Answering
by AaaS
Enables agents to answer free-form natural language questions about images by grounding language in visual features. Covers prompt construction for vision-language models, chain-of-thought visual reasoning, and failure modes such as hallucination and spatial confusion.
OCR Pipeline
by AaaS
Builds end-to-end pipelines for extracting structured text from images, scanned documents, and PDFs using OCR engines combined with layout analysis. Teaches preprocessing, engine selection (Tesseract, PaddleOCR, Google Document AI), post-correction, and handoff to language models for structured extraction.
Image Generation Prompting
by AaaS
Master structured prompting for text-to-image diffusion models like Stable Diffusion and Midjourney. Learn to control style, composition, and quality using techniques such as negative prompting, LoRA weights, and iterative refinement. This skill enables the programmatic generation of consistent, on-brand imagery at scale.
Image Segmentation
by AaaS
Covers semantic, instance, and panoptic segmentation techniques that enable agents to produce pixel-level masks for scene understanding. Includes practical guidance on using SAM 2, Mask R-CNN, and integrating segmentation outputs into multimodal pipelines.
Image Classification Pipeline
by Community
End-to-end image classification pipeline that handles dataset loading, preprocessing, model inference, and result export using PyTorch and torchvision. Supports batch inference against any Hugging Face ViT or ResNet checkpoint with configurable confidence thresholds.
OCR Pipeline Script
by Community
This script provides a sophisticated OCR pipeline that intelligently routes documents to the most suitable engine—Tesseract, PaddleOCR, or a cloud API—based on image quality analysis. It processes various document types and outputs structured JSON containing text sorted by reading order, complete with bounding box coordinates and confidence scores for each word or line.
Image Segmentation Script
by Meta AI
Runs Segment Anything Model (SAM 2) or Mask2Former on image batches, producing per-pixel segmentation masks with class labels and confidence scores. Includes utilities for mask overlay visualization and RLE-encoded mask export compatible with COCO annotation format.
Visual Search Engine
by Community
This script provides a complete framework for building a multimodal visual search engine. It uses CLIP to generate image and text embeddings, which are indexed in a vector database like Qdrant or Weaviate for efficient similarity search. The system supports both text-to-image and image-to-image queries and includes a FastAPI server for API access.
Video Understanding
by AaaS
Covers temporal reasoning over video streams, including frame sampling strategies, action recognition, scene change detection, and dense video captioning. Teaches agents to leverage video-native models (Gemini 1.5 Pro, Video-LLaVA) and build efficient pipelines that avoid processing every frame.
Face Recognition Setup
by Community
Configures a face recognition system using InsightFace or DeepFace, supporting gallery enrollment, real-time identification against a FAISS vector store, and liveness detection. Designed with privacy-first defaults and includes GDPR-compliant consent logging.