Explore.
7,960 AI entities indexed across tools, models, agents, skills, benchmarks, and more — schema-verified, agent-maintained.
1,020 entities of 7,960 total
Hugging Face
by Hugging Face
The largest platform for sharing and deploying machine learning models, datasets, and applications. Provides the Transformers library, Inference API, Spaces for demos, and a vibrant open-source AI community.
TensorFlow Lite
by Google
TensorFlow Lite is Google's lightweight ML framework designed for on-device inference on mobile, embedded, and IoT devices. It enables deploying trained models with minimal latency and no network dependency, supporting a wide range of hardware accelerators including GPU, DSP, and NPU.
Apache Airflow (ML Edition)
by Apache Software Foundation
Battle-tested workflow scheduler for authoring, scheduling, and monitoring data and ML pipelines as directed acyclic graphs. The ML ecosystem around Airflow includes providers for SageMaker, Vertex AI, MLflow, and all major cloud AI services.
Three.js (AI Integration)
by Three.js Community (Mr.doob)
The foundational JavaScript 3D library for rendering GPU-accelerated graphics in the browser via WebGL, with a growing ecosystem of AI-generated geometry, procedural shaders, and LLM-driven scene graph manipulation. Three.js powers the majority of web-based spatial AI visualizations.
dbt (AI/ML Edition)
by dbt Labs
The analytics engineering framework that transforms raw warehouse data into clean, tested, and documented datasets ready for ML and AI. dbt's model graph, column-level lineage, and semantic layer make it the backbone of production feature engineering pipelines.
Apache Spark MLlib
by Apache Software Foundation
Apache Spark's built-in machine learning library for distributed, large-scale ML on data lakes and warehouses. MLlib provides scalable algorithms for classification, regression, clustering, and collaborative filtering, plus a pipeline API for feature engineering.
Stability AI Platform
by Stability AI
The Stability AI Platform provides API access to Stability AI's suite of generative image, video, and audio models including Stable Diffusion 3.5 and Stable Video Diffusion, enabling developers to build creative AI applications at scale. It offers both hosted API endpoints and open-weight models for on-premises deployment.
MediaPipe
by Google
MediaPipe is Google's cross-platform framework for building perception pipelines that run on-device in real time. It provides production-ready solutions for tasks like hand tracking, face detection, pose estimation, and object detection across Android, iOS, web, and desktop.
Streamlit
by Snowflake (via acquisition)
Python-first framework for building interactive data applications and ML demos in minutes with no frontend experience required. Streamlit's reactive execution model, built-in widgets, and LLM streaming components make it the go-to tool for AI prototype UIs.
DeepL API
by DeepL
DeepL API provides neural machine translation of exceptional quality for 30+ languages, consistently outperforming competitors on blind translation benchmarks. It supports real-time text and full document translation with format preservation, a glossary system, and a free tier for developers.
Zapier AI
by Zapier
Zapier AI extends the world's largest no-code automation platform with AI-powered workflow generation, natural language Zap building, and an AI Actions API that lets LLMs trigger real-world automations. It connects 6000+ apps and enables non-technical users to build AI-augmented workflows without writing code.
Databricks
by Databricks Inc.
Unified data intelligence platform combining data engineering, ML, and GenAI on a Lakehouse foundation. Databricks provides managed Spark, Delta Lake, MLflow, and Model Serving with vector search, enabling end-to-end AI pipelines from raw data to production models.
Rosetta
by RosettaCommons
Rosetta is a comprehensive software suite for computational macromolecular modeling and design, enabling researchers to predict protein structure, design novel proteins, and model protein-protein interactions. Developed by the RosettaCommons consortium, it is the gold standard in computational protein design and has contributed to multiple Nobel Prize-winning research programs.
Anthropic Tool Use
by Anthropic
Anthropic's native tool use capability allowing Claude models to interact with external tools and APIs. Provides structured tool definitions with input schemas and supports parallel tool calls and streaming.
ONNX Runtime Mobile
by Microsoft
ONNX Runtime Mobile is Microsoft's high-performance inference engine optimized for mobile and edge devices, enabling deployment of models from any ONNX-compatible training framework. It provides hardware-accelerated inference via NNAPI, Core ML, and XNNPACK execution providers.
Runway ML
by Runway
Runway ML is a leading generative AI creative platform for video generation, editing, and visual effects, offering models like Gen-3 Alpha for high-fidelity text-to-video and image-to-video synthesis. It is used by filmmakers, advertisers, and content creators to produce cinematic-quality AI-generated video at scale.
Gradio
by Hugging Face
Python library for rapidly building shareable ML demos with a focus on multimodal inputs including images, audio, video, and text. Gradio is the standard for Hugging Face Spaces demos and integrates natively with the Hugging Face Hub model ecosystem.
Apache TVM
by Apache Software Foundation
Apache TVM is an open-source machine learning compiler stack that optimizes deep learning workloads for a diverse set of hardware backends including CPUs, GPUs, FPGAs, and custom accelerators. It automates model optimization through its AutoTVM and Ansor auto-tuning systems, delivering state-of-the-art inference performance on edge targets.
OneTrust AI
by OneTrust
OneTrust offers a Trust Intelligence Platform to help organizations manage privacy, security, and data governance. It automates workflows for compliance with regulations like GDPR and CCPA, manages user consent, and provides tools for AI governance, data discovery, and third-party risk assessment across the enterprise.
Delta Lake
by Linux Foundation (Delta Lake Project)
Delta Lake is an open-source storage layer that brings ACID transactions and reliability to data lakes. Built on top of Parquet files, it enables features like schema enforcement, time travel for data versioning, and unified batch and streaming data processing. It serves as the foundational storage format for the Lakehouse architecture.
Claude Code
by Anthropic
Claude Code is an agentic AI coding assistant from Anthropic designed to operate within a developer's terminal. It autonomously handles complex software development tasks by understanding entire codebases, editing files, executing shell commands, and managing Git workflows, acting as a hands-on pair programmer with minimal human supervision.
AWS API Gateway (ML)
by Amazon Web Services
AWS-managed API gateway service for building, deploying, and scaling ML and AI APIs backed by Lambda, SageMaker, and Bedrock endpoints. AWS API Gateway provides built-in authorization, throttling, caching, and monitoring for production AI service deployments at any scale.
LangSmith Testing
by LangChain
LangSmith is a platform for debugging, testing, evaluating, and monitoring LLM applications. It enables developers to visualize execution traces of their chains and agents, collect datasets, and run automated evaluators to score model performance. The platform is designed to streamline the LLM development lifecycle from prototype to production.
Neo4j GraphRAG
by Neo4j
Neo4j GraphRAG combines graph database capabilities with vector search to build retrieval-augmented generation systems that leverage structured relationships alongside semantic similarity. It enables developers to construct knowledge graphs that ground LLM responses in connected, structured data reducing hallucinations and improving traceability.
Semantic Kernel
by Microsoft
Microsoft's open-source SDK for integrating LLMs into applications with plugin architecture. Supports planners, memory, and connectors for building enterprise AI solutions across .NET, Python, and Java.
Make AI
by Make
Make (formerly Integromat) is a visual no-code automation platform with deep AI integration that enables complex multi-step workflows through a drag-and-drop scenario builder. It offers granular data transformation, HTTP module flexibility, and AI-powered scenario generation for orchestrating sophisticated automation pipelines.
Milvus
by Zilliz
Cloud-native vector database built for scalable similarity search with GPU acceleration. Supports billions of vectors with multiple index types, hybrid search, and multi-vector queries.
Instructor
by Jason Liu
Instructor is a Python library that simplifies extracting structured, typed data from Large Language Model (LLM) responses. By leveraging Pydantic models, it enables developers to define a desired data schema, and Instructor handles the prompting, validation, and retries to ensure the LLM output conforms to that schema, streamlining data extraction tasks.
Groq
by Groq
Groq is an AI inference company that provides ultra-fast access to open-source large language models. It leverages its custom-designed Language Processing Unit (LPU) hardware to deliver industry-leading token generation speeds, significantly reducing latency for real-time applications via an OpenAI-compatible API.
Jasper AI
by Jasper
Jasper is an AI content platform designed for enterprise marketing teams. It helps create on-brand content at scale by combining advanced AI models with a company's specific brand knowledge. The platform supports multi-channel content generation, campaign workflows, and ensures brand voice consistency across all outputs.
Model Context Protocol
by Anthropic
Open protocol by Anthropic for connecting AI models to external tools, data sources, and services. Provides a standardized interface for tool use with server and client SDKs for building integrations.
H2O AutoML
by H2O.ai
H2O AutoML is an open-source, distributed machine learning platform that automates the model training process. It systematically explores various algorithms and hyperparameters to produce a leaderboard of the best models. It supports both a Python/R API and a no-code Flow UI, making it accessible to both developers and business users.
Bubble AI
by Bubble
Bubble AI is an AI-powered no-code development platform for building web applications without writing code. Users can describe their app idea in natural language, and the AI assistant generates layouts, database structures, and workflows. This visual programming environment allows for extensive customization, ideal for creating MVPs and internal tools.
LoRA Library
by Hugging Face
The LoRA Library, integrated within Hugging Face's PEFT (Parameter-Efficient Fine-Tuning) package, provides tools to create, share, and use LoRA adapters. It allows for the efficient customization of large pre-trained models by training only a small number of new weights, drastically reducing computational costs and storage requirements compared to full fine-tuning.
TGI
by Hugging Face
A production-ready inference server for large language models, developed by Hugging Face in Rust. It enables high-performance LLM serving through features like tensor parallelism, continuous batching, and quantization, making it ideal for deploying demanding models at scale with low latency.
GitBook AI
by GitBook
GitBook AI is an intelligent documentation and knowledge management platform with built-in AI that can answer questions, generate content, and surface insights from your entire knowledge base. It combines a Notion-like editor with an AI search assistant and GitHub sync for technical teams.
Temporal AI
by Temporal Technologies
Durable workflow orchestration platform that makes it easy to build reliable distributed applications. Temporal handles retries, timeouts, and failure recovery automatically, making it ideal for long-running AI pipelines and agent orchestration.
Mintlify
by Mintlify
Mintlify is an AI-powered platform for creating and maintaining developer documentation. It auto-generates content from code comments and OpenAPI specs, provides an AI chatbot trained on the docs for instant answers, and offers a rich component library for technical writing. The platform simplifies publishing and hosting.
Terra
by Broad Institute / Microsoft
Terra is a cloud-based open platform for biomedical researchers to access data, run analysis tools, and collaborate, built on Google Cloud with support for WDL and CWL workflow languages. It provides access to petabyte-scale genomic datasets including TCGA and GTEx, and supports scalable analysis through Cromwell and Spark pipelines.
AutoGluon
by Amazon Web Services
AutoGluon is an open-source AutoML framework from AWS that simplifies machine learning. It automates model training, hyperparameter tuning, and ensembling to achieve state-of-the-art performance on tabular, image, text, and time-series data with just a few lines of Python code, making advanced ML accessible to all skill levels.
Unity ML-Agents
by Unity Technologies
Unity ML-Agents is an open-source toolkit that enables the use of the Unity game engine as a simulation environment for training intelligent agents. It connects rich 3D environments with Python-based deep reinforcement learning and imitation learning frameworks like TensorFlow and PyTorch, facilitating research and development in game AI, robotics, and autonomous systems.
Airbyte
by Airbyte Inc.
Open-source data integration platform with 350+ pre-built connectors for syncing data into AI-ready warehouses and vector databases. Airbyte's PyAirbyte SDK and AI Connector Builder enable rapid connector creation for custom data sources and AI pipelines.
Label Studio
by HumanSignal
Open-source data labeling and annotation platform supporting text, image, audio, and video. Provides customizable labeling interfaces, ML-assisted labeling, and team collaboration for building training datasets.
Auto-sklearn
by University of Freiburg (AutoML Group)
Auto-sklearn is an open-source AutoML toolkit built on scikit-learn. It leverages Bayesian optimization, meta-learning, and automated ensemble construction to find the best-performing machine learning pipeline for a given tabular dataset. It is a prominent tool in academic research for automated model selection.
Cline
by Cline
Autonomous coding agent that operates directly in VS Code with support for multiple LLM providers. Can create and edit files, run terminal commands, and browse the web while requiring human approval for actions.
Edge Impulse
by Edge Impulse
Edge Impulse is a leading development platform for machine learning on embedded systems and IoT devices. It offers an end-to-end MLOps pipeline, from data collection and signal processing to model training and deployment. The platform simplifies creating TinyML applications for resource-constrained microcontrollers.
Dagster
by Dagster Labs
Dagster is an asset-centric data orchestrator for building, testing, and monitoring data pipelines. It models data dependencies and computations as a graph of software-defined assets, providing built-in data lineage, type checking, and observability. This approach helps data teams create reliable and maintainable data platforms.
ReadMe AI
by ReadMe
ReadMe AI is an interactive API documentation platform that transforms OpenAPI specs and Markdown into beautiful, interactive developer hubs with personalized API explorers. Its AI-powered features include auto-generated code samples, semantic search, and contextual AI answers drawn from your documentation.
Outlines
by .txt
Outlines is an open-source Python library that provides fine-grained control over large language model text generation. It uses constrained decoding to force the model's output to conform to a specific structure, such as a regular expression, a Pydantic model, or a JSON schema. This guarantees that the generated text is always valid and parseable, eliminating the need for post-processing and error handling.
Play.ht
by Play.ht
Play.ht is an AI-powered text-to-speech generator and voice cloning platform. It offers a vast library of over 900 AI voices in multiple languages and accents. The platform is designed for various applications, from creating audio versions of articles to developing interactive conversational AI, thanks to its low-latency real-time streaming API.
Triton Inference Server
by NVIDIA
Triton is an open-source inference server from NVIDIA designed for high-performance, production-ready AI. It supports deploying models from virtually any framework, such as TensorFlow, PyTorch, and ONNX, on both GPUs and CPUs. Key features include dynamic batching, concurrent model execution, and model ensembling to maximize throughput and resource utilization.
n8n AI
by n8n
n8n AI is a source-available, workflow automation platform that enables users to build complex, AI-powered automations. It features a visual, node-based editor where users can connect hundreds of applications and services, including various LLMs and AI agents, to orchestrate intricate processes with minimal code.
TrustArc AI
by TrustArc
TrustArc AI is a comprehensive privacy management platform that leverages AI to automate and simplify compliance with global regulations like GDPR and CCPA. It provides tools for data inventory, risk assessments, and consent management, helping organizations build and maintain robust privacy programs.
Hopsworks
by Hopsworks AB
Hopsworks is an open-source ML platform centered around its feature store, enabling teams to manage the full lifecycle of features from engineering to serving for both batch and real-time ML workloads. It integrates deeply with Apache Spark, Flink, and Python environments, and provides built-in model registry and serving capabilities.
NVIDIA Omniverse
by NVIDIA
NVIDIA's platform for building physically accurate 3D simulations, digital twins, and collaborative virtual worlds powered by Universal Scene Description (USD) and real-time ray tracing. Omniverse integrates generative AI for scene synthesis, avatar animation, and synthetic data generation for robot and autonomous vehicle training.
Replit AI
by Replit
AI-powered cloud development platform with integrated coding assistant and one-click deployment. Combines a browser-based IDE with AI code generation, debugging, and instant deployment to production.
Kong AI Gateway
by Kong Inc.
AI-native API gateway that provides a unified control plane for managing, securing, and observing all LLM traffic across any provider. Kong AI Gateway adds semantic caching, prompt injection protection, token rate limiting, and cost attribution on top of the battle-tested Kong Gateway.
Phrase TMS AI
by Phrase
Phrase TMS AI is a translation management system with integrated AI that automates the end-to-end localization workflow for enterprises, including MT integration, translation memory, terminology management, and quality assurance automation. It serves as the operational backbone for global content and software localization programs.
Tecton
by Tecton
Tecton is an enterprise feature store platform that enables data scientists and ML engineers to build, share, and serve features for real-time and batch machine learning applications. It provides a declarative Python SDK for defining feature pipelines, with automatic backfilling, versioning, and point-in-time correct training data generation.
Kubeflow
by Google
Open-source ML platform for Kubernetes providing end-to-end ML workflow orchestration. Includes pipeline authoring, distributed training, hyperparameter tuning, and model serving on Kubernetes clusters.
Apigee AI (Google Cloud)
by Google Cloud
Google Cloud's enterprise API management platform with native Vertex AI and Gemini integration for building secure AI-powered APIs. Apigee AI adds LLM traffic management, semantic caching, safety policies, and analytics to Google's proven API gateway infrastructure.
Kapwing AI
by Kapwing
Kapwing is a browser-based AI video editor designed for content creators and social media teams. It offers AI-powered subtitle generation, background removal, smart cut, and one-click repurposing across formats and aspect ratios.
Unbabel
by Unbabel
Unbabel is an AI-powered translation platform that combines neural machine translation with a community of professional post-editors to deliver human-quality translations at machine speed. It is purpose-built for enterprise customer support and content teams requiring guaranteed accuracy at scale.
Modal
by Modal
Serverless cloud platform for running GPU-accelerated Python code with zero infrastructure management. Provides instant container spin-up, GPU autoscaling, and simple decorators for deploying ML workloads.
Guidance
by Microsoft
Microsoft's language for controlling LLMs with interleaved generation and prompting. Supports constrained output via token healing, regex constraints, and context-free grammars for reliable generation.
PathAI
by PathAI
PathAI is an AI-powered pathology platform that assists pathologists in diagnosing diseases including cancer by analyzing digitized tissue slides with deep learning models trained on millions of pathology images. It provides quantitative biomarker analysis, treatment response prediction, and clinical trial endpoint measurement at scale.
FLAML
by Microsoft Research
Fast and Lightweight AutoML library from Microsoft Research that minimizes compute while maximizing accuracy. FLAML uses cost-aware hyperparameter search and is designed to be embedded inside larger systems, including the AutoGen multi-agent framework.
Credo AI
by Credo AI
Credo AI is an AI governance platform that enables organizations to assess, monitor, and document AI model risks, fairness, and regulatory compliance. It automates evidence collection for frameworks like EU AI Act, NIST AI RMF, and ISO 42001, bridging the gap between AI teams and risk officers.
Chainlit
by Chainlit (Community)
Production-ready Python framework for building conversational AI applications with streaming, message threading, and human-in-the-loop feedback. Chainlit is optimized specifically for LLM chat UIs and integrates natively with LangChain, LlamaIndex, and LiteLLM.
RunPod
by RunPod
Cloud GPU platform for AI inference and training with serverless and dedicated GPU options. Provides cost-effective GPU rentals with pre-built templates for popular ML frameworks and models.
SerpAPI
by SerpAPI
API for scraping and parsing search engine results from Google, Bing, Yahoo, and others. Provides structured JSON results from multiple search engines with support for locations, languages, and devices.
Great Expectations
by Superconductive
Open-source data quality platform for validating, profiling, and documenting data pipelines. Provides expectation-based testing for data quality with automated documentation and alerting capabilities.
DVC
by Iterative
Open-source version control system for machine learning projects with Git-like data management. Provides data versioning, experiment tracking, and ML pipeline management alongside your existing Git workflow.
ArangoDB
by ArangoDB
ArangoDB is a native multi-model database supporting graphs, documents, and key-value storage in a single engine, with integrated vector search and ML capabilities for building knowledge-graph-backed AI applications. Its AQL query language and ArangoSearch make it suitable for complex knowledge retrieval pipelines combining structural and semantic search.
Swarm
by OpenAI
OpenAI's experimental lightweight multi-agent orchestration framework focused on handoffs and routines. Provides a minimal abstraction for agent coordination using function calling and agent transfers.
Patronus AI
by Patronus AI
Patronus AI is an enterprise LLM evaluation platform specializing in automated testing for hallucination, toxicity, PII leakage, and factual accuracy across production AI systems. It provides a library of 1,000+ pre-built evaluators and supports custom evaluator creation to enforce application-specific quality gates.
Fireworks AI
by Fireworks AI
High-performance inference platform for generative AI with fast model serving and fine-tuning. Optimized for production workloads with function calling, JSON mode, and grammar-based generation.
Continue
by Continue
Open-source AI code assistant for VS Code and JetBrains with customizable model and context providers. Supports tab autocomplete, chat, inline editing, and custom slash commands with any LLM.
Pictory
by Pictory AI
Pictory is an AI video creation platform that transforms long-form text, articles, and scripts into short branded videos automatically. It includes AI voiceovers, stock footage matching, automatic highlight extraction, and a brand kit for consistent visual identity.
Flyte
by Union.ai (Linux Foundation)
Kubernetes-native workflow orchestration platform purpose-built for machine learning and data processing at scale. Flyte enforces strong typing on inputs and outputs, provides built-in versioning, and integrates natively with Kubernetes for resource management.
Arthur AI
by Arthur AI
Arthur AI is an enterprise ML monitoring and observability platform that tracks model performance, detects data and concept drift, and measures fairness in production deployments. It provides real-time alerting, explainability dashboards, and bias mitigation tooling for high-stakes AI applications.
Anyscale
by Anyscale
Enterprise platform for scaling AI applications built on the Ray distributed computing framework. Provides managed Ray clusters, model serving, and fine-tuning infrastructure for production AI workloads.
Arize AI
by Arize AI
ML observability platform for monitoring model performance, detecting drift, and troubleshooting issues. Provides real-time monitoring, embedding analysis, and automated performance alerts for AI systems.
Amazon Neptune ML
by Amazon Web Services
Amazon Neptune ML is a managed graph machine learning capability built on Neptune that uses graph neural networks to make predictions on graph data without requiring ML expertise. It automatically trains GNN models on graph structure and node/edge properties for tasks like node classification, link prediction, and regression.
Panel
by HoloViz / NumFOCUS
High-level app and dashboarding framework from HoloViz that works with nearly every visualization library in the Python ecosystem. Panel supports reactive programming, GPU-accelerated plotting, and server-side rendering, making it ideal for complex analytical AI dashboards.
Metaflow
by Netflix / Outerbounds
Human-friendly Python library for building and managing real-life data science and ML projects. Originally developed at Netflix, provides seamless scaling from laptops to cloud with versioning and reproducibility.
Traefik AI Gateway
by Traefik Labs
Cloud-native edge router and AI gateway built for Kubernetes-native LLM traffic management. Traefik AI extends the battle-tested Traefik reverse proxy with LLM-aware middleware for token counting, semantic caching, failover routing, and provider load balancing.
Sourcegraph Cody
by Sourcegraph
AI coding assistant powered by Sourcegraph's code graph for deep codebase understanding. Provides context-aware code generation and answers using entire repository knowledge across large codebases.
Snorkel
by Snorkel AI
Enterprise data-centric AI platform for programmatically labeling and curating training data. Uses weak supervision and labeling functions to create large labeled datasets without manual annotation.
Spline AI
by Spline Design
Browser-based 3D design tool with integrated AI generation capabilities for creating interactive 3D scenes, objects, and animations from text prompts. Spline AI allows designers and developers to produce real-time web-ready 3D graphics without traditional 3D modeling expertise.
Marker
by VikParuchuri
Fast and accurate PDF to Markdown converter optimized for books and scientific papers. Handles complex layouts, equations, tables, and multi-column documents with higher quality than traditional OCR tools.
Zilliz Cloud
by Zilliz
Fully managed vector database service built on Milvus for enterprise-grade similarity search. Provides auto-scaling, high availability, and enterprise security with a simplified operational experience.
Cleanlab
by Cleanlab
Data-centric AI library for finding and fixing label errors in datasets automatically. Uses confident learning algorithms to identify mislabeled data, estimate noise, and improve model training quality.
Gemini Code Assist
by Google
Google's AI-powered code assistance tool integrated with Google Cloud and IDEs. Provides code completions, explanations, and transformations powered by Gemini models with enterprise security controls.
txtai
by NeuML
All-in-one embeddings database for semantic search, LLM orchestration, and language model workflows. Combines vector search with NLP pipelines including summarization, translation, and text-to-speech.
Swimm
by Swimm
Swimm is an AI-powered code documentation tool that auto-generates and keeps documentation synchronized with the codebase using IDE plugins. It detects code changes and alerts developers when docs become stale, enabling engineering teams to maintain accurate, living documentation at scale.
Serper
by Serper
Fast and affordable Google Search API for developers and AI applications. Provides structured Google search results including organic results, knowledge graphs, and related questions via a simple REST API.
AutoTrain
by Hugging Face
Hugging Face's automated training solution for fine-tuning LLMs and other models with minimal configuration. Provides a no-code UI and CLI for training custom models with automatic hyperparameter selection.
Stardog
by Stardog
Stardog is an enterprise knowledge graph platform built on W3C standards (RDF, OWL, SPARQL) that enables organizations to unify disparate data sources into a semantic layer for AI and analytics. Its Virtual Graph capability connects to existing databases without data migration, and its AI integration supports LLM grounding on enterprise knowledge.
Prodigy
by Explosion
Scriptable annotation tool by Explosion for creating training data with active learning. Integrates with spaCy for NLP tasks and provides efficient annotation workflows with model-in-the-loop labeling.
GPT-5
by OpenAI
OpenAI's frontier model with advanced reasoning, native multimodal understanding, and robust function calling. Designed for complex enterprise workflows and agentic applications.
GPT-4o
by OpenAI
OpenAI's natively multimodal flagship model processing text, image, and audio inputs with a single unified architecture. Delivers GPT-4 Turbo-level intelligence at 2x speed and 50% lower cost, with breakthrough real-time voice capabilities.
Claude 4
by Anthropic
Anthropic's most capable model featuring advanced reasoning, coding, and multimodal capabilities. Excels at complex analysis, agentic tasks, and extended thinking with industry-leading safety.
GPT-4
by OpenAI
OpenAI's breakthrough large language model that demonstrated a significant leap in reasoning and factual accuracy over GPT-3.5. Widely adopted across enterprise and developer workflows for code generation, analysis, and complex problem-solving.
Claude 3.5 Sonnet
by Anthropic
Anthropic's breakout model that surpassed Claude 3 Opus at Sonnet-tier pricing, setting new industry benchmarks for coding. Introduced computer use capability and became the most popular model on the API due to its exceptional intelligence-to-cost ratio.
Midjourney V6
by Midjourney
Midjourney V6 represents a major leap in photorealism, prompt adherence, and artistic coherence, setting a new industry benchmark for AI image generation quality. It introduced native text rendering within images and dramatically improved its understanding of complex, multi-subject prompts.
Whisper V3
by OpenAI
OpenAI's state-of-the-art open-source automatic speech recognition model trained on 680K hours of multilingual audio. Supports 99 languages with near-human accuracy and includes translation, timestamp, and language detection capabilities.
BERT
by Google
BERT (Bidirectional Encoder Representations from Transformers) is Google's landmark 2018 language model that introduced the bidirectional pre-training paradigm using masked language modeling and next sentence prediction. It revolutionized NLP by demonstrating that a single pre-trained model could achieve state-of-the-art results across dozens of downstream tasks with minimal fine-tuning.
Gemini 2.5 Pro
by Google DeepMind
Google DeepMind's flagship thinking model with native multimodal understanding across text, images, audio, and video. Excels at complex reasoning, code generation, and agentic tasks with a million-token context window.
Stable Diffusion XL
by Stability AI
Stability AI's high-resolution image generation model producing photorealistic and artistic images at 1024x1024 resolution. Features a two-stage architecture with a base model and refiner for enhanced detail and compositional quality.
GPT-4 Turbo
by OpenAI
An optimized variant of GPT-4 offering a 128K context window, faster inference, and significantly reduced costs. Introduced JSON mode and improved function calling, making it the preferred GPT-4 variant for production applications.
Llama 3.1 70B
by Meta
Meta's workhorse open-source model with 70B parameters, 128K context window, and native tool-use support. Widely deployed as a cost-effective alternative to proprietary frontier models.
DeepSeek-V3
by DeepSeek
DeepSeek's frontier-class MoE model with 671B total parameters and 37B active, trained using FP8 mixed precision for unprecedented cost efficiency. Matches or exceeds GPT-4o and Claude 3.5 Sonnet on key benchmarks.
o1
by OpenAI
OpenAI's first reasoning model that uses extended internal chain-of-thought before responding. Achieves expert-level performance on competitive math (AIME), PhD-level science (GPQA), and complex coding tasks through deliberative alignment.
ElevenLabs Turbo v2.5
by ElevenLabs
ElevenLabs Turbo v2.5 is a low-latency multilingual text-to-speech model optimized for real-time conversational AI applications, offering sub-400ms first-audio latency while maintaining the high voice cloning fidelity ElevenLabs is known for across 32 languages. It powers a wide range of AI assistant, customer service, and interactive voice applications where natural-sounding, real-time speech is critical.
Llama 3.1 405B
by Meta
The largest openly available language model at 405 billion parameters, rivaling proprietary frontier models in reasoning and knowledge. A landmark release demonstrating open-source models can match closed alternatives.
DALL-E 3
by OpenAI
OpenAI's most advanced image generation model with native ChatGPT integration. Features dramatically improved prompt following, text rendering, and safety mitigations compared to DALL-E 2, generating high-fidelity images from natural language descriptions.
Claude 4 Sonnet
by Anthropic
Anthropic's balanced Claude 4 generation model delivering strong coding and reasoning at competitive pricing. Features improved agentic capabilities and extended thinking, offering a compelling mid-tier option between Haiku and Opus.
Llama 3 70B
by Meta
Meta's high-performance 70B parameter model closing the gap with proprietary frontier models. Achieved competitive results on major benchmarks while remaining fully open-source.
Claude 4.5 Sonnet
by Anthropic
Anthropic's most advanced Sonnet-tier model, combining frontier intelligence with practical speed and cost. Features state-of-the-art coding performance, improved extended thinking, and robust agentic capabilities for complex multi-step workflows.
GPT-2
by OpenAI
GPT-2 is OpenAI's 2019 autoregressive language model that demonstrated for the first time that large-scale unsupervised pre-training on internet text could produce coherent, fluent long-form text generation with zero-shot task performance. Its initial withheld release sparked global debate about AI safety and responsible disclosure of capable AI systems.
Gemini 2.5 Flash
by Google DeepMind
Google DeepMind's fast thinking model optimized for speed and cost efficiency while retaining strong reasoning capabilities. Supports a million-token context window with native multimodal input.
Gemini 2.0 Flash
by Google
Google's next-generation fast model built for the agentic era, featuring native tool use, multimodal generation, and real-time streaming. Outperforms Gemini 1.5 Pro on key benchmarks while maintaining Flash-tier speed and cost efficiency.
AlphaFold 3
by Google DeepMind
AlphaFold 3 is Google DeepMind's third-generation protein structure prediction model that extends beyond proteins to predict the structures of DNA, RNA, and small molecules and their interactions. It represents a revolutionary tool for drug discovery and structural biology, dramatically accelerating our understanding of molecular machines that underpin life.
Google WaveNet
by Google / DeepMind
Google WaveNet is DeepMind's pioneering generative model for raw audio waveforms that dramatically advanced the state of the art in text-to-speech naturalness when published in 2016 and continues to power Google Assistant, Google Cloud TTS, and various Google products at massive scale. Its autoregressive waveform generation approach established the template for neural vocoder research and inspired a generation of TTS architectures.
Mistral 7B
by Mistral AI
Mistral AI's breakthrough 7B parameter model that outperformed Llama 2 13B across all benchmarks at launch. Introduced sliding window attention and grouped-query attention for efficient inference.
Gemini 1.5 Pro
by Google
Google's mid-size multimodal model featuring a groundbreaking 2 million token context window using mixture-of-experts architecture. Excels at long-document understanding, video analysis, and cross-modal reasoning tasks that require processing large volumes of information.
GPT-4o mini
by OpenAI
OpenAI's most cost-efficient small model, replacing GPT-3.5 Turbo as the default lightweight option. Scores 82% on MMLU and outperforms GPT-4 on chat preferences while costing over 60% less than GPT-4o.
FLUX 1.1 Pro
by Black Forest Labs
FLUX 1.1 Pro from Black Forest Labs is a next-generation text-to-image model built by the original creators of Stable Diffusion, offering superior prompt comprehension, anatomical accuracy, and photorealistic detail. It sets a new open-weights standard with exceptional speed and quality, available in Pro, Dev, and Schnell variants for different use cases.
T5
by Google
T5 (Text-To-Text Transfer Transformer) is Google's 2019 framework that reframes all NLP tasks as text-to-text problems, allowing a single model to be trained on a unified mixture of tasks. Its clean formulation and the C4 dataset became foundational references for multitask learning research, and T5 variants remain widely used in production and research.
GPT-4V
by OpenAI
OpenAI's multimodal extension of GPT-4 with native vision capabilities for image understanding, OCR, and visual reasoning. Processes interleaved text and images for tasks ranging from chart analysis to visual question answering.
Suno V3.5
by Suno AI
Suno V3.5 is a text-to-song AI model that generates complete, radio-quality music tracks with vocals, instrumentation, and song structure directly from natural language prompts or custom lyrics. It supports an enormous range of genres and styles and is widely regarded as the most accessible and highest-quality text-to-music system for non-musicians.
Mixtral 8x7B
by Mistral AI
Mistral AI's sparse mixture-of-experts model using 8 expert networks of 7B parameters each, activating only 2 per token. Matches GPT-3.5 performance while using a fraction of the compute at inference.
Qwen 2.5 72B
by Alibaba Cloud
The flagship open-weight model in the Qwen 2.5 series, offering substantial improvements in reasoning, instruction following, and structured output over its predecessor. Supports 128K context with strong performance across 29+ languages.
DeepSeek Coder V3
by DeepSeek
DeepSeek Coder V3 is DeepSeek's third-generation code-specialized model, trained on over 2 trillion tokens of code and natural language with a mixture-of-experts architecture. It achieves state-of-the-art performance on major coding benchmarks, surpassing GPT-4o and Claude 3.5 Sonnet on several code generation tasks.
Llama 3.3 70B
by Meta
Meta's refined 70B model delivering performance comparable to the much larger 405B variant through improved training techniques. Offers the best performance-to-cost ratio in the Llama family.
Llama 3 8B
by Meta
Meta's third-generation compact language model with significantly improved performance over Llama 2 at the same size class. Features an expanded 128K token vocabulary and improved tokenizer.
o3-mini
by OpenAI
A compact and cost-efficient reasoning model that delivers strong STEM performance at a fraction of o3's cost. Supports configurable reasoning effort (low/medium/high) to balance speed and accuracy for different use cases.
Claude 3 Opus
by Anthropic
Anthropic's most intelligent model at launch of the Claude 3 family, excelling at highly complex tasks requiring deep reasoning and nuanced understanding. Set new benchmarks in graduate-level reasoning and demonstrated near-human comprehension across academic subjects.
Llama 2 70B
by Meta
Meta's largest Llama 2 variant with 70 billion parameters delivering substantially improved reasoning and knowledge over the 7B version. Became the de facto open-source baseline for LLM research.
Llama 2 7B
by Meta
Llama 2 7B is an open-source 7 billion parameter large language model developed by Meta. Optimized for dialogue and general text generation, its permissive license and manageable size have made it a popular foundational model for fine-tuning, research, and building custom NLP applications.
Sora
by OpenAI
Sora is a text-to-video diffusion transformer model by OpenAI that generates high-fidelity, minute-long videos from textual prompts. It demonstrates an advanced understanding of language and the physical world, enabling complex scenes with multiple characters, specific motions, and coherent narratives.
Llama 3.1 8B
by Meta
Llama 3.1 8B is a compact, open-source language model from Meta, featuring a 128K token context window and native tool-use capabilities. It is optimized for high performance in instruction-following and reasoning tasks, making it a cost-effective solution for scalable, on-device, or resource-constrained applications.
Stable Diffusion 3
by Stability AI
Stable Diffusion 3 is a powerful text-to-image model using a Multimodal Diffusion Transformer (MMDiT) architecture. It excels at generating images with unprecedented text quality, adhering closely to complex prompts, and achieving high photorealism and compositional accuracy compared to its predecessors.
Azure Neural TTS
by Microsoft
Azure Neural TTS is Microsoft's enterprise-grade text-to-speech service, part of Azure AI Speech. It provides 400+ natural-sounding voices across 140+ languages, with detailed prosody control via SSML. The service is designed for scalable applications, from accessibility tools to customer service bots.
Adobe Firefly 3
by Adobe
Adobe Firefly 3 is a commercially safe generative image model trained exclusively on licensed Adobe Stock and public-domain content, making it uniquely suitable for professional and enterprise creative workflows. Its deep integration with Photoshop, Illustrator, and Express enables AI-powered generation directly within industry-standard design tools.
Codex-2
by OpenAI
Codex-2 is OpenAI's second-generation code-specialized model, significantly advancing code completion, synthesis, and debugging over the original Codex. It underpins GitHub Copilot's next-generation features and supports a wider range of programming languages and frameworks.
ClinicalBERT
by Kexin Huang et al. (Academic)
ClinicalBERT is a BERT-based model pre-trained on clinical notes from the MIMIC-III dataset. It provides a deep contextual understanding of electronic health record (EHR) text and clinical documentation, serving as a foundational model for various clinical natural language processing tasks.
Gemini 2.5 Ultra
by Google DeepMind
Gemini 2.5 Ultra is Google DeepMind's most capable model in the 2.5 generation, designed for the most demanding reasoning, coding, and multimodal tasks. It features an extended context window and advanced chain-of-thought capabilities surpassing prior Gemini variants.
Claude Opus 4
by Anthropic
Anthropic's most capable model in the Claude 4 generation, designed for the most demanding reasoning, analysis, and agentic tasks. Excels at complex multi-step problems requiring deep understanding and sustained coherence across long contexts.
Gemini 1.5 Flash
by Google
Google's lightweight and fast multimodal model optimized for high-volume, cost-sensitive workloads. Supports a 1 million token context window with natively multimodal capabilities across text, image, audio, and video at a fraction of Pro's cost.
Cohere Embed v3
by Cohere
Cohere's state-of-the-art embedding model supporting 100+ languages with native int8 and binary quantization for efficient storage. Produces high-quality vector representations optimized for search, classification, and clustering tasks.
Grok-3
by xAI
Grok-3 is xAI's frontier model, delivering state-of-the-art performance in math, science, and coding. Trained on the Colossus supercluster, it features DeepSearch for multi-step research and a 'Think' mode for extended chain-of-thought reasoning, enabling it to tackle complex, real-world problems with access to real-time information.
DeepSeek-Coder-V2
by DeepSeek
DeepSeek-Coder-V2 is a powerful open-source Mixture-of-Experts (MoE) model specialized in code. It supports 338 programming languages and features advanced fill-in-the-middle capabilities, offering performance comparable to top-tier proprietary models like GPT-4 Turbo at a significantly lower inference cost.
Claude 3.5 Haiku
by Anthropic
Anthropic's fastest, most affordable model in the 3.5 generation, offering performance comparable to Claude 3 Opus. It excels at coding, complex workflows, and agentic tasks due to its advanced tool-use capabilities and speed, making it ideal for high-throughput applications and enterprise automation.
Runway Gen-3 Alpha
by Runway
Runway Gen-3 Alpha is a professional-grade video generation model for high-fidelity, temporally consistent clips. It offers fine-grained control over motion, style, and camera behavior via text and image inputs, making it a key tool in professional film and advertising workflows for meeting commercial standards.
Qwen 2 72B
by Alibaba Cloud
Qwen2-72B is a 72-billion parameter large language model from Alibaba's Qwen2 series. It offers state-of-the-art performance, particularly in multilingual understanding, reasoning, and coding tasks. As an open-weight model, it provides a powerful alternative to proprietary systems for a wide range of applications.
Claude 3 Sonnet
by Anthropic
The balanced mid-tier model in the Claude 3 family, offering a strong combination of speed and intelligence. Provides enterprise-grade performance for coding, analysis, and content generation at moderate cost.
AlphaGo
by Google DeepMind
AlphaGo is a landmark AI from DeepMind that mastered the game of Go. It combines deep neural networks with Monte Carlo Tree Search and reinforcement learning, famously defeating world champion Lee Sedol in 2016. Its success demonstrated AI's ability to tackle complex problems requiring strategic planning.
Qwen 2.5 Coder 32B
by Alibaba Cloud
Qwen 2.5 Coder 32B is an open-weight, code-specialized large language model from Alibaba Cloud. Fine-tuned on a massive corpus covering over 92 programming languages, it excels at code generation, completion, and debugging tasks, demonstrating performance on par with or exceeding proprietary models like GPT-4o on several benchmarks.
Claude 3 Haiku
by Anthropic
Claude 3 Haiku is Anthropic's fastest, most compact model, excelling at near-instant responsiveness. It handles a wide range of tasks, including multimodal vision, with strong performance at a low cost, making it ideal for high-throughput applications like content moderation and customer service.
MusicGen
by Meta AI
MusicGen is an open-source text-to-music model from Meta AI that generates high-quality instrumental music from text descriptions. It can also be conditioned on a melody reference, providing a strong, controllable baseline for both research and commercial applications, trained on 20K hours of licensed music.
Mixtral 8x22B
by Mistral AI
Mixtral 8x22B is a large-scale, open-source Mixture-of-Experts (MoE) model from Mistral AI. It features 176 billion total parameters but only activates 39 billion per token, balancing immense power with efficiency. The model excels at reasoning, code generation, and multilingual tasks, and includes native function calling capabilities.
Mistral Large
by Mistral AI
Mistral Large is Mistral AI's flagship proprietary model, offering top-tier reasoning and multilingual capabilities. It is designed to compete with other frontier models like GPT-4, excelling in complex tasks that require deep understanding. Its native function calling and fluency in over 30 languages make it highly versatile for enterprise-grade applications.
Code Llama 34B
by Meta
Code Llama 34B is a large language model from Meta, fine-tuned from Llama 2 for code-specific tasks. It excels at generating, completing, and explaining code across various languages. With variants supporting a 100K token context window, it can analyze and work with extensive codebases for complex tasks like refactoring.
Multilingual-E5-Large
by Microsoft Research
Multilingual-E5-Large is a powerful text embedding model from Microsoft supporting 100 languages. Trained on billions of text pairs using contrastive learning, it excels at cross-lingual information retrieval and semantic similarity, establishing a strong open-source baseline for multilingual NLP tasks.
Med-PaLM 2
by Google
Med-PaLM 2 is Google's large language model specialized for the medical domain. It achieves expert-level performance on medical licensing exams (USMLE) by leveraging advanced clinical reasoning and question-answering capabilities. The model is designed to generate accurate and helpful responses for healthcare professionals.
Qwen2.5-VL-72B
by Alibaba Cloud (Qwen Team)
Qwen2.5-VL-72B is Alibaba's flagship open vision-language model at 72 billion parameters, achieving top-tier performance on visual understanding benchmarks including chart analysis, document parsing, and fine-grained image understanding. It supports dynamic resolution image inputs and video understanding with native high-resolution processing.
GPT-4.5
by OpenAI
GPT-4.5 is a hypothetical large language model from OpenAI, positioned as a research preview before GPT-5. It focuses on large-scale unsupervised learning to significantly reduce hallucinations and enhance factual accuracy. The model is also designed for improved creative writing and greater emotional intelligence in its responses.
Phi-3.5-mini
by Microsoft
Phi-3.5-mini is a 3.8B parameter instruction-tuned model from Microsoft, optimized for edge and mobile devices. Despite its compact size, it delivers performance comparable to much larger models on benchmarks for reasoning, coding, and language tasks, making it highly efficient for on-device AI applications.
o1-mini
by OpenAI
A smaller, faster, and more affordable reasoning model optimized for STEM tasks. Delivers 80% of o1's reasoning capability at roughly 80% lower cost, making it ideal for high-volume coding and math workloads.
PaLM
by Google
PaLM (Pathways Language Model) is Google's 540 billion parameter language model trained using the Pathways system across 6,144 TPU v4 chips, demonstrating breakthrough capabilities on chain-of-thought reasoning, code generation, and multilingual tasks. It introduced the concept of 'discontinuous' capability jumps at scale and set new benchmarks on hundreds of NLP tasks upon release in 2022.
Ideogram 2
by Ideogram AI
Ideogram 2 is a text-to-image model renowned for its superior ability to render legible and accurate text within generated images. It excels at creating high-quality photorealistic and artistic visuals with strong prompt adherence, making it a powerful tool for design, branding, and creative projects.
Amazon Polly Neural
by Amazon Web Services
Amazon Polly is a cloud-based text-to-speech (TTS) service from AWS that produces highly natural-sounding human speech using neural engine technology. It supports over 30 languages with both standard and neural voices, offering deep integration with the AWS ecosystem for scalable production applications.
Claude Opus 4.5
by Anthropic
Claude Opus 4.5 is Anthropic's frontier AI model, delivering state-of-the-art performance in complex reasoning, creative tasks, and nuanced understanding. It features advanced multimodal vision capabilities for analyzing images and documents, along with extended thinking for multi-step, agentic tasks.
TTS-1
by OpenAI
OpenAI's TTS-1 is a text-to-speech model designed for real-time audio generation. It provides six distinct, natural-sounding preset voices and supports low-latency streaming, making it ideal for interactive applications. A higher-quality variant, tts-1-hd, is available for tasks where audio fidelity is prioritized over speed.
Command R+
by Cohere
Cohere's most capable RAG-optimized model, offering significantly enhanced reasoning, multi-step tool use, and superior grounded generation over Command R. Designed for complex enterprise workflows requiring high accuracy and citations.
Imagen 3
by Google DeepMind
Google DeepMind's highest-quality text-to-image generation model producing photorealistic images with improved detail, lighting, and fewer artifacts. Features enhanced prompt understanding and safety filtering.
Qwen 2.5 Max
by Alibaba Cloud
Alibaba Cloud's most capable proprietary model in the Qwen 2.5 family, optimized for complex reasoning and enterprise applications. Available exclusively through Alibaba Cloud's Model Studio API with enhanced safety and alignment.
AudioCraft
by Meta AI
AudioCraft is an open-source generative audio framework from Meta AI. It integrates MusicGen for music, AudioGen for sound effects, and the EnCodec codec into a single platform. This unified, modular design allows for text-to-audio generation and has become a key reference for audio LLM research.
LegalBERT
by Ilias Chalkidis et al. (Academic)
LegalBERT is a family of BERT models pre-trained on a diverse corpus of English legal texts, including legislation, court cases, and contracts. This specialized training allows it to significantly outperform general-purpose BERT models on downstream legal NLP tasks, establishing it as a foundational baseline for legal AI research and applications.
Gemma 2 9B
by Google DeepMind
Gemma 2 9B is a lightweight, state-of-the-art open model from Google, part of the next generation of the Gemma family. It offers strong performance for its size class, making it ideal for environments with limited computational resources. Built on a new architecture, it is optimized for on-device applications, research, and fine-tuning.
QwQ-32B
by Alibaba / Qwen Team
QwQ-32B is a 32 billion parameter language model from Alibaba, specifically optimized for complex reasoning tasks. It utilizes a deep chain-of-thought methodology to excel at mathematical, scientific, and logical problems, achieving performance comparable to much larger models and showcasing high parameter efficiency.
BLOOM
by BigScience Workshop
BLOOM is a 176 billion parameter, open-access multilingual language model developed by the BigScience research workshop. Trained on 46 natural languages and 13 programming languages, it provides powerful text and code generation capabilities, making it a key resource for researchers and developers building multilingual AI applications.
StarCoder2 15B
by BigCode (ServiceNow + Hugging Face)
StarCoder2 15B is a powerful open-source code generation model from the BigCode project. Trained on The Stack v2 dataset spanning over 600 programming languages, it excels at code completion, generation, and fill-in-the-middle tasks, emphasizing data transparency and author opt-out.
Phi-3 Mini
by Microsoft
Microsoft's Phi-3 Mini is a 3.8 billion parameter small language model (SLM) designed for high performance on resource-constrained devices. Despite its compact size, it exhibits strong reasoning and language understanding capabilities, making it suitable for on-device and edge AI applications. It is optimized for efficient inference.
Cohere Rerank v3
by Cohere
Cohere Rerank v3 is a state-of-the-art neural model designed to significantly boost the relevance of search results for Retrieval-Augmented Generation (RAG) systems. It re-scores a list of candidate documents from any keyword or vector search system, identifying the most pertinent information. It supports over 100 languages and can process long documents, making it highly versatile.
DeepSeek Coder 33B
by DeepSeek
DeepSeek Coder 33B is a dense, open-source large language model specializing in code-related tasks. Trained from scratch on a massive 2 trillion token dataset of code and natural language, it understands project-level context and supports 87 different programming languages for advanced code generation and completion.
Llama 3.2 11B Vision
by Meta
Llama 3.2 11B Vision is Meta's first open-source multimodal model, integrating native image understanding with advanced text generation. At a compact 11B parameters, it's designed for efficiency, enabling visual question answering, image captioning, and complex reasoning across text and images in a single, deployable model.
DeepSeek-V2
by DeepSeek
DeepSeek's mixture-of-experts model introducing Multi-head Latent Attention (MLA) for dramatically reduced inference cost. Activates 21B of its 236B total parameters per token while matching larger dense models.
Codestral
by Mistral AI
Codestral is Mistral AI's open-weight generative model explicitly designed for code generation tasks. Trained on a diverse dataset of over 80 programming languages, it excels at code completion, generation, and its unique fill-in-the-middle capability. It is optimized for low-latency performance in real-world applications.
Gemma 2 27B
by Google DeepMind
Gemma 2 27B is a powerful, mid-sized open-weights model from Google DeepMind. It delivers significant performance gains in reasoning, coding, and instruction following over smaller variants. Designed for server-side deployment, it provides a strong foundation for advanced research and custom fine-tuning projects.
Claude 4.5 Haiku
by Anthropic
Claude 4.5 Haiku is Anthropic's fastest and most compact model, engineered for near-instant responsiveness and high-throughput workloads. It provides enterprise-grade performance at a fraction of the cost, making it ideal for real-time interactions, content moderation, and cost-effective agentic tasks.
XTTS-v2
by Coqui AI
XTTS-v2 is an open-source, cross-lingual text-to-speech model from Coqui AI. It excels at high-quality voice cloning from just a few seconds of audio and supports 17 languages. With real-time streaming inference, it's ideal for applications needing custom voices and low-latency output.
BloombergGPT
by Bloomberg
BloombergGPT is a 50-billion parameter large language model developed by Bloomberg. It is specifically trained on a massive, curated corpus of financial data accumulated over decades, combined with general-purpose datasets. This specialized training allows it to excel at financial natural language processing tasks, outperforming similarly sized general models.
Grok-2
by xAI
Grok-2 is xAI's second-generation large language model, notable for its real-time knowledge access through the X platform. It possesses strong reasoning and multimodal capabilities, including vision understanding. The model is designed for a more natural, conversational interaction style with a lower tendency to refuse prompts.
BioGPT
by Microsoft Research
BioGPT is a domain-specific language model from Microsoft, pre-trained on a massive corpus of biomedical literature from PubMed. It excels at tasks like generating biomedical text, extracting relationships between entities, and answering questions based on medical research, achieving state-of-the-art results on several benchmarks.
Command R
by Cohere
Command R is a retrieval-optimized language model from Cohere, specifically designed for enterprise-grade Retrieval-Augmented Generation (RAG) and tool use. It excels in multilingual applications, supporting over 10 languages, and features built-in capabilities for grounding responses and generating citations to ensure accuracy.
Gemma 2B
by Google DeepMind
Gemma 2B is Google DeepMind's open-weight 2 billion parameter language model from the Gemma family, designed for lightweight deployment on devices with limited resources. It delivers strong performance for its size on language understanding and generation tasks, and serves as a foundation for fine-tuning on domain-specific tasks.
Pika 1.5
by Pika Labs
Pika 1.5 is an accessible AI video generation model that transforms text prompts or images into high-quality videos. It is known for its expressive motion, diverse cinematic styles, and unique features like physics-based effects and automated lip-sync, making it popular among creators and consumers.
SRE Triage Agent
by AaaS DevOps Foundry
Detects anomalies in live system telemetry, runs deterministic diagnostics from the organization's top remediation runbooks, and autonomously resolves up to 40% of standard incidents without human intervention. Operates within strict change-window and read-only access constraints, with mandatory human-in-the-loop approval for any remediation touching production data or falling outside predefined runbooks. Reduces mean-time-to-recovery and augments on-call teams.
Pipeline Healer Agent
by AaaS DevOps Foundry
Continuously observes CI/CD pipelines, code repositories, and incident logs. Detects deployment anomalies the moment thresholds breach, safely rolls back anomalous releases using historical context, and triggers automated fixes — all without waiting for a human on-call engineer. Operates within strict rollback policies including blast-radius limits and change-window enforcement to prevent cascading failures.
Dependency Guardian Agent
by AaaS DevOps Foundry
Maps the entire dependency tree across an organization's codebases, tests library updates in isolated sandbox environments, writes localized unit tests to verify compatibility, and submits fully validated pull requests that respect architectural constraints. Prevents the cascade-of-breaking-changes problem that plagues manual dependency updates, where an LLM taking a prompt literally would introduce version conflicts or accidentally remove necessary features.
OpenAI Assistants API
by OpenAI
OpenAI's managed agent platform for building custom AI assistants with persistent threads, built-in code interpreter, file search, and function calling. Handles conversation state, tool orchestration, and context management so developers can focus on business logic.
Microsoft Copilot Agent
by Microsoft
Microsoft's autonomous agent within the Copilot ecosystem that operates across Microsoft 365 apps to automate business processes. Handles email triage, meeting preparation, document summarization, and cross-app workflow automation with enterprise-grade security.
Personalized Tutor Agent
by Khanmigo (Khan Academy)
An adaptive tutoring agent that dynamically adjusts difficulty, pacing, and instructional modality based on individual learner performance signals. It maintains a persistent knowledge model per student, identifies misconceptions through Socratic questioning, and routes learners to mastery via spaced-repetition scheduling.
Codebase Architecture Agent
by AaaS DevOps Foundry
Maps structural dependencies, architectural patterns, and historical technical decisions across enterprise codebases. When a critical service fails and the original developers are unavailable, this agent produces a semantic architecture map — dependency graphs, hotspot analysis, and knowledge gap identification — in minutes instead of weeks. Integrates deeply with repositories to understand code as architecture, not just text.
Omnichannel Support Agent
by Intercom
A fully-autonomous customer support agent that unifies conversations across chat, email, SMS, and social DMs into a single threaded context window. It resolves tier-1 and tier-2 tickets using a retrieval-augmented knowledge base and maintains CSAT targets through sentiment-aware tone calibration.
AutoGen
by Microsoft Research
Microsoft's multi-agent conversation framework enabling multiple LLM agents to converse, collaborate, and solve tasks through automated chat. Supports customizable agent behaviors, human-in-the-loop, and code execution sandboxing.
Perplexity
by Perplexity AI
AI-powered answer engine that combines real-time web search with LLM synthesis to provide cited, accurate answers. Features multi-step research capabilities, source verification, and conversational follow-up for deep topic exploration.
EHR Documentation Agent
by Nuance Communications (Microsoft)
Ambient AI agent that listens to physician-patient encounters, generates structured clinical notes (SOAP, H&P, discharge summaries), and auto-populates EHR fields in real time. Reduces documentation burden by over 70% while maintaining compliance with ICD-10 and CPT coding standards.
Salesforce Einstein Agent
by Salesforce
Salesforce's autonomous AI agent built on the Einstein platform that handles customer interactions, resolves support cases, and automates sales workflows. Operates within the Salesforce ecosystem with full access to CRM data, knowledge bases, and business rules.
Drug Interaction Checker
by Wolters Kluwer Health
Real-time pharmacological agent that screens multi-drug regimens for contraindications, adverse interactions, and dosing conflicts. Cross-references patient allergy profiles, renal function, and genetic pharmacogenomics data to surface clinically relevant alerts at point of prescribing.
SEO Analysis Agent
by Ahrefs
A fully-autonomous SEO agent that continuously crawls a target website, audits technical health, researches high-intent keywords, and generates prioritized optimization recommendations. It tracks ranking movements in real time and surfaces backlink opportunities from competitor gap analysis.
ElevenLabs Conversational Agent
by ElevenLabs
ElevenLabs' conversational AI agent platform combining industry-leading voice synthesis with real-time dialogue capabilities. Supports 29+ languages, custom voice creation, and ultra-low-latency responses for natural phone and web interactions.
AutoGPT
by Significant Gravitas
One of the first open-source autonomous AI agents that chains LLM calls to accomplish complex goals. Decomposes high-level objectives into sub-tasks, maintains memory, and executes multi-step plans with internet access and file operations.
Latency Budget Planner Agent
by AaaS DevOps Foundry
Decomposes end-to-end application latency into detailed per-component budgets for real-time and streaming pipeline architectures. Autonomously adds graceful degradation protocols, timeout handling configurations, and p50/p95 tracing metrics required for production multimodal systems. Where a generic AI produces streaming pipeline code without real-world latency considerations, this agent understands the physics of distributed systems and produces actionable latency allocation plans.
Learning Path Optimizer
by Coursera
A recommendation agent that maps learner skill profiles against target competency frameworks and synthesizes the shortest credentialed path to proficiency. It continuously reoptimizes routing as learners complete modules and integrates real-time labor-market signals to prioritize high-value skill sequences.
Legal Research Agent
by Westlaw AI (Thomson Reuters)
Comprehensive legal research agent that queries case law databases, statutes, regulations, and secondary sources to synthesize jurisdiction-specific memos, identify controlling precedents, and map circuit splits. Generates formatted legal research memos with citation-verified sources and confidence scores.
Google Duet AI
by Google
Google's AI-powered assistant embedded across Google Workspace and Google Cloud that automates document creation, email drafting, data analysis, and cloud infrastructure management. Leverages Gemini models for contextual understanding across the Google ecosystem.
Meeting Summarizer Agent
by Otter.ai
An autonomous agent that joins virtual meetings, transcribes conversations in real time with speaker diarization, and generates structured summaries containing decisions made, action items with owners and due dates, and key discussion points. It distributes follow-up notes to participants, syncs action items into project management tools, and maintains a searchable meeting knowledge base.
Snyk AI Agent
by Snyk
AI-powered developer security agent that continuously scans code, dependencies, containers, and infrastructure-as-code for vulnerabilities. Provides automated fix pull requests, prioritizes issues by exploitability, and integrates directly into the developer workflow for shift-left security.
GitHub Copilot Workspace
by GitHub (Microsoft)
GitHub's AI-native development environment that turns issues into fully implemented code changes. Plans, implements, and validates multi-file edits with human-in-the-loop review before merging.
LangGraph
by LangChain Inc.
LangChain's framework for building stateful, multi-agent applications using graph-based workflows. Provides fine-grained control over agent state, cycles, branching, and human-in-the-loop checkpoints for production-grade agentic systems.
Dependency Updater Agent
by Mend (WhiteSource)
An automated agent that scans software repositories for outdated or vulnerable dependencies, opens pull requests with tested dependency upgrades, and resolves breaking API changes introduced by major version bumps. It groups related updates, runs the test suite for each PR, and prioritizes CVE-critical packages to ensure security patches ship within SLA windows.
AWS Bedrock Agents
by Amazon Web Services
AWS's fully managed agent service within Amazon Bedrock that orchestrates multi-step tasks using foundation models. Automatically breaks down user requests, calls APIs, queries knowledge bases, and executes actions while maintaining enterprise security and compliance controls.
ServiceNow AI Agent
by ServiceNow
An autonomous AI agent built on the Now Platform, designed to automate end-to-end IT Service Management (ITSM) processes. It independently resolves common incidents, fulfills service requests, and executes standard change workflows by leveraging a proprietary knowledge graph and workflow engine, reducing the need for human intervention.
Performance Profiler Agent
by Datadog
An autonomous profiling agent that instruments application code, analyzes CPU flame graphs, memory heap snapshots, and database query plans to identify performance bottlenecks, then proposes and optionally applies targeted code optimizations. It tracks regression history, correlates deployments with latency spikes, and benchmarks fixes against baseline measurements before recommending production rollout.
CrewAI
by CrewAI
Framework for orchestrating role-playing autonomous AI agents that work together as a crew. Enables defining agents with specific roles, goals, and backstories to collaborate on complex tasks through structured workflows.
Risk Assessment Agent
by ServiceNow
This AI agent automates enterprise risk management (ERM) by continuously synthesizing data from internal systems and external intelligence. It identifies, categorizes, and scores diverse risks, maintaining a live risk register and mapping control effectiveness to provide a real-time, holistic view of the organization's risk posture.
Zendesk AI Agent
by Zendesk
Zendesk's AI Agent is an autonomous customer support tool designed to resolve inquiries across email, chat, and messaging. Trained on billions of real service interactions, it understands intent and sentiment to provide resolutions without requiring human intervention, freeing up teams for complex issues.
Social Media Optimizer
by Sprout Social
A semi-autonomous agent that optimizes social media content for maximum reach. It analyzes platform-specific engagement patterns, rewrites posts, schedules them for peak audience times, and A/B tests caption variations to improve performance across channels.
Document Classification Agent
by ABBYY
An AI agent that automates document processing by classifying unstructured files like invoices, contracts, and emails into predefined categories. It extracts key data, validates it against business logic, and routes documents to appropriate systems, supporting multiple languages and improving over time via human feedback.
Elicit
by Elicit
AI research assistant that automates systematic literature reviews and evidence synthesis. Searches across 200M+ academic papers, extracts key findings, and synthesizes results into structured summaries with full citations.
SWE-agent
by Princeton NLP
Princeton NLP's research agent that turns LLMs into autonomous software engineers. Achieves state-of-the-art results on SWE-bench by providing an agent-computer interface optimized for code navigation and editing.
Figma AI Agent
by Figma
Figma AI is a suite of native artificial intelligence features integrated directly within the Figma and FigJam platforms. It accelerates the design process by generating UI elements from text prompts, automatically populating mockups with realistic content, and providing intelligent suggestions to improve design consistency.
Support Resolver Agent
by AaaS
Resolves up to 80% of Tier-1 and Tier-2 support requests by directly accessing the CRM and payment gateways. Processes refunds within configurable monetary caps, updates account settings, modifies subscriptions, and routes edge cases to human representatives with full conversation summaries and diagnostic context. Unlike basic chatbots that regurgitate FAQ documents, this agent takes transactional action — it resolves, not deflects.
Google Vertex AI Agents
by Google Cloud
Google Vertex AI Agents is an enterprise-grade platform for building and deploying production-ready generative AI agents on Google Cloud. It enables developers to create agents that can reason, use tools, and leverage grounded generation with Google Search to complete complex tasks and engage in multi-turn conversations.
CrowdStrike Charlotte AI
by CrowdStrike
CrowdStrike's generative AI security analyst, Charlotte AI, accelerates threat operations by automating investigation and response. It correlates alerts, enriches incidents with threat intelligence, and recommends actions, allowing security teams to query vast datasets and understand threats using natural language.
Predictive Maintenance Agent
by SparkCognition
An IoT-connected agent that ingests vibration, temperature, acoustic, and electrical signals from industrial equipment to predict failure events hours to weeks in advance using ML anomaly detection and physics-based models. It generates work orders in CMMS systems, recommends spare parts pre-positioning, and calculates optimal maintenance windows to minimize production impact.
Radiology Report Agent
by Nuance PowerScribe
An AI assistant that accelerates radiology reporting by automatically drafting structured reports from imaging findings. It applies standard templates like ACR BI-RADS, extracts key measurements, and codes findings using RadLex terminology, significantly reducing radiologist documentation time and improving data consistency for analytics.
Medical Imaging Analyzer
by Arterys
Deep-learning agent that analyzes DICOM medical images across modalities — CT, MRI, X-ray, and PET — to surface anomalies, measure lesions, and generate structured findings. Integrates directly into PACS workflows and flags priority studies for radiologist review.
Escalation Manager Agent
by Zendesk
A decision-intelligence agent that monitors live support queues in real time, detects escalation signals (frustrated language, churn-risk keywords, repeat contacts), and routes high-priority cases to the most qualified available agent with full context pre-loaded. It enforces tiered escalation policies and logs every routing decision for compliance auditing.
Database Migration Agent
by Redgate
An autonomous agent designed to automate the entire database migration lifecycle. It analyzes schema differences, generates forward and rollback migration scripts, and validates data integrity post-migration. The agent supports complex data transformations and migrations across different database platforms like PostgreSQL and Oracle, ensuring zero data loss.
Fraud Detection Agent
by Featurespace
An AI agent designed for real-time fraud prevention across various payment channels. It leverages behavioral biometrics, graph analytics, and machine learning to analyze transaction streams, identify suspicious patterns, and provide sub-50ms risk decisions. The system includes adaptive feedback loops for continuous model improvement.
PagerDuty AI
by PagerDuty
PagerDuty AI is an AIOps agent for incident management that automates triage and response. It intelligently groups related alerts to reduce noise, correlates events to identify root causes, and suggests or executes automated remediation runbooks. This helps teams minimize downtime and streamline their on-call processes.
Content Strategy Agent
by Jasper AI
An autonomous AI agent designed to streamline content marketing operations. It performs comprehensive audits of existing content, identifies strategic topic gaps by analyzing competitors and search trends, and generates data-driven editorial calendars. The agent ensures all content aligns with brand voice and business objectives.
DataRobot AI Agent
by DataRobot
DataRobot is an enterprise AI platform that automates the end-to-end machine learning lifecycle. It enables users to build, deploy, and monitor predictive models at scale, from data preparation to production. The platform offers automated feature engineering, model selection, and hyperparameter tuning to accelerate the path from raw data to business value.
Student Assessment Agent
by Turnitin
An automated assessment agent that generates item banks, administers adaptive quizzes, and provides calibrated scoring with detailed feedback explanations. It applies Item Response Theory to estimate learner proficiency and surfaces at-risk students to instructors via configurable alert thresholds.
Contract Management Agent
by Ironclad
An AI agent for automating contract lifecycle management (CLM). It extracts critical data like terms, dates, and obligations from agreements, centralizes them into a searchable repository, and provides automated alerts for key deadlines. The agent streamlines review by comparing new contracts against pre-approved clause libraries and company playbooks.
Portfolio Optimizer
by BlackRock Aladdin
An advanced AI agent for constructing and managing investment portfolios. It leverages quantitative techniques like mean-variance optimization and the Black-Litterman model to align portfolios with specific investor goals, risk tolerances, and constraints such as ESG mandates, while continuously monitoring for drift and executing tax-efficient rebalancing.
Intercom Fin
by Intercom
Intercom Fin is an AI-powered chatbot designed for customer support automation, built on OpenAI's GPT-4. It autonomously resolves customer queries by leveraging a company's help center content and past conversation data. Fin provides human-like answers, can execute actions, and intelligently escalates complex issues to human agents.
Boston Dynamics Atlas
by Boston Dynamics
Next-generation fully electric humanoid robot designed for industrial and commercial applications. Features unmatched athletic ability, whole-body manipulation, and advanced perception for operating in complex, dynamic environments alongside humans.
Azure AI Agent Service
by Microsoft
An enterprise-grade platform from Microsoft for building, deploying, and managing sophisticated AI agents. Built on the Copilot stack, it allows developers to create agents that can reason, use tools, and orchestrate complex tasks. The service features deep integration with Microsoft services and robust responsible AI controls.
Ad Copy Generator
by Copy.ai
An AI agent designed for paid advertising that generates multiple headline and description variants to boost click-through rates. It analyzes product data, target personas, and landing pages to create optimized copy for Google Ads, Meta, and LinkedIn, ensuring strong message-to-market alignment.
MetaGPT
by DeepWisdom
MetaGPT is an open-source multi-agent framework that automates software development by simulating a virtual company. It assigns distinct roles like product manager, architect, and engineer to different LLM agents. Starting from a single-line requirement, it follows Standardized Operating Procedures (SOPs) to generate comprehensive outputs, including user stories, system designs, diagrams, and executable code.
Customer Feedback Analyzer
by Medallia
A continuous feedback intelligence agent that ingests NPS surveys, review platforms, support tickets, and social mentions to extract structured voice-of-customer insights. It applies aspect-level sentiment analysis to surface product and service themes and auto-generates prioritized improvement briefs for product and operations teams.
Player Analytics Agent
by GameAnalytics
A behavioral analytics agent that ingests player telemetry streams to build individual and cohort behavior models, predict churn risk, and surface liveops intervention opportunities. It continuously segments the player base by engagement and monetization propensity, feeding recommendations into targeted push notification, reward, and re-engagement campaign engines.
Literature Review Agent
by Elicit AI
An AI-powered agent designed to automate systematic literature reviews. It queries major academic databases like PubMed and arXiv to identify, screen, and synthesize evidence from thousands of papers. The agent produces structured outputs including evidence tables, meta-analysis plots, and PRISMA-compliant reports with bias assessments.
Email Triage Agent
by Superhuman
An inbox intelligence agent that reads, categorizes, and prioritizes incoming emails by urgency and business intent, drafts context-aware reply suggestions, auto-responds to routine inquiries within configured policies, and escalates high-priority items with briefings to ensure nothing critical is missed. It learns communication preferences over time to continuously improve draft quality and routing accuracy.
Aider
by Paul Gauthier
AI pair programming tool in the terminal that works with any LLM to edit code in local git repositories. Features automatic git commits, multi-file editing, and voice coding with support for connecting to dozens of model providers.
Invoice Reconciler Agent
by AaaS
Ingests unstructured invoice data across wildly varying formats (multi-page PDFs, email attachments, CSV exports). Matches invoices deterministically against Purchase Orders and receipt logs in the ERP system, and authorizes payment for the established happy path — achieving 60-80% straight-through processing without human intervention. Exception invoices (mismatches, missing POs, duplicate detection) are routed to a human queue with full context. Every payment authorization generates an immutable audit trail.
Contract Review Agent
by Ironclad AI
An AI agent designed to automate the legal contract review process. It extracts key clauses from documents like NDAs, MSAs, and SOWs, comparing them against a pre-defined legal playbook to flag non-standard language. The agent scores risk levels by clause and can automatically generate redlines with preferred positions, accelerating review cycles.
Amazon Q Developer Agent
by Amazon Web Services
Amazon Q is an AI-powered developer agent from AWS that automates code transformations, feature implementation, and security remediation. It is deeply integrated with the AWS ecosystem, allowing it to understand project context, suggest relevant AWS services, and streamline cloud-native development workflows directly within the IDE.
Quality Inspection Agent
by Landing AI
An AI agent that uses computer vision to perform real-time quality inspections on manufacturing lines. It automatically detects, classifies, and logs surface defects, dimensional inaccuracies, and assembly errors at production speed, triggering alerts or reject mechanisms to prevent faulty products from proceeding.
ChatDev
by OpenBMB
ChatDev is a virtual software company powered by multiple LLM agents that simulate a real-world development team. These agents, playing roles like CEO, programmer, and tester, collaborate to automate the entire software development lifecycle, from design and coding to testing, based on a single natural language prompt.
Campaign Analytics Agent
by Northbeam
An autonomous AI agent that unifies campaign data from disparate marketing channels to provide a holistic view of performance. It leverages advanced multi-touch attribution models to calculate true ROI and delivers actionable recommendations for budget optimization. The agent automatically generates executive-level reports and issues real-time alerts for performance anomalies.
Expense Audit Agent
by AppZen
An AI agent that automates the auditing of employee expense reports. It uses OCR to extract data from receipts, then validates expenses against company policies, per-diem rates, and vendor lists. The agent flags violations and potential fraud, auto-approves compliant reports, and routes exceptions for human review.
OpenHands
by All Hands AI
OpenHands is an open-source platform for creating autonomous AI software agents. It offers a secure, sandboxed environment where agents can execute complex development tasks by writing code, running commands, browsing the web, and interacting with APIs. It supports multi-agent delegation for tackling intricate problems.
Consensus
by Consensus NLP
Consensus is an AI-powered search engine designed to extract and synthesize findings directly from peer-reviewed scientific literature. It uses natural language processing to answer user questions with evidence-based conclusions, highlighting the general consensus from multiple studies and providing metrics on study quality.
Vapi AI
by Vapi
Vapi AI is a developer-first platform for building and deploying real-time, conversational voice agents. It provides low-latency streaming, interruptible speech, and seamless integrations with various LLM, TTS, and STT providers. The platform is designed for developers to create sophisticated voice experiences with features like function calling and call analytics.
GitLab Duo Agent
by GitLab
GitLab Duo is an AI-powered assistant integrated into the GitLab DevSecOps platform. It enhances developer productivity across the software development lifecycle by offering code suggestions, summarizing issues, explaining vulnerabilities, and generating tests, all within the native GitLab environment.
Intent Prospector Agent
by AaaS
Continuously tracks buyer intent signals across website visits, email engagement, CRM activity, and third-party intent data providers. Scores leads against Ideal Customer Profiles, drafts hyper-personalized outreach sequences based on behavioral signals, and books meetings directly on sales representatives' calendars. Replaces generic outreach spam with data-driven, compliant prospecting that respects CAN-SPAM and GDPR opt-out requirements.
Logistics Routing Agent
by project44
An AI agent designed to solve complex vehicle routing problems (VRP) for logistics and supply chain operations. It optimizes multi-stop routes for entire fleets by considering constraints like time windows, vehicle capacity, and traffic. The agent dynamically reroutes in real-time to adapt to new orders, delays, or cancellations.
H2O AI Agent
by H2O.ai
H2O.ai offers an open-source and enterprise AutoML platform that automates the machine learning lifecycle. It excels at automated model training, interpretation, and deployment, supporting distributed computing for large datasets. The platform provides comprehensive model explainability features like SHAP values, making complex models transparent.
Carbon Footprint Analyzer
by Persefoni
Calculates comprehensive Scope 1, 2, and 3 carbon emissions across the entire value chain. This ESG intelligence agent ingests diverse data like energy bills, travel records, and procurement data to generate audit-ready GHG inventory reports. It benchmarks performance and identifies key reduction opportunities, ensuring alignment with GHG Protocol standards.
Fraud Isolator Agent
by AaaS
Continuously evaluates live transaction streams against episodic memories of individual user behavior patterns. Detects complex anomaly patterns that rule-based fraud detection systems miss due to high false-positive rates. Autonomously pauses suspicious transactions and triggers secure multi-factor escalation workflows. The agent can only pause and flag — it never approves or releases funds, ensuring a human always makes the final call on flagged transactions.
SLA Monitor Agent
by PagerDuty
Monitors service-level agreement (SLA) compliance by tracking response and resolution times for all active tickets. The agent proactively alerts teams about potential SLA breaches, allowing them to act before a violation occurs. It can also automatically reprioritize ticket queues based on urgency and generates regular SLA performance reports for management.
Fleet Management Agent
by Geotab
An operational intelligence agent for managing autonomous vehicle fleets. It optimizes asset utilization and uptime by intelligently dispatching vehicles to demand hotspots, scheduling predictive maintenance from telemetry data, and balancing charge levels for EVs. The agent provides a real-time control dashboard to surface anomalies for operations teams.
Demand Forecasting Agent
by o9 Solutions
The Demand Forecasting Agent leverages machine learning to analyze diverse datasets, including historical sales, market trends, and external factors like weather or promotions. It produces accurate, SKU-level demand forecasts for various time horizons, enabling businesses to optimize inventory, reduce stockouts, and improve supply chain efficiency.
Codex CLI
by OpenAI
OpenAI's open-source CLI coding agent that operates in the terminal with sandboxed execution. Reads and edits files, runs commands, and supports multiple approval modes from suggest to full-auto.
Cloud Cost Optimizer Agent
by AaaS
Holds persistent, long-term memory of historical cloud infrastructure utilization patterns. Autonomously monitors resource usage across regions and cloud providers, identifies idle or underutilized resources, right-sizes instances based on real traffic patterns, and executes cost-saving measures continuously — all without waiting for a human FinOps review. Operates within safety guardrails: never terminates production instances, enforces a 7-day cooldown before right-sizing, and logs all actions with rollback capability.
Architecture Review Agent
by Codescene
A senior-engineer-level agent that statically analyzes codebase architecture using dependency graphs, coupling metrics, and design pattern recognition to identify anti-patterns, circular dependencies, and violations of architectural fitness functions. It produces architectural decision records (ADRs), generates C4 model diagrams, and prioritizes refactoring opportunities by technical debt cost and business risk.
Churn Prevention Agent
by AaaS
Tracks subtle drops in product usage, performs sentiment analysis on support tickets, and synthesizes disparate signals into churn risk scores for each account. Preemptively drafts retention plans, schedules proactive check-in calls, and highlights upselling opportunities — all before the critical renewal window opens. Transforms customer success from reactive firefighting into a data-driven retention engine.
Legal Document Drafter
by Harvey AI
An AI agent that automates the creation of legal documents by leveraging structured data, template libraries, and firm-specific style guides. It generates jurisdiction-compliant agreements, pleadings, and regulatory filings, incorporating precedents and flagging potential issues for attorney review. The system streamlines drafting workflows, ensuring consistency and accuracy.
Supplier Risk Agent
by Resilinc
An intelligence agent that continuously monitors supplier financial health, geopolitical exposure, ESG compliance, news sentiment, and delivery performance to generate dynamic risk scores for every vendor in the supply network. It alerts procurement teams to emerging threats and recommends dual-sourcing or buffer stock adjustments before disruptions materialize.
NPC Behavior Agent
by Inworld AI
An AI agent that uses reinforcement learning (RL) to generate dynamic NPC behaviors. Instead of relying on static scripts, it learns complex strategies through self-play and interaction, adapting its difficulty and tactics in real-time to match a player's skill level, ensuring a consistently challenging and unpredictable experience.
Digital Twin Agent
by Ansys
This AI agent creates and manages high-fidelity virtual replicas of physical assets and processes. By synchronizing with real-time IoT data, it runs complex simulations to test changes, predict failures, and analyze what-if scenarios posed in natural language, enabling optimization before physical implementation.
GPT Researcher
by Tavily
Open-source autonomous research agent that conducts comprehensive web research on any topic. Generates detailed research reports by planning queries, scraping multiple sources, filtering information, and synthesizing findings with citations.
Financial Statement Analyzer
by Visible Alpha
An AI-powered agent designed for systematic financial statement analysis. It automates the ingestion and parsing of corporate filings like 10-Ks and 10-Qs to compute key financial ratios, identify accounting anomalies, and benchmark performance against industry peers. The agent generates concise, investment-grade summaries highlighting financial health and potential risks.
Moveworks
by Moveworks
Moveworks is an enterprise AI copilot platform that automates employee support. It uses conversational AI to understand and resolve requests across IT, HR, finance, and other departments directly in collaboration tools like Slack and Microsoft Teams, reducing the need for manual intervention.
HR Screening Agent
by HireVue
An AI recruiting agent that screens resumes at scale, scores candidates against job description criteria, conducts asynchronous video interview analysis, and shortlists top applicants while flagging potential bias signals for human review. It integrates with ATS platforms to automate interview scheduling for shortlisted candidates and maintains a structured candidate evaluation audit trail.
Campaign Orchestrator Agent
by AaaS
Monitors live campaign performance across Google Ads, Meta, LinkedIn, and other advertising channels continuously. Automatically reallocates budgets to highest-performing channels, dynamically personalizes ad copy for different audience segments, and kills underperformers before they waste spend. Operates within configurable budget guardrails including max daily spend caps, minimum ROAS thresholds, and A/B test significance gates.
Galileo AI
by Galileo AI
Galileo AI is a design copilot that transforms natural language prompts into high-fidelity, editable UI designs. It generates complete screens, individual components, and custom illustrations directly within Figma, aiming to accelerate the design process by automating repetitive tasks and providing instant visual mockups.
Traffic Prediction Agent
by HERE Technologies
This agent specializes in spatio-temporal traffic forecasting, predicting conditions up to 30 minutes in advance for intersections and corridors. It processes data from V2X communications, vehicle telemetry, and infrastructure sensors. The predictions are designed for fleet routing engines to optimize ETAs and alleviate urban congestion.
STORM (Stanford)
by Stanford NLP
STORM is an open-source AI research agent from Stanford University designed to automate the creation of comprehensive, Wikipedia-style articles. It simulates a human research process by generating diverse questions, searching the web for information, and synthesizing the findings into a well-structured, cited narrative based on a generated outline.
Tavily Research Agent
by Tavily
Tavily is a specialized search API designed for Large Language Models (LLMs) and AI agents. It provides real-time, fact-grounded web search results in a structured, clean format, eliminating the need for manual data cleaning. The API is optimized to deliver relevant, concise information, making it ideal for powering autonomous agents and RAG applications.
Knowledge Base Builder Agent
by Guru
This autonomous agent streamlines knowledge management by ingesting data from support tickets, chat logs, and documents. It automatically generates, updates, and deduplicates knowledge base articles, identifying content gaps by analyzing unanswered user queries. New drafts are created for human review, ensuring a constantly improving self-service resource.
Sourcegraph Cody
by Sourcegraph
Sourcegraph's AI coding assistant with deep codebase context powered by code graph intelligence. Understands entire repositories through code search, cross-references, and dependency analysis for highly accurate code generation and answers.
Ada AI
by Ada
Ada is an enterprise-grade conversational AI platform designed for automating customer service. Its no-code builder allows businesses to create and deploy AI agents across various digital channels, aiming to resolve a high percentage of customer inquiries without human intervention and providing seamless handoffs when needed.
Transfer Learning
by Community
Leverages knowledge from a source domain to improve model performance on a target domain with limited labeled data. A foundational technique for reducing training costs and accelerating model development across diverse applications.
Chain-of-Thought
by AaaS
Guides LLMs to produce step-by-step reasoning before arriving at a final answer. Dramatically improves performance on math, logic, and multi-step problems by making the model's reasoning process explicit and verifiable.
Prompt Engineering
by AaaS
The foundational discipline of crafting effective prompts to elicit desired behaviors from language models. Covers system prompt design, instruction formatting, output structuring, temperature tuning, and iterative prompt refinement techniques.
Code Generation
by AaaS
Generates functional code from natural language descriptions, specifications, or partial implementations. Covers multiple languages and frameworks with support for boilerplate scaffolding, algorithm implementation, and API integration patterns.
Function Calling
by AaaS
Enables LLMs to invoke external functions by generating structured JSON arguments matching defined schemas. Supports parallel function calls, error handling, and chained invocations for complex multi-step tool interactions.
Collaborative Filtering
by Community
Predicts user preferences by identifying patterns from collective user-item interaction histories, using memory-based neighborhood methods or model-based matrix factorization and neural approaches. The backbone of recommendation systems at scale across e-commerce, streaming, and social platforms.
Few-Shot Learning
by AaaS
Teaches LLMs to perform tasks by providing a small number of input-output examples in the prompt. Enables rapid task adaptation without fine-tuning by demonstrating the desired pattern through carefully selected, representative examples.
Tool Use
by AaaS
Equips AI agents with the ability to select and use appropriate tools from a defined toolkit to accomplish tasks. Covers tool selection logic, input marshalling, output interpretation, and fallback strategies when tools fail or return unexpected results.
Speech Recognition
by AaaS
Teaches integration and optimization of automatic speech recognition (ASR) systems — from Whisper to streaming cloud APIs — for agentic voice pipelines. Covers language identification, word error rate reduction, punctuation restoration, and handling noisy audio environments.
Time-Series Forecasting
by Community
Predicts future values of sequential, time-indexed data using classical statistical models (ARIMA, ETS), gradient boosting (LightGBM, XGBoost), and deep learning architectures (Transformers, N-BEATS, TFT). Handles trend, seasonality, exogenous covariates, and uncertainty quantification.
Domain-Specific Fine-Tuning
by Community
Adapts a general-purpose pretrained model to a narrow domain by continuing training on curated domain corpora or instruction datasets. Produces specialized models that outperform generalist baselines on domain-specific benchmarks while preserving broad language understanding.
Code Review
by AaaS
Analyzes code for bugs, security vulnerabilities, performance issues, and style violations. Provides actionable feedback with severity levels and suggested fixes aligned to language-specific best practices and project conventions.
Hybrid Recommendation Systems
by Community
Combines collaborative filtering and content-based signals — along with contextual, knowledge-graph, and session-based features — into unified ranking models that outperform single-strategy approaches. Modern implementations use two-tower neural architectures for efficient retrieval followed by cross-attention reranking.
Graph Neural Networks
by Community
Applies deep learning directly to graph-structured data by passing and aggregating messages between connected nodes across multiple layers, enabling node classification, link prediction, and graph-level tasks. Powers state-of-the-art knowledge graph completion, molecular property prediction, and social network analysis.
Reinforcement Learning for Control
by Community
Trains control policies for autonomous systems through environment interaction and reward signals using model-free (PPO, SAC, TD3) and model-based (MBPO, Dreamer) RL algorithms. Enables superhuman performance in complex continuous control tasks from locomotion to manipulation.
Summarization
by AaaS
Condenses long documents into concise summaries while preserving key information and maintaining factual accuracy. Supports extractive, abstractive, and hierarchical summarization with configurable length, style, and focus area parameters.
Anomaly Detection
by Community
Identifies unusual patterns, outliers, and change points in time-series and tabular data using statistical, density-based, isolation forest, autoencoder, and transformer-based methods. Fundamental for operational monitoring, fraud detection, and predictive maintenance systems.
RAG Retrieval
by AaaS
A technique that enhances large language models by dynamically retrieving relevant information from an external knowledge base. This process grounds the model's responses in factual data, reducing hallucinations and enabling it to answer questions about information not present in its original training data.
Object Detection
by AaaS
A core computer vision skill that enables agents to identify and locate objects within an image or video stream. By predicting bounding boxes and class labels for each object, this skill forms the foundation for environmental understanding. It is crucial for applications requiring spatial awareness, from autonomous navigation to automated inspection.
Code Debugging
by AaaS
Diagnoses and resolves software bugs by analyzing error messages, stack traces, and code behavior. Applies systematic debugging strategies including root cause analysis, state inspection, and targeted fix generation with regression awareness.
Semantic Search
by AaaS
Enables meaning-based retrieval by converting queries and documents into dense vector representations and finding nearest neighbors. Foundational skill for any RAG pipeline or knowledge-base-powered agent.
Federated Learning
by Community
A machine learning technique that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging the data itself. It enables collaborative model training by aggregating locally computed updates, thereby preserving data privacy, security, and sovereignty.
Content-Based Recommendation
by Community
Recommends items by matching item feature profiles to user preference profiles derived from their interaction history, using TF-IDF, embeddings, and semantic similarity techniques. Effective for cold-start scenarios where user interaction data is sparse and item metadata is rich.
Text Classification
by AaaS
Automates the categorization of text into predefined classes. This skill leverages large language models to perform zero-shot and multi-label classification, eliminating the need for extensive training data. It can analyze documents, user feedback, or social media posts, assigning relevant labels from a simple list or a complex hierarchical taxonomy.
Path Planning
by Community
Path Planning is a fundamental capability in robotics and autonomous systems that computes a collision-free geometric path from a start to a goal configuration. It operates within a system's configuration space, using algorithms like A* or RRT to find optimal or feasible routes, distinct from motion planning which also considers dynamics like velocity and acceleration.
Visual Question Answering
by AaaS
Enables agents to answer free-form natural language questions about images by grounding language in visual features. Covers prompt construction for vision-language models, chain-of-thought visual reasoning, and failure modes such as hallucination and spatial confusion.
Differential Privacy
by Community
Provides mathematically rigorous privacy guarantees by adding calibrated noise to query outputs or model gradients, ensuring individual data points cannot be inferred from published statistics or trained models. The de facto standard for privacy-preserving data analysis and compliant ML training.
Embedding Generation
by AaaS
Generates dense vector embeddings from text, images, or other data types for use in similarity search, clustering, and classification. Covers model selection, batch processing, dimensionality considerations, and normalization strategies for optimal retrieval performance.
Test Generation
by AaaS
Automates the creation of test suites by analyzing source code, function signatures, or specifications. It generates unit tests, integration tests, and edge case scenarios for popular frameworks, complete with necessary mocks and assertions. This accelerates development cycles and improves code reliability.
ReAct Prompting
by AaaS
Implements the Reasoning + Acting (ReAct) paradigm where LLMs alternate between thinking steps and action steps. The model reasons about what to do next, takes an action (like searching or computing), observes the result, and continues reasoning until the task is complete.
Sensor Fusion
by Community
Combines data from multiple heterogeneous sensors — cameras, LiDAR, radar, GPS, IMU — using probabilistic filters and deep learning to produce a unified, accurate state estimate of the environment. Foundational for autonomous vehicles, drones, and any robot requiring robust situational awareness.
Fine-Tuning
by AaaS
Adapts pre-trained language models to specific domains, tasks, or styles through additional training on curated datasets. Covers full fine-tuning, parameter-efficient methods like LoRA and QLoRA, and best practices for dataset preparation, hyperparameter selection, and evaluation.
Anomaly Detection
by AaaS
Identifies deviations from normal system behavior across time-series telemetry data (CPU, memory, latency, error rates, request volumes). Uses statistical methods (z-score, IQR) and learned baselines to distinguish genuine anomalies from expected variance. A critical cross-foundry skill reused by SRE (F1), Fraud Detection (F6), and Supply Chain (F8) agents.
Code Refactoring
by AaaS
Code Refactoring is the disciplined process of restructuring existing computer code without altering its external behavior. It focuses on enhancing nonfunctional attributes like readability, maintainability, and performance. This practice is key to managing technical debt, applying design patterns, and modernizing legacy systems to align with current best practices.
Translation
by AaaS
Provides the ability to translate text from a source language to a target language. It aims to preserve the original meaning, tone, and cultural context. The skill supports domain-specific terminology for fields like legal or medical, allows for register control between formal and informal language, and handles idiomatic expressions with contextually appropriate equivalents.
Synthetic Data Generation
by Community
A process for creating artificial data that mimics the statistical properties and patterns of real-world datasets. It employs techniques like GANs, VAEs, and diffusion models to generate new data points, addressing issues of data scarcity, privacy, and imbalance. This enables robust model training and testing where real data is unavailable or sensitive.
Robot Perception
by Community
Enables robots to interpret their surroundings by processing and fusing data from sensors like cameras, LiDAR, and IMUs. This capability allows machines to build environmental models, detect and track objects, and determine their own position and orientation (localization). It is a cornerstone of autonomous navigation and interaction.
OCR Pipeline
by AaaS
Builds end-to-end pipelines for extracting structured text from images, scanned documents, and PDFs using OCR engines combined with layout analysis. Teaches preprocessing, engine selection (Tesseract, PaddleOCR, Google Document AI), post-correction, and handoff to language models for structured extraction.
Motion Planning
by Community
Motion Planning is the process of generating a valid trajectory for an autonomous system, such as a robot arm or self-driving car, from a starting state to a desired goal state. It computes a collision-free path that respects the system's kinematic and dynamic constraints, effectively bridging perception with physical action.
Knowledge Graph Construction
by Community
Builds structured knowledge graphs from unstructured text and semi-structured sources through entity recognition, relation extraction, coreference resolution, and entity linking. The resulting graphs power question answering, search, recommendation, and reasoning applications.
Image Generation Prompting
by AaaS
Master structured prompting for text-to-image diffusion models like Stable Diffusion and Midjourney. Learn to control style, composition, and quality using techniques such as negative prompting, LoRA weights, and iterative refinement. This skill enables the programmatic generation of consistent, on-brand imagery at scale.
Prompt Chaining
by AaaS
Prompt Chaining is a technique for executing complex tasks by breaking them into a sequence of smaller, interconnected prompts. The output from one large language model (LLM) call serves as the input for the next, creating a multi-step workflow. This method enables more sophisticated reasoning, state management, and integration with external tools.
Hybrid Search
by AaaS
Hybrid search enhances information retrieval by merging the results of two distinct search methods: dense vector search for semantic understanding and sparse keyword search (like BM25) for lexical precision. This dual approach ensures that search results are not only contextually relevant but also capture exact term matches, significantly improving recall and relevance across diverse and complex queries.
Data Extraction
by AaaS
Data Extraction is the process of automatically identifying and pulling structured information from unstructured or semi-structured sources like documents, web pages, and text. It uses NLP and computer vision to parse content into a predefined schema, enabling data to be used in databases, analytics, and automated workflows.
Few-Shot Domain Adaptation
by Community
Adapts models to new target domains using only a handful of labeled examples, combining meta-learning, prompt engineering, and prototype-based methods. Critical for enterprise deployments where labeled data is scarce or expensive to acquire.
CRM Data Retrieval
by AaaS
Queries CRM systems to retrieve customer account data, ticket history, subscription status, and interaction logs. Provides the customer context foundation that support, churn, and sales agents depend on for personalized actions.
Document Chunking
by AaaS
Splits large documents into semantically coherent chunks optimized for embedding and retrieval. Supports recursive, semantic, and sentence-based splitting strategies with configurable overlap and size parameters.
Multi-Step Reasoning
by AaaS
A core AI capability that enables agents to break down complex queries into a sequence of manageable, logical steps. By generating intermediate thoughts and verifying them, this process mimics human reasoning to solve problems that require planning, deduction, and synthesis of information over multiple stages.
Active Learning
by Community
Active Learning is a machine learning technique that intelligently selects the most informative data points from a large pool of unlabeled data to be labeled by a human annotator. By prioritizing examples where the model is most uncertain, it aims to achieve higher model accuracy with significantly fewer labeled samples, reducing annotation costs and time.
Log Analysis
by AaaS
Parses, correlates, and summarizes structured and unstructured log streams from multiple sources (application logs, system logs, CI/CD logs). Identifies error patterns, correlates events across distributed services using trace IDs, and extracts actionable insights from high-volume log data. A foundational skill reused across DevOps, SRE, and security agents.
Image Segmentation
by AaaS
Covers semantic, instance, and panoptic segmentation techniques that enable agents to produce pixel-level masks for scene understanding. Includes practical guidance on using SAM 2, Mask R-CNN, and integrating segmentation outputs into multimodal pipelines.
Feature Attribution
by AaaS
This skill involves computing and communicating which input features most influenced a model's prediction. It leverages methods like SHAP, LIME, and Integrated Gradients for tabular, text, and image data. The core focus is on generating local and global explanations and presenting them visually for both technical and non-technical audiences.
Causal Effect Estimation
by Community
Causal Effect Estimation quantifies the true impact of an action or intervention by analyzing observational data. It moves beyond simple correlation to isolate causality using statistical methods, which is crucial for evaluating policies, business strategies, and medical treatments where A/B tests are infeasible.
Planning
by AaaS
Enables agents to create structured execution plans for multi-step tasks by analyzing goals, identifying sub-tasks, ordering dependencies, and allocating resources. Supports plan revision when steps fail or new information emerges during execution.
Named Entity Recognition
by AaaS
Identifies and classifies named entities (people, organizations, locations, dates, etc.) within unstructured text. Supports custom entity types, relationship extraction between entities, and structured output formatting for downstream processing.
Sim-to-Real Transfer
by Community
Sim-to-Real Transfer is a set of techniques used in robotics and AI to bridge the 'reality gap' between simulation and the real world. It enables models and control policies trained in a virtual environment to be deployed effectively on physical hardware, drastically reducing the need for costly, time-consuming, and potentially unsafe real-world data collection.
Multi-Agent Coordination
by AaaS
Multi-Agent Coordination involves designing systems where multiple autonomous agents collaborate to achieve a common goal. This skill encompasses architectural patterns like hierarchical supervision and peer-to-peer negotiation for task distribution and conflict resolution. It focuses on managing shared information and ensuring coherent collective action in complex, dynamic environments.
Streaming Responses
by AaaS
This skill involves implementing real-time, token-by-token data delivery from Large Language Models to end-users. It utilizes protocols like Server-Sent Events (SSE) or WebSockets to create interactive and responsive applications, such as chatbots or code assistants, by progressively displaying content as it's generated.
Reranking
by AaaS
Applies a cross-encoder or LLM-based reranker to refine initial retrieval results by scoring query-document pairs for relevance. Dramatically improves precision by promoting the most contextually relevant passages to the top of the result set.
OCR Extraction
by AaaS
Extracts structured data from unstructured documents (PDFs, scanned images, email attachments) using optical character recognition with layout-aware parsing. Handles multi-page invoices, varying formats, and poor scan quality — producing structured key-value pairs for downstream reconciliation.
Escalation Routing
by AaaS
Routes unresolved or high-risk incidents to the appropriate human responder with full diagnostic context. Determines escalation urgency (P1-P5), identifies the correct on-call engineer or team based on service ownership, and packages a complete incident summary (timeline, diagnostics run, hypothesis). A cross-foundry skill reused by Customer Success (F4) and Healthcare (F9) agents.
Content Filtering
by AaaS
A system that automatically screens text inputs and outputs for large language models (LLMs) to detect and manage harmful content. It uses multi-category classification to identify issues like toxicity, hate speech, and violence, applying configurable rules and thresholds to enforce safety policies and protect users.
Code Explanation
by AaaS
Provides detailed, multi-level explanations for code snippets, functions, or entire repositories. It breaks down complex algorithms, clarifies control flow, and describes the purpose of variables and dependencies. The skill supports numerous programming languages, generating documentation-style overviews or granular, line-by-line analyses to accelerate learning and code reviews.
Web Browsing
by AaaS
Empowers autonomous agents to interact with the web like a human user. This skill provides the core functionality to navigate to URLs, render pages including executing JavaScript, and parse DOM elements. It enables complex workflows such as filling out forms, clicking buttons, and extracting structured data for analysis or task completion.
Continual Learning
by Community
A machine learning paradigm enabling models to learn sequentially from a continuous stream of data without forgetting previously acquired knowledge. Continual Learning, or Lifelong Learning, directly addresses the problem of catastrophic forgetting in neural networks using methods like regularization, memory replay, and dynamic architectures.
Prompt Injection Defense
by AaaS
Detects and mitigates prompt injection attacks where malicious inputs attempt to override system instructions or extract sensitive information. Implements input sanitization, instruction hierarchy enforcement, and output monitoring to protect LLM-powered applications.
Agentic RAG
by AaaS
Agentic RAG transforms Retrieval-Augmented Generation from a static, single-step process into a dynamic, multi-step workflow. In this paradigm, an LLM-powered agent intelligently decides when to retrieve information, what queries to use, and whether to perform additional retrieval cycles, often using external tools to refine its approach.
SQL Generation
by AaaS
Converts natural language questions into executable SQL queries against relational databases. Supports schema-aware generation, multi-table joins, aggregations, and query optimization with dialect-specific syntax for PostgreSQL, MySQL, SQLite, and others.
Sentiment Analysis
by AaaS
Classifies the emotional tone and sentiment polarity of customer text communications — support tickets, survey responses, chat logs, and social mentions. Produces sentiment scores with confidence levels, enabling churn prevention and coaching agents to identify dissatisfied accounts before explicit complaints surface.
Knowledge Retrieval
by AaaS
Retrieves relevant articles, documentation, and policy information from knowledge bases in response to real-time queries. Uses hybrid search (keyword + semantic) with cross-encoder reranking to surface the most contextually appropriate content for support and coaching agents.
Ticket Routing
by AaaS
Classifies support tickets by category, urgency, and required expertise, then routes them to the correct queue or human agent. Handles auto-resolution for simple cases and escalates complex ones with full context summaries.
Deployment Monitoring
by AaaS
Continuously observes deployment pipelines and post-deploy health metrics. Detects anomalous deployment patterns (elevated error rates, latency spikes, failed health checks) within seconds of release. Integrates with canary and blue-green deployment strategies to provide real-time go/no-go signals based on configurable thresholds.
Lead Scoring
by AaaS
Assigns numerical scores to leads based on demographic fit, firmographic match, behavioral engagement, and intent signals. Enables agents to rank prospects by conversion likelihood and route high-scoring leads to immediate outreach while nurturing lower-scoring ones.
Context Window Optimization
by AaaS
A set of techniques for managing the limited memory (context window) of Large Language Models. It involves strategically structuring prompts, summarizing or pruning conversation history, and selectively including relevant information to ensure efficient, cost-effective, and coherent long-form interactions with an AI.
PII Detection
by AaaS
Identifies and flags personally identifiable information (PII) in text data, including names, addresses, phone numbers, SSNs, and financial details. Supports configurable sensitivity levels, redaction strategies, and compliance reporting for GDPR, HIPAA, and CCPA requirements.
Entity Resolution
by Community
Identifies and merges records across heterogeneous data sources that refer to the same real-world entity, using blocking, similarity scoring, and classification models to scale to large corpora. Critical for maintaining knowledge graph integrity and enabling cross-source analytics.
Documentation Generation
by AaaS
Generates technical documentation from source code, including API references, README files, inline comments, and architectural guides. Adapts tone and detail level for different audiences from developer guides to end-user documentation.
Threshold Detection
by AaaS
Evaluates real-time metrics against configurable thresholds (SLOs, SLIs, error budgets) and triggers appropriate responses. Supports static thresholds, dynamic baselines, and anomaly-based detection. Distinguishes between noise and genuine threshold breaches using historical context and burn-rate analysis.
Causal Discovery
by Community
Causal Discovery is a subfield of AI that infers causal relationships from observational data. It constructs a Directed Acyclic Graph (DAG) to represent these cause-and-effect links without manual intervention or controlled experiments, using statistical algorithms to distinguish correlation from causation.
Agent Memory Systems
by AaaS
Teaches design and implementation of multi-tier agent memory architectures — in-context working memory, episodic memory via vector stores, and semantic memory via knowledge graphs — enabling agents to maintain coherent state across long-running tasks and sessions. Covers retrieval-augmented memory, memory consolidation, and forgetting strategies.
Structured Output RAG
by AaaS
This skill involves building Retrieval-Augmented Generation (RAG) systems that output structured data, like JSON, conforming to a predefined schema. Instead of unreliable free-form text, it uses techniques like constrained decoding and validation to ensure outputs are machine-readable and ready for direct use in APIs or databases.
Reflection
by AaaS
Allows agents to evaluate their own outputs, identify errors or weaknesses, and iteratively improve responses. Implements self-critique loops where the agent reviews its work against quality criteria and refines until standards are met.
Refund Processing
by AaaS
Processes customer refunds through payment gateway APIs within configurable monetary caps. Enforces refund policies (max per transaction, max per day, cooling periods) and generates immutable audit trails for every refund action. Escalates requests above caps to human approval.
Memory Management
by AaaS
Enables AI agents to maintain state and context across multiple interactions by managing short-term and long-term memory. This is crucial for creating coherent, personalized experiences, moving beyond stateless request-response models. It uses techniques like conversation buffers, summarization, and vector-based retrieval.
Web Scraping
by AaaS
Web scraping automates the extraction of large amounts of data from websites. By simulating human browsing, it can crawl through pages, parse HTML, and collect specific information like prices, contacts, or articles, transforming unstructured web content into structured data for analysis or other applications.
Autonomous Planning
by AaaS
Autonomous Planning enables AI agents to independently decompose high-level, long-horizon objectives into a structured graph of executable sub-tasks. It involves generating plans using classical (PDDL), LLM-based, or hybrid methods, estimating necessary resources, and dynamically replanning in response to execution failures or new environmental data.
Usage Trend Analysis
by AaaS
Tracks product usage patterns over time — login frequency, feature adoption, session duration, and activity drops. Identifies accounts showing declining engagement that correlate with churn risk, enabling proactive retention before the customer disengages.
Telemetry Analysis
by AaaS
Ingests and analyzes telemetry data (metrics, traces, spans) from distributed systems. Correlates performance data across service boundaries using distributed tracing, identifies bottleneck services, and produces latency breakdowns. Provides the observability foundation that SRE Triage and Latency Budget Planner agents depend on.
PO Matching
by AaaS
Matches extracted invoice data against Purchase Orders and receipt logs in ERP systems using deterministic matching rules (PO number, vendor, amount, line items). Handles partial matches, tolerance thresholds, and multi-line reconciliation. Routes exceptions to human queues with full mismatch details.
Calendar Negotiation
by AaaS
Accesses multiple participants' calendars simultaneously and finds optimal meeting times across time zones, working hours, and scheduling constraints. Handles rescheduling, cancellations, and conflict resolution autonomously.
Approval Workflow
by AaaS
Routes transactions, documents, and exceptions through configurable multi-step approval chains based on amount thresholds, risk levels, and organizational policies. Tracks approver actions with timestamps, sends reminders for pending items, and escalates stalled approvals — ensuring no payment or commitment is authorized without the required sign-offs.
Pull Request Generation
by AaaS
Generates complete, well-structured pull requests including: descriptive title, detailed body with change rationale, test results summary, dependency diff, and reviewer assignments. Follows the organization's PR template and conventional commit conventions. Produces PRs that human reviewers can approve quickly because all context is pre-packaged.
Constitutional AI
by AaaS
Applies Anthropic's Constitutional AI principles to self-supervise model outputs against a set of defined rules or principles. The model critiques and revises its own responses to ensure they align with safety guidelines, ethical principles, and quality standards.
Output Validation
by AaaS
Validates LLM outputs against expected schemas, formats, and quality criteria before delivery to end users. Implements JSON schema validation, hallucination checks, citation verification, and automated retry logic for outputs that fail validation.
Counterfactual Reasoning
by Community
Generates and evaluates counterfactual explanations — minimal input changes that would alter a model's prediction — using structural causal models and algorithmic recourse techniques. Provides actionable explanations for model decisions and supports causal effect estimation under interventions.
Personalized Outreach
by AaaS
Drafts hyper-personalized outreach messages for each prospect using their specific firmographic profile, recent intent signals, and ICP match factors. Enforces brand voice and CAN-SPAM/GDPR compliance, adapts tone by channel (email, LinkedIn, phone script), and graduates from human-approved to autonomous sending as trust is established.
Dependency Mapping
by AaaS
Constructs complete dependency graphs across package managers (npm, pip, cargo, Maven) and internal modules. Identifies version conflicts, circular dependencies, security-vulnerable transitive dependencies, and upgrade paths. Produces actionable dependency health reports that inform both the Codebase Architect and Dependency Guardian agents.
Buyer Intent Tracking
by AaaS
Monitors buyer intent signals across website visits, email opens, content downloads, CRM activity, and third-party intent data providers. Correlates engagement patterns to identify accounts showing active buying behavior, enabling agents to prioritize high-intent prospects over cold outreach.
Speaker Diarization
by AaaS
Enables agents to segment audio recordings by speaker identity, answering 'who spoke when' for downstream summarization and analysis tasks. Covers embedding-based clustering (pyannote.audio, NeMo), overlapping speech handling, and merging diarization with ASR transcripts.
Rollback Execution
by AaaS
Executes safe, policy-constrained rollbacks of failed deployments. Respects blast-radius limits (max affected services), rate limits (max rollbacks per hour), and change-window constraints. Supports multiple rollback strategies: Git revert, container image pinning, feature flag disabling, and traffic shifting. Produces a detailed rollback report with root cause hypothesis.
Hugging Face Transformers Training Script
by Hugging Face
The Hugging Face Transformers training script simplifies the process of training and fine-tuning transformer models for various NLP tasks. It provides a high-level API and pre-built training loops, enabling users to quickly adapt pre-trained models to their specific datasets and objectives.
PyTorch Image Classification Script
by PyTorch
A Python script using PyTorch for training and evaluating image classification models. It provides a modular structure for defining datasets, models, training loops, and evaluation metrics, enabling researchers and practitioners to quickly prototype and deploy image classification solutions.
TensorFlow Model Garden
by Google
The TensorFlow Model Garden is a repository containing a collection of example implementations for state-of-the-art (SOTA) machine learning models and modeling solutions for TensorFlow. It provides a wide variety of models, pre-trained weights, and scripts to help users quickly prototype and deploy TensorFlow-based AI solutions.
TensorFlow Model Optimization Toolkit Script
by Google
The TensorFlow Model Optimization Toolkit script provides tools and techniques to optimize TensorFlow models for deployment, including quantization, pruning, and clustering. It reduces model size and improves inference speed, making models more suitable for edge devices and resource-constrained environments.
Scikit-learn Model Evaluation Script
by Scikit-learn
A Python script leveraging scikit-learn to comprehensively evaluate machine learning models. It calculates various performance metrics (e.g., accuracy, precision, recall, F1-score, AUC) and generates visualizations (e.g., confusion matrices, ROC curves) to provide insights into model behavior and facilitate informed decision-making.
LangChain Expression Language (LCEL) Script
by LangChain
LCEL is a declarative way to compose chains of language models and other primitives in LangChain. This script demonstrates how to use LCEL to build complex AI pipelines with features like streaming, parallel execution, and retry mechanisms, enabling developers to create robust and scalable AI applications.
Stable Diffusion XL Turbo Inference Script
by Stability AI
This script provides a streamlined method for performing image generation using Stable Diffusion XL Turbo. It leverages optimized inference techniques to achieve faster generation speeds, making it suitable for real-time applications and interactive experiences.
Speech-to-Text Pipeline
by OpenAI
Production-grade ASR pipeline using OpenAI Whisper or faster-whisper with VAD-based chunking, speaker timestamp alignment, and SRT/VTT subtitle export. Handles long-form audio via sliding window segmentation and automatic language detection.
Object Detection Setup
by Ultralytics
Bootstraps a production-ready object detection workflow using YOLOv8 or RT-DETR, including webcam/video stream ingestion, NMS post-processing, and annotation overlay rendering. Outputs annotated frames and a structured JSON detections log suitable for downstream analytics.
Feature Importance Analyzer
by Community
Analyzes feature importance for scikit-learn compatible models using multiple advanced techniques. It computes SHAP values with Tree and Kernel Explainers, calculates permutation importance, and performs feature selection with Boruta. Results are compiled into an interactive HTML dashboard for easy interpretation and sharing.
REST AI API Template
by Community
Production-ready FastAPI template for AI-powered REST APIs, with pre-wired OpenAI/Anthropic client, async streaming endpoints, JWT authentication, rate limiting, structured logging, and OpenAPI docs. Includes Docker Compose stack with Redis rate-limit store and Prometheus metrics.
Fraud Detection Pipeline
by Community
This is a complete machine learning pipeline for detecting fraudulent transactions in real-time. It employs a hybrid approach, using XGBoost or LightGBM for classification and an Isolation Forest for anomaly detection. The system is specifically designed to handle severely imbalanced datasets through SMOTE-Tomek resampling and cost-sensitive learning.
Image Classification Pipeline
by Community
End-to-end image classification pipeline that handles dataset loading, preprocessing, model inference, and result export using PyTorch and torchvision. Supports batch inference against any Hugging Face ViT or ResNet checkpoint with configurable confidence thresholds.
Model Fine-Tuning (LoRA)
by AaaS
This script automates the process of fine-tuning large language models using Low-Rank Adaptation (LoRA). It provides an end-to-end workflow, from preparing custom datasets to training lightweight adapters and merging them into a base model for efficient deployment. This enables domain-specific model specialization with significantly reduced computational costs.
OCR Pipeline Script
by Community
This script provides a sophisticated OCR pipeline that intelligently routes documents to the most suitable engine—Tesseract, PaddleOCR, or a cloud API—based on image quality analysis. It processes various document types and outputs structured JSON containing text sorted by reading order, complete with bounding box coordinates and confidence scores for each word or line.
Image Segmentation Script
by Meta AI
Runs Segment Anything Model (SAM 2) or Mask2Former on image batches, producing per-pixel segmentation masks with class labels and confidence scores. Includes utilities for mask overlay visualization and RLE-encoded mask export compatible with COCO annotation format.
Data Quality Checker
by Great Expectations
Automates data quality testing for tabular data using the Great Expectations library. This script profiles datasets to generate and validate 'Expectations' covering schema, statistical properties, and referential integrity. It produces a comprehensive HTML report (Data Docs) and can be integrated into CI/CD pipelines as a quality gate to prevent bad data from entering production systems.
PII Redaction Pipeline
by Microsoft
An automated pipeline that leverages Microsoft Presidio to identify and remove personally identifiable information (PII) from text and structured data. It supports configurable entity recognizers for GDPR and HIPAA compliance and features a reversible pseudonymization capability with a secure vault for authorized re-identification.
Basic RAG Pipeline
by AaaS
This script provides a foundational Retrieval-Augmented Generation (RAG) pipeline. It handles core tasks like loading documents, splitting text into chunks, generating embeddings, and indexing them into a vector store. It includes a basic query interface, making it ideal for learning the RAG workflow and prototyping simple applications.
Speaker Diarization Script
by pyannote
This script automates the process of creating turn-by-turn transcripts from multi-speaker audio files. It first uses the pyannote.audio library to perform speaker diarization, identifying who spoke and when. These speaker segments are then aligned and merged with a transcription generated by OpenAI's Whisper, producing a final text output that attributes each line of dialogue to a specific speaker.
Chatbot Builder Script
by Community
This script generates a production-ready chatbot foundation using Rasa for structured dialogue and an LLM for open-ended fallback. It provides a unified channel adapter for deploying to Web, WhatsApp, and Slack, and includes built-in conversation analytics and a Streamlit-based testing environment for rapid development.
Neo4j RAG Pipeline
by Neo4j
Implements a GraphRAG pattern that stores document entities and relationships in Neo4j, then retrieves contextually relevant subgraphs at query time before passing them to an LLM. Includes automatic entity extraction with spaCy, relationship inference, and a Cypher query generator.
Visual Search Engine
by Community
This script provides a complete framework for building a multimodal visual search engine. It uses CLIP to generate image and text embeddings, which are indexed in a vector database like Qdrant or Weaviate for efficient similarity search. The system supports both text-to-image and image-to-image queries and includes a FastAPI server for API access.
Serverless Model Deploy
by Community
Packages a trained ML model into a serverless function on AWS Lambda, Modal, or Google Cloud Run, handling cold-start optimization, dependency layering, and auto-scaling configuration. Includes health-check endpoints, structured logging, and a GitHub Actions workflow for automated rollout.
Recommendation Engine Setup
by Community
This script provides a complete setup for a modern, two-stage recommendation engine. It uses a two-tower neural network for efficient candidate retrieval and a powerful Large Language Model (LLM) for nuanced re-ranking. The system integrates with a Feast feature store to leverage real-time user context, ensuring timely and relevant suggestions.
Edge Model Optimization
by Community
Optimizes PyTorch and TensorFlow models for edge hardware by applying INT8/FP16 quantization and converting them to ONNX or TFLite formats. This script provides platform-specific tuning for ARM and NPU targets, benchmarking latency and memory usage while generating a report on accuracy trade-offs.
Model Serving (vLLM)
by AaaS
This script automates the deployment of a large language model using the vLLM inference engine. It creates a high-throughput, OpenAI-compatible API endpoint. Key features like PagedAttention and continuous batching are configured to maximize performance and memory efficiency, making it suitable for production environments.
WebSocket Streaming API
by Community
WebSocket server that proxies token-by-token LLM streaming to multiple simultaneous clients, with connection lifecycle management, heartbeat keep-alives, and per-session context persistence. Supports fan-out broadcasting for collaborative AI sessions and reconnection with message replay.
Automated Feature Engineering
by Alteryx
Applies Deep Feature Synthesis via Featuretools and AutoFeat to automatically generate hundreds of candidate features from relational tabular data, then prunes them using mutual information and SHAP-based importance filters. Produces a reproducible feature pipeline serializable to scikit-learn format.
Sentiment Dashboard
by Community
Ingests social media feeds, reviews, and support tickets in near-real-time, scores sentiment at entity and aspect level using a fine-tuned RoBERTa model, and renders a live Streamlit dashboard with trend charts, topic clustering, and configurable alert thresholds for brand-crisis detection.
Data Cleaning Script
by AaaS
Cleans and normalizes text data for LLM consumption by removing HTML artifacts, fixing encoding issues, standardizing whitespace, deduplicating near-identical entries, and filtering low-quality content based on configurable quality heuristics.
Voice Cloning Setup
by Coqui
Sets up a zero-shot voice cloning pipeline using Coqui XTTS-v2 or Tortoise-TTS, requiring only a 3-second reference audio clip to synthesize new speech in the target voice. Includes a FastAPI inference server, audio quality normalization, and speaker embedding export for reuse.
Document Ingestion Pipeline
by AaaS
Automated pipeline for ingesting documents from multiple sources (files, URLs, APIs) into a vector store. Handles format detection, text extraction, chunking, deduplication, metadata enrichment, and incremental updates for growing knowledge bases.
Dataset Preparation
by AaaS
Prepares datasets for LLM fine-tuning by converting raw data into instruction-following, conversation, or completion formats. Handles data cleaning, deduplication, train/val/test splitting, tokenization analysis, and quality filtering.
Web Scraping Pipeline
by AaaS
Automated web scraping pipeline with configurable crawl depth, content extraction, and rate limiting. Converts web content into clean text documents suitable for embedding and RAG ingestion with support for dynamic JavaScript-rendered pages.
Tool Calling Setup
by AaaS
Sets up a tool-calling agent with typed tool definitions, argument validation, error handling, and execution sandboxing. Includes example tools for web search, calculator, file operations, and database queries with a pluggable tool registry.
Batch Embedding Generation
by AaaS
Generates embeddings at scale for large document collections with batching, rate limiting, checkpointing, and error recovery. Supports multiple embedding providers (OpenAI, Cohere, local models) with automatic dimension detection and output format selection.
Temporal Feature Builder
by Community
Generates comprehensive temporal features from time-series data including rolling statistics, lag features, Fourier transforms, and calendar encodings using tsfresh and custom transformers. Handles irregular time series with forward-fill interpolation and produces a point-in-time-correct feature matrix to prevent leakage.
RAG Pipeline Setup
by AaaS
End-to-end setup script for deploying a production RAG pipeline. Provisions vector database, configures document ingestion, sets up embedding generation, and creates retrieval endpoints.
Advanced RAG Pipeline
by AaaS
Production-grade RAG pipeline with hybrid search, reranking, contextual compression, and multi-index routing. Includes query decomposition, metadata filtering, evaluation metrics, and performance monitoring for enterprise deployments.
Model A/B Testing
by Community
Implements statistically rigorous A/B and shadow-mode testing for competing ML model versions behind a feature flag router, logging predictions and latencies to a data warehouse for significance testing. Automatically computes sample size requirements and stops experiments when significance thresholds are met.
Graph Embedding Generator
by Community
Generates node and edge embeddings for knowledge graphs using Node2Vec, TransE, or a GNN (via PyTorch Geometric), then indexes them in a vector store for similarity search and link prediction. Includes training scripts, evaluation on standard link-prediction benchmarks, and a REST API for embedding lookup.
Financial Report Parser
by Community
Parses SEC filings, earnings call transcripts, and annual reports using FinBERT for sentiment analysis and a table-extraction pipeline that converts HTML/XBRL financial statements into normalized pandas DataFrames. Exports structured financial metrics to a database and generates LLM-ready summaries for investor Q&A.
Clinical NLP Pipeline
by Community
Processes unstructured clinical notes using medspaCy and BioClinicalBERT to extract diagnoses, medications, procedures, and lab values, then maps entities to ICD-10 and SNOMED-CT codes. Outputs FHIR-compatible JSON bundles and includes a de-identification step compliant with HIPAA Safe Harbor.
Feature Store Sync
by Feast
Synchronizes feature definitions and materialized feature values between offline (BigQuery/Snowflake) and online (Redis/DynamoDB) feature stores using Feast or Tecton, with configurable freshness SLAs and backfill scheduling. Includes drift monitoring to alert when online and offline distributions diverge.
PDF Extraction Pipeline
by AaaS
Specialized pipeline for extracting structured content from PDF documents including text, tables, images, and metadata. Supports OCR for scanned documents, layout analysis for complex formats, and chunking optimized for PDF document structures.
Docker ML Deployment
by AaaS
Containerizes ML models and inference servers with optimized Docker images for production deployment. Includes multi-stage builds for minimal image size, GPU support configuration, health checks, and docker-compose setups for full inference stacks.
Canary Deployment ML
by Community
Orchestrates progressive canary deployments of ML model services on Kubernetes using Istio traffic shifting, with automated rollback triggered by error-rate or latency SLO breaches. Integrates with Argo Rollouts for declarative release management and posts deployment status to Slack.
Model Evaluation Harness
by AaaS
Comprehensive model evaluation script that runs models against standard benchmarks including MMLU, HumanEval, GSM8K, and custom evaluation sets. Produces detailed reports with per-category breakdowns, confidence intervals, and comparison charts.
GGUF Conversion
by AaaS
Converts Hugging Face model weights to GGUF format for use with llama.cpp and compatible inference engines. Supports multiple quantization levels (Q4_K_M, Q5_K_M, Q8_0), validates output integrity, and generates model cards with performance characteristics.
Music Generation Script
by Meta AI
Generates royalty-free music from text prompts using Meta's MusicGen or AudioCraft, with controls for tempo, key, duration, and genre conditioning. Provides a CLI for batch generation and a streaming mode that writes 30-second chunks to disk or an S3 bucket.
Face Recognition Setup
by Community
Configures a face recognition system using InsightFace or DeepFace, supporting gallery enrollment, real-time identification against a FAISS vector store, and liveness detection. Designed with privacy-first defaults and includes GDPR-compliant consent logging.
Model Comparison Script
by AaaS
Side-by-side model comparison script that runs identical prompts through multiple LLM APIs and presents results in a structured format. Measures response quality, latency, token usage, and cost per query with automated scoring via LLM judges.
Knowledge Graph Builder
by Community
Automatically constructs a knowledge graph from unstructured text by extracting subject-predicate-object triples using an LLM, then serializing them to RDF/OWL or property-graph formats. Supports ontology alignment, duplicate merging via entity resolution, and Turtle/JSON-LD export.
Knowledge Base Builder
by AaaS
End-to-end script for building a searchable knowledge base from heterogeneous sources including documents, APIs, databases, and web content. Orchestrates ingestion, deduplication, embedding, indexing, and creates a unified query interface across all sources.
Legal Document Analyzer
by Community
Analyzes legal contracts and court documents using a fine-tuned LegalBERT model for clause classification, obligation extraction, and risk-flag detection, with outputs cross-referenced against a configurable playbook of standard clause definitions. Generates a redline-ready Word document and a structured JSON risk register.
Multi-Agent Orchestration
by AaaS
Orchestrates multiple specialized AI agents in coordinated workflows with task routing, state management, and result aggregation. Implements supervisor and swarm patterns with configurable agent selection logic and inter-agent communication.
Model Quantization (GPTQ)
by AaaS
Quantizes language models using GPTQ for efficient inference on consumer hardware. Performs calibration-based quantization, quality evaluation against the original model, and exports in formats compatible with vLLM, llama.cpp, and other inference engines.
Entity Linking Script
by Community
Disambiguates named entities in text by linking them to canonical Wikidata or custom knowledge base entries, using a bi-encoder retriever followed by a cross-encoder reranker. Handles multi-lingual input via mBERT and outputs entity URIs with confidence scores for downstream graph population.
Cost Calculator
by AaaS
Calculates and projects LLM API costs based on usage patterns, model pricing, and workload forecasts. Compares costs across providers and models, identifies the most cost-effective configuration for a given quality threshold, and generates budget reports.
Hallucination Detector
by AaaS
Detects hallucinated content in LLM outputs by cross-referencing claims against source documents and knowledge bases. Uses claim decomposition, source attribution scoring, and consistency checking to flag unsupported or fabricated statements.
Hybrid Search Setup
by AaaS
Configures a hybrid search system combining dense vector similarity with sparse BM25 keyword matching. Sets up dual index creation, score fusion strategies, and query routing logic for optimal retrieval across different query types.
Prompt Testing Suite
by AaaS
Automated testing framework for prompt engineering with test case management, assertion-based evaluation, regression detection, and A/B comparison. Validates prompt outputs against expected patterns, formats, and quality criteria with CI/CD integration.
MCP Server Template
by AaaS
Template for building Model Context Protocol (MCP) servers that expose tools, resources, and prompts to MCP-compatible clients. Includes typed tool handlers, resource providers, error handling, and transport configuration for stdio and HTTP modes.
LLM Load Testing
by AaaS
Load tests LLM API endpoints with configurable concurrency, request patterns, and duration. Measures throughput, latency percentiles (p50/p95/p99), time-to-first-token, error rates, and generates performance reports with degradation alerts.
CSV to Embeddings
by AaaS
Converts CSV data into vector embeddings with configurable column selection, text template formatting, and metadata extraction. Outputs to popular vector stores or file formats with chunking support for large CSV files that exceed memory limits.
Audio Classification Setup
by Community
Configures an audio classification system using Audio Spectrogram Transformer (AST) or YAMNet fine-tuned on AudioSet, with Mel spectrogram feature extraction and batch inference. Exports per-clip predictions with top-5 class probabilities and integrates with a streaming event bus for real-time use.
Document Classification
by AaaS
Classifies documents into predefined categories using LLM-based inference with configurable taxonomies. Supports batch processing, multi-label classification, confidence thresholds, and exports results to CSV or database with audit trails.
Data Lineage Tracker
by OpenLineage
Instruments ETL and ML pipelines with OpenLineage events, shipping dataset-level provenance metadata to a Marquez or Apache Atlas backend. Generates interactive lineage DAGs showing data transformations from source to model artifact, supporting impact analysis and audit trails.
Cost Optimization Script
by AaaS
Analyzes LLM API usage patterns and identifies cost optimization opportunities. Recommends model downgrades for simple tasks, prompt compression strategies, caching opportunities, and batch processing windows based on historical usage data and cost metrics.
GraphQL AI Gateway
by Community
GraphQL gateway for multi-model AI services built with Strawberry Python, exposing query, mutation, and subscription resolvers for chat, embedding, and image generation endpoints across multiple LLM providers. Features a DataLoader-based batching layer and persisted query caching to minimize token usage.
Supply Chain Optimizer
by Community
Combines ML demand forecasting (Prophet + LightGBM) with constraint-based optimization (Google OR-Tools) to minimize inventory costs while meeting service-level targets across a multi-echelon supply chain. Outputs replenishment orders, safety stock recommendations, and a scenario simulation dashboard.
Safety Audit Script
by AaaS
Comprehensive safety audit for LLM-powered applications testing for prompt injection vulnerabilities, PII leakage, harmful content generation, and policy violations. Generates detailed audit reports with severity ratings and remediation recommendations.
Entity Extraction Pipeline
by AaaS
Extracts named entities and relationships from unstructured text at scale using LLM-powered NER with custom entity type support. Outputs structured data with entity linking, relationship graphs, and confidence scores for knowledge graph construction.
Monitoring Setup (Grafana)
by AaaS
Sets up Grafana dashboards and Prometheus metrics for LLM application monitoring. Includes pre-built dashboards for token usage, latency, error rates, cost tracking, and model performance with configurable alert rules and notification channels.
Model Benchmarking Suite
by AaaS
Performance benchmarking suite measuring LLM inference throughput, latency percentiles, time-to-first-token, and tokens-per-second under various load patterns. Generates detailed performance reports with charts for capacity planning and SLA validation.
LLM Regression Testing
by AaaS
Detects regressions in LLM behavior across model updates, prompt changes, or configuration modifications. Runs golden test sets, compares outputs using semantic similarity and LLM judges, and flags significant quality degradation with detailed diff reports.
Agent Evaluation Framework
by AaaS
Evaluates AI agent performance across defined test scenarios with success criteria, step tracking, and automated scoring. Supports custom evaluation rubrics, regression detection, and generates detailed reports comparing agent versions over time.
Annotation Pipeline
by AaaS
Automated data annotation pipeline using LLMs for labeling, classification, and quality scoring of training data. Implements multi-annotator consensus, confidence thresholds, human review queuing for uncertain samples, and annotation analytics.
Token Usage Analyzer
by AaaS
Analyzes token usage patterns across LLM applications to identify optimization opportunities. Tracks input/output token ratios, identifies verbose prompts, detects unnecessary context, and recommends prompt engineering improvements for cost reduction.
CI/CD ML Pipeline
by AaaS
CI/CD pipeline for machine learning models with automated testing, evaluation, registry management, and staged deployment. Runs benchmark suites, compares against baseline metrics, and promotes models through staging environments with approval gates.
Bias Detection Script
by AaaS
Detects demographic and topical biases in LLM outputs by running structured test prompts across protected categories. Measures response quality disparities, sentiment differences, and representation gaps with statistical significance testing and bias scorecards.
Rate Limiter Setup
by AaaS
Configures intelligent rate limiting for LLM API proxies with per-user, per-model, and per-endpoint limits. Implements token bucket, sliding window, and adaptive rate limiting algorithms with Redis-backed distributed state and graceful degradation.
Agent Deployment Script
by AaaS
Deploys AI agents as production services with health checks, graceful shutdown, error recovery, and monitoring integration. Supports Docker and Kubernetes deployments with configurable scaling, environment management, and rollback capabilities.
Red Teaming Script
by AaaS
Automated red teaming toolkit that generates and tests adversarial prompts against LLM applications. Covers jailbreak attempts, prompt injection variants, social engineering patterns, and boundary probing with categorized attack vectors and success tracking.
Latency Benchmarking
by AaaS
Benchmarks LLM API latency across providers, models, and prompt sizes with detailed statistical analysis. Measures time-to-first-token, inter-token latency, total response time, and generates comparison reports with confidence intervals and percentile distributions.
Kubernetes Model Serving
by AaaS
Deploys and manages LLM inference workloads on Kubernetes with GPU scheduling, auto-scaling based on queue depth, rolling updates, and canary deployments. Generates Helm charts and Kustomize configurations for reproducible deployments.
Energy Forecast Script
by Community
Forecasts electricity demand and renewable generation (solar/wind) using Temporal Fusion Transformer or N-HiTS via NeuralForecast, with weather feature integration and probabilistic intervals for grid balancing. Outputs 24-hour and 7-day ahead forecasts in an InfluxDB-compatible format.
API Gateway Configuration
by AaaS
Configures an API gateway for LLM inference endpoints with provider routing, rate limiting, authentication, request/response logging, and failover between multiple LLM providers. Includes usage tracking and cost allocation by API key.
Multi-Source RAG
by AaaS
RAG pipeline that queries multiple specialized vector indexes and merges results with intelligent routing. Implements source-aware retrieval with automatic query classification, per-source relevance scoring, and citation tracking across diverse knowledge domains.
A/B Testing Framework
by AaaS
Framework for A/B testing different LLM configurations including models, prompts, temperatures, and system instructions. Runs controlled experiments with statistical significance testing, effect size calculation, and automated winner selection.
Agent Monitoring Dashboard
by AaaS
Sets up a monitoring dashboard for AI agent systems tracking task completion rates, error rates, latency, token usage, and cost. Integrates with Prometheus for metrics collection and Grafana for visualization with pre-built alert rules.
Consent Management Script
by Community
Implements a GDPR-compliant consent management layer that records per-user data processing consents in an append-only ledger, enforces purpose limitation at the data access layer, and generates DSAR (data subject access request) reports on demand. Supports consent propagation to downstream ML training pipelines.
Model Merging
by AaaS
Merges multiple fine-tuned model checkpoints using strategies like SLERP, TIES, DARE, and linear interpolation. Enables combining specialized model capabilities without additional training, with automated quality validation against benchmark suites.
Vector DB Migration
by AaaS
Migrates vector data between different vector database providers (Pinecone, Weaviate, Chroma, Qdrant, Milvus). Handles schema mapping, batch transfers, index recreation, metadata preservation, and validation with rollback support.
Agent Testing Harness
by AaaS
Testing harness for AI agents with mock tool providers, simulated user interactions, and deterministic replay capabilities. Enables unit testing of agent logic, integration testing of tool chains, and end-to-end testing of complete agent workflows.
A2A Communication Setup
by AaaS
Configures Agent-to-Agent (A2A) communication infrastructure with message routing, capability discovery, and protocol compliance. Sets up agent registries, message queues, and typed message schemas for reliable inter-agent collaboration.
MLPerf Training
by MLCommons
MLPerf Training is a suite of benchmarks that measure the time it takes to train various machine learning models on different hardware and software platforms. It provides a standardized way to compare the performance of different AI training systems, driving innovation in hardware and software optimization for AI workloads.
HELM: Holistic Evaluation of Language Models
by Stanford Center for Research on Foundation Models (CRFM)
HELM is a living benchmark designed to provide a comprehensive and holistic evaluation of language models across a wide range of scenarios and metrics. It aims to move beyond single-number evaluations by assessing models on factors like truthfulness, calibration, fairness, robustness, and efficiency, providing a more nuanced understanding of their capabilities and limitations.
ImageNet
by Deng et al. / Stanford / Princeton
ImageNet (ILSVRC) is the foundational large-scale visual recognition benchmark with 1.2 million training images across 1,000 object categories. Top-1 and Top-5 accuracy on the validation set have been the standard measure of progress in image classification for over a decade.
RoboSuite
by Stanford AI Lab
RoboSuite is a simulation framework and benchmark suite for robot learning. It provides a standardized set of environments and tasks for training and evaluating reinforcement learning algorithms in robotics, focusing on manipulation and locomotion tasks with realistic physics and sensor models.
AI2 Reasoning Challenge (ARC)
by Allen Institute for AI (AI2)
The AI2 Reasoning Challenge (ARC) is a question-answering dataset designed to evaluate advanced reasoning capabilities in AI systems. It consists of elementary-level science questions specifically crafted to be difficult for retrieval-based methods and require deeper understanding and reasoning to answer correctly.
COCO Detection
by Lin et al. / Microsoft
COCO Detection is the standard benchmark for object detection and instance segmentation, featuring 330,000 images with over 1.5 million annotated instances across 80 object categories. Mean Average Precision (mAP) at various IoU thresholds is the primary metric.
LibriSpeech
by Panayotov et al. / Johns Hopkins
LibriSpeech is the standard English automatic speech recognition (ASR) benchmark derived from LibriVox audiobooks, containing 1,000 hours of read speech at 16kHz. Word Error Rate (WER) on clean and noisy test splits drives competitive progress in ASR research.
ADE20K Segmentation
by Zhou et al. / MIT CSAIL
ADE20K is the benchmark for semantic scene parsing, containing 25,000 images densely annotated with 150 semantic categories. Mean Intersection over Union (mIoU) is the standard metric, and it drives progress in perception systems for autonomous driving, robotics, and scene understanding.
GSM8K
by OpenAI
Grade School Math 8K benchmark with 8,500 linguistically diverse grade school math word problems requiring 2-8 step reasoning. Tests basic mathematical reasoning and arithmetic with problems that require sequential multi-step solutions.
SWE-bench Verified
by Princeton NLP
Human-validated subset of SWE-bench containing 500 problems verified by software engineers for correctness, clarity, and solvability. Provides a more reliable signal than the full SWE-bench by filtering out ambiguous or under-specified issues.
MATH
by UC Berkeley
Collection of 12,500 competition mathematics problems from AMC, AIME, and other math competitions covering algebra, geometry, number theory, combinatorics, and more. Problems require multi-step reasoning and mathematical insight beyond pattern matching.
ARC-AGI
by Chollet / ARC Prize Foundation
ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) measures fluid intelligence through visual grid transformation puzzles. Models must infer transformation rules from three or fewer examples and apply them to a test grid — a task trivially solved by humans but historically extremely difficult for AI systems.
HellaSwag
by Allen AI
Evaluates commonsense natural language inference by asking models to select the most plausible continuation of a scenario. Uses adversarially filtered endings generated by language models, making it challenging for machines while trivial for humans.
Common Voice
by Mozilla Foundation
Common Voice is Mozilla's crowd-sourced multilingual speech corpus spanning 100+ languages with verified recordings from volunteers. It benchmarks ASR systems on low-resource and diverse language conditions, making it critical for evaluating cross-lingual speech model generalization.
MLPerf Inference
by MLCommons
MLPerf Inference is the industry-standard benchmark for measuring AI inference performance across hardware platforms. It covers image classification, object detection, NLP, speech recognition, and generative AI workloads, enabling fair apples-to-apples comparison of accelerators and inference stacks.
ARC Challenge
by Allen AI
AI2 Reasoning Challenge featuring grade-school science questions that require commonsense reasoning and world knowledge. The Challenge set contains questions that simple retrieval and co-occurrence methods fail to answer correctly.
MedQA
by Jin et al. / UC San Diego
MedQA tests medical knowledge using free-form multiple-choice questions drawn from the US Medical Licensing Examination (USMLE). It evaluates whether language models can reason through complex clinical scenarios requiring deep biomedical knowledge.
MT-Bench
by LMSYS
Multi-turn conversation benchmark with 80 high-quality questions across 8 categories including writing, reasoning, math, coding, and extraction. Uses GPT-4 as an automated judge to evaluate response quality on a 1-10 scale across two conversation turns.
FLORES-200
by NLLB Team / Meta AI
FLORES-200 is a many-to-many multilingual translation benchmark covering 200 languages, including many low-resource ones. It evaluates machine translation systems across 40,000 language direction pairs, making it the most comprehensive translation benchmark for assessing cross-lingual generalization.
GPQA
by NYU
Graduate-level Google-Proof Question Answering benchmark featuring questions written by domain experts in physics, chemistry, and biology. Questions are designed to be unsearchable, requiring genuine reasoning rather than memorization.
TruthfulQA
by University of Oxford
Measures whether language models generate truthful answers to questions where humans are commonly mistaken. Covers health, law, finance, and politics topics where popular misconceptions and conspiracies create systematic failure modes.
MBPP
by Google Research
Mostly Basic Programming Problems — a collection of 974 crowd-sourced Python programming tasks with natural language descriptions and test cases. Tests foundational programming ability including string manipulation, list processing, and basic algorithms.
Flickr30k
by Young et al. / University of Illinois
Flickr30k is a benchmark for image-text retrieval and visual grounding, comprising 31,783 Flickr images each paired with five human-written captions. Models are evaluated on bidirectional image-to-text and text-to-image retrieval recall at ranks 1, 5, and 10.
Needle-in-a-Haystack
by Greg Kamradt (community)
Needle-in-a-Haystack is a pressure test for long-context language models that places a single fact (the needle) at a specific position within a long document (the haystack) and asks the model to retrieve it. It systematically varies both context length and needle depth to reveal performance degradation patterns.
VQA v2
by Georgia Tech / VT
Visual Question Answering benchmark requiring models to answer open-ended questions about images. Version 2 balances the dataset to reduce language biases, ensuring models must genuinely understand image content rather than relying on question-type priors.
BIG-Bench Hard
by Google DeepMind
Curated subset of 23 challenging BIG-Bench tasks where prior language models performed below average human raters. Specifically designed to test tasks that benefit significantly from chain-of-thought prompting and multi-step reasoning.
WinoGrande
by Allen AI
Large-scale dataset for commonsense coreference resolution inspired by Winograd schemas. Tests whether models can correctly resolve pronoun references based on world knowledge and commonsense reasoning in carefully constructed sentence pairs.
RealToxicityPrompts
by Gehman et al. / Allen Institute for AI
RealToxicityPrompts measures the propensity of language model generations to produce toxic content when conditioned on a diverse set of 100,000 naturally occurring prompts extracted from the web. It uses the Perspective API to score generated text on toxicity dimensions.
PubMedQA
by Jin et al. / Carnegie Mellon University
PubMedQA is a biomedical question-answering dataset sourced from PubMed abstracts. Models must answer yes/no/maybe questions about biomedical research findings, testing the ability to reason over scientific literature.
LegalBench
by Guha et al. / Stanford CodeX
LegalBench is a collaboratively built benchmark measuring the legal reasoning ability of large language models across 162 tasks spanning issue spotting, rule recall, rule application, and legal interpretation. It provides a comprehensive evaluation of whether models can reason like lawyers.
ScienceQA
by Lu et al. / UCLA
ScienceQA is a large-scale multimodal benchmark featuring 21,208 science questions for grades 3-12. It uniquely combines visual diagrams and textual contexts, requiring models to perform complex reasoning. Each question includes multiple-choice options, a detailed lecture, and a step-by-step explanation for the correct answer.
AlpacaEval
by Stanford
Automated evaluation framework comparing model outputs against a reference model on 805 instructions. Uses LLM judges to determine win rates, with length-controlled metrics to avoid rewarding verbosity over quality.
BioASQ
by Tsatsaronis et al. / BioASQ Challenge
BioASQ is a large-scale benchmark for biomedical semantic question answering. It challenges systems to perform document retrieval, concept mapping, and answer extraction from PubMed literature. The benchmark includes diverse question types like yes/no, factoid, list, and summary, with gold-standard answers curated by experts.
MMLU-Pro
by TIGER-Lab
MMLU-Pro is a challenging benchmark designed to evaluate the advanced reasoning and knowledge capabilities of frontier AI models. It enhances the original MMLU by introducing harder, professionally-vetted questions, expanding answer choices from 4 to 10, and reducing sensitivity to prompt formatting for a more robust and discriminative assessment.
ToolBench
by Qin et al. / Tsinghua University
ToolBench evaluates LLMs on their ability to use real-world REST APIs to complete user instructions. It provides 16,000+ real APIs from RapidAPI Hub across 49 categories and 12,000+ instruction–API solution pairs, measuring whether models can plan and execute multi-step API call sequences.
MMMU
by CUHK / Waterloo
MMMU is a challenging multimodal benchmark designed to evaluate large models on expert-level tasks. It contains over 11,500 college-level problems spanning six core disciplines, requiring models to integrate deep subject knowledge with visual perception to answer multiple-choice questions with detailed reasoning.
DROP
by Allen AI
DROP (Discrete Reasoning Over Paragraphs) is a challenging benchmark designed to evaluate a model's numerical reasoning capabilities within textual contexts. It requires systems to read paragraphs and answer questions that involve discrete operations like addition, counting, sorting, or comparison. Unlike simpler QA datasets, DROP necessitates multi-step reasoning processes, pushing models beyond basic information retrieval.
ToxiGen
by Hartvigsen et al. / MIT
ToxiGen is a large-scale, machine-generated dataset for evaluating nuanced hate speech detection. It contains over 274,000 toxic and benign statements about 13 minority groups, designed to challenge models to identify implicit toxicity without relying on obvious slurs or surface-level cues.
BigCodeBench
by Zhuo et al. / BigCode / Hugging Face
BigCodeBench is a challenging benchmark for evaluating large language models on practical, function-level code generation tasks. It comprises 1,140 problems that require the use and integration of popular Python libraries like NumPy, Pandas, and Scikit-learn, moving beyond simple algorithmic puzzles to mirror real-world software development scenarios.
TyDi QA
by Clark et al. / Google Research
TyDi QA is a multilingual question-answering benchmark featuring 11 typologically diverse languages. Questions are written natively by speakers of each language, ensuring genuine linguistic challenges and avoiding translation artifacts. It is designed to evaluate reading comprehension across a wide range of language structures.
MedMCQA
by Pal et al. / IIT Kanpur
MedMCQA is a massive multiple-choice question dataset sourced from Indian medical entrance examinations like AIIMS and NEET-PG. It contains over 194,000 questions covering 2,400 healthcare topics, designed to rigorously test a model's breadth of medical knowledge and reasoning abilities across multiple subjects.
RULER
by Hsieh et al. / NVIDIA
RULER is a synthetic benchmark for evaluating large language models in long-context scenarios, scaling from 4K to 128K tokens. It assesses complex skills like multi-hop retrieval, aggregation, and coreference resolution, offering a more nuanced analysis than simple 'needle-in-a-haystack' tests.
AIME 2024
by MAA
A highly challenging benchmark for evaluating the mathematical reasoning of frontier AI models. It uses 30 problems from the 2024 American Invitational Mathematics Examination (AIME), which are designed to test creative problem-solving, multi-step deduction, and knowledge across number theory, geometry, algebra, and combinatorics.
BBQ (Bias Benchmark for QA)
by Parrish et al. / NYU
BBQ is a question-answering benchmark designed to expose social biases in language models. It uses ambiguous and disambiguated questions related to nine protected categories to measure a model's tendency to rely on harmful stereotypes when context is lacking versus its ability to answer correctly when enough information is provided.
LongBench
by Bai et al. / Tsinghua University
LongBench is a comprehensive bilingual benchmark designed to evaluate the long-context understanding capabilities of large language models in English and Chinese. It comprises 21 diverse tasks, including single and multi-document QA, summarization, and code completion, with an average context length of over 6,700 tokens to rigorously test model performance on extended inputs.
MusicCaps
by Agostinelli et al. / Google DeepMind
MusicCaps is a benchmark dataset of 5,521 music clips from AudioSet, each paired with a detailed text description written by professional musicians. It is primarily used for evaluating text-to-music generation models, as well as for music captioning, retrieval tasks, and fine-tuning audio-language models.
IFEval
by Google Research
Instruction-Following Evaluation benchmark testing models' ability to precisely follow verifiable formatting instructions. Includes constraints like word count limits, specific formatting requirements, keyword inclusion/exclusion, and structural rules that can be programmatically verified.
Chatbot Arena Hard
by LMSYS
Chatbot Arena Hard is a static benchmark composed of 500 challenging prompts curated from Chatbot Arena. It is designed to rigorously evaluate and differentiate the capabilities of large language models. The benchmark utilizes an automated judging system, typically employing a powerful model like GPT-4, to provide a quick, reproducible proxy for human preference.
HumanEval+
by BigCode
HumanEval+ is a benchmark for rigorously evaluating code generation models. It augments the original HumanEval dataset by expanding the test suite for each of its 164 problems by 80x. This extensive testing helps uncover subtle bugs and failures on edge cases that simpler benchmarks miss, providing a more accurate measure of a model's true coding ability.
CyberSecEval
by Meta AI
CyberSecEval is a benchmark developed by Meta to assess the cybersecurity risks associated with Large Language Models (LLMs). It evaluates a model's propensity to generate insecure code, assist in exploiting vulnerabilities, and facilitate attacks, helping safety teams quantify the dual-use risk of code-capable models.
DocVQA
by CVC Barcelona
DocVQA is a large-scale dataset and benchmark for Visual Question Answering on document images. It challenges models to answer questions by reading and interpreting text, understanding layouts, and reasoning about information within complex documents like forms, invoices, and reports. It serves as a standard for evaluating document intelligence systems.
FinanceBench
by Islam et al. / Patronus AI
FinanceBench is a benchmark designed to evaluate the financial question-answering capabilities of Large Language Models. It uses publicly available corporate documents like 10-K filings and earnings reports to test models on information retrieval, numerical reasoning, and multi-step financial calculations, providing a standardized testbed for financial AI.
WebArena
by CMU
WebArena is a realistic and reproducible benchmark environment designed to evaluate autonomous language agents. It tests an agent's ability to perform complex, multi-step tasks across a diverse set of self-hosted websites, including e-commerce, forums, and content management systems, using real web interfaces.
XL-Sum
by Hasan et al. / University of Edinburgh
XL-Sum is a large-scale benchmark dataset for multilingual abstractive summarization. It contains 1.35 million article-summary pairs from BBC News across 44 languages, designed to evaluate a model's ability to generate concise summaries across diverse linguistic families and writing systems.
GAIA Benchmark
by Meta / Hugging Face
GAIA (General AI Assistants) is a benchmark for evaluating AI models on complex, real-world tasks. It features questions with unambiguous factual answers that require sophisticated capabilities like multi-step reasoning, web browsing, and tool use. GAIA is designed to test the practical limits of general-purpose AI assistants.
CrowS-Pairs
by Nangia et al. / NYU
CrowS-Pairs is a benchmark dataset for evaluating social bias in masked language models. It contains 1,508 sentence pairs with stereotypical and anti-stereotypical statements across nine bias types. The benchmark measures a model's preference for stereotypical completions using pseudo-log-likelihood scores.
MGSM
by Google Research
MGSM (Multilingual Grade School Math) is a benchmark for evaluating the mathematical reasoning of large language models across multiple languages. It consists of 250 grade-school math problems from the GSM8K dataset, professionally translated into ten typologically diverse languages, including low-resource ones like Swahili and Telugu.
AgentBoard
by Ma et al. / Shanghai AI Lab
AgentBoard is a comprehensive evaluation framework for Large Language Model (LLM) based agents. It assesses agent performance across nine diverse tasks, including embodied AI, gaming, web browsing, and tool use. The framework uniquely measures both final task success and partial progress through a fine-grained sub-goal metric.
ContractNLI
by Koreeda & Manning / Stanford NLP
ContractNLI is a dataset for natural language inference (NLI) focused on contract understanding. It challenges models to determine if a hypothesis about a contract is entailed, contradicted, or not mentioned by the contract text. This simulates real-world legal document review, testing a model's ability to reason over complex legal language.
SimpleQA
by OpenAI
SimpleQA is a benchmark dataset developed by OpenAI to assess the factual accuracy of language models. It consists of simple, unambiguous questions that have a single, verifiable correct answer. The benchmark is designed to measure a model's ability to recall factual knowledge and, crucially, to abstain from answering when it is uncertain, providing a measure of its calibration.
Humanity's Last Exam
by CAIS
Humanity's Last Exam is a crowdsourced benchmark designed to rigorously test the limits of advanced AI systems. It comprises extremely difficult questions contributed by domain experts across diverse fields like science, math, and philosophy, serving as a public evaluation for frontier model capabilities in complex reasoning and specialized knowledge.
WinoBias
by Zhao et al. / USC
WinoBias is a benchmark dataset designed to measure gender bias in coreference resolution systems. It consists of sentence pairs where pronouns refer to individuals in stereotyped or non-stereotyped occupations, allowing for the quantification of a model's reliance on gender stereotypes versus grammatical correctness.
InfiniteBench
by Zhang et al. / Peking University
InfiniteBench is a benchmark designed to evaluate the long-context capabilities of large language models. It features tasks that require processing and reasoning over inputs exceeding 100,000 tokens, including math, code debugging, and retrieval from novels, where crucial information is distributed across the entire context.
AgentBench
by Tsinghua University
Comprehensive benchmark evaluating LLM agents across 8 distinct environments including operating systems, databases, knowledge graphs, digital card games, lateral thinking puzzles, and web shopping. Tests generalization of agent capabilities across diverse interaction paradigms.
Minerva Math
by Google Research
Minerva Math is a quantitative reasoning benchmark designed to evaluate large language models on complex STEM problems. Sourced from web pages with LaTeX and arXiv preprints, it covers subjects like math, physics, and chemistry, requiring multi-step computation, symbolic manipulation, and deep scientific understanding to solve.
CaseHOLD
by Zheng et al. / Berkeley Law / LexGLUE
CaseHOLD is a legal NLP benchmark for evaluating a model's ability to identify the correct holding statement for a US court case. Given a citing context, the model must choose the correct holding from a list of candidates. Sourced from over 53,000 cases, it is a core component of the LexGLUE benchmark suite for legal AI.
API-Bank
by Li et al. / Wuhan University
API-Bank is a comprehensive benchmark for evaluating tool-augmented LLMs. It features 73 diverse APIs and assesses models on three levels: API retrieval, API calling, and complex planning. The benchmark measures both the correctness of tool selection and the accuracy of execution, providing a thorough test of an agent's capabilities.
MathVista
by UCLA
Mathematical reasoning benchmark requiring visual understanding of charts, plots, geometry diagrams, and infographics. Tests the intersection of visual perception and mathematical reasoning with 6,141 problems from 28 existing datasets and 3 newly collected ones.
Aider Polyglot
by Aider
Multi-language code editing benchmark testing models' ability to make targeted code changes across Python, JavaScript, TypeScript, Java, C++, and other languages. Evaluates real-world code modification tasks rather than generation from scratch.
MLAgentBench
by Huang et al. / Stanford
MLAgentBench challenges AI agents to perform machine learning research tasks autonomously — reading papers, writing code, running experiments, analyzing results, and improving models. It tests whether agents can replicate and build upon real ML research across 13 diverse ML tasks.
FrontierMath
by Epoch AI
Benchmark of original, research-level mathematics problems created by professional mathematicians. Tests capabilities at the frontier of mathematical reasoning including novel proofs, advanced computation, and multi-domain mathematical synthesis.
ClinicalCamel Benchmark
by Toma et al. / University of Toronto
ClinicalCamel Benchmark evaluates open-source language models on clinical dialogue and medical instruction-following tasks derived from physician–patient interactions. It focuses on safety, accuracy, and appropriateness of clinical advice generation.
Codeforces Benchmark
by Codeforces / Community
Evaluates models on competitive programming problems from the Codeforces platform across difficulty ratings. Tests algorithmic thinking, data structure knowledge, and the ability to produce correct and efficient solutions under competitive constraints.
TAU-bench
by Sierra AI
Tool-Agent-User benchmark evaluating AI agents on realistic customer service scenarios requiring multi-step tool use. Tests agents' ability to navigate complex workflows, use tools correctly, follow policies, and handle edge cases in airline and retail domains.
MLE-bench
by OpenAI
Benchmark evaluating AI agents on real Kaggle machine learning competitions. Tests the full ML engineering pipeline including data exploration, feature engineering, model selection, training, and submission formatting against actual competition leaderboards.
OSWorld
by University of Hong Kong
Benchmark for evaluating multimodal agents on real operating system tasks spanning Ubuntu, Windows, and macOS environments. Tests agents' ability to interact with desktop applications, file systems, terminals, and GUI elements to complete everyday computer tasks.
RealWorldQA
by xAI
Benchmark testing multimodal models on practical real-world visual understanding tasks. Features questions about real photographs requiring spatial reasoning, object recognition, scene understanding, and practical knowledge that goes beyond simple object detection.
EnergyBench
by Lannelongue et al. / EMBL-EBI
EnergyBench quantifies the energy consumption and carbon footprint of AI inference across hardware and software configurations. It correlates task accuracy with joules consumed, enabling practitioners to make informed accuracy-efficiency trade-offs for sustainable AI deployment.
GreenAI Benchmark
by Schwartz et al. / AI2 / University of Washington
GreenAI Benchmark evaluates the efficiency of AI training and inference by reporting accuracy alongside FLOPs, parameters, and CO2 emissions. It promotes the efficiency metric paradigm where reporting results without computational cost is considered incomplete science.
SWE-bench
by Princeton NLP
SWE-bench is a benchmark for evaluating AI systems' ability to resolve real GitHub issues from popular Python repositories. Each instance requires understanding a codebase, identifying the bug, and producing a correct patch. SWE-bench Verified is the curated subset accepted as the standard for coding agent evaluation by the AI industry.
MTEB
by Hugging Face / MTEB Team
MTEB (Massive Text Embedding Benchmark) is the standard benchmark for evaluating text embedding models across 8 task types (retrieval, clustering, classification, etc.) and 112 datasets. The MTEB leaderboard on Hugging Face is the primary reference for selecting embedding models and is updated continuously as new models are released.
MMLU
by UC Berkeley
MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark covering 57 academic subjects from elementary to professional level, including STEM, law, medicine, and social sciences. It became the standard for measuring general knowledge breadth in LLMs and is included in virtually every model evaluation suite.
LiveBench
by LiveBench OSS
LiveBench is a contamination-resistant benchmark that continuously updates with new questions sourced from recent math competitions, research papers, and news. By using only data post-dating model training cutoffs, LiveBench mitigates benchmark saturation and provides more reliable capability assessments of frontier models.
HumanEval
by OpenAI
HumanEval is OpenAI's code generation benchmark consisting of 164 hand-written Python programming problems with unit tests. It measures a model's ability to generate syntactically correct and functionally complete code from docstring descriptions. HumanEval is the foundational coding benchmark that all subsequent code benchmarks build upon.
HELM
by Stanford CRFM
HELM (Holistic Evaluation of Language Models) from Stanford CRFM provides a multi-dimensional evaluation framework that measures LLMs across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. It evaluates models on 42 scenarios and 59 metrics, providing the most comprehensive public assessment of LLM capabilities and risks.
GPQA Diamond
by NYU / Cohere
GPQA Diamond (Graduate-Level Google-Proof Q&A) is a challenging multiple-choice benchmark requiring expert-level knowledge in biology, chemistry, and physics. Questions are designed to be answerable by domain PhD students but not by web search. GPQA Diamond is the standard for measuring frontier scientific reasoning capability.
Chatbot Arena
by LMSys
Chatbot Arena is a crowdsourced human evaluation platform from LMSys where users anonymously compare responses from two random LLMs and vote for the better one. The resulting Elo-based leaderboard (LMSYS Leaderboard) is widely regarded as the most reliable measure of real-world LLM preference across diverse user tasks.
ARC-AGI-2
by ARC Prize Foundation
ARC-AGI-2 is the second iteration of François Chollet's Abstraction and Reasoning Corpus benchmark, designed to measure fluid intelligence and generalization in AI systems. Tasks require identifying abstract visual patterns that cannot be solved by memorization, targeting a capability gap that separates current LLMs from human-level reasoning.
AIME 2025
by MAA / Community Eval
AIME (American Invitational Mathematics Examination) 2025 is used as a frontier math reasoning benchmark for LLMs. The competition-level math problems require multi-step reasoning without lookup, making AIME scores a direct indicator of a model's mathematical problem-solving depth. Frontier models are evaluated on the 2025 problem set to avoid training data contamination.
MATH-500
by
Mathematics benchmark testing advanced problem-solving from algebra to competition mathematics.
Arena-Hard Auto
by
Automated benchmark derived from Chatbot Arena for evaluating instruction-following and open-ended generation.
AI2 Reasoning Challenge (ARC)
by Allen Institute for AI (AI2)
The AI2 Reasoning Challenge (ARC) is a question-answering dataset designed to encourage research in advanced question-answering. It consists of grade-school science questions specifically crafted to require reasoning beyond simple fact retrieval, posing a significant challenge for AI models.
ImageNet-1K
by ImageNet / Stanford Vision Lab
The canonical large-scale visual recognition benchmark containing 1.28 million training images across 1,000 object categories. ImageNet-1K underpins the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and has driven the majority of deep learning breakthroughs in computer vision since 2012.
COCO 2017
by Microsoft
Microsoft COCO (Common Objects in Context) 2017 provides 118K training images with 860K object instances annotated with bounding boxes, segmentation masks, keypoints, and captions across 80 object categories. It remains the primary benchmark for object detection and instance segmentation research.
Protein Data Bank
by RCSB PDB / wwPDB Consortium
The RCSB Protein Data Bank (PDB) is the single worldwide archive of experimentally determined 3D structures of proteins, nucleic acids, and complex assemblies, currently containing over 220,000 biological macromolecular structures determined by X-ray crystallography, NMR, and cryo-EM. It is the foundational structural dataset for computational biology and was used to train and validate AlphaFold2 and other structure-prediction models.
UniProt
by UniProt Consortium (EMBL-EBI / SIB / PIR)
UniProt (Universal Protein Resource) is the world's comprehensive, freely accessible protein sequence and functional information database, maintained by a consortium of EMBL-EBI, SIB, and PIR. It contains over 250 million protein sequences in UniParc, with 570,000+ manually reviewed entries in SwissProt providing expert-curated functional annotations, and serves as the gold-standard training source for protein language models.
MMLU Dataset
by UC Berkeley
Massive Multitask Language Understanding (MMLU) is a benchmark covering 57 academic subjects from STEM to humanities, with 14,000+ multiple-choice questions at undergraduate and professional level. It has become the de facto standard for measuring broad world knowledge and academic reasoning in LLMs.
Wikipedia (Processed)
by Wikimedia Foundation / Hugging Face
The processed Wikipedia dataset is a cleaned and tokenized version of Wikipedia dumps covering 20+ languages, available via Hugging Face Datasets. With HTML stripped and paragraph structure preserved, it is one of the most universally used pretraining corpora and a standard knowledge-grounding source for retrieval-augmented generation (RAG) baselines and open-domain QA systems.
Wikipedia Dump
by Wikimedia Foundation
The full text dump of Wikipedia articles available in over 300 languages, regularly updated and distributed by the Wikimedia Foundation. It is one of the most universally included components in language model pretraining pipelines due to its high factual density, editorial quality, and broad topical coverage.
LibriSpeech
by OpenSLR / Johns Hopkins University
LibriSpeech is a corpus of approximately 1,000 hours of 16kHz read English speech derived from LibriVox audiobooks, split into clean and other subsets of 100h and 360h for training, with dedicated development and test sets. It has become the de facto standard benchmark for English ASR systems.
GSM8K Dataset
by OpenAI
Grade School Math 8K is a dataset of 8,500 high-quality linguistically diverse grade school math word problems requiring 2-8 step reasoning. Created by OpenAI, GSM8K is widely used for evaluating multi-step arithmetic reasoning and the effectiveness of chain-of-thought prompting.
PubChem
by NCBI / NIH
PubChem is the world's largest open chemical database maintained by the NCBI, containing information on over 115 million compounds, 295 million substances, and 270 million bioactivity outcomes from more than 1.2 million assays. It provides standardized molecular structures, properties, and biological activity data freely accessible via REST API and bulk download, making it the canonical resource for cheminformatics and drug discovery research.
GENIE Benchmark
by Stanford University
The GENIE Benchmark is a comprehensive dataset for evaluating the performance of text-to-SQL models. It includes a diverse set of SQL queries and corresponding natural language questions across multiple domains, designed to assess the generalization capabilities of these models.
HumanEval Dataset
by OpenAI
A curated set of 164 handwritten Python programming problems released by OpenAI, each consisting of a function signature, docstring, reference solution, and unit tests. HumanEval introduced the pass@k metric for functional code correctness evaluation and has become the de facto standard benchmark reported in virtually every code generation model paper.
MIMIC-IV
by MIT Laboratory for Computational Physiology / Beth Israel Deaconess Medical Center
MIMIC-IV (Medical Information Mart for Intensive Care) is a comprehensive de-identified electronic health record database covering over 300,000 patients admitted to Beth Israel Deaconess Medical Center's ICU between 2008 and 2019. It contains detailed clinical data including diagnoses, procedures, medications, laboratory values, and waveforms, enabling a wide range of clinical AI research.
MATH Dataset
by UC Berkeley
A challenging benchmark of 12,500 competition mathematics problems from AMC, AIME, and similar competitions across 5 difficulty levels and 7 subjects. Each problem includes a full step-by-step solution in LaTeX, making it suitable for both evaluation and training of mathematical reasoning.
SA-1B (Segment Anything)
by Meta AI
SA-1B is Meta AI's massive segmentation dataset released alongside the Segment Anything Model (SAM), containing over 1 billion high-quality segmentation masks across 11 million diverse, high-resolution images. It is the largest segmentation dataset ever created and enables training of generalist vision models with strong zero-shot transfer capabilities.
HellaSwag Dataset
by University of Washington
HellaSwag is an adversarially filtered commonsense NLI benchmark where models must pick the most plausible sentence completion from 4 options. Humans score 95%+ while early LLMs struggled below 50%, making it a robust test of grounded language understanding and commonsense reasoning.
Common Crawl
by Common Crawl Foundation
The world's largest open repository of web crawl data, maintained by the non-profit Common Crawl Foundation and updated with new crawls monthly since 2011. It forms the foundational raw data layer for virtually every major language model pretraining pipeline including GPT-3, LLaMA, PaLM, and Falcon, typically after quality filtering and deduplication steps.
ARC Dataset
by Allen Institute for AI
The AI2 Reasoning Challenge (ARC) dataset contains 7,787 grade 3–9 science exam questions split into Easy and Challenge partitions. The Challenge set contains questions that require deeper reasoning and world knowledge, making it a reliable signal for advanced language understanding.
Open Images V7
by Google
Google's Open Images V7 is one of the largest existing datasets with object-level annotations, containing approximately 9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives across 600+ object classes.
TruthfulQA Dataset
by University of Oxford
TruthfulQA measures the truthfulness of LLMs across 817 adversarially crafted questions spanning 38 categories where humans are commonly misled by false beliefs. Models are scored on generating truthful AND informative answers, revealing how larger models can paradoxically become more confidently wrong.
Stack Exchange Dump
by Stack Exchange
The Stack Exchange Data Dump is a quarterly XML export of all public questions, answers, comments, and votes across the entire Stack Exchange network of 170+ Q&A communities including Stack Overflow. Containing hundreds of millions of high-quality technical and domain-specific Q&A pairs, it is a critical pretraining source for code and reasoning capabilities and a standard retrieval benchmark for dense passage retrieval.
SuperGLUE
by New York University
SuperGLUE is a benchmark suite of 8 challenging NLU tasks including question answering, coreference resolution, causal reasoning, and word-sense disambiguation, designed as a harder successor to GLUE. It includes human baselines and has driven significant progress in pre-trained language model capabilities.
LAION-5B
by LAION
The largest openly available image-text pair dataset, containing 5.85 billion CLIP-filtered image-text pairs across English, multilingual, and aesthetic subsets. LAION-5B was the primary training corpus for Stable Diffusion, DALL-E 2 replications, and numerous open vision-language models, enabling the open-source community to train competitive text-to-image generation models.
ADE20K Dataset
by MIT CSAIL
ADE20K is a densely annotated semantic segmentation dataset containing over 27,000 images with pixel-level annotations for 150 semantic categories covering both indoor and outdoor scenes. It is the primary benchmark for scene parsing and semantic segmentation tasks in the computer vision community.
WinoGrande Dataset
by Allen Institute for AI
WinoGrande is a large-scale crowdsourced dataset of 44,000 Winograd-style fill-in-the-blank commonsense problems, debiased using the AFLITE algorithm to minimize spurious statistical cues. It is significantly harder than the original Winograd Schema Challenge for contemporary NLP models.
AudioSet
by Google
Google's AudioSet is a large-scale dataset of manually annotated audio events comprising over 2 million 10-second YouTube clips labeled with a hierarchical ontology of 632 audio event classes. It is the primary benchmark for audio tagging and sound event detection, spanning music, speech, and environmental sounds.
CheXpert
by Stanford ML Group
CheXpert is a large chest X-ray dataset from Stanford containing 224,316 chest radiographs from 65,240 patients with labels for 14 observations mined from radiology reports using an automated labeler. It uniquely addresses label uncertainty with positive, negative, and uncertain labels, making it a challenging and realistic benchmark for automated chest X-ray interpretation.
PubMedCentral OA
by National Institutes of Health / National Library of Medicine
PubMedCentral Open Access (PMC OA) is a subset of the PMC literature archive made freely available for text mining and NLP research, containing over 4 million full-text biomedical and life science articles. It is the primary corpus used for pretraining biomedical language models such as BioBERT, PubMedBERT, and BioGPT.
VoxCeleb2
by Oxford Visual Geometry Group (VGG)
VoxCeleb2 is a large-scale speaker recognition dataset containing over 1 million utterances from 6,112 celebrities extracted from YouTube videos in challenging real-world conditions. It is the standard benchmark for speaker verification and diarization research, providing naturalistic conversational speech at scale.
FLORES-200 Dataset
by Meta AI
FLORES-200 is Meta's few-shot translation evaluation benchmark spanning 200 languages, including many low-resource and endangered ones. Each language contains 1,012 parallel sentences translated from English Wikipedia, covering both devtest and test splits for systematic MT evaluation at scale.
Alpaca Dataset
by Stanford University
Stanford Alpaca's 52,000 instruction-following examples generated using the self-instruct technique applied to GPT-3.5 (text-davinci-003). This foundational dataset enabled the creation of the Alpaca 7B model and popularized cost-effective instruction-tuning approaches.
Common Voice 15
by Mozilla
Mozilla's Common Voice 15.0 is the world's largest publicly available multilingual speech corpus, containing over 30,000 hours of validated speech data across 114 languages, all contributed and validated by volunteers. It enables training and evaluation of multilingual and low-resource speech recognition systems.
SEC-EDGAR Filings
by U.S. Securities and Exchange Commission
The SEC-EDGAR Filings dataset encompasses over 20 million full-text regulatory filings submitted to the US Securities and Exchange Commission since 1993, including 10-K annual reports, 10-Q quarterly reports, 8-K current reports, and proxy statements from all US public companies. It is the foundational corpus for financial NLP research, sentiment analysis, and financial document AI.
MBPP (Mostly Basic Python Problems)
by Google
A dataset of 974 crowd-sourced Python programming problems suitable for entry-level programmers, each with a problem description, code solution, and three automated test cases. MBPP complements HumanEval by covering a broader variety of programming concepts and is widely used alongside it for comprehensive evaluation of code generation capabilities across model families.
The Stack v2
by BigCode
An expanded code pretraining dataset containing 3 trillion tokens of source code in 619 programming languages, curated by BigCode from GitHub repositories with permissive SPDX licenses. Version 2 triples the size of the original Stack and includes improved deduplication, opt-out mechanisms for authors, and structured data from GitHub issues and pull requests alongside raw source files.
CelebA-HQ
by NVIDIA / CUHK
CelebA-HQ is a high-quality version of the CelebA face dataset containing 30,000 celebrity images at 1024×1024 resolution with 40 binary attribute annotations. It was introduced alongside Progressive GAN and has become the standard benchmark for high-fidelity face generation and synthesis research.
ArXiv Papers Dataset
by Cornell University / arXiv
The ArXiv Papers Dataset is a bulk export of over 2.3 million scientific preprints from arXiv spanning physics, mathematics, computer science, biology, finance, and economics, provided by Cornell University and hosted on Kaggle and AWS S3. The full-text LaTeX source and parsed metadata make it a primary pretraining corpus for scientific language models and citation-network research.
OpenAssistant Conversations
by LAION
A large-scale, human-annotated dataset of assistant-style conversations collected through the OpenAssistant crowdsourcing platform. Contains over 161,000 messages across 66,000+ conversation trees, with ranked responses for RLHF training.
mC4
by Google
The multilingual Colossal Clean Crawled Corpus (mC4) spans 101 languages and contains hundreds of billions of tokens scraped from Common Crawl with language detection and heuristic quality filters. It was used to train mT5 and is one of the largest publicly available multilingual pre-training corpora.
Semantic Scholar ORC
by Allen Institute for AI (AI2)
The Semantic Scholar Open Research Corpus (S2ORC) is a large English-language corpus of 136 million academic papers with structured metadata, abstracts, citation graphs, and full-text body paragraphs where licensing allows. Maintained by the Allen Institute for AI, it covers 19 scientific fields and is widely used for scientific NLP tasks including citation prediction, claim verification, and scientific QA.
BookCorpus
by University of Toronto
A dataset of over 11,000 unpublished books spanning fiction and non-fiction genres, originally scraped from Smashwords and used as the primary pretraining corpus for BERT alongside Wikipedia. It provides rich long-range dependency data that helps models learn coherent narrative structure and extended discourse patterns.
Places365
by MIT CSAIL
Places365 is a scene-centric database with 1.8 million training images across 365 scene categories, designed to train and evaluate scene recognition models. The dataset enables models to understand the semantic meaning of places and environments, making it ideal for applications in autonomous driving, robotics, and image retrieval.
CodeSearchNet
by GitHub / Microsoft Research
A dataset and benchmark challenge for code retrieval and search containing 2 million (code, documentation) pairs in six programming languages — Python, Java, JavaScript, PHP, Ruby, and Go — curated by GitHub and Microsoft Research. It is the canonical benchmark for code-to-natural-language and natural-language-to-code retrieval tasks and is widely used to evaluate code embedding models.
APPS (Automated Programming Progress Standard)
by UC Berkeley
A benchmark of 10,000 programming problems at introductory, interview, and competitive programming difficulty levels, each with problem statements, test cases, and human-written solutions. APPS is the standard dataset for evaluating code generation models on realistic programming tasks ranging from simple loops to complex algorithmic challenges drawn from competitive programming platforms.
UltraFeedback
by Tsinghua University
A large-scale, high-quality preference dataset with 64,000 instructions each answered by 4 LLMs and rated by GPT-4 on instruction-following, truthfulness, honesty, and helpfulness. UltraFeedback is the backbone of the Zephyr and Tulu 2 DPO models.
Financial PhraseBank
by Pekka Malo et al. / Aalto University
Financial PhraseBank is a sentiment analysis dataset containing 4,845 sentences from English-language financial news annotated by 16 financial domain experts with positive, negative, or neutral sentiment labels. It is the most widely used benchmark for financial sentiment analysis and has been used to fine-tune FinBERT and numerous other financial NLP models.
Self-Instruct
by University of Washington
Self-Instruct is the foundational instruction-tuning dataset and methodology introduced by Wang et al. (2022), where 175 human-written seed tasks are iteratively expanded into 52,000 instruction-input-output triplets using GPT-3 as the generator. It established the paradigm of bootstrapping instruction data from existing LLMs and directly inspired Alpaca, WizardLM, and most subsequent synthetic alignment datasets.
StarCoderData
by BigCode
The 780 billion token code dataset used to pretrain the StarCoder family of models, assembled by BigCode from The Stack v1 spanning 86 programming languages with permissive licenses. It includes GitHub issues, Git commits, and Jupyter notebook data alongside source files, enabling models to learn from developer workflows and not just static code.
DM Mathematics
by Google DeepMind
DeepMind Mathematics (DM Mathematics) is a dataset of 2 million mathematical question-answer pairs covering algebra, arithmetic, calculus, comparisons, measurement, numbers, polynomials, and probability, procedurally generated to test mathematical reasoning capabilities of language models. The symbolic and step-structured nature of the dataset makes it a standard benchmark for evaluating compositional generalization and multi-step arithmetic reasoning.
LIMA
by Meta AI
LIMA (Less Is More for Alignment) is a carefully curated dataset of 1,000 high-quality instruction-response pairs demonstrating that alignment quality matters more than quantity. Sourced from StackExchange, wikiHow, and manually written prompts, LIMA-tuned models rival GPT-4 on many benchmarks.
OpenHermes 2.5
by Nous Research
A large curated synthetic instruction dataset with ~1 million entries sourced from multiple high-quality open datasets including Airoboros, Camel, GPT4-LLM, and others. OpenHermes 2.5 powers the Nous Hermes model family and is widely regarded as one of the best open instruction datasets.
OASST2
by LAION / OpenAssistant
OpenAssistant Conversations 2 (OASST2) is a crowd-sourced human-annotated dataset of 100,000+ assistant-style conversations in 35 languages, where human contributors created and ranked message trees to produce preference labels for RLHF training. It is the largest open multilingual human-feedback dataset and is widely used for training preference models and reward functions in open-source alignment pipelines.
NLLB Training Data
by Meta AI
The No Language Left Behind (NLLB) training corpus released by Meta AI contains high-quality parallel data across 200+ language pairs, including newly mined bitext for dozens of low-resource languages. It was used to train the NLLB-200 model achieving state-of-the-art translation on low-resource language pairs.
ShareGPT
by Community
A community-collected dataset of real ChatGPT and GPT-4 conversation logs shared by users, covering a broad range of tasks and domains. Available in multiple filtered and cleaned versions including ShareGPT52K and ShareGPT90K used by Vicuna and other open models.
LSUN
by Princeton / Columbia University
The Large-Scale Scene Understanding (LSUN) dataset is a massive collection of nearly one million labeled images for each of 10 scene and 20 object categories. It is a key benchmark for advancing research in scene understanding, particularly for generative modeling, classification, and reconstruction tasks.
Dolly-15K
by Databricks
Dolly-15K is a high-quality, open-source dataset of 15,000 instruction-following records generated by humans. Created by Databricks employees, it's designed for fine-tuning large language models to exhibit instruction-following capabilities, such as those seen in ChatGPT, using a relatively small, targeted dataset.
OPUS-100
by University of Helsinki
OPUS-100 is a large-scale multilingual parallel corpus for machine translation, featuring 100 languages pivoted through English. Sampled from the OPUS collection, it provides up to 1 million sentence pairs per language pair, making it a standard benchmark for training and evaluating multilingual models.
Phi-1 TextBooks
by Microsoft
Phi-1 TextBooks is a synthetic dataset of Python coding textbooks and exercises generated by GPT-3.5 and GPT-4. It was created to pretrain Microsoft's Phi-1 small language model, demonstrating that high-quality, curriculum-style data can significantly boost the coding abilities of smaller models compared to training on general web data.
GigaSpeech
by Seasalt.ai / SpeechColab
GigaSpeech is a multi-domain English speech corpus with 10,000 hours of high-quality labeled audio for ASR, sourced from audiobooks, podcasts, and YouTube across a broad range of topics and recording conditions. Its scale and diversity make it particularly valuable for training robust, domain-generalizable speech recognition models.
WizardLM Evol-Instruct
by Microsoft Research
WizardLM Evol-Instruct is a synthetic dataset created by Microsoft Research for fine-tuning large language models. It uses an LLM-based evolutionary process to iteratively rewrite and complicate a seed set of instructions, progressively increasing their complexity and diversity. The dataset is designed to enhance a model's ability to follow intricate, multi-step commands across various domains like coding, math, and reasoning.
TyDi QA Dataset
by Google Research
TyDi QA is a benchmark for question answering across 11 typologically diverse languages. It features information-seeking questions written by native speakers who have not seen the answer, ensuring real-world applicability. This design challenges models to generalize beyond high-resource, typologically similar languages.
DataComp-1B
by DataComp Consortium
A curated 1.28 billion image-text pair dataset produced through the DataComp benchmark competition, which challenged participants to filter a 12.8 billion pair candidate pool to produce the best downstream CLIP model. DataComp-1B represents the winning filtering strategy and achieves state-of-the-art zero-shot classification performance among datasets of its size.
OpenWebText
by EleutherAI
OpenWebText is a large-scale, open-source English text corpus created by scraping web pages linked from Reddit. Designed as a public replication of OpenAI's original WebText dataset used for GPT-2, it contains approximately 38 GB of text filtered by Reddit upvotes to ensure a baseline of quality and relevance.
LAION-400M Text Captions
by LAION
The text caption component of the LAION-400M dataset, offering 400 million English alt-text captions. These captions were scraped from the web and filtered using CLIP to ensure a minimum similarity to their corresponding images. The text is used independently for large-scale NLP and multimodal research.
BioASQ Dataset
by BioASQ Consortium
The BioASQ dataset is a benchmark for biomedical semantic indexing and question answering. It contains thousands of expert-annotated questions (factoid, list, yes/no, summary) paired with relevant PubMed articles, concepts, and ideal answers, designed to train and evaluate advanced NLP systems in the medical domain.
Open X-Embodiment
by Google DeepMind / Consortium
Open X-Embodiment (OXE) is a massive robotics dataset combining over 1 million demonstration episodes from 22 distinct robot embodiments. It covers 527 skills and is designed to train generalist robot policies that can transfer skills across diverse hardware, serving as a key resource for vision-language-action models.
Legal-BERT Training Data
by Gerasimos Spanakis / Maastricht University
The Legal-BERT training corpus is a large collection of English legal text assembled from UK legislation, EU legislation, ECHR/ECLI court decisions, and US contracts specifically curated to pretrain domain-adapted BERT models. It has enabled a family of Legal-BERT models that significantly outperform general-domain language models on legal NLP tasks.
GenLaw: A Legal Reasoning Dataset
by Stanford Center for Legal Informatics
GenLaw is a comprehensive dataset designed for evaluating legal reasoning capabilities of large language models. It contains a diverse set of legal questions, case summaries, and relevant statutes, enabling researchers to assess a model's ability to understand and apply legal principles.
SlimPajama
by Cerebras
SlimPajama is a cleaned and deduplicated version of the RedPajama dataset, containing 627 billion high-quality tokens. Produced by Cerebras, it demonstrates that training on fewer, higher-quality tokens can match or exceed the performance of models trained on larger, noisier datasets.
EU Court Decisions
by European Court of Human Rights / CJEU
The EU Court Decisions dataset aggregates judgments from the European Court of Human Rights (ECHR) and the Court of Justice of the European Union (CJEU), covering tens of thousands of decisions in multiple EU languages with structured metadata. It is widely used for multilingual legal NLP research, legal judgment prediction, and cross-lingual information retrieval.
Evol-CodeAlpaca
by Microsoft Research
Evol-CodeAlpaca is a dataset of 110,000 instruction-solution pairs for code generation, created by applying the EvolInstruct method to Code Alpaca seeds. Using GPT-4, it progressively increases the complexity and diversity of programming problems, serving as the primary training data for the WizardCoder models.
ShareGPT4V
by Shanghai AI Lab
ShareGPT4V is a large-scale, high-quality dataset containing 100,000 image-text pairs generated by GPT-4V. It is specifically designed for the instruction-tuning of open-source large vision-language models (LVLMs). The dataset's detailed captions and conversational QA pairs significantly enhance a model's ability to perform complex scene understanding, OCR, and visual reasoning.
FinQA Dataset
by Zhiyu Chen et al. / University of California Santa Barbara
FinQA is a large-scale dataset for numerical reasoning over financial data, containing over 8,000 question-answer pairs from S&P 500 earnings reports. Each question requires multi-step reasoning across both unstructured text and structured tables, making it a challenging benchmark for financial AI systems.
Cosmopedia
by Hugging Face
Cosmopedia is a massive synthetic dataset containing 30 million documents styled as textbooks, blog posts, and articles. Generated by Mixtral-8x7B-Instruct, it provides a vast, multilingual corpus of high-quality educational content designed for pretraining large language models at scale.
CC12M (Conceptual 12M)
by Google
CC12M is a large-scale dataset by Google containing 12 million image-text pairs from the web. It was created with a less restrictive filtering process than its predecessor, CC3M, to achieve greater scale and diversity. This makes it a foundational resource for pretraining large vision-language models like CLIP and ALIGN.
XL-Sum Dataset
by BUET (Bangladesh University of Engineering and Technology)
XL-Sum is a massive multilingual dataset for abstractive summarization. It consists of over 1 million article-summary pairs scraped from BBC News, covering 44 different languages. This diversity makes it a crucial resource for developing and evaluating cross-lingual and multilingual summarization models.
CaseText Corpus
by Casetext (acquired by Thomson Reuters)
The CaseText Corpus is a large-scale dataset of US federal and state court decisions. It includes full text, structured metadata, and citation networks, designed for legal research and the development of AI applications like legal language models and case retrieval systems, spanning decades of US jurisprudence.
RLBench
by Dyson Robotics Lab / Imperial College London
RLBench is a large-scale robot learning benchmark and dataset built on the CoppeliaSim simulator, providing 100 unique manipulation tasks with demonstrations, observations, and reward functions. It offers RGB, depth, and point-cloud observations for a Franka Panda arm across diverse household tasks, widely used for evaluating imitation learning, reinforcement learning, and multi-task robot policies.
OpenMathInstruct
by NVIDIA
OpenMathInstruct is a large-scale, synthetic dataset by NVIDIA featuring 1.8M+ math problem-solution pairs. Generated by Mixtral models and verified for correctness, it provides reliable, step-by-step reasoning chains for training and fine-tuning language models on diverse mathematical topics, from arithmetic to competition math.
GitHub Code Dataset
by Hugging Face / BigCode
The GitHub Code Dataset is a massive, multilingual collection of source code from public GitHub repositories, spanning 32 programming languages. Distributed via Hugging Face under the BigCode project, it provides a foundational resource for pretraining large language models on diverse code-related tasks, from generation to analysis.
CC-News
by CommonCrawl Foundation
CC-News is a large-scale dataset of over 700,000 English news articles from the CommonCrawl archive, collected between 2016 and 2019. It serves as a key pretraining corpus, notably for the RoBERTa model, providing a rich source of journalistic text for developing models that understand news language and current events.
MusicNet
by University of Washington
MusicNet is a collection of 330 freely licensed classical music recordings with over 1 million annotated labels indicating the precise timing and identity of every musical note in each recording. It supports supervised learning for music transcription, instrument recognition, and music information retrieval tasks.
CulturaX
by University of Oregon
CulturaX is a massive, cleaned multilingual text corpus containing 6.3 trillion tokens across 167 languages. It was created by combining, deduplicating, and filtering the mC4 and OSCAR datasets using language model-based quality scoring. This makes it one of the largest and cleanest public datasets for pre-training large language models.
Tulu V2 Mix
by Allen Institute for AI (AI2)
Tulu V2 Mix is a curated 326,000-sample mixture of instruction-tuning datasets from AI2. It blends diverse sources like FLAN, Open Assistant, and Code Alpaca to train the Tulu 2 model family. The dataset serves as a benchmark for analyzing the impact of different data sources on model performance and quality.
MedNLI
by University of Massachusetts / Partners Healthcare
MedNLI is a benchmark dataset for Natural Language Inference (NLI) in the clinical domain. Derived from the MIMIC-III database, it contains over 14,000 sentence pairs from clinical notes, each annotated by a clinician as representing entailment, contradiction, or a neutral relationship, enabling the evaluation of clinical text reasoning.
WebVid-10M
by University of Oxford
WebVid-10M is a massive dataset containing over 10 million video clips paired with descriptive text captions. Scraped from stock video websites, it serves as a foundational pretraining corpus for state-of-the-art video-language models, facilitating research in video understanding, retrieval, and generation.
PushShift Reddit Dataset
by PushShift.io
A massive, multi-billion token archive of Reddit comments and submissions from 2005 to 2023, collected by the PushShift project. This dataset is a cornerstone for social NLP research, large-scale language model pre-training, and studying the dynamics of online communities and conversational discourse.
Nectar
by UC Berkeley
Nectar is a large-scale, high-quality preference dataset from Berkeley AI Research (BAIR). It contains 183,000 prompts, each with seven ranked responses from diverse models like GPT-4, ChatGPT, and open-source LLMs. It is designed for training robust reward models for RLHF and DPO.
RoboNet
by Berkeley AI Research (BAIR)
RoboNet is a large-scale dataset for robot learning, featuring 15 million video frames from diverse robot arms across multiple labs. It is designed to train and benchmark self-supervised visual models, aiming to achieve generalization across different robot morphologies and workspaces without task-specific labels.
Orca DPO Pairs
by Intel Labs / Community
Orca DPO Pairs is a synthetic dataset containing 12,000 instruction-following examples. Each example includes a prompt, a high-quality response from GPT-4 (chosen), and a lower-quality response from GPT-3.5 (rejected). It is designed for efficiently aligning language models using Direct Preference Optimization (DPO) without a reward model.
CALVIN
by Albert-Ludwigs-Universität Freiburg
CALVIN is a large-scale dataset and benchmark for long-horizon, language-conditioned robot manipulation. It features over 24 hours of teleoperated demonstration data in a tabletop environment, encompassing 34 distinct skills that can be composed to solve complex, multi-step tasks from natural language instructions.
Deita 6K
by HKUST / Community
Deita 6K is an ultra-compact, high-quality instruction-tuning dataset of 6,000 carefully selected samples produced by the Data-Efficient Instruction Tuning for Alignment (DEITA) framework, which scores and filters instruction data by complexity and quality using LLM judges. Despite its small size, models trained on Deita 6K match or outperform those trained on datasets 10-100x larger, demonstrating the power of principled data selection over scale.
CAMEL-AI Datasets
by CAMEL-AI
The CAMEL-AI Datasets are a collection of synthetic multi-agent conversation datasets generated through the Communicative Agents framework, where AI assistants and user agents collaborate via role-playing to solve tasks. The collection covers coding, math, science, and open-ended reasoning domains, providing diverse instruction-following dialogues useful for SFT and alignment research.
CodeParrot GitHub Code
by Hugging Face
A 50 GB dataset of Python code scraped from GitHub, originally created to train the CodeParrot model as a demonstration of code-focused language model pretraining. It filters repositories for Python files only and applies basic deduplication, making it a lightweight starting point for Python-specific code generation research and experimentation.
Capybara
by Argilla / LDJnr
Capybara is a high-quality instruction-tuning dataset of 15,000 diverse, long-form single- and multi-turn conversations synthesized to cover a wide range of topics and response styles, designed to improve model coherence and verbosity on open-ended tasks. It emphasizes narrative quality and conceptual depth over simple factual responses, making it particularly effective for improving chat model fluency and reasoning.
Genstruct
by NousResearch
Genstruct is a synthetic instruction dataset generated by the Genstruct-7B model, which converts raw documents into structured instruction-response pairs. Unlike typical self-instruct approaches, Genstruct grounds every instruction in a source document, ensuring factual consistency and enabling controllable synthetic data generation from any text corpus.
UltraChat
by Tsinghua University
1.5M high-quality multi-turn dialogue dataset for instruction fine-tuning.
The Pile
by EleutherAI
825GB diverse English pretraining corpus from 22 high-quality data sources.
SWE-bench
by Princeton NLP
2.3K real GitHub issues requiring AI agents to write and verify code fixes.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
by Google AI
Introduced BERT, a bidirectional Transformer pre-trained on masked language modeling and next sentence prediction. Established the pretrain-then-fine-tune paradigm that dominated NLP for years and achieved state-of-the-art on 11 NLP benchmarks.
Learning Transferable Visual Models From Natural Language Supervision (CLIP)
by OpenAI
Introduced CLIP (Contrastive Language-Image Pre-training), a model trained on 400 million image-text pairs using contrastive learning that achieves remarkable zero-shot transfer to diverse vision tasks. CLIP became foundational for vision-language alignment and generative AI pipelines.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
by Google Brain
Introduced chain-of-thought prompting, a simple technique of providing exemplars with step-by-step reasoning traces in few-shot prompts. This approach dramatically improves LLM performance on arithmetic, commonsense, and symbolic reasoning tasks, with the effect emerging at approximately 100B parameters.
High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion)
by CompVis / Stability AI
Introduced Latent Diffusion Models (LDMs), which perform the diffusion process in a compressed latent space rather than pixel space, dramatically reducing computational cost while maintaining image quality. This work underpins Stable Diffusion, the most widely used open-source image generation model.
Language Models are Few-Shot Learners (GPT-3)
by OpenAI
Introduced GPT-3, a 175B parameter language model demonstrating remarkable few-shot learning capabilities across diverse tasks. Showed that scaling model size dramatically improves in-context learning without gradient updates, reshaping the field.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
by Google Brain
Introduced the Vision Transformer (ViT), demonstrating that a pure transformer applied directly to sequences of image patches achieves state-of-the-art performance on image classification when pretrained on large datasets. The paper challenged the dominance of convolutional neural networks in computer vision.
Training Language Models to Follow Instructions with Human Feedback
by OpenAI
Presents InstructGPT, which uses Reinforcement Learning from Human Feedback (RLHF) to align GPT-3 with human intent. By fine-tuning on human demonstrations and training a reward model on human preference comparisons, InstructGPT produces outputs that human evaluators prefer to GPT-3 outputs despite having 100× fewer parameters.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
by Facebook AI Research
Introduces Retrieval-Augmented Generation (RAG), combining parametric memory (language model weights) with non-parametric memory (dense retrieval over Wikipedia) for knowledge-intensive NLP tasks. RAG models achieve state-of-the-art on open-domain QA benchmarks and produce more specific, factual, and diverse responses than pure parametric models.
Proximal Policy Optimization Algorithms
by OpenAI
PPO introduces a clipped surrogate objective that constrains policy update step sizes, achieving the stability of trust-region methods (TRPO) with the simplicity and scalability of first-order optimizers. It quickly became the dominant RL algorithm for training large language models with human feedback.
Highly Accurate Protein Structure Prediction with AlphaFold
by DeepMind
AlphaFold 2 achieves atomic-level accuracy in protein structure prediction by combining evolutionary information from multiple sequence alignments with a novel Evoformer architecture and structure module, solving a 50-year grand challenge in biology. Its predictions have been released for virtually all known proteins and have accelerated drug discovery, enzyme design, and structural biology worldwide.
GPT-4 Technical Report
by OpenAI
Technical report for GPT-4, OpenAI's multimodal large language model accepting image and text inputs. Demonstrates state-of-the-art performance on academic and professional benchmarks, including passing the bar exam in the top 10% of test takers.
Segment Anything
by Meta AI
Introduced the Segment Anything Model (SAM) and the SA-1B dataset of 1 billion masks on 11 million images. SAM is a promptable segmentation foundation model that generalizes to new image distributions and tasks without additional training, enabling a new paradigm of interactive segmentation.
Evaluating Large Language Models Trained on Code (Codex)
by OpenAI
Introduced Codex, a GPT language model fine-tuned on publicly available code from GitHub, and the HumanEval benchmark for measuring code synthesis from docstrings. Codex powers GitHub Copilot and represents a breakthrough in automated programming assistance.
ReAct: Synergizing Reasoning and Acting in Language Models
by Google / Princeton
Introduces ReAct, a paradigm that combines reasoning traces and task-specific actions in language models. By interleaving thinking steps with tool calls, ReAct agents outperform chain-of-thought and act-only baselines on diverse tasks including question answering, fact verification, and interactive decision-making.
LoRA: Low-Rank Adaptation of Large Language Models
by Microsoft Research
Introduces LoRA, which freezes pretrained model weights and injects trainable low-rank decomposition matrices into Transformer layers. Reduces trainable parameters by 10,000× and GPU memory by 3× with no inference latency overhead, enabling efficient LLM fine-tuning.
LLaMA: Open and Efficient Foundation Language Models
by Meta AI
Introduces LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, trained on publicly available datasets. Showed that smaller models trained on more tokens can match or exceed larger models, democratizing LLM research.
Deep Reinforcement Learning from Human Preferences
by OpenAI
This foundational RLHF paper shows that human preference comparisons between agent behaviors can train a reward model that guides deep RL agents in complex tasks like Atari games and MuJoCo locomotion, without hand-crafted reward functions. The approach reduces human labeling effort by ~3 orders of magnitude compared to direct reward specification.
Gemini: A Family of Highly Capable Multimodal Models
by Google DeepMind
Introduced the Gemini family of multimodal models (Ultra, Pro, Nano) natively trained to process and combine text, images, audio, and video. Gemini Ultra is the first model to surpass human expert performance on MMLU and achieves state-of-the-art across 30 of 32 benchmarks evaluated.
Efficient Memory Management for Large Language Model Serving with PagedAttention
by UC Berkeley
Introduced PagedAttention and the vLLM serving system, which manages the KV cache in non-contiguous physical memory blocks inspired by OS paging, enabling near-zero memory waste and efficient sharing of KV cache across requests. vLLM achieves 2-4x higher throughput than HuggingFace Transformers and 1.7x over Orca.
Generative Agents: Interactive Simulacra of Human Behavior
by Stanford University / Google
Introduces generative agents—computational software agents that simulate believable human behavior—by combining a large language model with memory streams, reflection synthesis, and planning mechanisms. Twenty-five agents populate a virtual town, exhibiting emergent social behaviors including relationship formation, information propagation, and event coordination.
Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)
by OpenAI
Presented DALL-E 2 (unCLIP), a hierarchical text-conditional image generation system using CLIP image embeddings as a prior and a diffusion decoder. The system achieves state-of-the-art photorealism and text-image alignment, substantially advancing the field of text-to-image synthesis.
Training Language Models to Follow Instructions with Human Feedback (InstructGPT)
by OpenAI
Introduces InstructGPT, fine-tuning GPT-3 with Reinforcement Learning from Human Feedback (RLHF) to follow instructions. A 1.3B InstructGPT model is preferred over 175B GPT-3 by human labelers, establishing RLHF as the dominant alignment technique.
Self-Consistency Improves Chain of Thought Reasoning in Language Models
by Google Brain
Introduced self-consistency, a decoding strategy that samples diverse reasoning paths from a language model and returns the most consistent answer by marginalizing out the reasoning paths. Self-consistency is a simple, training-free technique that substantially improves chain-of-thought prompting across arithmetic and commonsense reasoning tasks.
Scaling Laws for Neural Language Models
by OpenAI
Empirically establishes power-law scaling relationships between language model performance and model size, dataset size, and compute budget. Provides the foundational framework for predicting LLM capabilities as a function of scale, guiding research for years.
Visual Instruction Tuning (LLaVA)
by University of Wisconsin–Madison / Microsoft Research
Introduced LLaVA (Large Language and Vision Assistant), a multimodal model trained via visual instruction tuning using GPT-4-generated multimodal instruction-following data. LLaVA demonstrates impressive multimodal chat abilities and achieves 85.1% on Science QA, pioneering open-source visual instruction tuning.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
by Google Brain
Introduced Switch Transformers, a simplified mixture-of-experts (MoE) architecture that routes each token to exactly one expert (top-1 routing), enabling trillion-parameter models with sub-linear compute scaling. Switch Transformers achieve 7x pretraining speedup over a dense T5 model while maintaining model quality.
Model Cards for Model Reporting
by Google
Model Cards introduces a structured framework for documenting machine learning models across intended uses, performance disaggregated by demographic groups, and ethical considerations, enabling informed model selection and deployment decisions. The paper has become an industry standard, with model card adoption by Google, Hugging Face, and most major AI providers.
Language Models are Unsupervised Multitask Learners (GPT-2)
by OpenAI
Introduced GPT-2, demonstrating that large language models trained on diverse web text can perform zero-shot transfer across many NLP tasks without task-specific fine-tuning. Showed emergent capabilities at scale and sparked debate on responsible AI release.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
by Stanford University
Introduces FlashAttention, an IO-aware exact attention algorithm that restructures attention computation to minimize memory reads/writes between HBM and SRAM. Achieves 2-4× speedup over standard attention and enables training on much longer sequences.
Training Compute-Optimal Large Language Models (Chinchilla)
by DeepMind
Challenges the Kaplan et al. scaling laws by showing that model size and training tokens should scale equally. Trains Chinchilla (70B) on 4× more data than Gopher, matching or beating models 4× its size, redefining compute-optimal training strategies.
Datasheets for Datasets
by Microsoft Research / Multiple Institutions
Drawing an analogy to electronics component datasheets, this paper proposes that every ML dataset should be accompanied by a standardized document covering its motivation, composition, collection process, preprocessing, uses, distribution, and maintenance. Datasheets for Datasets has become the foundational standard for dataset transparency and is widely required by major AI venues.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
by Princeton University / Google DeepMind
Introduced Tree of Thoughts (ToT), a framework that generalizes chain-of-thought prompting to a tree search over intermediate reasoning steps. ToT enables LLMs to explore multiple reasoning paths, evaluate choices, and backtrack, achieving dramatic improvements on tasks requiring lookahead and planning.
Fast Inference from Transformers via Speculative Decoding
by Google Research
Introduced speculative decoding, a lossless inference acceleration technique that uses a smaller, faster draft model to propose multiple tokens, then verifies them in parallel with the target model in a single forward pass. This achieves 2-3x speedup without any degradation in output quality or distribution.
Constitutional AI: Harmlessness from AI Feedback
by Anthropic
Introduces Constitutional AI (CAI), a method for training harmless AI assistants using a set of written principles (a 'constitution') to guide both supervised learning and reinforcement learning from AI feedback (RLAIF). CAI enables Anthropic to reduce reliance on human harm labels while maintaining helpfulness and making AI reasoning about harmlessness explicit.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
by Stability AI
Presented SDXL, a significantly improved latent diffusion model architecture featuring a 3.5B parameter UNet backbone with a secondary refiner model, conditioning on image size and crop parameters, and a curated high-aesthetic dataset. SDXL substantially improves visual quality and prompt adherence over prior Stable Diffusion versions.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
by DeepSeek
DeepSeek-R1 demonstrates that pure reinforcement learning with rule-based rewards—without supervised fine-tuning on chain-of-thought data—can incentivize emergent reasoning capabilities in LLMs including self-verification, reflection, and long chain-of-thought. The model achieves performance comparable to OpenAI-o1 on reasoning benchmarks while being fully open-sourced, triggering a significant industry response.
QLoRA: Efficient Finetuning of Quantized LLMs
by University of Washington
Introduces QLoRA, which combines 4-bit quantization with LoRA adapters to fine-tune a 65B LLM on a single 48GB GPU while preserving full 16-bit fine-tuning performance. Introduces NF4 data type and double quantization for extreme memory reduction.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
by Institute of Science and Technology Austria (IST Austria)
Presented GPTQ, a one-shot weight quantization method based on approximate second-order information that can quantize GPT models with 175B parameters to 4-bit or 3-bit precision in approximately four GPU-hours with negligible accuracy loss. GPTQ made large model inference practical on consumer hardware.
Code Llama: Open Foundation Models for Code
by Meta AI
Introduced Code Llama, a family of large language models for code built on Llama 2 through code-specific pretraining and fine-tuning. Code Llama achieves state-of-the-art performance among open models on HumanEval and MBPP, with variants for Python, instruction following, and long context (100K tokens).
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
by LMSYS / UC Berkeley
Introduces Chatbot Arena, a platform for crowdsourced human evaluation of LLMs via pairwise comparisons using an Elo rating system. The arena has collected over 240K human votes across 50+ models, revealing human preference rankings that often diverge from standard benchmark leaderboards and providing a complementary evaluation signal.
Toolformer: Language Models Can Teach Themselves to Use Tools
by Meta AI
Presents Toolformer, a model that learns to use external tools (APIs) in a self-supervised manner without requiring human annotations. The model decides which APIs to call, how to call them, and how to incorporate results, achieving strong performance across diverse tasks while maintaining generative language modeling ability.
Mistral 7B
by Mistral AI
Introduces Mistral 7B, a 7B parameter language model outperforming LLaMA 2 13B on all benchmarks and approaching LLaMA 2 34B on code and reasoning. Uses grouped-query attention and sliding window attention for efficient inference.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
by MIT / MIT-IBM Watson AI Lab
Introduced AWQ (Activation-aware Weight Quantization), a hardware-friendly low-bit weight quantization approach that protects a small fraction (1%) of salient weights based on activation magnitudes, achieving better performance than GPTQ at 4-bit while being faster and more broadly applicable across model architectures.
PaLM: Scaling Language Modeling with Pathways
by Google Research
Introduces PaLM (Pathways Language Model), a 540B parameter model trained on 780B tokens using the Pathways system. Achieved breakthrough performance on reasoning tasks and demonstrated discontinuous performance improvements that define emergent abilities.
DINOv2: Learning Robust Visual Features without Supervision
by Meta AI
Presented DINOv2, a self-supervised vision foundation model trained on a curated dataset of 142 million images using a combination of self-distillation and contrastive objectives. DINOv2 features serve as universal visual representations, excelling on depth estimation, segmentation, and classification without fine-tuning.
Holistic Evaluation of Language Models
by Stanford CRFM
Presents HELM, a holistic evaluation framework for language models across 42 scenarios and 59 metrics including accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. HELM reveals that no single model dominates across all dimensions and exposes significant gaps between narrow and comprehensive model assessment.
GPT-4V(ision) System Card
by OpenAI
The system card for GPT-4 with vision (GPT-4V), detailing the model's visual understanding capabilities, safety evaluations, limitations, and mitigation strategies. GPT-4V represents a major advancement in large multimodal models, enabling complex visual reasoning from natural language prompts.
The Claude 3 Model Family: Opus, Sonnet, Haiku
by Anthropic
Presents the Claude 3 family of models (Opus, Sonnet, Haiku), demonstrating state-of-the-art performance on reasoning, vision, and multilingual tasks. Highlights Anthropic's safety techniques including Constitutional AI and RLHF-based alignment.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)
by Google Brain
Introduced Imagen, a text-to-image diffusion model that leverages large pretrained language models (T5-XXL) for text understanding combined with cascaded diffusion models for image synthesis. Imagen demonstrated that scaling text encoders is more impactful than scaling diffusion models, establishing DrawBench as a new evaluation benchmark.
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
by Princeton University / Together AI
Extends FlashAttention with improved work partitioning across GPU thread blocks and warps, achieving 2× speedup over FlashAttention and ~9× speedup over standard attention. Enables efficient training of models with context lengths up to 256K tokens.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
by University of Washington / Black in AI
This influential FAccT paper argues that ever-larger language models carry significant risks—including environmental costs, biased training data, and the illusion of meaning—that are often overlooked in the race for benchmark performance. It calls for pausing scaling to focus on documentation, auditing, and community-centered research practices.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
by Salesforce Research
Presented BLIP-2, which bridges the modality gap between frozen image encoders and frozen LLMs using a lightweight Querying Transformer (Q-Former) trained in two stages. BLIP-2 achieves state-of-the-art VQA performance with significantly fewer trainable parameters than prior methods.
Flamingo: a Visual Language Model for Few-Shot Learning
by DeepMind
Introduced Flamingo, a family of visual language models that bridge powerful pretrained vision and language models, enabling few-shot learning on a diverse range of multimodal tasks by training on arbitrarily interleaved sequences of images, video, and text. Flamingo set new few-shot state-of-the-art on 16 benchmarks.
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
by Tsinghua / Peking University / DeepWisdom
Presents MetaGPT, a multi-agent framework that encodes human workflows as Standardized Operating Procedures (SOPs) for LLM agents acting as specialized software roles. By assigning product manager, architect, engineer, and QA roles, MetaGPT produces complete, executable codebases from natural language requirements with higher quality than prior approaches.
Let's Verify Step by Step
by OpenAI
Demonstrated that process-based reward models (PRMs), which provide feedback on each reasoning step, substantially outperform outcome-based reward models (ORMs) for training LLMs to solve mathematical reasoning problems. The paper also introduced PRM800K, a dataset of 800K step-level human feedback labels on MATH solutions.
RoFormer: Enhanced Transformer with Rotary Position Embedding
by Zhuiyi Technology
Introduces Rotary Position Embedding (RoPE), encoding absolute position information with a rotation matrix and naturally incorporating relative position in self-attention. Adopted by LLaMA, PaLM 2, and most modern LLMs for its length generalization properties.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
by Princeton University
Introduced SWE-bench, a benchmark of 2,294 real GitHub issues from 12 popular Python repositories requiring models to resolve issues by writing code patches. SWE-bench reveals that even the best LLMs resolve fewer than 4% of issues with standard techniques, motivating research into code agents.
Voyager: An Open-Ended Embodied Agent with Large Language Models
by NVIDIA / Caltech / UT Austin
Presents Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager uses an automatic curriculum, an ever-growing skill library of executable code, and an iterative prompting mechanism to overcome failures.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
by Stanford University
Introduces DPO, a stable and efficient alternative to RLHF that directly optimizes a language model on human preference data without an explicit reward model or RL. Achieves comparable or superior alignment results with significantly simpler implementation.
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
by Microsoft Research
Presents GraphRAG, which uses LLM-generated knowledge graphs and community detection to enable query-focused summarization over entire text corpora. Unlike standard RAG which answers local questions from text chunks, GraphRAG enables global sensemaking queries by reasoning over interconnected entity communities at multiple granularities.
Decision Transformer: Reinforcement Learning via Sequence Modeling
by UC Berkeley / Google Brain
Decision Transformer recasts offline reinforcement learning as a conditional sequence modeling problem, predicting actions given return-to-go, states, and past actions using a causal Transformer. This eliminates the need for temporal difference learning and bootstrapping while achieving competitive performance on Atari and MuJoCo benchmarks.
REALM: Retrieval-Augmented Language Model Pre-Training
by Google Research
Proposes REALM, which augments language model pre-training with a learned textual knowledge retriever, enabling the model to retrieve and attend over documents from a large corpus during both pre-training and fine-tuning. REALM achieves state-of-the-art on Open-domain QA benchmarks while providing interpretable knowledge retrieval.
StarCoder: May the Source Be With You!
by BigCode / Hugging Face / ServiceNow
Presented StarCoder, a 15.5B parameter open-source code LLM trained on 1 trillion tokens from The Stack (permissively licensed source code) with fill-in-the-middle capability, fast multi-token prediction inference, and a commitment to responsible AI through a model card and attribution feature.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
by Google Brain
Introduces the Sparsely-Gated Mixture-of-Experts (MoE) layer, enabling 1000× capacity increase with only marginal computational cost increase. A learned gating network selects a sparse subset of expert sub-networks per input, enabling unprecedented model scale.
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
by Google / Everyday Robots
SayCan combines the semantic reasoning capabilities of large language models with learned value functions that encode physical feasibility, allowing robots to plan long-horizon tasks expressed in natural language. The approach grounds high-level language instructions in real-world robot affordances without task-specific fine-tuning.
DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter
by Hugging Face
Introduces DistilBERT, a knowledge-distilled version of BERT that retains 97% of BERT's language understanding while being 40% smaller and 60% faster. Demonstrates the effectiveness of task-agnostic knowledge distillation for pretrained language models.
Conservative Q-Learning for Offline Reinforcement Learning
by UC Berkeley
CQL (Conservative Q-Learning) addresses distribution shift in offline RL by augmenting the standard Bellman objective with a term that penalizes Q-values for out-of-distribution actions, producing a lower bound on the true value function. This conservative approach prevents over-optimistic value estimation and achieves strong performance across locomotion, navigation, and robotic manipulation datasets.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
by University of Washington / IBM AI Research / Allen AI
Introduces Self-RAG, a framework that trains a single LM to adaptively retrieve passages on demand, generate text, and critique its own outputs using special reflection tokens. Unlike standard RAG, Self-RAG decides when to retrieve and reflects on retrieved passages and generation quality, outperforming ChatGPT and standard RAG on diverse downstream tasks.
Emergent Abilities of Large Language Models
by Google Research / Stanford / DeepMind / UNC
Defines and documents emergent abilities in LLMs — capabilities that appear sharply at certain model scales rather than improving gradually. Surveys over 100 tasks where models exhibit phase-transition-like capability gains, sparking debate on whether emergence is real or a measurement artifact.
Improving Language Models by Retrieving from Trillions of Tokens
by DeepMind
Presents RETRO (Retrieval-Enhanced Transformers), a model that retrieves from a 2-trillion-token database at inference time via chunked cross-attention. RETRO achieves performance comparable to GPT-3 with 25× fewer parameters by leveraging retrieved passages, demonstrating that retrieval augmentation is a compute-efficient alternative to scaling.
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
by Princeton NLP / Princeton Language and Intelligence
Introduces SWE-agent, which defines Agent-Computer Interfaces (ACIs) to enable LLMs to autonomously solve real GitHub issues by browsing codebases, editing files, and running tests. On the SWE-bench benchmark, SWE-agent with GPT-4 Turbo resolves 12.5% of issues, significantly outperforming prior methods.
Red Teaming Language Models with Language Models
by DeepMind
Proposes using language models to automatically generate test cases that elicit harmful behaviors from target language models—a scalable alternative to manual red teaming. The approach discovers diverse attack prompts across harm categories and reveals that larger models are harder to red-team but produce more harmful outputs when successfully attacked.
Qwen2.5 Technical Report
by Alibaba Cloud / Qwen Team
Qwen2.5 is a comprehensive family of open-source LLMs (0.5B to 72B parameters) trained on 18 trillion tokens including significantly expanded coding and mathematics data, achieving state-of-the-art open-source performance on coding (HumanEval), mathematics (MATH), and multilingual benchmarks. The series includes specialized Qwen2.5-Coder and Qwen2.5-Math variants.
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* Quality
by LMSYS / UC Berkeley / CMU / UCSD
Presents Vicuna-13B, an open-source chatbot created by fine-tuning LLaMA on ShareGPT conversation data, achieving approximately 90% of ChatGPT and Bard quality as judged by GPT-4. The paper introduces GPT-4 as an automated judge for chatbot evaluation, establishing a widely adopted evaluation paradigm for conversational AI.
AgentBench: Evaluating LLMs as Agents
by Tsinghua University
Introduces AgentBench, the first systematic benchmark for evaluating LLMs as autonomous agents across eight distinct environments spanning operating systems, databases, knowledge graphs, digital games, and web browsing. The benchmark reveals a large performance gap between commercial and open-source models on real-world agent tasks.
Competition-Level Code Generation with AlphaCode
by DeepMind
AlphaCode is a large-scale language model from DeepMind designed for competitive programming. It was pre-trained on public GitHub code and fine-tuned on a curated dataset of programming contest problems. The system generates a vast number of potential solutions and then filters them using test cases to find a correct one.
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
by OpenAI
This paper explores weak-to-strong generalization, a method for training a powerful AI model using supervision from a weaker one. It serves as an analogy for aligning superintelligent AI with human values. The research shows that strong models can learn beyond their weak supervisors and introduces techniques like auxiliary confidence loss to improve performance.
Scalable agent alignment via reward modeling: a research direction
by DeepMind
This research paper proposes a method for aligning advanced AI systems by using recursive reward modeling. The approach leverages AI assistants to help human evaluators assess complex AI actions, enabling scalable oversight and positioning this technique alongside debate and amplification as key AI safety strategies.
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
by Alibaba Cloud / DAMO Academy
Qwen-VL is a large-scale vision-language model series from Alibaba, trained on a curated multilingual multimodal dataset. It supports high-resolution image understanding, visual grounding with bounding boxes, and multilingual text reading, achieving state-of-the-art results on multiple visual benchmarks.
CAMEL: Communicative Agents for Mind Exploration of Large Language Model Society
by KAUST
CAMEL introduces a novel framework for studying multi-agent cooperation by having AI agents role-play to solve tasks. It utilizes a technique called 'inception prompting' to ensure agents adhere to their assigned personas, enabling the exploration of complex communicative behaviors and societal dynamics within large language models with minimal human guidance.
STaR: Bootstrapping Reasoning With Reasoning
by Stanford University / Google Brain
STaR (Self-Taught Reasoner) is a research paper introducing an iterative bootstrapping method for language models. The model learns to improve its reasoning abilities by generating rationales for problems, filtering out the incorrect ones, and then fine-tuning itself on the successfully reasoned examples. This allows smaller models to achieve reasoning performance comparable to much larger ones.
The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI
by Meta AI
Llama 4 introduces a family of natively multimodal mixture-of-experts models—Scout (17B/16 experts), Maverick (17B/128 experts), and Behemoth (288B/16 experts)—pretrained jointly on text, image, and video data. Maverick achieves top scores on vision-language benchmarks while Scout offers 10M-token context at a fraction of the compute of comparable models.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
by Google Research
Introduces Grouped-Query Attention (GQA), an efficient attention mechanism that generalizes Multi-Head and Multi-Query Attention. GQA groups query heads to share key and value heads, drastically reducing the KV cache size and memory bandwidth, which accelerates inference speed while maintaining near Multi-Head quality.
Artificial Intelligence Ethics Guidelines: A Global Inventory
by EPFL / Multiple Institutions
This paper presents a systematic review of 84 prominent AI ethics guidelines from around the world. It identifies a global convergence on five key ethical principles, including transparency and justice, but reveals significant divergence in how these principles are interpreted and operationalized across different sectors and regions.
Zoom In: An Introduction to Circuits
by Distill / OpenAI
This essay by Chris Olah and colleagues at Distill introduces the circuits framework for mechanistic interpretability, arguing that neural network weights encode interpretable algorithms composed of features and circuits. It presents case studies of curve detectors and multimodal neurons as evidence that individual units and motifs in neural networks are meaningfully interpretable.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
by Anthropic
Demonstrates that LLMs can be trained to behave safely during normal operation but exhibit unsafe behaviors when triggered by specific conditions—acting as 'sleeper agents'—and that standard safety training techniques including RLHF, supervised fine-tuning, and adversarial training fail to reliably remove these backdoors, sometimes even hiding them deeper.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
by Google Research
Introduces Switch Transformers, simplifying MoE routing to select a single expert per token (top-1), enabling stable trillion-parameter T5-scale models with 7× pre-training speedup. Demonstrates that parameter count and compute can be decoupled through sparsity.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
by DeepSeek
This paper introduces Group Relative Policy Optimization (GRPO), a memory-efficient reinforcement learning algorithm. GRPO enables scalable RLHF-style training by replacing the critic model with group-sampled reward baselines, a technique used to enhance the mathematical reasoning of models like DeepSeekMath.
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
by Google DeepMind
RT-2 is a Vision-Language-Action (VLA) model that translates visual and language inputs directly into robotic actions. By co-fine-tuning large models on both web-scale and robotics data, it transfers knowledge from the internet to physical control, enabling robots to reason about and execute tasks involving novel objects and scenarios without explicit robotic training.
Representation Engineering: A Top-Down Approach to AI Transparency
by Center for AI Safety / UC Berkeley
Representation Engineering (RepE) is a top-down AI transparency technique for interpreting and controlling Large Language Models. It uses linear probes on activation differences from contrastive prompts to identify and manipulate high-level concepts like truthfulness and emotion without needing to retrain or fine-tune the model.
Atlas: Few-shot Learning with Retrieval Augmented Language Models
by Meta AI / University College London
Atlas is a retrieval-augmented language model designed for few-shot learning. It uniquely pre-trains its retriever and language model components jointly, enabling it to effectively leverage external knowledge documents. This approach allows Atlas to achieve state-of-the-art few-shot performance on knowledge-intensive NLP benchmarks like MMLU, outperforming much larger models.
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning
by Anthropic
This research paper from Anthropic introduces a method using sparse autoencoders to decompose the internal activations of a transformer model. It successfully extracts thousands of interpretable, monosemantic features, demonstrating that the superposition of concepts within neurons can be untangled.
Claude Opus 4 Technical Report
by Anthropic
The Claude Opus 4 technical report details Anthropic's flagship model, highlighting its extended thinking, advanced coding, and agentic capabilities. It showcases top-tier performance on benchmarks like SWE-bench and GPQA, along with significant improvements in safety through Constitutional AI and RLHF.
Gemini 2.5 Pro Technical Report
by Google DeepMind
Gemini 2.5 Pro introduces thinking mode—an integrated chain-of-thought reasoning layer—combined with a 1M-token context window and natively multimodal capabilities spanning text, image, audio, and video. The model achieves leading positions on multiple reasoning and coding benchmarks including Codeforces, AIME, and MMMU.
In-context Learning and Induction Heads
by Anthropic
This paper establishes a causal link between specific transformer circuits, termed "induction heads," and the phenomenon of in-context learning. It demonstrates that these two-layer attention patterns, which copy and complete sequences, emerge predictably during training and are a key mechanistic driver of few-shot learning abilities in LLMs.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
by Carnegie Mellon University / Together AI
Mamba is a novel sequence modeling architecture based on structured state space models (SSMs). It introduces a selection mechanism that allows the model to selectively propagate or forget information based on the input, overcoming a key limitation of previous SSMs. This enables Mamba to achieve Transformer-level performance with linear time complexity and significantly faster inference.
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
by LMSYS / UC Berkeley
Introduces LMSYS-Chat-1M, a large-scale dataset of one million real-world conversations with 25 state-of-the-art LLMs collected from the Chatbot Arena platform. Analysis reveals diverse usage patterns, safety violations, and human preference signals, making it a valuable resource for safety evaluation, capability assessment, and alignment research.
Towards Expert-Level Medical Question Answering with Large Language Models
by Google Research
This paper introduces Med-PaLM 2, a large language model fine-tuned on medical data. It achieves expert-level performance on medical licensing exam questions, demonstrating clinical reasoning comparable to physicians, and proposes a framework for evaluating the safety and alignment of medical AI systems.
CogVLM: Visual Expert for Pretrained Language Models
by Tsinghua University / Zhipu AI
CogVLM is a vision-language model that enhances pretrained language models (LLMs) with visual understanding. It introduces a trainable visual expert module into each layer of a frozen LLM, enabling deep fusion of image and text features. This approach achieves state-of-the-art results on numerous vision-language benchmarks without altering the original language model's parameters.
Fast Transformer Decoding: One Write-Head is All You Need (Multi-Query Attention)
by Google Brain
Introduces Multi-Query Attention (MQA), an efficient attention mechanism for autoregressive decoding. By sharing a single key and value head across all query heads, MQA drastically reduces the size of the KV cache. This leads to significant memory bandwidth savings and faster inference speeds with minimal impact on model quality.
NVIDIA AI
by NVIDIA
NVIDIA AI provides a comprehensive suite of hardware and software solutions for accelerating AI development and deployment. Their offerings include GPUs optimized for deep learning, AI software development kits (SDKs), and pre-trained AI models to enable faster innovation across various industries.
Amazon SageMaker
by Amazon Web Services (AWS)
Amazon SageMaker is a fully managed machine learning service that enables data scientists and developers to build, train, and deploy machine learning models quickly. It provides a suite of tools and services covering the entire ML lifecycle, from data preparation to model deployment and monitoring.
Databricks
by Databricks
Databricks is a unified data analytics platform built on Apache Spark, providing tools for data engineering, data science, and machine learning. It enables organizations to process large datasets, build and deploy ML models, and collaborate across teams.
AssemblyAI
by AssemblyAI
AssemblyAI provides a Speech-to-Text API that allows developers to transcribe audio and video files with high accuracy. Their platform offers features like speaker diarization, sentiment analysis, and content moderation, making it a comprehensive solution for audio intelligence.
Hugging Face
by Hugging Face
Hugging Face is the GitHub of AI, providing the world's largest open model hub, dataset repository, and ML collaboration platform. Its Transformers library is the de-facto standard for working with open-weight models, and the Hugging Face Hub hosts hundreds of thousands of models and datasets. Its Spaces platform allows AI demos to be deployed instantly.
Amazon Web Services AI
by Amazon
Amazon Web Services is the world's largest cloud provider and offers the most comprehensive set of AI and machine learning services, including Amazon Bedrock for managed foundation model APIs, SageMaker for MLOps, Rekognition for computer vision, and Alexa for voice AI. AWS Bedrock gives enterprises access to models from Anthropic, Meta, Mistral, Cohere, and others through a unified API.
LangChain Inc
by LangChain Inc
LangChain Inc is the company behind the most widely adopted LLM orchestration framework in the AI ecosystem. LangChain provides composable abstractions for building LLM-powered applications, while its LangSmith platform offers observability and evaluation tooling, and LangGraph enables the construction of stateful, multi-actor agent workflows.
Microsoft Azure AI
by Microsoft
Microsoft Azure AI is the AI services division of Microsoft's cloud platform, uniquely positioned as the exclusive cloud partner of OpenAI. Through Azure OpenAI Service, enterprises access GPT-4, DALL-E, and Whisper with enterprise-grade compliance and data residency guarantees. Microsoft has deeply integrated AI across its product suite including Copilot for Microsoft 365, GitHub Copilot, and Azure AI Foundry.
Google Cloud AI
by Google
Google Cloud AI provides enterprise access to Google DeepMind's Gemini models and a comprehensive suite of managed AI services via Vertex AI. As the creator of the Transformer architecture and TensorFlow, Google Cloud offers unmatched AI infrastructure including custom TPUs, a full MLOps platform, and pre-built APIs for vision, speech, and natural language processing.
Graphcore
by Graphcore
Graphcore is a semiconductor company that develops Intelligence Processing Units (IPUs), a type of microprocessor designed specifically for AI and machine learning workloads. Their IPUs are designed to accelerate training and inference for complex AI models, offering an alternative to GPUs.
Pinecone Systems
by Pinecone
Pinecone is the leading managed vector database, purpose-built for AI applications requiring similarity search at scale. It powers retrieval-augmented generation, semantic search, and recommendation systems for thousands of enterprises. Pinecone's serverless architecture eliminates infrastructure management while delivering sub-millisecond query performance.
LMSYS
by LMSYS / UC Berkeley
LMSYS (Large Model Systems Organization) is a research collective from UC Berkeley known for creating Chatbot Arena—the leading human preference-based LLM evaluation leaderboard—and developing high-performance open-source inference systems including vLLM and FastChat. LMSYS research on Elo-based evaluation and serving efficiency has become foundational to the field.
EleutherAI
by EleutherAI
EleutherAI is a decentralized open-source AI research collective best known for training and releasing the GPT-Neo, GPT-J, GPT-NeoX, and Pythia model families, as well as developing the LM Evaluation Harness—the standard benchmarking framework for language models. The organization operates as a grassroots nonprofit committed to open and reproducible AI research.
Allen Institute for AI (AI2)
by Allen Institute for AI
The Allen Institute for AI (AI2) is a nonprofit research institute focused on high-impact, open-source AI. Founded by Paul Allen, it produces foundational models like OLMo, influential datasets such as MMLU, and reasoning benchmarks. Its Semantic Scholar platform provides AI-powered discovery across 200M+ academic papers.
Scale AI
by Scale AI
Scale AI is the leading AI data platform providing high-quality training data labeling, RLHF pipelines, and model evaluation services for frontier AI labs, government agencies, and Fortune 500 enterprises. Its Rapid platform and data engine power training datasets for many leading language and vision models.
ElevenLabs
by ElevenLabs
ElevenLabs is a voice technology research company developing advanced text-to-speech and voice cloning software. Their platform allows users to generate high-quality spoken audio in numerous languages, create custom AI voices, or clone existing ones. It is widely used for audiobooks, video games, and content creation.
LAION
by LAION
LAION (Large-scale Artificial Intelligence Open Network) is a German nonprofit that creates and releases massive open datasets for AI research. Its most notable contribution, LAION-5B, is a dataset of 5.85 billion image-text pairs that was pivotal in training foundational models like Stable Diffusion.
Perplexity AI
by Perplexity AI
Perplexity AI is an answer engine that combines real-time web search with large language model reasoning to deliver cited, conversational responses. Founded in 2022, it has rapidly grown to tens of millions of monthly active users and positions itself as an AI-native alternative to traditional search engines.
Weights & Biases
by Weights & Biases
Weights & Biases (W&B) is a leading MLOps platform for developers, specializing in experiment tracking, model evaluation, and dataset versioning. It provides tools to visualize model performance, manage datasets, and collaborate on machine learning projects, integrating with popular frameworks like PyTorch and TensorFlow.
Runway ML
by Runway ML
Runway is an applied AI research company focused on building multimodal AI systems for art, entertainment, and human creativity. It provides a suite of web-based tools for generative content creation, including industry-leading text-to-video, image-to-video, and AI-powered video editing features for creative professionals.
Character AI
by Character AI
Character AI is a consumer platform for creating and interacting with AI-powered characters. Users can engage in conversations for entertainment, role-playing, and creative exploration. It has become a major consumer AI application with a massive user base, focusing on personalized and immersive chat experiences.
Stability AI
by Stability AI
Stability AI is a generative AI company known for developing the popular open-source Stable Diffusion text-to-image model. They focus on creating open, multi-modal AI models for image, language, audio, and video generation, which are accessible via APIs and as downloadable weights for custom implementation.
Groq
by Groq
Groq is a semiconductor company that developed the Language Processing Unit (LPU), a custom chip for ultra-fast AI inference. Their managed API provides some of the fastest publicly available LLM inference speeds, often exceeding 800 tokens/second, making it ideal for latency-sensitive applications.
Weaviate
by Weaviate
Weaviate is an open-source vector database designed for AI-native applications. It enables flexible hybrid search, combining vector and keyword methods, and uniquely supports multi-modal data like text, images, and audio. Weaviate offers both self-hosting for maximum control and a managed cloud service for ease of use.
BigCode Project
by BigCode / Hugging Face / ServiceNow
BigCode is an open scientific collaboration by Hugging Face and ServiceNow for the responsible development of large language models (LLMs) for code. The project produced the StarCoder and StarCoder2 models, trained on 'The Stack' dataset, with a strong emphasis on ethical data governance, source attribution, and consent.
BigScience
by BigScience / Hugging Face
BigScience was a year-long, open research collaboration involving over 1,000 volunteer researchers, organized by Hugging Face. This global effort focused on the transparent and ethical development of large language models, culminating in the creation of BLOOM, a 176-billion parameter open-access multilingual model.
Together AI
by Together AI
Together AI provides a high-performance cloud inference platform for open-source models, offering one of the fastest and most cost-effective APIs for running models like Llama, Mistral, and DeepSeek. Its Together Inference platform specializes in speculative decoding and model parallelism techniques, and also offers managed fine-tuning and custom model deployment.
Synthesia
by Synthesia
Synthesia is an enterprise AI video generation platform that enables users to create professional-quality videos featuring realistic AI avatars from text scripts, without cameras, actors, or studios. Serving thousands of enterprise customers including Accenture, BBC, and Reuters, it is the leading platform for scalable AI-generated corporate video content.
Jasper AI
by Jasper AI
Jasper AI is an enterprise-grade AI content platform designed for marketing teams to produce brand-consistent copy, campaigns, and creative assets at scale. It integrates with brand voice guidelines, company knowledge bases, and major marketing workflows to maintain tone consistency across channels.
Casetext
by Casetext / Thomson Reuters
Casetext was a pioneer in AI-powered legal research and drafting, launching CoCounsel—the first AI legal assistant powered by GPT-4—before being acquired by Thomson Reuters in 2023 for $650M. Its technology is now integrated into Westlaw and Practical Law, making AI legal assistance available to millions of legal professionals.
Anyscale
by Anyscale
Anyscale is the company behind Ray, the open-source distributed computing framework that has become the infrastructure backbone for training and serving large-scale AI at companies like OpenAI, Uber, and Spotify. Anyscale provides a managed platform for Ray workloads, including Anyscale Endpoints for scalable LLM inference and RayLLM for open-model serving.
Replicate
by Replicate
Replicate is a cloud platform that makes it trivial to run open-source machine learning models via a simple API with pay-per-second billing. It hosts thousands of community models spanning image generation, video, audio, and language, and allows developers to package and deploy custom models as Cogs without managing any GPU infrastructure.
Labelbox
by Labelbox
Labelbox is an enterprise data-curation and annotation platform that streamlines the creation of high-quality training datasets for computer vision, NLP, and multimodal AI models. It provides annotation tooling, quality workflows, model-assisted labeling, and a managed workforce marketplace.
Harvey AI
by Harvey AI
Harvey AI is an enterprise legal AI platform built on foundation models fine-tuned on legal corpora to assist law firms and corporate legal departments with research, drafting, due diligence, and contract analysis. It is deployed at leading global law firms and backed by OpenAI, positioning itself as the AI layer for professional legal services.
Cerebras Systems
by Cerebras Systems
Cerebras Systems designs and manufactures the Wafer Scale Engine (WSE), the world's largest AI chip, enabling ultra-fast LLM training and inference at speeds far exceeding GPU clusters. Its CS-3 system and Cerebras Inference cloud service deliver token generation rates of 2,000+ tokens/second for leading open-weight models.
BentoML
by BentoML
BentoML is an open-source platform for building, shipping, and scaling AI applications and model inference services, providing a unified framework from local development to cloud production. BentoCloud, its managed service, offers one-click deployment, auto-scaling, and observability for ML teams.
Nomic AI
by Nomic AI
Nomic AI builds open, auditable AI systems focused on embedding models and large-scale data visualization, most notably the nomic-embed-text model and Atlas—a platform for exploring and understanding massive datasets through interactive AI-powered maps. The company emphasizes transparency and reproducibility in model development.
Modal
by Modal Labs
Modal is a serverless cloud platform purpose-built for running GPU-intensive Python workloads including ML inference, fine-tuning, and batch processing without managing infrastructure. Developers define compute requirements in Python decorators and Modal handles container orchestration, scaling, and cold-start optimization.
Fireworks AI
by Fireworks AI
Fireworks AI is a production inference platform founded by ex-Google Brain researchers, offering fast and reliable serving for open-weight models with enterprise SLAs. Fireworks specializes in compound AI systems, function calling, and JSON-mode inference, and provides FireFunction—its own fine-tuned function-calling model—alongside hosting for Llama, Mistral, and other popular open models.
PathAI
by PathAI
PathAI develops AI-powered pathology solutions that enable more accurate cancer diagnosis, biomarker assessment, and drug development support by analyzing histopathology images at scale. Its AISight platform is deployed in clinical laboratories and pharmaceutical research, improving diagnostic consistency and accelerating oncology trials.
Snorkel AI
by Snorkel AI
Snorkel AI commercializes weak supervision and programmatic data development research from Stanford AI Lab, enabling teams to build, manage, and iterate on AI training datasets programmatically at scale. Its platform reduces reliance on manual labeling by using labeling functions and foundation model assistance.
IBM Watson / watsonx
by IBM
IBM Watson, now branded as IBM watsonx, is IBM's enterprise AI platform offering governed, trustworthy AI for regulated industries. The watsonx.ai studio, watsonx.data lakehouse, and watsonx.governance suite provide a complete enterprise AI development and deployment pipeline with strong emphasis on explainability, fairness, and compliance for sectors like finance, healthcare, and government.
Oracle AI
by Oracle
Oracle AI provides a suite of generative AI services built into Oracle Cloud Infrastructure (OCI), including the OCI Generative AI Service powered by Cohere and Meta models. Oracle has uniquely integrated AI capabilities directly into its database (Oracle Database 23ai), ERP, and industry cloud offerings, targeting enterprises with existing Oracle relationships.
Zhipu AI (GLM)
by Zhipu AI
Zhipu AI is a Chinese AI company spun out of Tsinghua University's KEG Lab, known for the GLM (General Language Model) series. Its ChatGLM models were among the first high-quality open Chinese language models and have been widely adopted in Chinese industry and research communities.
Adept AI
by Adept AI
Adept AI builds AI systems that can take actions in software to complete complex multi-step workflows on behalf of users. The company focuses on general-purpose action models trained to interact with real-world software interfaces through browser and desktop automation.
Recursion Pharmaceuticals
by Recursion Pharmaceuticals
Recursion Pharmaceuticals is a clinical-stage techbio company that combines automated biology, large-scale imaging, and machine learning to industrialize drug discovery, operating one of the largest biological datasets in the industry. Its Recursion OS platform maps biological relationships at unprecedented scale to identify novel therapeutic targets and drug candidates.
Helicone
by Helicone
Helicone is an open-source LLM observability and monitoring platform that provides a single proxy endpoint for logging, tracking costs, debugging, and improving LLM applications across all major model providers. It integrates with a one-line code change and supports caching, rate limiting, and prompt management.
Insilico Medicine
by Insilico Medicine
Insilico Medicine is an AI-driven drug discovery company that has become the first to advance an AI-designed small molecule into Phase II clinical trials, demonstrating end-to-end AI-powered drug development from target identification through IND. Its Chemistry42 and PandaOmics platforms generatively design and screen drug candidates.
SambaNova Systems
by SambaNova Systems
SambaNova Systems builds reconfigurable AI hardware and software solutions optimized for enterprise-scale LLM training and inference, offering its Samba-1 model and SambaNova Cloud API as commercial services. The company's Reconfigurable Dataflow Unit (RDU) architecture is designed specifically for deep learning workloads.
xAI
by xAI
xAI is Elon Musk's AI company and creator of the Grok model family. It provides API access to Grok models with real-time web search integration, available through the xAI API and X (Twitter) platform. Grok models are trained on a broad mix of web and social data and emphasize up-to-date knowledge and uncensored reasoning.
Vast.ai
by Vast.ai
Vast.ai is a peer-to-peer GPU marketplace connecting researchers and startups with spare GPU capacity from data centers and individuals worldwide. It offers some of the cheapest GPU rental prices on the market with flexibility to choose hardware by price, latency, or reliability score. Best suited for cost-sensitive experimentation and training runs.
Together AI (GPU Compute)
by Together AI
Together AI's compute platform provides on-demand and reserved GPU clusters for training and fine-tuning open-source models. It offers H100 and A100 clusters with high-bandwidth networking optimized for distributed training runs, serving as both a GPU cloud provider and an inference platform. Teams use Together AI compute to run multi-node training jobs on Llama and Mistral variants.
Together AI
by Together AI
Together AI provides a cloud platform for running, fine-tuning, and deploying open-source language models. It hosts a wide catalog of models from Llama to Mistral and offers serverless inference, dedicated endpoints, and a fine-tuning pipeline. Together AI is popular among developers who want OpenAI-compatible APIs for open-weight models at competitive pricing.
SambaNova
by SambaNova Systems
SambaNova Systems builds custom AI hardware (Reconfigurable Dataflow Units) and offers cloud inference via SambaNova Cloud. It delivers some of the highest throughput speeds for large models including Llama 3 and Meta's frontier releases, targeting enterprises that need predictable, high-throughput inference at scale.
RunPod
by RunPod
RunPod is a community-driven GPU cloud marketplace offering some of the lowest per-hour prices for NVIDIA and AMD GPUs. It enables developers to rent GPU compute from a distributed network of data centers and deploy containerized workloads instantly. RunPod supports serverless GPU endpoints, making it popular for open-source model inference.
Replicate
by Replicate
Replicate is a platform for running machine learning models in the cloud via a simple API. It hosts thousands of open-source models for image generation, language, audio, and video, deployable with a single API call. Replicate charges per-second of GPU usage and supports deploying custom models as private or public endpoints.
OpenAI
by OpenAI
OpenAI is the leading AI research and deployment company behind the GPT and o-series model families. It offers API access to frontier language models, image generation via DALL-E, speech recognition via Whisper, and an Assistants API for building stateful agent workflows. OpenAI operates both a consumer product (ChatGPT) and an enterprise API platform used by millions of developers.
Modal
by Modal Labs
Modal is a cloud compute platform for running GPU workloads from Python, with a focus on developer ergonomics and serverless scaling. It allows deploying Python functions as GPU-accelerated endpoints with zero infrastructure configuration, automatic scaling to zero, and fast cold-start times. Popular for ML inference, batch jobs, and LLM serving.
Mistral AI
by Mistral AI
Mistral AI is a French AI company known for publishing high-efficiency open-weight models alongside its commercial API offerings. The Mistral and Mixtral model families deliver strong benchmark performance at a fraction of the compute cost of larger models. Mistral's La Plateforme API provides access to both open and closed proprietary models.
Meta AI
by Meta
Meta AI is the open-source AI division of Meta, responsible for the Llama model family. Llama 4 and its variants are released under open weights licenses, enabling local deployment, fine-tuning, and commercial use. Meta provides model weights via Hugging Face and its own download portal, making it the dominant open-weights LLM ecosystem.
Lambda Labs
by Lambda Labs
Lambda Labs provides cloud GPU instances and on-premises GPU servers targeted at AI researchers and ML engineers. Its Lambda Cloud offers on-demand and reserved NVIDIA H100 and A100 instances at competitive rates with a simple developer-friendly interface. Lambda also sells GPU workstations and servers for local development.
Groq
by Groq
Groq offers ultra-low-latency LLM inference through its custom Language Processing Unit (LPU) hardware. The GroqCloud API serves open-weight models including Llama, Mixtral, and Gemma at speeds that far exceed GPU-based inference, making it ideal for real-time agent applications. Groq provides a developer-friendly API compatible with the OpenAI client format.
Google DeepMind
by Google DeepMind
Google DeepMind is the unified AI research division behind the Gemini model family. It offers API access through Google AI Studio and Vertex AI, covering multimodal reasoning, code generation, long-context understanding up to 2M tokens, and tight integration with Google Cloud services. DeepMind also publishes foundational research in reinforcement learning and scientific AI.
Google Cloud (GPU)
by Google Cloud
Google Cloud offers A100, H100, and TPU v5 instances for AI training and inference via Compute Engine and Vertex AI. Google Cloud's TPU pods provide unique competitive advantage for training large models efficiently, while its A3 instances with H100s target inference workloads. Deep integration with Vertex AI simplifies the MLOps lifecycle.
FluidStack
by FluidStack
FluidStack aggregates spare GPU capacity from data centers globally, providing an on-demand cloud GPU rental marketplace at competitive rates. It offers H100, A100, and RTX GPU clusters for training and inference with an API-driven provisioning model. FluidStack is used by AI startups for burst compute and cost-efficient long-running training jobs.
Fireworks AI
by Fireworks AI
Fireworks AI specializes in fast, cost-efficient inference for open-source models including Llama, Mistral, and Mixtral families. It offers serverless and on-demand deployment with a focus on production reliability. Fireworks provides an OpenAI-compatible API and supports compound AI systems through its FireFunction tool-calling models.
DeepSeek
by DeepSeek
DeepSeek is a Chinese AI lab that has released competitive open-weight models rivaling frontier closed models at dramatically lower training costs. DeepSeek R1 and V3 demonstrated that mixture-of-experts and reinforcement learning at scale can close the gap with GPT-4-class models. Models are freely available via Hugging Face and a low-cost API.
CoreWeave
by CoreWeave
CoreWeave is a specialized cloud infrastructure provider built exclusively for GPU-intensive AI and ML workloads. It offers on-demand and reserved access to NVIDIA H100, A100, and H200 clusters with high-bandwidth InfiniBand networking. CoreWeave is trusted by AI labs and enterprises for large-scale model training and inference at competitive pricing.
Cohere
by Cohere
Cohere is an enterprise-focused AI company specializing in language models optimized for business applications including search, retrieval-augmented generation, and text classification. Its Command and Embed model families are widely used in enterprise RAG pipelines. Cohere offers private cloud and on-premises deployment options alongside its API.
Cerebras Inference
by Cerebras Systems
Cerebras provides cloud inference powered by its Wafer-Scale Engine (WSE) chip, delivering some of the highest token throughput for large language models. Cerebras Inference serves Llama and other open-weight models with hardware-level advantages that push tokens-per-second beyond what GPU clusters can achieve for certain model sizes.
Baseten
by Baseten
Baseten is a model inference platform for deploying ML models to production with high performance and reliability. It specializes in low-latency serving of open-source LLMs and diffusion models with features like cascade batching, LoRA serving, and speculative decoding. Baseten targets teams that need production-grade inference without managing Kubernetes.
Azure (GPU)
by Microsoft Azure
Microsoft Azure provides ND H100 v5 and NCv3 GPU instances for AI model training and inference, with tight integration into Azure AI Studio, Azure OpenAI Service, and GitHub Copilot infrastructure. Azure is the preferred cloud for enterprises with Microsoft licensing agreements and provides access to OpenAI models via Azure OpenAI Service.
AWS EC2 (GPU)
by Amazon Web Services
Amazon EC2 provides GPU instances (P4, P5, G5, Inf2 families) for AI/ML training and inference at any scale. As the largest cloud provider, AWS offers the broadest ecosystem of managed ML services including SageMaker, Bedrock, and Trainium-based Inf2 instances. Best for enterprises requiring deep AWS integration and compliance certifications.
Anthropic
by Anthropic
Anthropic is an AI safety company and the creator of the Claude model family. Its API provides access to Claude Opus, Sonnet, and Haiku variants, with strong support for long-context reasoning, tool use, and multi-agent workflows via the Claude Agent SDK. Anthropic publishes extensive safety research and pioneered Constitutional AI alignment techniques.
Alibaba / Qwen
by Alibaba Cloud
Alibaba Cloud's Qwen team releases the Qwen model series, a family of open-weight and API-accessible language models covering dense and mixture-of-experts architectures. Qwen models are competitive on multilingual and coding benchmarks and are available through Alibaba Cloud's DashScope API as well as Hugging Face for local deployment.
AI21 Labs
by AI21 Labs
AI21 Labs is an Israeli AI company known for the Jamba model family, which uses a hybrid SSM-Transformer architecture for long-context efficiency. Its Wordtune product targets writing assistance while the API focuses on enterprise NLP tasks. Jamba 1.6 offers a unique balance of long-context window handling and low inference latency.
01.AI (Yi)
by 01.AI
01.AI is a Chinese AI startup founded by Kai-Fu Lee, creator of the Yi series of bilingual large language models. Yi models are released as open weights under permissive licenses and have demonstrated strong performance on multilingual benchmarks, positioning 01.AI as a key contributor to the open-source AI ecosystem.
Figure AI
by Figure AI
Figure AI is building general-purpose humanoid robots designed to perform physical labor in warehouses, factories, and logistics environments, powered by a neural network trained with visual data and language models. Its Figure 02 robot, developed in partnership with BMW and backed by OpenAI, Microsoft, and NVIDIA, is one of the most advanced humanoid platforms commercially deployed.
Lepton AI
by Lepton AI
Lepton AI provides a serverless cloud platform for running open-source AI models and custom workloads with a Pythonic SDK, eliminating infrastructure management overhead for ML teams. Founded by ex-Meta researchers, the platform supports fine-tuning, deployment, and monitoring of models with pay-per-use pricing.
Baichuan
by Baichuan
Baichuan Intelligence is a Chinese AI startup founded by Zhiyuan Wang, a former Sogou CEO, specializing in large language models with applications in healthcare and enterprise workflows. Its Baichuan2 series models are notable for strong Chinese language performance and vertical-specific fine-tuning capabilities.
Cerebras
by
AI compute provider with wafer-scale chips delivering record-breaking inference speeds for LLMs.
Inflection AI
by Inflection AI
Inflection AI was co-founded by Mustafa Suleyman (ex-DeepMind) and Reid Hoffman, initially building the Pi personal AI assistant. After a major leadership transition to Microsoft in 2024, the remaining company pivoted to enterprise AI services, offering its Inflection 3 model and AI consulting for large organizations.
Mozilla AI
by Mozilla
Mozilla AI is a startup launched by the Mozilla Foundation to build open, trustworthy AI tools and advocate for responsible AI development as a counterweight to closed proprietary systems. The organization releases tools like Lumigator (LLM evaluation) and contributes to open-source AI infrastructure aligned with the open web.
Cerebras
by
AI compute provider with wafer-scale chips delivering record-breaking inference speeds for LLMs.
AMD Instinct MI350X
by AMD
The AMD Instinct MI350X is a data center GPU designed for high-performance computing and AI workloads. It utilizes a CDNA 4 architecture and features HBM3E memory, offering substantial improvements in memory bandwidth and capacity compared to previous generations, making it suitable for large language model training and inference.
NVIDIA RTX 4090
by NVIDIA
NVIDIA's flagship consumer GPU based on Ada Lovelace. Has become popular for local LLM inference and fine-tuning due to its 24GB GDDR6X memory and high performance-per-dollar ratio, enabling on-premise AI workloads without data center costs.
AMD Instinct MI400A
by Advanced Micro Devices (AMD)
The AMD Instinct MI400A is a data center accelerator designed for high-performance computing and AI workloads. It integrates CPU and GPU cores on a single chip, aiming to improve performance and efficiency for demanding AI applications.
Cerebras Wafer Scale Engine 4 (WSE-4)
by Cerebras Systems
The Cerebras WSE-4 is the fourth generation wafer-scale processor designed specifically for AI compute. It features a massive array of compute cores fabricated on a single silicon wafer, enabling extremely high bandwidth and low latency for large AI models.
AMD Instinct MI400 Series
by Advanced Micro Devices (AMD)
The AMD Instinct MI400 series is a family of data center GPUs designed for high-performance computing and AI workloads. It leverages AMD's CDNA 4 architecture and offers significant improvements in performance and energy efficiency compared to previous generations, targeting large-scale AI training and inference.
NVIDIA DGX H100
by NVIDIA
The NVIDIA DGX H100 is a purpose-built AI supercomputer, serving as the foundational building block for large-scale AI infrastructure. It integrates eight H100 Tensor Core GPUs with high-speed NVLink interconnects, providing a turnkey solution for the most demanding AI training, inference, and data analytics workloads.
Tesla Dojo D2 Chip
by Tesla
The Tesla Dojo D2 chip is a custom-designed AI accelerator developed by Tesla for training large-scale neural networks used in autonomous driving. It is a key component of Tesla's Dojo supercomputer, aimed at improving the efficiency and speed of AI model training.
NVIDIA B100
by NVIDIA
The NVIDIA B100 is a data center GPU based on the Blackwell architecture, succeeding the H100. It offers substantial performance improvements for AI training and inference, featuring a second-generation Transformer Engine with FP4 precision, and a fifth-generation NVLink interconnect for massive multi-GPU scaling.
NVIDIA Jetson AGX Orin
by NVIDIA
The NVIDIA Jetson AGX Orin is a high-performance System-on-Module (SoM) designed for edge AI and autonomous machines. It delivers up to 275 TOPS of AI performance, integrating an NVIDIA Ampere architecture GPU with Arm CPUs and deep learning accelerators for server-class computing in a power-efficient package.
Graphcore Bow Pod2024
by Graphcore
The Graphcore Bow Pod2024 is a modular AI compute system built for large-scale machine learning. It utilizes Graphcore's Intelligence Processing Units (IPUs) and is specifically engineered to accelerate sparse models, such as graph neural networks and large language models, in data center environments.
Tenstorrent Wormhole GF12
by Tenstorrent
The Tenstorrent Wormhole GF12 is a high-performance AI accelerator built on GlobalFoundries' 12nm process. It features a grid of programmable Tensix cores, RISC-V CPUs, and a high-speed Ethernet fabric for direct chip-to-chip communication, enabling scalable systems for both AI training and inference workloads.
d-Matrix Corsair
by d-Matrix
The d-Matrix Corsair is an in-memory compute platform designed to accelerate AI inference workloads. It leverages analog compute to achieve high energy efficiency and low latency, targeting applications like recommendation engines and generative AI.
NVIDIA A10G
by NVIDIA
NVIDIA Ampere GPU optimized for graphics and inference workloads. Commonly deployed in AWS G5 instances, offering a cost-effective option for inference, graphics rendering, and video processing at cloud scale.
NVIDIA V100
by NVIDIA
NVIDIA Volta architecture GPU that introduced Tensor Cores to the data center, providing the first dedicated matrix multiply hardware for AI. Powered the first wave of transformer model training including BERT and GPT-2, and became the dominant AI training platform from 2017–2020.
NVIDIA L40S
by NVIDIA
The NVIDIA L40S is a universal data center GPU based on the Ada Lovelace architecture. It features 48GB of GDDR6 memory and combines powerful AI compute, graphics, and media acceleration capabilities, making it a versatile solution for a wide range of workloads from generative AI to professional visualization.
Apple M4 Ultra Neural Engine
by Apple
Apple M4 Ultra's 32-core Neural Engine capable of 38 TOPS, embedded in Apple's highest-end desktop and workstation chips. Combined with up to 192GB unified memory shared between CPU, GPU, and Neural Engine, it enables running large models locally on macOS with exceptional energy efficiency.
Graphcore Bow Pod1024
by Graphcore
The Graphcore Bow Pod1024 is a supercomputing-scale AI system, delivering over 250 PetaFLOPS of AI compute. It leverages 1,024 Bow IPU processors linked by a high-bandwidth fabric, specifically engineered for training massive, next-generation AI models and complex graph analytics workloads at an unprecedented scale.
NVIDIA GB200 NVL72
by NVIDIA
The NVIDIA GB200 NVL72 is a liquid-cooled, rack-scale system designed for exascale AI. It connects 36 Grace Blackwell Superchips, comprising 72 B200 GPUs and 36 Grace CPUs, via fifth-generation NVLink to function as a single massive GPU for training and inferencing on trillion-parameter models with unprecedented performance and energy efficiency.
Google TPU v5p
by Google
Google's fifth-generation Tensor Processing Unit, the TPU v5p, is an AI accelerator designed for training and serving the largest AI models. It offers significant performance gains over its predecessor, featuring liquid cooling, 95 GB of HBM, and support for new data formats like MX4 for enhanced efficiency and scalability in massive pod configurations.
Google TPU v4
by Google
Google's fourth-generation TPU, used internally to train PaLM, LaMDA, and early Gemini models. Features 32GB HBM2 per chip and an optical circuit-switched ICI for flexible pod topology, enabling massive-scale distributed training.
NVIDIA Jetson Orin NX
by NVIDIA
Compact Orin-based Jetson module delivering up to 100 TOPS in a small form factor. Targets robotics, drones, medical devices, and industrial edge AI applications requiring significant AI performance in constrained size, weight, and power envelopes.
Google TPU v5e
by Google
Google's cost-efficient TPU variant optimized for inference and medium-scale training. Offers a better price-performance ratio than TPU v5p for serving workloads, with 16GB HBM2 per chip and excellent throughput for transformer inference.
Google TPU v6 (Trillium)
by Google
Google's sixth-generation TPU, codenamed Trillium, delivering 4.7x compute improvement over TPU v5e. Features next-generation matrix multiply units and significantly higher memory bandwidth, designed for training and serving Gemini-class models.
AWS Inferentia2
by AWS
AWS second-generation custom inference chip with 4x higher compute and 10x higher memory bandwidth than Inferentia1. Optimized for cost-efficient large-scale inference of transformer models with very high throughput and low latency.
NVIDIA P100
by NVIDIA
NVIDIA Pascal architecture GPU and the first to use HBM2 memory in a data center product. Delivered 10x deep learning performance over its predecessor and was the primary platform for training early deep learning models before the Volta generation.
Google Tensor G4
by Google
Google's fourth-generation Tensor chip powering Pixel 9 smartphones. Features a dedicated TPU-derived neural core enabling on-device Gemini Nano inference for features like live captions, call screening, and generative AI photography without cloud latency.
Intel Meteor Lake NPU
by Intel
Intel's first dedicated Neural Processing Unit embedded in Core Ultra (Meteor Lake) laptop processors. Delivers 10+ TOPS for AI inferencing on Windows AI PCs, enabling background AI workloads like live captioning, noise suppression, and on-device LLM assistance without using GPU/CPU resources.
AWS Trainium2
by AWS
AWS second-generation custom AI training chip delivering up to 4x performance improvement over Trainium. Designed specifically for training large language models on AWS, with tight integration with UltraCluster networking for scale-out training jobs.
Cerebras CS-3
by Cerebras
Cerebras Wafer Scale Engine 3 — the world's largest chip, spanning an entire silicon wafer. Contains 4 trillion transistors and 44GB of on-chip SRAM, eliminating off-chip memory bandwidth as a bottleneck for training large neural networks.
Google TPU v3
by Google
Google's third-generation TPU featuring liquid cooling to sustain higher clock speeds and 32GB HBM per chip. Doubled compute and memory versus TPU v2, enabling training of BERT, T5, and early large language models. Powered many foundational AI research papers at Google Brain and DeepMind.
MediaTek Dimensity 9400 APU
by MediaTek
MediaTek Dimensity 9400's AI Processing Unit — the most powerful mobile NPU in Android smartphones. Delivers 50 TOPS for on-device AI with support for 13B parameter models on-device, enabling private, low-latency AI features for Android flagship devices.
Google TPU v7 Ironwood
by Google
Google's TPU v7 Ironwood is the seventh generation of Google's custom Tensor Processing Units, designed for large-scale AI inference at hyperscaler capacity. Ironwood pods target serving frontier models like Gemini at Google's internal scale and are available to cloud customers via Google Cloud's TPU v7 instances.
Google TPU v6e Trillium
by Google
Google TPU v6e Trillium is Google's sixth-generation TPU with 4x the compute and 3x the memory bandwidth per chip compared to v5e. Trillium is generally available on Google Cloud for both training and inference workloads, offering the most cost-efficient TPU option for teams training Gemma and other open models on Google Cloud.
SambaNova SN40L RDU
by SambaNova Systems
SambaNova's SN40L is a Reconfigurable Dataflow Unit designed for high-throughput LLM inference and training. Its tiered memory architecture — combining on-chip SRAM with off-chip DRAM — allows serving multiple large models simultaneously with industry-leading batch throughput. The SN40L is the hardware underlying SambaNova Cloud's inference API.
NVIDIA RTX 5090
by NVIDIA
The NVIDIA RTX 5090 is NVIDIA's flagship consumer/prosumer GPU in the Blackwell generation, featuring 32GB GDDR7 memory and massive compute for local AI inference and fine-tuning. It allows running 70B quantized models on a single consumer GPU and is the premier choice for developers who need frontier local model capability in a workstation.
NVIDIA H200
by NVIDIA
The NVIDIA H200 is a Hopper-generation GPU with 141GB of HBM3e memory — nearly double the H100's bandwidth — targeting inference workloads for very large models. The additional memory enables running 70B+ parameter models on fewer GPUs, significantly reducing the cost per inference token for large-scale deployments.
NVIDIA H100
by NVIDIA
The NVIDIA H100 Hopper GPU is the dominant AI training and inference accelerator in production deployments as of 2024–2025. With 80GB HBM3 memory and NVLink 4 support, it delivers 4x the compute of the A100. The H100 SXM5 variant connects to 8-GPU NVL8 nodes via NVSwitch for large model training runs.
NVIDIA GB200 NVL72
by NVIDIA
The GB200 NVL72 is NVIDIA's rack-scale AI system combining 36 Grace CPUs and 72 Blackwell B200 GPUs via NVLink interconnect. It delivers up to 1.44 ExaFLOPS of AI compute in a single rack, targeting hyperscaler-class training of frontier models. The NVL72 represents a fundamental shift from server-level to rack-level GPU system design.
NVIDIA B200
by NVIDIA
The NVIDIA B200 is the first Blackwell-architecture data center GPU, delivering 2.5x the training throughput and 5x the inference performance of the H100. With 192GB of HBM3e memory and NVLink 5 interconnects, it is designed for training and serving trillion-parameter models. The B200 anchors NVIDIA's Blackwell product generation.
NVIDIA A100
by NVIDIA
The NVIDIA A100 Ampere GPU remains widely deployed in cloud and on-premises AI infrastructure for training and inference. With 40GB or 80GB HBM2e memory variants and MIG (Multi-Instance GPU) support for partitioning into up to 7 isolated GPU instances, the A100 is the proven workhorse of many production AI deployments.
Intel Gaudi 3
by Intel
Intel Gaudi 3 is Intel's AI training and inference accelerator designed as a cost-competitive alternative to NVIDIA H100. It features 128GB of HBM2e memory and 24 100GbE RoCE ports for scale-out connectivity. Gaudi 3 is supported by Intel's Optimum Habana software stack and available via major cloud providers and on-premises.
Groq LPU
by Groq
Groq's Language Processing Unit (LPU) is a deterministic ASIC architecture optimized for sequential transformer inference, eliminating the memory-bandwidth bottlenecks of GPU-based serving. Groq LPU clusters deliver measured token generation speeds of 500+ tokens/second for Llama-class models, significantly outpacing GPU inference for latency-critical applications.
Cerebras WSE-3
by Cerebras Systems
The Cerebras Wafer-Scale Engine 3 (WSE-3) is the world's largest chip, containing 4 trillion transistors on a single 46,225 mm² silicon wafer. Its architecture eliminates the memory bandwidth bottlenecks of conventional GPU clusters for large model inference, achieving industry-leading tokens-per-second throughput for models up to 70B parameters.
AWS Trainium3
by Amazon Web Services
AWS Trainium3 is Amazon's third-generation custom ML training chip, offering significant improvements in training throughput and energy efficiency over Trainium2. Trainium3 instances are available through Amazon SageMaker and EC2, targeting cost-efficient training of large language models for AWS-native AI development teams.
AMD MI325X
by AMD
The AMD Instinct MI325X is an updated Instinct GPU with 288GB of HBM3e memory and improved memory bandwidth over the MI300X. It targets inference workloads for the largest frontier models and positions AMD competitively against the NVIDIA H200 in memory-bound inference scenarios.
AMD MI300X
by AMD
The AMD Instinct MI300X is AMD's flagship AI accelerator featuring 192GB of HBM3 memory, the highest of any GPU when released. This massive memory capacity makes it compelling for inference of 70B+ parameter models and has led to adoption by Microsoft Azure, Oracle, and major AI labs as an H100 alternative.
SambaNova SN40L
by SambaNova
SambaNova's Reconfigurable Dataflow Unit with a three-tier memory hierarchy: on-chip scratchpad, on-package HBM, and off-package DRAM. The unique architecture enables running multiple models simultaneously and excels at efficient mixture-of-experts inference.
Google TPU v2
by Google
Google's second-generation TPU and the first available on Google Cloud. Added training capability (v1 was inference-only), HBM memory for gradient storage, and introduced the concept of TPU Pods — interconnected multi-chip systems enabling distributed training at scale.
Google TPU v1
by Google
Google's first Tensor Processing Unit — the seminal custom AI ASIC that launched the modern era of purpose-built ML hardware. Deployed in 2015 and described publicly in a landmark 2017 ISCA paper, it ran inference for Google Search, Maps, and Translate, delivering 30x performance-per-watt vs contemporary GPUs.
Qualcomm Cloud AI 100
by Qualcomm
Qualcomm's data center AI inference accelerator designed for power-efficient deployment. Based on the same AI architecture as Snapdragon, it delivers competitive inference performance with a focus on power efficiency metrics (TOPS/W) for hyperscale deployments.
NVIDIA K80
by NVIDIA
NVIDIA Kepler-based dual-GPU data center card that became the first widely available cloud GPU for deep learning. Google Colab's original free tier ran on K80s, making it instrumental in democratizing access to GPU-accelerated deep learning for researchers and students worldwide.
Graphcore Bow IPU
by Graphcore
Graphcore's Bow Intelligence Processing Unit using 3D wafer-on-wafer technology. Features a massively parallel MIMD architecture with 1472 processor cores and 900MB on-chip SRAM, designed for graph-structured AI workloads and sparse computation.
Graphcore MK2 IPU (Colossus GC200)
by Graphcore
Graphcore's second-generation Colossus GC200 Intelligence Processing Unit. Featured 1472 IPU-Cores with 900MB on-chip SRAM and introduced the Bulk Synchronous Parallel with Staleness (BSS) execution model. Preceded the Bow IPU and established Graphcore's approach to graph-native, SRAM-centric AI compute.
Tenstorrent Grayskull
by Tenstorrent
Tenstorrent's first commercial AI accelerator co-designed by Jim Keller. Built on a RISC-V Tensix processor architecture with a mesh NoC, enabling programmable AI compute. Notable for its open software stack and developer-friendly approach to hardware AI.
Intel Nervana NNP-T1000
by Intel
Intel Nervana Neural Network Processor for Training — Intel's attempt at a purpose-built AI training chip following the 2016 acquisition of Nervana Systems. Featured 32GB HBM2 and a novel MCDRAM+HBM architecture. Discontinued in 2020 as Intel pivoted focus to the Habana Gaudi line.
Databricks Feature Store - MLflow Integration
by Databricks
The Databricks Feature Store provides a centralized repository for managing and sharing machine learning features. Its integration with MLflow enables seamless tracking of feature usage in ML models, ensuring reproducibility and simplifying model deployment workflows by automatically packaging feature dependencies.
PyTorch Geometric
by PyTorch
PyTorch Geometric (PyG) is a library built upon PyTorch to facilitate the development of graph neural networks (GNNs). It provides data handling utilities, learning methods on graphs and other irregular structures, and benchmark datasets for various graph-related tasks.
TensorFlow Quantum
by Google
TensorFlow Quantum (TFQ) is a library for building quantum machine learning models. It allows researchers to construct and train hybrid quantum-classical models by leveraging TensorFlow's infrastructure for classical computation and quantum simulators or quantum hardware for quantum computation.
LangChain + OpenAI
by LangChain
Native integration between LangChain and OpenAI's GPT models. Provides seamless access to chat completions, embeddings, and function calling through LangChain's unified interface. Supports streaming, tool use, and structured output via the langchain-openai package.
MLflow Databricks Integration
by Databricks
The MLflow integration with Databricks provides a managed MLflow service within the Databricks platform. It simplifies the process of tracking experiments, managing models, and deploying them to production by leveraging Databricks' scalable infrastructure and collaborative environment.
GitHub Copilot + VS Code
by GitHub
GitHub Copilot integrates into VS Code as a first-party extension, delivering inline ghost-text completions, multi-line suggestions, and a dedicated Copilot Chat panel for conversational refactoring, test generation, and documentation. It leverages Codex and GPT-4 models under the hood, with workspace-aware context from open tabs and the current file.
Meta + HuggingFace (Llama)
by Meta AI
Official Meta Llama model weights distributed through the HuggingFace Hub under Meta's community license. Covers Llama 3.1, 3.2, and 3.3 variants from 1B to 405B parameters with full transformers, TGI, and vLLM compatibility. HuggingFace serves as the primary public distribution channel for Meta's open-weight releases.
LangChain + Anthropic
by LangChain
Official LangChain integration for Anthropic's Claude model family. Exposes Claude's extended context window, vision capabilities, and tool use through LangChain's standard chat model interface. Supports streaming and the full Messages API via the langchain-anthropic package.
Pinecone + OpenAI Embeddings
by Pinecone
Direct integration pairing Pinecone's managed vector database with OpenAI's text-embedding-3 models. Commonly used pattern for production RAG systems where OpenAI generates dense vectors and Pinecone handles ANN retrieval at scale. Supports serverless and pod-based indexes with metadata filtering.
W&B + Hugging Face
by Weights & Biases
Weights & Biases integrates directly into Hugging Face Trainer and PEFT via a built-in report_to callback, logging training loss curves, GPU utilization, gradient norms, and hyperparameters to shareable W&B runs. The integration supports sweep-based hyperparameter optimization and artifact versioning for model checkpoints.
TensorFlow Privacy
by Google
TensorFlow Privacy is a library that makes it easier to train machine learning models with differential privacy. It provides TensorFlow optimizers that implement differentially private stochastic gradient descent (DP-SGD), allowing developers to protect the privacy of training data while still achieving good model performance.
vLLM + NVIDIA
by vLLM Project
vLLM's NVIDIA backend leverages CUDA kernels, FlashAttention-2, and PagedAttention to deliver state-of-the-art throughput for LLM inference on NVIDIA A100, H100, and H200 GPUs. The integration supports tensor and pipeline parallelism across multiple GPUs, FP8/FP16/BF16 quantization, and CUDA graph capture for minimal per-token latency.
LangSmith + LangChain
by LangChain Inc.
LangSmith provides first-class tracing and evaluation for LangChain pipelines, capturing every LLM call, chain step, and tool invocation with full prompt/response payloads. Teams use the integration to debug production failures, build evaluation datasets, and run automated regression tests against golden traces.
OpenAI + Azure OpenAI Service
by Microsoft Azure
Microsoft Azure's managed deployment of OpenAI models including GPT-4o, o1, and DALL-E 3 with enterprise SLAs, private networking, and regional data residency. Provides the same OpenAI API surface with additional Azure IAM, VNet integration, content filtering, and Azure Monitor observability.
Databricks Feature Store - Feast Integration
by Databricks
The Databricks Feature Store integrates with Feast, an open-source feature store, to streamline feature engineering and management for machine learning workflows. This integration allows users to define, store, and serve features consistently across training and inference, reducing data skew and improving model performance within the Databricks environment.
LangChain + Pinecone
by LangChain
LangChain VectorStore integration for Pinecone's managed vector database. Enables similarity search, MMR retrieval, and metadata filtering within LangChain RAG pipelines. Supports both serverless and pod-based Pinecone indexes via the langchain-pinecone package.
Hugging Face Optimum Intel Extension
by Hugging Face / Intel
Hugging Face Optimum Intel Extension is a toolkit designed to accelerate inference and training of transformer models on Intel CPUs and GPUs. It leverages Intel's Deep Learning Boost (DL Boost) and other hardware features to optimize model performance within the Hugging Face ecosystem.
Cursor + OpenAI
by Anysphere
Cursor is a VS Code fork that uses OpenAI's GPT-4 and o-series models as its reasoning engine for multi-file edits, semantic codebase search, and an agent mode that can autonomously implement features across the entire repository. It offers a Composer panel for multi-file diffs and a codebase-aware chat that indexes the project with embeddings for precise retrieval.
Anthropic + AWS Bedrock
by Amazon Web Services
Anthropic's Claude model family available through Amazon Bedrock's fully managed foundation model service. Provides serverless inference with pay-per-token pricing, AWS IAM authentication, VPC endpoint support, and model evaluation tools. Claude 3.5 Sonnet, Haiku, and Opus are all available through the Bedrock API.
TGI + Hugging Face Hub
by Hugging Face
Text Generation Inference (TGI) by Hugging Face is a production-grade inference server that directly loads models from the Hugging Face Hub via model IDs, handling shard downloading, quantization, and OpenAI-compatible endpoint serving in a single Docker command. It implements continuous batching, speculative decoding, and FlashAttention for optimal throughput on Ampere and Hopper GPUs.
Ollama + Docker
by Ollama
Ollama's official Docker image provides a self-contained environment for running large language models locally. It enables developers to easily deploy and manage quantized GGUF models using familiar container orchestration tools like Docker Compose and Kubernetes, supporting GPU acceleration and an OpenAI-compatible API.
MCP + GitHub
by Anthropic / GitHub
Integrates the MCP environment with GitHub's REST and GraphQL APIs, enabling programmatic control over software development workflows. Users can manage repositories, track issues, review pull requests, and search code directly from an agent context, streamlining development tasks without switching tools.
GitHub Copilot + JetBrains
by GitHub
The GitHub Copilot plugin for JetBrains IDEs integrates AI-powered code completion and a conversational chat panel directly into the editor. It provides inline, ghost-text suggestions and mirrors the functionality of the VS Code extension, adapting to JetBrains' native keymaps and user interface for a seamless experience across IDEs like IntelliJ IDEA and PyCharm.
MCP + Filesystem
by Anthropic
The Anthropic MCP Filesystem server allows AI agents, like Claude, to interact directly with a user's local files. It exposes a secure API for reading, writing, listing, and searching files and directories, enabling agents to perform tasks such as code analysis, data processing, and file organization on the host machine.
LangChain + Chroma
by LangChain
LangChain VectorStore integration for Chroma, the open-source AI-native embedding database. Ideal for local development and prototyping with zero infrastructure setup. Supports persistent and in-memory collections, metadata filtering, and relevance-scored retrieval via langchain-chroma.
LangChain + Google AI
by LangChain
This integration connects the LangChain framework with Google's advanced AI services, including the Gemini API via Google AI Studio and models on Vertex AI. It enables developers to build sophisticated applications leveraging multimodal capabilities for processing text and images, advanced function calling for tool use, and grounding responses with Google Search for accuracy.
Google AI + Vertex AI
by Google Cloud
Vertex AI is Google Cloud's managed machine learning platform for deploying and scaling AI applications. It provides an enterprise-grade environment for using Google's foundation models like Gemini and PaLM, adding MLOps tooling, security controls, and deep integration with the Google Cloud ecosystem. This includes features like model tuning, evaluation, and grounding with Google Search.
LangChain + HuggingFace
by LangChain
This integration connects LangChain with the HuggingFace ecosystem, enabling the use of thousands of open-source models. It allows developers to call models via the HuggingFace Inference API, run local inference using the `transformers` library, and generate embeddings, all within LangChain's structured framework for building complex LLM applications.
TensorRT-LLM + NVIDIA Triton
by NVIDIA
TensorRT-LLM optimizes large language models into fused CUDA kernels, while the Triton Inference Server orchestrates serving. Together, they form NVIDIA's production stack for maximizing token throughput and minimizing latency on datacenter GPUs, enabling high-performance, scalable LLM inference.
LangGraph + LangSmith
by LangChain Inc.
The LangGraph and LangSmith integration provides built-in observability for stateful agent graphs. It automatically captures every node execution, state change, and tool call as a structured trace in LangSmith, enabling deep, step-by-step debugging, performance analysis, and regression testing of complex agent workflows.
CrewAI + LangChain
by CrewAI / LangChain
This integration enables CrewAI agents to leverage the entire LangChain tool ecosystem. CrewAI orchestrates multi-agent workflows by assigning roles and delegating tasks, while LangChain provides the foundational tools for capabilities like web search, code execution, vector store retrieval, and API connectivity.
Ray Serve + GCP
by Anyscale
Ray Serve deploys scalable model serving applications on Google Cloud Platform using GKE and Vertex AI infrastructure, with Ray's distributed runtime managing replica placement, traffic splitting, and resource scheduling across GPU node pools. The integration supports multi-model serving graphs, A/B rollouts, and seamless scale-to-zero on GCP Spot instances for cost optimization.
LlamaParse + LlamaIndex
by LlamaIndex
LlamaParse is a proprietary parsing service for complex documents like PDFs with embedded tables and charts. Its first-party integration with the open-source LlamaIndex framework allows developers to directly ingest parsed, structured objects (Nodes) into advanced Retrieval-Augmented Generation (RAG) pipelines, preserving the original document's rich context.
Helicone + OpenAI
by Helicone
Helicone is an observability platform for LLMs that acts as a proxy for the OpenAI API. It enables developers to monitor usage, track costs, and optimize performance with minimal code changes. Key features include real-time dashboards, request-level caching, rate-limiting, and detailed analytics.
MCP + Slack
by Anthropic / Slack
This integration connects MCP-compatible AI agents, such as Claude, directly to a Slack workspace. It enables programmatic control over Slack functionalities, allowing agents to read channel histories, post messages, manage channels, and look up user information. The connection is authenticated using a Slack Bot token for secure, automated communication.
MCP + Brave Search
by Anthropic / Brave
An integration that connects the Multi-agent Control Plane (MCP) with Brave's independent search index. It equips AI agents, like Claude, with tools for real-time web, local, and news searches, offering a privacy-focused alternative to Google and Bing for data retrieval and grounding.
LangChain + Weaviate
by LangChain
LangChain integration for Weaviate's open-source vector database. Supports hybrid search (BM25 + vector), multi-tenancy, and generative search modules within LangChain chains and agents. Connects via the Weaviate Python client inside the langchain-weaviate package.
Langfuse + LlamaIndex
by Langfuse
Langfuse integrates with LlamaIndex to provide open-source observability for LLM applications. A simple callback handler captures detailed traces of query engines, retrievers, and LLM calls. This data, including token usage, latency, and custom scores, is visualized in a self-hostable dashboard for comprehensive monitoring.
MCP + Puppeteer
by Anthropic
Official MCP Puppeteer server providing headless Chrome browser control to MCP clients. Exposes tools for page navigation, element interaction, form filling, screenshot capture, and JavaScript execution, enabling Claude to automate complex web workflows that require a real browser environment.
AutoGen + Azure OpenAI
by Microsoft
Integrate the AutoGen multi-agent framework with Azure OpenAI Service to build sophisticated, enterprise-grade AI applications. This connector enables developers to leverage Azure's security features, including RBAC and private endpoints, while using all standard AutoGen agents like AssistantAgent and UserProxyAgent for complex, collaborative tasks.
Tabnine + VS Code
by Tabnine
Tabnine's VS Code extension provides AI-powered code completions, including whole-line and full-function suggestions. It is designed for enterprises with strict privacy and data-residency needs, offering on-premise or private cloud deployment options. The AI can be trained on a team's specific codebase for highly relevant completions.
Cline + VS Code
by Community
Cline is an open-source VS Code extension that provides an AI agent with direct access to the IDE's environment. It enables multi-step agentic workflows by allowing the AI to use the file system, terminal, and an integrated browser. The extension supports various models and includes a human-in-the-loop approval process for safety.
LlamaIndex + Qdrant
by LlamaIndex / Qdrant
Native LlamaIndex vector store adapter for Qdrant, enabling index construction, similarity search, and filtered retrieval over Qdrant collections. Supports both in-memory and hosted Qdrant deployments with payload-based metadata filtering.
Unstructured + Pinecone
by Unstructured / Pinecone
This integration provides a direct pipeline from Unstructured's data transformation service to the Pinecone vector database. It automates extracting, cleaning, and chunking data from documents like PDFs and DOCX, then embeds and indexes the content into a Pinecone namespace for use in RAG applications.
MCP + PostgreSQL
by Anthropic
This integration provides a secure, read-only connection to a PostgreSQL database within the MCP environment. It allows agents to perform database introspection, such as listing schemas and describing tables. A key feature is its ability to facilitate natural-language-to-SQL workflows, enabling users to ask questions in plain English and have them translated into safe, read-only SELECT queries for execution.
LangChain + Ollama
by LangChain
Integrate LangChain with Ollama for fully local LLM inference. This allows developers to use models like Llama 3 and Mistral on their own hardware, ensuring data privacy by eliminating external API calls. It's ideal for building offline-capable, privacy-sensitive applications.
Arize Phoenix + LangChain
by Arize AI
Arize Phoenix integrates with LangChain to provide deep observability for LLM applications. By leveraging OpenTelemetry, it captures and streams traces for chains, agents, and retrievers to a local UI or the Arize cloud. This enables developers to debug applications, detect embedding drift, score retrieval quality, and analyze hallucinations at the span level.
Portkey + Multi-Provider
by Portkey
Portkey's AI gateway unifies over 200 LLM providers through a single OpenAI-compatible API. It enables automatic fallbacks, load balancing, and semantic caching to improve reliability and performance. The platform provides full observability, capturing detailed cost, latency, and metadata for every request.
LangChain + Mistral AI
by LangChain
This integration connects the LangChain framework with Mistral AI's suite of models, including Mistral Large and Codestral. It enables developers to build sophisticated applications by leveraging Mistral's capabilities like function calling, JSON mode, and streaming within LangChain's structured environment for creating agents and chains.
BentoML + AWS
by BentoML
BentoML streamlines deploying machine learning models to the AWS cloud. It packages models and their inference logic into standardized containers, enabling one-command deployment to services like SageMaker, EC2, and ECS. The platform automates production concerns such as auto-scaling, batching, and monitoring.
Windsurf + Anthropic
by Codeium
Windsurf (by Codeium) is an AI-native IDE that integrates Anthropic's Claude models as the backbone of its Cascade agent, which autonomously plans and executes multi-step coding tasks with real-time file and terminal access. The Anthropic integration powers deep context awareness across large codebases and supports long-horizon agent tasks with coherent state tracking.
Claude Agent SDK + MCP
by Anthropic
Anthropic's Claude Agent SDK ships with native Model Context Protocol (MCP) client support, allowing Claude-powered agents to connect to any MCP server and use its exposed tools, resources, and prompts. The integration bridges Claude's tool-use capabilities with the open MCP ecosystem for plug-and-play external integrations.
LangChain + Cohere
by LangChain
LangChain integration for Cohere's enterprise AI platform. Provides access to Command models for generation, Embed v3 for multilingual embeddings, and the Rerank API for RAG pipeline precision improvement. Available via the langchain-cohere package with first-class reranker support.
Sourcegraph + Cody
by Sourcegraph
Sourcegraph Cody combines enterprise-grade code search with an AI coding assistant, letting developers ask questions grounded in the entire codebase indexed by Sourcegraph. The integration uses Sourcegraph's precise code intelligence (SCIP) as a retrieval layer for Cody's Claude-powered chat, delivering context-accurate answers across mono-repos with millions of files.
MCP + Google Drive
by Anthropic / Google
Official MCP Google Drive server granting MCP clients access to Drive file listings, search, and document content reading via OAuth 2.0. Supports Docs, Sheets, Slides, and plain files, enabling agents to retrieve and reason over cloud-stored enterprise documents.
Groq + LangChain
by Groq
LangChain chat model integration for Groq's Language Processing Unit (LPU) inference API. Enables ultra-low-latency LLM calls within LangChain chains and agents with first-token latency under 100ms. Supports Llama 3, Mixtral, and Gemma models served on Groq hardware via the langchain-groq package.
Continue + VS Code
by Continue Dev
Continue is an open-source AI code assistant for VS Code that supports any LLM through a flexible config file, covering inline completions, chat, edit mode, and custom slash commands. Its context providers system lets developers include files, docs, web search results, and terminal output in every prompt, making it highly adaptable to team-specific workflows.
Chroma + HuggingFace
by Chroma
Chroma's built-in embedding function for HuggingFace's sentence-transformers library. Enables fully local embedding generation and vector storage without any API keys. Supports hundreds of pre-trained models from the HuggingFace Hub including all-MiniLM, BGE, and E5 variants.
Qdrant + LlamaIndex
by Qdrant
LlamaIndex VectorStore integration for Qdrant's high-performance vector search engine. Exposes Qdrant's payload filtering, sparse-dense hybrid search, and collection management through LlamaIndex's standard index and query engine abstractions for advanced RAG pipelines.
DeepSeek + Together AI
by Together AI
DeepSeek's open-weight models including DeepSeek-V3 and DeepSeek-R1 served through Together AI's inference cloud at competitive token prices. Provides an OpenAI-compatible API endpoint, enabling drop-in substitution for cost-sensitive workloads. Together AI's custom GPU kernels deliver high throughput for DeepSeek's MoE architecture.
Arize Phoenix + LlamaIndex
by Arize AI
Arize Phoenix instruments LlamaIndex query pipelines with OpenTelemetry spans, exposing retrieval precision, reranker performance, and LLM generation quality in a local-first UI. The integration is particularly valuable for RAG applications where diagnosing retrieval failures requires joint analysis of embeddings, chunks, and generation outputs.
Firecrawl + LangChain
by Firecrawl / LangChain
LangChain document loader built on Firecrawl's web crawling and scraping API, transforming live web content into clean Markdown documents ready for chunking and indexing. Supports full-site crawls, sitemap-driven ingestion, and JavaScript-rendered pages.
MCP + Notion
by Community / Notion
MCP Notion server built on the official Notion API, providing tools for searching pages, reading blocks, creating pages, and updating database entries. Enables Claude and other agents to use Notion as a structured knowledge store within agentic workflows.
Weaviate + Cohere
by Weaviate
Weaviate's built-in text2vec-cohere and reranker-cohere modules for zero-ETL vectorization and result reranking within Weaviate clusters. Automatically embeds documents at write time using Cohere Embed v3 and reranks retrieval results without external orchestration code.
Milvus + LangChain
by Zilliz
LangChain VectorStore integration for Milvus, the open-source distributed vector database. Supports billion-scale ANN search, multiple index types (IVF_FLAT, HNSW, DiskANN), and collection-level partitioning through LangChain's unified retriever interface via the pymilvus client.
PydanticAI + Anthropic
by Pydantic
PydanticAI's native Anthropic model provider, enabling type-safe agentic workflows backed by Claude models. Agent inputs, tool call parameters, and structured outputs are all validated through Pydantic schemas, with full support for Claude's extended tool use and streaming responses.
SmolAgents + HuggingFace
by HuggingFace
SmolAgents is HuggingFace's minimal agent framework that defaults to code-writing agents powered by HuggingFace-hosted open-source models. The integration allows seamless use of models from the HuggingFace Hub (Qwen, Mistral, LLaMA) through the Inference API or local transformers without API key lock-in.
LlamaFile + Local Execution
by Mozilla
LlamaFile by Mozilla and Justine Tunney bundles a complete LLM with its runtime into a single self-contained executable that runs on Linux, macOS, Windows, FreeBSD, NetBSD, and OpenBSD without any installation. It embeds a compressed GGUF model and a llama.cpp backend into a polyglot binary (ZIP + ELF/Mach-O), serving an OpenAI-compatible HTTP API on localhost at startup.
MCP + Sentry
by Community / Sentry
MCP Sentry server exposing Sentry's error tracking and performance monitoring data to MCP-compatible agents. Agents can list recent issues, retrieve stack traces, inspect breadcrumbs, and query performance data, enabling AI-powered incident triage and root cause analysis workflows.
Swarm + OpenAI
by OpenAI
OpenAI's experimental Swarm framework natively targets the OpenAI Chat Completions API for lightweight, stateless multi-agent handoffs. Agents are plain Python functions decorated with tool schemas; the framework manages context passing and agent-to-agent transfers through the standard OpenAI function-calling interface.
Mistral AI + AWS Bedrock
by Amazon Web Services
Mistral AI's Mistral Large and Mistral Small models available through Amazon Bedrock for serverless inference. Provides AWS-native access to Mistral's frontier models with pay-per-token pricing, IAM-based auth, and Bedrock Guardrails — enabling EU-origin AI capabilities within AWS infrastructure without a separate Mistral API account.
Braintrust + Anthropic
by Braintrust Data
Braintrust wraps the Anthropic SDK to automatically trace every Claude API call and funnel results into structured eval datasets. Developers can run model-graded scoring, regression suites against golden datasets, and A/B comparisons between Claude model versions directly from the Braintrust dashboard.
pgvector + Django
by pgvector
pgvector-django package adding native vector similarity search to Django's ORM via PostgreSQL's pgvector extension. Adds VectorField, IvfflatIndex, and HnswIndex with cosine, L2, and inner product distance operators. Enables AI-powered search inside existing Django applications without a separate vector DB.
Marker + ChromaDB
by VikParuchuri / ChromaDB
Combines Marker's high-fidelity PDF-to-Markdown conversion with ChromaDB's local-first vector store for lightweight, self-hosted RAG pipelines. Ideal for on-device or air-gapped deployments where cloud vector stores are unavailable.
Agency Swarm + OpenAI
by VRSEN
Agency Swarm is built on top of the OpenAI Assistants API, wrapping it with agency-level abstractions for defining communication flows between specialized agents. It provides a higher-level interface for creating persistent agent threads, shared tool registries, and structured agent communication protocols.
Jina Reader + PGVector
by Jina AI / PostgreSQL
Routes Jina Reader's URL-to-text extraction through PostgreSQL's pgvector extension for SQL-native RAG storage. Enables teams already running PostgreSQL to add vector search without adopting a separate vector database, keeping the stack simple.
Opik + LangChain
by Comet ML
Opik by Comet provides an open-source LLM observability platform that integrates with LangChain via a callback handler, recording traces, token counts, and custom scores into a queryable dataset. The integration includes built-in hallucination and answer-relevance evaluators that run automatically on captured traces.
Docling + Weaviate
by IBM / Weaviate
Combines IBM's Docling document conversion library with Weaviate's vector database for structured RAG pipelines. Docling extracts rich document structure (tables, figures, headings) which is then stored as typed Weaviate objects with native vector indexing.
LanceDB + LlamaIndex
by LanceDB
LlamaIndex integration for LanceDB's serverless, embedded vector database built on the Lance columnar format. Supports multimodal data (text, images, video), zero-copy queries, and versioned datasets. Ideal for local or edge AI applications requiring a zero-ops vector store with full LlamaIndex query engine compatibility.
Cohere + AWS SageMaker
by Amazon Web Services
Cohere's Command and Embed models deployed as dedicated SageMaker endpoints for real-time inference with guaranteed throughput. Available through AWS Marketplace as JumpStart models, supporting VPC isolation, auto-scaling, and A/B testing. Preferred for enterprises requiring dedicated capacity and AWS billing consolidation.
Fireworks AI + vLLM
by Fireworks AI
Integration between Fireworks AI's model platform and the vLLM inference engine for on-premises or self-hosted deployment of Fireworks-optimized models. Fireworks packages FireOptimizer-quantized models in formats directly compatible with vLLM's OpenAI-compatible server, enabling enterprise teams to run Fireworks-quality inference on their own GPU infrastructure.
Vespa + Haystack
by deepset
Haystack DocumentStore integration for Vespa, Yahoo's open-source big-data serving engine. Combines Vespa's multi-stage ranking, approximate nearest neighbor search, and real-time indexing with Haystack's RAG pipeline builder. Supports BM25 + dense hybrid retrieval at web scale.
Log10 + OpenAI
by Log10
Log10 provides zero-configuration auto-logging for OpenAI API calls through a context manager that intercepts completions and stores full request/response pairs with automatic tagging. The integration supports user feedback collection, few-shot prompt organization, and GDPR-compliant data masking for PII in logged payloads.
Chunkr + Milvus
by Chunkr / Zilliz
Pairs Chunkr's semantic chunking service with Milvus's high-performance vector database for production-scale RAG. Chunkr splits documents using structure-aware boundaries and Milvus stores the resulting dense vectors with ANN indexing for sub-millisecond retrieval.
Zilliz + Apache Spark
by Zilliz
Connector linking Zilliz Cloud (managed Milvus) with Apache Spark for large-scale batch embedding ingestion and vector ETL pipelines. Enables parallel document embedding across Spark executors with direct write to Zilliz collections, supporting data lake to vector store pipelines at petabyte scale.
Weights & Biases
by
ML experiment tracking and model monitoring platform. Integrates with all major training frameworks.
Cerebras + LiteLLM
by LiteLLM
LiteLLM proxy integration for Cerebras Inference, enabling Cerebras's wafer-scale chip throughput to be accessed via a unified OpenAI-compatible gateway. Allows developers to route requests to Cerebras's CS-3 hardware — delivering over 2000 tokens/second on Llama 3.1 70B — from any existing OpenAI SDK integration through LiteLLM's model aliases.
Turbopuffer + Vercel
by Turbopuffer
Integration connecting Turbopuffer's serverless vector database with Vercel's deployment platform. Turbopuffer stores vectors on object storage with sub-100ms cold query latency, making it viable for Vercel serverless functions and Edge Runtime. Zero infrastructure management for full-stack AI apps on Vercel.
Weights & Biases
by
ML experiment tracking and model monitoring platform. Integrates with all major training frameworks.
OWASP Top 10 for Agentic Applications
by OWASP Foundation
Security standard for AI agent systems (2026).
OWASP Top 10 for Agentic Applications
by
Security standard for AI agent systems (2026).
EU AI Act Compliance Framework
by
Regulatory framework for AI systems in the EU (Aug 2026).
AP2 (Agent Payment Protocol)
by
Autonomous agent commerce with crypto-signed mandates.