Explore.

model-hubinference-apitransformers

Hugging Face

by Hugging Face

The largest platform for sharing and deploying machine learning models, datasets, and applications. Provides the Transformers library, Inference API, Spaces for demos, and a vibrant open-source AI community.

TensorFlow Lite

by Google

TensorFlow Lite is Google's lightweight ML framework designed for on-device inference on mobile, embedded, and IoT devices. It enables deploying trained models with minimal latency and no network dependency, supporting a wide range of hardware accelerators including GPU, DSP, and NPU.

edgeiotmobile

75B+

Apache Airflow (ML Edition)

by Apache Software Foundation

Battle-tested workflow scheduler for authoring, scheduling, and monitoring data and ML pipelines as directed acyclic graphs. The ML ecosystem around Airflow includes providers for SageMaker, Vertex AI, MLflow, and all major cloud AI services.

workflowdagscheduling

74.8B+

Three.js (AI Integration)

by Three.js Community (Mr.doob)

The foundational JavaScript 3D library for rendering GPU-accelerated graphics in the browser via WebGL, with a growing ecosystem of AI-generated geometry, procedural shaders, and LLM-driven scene graph manipulation. Three.js powers the majority of web-based spatial AI visualizations.

3dwebgljavascript

73.9B+

analytics-engineeringsql-transformationdata-modeling

dbt (AI/ML Edition)

by dbt Labs

The analytics engineering framework that transforms raw warehouse data into clean, tested, and documented datasets ready for ML and AI. dbt's model graph, column-level lineage, and semantic layer make it the backbone of production feature engineering pipelines.

distributed-mlbig-dataspark

Apache Spark MLlib

by Apache Software Foundation

Apache Spark's built-in machine learning library for distributed, large-scale ML on data lakes and warehouses. MLlib provides scalable algorithms for classification, regression, clustering, and collaborative filtering, plus a pipeline API for feature engineering.

72.9B+

image-generationstable-diffusionapi

Stability AI Platform

by Stability AI

The Stability AI Platform provides API access to Stability AI's suite of generative image, video, and audio models including Stable Diffusion 3.5 and Stable Video Diffusion, enabling developers to build creative AI applications at scale. It offers both hosted API endpoints and open-weight models for on-premises deployment.

MediaPipe

by Google

MediaPipe is Google's cross-platform framework for building perception pipelines that run on-device in real time. It provides production-ready solutions for tasks like hand tracking, face detection, pose estimation, and object detection across Android, iOS, web, and desktop.

edgecomputer-visionpose

ui-builderpythondata-apps

Streamlit

by Snowflake (via acquisition)

Python-first framework for building interactive data applications and ML demos in minutes with no frontend experience required. Streamlit's reactive execution model, built-in widgets, and LLM streaming components make it the go-to tool for AI prototype UIs.

71.5B+

translationnlplanguage-ai

DeepL API

by DeepL

DeepL API provides neural machine translation of exceptional quality for 30+ languages, consistently outperforming competitors on blind translation benchmarks. It supports real-time text and full document translation with format preservation, a glossary system, and a free tier for developers.

automationno-codeai-workflows

Zapier AI

by Zapier

Zapier AI extends the world's largest no-code automation platform with AI-powered workflow generation, natural language Zap building, and an AI Actions API that lets LLMs trigger real-world automations. It connects 6000+ apps and enables non-technical users to build AI-augmented workflows without writing code.

71.1B+

Databricks

by Databricks Inc.

Unified data intelligence platform combining data engineering, ML, and GenAI on a Lakehouse foundation. Databricks provides managed Spark, Delta Lake, MLflow, and Model Serving with vector search, enabling end-to-end AI pipelines from raw data to production models.

lakehousemlflowspark

biotechprotein-designstructural-biology

Rosetta

by RosettaCommons

Rosetta is a comprehensive software suite for computational macromolecular modeling and design, enabling researchers to predict protein structure, design novel proteins, and model protein-protein interactions. Developed by the RosettaCommons consortium, it is the gold standard in computational protein design and has contributed to multiple Nobel Prize-winning research programs.

Anthropic Tool Use

by Anthropic

Anthropic's native tool use capability allowing Claude models to interact with external tools and APIs. Provides structured tool definitions with input schemas and supports parallel tool calls and streaming.

tool-useanthropicclaude

ONNX Runtime Mobile

by Microsoft

ONNX Runtime Mobile is Microsoft's high-performance inference engine optimized for mobile and edge devices, enabling deployment of models from any ONNX-compatible training framework. It provides hardware-accelerated inference via NNAPI, Core ML, and XNNPACK execution providers.

edgeonnxmobile

videogenerative-aicreative

Runway ML

by Runway

Runway ML is a leading generative AI creative platform for video generation, editing, and visual effects, offering models like Gen-3 Alpha for high-fidelity text-to-video and image-to-video synthesis. It is used by filmmakers, advertisers, and content creators to produce cinematic-quality AI-generated video at scale.

70B+

ui-builderml-demoshuggingface

Gradio

by Hugging Face

Python library for rapidly building shareable ML demos with a focus on multimodal inputs including images, audio, video, and text. Gradio is the standard for Hugging Face Spaces demos and integrates natively with the Hugging Face Hub model ecosystem.

68.6B

Apache TVM

by Apache Software Foundation

Apache TVM is an open-source machine learning compiler stack that optimizes deep learning workloads for a diverse set of hardware backends including CPUs, GPUs, FPGAs, and custom accelerators. It automates model optimization through its AutoTVM and Ansor auto-tuning systems, delivering state-of-the-art inference performance on edge targets.

edgecompileroptimization

complianceprivacy-managementgdpr

OneTrust AI

by OneTrust

OneTrust offers a Trust Intelligence Platform to help organizations manage privacy, security, and data governance. It automates workflows for compliance with regulations like GDPR and CCPA, manages user consent, and provides tools for AI governance, data discovery, and third-party risk assessment across the enterprise.

67.8B

data-lakelakehousestorage-format

Delta Lake

by Linux Foundation (Delta Lake Project)

Delta Lake is an open-source storage layer that brings ACID transactions and reliability to data lakes. Built on top of Parquet files, it enables features like schema enforcement, time travel for data versioning, and unified batch and streaming data processing. It serves as the foundational storage format for the Lakehouse architecture.

Claude Code

by Anthropic

Claude Code is an agentic AI coding assistant from Anthropic designed to operate within a developer's terminal. It autonomously handles complex software development tasks by understanding entire codebases, editing files, executing shell commands, and managing Git workflows, acting as a hands-on pair programmer with minimal human supervision.

ai-codingcliagentic-ai

AWS API Gateway (ML)

by Amazon Web Services

AWS-managed API gateway service for building, deploying, and scaling ML and AI APIs backed by Lambda, SageMaker, and Bedrock endpoints. AWS API Gateway provides built-in authorization, throttling, caching, and monitoring for production AI service deployments at any scale.

api-gatewayawslambda

llm-evaluationllm-testingllmops

Tooltesting

LangSmith Testing

by LangChain

LangSmith is a platform for debugging, testing, evaluating, and monitoring LLM applications. It enables developers to visualize execution traces of their chains and agents, collect datasets, and run automated evaluators to score model performance. The platform is designed to streamline the LLM development lifecycle from prototype to production.

66.2B

knowledge-graphragvector-search

Neo4j GraphRAG

by Neo4j

Neo4j GraphRAG combines graph database capabilities with vector search to build retrieval-augmented generation systems that leverage structured relationships alongside semantic similarity. It enables developers to construct knowledge graphs that ground LLM responses in connected, structured data reducing hallucinations and improving traceability.

65.9B

llm-frameworkmicrosoftorchestration

Semantic Kernel

by Microsoft

Microsoft's open-source SDK for integrating LLMs into applications with plugin architecture. Supports planners, memory, and connectors for building enterprise AI solutions across .NET, Python, and Java.

65.7B

automationno-codevisual-workflow

Make AI

by Make

Make (formerly Integromat) is a visual no-code automation platform with deep AI integration that enables complex multi-step workflows through a drag-and-drop scenario builder. It offers granular data transformation, HTTP module flexibility, and AI-powered scenario generation for orchestrating sophisticated automation pipelines.

vector-databasedistributedgpu-accelerated

Milvus

by Zilliz

Cloud-native vector database built for scalable similarity search with GPU acceleration. Supports billions of vectors with multiple index types, hybrid search, and multi-vector queries.

structured-outputpydanticvalidation

Instructor

by Jason Liu

Instructor is a Python library that simplifies extracting structured, typed data from Large Language Model (LLM) responses. By leveraging Pydantic models, it enables developers to define a desired data schema, and Instructor handles the prompting, validation, and retries to ensure the LLM output conforms to that schema, streamlining data extraction tasks.

Groq

by Groq

Groq is an AI inference company that provides ultra-fast access to open-source large language models. It leverages its custom-designed Language Processing Unit (LPU) hardware to deliver industry-leading token generation speeds, significantly reducing latency for real-time applications via an OpenAI-compatible API.

inferencelpulow-latency

ai-writercontent-creationmarketing

Jasper AI

by Jasper

Jasper is an AI content platform designed for enterprise marketing teams. It helps create on-brand content at scale by combining advanced AI models with a company's specific brand knowledge. The platform supports multi-channel content generation, campaign workflows, and ensures brand voice consistency across all outputs.

Model Context Protocol

by Anthropic

Open protocol by Anthropic for connecting AI models to external tools, data sources, and services. Provides a standardized interface for tool use with server and client SDKs for building integrations.

protocolmcpanthropic

automlmachine-learningopen-source

H2O AutoML

by H2O.ai

H2O AutoML is an open-source, distributed machine learning platform that automates the model training process. It systematically explores various algorithms and hyperparameters to produce a leaderboard of the best models. It supports both a Python/R API and a no-code Flow UI, making it accessible to both developers and business users.

no-codeapp-builderai-generation

Bubble AI

by Bubble

Bubble AI is an AI-powered no-code development platform for building web applications without writing code. Users can describe their app idea in natural language, and the AI assistant generates layouts, database structures, and workflows. This visual programming environment allows for extensive customization, ideal for creating MVPs and internal tools.

LoRA Library

by Hugging Face

The LoRA Library, integrated within Hugging Face's PEFT (Parameter-Efficient Fine-Tuning) package, provides tools to create, share, and use LoRA adapters. It allows for the efficient customization of large pre-trained models by training only a small number of new weights, drastically reducing computational costs and storage requirements compared to full fine-tuning.

loraadaptersmodel-hub

llm-inferencemodel-servinghugging-face

TGI

by Hugging Face

A production-ready inference server for large language models, developed by Hugging Face in Rust. It enables high-performance LLM serving through features like tensor parallelism, continuous batching, and quantization, making it ideal for deploying demanding models at scale with low latency.

63B

documentationknowledge-baseai-docs

GitBook AI

by GitBook

GitBook AI is an intelligent documentation and knowledge management platform with built-in AI that can answer questions, generate content, and surface insights from your entire knowledge base. It combines a Notion-like editor with an AI search assistant and GitHub sync for technical teams.

62.9B

workfloworchestrationdurable-execution

Temporal AI

by Temporal Technologies

Durable workflow orchestration platform that makes it easy to build reliable distributed applications. Temporal handles retries, timeouts, and failure recovery automatically, making it ideal for long-running AI pipelines and agent orchestration.

documentation-platformai-documentationdeveloper-tools

Mintlify

by Mintlify

Mintlify is an AI-powered platform for creating and maintaining developer documentation. It auto-generates content from code comments and OpenAPI specs, provides an AI chatbot trained on the docs for instant answers, and offers a rich component library for technical writing. The platform simplifies publishing and hosting.

genomicsbioinformaticscloud

Terra

by Broad Institute / Microsoft

Terra is a cloud-based open platform for biomedical researchers to access data, run analysis tools, and collaborate, built on Google Cloud with support for WDL and CWL workflow languages. It provides access to petabyte-scale genomic datasets including TCGA and GTEx, and supports scalable analysis through Cromwell and Spark pipelines.

62.4B

automltabular-datamultimodal-learning

AutoGluon

by Amazon Web Services

AutoGluon is an open-source AutoML framework from AWS that simplifies machine learning. It automates model training, hyperparameter tuning, and ensembling to achieve state-of-the-art performance on tabular, image, text, and time-series data with just a few lines of Python code, making advanced ML accessible to all skill levels.

reinforcement-learningsimulationgame-ai

Unity ML-Agents

by Unity Technologies

Unity ML-Agents is an open-source toolkit that enables the use of the Unity game engine as a simulation environment for training intelligent agents. It connects rich 3D environments with Python-based deep reinforcement learning and imitation learning frameworks like TensorFlow and PyTorch, facilitating research and development in game AI, robotics, and autonomous systems.

data-integrationeltconnectors

Airbyte

by Airbyte Inc.

Open-source data integration platform with 350+ pre-built connectors for syncing data into AI-ready warehouses and vector databases. Airbyte's PyAirbyte SDK and AI Connector Builder enable rapid connector creation for custom data sources and AI pipelines.

annotationlabelingdata-management

Label Studio

by HumanSignal

Open-source data labeling and annotation platform supporting text, image, audio, and video. Provides customizable labeling interfaces, ML-assisted labeling, and team collaboration for building training datasets.

61.45B

automlbayesian-optimizationscikit-learn

Auto-sklearn

by University of Freiburg (AutoML Group)

Auto-sklearn is an open-source AutoML toolkit built on scikit-learn. It leverages Bayesian optimization, meta-learning, and automated ensemble construction to find the best-performing machine learning pipeline for a given tabular dataset. It is a prominent tool in academic research for automated model selection.

ai-codingvscode-extensionautonomous

Cline

by Cline

Autonomous coding agent that operates directly in VS Code with support for multiple LLM providers. Can create and edit files, run terminal commands, and browse the web while requiring human approval for actions.

61.05B

Edge Impulse

by Edge Impulse

Edge Impulse is a leading development platform for machine learning on embedded systems and IoT devices. It offers an end-to-end MLOps pipeline, from data collection and signal processing to model training and deployment. The platform simplifies creating TinyML applications for resource-constrained microcontrollers.

tinymlembedded-mliot

61B

data-orchestrationsoftware-defined-assetsdata-pipelines

Dagster

by Dagster Labs

Dagster is an asset-centric data orchestrator for building, testing, and monitoring data pipelines. It models data dependencies and computations as a graph of software-defined assets, providing built-in data lineage, type checking, and observability. This approach helps data teams create reliable and maintainable data platforms.

api-documentationdeveloper-portalai-docs

ReadMe AI

by ReadMe

ReadMe AI is an interactive API documentation platform that transforms OpenAPI specs and Markdown into beautiful, interactive developer hubs with personalized API explorers. Its AI-powered features include auto-generated code samples, semantic search, and contextual AI answers drawn from your documentation.

structured-generationconstrained-decodingjson-schema

Outlines

by .txt

Outlines is an open-source Python library that provides fine-grained control over large language model text generation. It uses constrained decoding to force the model's output to conform to a specific structure, such as a regular expression, a Pydantic model, or a JSON schema. This guarantees that the generated text is always valid and parseable, eliminating the need for post-processing and error handling.

text-to-speechvoice-cloningaudio-ai

Play.ht

by Play.ht

Play.ht is an AI-powered text-to-speech generator and voice cloning platform. It offers a vast library of over 900 AI voices in multiple languages and accents. The platform is designed for various applications, from creating audio versions of articles to developing interactive conversational AI, thanks to its low-latency real-time streaming API.

model-servinginference-servernvidia

Triton Inference Server

by NVIDIA

Triton is an open-source inference server from NVIDIA designed for high-performance, production-ready AI. It supports deploying models from virtually any framework, such as TensorFlow, PyTorch, and ONNX, on both GPUs and CPUs. Key features include dynamic batching, concurrent model execution, and model ensembling to maximize throughput and resource utilization.

59.1C+

workflow-automationlow-codeai-automation

n8n AI

by n8n

n8n AI is a source-available, workflow automation platform that enables users to build complex, AI-powered automations. It features a visual, node-based editor where users can connect hundreds of applications and services, including various LLMs and AI agents, to orchestrate intricate processes with minimal code.

privacy-managementcompliance-automationgdpr-compliance

TrustArc AI

by TrustArc

TrustArc AI is a comprehensive privacy management platform that leverages AI to automate and simplify compliance with global regulations like GDPR and CCPA. It provides tools for data inventory, risk assessments, and consent management, helping organizations build and maintain robust privacy programs.

feature-storemlopsopen-source

Toolfeature-store

Hopsworks

by Hopsworks AB

Hopsworks is an open-source ML platform centered around its feature store, enabling teams to manage the full lifecycle of features from engineering to serving for both batch and real-time ML workloads. It integrates deeply with Apache Spark, Flink, and Python environments, and provides built-in model registry and serving capabilities.

NVIDIA Omniverse

by NVIDIA

NVIDIA's platform for building physically accurate 3D simulations, digital twins, and collaborative virtual worlds powered by Universal Scene Description (USD) and real-time ray tracing. Omniverse integrates generative AI for scene synthesis, avatar animation, and synthetic data generation for robot and autonomous vehicle training.

3dsimulationdigital-twin

58.4C+

ai-codingcloud-idedeployment

Replit AI

by Replit

AI-powered cloud development platform with integrated coding assistant and one-click deployment. Combines a browser-based IDE with AI code generation, debugging, and instant deployment to production.

api-gatewayllm-proxyrate-limiting

Kong AI Gateway

by Kong Inc.

AI-native API gateway that provides a unified control plane for managing, securing, and observing all LLM traffic across any provider. Kong AI Gateway adds semantic caching, prompt injection protection, token rate limiting, and cost attribution on top of the battle-tested Kong Gateway.

58.1C+

translation-managementlocalizationtms

Phrase TMS AI

by Phrase

Phrase TMS AI is a translation management system with integrated AI that automates the end-to-end localization workflow for enterprises, including MT integration, translation memory, terminology management, and quality assurance automation. It serves as the operational backbone for global content and software localization programs.

feature-storemlopsreal-time

Toolfeature-store

Tecton

by Tecton

Tecton is an enterprise feature store platform that enables data scientists and ML engineers to build, share, and serve features for real-time and batch machine learning applications. It provides a declarative Python SDK for defining feature pipelines, with automatic backfilling, versioning, and point-in-time correct training data generation.

Kubeflow

by Google

Open-source ML platform for Kubernetes providing end-to-end ML workflow orchestration. Includes pipeline authoring, distributed training, hyperparameter tuning, and model serving on Kubernetes clusters.

mlopskubernetespipelines

57.05C+

api-gatewayenterprisegoogle-cloud

Apigee AI (Google Cloud)

by Google Cloud

Google Cloud's enterprise API management platform with native Vertex AI and Gemini integration for building secure AI-powered APIs. Apigee AI adds LLM traffic management, semantic caching, safety policies, and analytics to Google's proven API gateway infrastructure.

57C+

video-editingai-videosubtitle-generation

Kapwing AI

by Kapwing

Kapwing is a browser-based AI video editor designed for content creators and social media teams. It offers AI-powered subtitle generation, background removal, smart cut, and one-click repurposing across formats and aspect ratios.

56.6C+

translationhuman-in-the-loopcustomer-support-translation

Unbabel

by Unbabel

Unbabel is an AI-powered translation platform that combines neural machine translation with a community of professional post-editors to deliver human-quality translations at machine speed. It is purpose-built for enterprise customer support and content teams requiring guaranteed accuracy at scale.

56.1C+

serverlessgpu-computeinfrastructure

Modal

by Modal

Serverless cloud platform for running GPU-accelerated Python code with zero infrastructure management. Provides instant container spin-up, GPU autoscaling, and simple decorators for deploying ML workloads.

56.1C+

llm-programmingconstrained-generationtemplating

Guidance

by Microsoft

Microsoft's language for controlling LLMs with interleaved generation and prompting. Supports constrained output via token healing, regex constraints, and context-free grammars for reliable generation.

55.85C+

pathologymedical-aidigital-pathology

PathAI

by PathAI

PathAI is an AI-powered pathology platform that assists pathologists in diagnosing diseases including cancer by analyzing digitized tissue slides with deep learning models trained on millions of pathology images. It provides quantitative biomarker analysis, treatment response prediction, and clinical trial endpoint measurement at scale.

automlhyperparameter-optimizationefficient

FLAML

by Microsoft Research

Fast and Lightweight AutoML library from Microsoft Research that minimizes compute while maximizing accuracy. FLAML uses cost-aware hyperparameter search and is designed to be embedded inside larger systems, including the AutoGen multi-agent framework.

ai-governanceresponsible-aicompliance

Credo AI

by Credo AI

Credo AI is an AI governance platform that enables organizations to assess, monitor, and document AI model risks, fairness, and regulatory compliance. It automates evidence collection for frameworks like EU AI Act, NIST AI RMF, and ISO 42001, bridging the gap between AI teams and risk officers.

ui-builderchatbotconversational-ai

Chainlit

by Chainlit (Community)

Production-ready Python framework for building conversational AI applications with streaming, message threading, and human-in-the-loop feedback. Chainlit is optimized specifically for LLM chat UIs and integrates natively with LangChain, LlamaIndex, and LiteLLM.

gpu-cloudserverlessinference

RunPod

by RunPod

Cloud GPU platform for AI inference and training with serverless and dedicated GPU options. Provides cost-effective GPU rentals with pre-built templates for popular ML frameworks and models.

55.4C+

search-apiscrapingmulti-engine

SerpAPI

by SerpAPI

API for scraping and parsing search engine results from Google, Bing, Yahoo, and others. Provides structured JSON results from multiple search engines with support for locations, languages, and devices.

55.3C+

data-qualityvalidationtesting

Great Expectations

by Superconductive

Open-source data quality platform for validating, profiling, and documenting data pipelines. Provides expectation-based testing for data quality with automated documentation and alerting capabilities.

55.3C+

version-controldata-versioningml-pipelines

DVC

by Iterative

Open-source version control system for machine learning projects with Git-like data management. Provides data versioning, experiment tracking, and ML pipeline management alongside your existing Git workflow.

54.9C+

knowledge-graphmulti-modelgraph

ArangoDB

by ArangoDB

ArangoDB is a native multi-model database supporting graphs, documents, and key-value storage in a single engine, with integrated vector search and ML capabilities for building knowledge-graph-backed AI applications. Its AQL query language and ArangoSearch make it suitable for complex knowledge retrieval pipelines combining structural and semantic search.

54.9C+

multi-agentopenailightweight

Swarm

by OpenAI

OpenAI's experimental lightweight multi-agent orchestration framework focused on handoffs and routines. Provides a minimal abstraction for agent coordination using function calling and agent transfers.

54.65C+

Tooltesting

Patronus AI

by Patronus AI

Patronus AI is an enterprise LLM evaluation platform specializing in automated testing for hallucination, toxicity, PII leakage, and factual accuracy across production AI systems. It provides a library of 1,000+ pre-built evaluators and supports custom evaluator creation to enforce application-specific quality gates.

testingllmevaluation

54.6C+

inferencefine-tuningfast-inference

Fireworks AI

by Fireworks AI

High-performance inference platform for generative AI with fast model serving and fine-tuning. Optimized for production workloads with function calling, JSON mode, and grammar-based generation.

54.55C+

ai-codingopen-sourceide-extension

Continue

by Continue

Open-source AI code assistant for VS Code and JetBrains with customizable model and context providers. Supports tab autocomplete, chat, inline editing, and custom slash commands with any LLM.

54.55C+

ai-videotext-to-videovideo-summarization

Pictory

by Pictory AI

Pictory is an AI video creation platform that transforms long-form text, articles, and scripts into short branded videos automatically. It includes AI voiceovers, stock footage matching, automatic highlight extraction, and a brand kit for consistent visual identity.

54.2C+

Flyte

by Union.ai (Linux Foundation)

Kubernetes-native workflow orchestration platform purpose-built for machine learning and data processing at scale. Flyte enforces strong typing on inputs and outputs, provides built-in versioning, and integrates natively with Kubernetes for resource management.

workflowmlopskubernetes

54.1C+

ml-monitoringai-observabilitymodel-performance

Arthur AI

by Arthur AI

Arthur AI is an enterprise ML monitoring and observability platform that tracks model performance, detects data and concept drift, and measures fairness in production deployments. It provides real-time alerting, explainability dashboards, and bias mitigation tooling for high-stakes AI applications.

54C+

raydistributed-computinginference

Anyscale

by Anyscale

Enterprise platform for scaling AI applications built on the Ray distributed computing framework. Provides managed Ray clusters, model serving, and fine-tuning infrastructure for production AI workloads.

54C+

observabilitymodel-monitoringml-observability

Arize AI

by Arize AI

ML observability platform for monitoring model performance, detecting drift, and troubleshooting issues. Provides real-time monitoring, embedding analysis, and automated performance alerts for AI systems.

53.8C+

Amazon Neptune ML

by Amazon Web Services

Amazon Neptune ML is a managed graph machine learning capability built on Neptune that uses graph neural networks to make predictions on graph data without requiring ML expertise. It automatically trains GNN models on graph structure and node/edge properties for tasks like node classification, link prediction, and regression.

knowledge-graphawsgnn

53.8C+

ui-builderdashboardshvplot

Panel

by HoloViz / NumFOCUS

High-level app and dashboarding framework from HoloViz that works with nearly every visualization library in the Python ecosystem. Panel supports reactive programming, GPU-accelerated plotting, and server-side rendering, making it ideal for complex analytical AI dashboards.

53.6C+

ml-pipelinesdata-scienceworkflow

Metaflow

by Netflix / Outerbounds

Human-friendly Python library for building and managing real-life data science and ML projects. Originally developed at Netflix, provides seamless scaling from laptops to cloud with versioning and reproducibility.

52.95C+

api-gatewaykubernetescloud-native

Traefik AI Gateway

by Traefik Labs

Cloud-native edge router and AI gateway built for Kubernetes-native LLM traffic management. Traefik AI extends the battle-tested Traefik reverse proxy with LLM-aware middleware for token counting, semantic caching, failover routing, and provider load balancing.

52.7C+

ai-codingcode-searchcodebase-context

Sourcegraph Cody

by Sourcegraph

AI coding assistant powered by Sourcegraph's code graph for deep codebase understanding. Provides context-aware code generation and answers using entire repository knowledge across large codebases.

51.8C+

programmatic-labelingweak-supervisiondata-labeling

Snorkel

by Snorkel AI

Enterprise data-centric AI platform for programmatically labeling and curating training data. Uses weak supervision and labeling functions to create large labeled datasets without manual annotation.

51.3C+

Spline AI

by Spline Design

Browser-based 3D design tool with integrated AI generation capabilities for creating interactive 3D scenes, objects, and animations from text prompts. Spline AI allows designers and developers to produce real-time web-ready 3D graphics without traditional 3D modeling expertise.

3dspatialgenerative-3d

51.1C+

pdf-to-markdowndocument-conversionocr

Marker

by VikParuchuri

Fast and accurate PDF to Markdown converter optimized for books and scientific papers. Handles complex layouts, equations, tables, and multi-column documents with higher quality than traditional OCR tools.

vector-databasemanaged-milvuscloud

Zilliz Cloud

by Zilliz

Fully managed vector database service built on Milvus for enterprise-grade similarity search. Provides auto-scaling, high availability, and enterprise security with a simplified operational experience.

50.9C+

data-qualitylabel-errorsconfidence-learning

Cleanlab

by Cleanlab

Data-centric AI library for finding and fixing label errors in datasets automatically. Uses confident learning algorithms to identify mislabeled data, estimate noise, and improve model training quality.

50.55C+

ai-codinggooglecode-completion

Gemini Code Assist

by Google

Google's AI-powered code assistance tool integrated with Google Cloud and IDEs. Provides code completions, explanations, and transformations powered by Gemini models with enterprise security controls.

50.5C+

embeddingssemantic-searchrag

txtai

by NeuML

All-in-one embeddings database for semantic search, LLM orchestration, and language model workflows. Combines vector search with NLP pipelines including summarization, translation, and text-to-speech.

49.7C

code-documentationdeveloper-toolsknowledge-management

Swimm

by Swimm

Swimm is an AI-powered code documentation tool that auto-generates and keeps documentation synchronized with the codebase using IDE plugins. It detects code changes and alerts developers when docs become stale, enabling engineering teams to maintain accurate, living documentation at scale.

49.7C

search-apigoogle-resultsfast

Serper

by Serper

Fast and affordable Google Search API for developers and AI applications. Provides structured Google search results including organic results, knowledge graphs, and related questions via a simple REST API.

49.6C

automlhugging-facefine-tuning

AutoTrain

by Hugging Face

Hugging Face's automated training solution for fine-tuning LLMs and other models with minimal configuration. Provides a no-code UI and CLI for training custom models with automatic hyperparameter selection.

49.6C

knowledge-graphenterpriserdf

Stardog

by Stardog

Stardog is an enterprise knowledge graph platform built on W3C standards (RDF, OWL, SPARQL) that enables organizations to unify disparate data sources into a semantic layer for AI and analytics. Its Virtual Graph capability connects to existing databases without data migration, and its AI integration supports LLM grounding on enterprise knowledge.

49.4C

annotationactive-learningnlp

Prodigy

by Explosion

Scriptable annotation tool by Explosion for creating training data with active learning. Integrates with spaCy for NLP tasks and provides efficient annotation workflows with model-in-the-loop labeling.

49.4C

GPT-5

by OpenAI

OpenAI's frontier model with advanced reasoning, native multimodal understanding, and robust function calling. Designed for complex enterprise workflows and agentic applications.

78.7B+

GPT-4o

by OpenAI

OpenAI's natively multimodal flagship model processing text, image, and audio inputs with a single unified architecture. Delivers GPT-4 Turbo-level intelligence at 2x speed and 50% lower cost, with breakthrough real-time voice capabilities.

llmmultimodalomni

78.1B+

Claude 4

by Anthropic

Anthropic's most capable model featuring advanced reasoning, coding, and multimodal capabilities. Excels at complex analysis, agentic tasks, and extended thinking with industry-leading safety.

llmreasoningcoding

78B+

GPT-4

by OpenAI

OpenAI's breakthrough large language model that demonstrated a significant leap in reasoning and factual accuracy over GPT-3.5. Widely adopted across enterprise and developer workflows for code generation, analysis, and complex problem-solving.

77.9B+

Claude 3.5 Sonnet

by Anthropic

Anthropic's breakout model that surpassed Claude 3 Opus at Sonnet-tier pricing, setting new industry benchmarks for coding. Introduced computer use capability and became the most popular model on the API due to its exceptional intelligence-to-cost ratio.

llmcodingmultimodal

77.7B+

image-generationtext-to-imagecreative-ai

Midjourney V6

by Midjourney

Midjourney V6 represents a major leap in photorealism, prompt adherence, and artistic coherence, setting a new industry benchmark for AI image generation quality. It introduced native text rendering within images and dramatically improved its understanding of complex, multi-subject prompts.

77.2B+

speech-to-texttranscriptionmultilingual

Whisper V3

by OpenAI

OpenAI's state-of-the-art open-source automatic speech recognition model trained on 680K hours of multilingual audio. Supports 99 languages with near-human accuracy and includes translation, timestamp, and language detection capabilities.

77B+

foundationalgoogletransformer

BERT

by Google

BERT (Bidirectional Encoder Representations from Transformers) is Google's landmark 2018 language model that introduced the bidirectional pre-training paradigm using masked language modeling and next sentence prediction. It revolutionized NLP by demonstrating that a single pre-trained model could achieve state-of-the-art results across dozens of downstream tasks with minimal fine-tuning.

76.3B+

Gemini 2.5 Pro

by Google DeepMind

Google DeepMind's flagship thinking model with native multimodal understanding across text, images, audio, and video. Excels at complex reasoning, code generation, and agentic tasks with a million-token context window.

76.2B+

image-generationdiffusionopen-source

Stable Diffusion XL

by Stability AI

Stability AI's high-resolution image generation model producing photorealistic and artistic images at 1024x1024 resolution. Features a two-stage architecture with a base model and refiner for enhanced detail and compositional quality.

GPT-4 Turbo

by OpenAI

An optimized variant of GPT-4 offering a 128K context window, faster inference, and significantly reduced costs. Introduced JSON mode and improved function calling, making it the preferred GPT-4 variant for production applications.

74.3B+

Llama 3.1 70B

by Meta

Meta's workhorse open-source model with 70B parameters, 128K context window, and native tool-use support. Widely deployed as a cost-effective alternative to proprietary frontier models.

73.5B+

DeepSeek-V3

by DeepSeek

DeepSeek's frontier-class MoE model with 671B total parameters and 37B active, trained using FP8 mixed precision for unprecedented cost efficiency. Matches or exceeds GPT-4o and Claude 3.5 Sonnet on key benchmarks.

72.8B+

llmreasoningchain-of-thought

o1

by OpenAI

OpenAI's first reasoning model that uses extended internal chain-of-thought before responding. Achieves expert-level performance on competitive math (AIME), PhD-level science (GPQA), and complex coding tasks through deliberative alignment.

text-to-speechvoice-cloninglow-latency

ElevenLabs Turbo v2.5

by ElevenLabs

ElevenLabs Turbo v2.5 is a low-latency multilingual text-to-speech model optimized for real-time conversational AI applications, offering sub-400ms first-audio latency while maintaining the high voice cloning fidelity ElevenLabs is known for across 32 languages. It powers a wide range of AI assistant, customer service, and interactive voice applications where natural-sounding, real-time speech is critical.

Llama 3.1 405B

by Meta

The largest openly available language model at 405 billion parameters, rivaling proprietary frontier models in reasoning and knowledge. A landmark release demonstrating open-source models can match closed alternatives.

llmopen-sourcefrontier

image-generationtext-to-imagecreative

DALL-E 3

by OpenAI

OpenAI's most advanced image generation model with native ChatGPT integration. Features dramatically improved prompt following, text rendering, and safety mitigations compared to DALL-E 2, generating high-fidelity images from natural language descriptions.

Claude 4 Sonnet

by Anthropic

Anthropic's balanced Claude 4 generation model delivering strong coding and reasoning at competitive pricing. Features improved agentic capabilities and extended thinking, offering a compelling mid-tier option between Haiku and Opus.

llmcodingmultimodal

Llama 3 70B

by Meta

Meta's high-performance 70B parameter model closing the gap with proprietary frontier models. Achieved competitive results on major benchmarks while remaining fully open-source.

72.05B+

Claude 4.5 Sonnet

by Anthropic

Anthropic's most advanced Sonnet-tier model, combining frontier intelligence with practical speed and cost. Features state-of-the-art coding performance, improved extended thinking, and robust agentic capabilities for complex multi-step workflows.

llmcodingmultimodal

71.1B+

foundationalopenaiautoregressive

GPT-2

by OpenAI

GPT-2 is OpenAI's 2019 autoregressive language model that demonstrated for the first time that large-scale unsupervised pre-training on internet text could produce coherent, fluent long-form text generation with zero-shot task performance. Its initial withheld release sparked global debate about AI safety and responsible disclosure of capable AI systems.

llmfast-inferencemultimodal

Gemini 2.5 Flash

by Google DeepMind

Google DeepMind's fast thinking model optimized for speed and cost efficiency while retaining strong reasoning capabilities. Supports a million-token context window with native multimodal input.

Gemini 2.0 Flash

by Google

Google's next-generation fast model built for the agentic era, featuring native tool use, multimodal generation, and real-time streaming. Outperforms Gemini 1.5 Pro on key benchmarks while maintaining Flash-tier speed and cost efficiency.

llmfastmultimodal

foundationaldeepmindprotein-structure

Modelother

AlphaFold 3

by Google DeepMind

AlphaFold 3 is Google DeepMind's third-generation protein structure prediction model that extends beyond proteins to predict the structures of DNA, RNA, and small molecules and their interactions. It represents a revolutionary tool for drug discovery and structural biology, dramatically accelerating our understanding of molecular machines that underpin life.

text-to-speechwavenetgoogle

Google WaveNet

by Google / DeepMind

Google WaveNet is DeepMind's pioneering generative model for raw audio waveforms that dramatically advanced the state of the art in text-to-speech naturalness when published in 2016 and continues to power Google Assistant, Google Cloud TTS, and various Google products at massive scale. Its autoregressive waveform generation approach established the template for neural vocoder research and inspired a generation of TTS architectures.

llmopen-sourcesmall-model

Mistral 7B

by Mistral AI

Mistral AI's breakthrough 7B parameter model that outperformed Llama 2 13B across all benchmarks at launch. Introduced sliding window attention and grouped-query attention for efficient inference.

llmlong-contextmultimodal

Gemini 1.5 Pro

by Google

Google's mid-size multimodal model featuring a groundbreaking 2 million token context window using mixture-of-experts architecture. Excels at long-document understanding, video analysis, and cross-modal reasoning tasks that require processing large volumes of information.

llmlightweightcost-efficient

GPT-4o mini

by OpenAI

OpenAI's most cost-efficient small model, replacing GPT-3.5 Turbo as the default lightweight option. Scores 82% on MMLU and outperforms GPT-4 on chat preferences while costing over 60% less than GPT-4o.

70.35B+

image-generationtext-to-imageopen-source

FLUX 1.1 Pro

by Black Forest Labs

FLUX 1.1 Pro from Black Forest Labs is a next-generation text-to-image model built by the original creators of Stable Diffusion, offering superior prompt comprehension, anatomical accuracy, and photorealistic detail. It sets a new open-weights standard with exceptional speed and quality, available in Pro, Dev, and Schnell variants for different use cases.

foundationalgoogleencoder-decoder

T5

by Google

T5 (Text-To-Text Transfer Transformer) is Google's 2019 framework that reframes all NLP tasks as text-to-text problems, allowing a single model to be trained on a unified mixture of tasks. Its clean formulation and the C4 dataset became foundational references for multitask learning research, and T5 variants remain widely used in production and research.

GPT-4V

by OpenAI

OpenAI's multimodal extension of GPT-4 with native vision capabilities for image understanding, OCR, and visual reasoning. Processes interleaved text and images for tasks ranging from chart analysis to visual question answering.

multimodalvisionopenai

69.6B

music-generationtext-to-musicvocals

Suno V3.5

by Suno AI

Suno V3.5 is a text-to-song AI model that generates complete, radio-quality music tracks with vocals, instrumentation, and song structure directly from natural language prompts or custom lyrics. It supports an enormous range of genres and styles and is widely regarded as the most accessible and highest-quality text-to-music system for non-musicians.

Mixtral 8x7B

by Mistral AI

Mistral AI's sparse mixture-of-experts model using 8 expert networks of 7B parameters each, activating only 2 per token. Matches GPT-3.5 performance while using a fraction of the compute at inference.

llmmultilingualopen-weight

Qwen 2.5 72B

by Alibaba Cloud

The flagship open-weight model in the Qwen 2.5 series, offering substantial improvements in reasoning, instruction following, and structured output over its predecessor. Supports 128K context with strong performance across 29+ languages.

69.3B

DeepSeek Coder V3

by DeepSeek

DeepSeek Coder V3 is DeepSeek's third-generation code-specialized model, trained on over 2 trillion tokens of code and natural language with a mixture-of-experts architecture. It achieves state-of-the-art performance on major coding benchmarks, surpassing GPT-4o and Claude 3.5 Sonnet on several code generation tasks.

deepseekcodeopen-source

69.2B

Llama 3.3 70B

by Meta

Meta's refined 70B model delivering performance comparable to the much larger 405B variant through improved training techniques. Offers the best performance-to-cost ratio in the Llama family.

68.95B

llmopen-sourcesmall-model

Llama 3 8B

by Meta

Meta's third-generation compact language model with significantly improved performance over Llama 2 at the same size class. Features an expanded 128K token vocabulary and improved tokenizer.

68.9B

llmreasoningcost-efficient

o3-mini

by OpenAI

A compact and cost-efficient reasoning model that delivers strong STEM performance at a fraction of o3's cost. Supports configurable reasoning effort (low/medium/high) to balance speed and accuracy for different use cases.

Claude 3 Opus

by Anthropic

Anthropic's most intelligent model at launch of the Claude 3 family, excelling at highly complex tasks requiring deep reasoning and nuanced understanding. Set new benchmarks in graduate-level reasoning and demonstrated near-human comprehension across academic subjects.

Llama 2 70B

by Meta

Meta's largest Llama 2 variant with 70 billion parameters delivering substantially improved reasoning and knowledge over the 7B version. Became the de facto open-source baseline for LLM research.

Llama 2 7B

by Meta

Llama 2 7B is an open-source 7 billion parameter large language model developed by Meta. Optimized for dialogue and general text generation, its permissive license and manageable size have made it a popular foundational model for fine-tuning, research, and building custom NLP applications.

llmopen-sourcemeta-ai

video-generationtext-to-videoopenai

Sora

by OpenAI

Sora is a text-to-video diffusion transformer model by OpenAI that generates high-fidelity, minute-long videos from textual prompts. It demonstrates an advanced understanding of language and the physical world, enabling complex scenes with multiple characters, specific motions, and coherent narratives.

llmopen-sourcesmall-model

Llama 3.1 8B

by Meta

Llama 3.1 8B is a compact, open-source language model from Meta, featuring a 128K token context window and native tool-use capabilities. It is optimized for high performance in instruction-following and reasoning tasks, making it a cost-effective solution for scalable, on-device, or resource-constrained applications.

image-generationdiffusiontext-to-image

Stable Diffusion 3

by Stability AI

Stable Diffusion 3 is a powerful text-to-image model using a Multimodal Diffusion Transformer (MMDiT) architecture. It excels at generating images with unprecedented text quality, adhering closely to complex prompts, and achieving high photorealism and compositional accuracy compared to its predecessors.

67.55B

text-to-speechneural-ttsazure-ai

Azure Neural TTS

by Microsoft

Azure Neural TTS is Microsoft's enterprise-grade text-to-speech service, part of Azure AI Speech. It provides 400+ natural-sounding voices across 140+ languages, with detailed prosody control via SSML. The service is designed for scalable applications, from accessibility tools to customer service bots.

image-generationtext-to-imagecommercial-safe

Adobe Firefly 3

by Adobe

Adobe Firefly 3 is a commercially safe generative image model trained exclusively on licensed Adobe Stock and public-domain content, making it uniquely suitable for professional and enterprise creative workflows. Its deep integration with Photoshop, Illustrator, and Express enables AI-powered generation directly within industry-standard design tools.

openaicodecode-generation

Codex-2

by OpenAI

Codex-2 is OpenAI's second-generation code-specialized model, significantly advancing code completion, synthesis, and debugging over the original Codex. It underpins GitHub Copilot's next-generation features and supports a wider range of programming languages and frameworks.

clinical-nlptransformer-modelbert

ClinicalBERT

by Kexin Huang et al. (Academic)

ClinicalBERT is a BERT-based model pre-trained on clinical notes from the MIMIC-III dataset. It provides a deep contextual understanding of electronic health record (EHR) text and clinical documentation, serving as a foundational model for various clinical natural language processing tasks.

Gemini 2.5 Ultra

by Google DeepMind

Gemini 2.5 Ultra is Google DeepMind's most capable model in the 2.5 generation, designed for the most demanding reasoning, coding, and multimodal tasks. It features an extended context window and advanced chain-of-thought capabilities surpassing prior Gemini variants.

googledeepmindfrontier

Claude Opus 4

by Anthropic

Anthropic's most capable model in the Claude 4 generation, designed for the most demanding reasoning, analysis, and agentic tasks. Excels at complex multi-step problems requiring deep understanding and sustained coherence across long contexts.

llmreasoningfrontier

Gemini 1.5 Flash

by Google

Google's lightweight and fast multimodal model optimized for high-volume, cost-sensitive workloads. Supports a 1 million token context window with natively multimodal capabilities across text, image, audio, and video at a fraction of Pro's cost.

llmfastmultimodal

embeddingssemantic-searchrag

Cohere Embed v3

by Cohere

Cohere's state-of-the-art embedding model supporting 100+ languages with native int8 and binary quantization for efficient storage. Produces high-quality vector representations optimized for search, classification, and clustering tasks.

llmfrontier-modelreasoning-engine

Grok-3

by xAI

Grok-3 is xAI's frontier model, delivering state-of-the-art performance in math, science, and coding. Trained on the Colossus supercluster, it features DeepSearch for multi-step research and a 'Think' mode for extended chain-of-thought reasoning, enabling it to tackle complex, real-world problems with access to real-time information.

65.55B

code-generationopen-sourcemoe

DeepSeek-Coder-V2

by DeepSeek

DeepSeek-Coder-V2 is a powerful open-source Mixture-of-Experts (MoE) model specialized in code. It supports 338 programming languages and features advanced fill-in-the-middle capabilities, offering performance comparable to top-tier proprietary models like GPT-4 Turbo at a significantly lower inference cost.

Claude 3.5 Haiku

by Anthropic

Anthropic's fastest, most affordable model in the 3.5 generation, offering performance comparable to Claude 3 Opus. It excels at coding, complex workflows, and agentic tasks due to its advanced tool-use capabilities and speed, making it ideal for high-throughput applications and enterprise automation.

llmfastcost-efficient

video-generationtext-to-videoimage-to-video

Runway Gen-3 Alpha

by Runway

Runway Gen-3 Alpha is a professional-grade video generation model for high-fidelity, temporally consistent clips. It offers fine-grained control over motion, style, and camera behavior via text and image inputs, making it a key tool in professional film and advertising workflows for meeting commercial standards.

llmopen-weightmultilingual

Qwen 2 72B

by Alibaba Cloud

Qwen2-72B is a 72-billion parameter large language model from Alibaba's Qwen2 series. It offers state-of-the-art performance, particularly in multilingual understanding, reasoning, and coding tasks. As an open-weight model, it provides a powerful alternative to proprietary systems for a wide range of applications.

Claude 3 Sonnet

by Anthropic

The balanced mid-tier model in the Claude 3 family, offering a strong combination of speed and intelligence. Provides enterprise-grade performance for coding, analysis, and content generation at moderate cost.

llmbalancedmultimodal

foundationaldeepmindreinforcement-learning

Modelother

AlphaGo

by Google DeepMind

AlphaGo is a landmark AI from DeepMind that mastered the game of Go. It combines deep neural networks with Monte Carlo Tree Search and reinforcement learning, famously defeating world champion Lee Sedol in 2016. Its success demonstrated AI's ability to tackle complex problems requiring strategic planning.

code-llmopen-weightcode-generation

Qwen 2.5 Coder 32B

by Alibaba Cloud

Qwen 2.5 Coder 32B is an open-weight, code-specialized large language model from Alibaba Cloud. Fine-tuned on a massive corpus covering over 92 programming languages, it excels at code generation, completion, and debugging tasks, demonstrating performance on par with or exceeding proprietary models like GPT-4o on several benchmarks.

llmhigh-speedcost-efficient

Claude 3 Haiku

by Anthropic

Claude 3 Haiku is Anthropic's fastest, most compact model, excelling at near-instant responsiveness. It handles a wide range of tasks, including multimodal vision, with strong performance at a low cost, making it ideal for high-throughput applications like content moderation and customer service.

music-generationtext-to-musicopen-source

MusicGen

by Meta AI

MusicGen is an open-source text-to-music model from Meta AI that generates high-quality instrumental music from text descriptions. It can also be conditioned on a melody reference, providing a strong, controllable baseline for both research and commercial applications, trained on 20K hours of licensed music.

Mixtral 8x22B

by Mistral AI

Mixtral 8x22B is a large-scale, open-source Mixture-of-Experts (MoE) model from Mistral AI. It features 176 billion total parameters but only activates 39 billion per token, balancing immense power with efficiency. The model excels at reasoning, code generation, and multilingual tasks, and includes native function calling capabilities.

64.4B

llmproprietary-modelapi-access

Mistral Large

by Mistral AI

Mistral Large is Mistral AI's flagship proprietary model, offering top-tier reasoning and multilingual capabilities. It is designed to compete with other frontier models like GPT-4, excelling in complex tasks that require deep understanding. Its native function calling and fluency in over 30 languages make it highly versatile for enterprise-grade applications.

code-llmopen-sourcecode-generation

Code Llama 34B

by Meta

Code Llama 34B is a large language model from Meta, fine-tuned from Llama 2 for code-specific tasks. It excels at generating, completing, and explaining code across various languages. With variants supporting a 100K token context window, it can analyze and work with extensive codebases for complex tasks like refactoring.

text-embeddingmultilingualcross-lingual

Modelembeddings

Multilingual-E5-Large

by Microsoft Research

Multilingual-E5-Large is a powerful text embedding model from Microsoft supporting 100 languages. Trained on billions of text pairs using contrastive learning, it excels at cross-lingual information retrieval and semantic similarity, establishing a strong open-source baseline for multilingual NLP tasks.

medical-aiclinical-decision-supportllm

Med-PaLM 2

by Google

Med-PaLM 2 is Google's large language model specialized for the medical domain. It achieves expert-level performance on medical licensing exams (USMLE) by leveraging advanced clinical reasoning and question-answering capabilities. The model is designed to generate accurate and helpful responses for healthcare professionals.

alibabaqwenvision-language

Modelmultimodal

Qwen2.5-VL-72B

by Alibaba Cloud (Qwen Team)

Qwen2.5-VL-72B is Alibaba's flagship open vision-language model at 72 billion parameters, achieving top-tier performance on visual understanding benchmarks including chart analysis, document parsing, and fine-grained image understanding. It supports dynamic resolution image inputs and video understanding with native high-resolution processing.

GPT-4.5

by OpenAI

GPT-4.5 is a hypothetical large language model from OpenAI, positioned as a research preview before GPT-5. It focuses on large-scale unsupervised learning to significantly reduce hallucinations and enhance factual accuracy. The model is also designed for improved creative writing and greater emotional intelligence in its responses.

small-language-modelon-device-aiedge-computing

Phi-3.5-mini

by Microsoft

Phi-3.5-mini is a 3.8B parameter instruction-tuned model from Microsoft, optimized for edge and mobile devices. Despite its compact size, it delivers performance comparable to much larger models on benchmarks for reasoning, coding, and language tasks, making it highly efficient for on-device AI applications.

o1-mini

by OpenAI

A smaller, faster, and more affordable reasoning model optimized for STEM tasks. Delivers 80% of o1's reasoning capability at roughly 80% lower cost, making it ideal for high-volume coding and math workloads.

llmreasoningmath

foundationalgooglepathways

PaLM

by Google

PaLM (Pathways Language Model) is Google's 540 billion parameter language model trained using the Pathways system across 6,144 TPU v4 chips, demonstrating breakthrough capabilities on chain-of-thought reasoning, code generation, and multilingual tasks. It introduced the concept of 'discontinuous' capability jumps at scale and set new benchmarks on hundreds of NLP tasks upon release in 2022.

63.3B

text-to-imageimage-generationtypography

Ideogram 2

by Ideogram AI

Ideogram 2 is a text-to-image model renowned for its superior ability to render legible and accurate text within generated images. It excels at creating high-quality photorealistic and artistic visuals with strong prompt adherence, making it a powerful tool for design, branding, and creative projects.

text-to-speechcloud-ttsenterprise

Amazon Polly Neural

by Amazon Web Services

Amazon Polly is a cloud-based text-to-speech (TTS) service from AWS that produces highly natural-sounding human speech using neural engine technology. It supports over 30 languages with both standard and neural voices, offering deep integration with the AWS ecosystem for scalable production applications.

63B

llmfrontier-modelmultimodal-ai

Claude Opus 4.5

by Anthropic

Claude Opus 4.5 is Anthropic's frontier AI model, delivering state-of-the-art performance in complex reasoning, creative tasks, and nuanced understanding. It features advanced multimodal vision capabilities for analyzing images and documents, along with extended thinking for multi-step, agentic tasks.

62.9B

text-to-speechvoice-synthesisaudio-generation

TTS-1

by OpenAI

OpenAI's TTS-1 is a text-to-speech model designed for real-time audio generation. It provides six distinct, natural-sounding preset voices and supports low-latency streaming, making it ideal for interactive applications. A higher-quality variant, tts-1-hd, is available for tasks where audio fidelity is prioritized over speed.

Command R+

by Cohere

Cohere's most capable RAG-optimized model, offering significantly enhanced reasoning, multi-step tool use, and superior grounded generation over Command R. Designed for complex enterprise workflows requiring high accuracy and citations.

llmragenterprise

image-generationdiffusiontext-to-image

Imagen 3

by Google DeepMind

Google DeepMind's highest-quality text-to-image generation model producing photorealistic images with improved detail, lighting, and fewer artifacts. Features enhanced prompt understanding and safety filtering.

62.65B

Qwen 2.5 Max

by Alibaba Cloud

Alibaba Cloud's most capable proprietary model in the Qwen 2.5 family, optimized for complex reasoning and enterprise applications. Available exclusively through Alibaba Cloud's Model Studio API with enhanced safety and alignment.

llmproprietaryreasoning

audio-generationmusic-generationsound-effects

AudioCraft

by Meta AI

AudioCraft is an open-source generative audio framework from Meta AI. It integrates MusicGen for music, AudioGen for sound effects, and the EnCodec codec into a single platform. This unified, modular design allows for text-to-audio generation and has become a key reference for audio LLM research.

legal-techberttransformer-model

LegalBERT

by Ilias Chalkidis et al. (Academic)

LegalBERT is a family of BERT models pre-trained on a diverse corpus of English legal texts, including legislation, court cases, and contracts. This specialized training allows it to significantly outperform general-purpose BERT models on downstream legal NLP tasks, establishing it as a foundational baseline for legal AI research and applications.

62.5B

llmopen-weightssmall-model

Gemma 2 9B

by Google DeepMind

Gemma 2 9B is a lightweight, state-of-the-art open model from Google, part of the next generation of the Gemma family. It offers strong performance for its size class, making it ideal for environments with limited computational resources. Built on a new architecture, it is optimized for on-device applications, research, and fine-tuning.

QwQ-32B

by Alibaba / Qwen Team

QwQ-32B is a 32 billion parameter language model from Alibaba, specifically optimized for complex reasoning tasks. It utilizes a deep chain-of-thought methodology to excel at mathematical, scientific, and logical problems, achieving performance comparable to much larger models and showcasing high parameter efficiency.

reasoningqwenalibaba

foundational-modelbigsciencemultilingual

BLOOM

by BigScience Workshop

BLOOM is a 176 billion parameter, open-access multilingual language model developed by the BigScience research workshop. Trained on 46 natural languages and 13 programming languages, it provides powerful text and code generation capabilities, making it a key resource for researchers and developers building multilingual AI applications.

61.6B

code-llmopen-sourcecode-generation

StarCoder2 15B

by BigCode (ServiceNow + Hugging Face)

StarCoder2 15B is a powerful open-source code generation model from the BigCode project. Trained on The Stack v2 dataset spanning over 600 programming languages, it excels at code completion, generation, and fill-in-the-middle tasks, emphasizing data transparency and author opt-out.

Phi-3 Mini

by Microsoft

Microsoft's Phi-3 Mini is a 3.8 billion parameter small language model (SLM) designed for high performance on resource-constrained devices. Despite its compact size, it exhibits strong reasoning and language understanding capabilities, making it suitable for on-device and edge AI applications. It is optimized for efficient inference.

slmopen-weightedge-ai

Cohere Rerank v3

by Cohere

Cohere Rerank v3 is a state-of-the-art neural model designed to significantly boost the relevance of search results for Retrieval-Augmented Generation (RAG) systems. It re-scores a list of candidate documents from any keyword or vector search system, identifying the most pertinent information. It supports over 100 languages and can process long documents, making it highly versatile.

rerankingsearchrag

61.45B

code-generationopen-sourcedense-model

DeepSeek Coder 33B

by DeepSeek

DeepSeek Coder 33B is a dense, open-source large language model specializing in code-related tasks. Trained from scratch on a massive 2 trillion token dataset of code and natural language, it understands project-level context and supports 87 different programming languages for advanced code generation and completion.

Llama 3.2 11B Vision

by Meta

Llama 3.2 11B Vision is Meta's first open-source multimodal model, integrating native image understanding with advanced text generation. At a compact 11B parameters, it's designed for efficiency, enabling visual question answering, image captioning, and complex reasoning across text and images in a single, deployable model.

llmopen-sourcemultimodal

DeepSeek-V2

by DeepSeek

DeepSeek's mixture-of-experts model introducing Multi-head Latent Attention (MLA) for dramatically reduced inference cost. Activates 21B of its 236B total parameters per token while matching larger dense models.

code-generationopen-weightfill-in-middle

Codestral

by Mistral AI

Codestral is Mistral AI's open-weight generative model explicitly designed for code generation tasks. Trained on a diverse dataset of over 80 programming languages, it excels at code completion, generation, and its unique fill-in-the-middle capability. It is optimized for low-latency performance in real-world applications.

60.65B

Gemma 2 27B

by Google DeepMind

Gemma 2 27B is a powerful, mid-sized open-weights model from Google DeepMind. It delivers significant performance gains in reasoning, coding, and instruction following over smaller variants. Designed for server-side deployment, it provides a strong foundation for advanced research and custom fine-tuning projects.

llmopen-weightsgoogle

60.25B

llmanthropicclaude-4.5-haiku

Claude 4.5 Haiku

by Anthropic

Claude 4.5 Haiku is Anthropic's fastest and most compact model, engineered for near-instant responsiveness and high-throughput workloads. It provides enterprise-grade performance at a fraction of the cost, making it ideal for real-time interactions, content moderation, and cost-effective agentic tasks.

60.1B

text-to-speechvoice-cloningmultilingual-tts

XTTS-v2

by Coqui AI

XTTS-v2 is an open-source, cross-lingual text-to-speech model from Coqui AI. It excels at high-quality voice cloning from just a few seconds of audio and supports 17 languages. With real-time streaming inference, it's ideal for applications needing custom voices and low-latency output.

financefinancial-nlpdomain-specific

BloombergGPT

by Bloomberg

BloombergGPT is a 50-billion parameter large language model developed by Bloomberg. It is specifically trained on a massive, curated corpus of financial data accumulated over decades, combined with general-purpose datasets. This specialized training allows it to excel at financial natural language processing tasks, outperforming similarly sized general models.

large-language-modelgenerative-aixai

Grok-2

by xAI

Grok-2 is xAI's second-generation large language model, notable for its real-time knowledge access through the X platform. It possesses strong reasoning and multimodal capabilities, including vision understanding. The model is designed for a more natural, conversational interaction style with a lower tendency to refuse prompts.

59.45C+

BioGPT

by Microsoft Research

BioGPT is a domain-specific language model from Microsoft, pre-trained on a massive corpus of biomedical literature from PubMed. It excels at tasks like generating biomedical text, extracting relationships between entities, and answering questions based on medical research, achieving state-of-the-art results on several benchmarks.

biomedicalnlppubmed

59.1C+

Command R

by Cohere

Command R is a retrieval-optimized language model from Cohere, specifically designed for enterprise-grade Retrieval-Augmented Generation (RAG) and tool use. It excels in multilingual applications, supporting over 10 languages, and features built-in capabilities for grounding responses and generating citations to ensure accuracy.

llmragenterprise-ai

59.05C+

Gemma 2B

by Google DeepMind

Gemma 2B is Google DeepMind's open-weight 2 billion parameter language model from the Gemma family, designed for lightweight deployment on devices with limited resources. It delivers strong performance for its size on language understanding and generation tasks, and serves as a foundation for fine-tuning on domain-specific tasks.

googlesmalledge

59C+

video-generationtext-to-videoimage-to-video

Pika 1.5

by Pika Labs

Pika 1.5 is an accessible AI video generation model that transforms text prompts or images into high-quality videos. It is known for its expressive motion, diverse cinematic styles, and unique features like physics-based effects and automated lip-sync, making it popular among creators and consumers.

sreincident-managementtriage

SRE Triage Agent

by AaaS DevOps Foundry

Detects anomalies in live system telemetry, runs deterministic diagnostics from the organization's top remediation runbooks, and autonomously resolves up to 40% of standard incidents without human intervention. Operates within strict change-window and read-only access constraints, with mandatory human-in-the-loop approval for any remediation touching production data or falling outside predefined runbooks. Reduces mean-time-to-recovery and augments on-call teams.

84.2A

Pipeline Healer Agent

by AaaS DevOps Foundry

Continuously observes CI/CD pipelines, code repositories, and incident logs. Detects deployment anomalies the moment thresholds breach, safely rolls back anomalous releases using historical context, and triggers automated fixes — all without waiting for a human on-call engineer. Operates within strict rollback policies including blast-radius limits and change-window enforcement to prevent cascading failures.

devopsci-cddeployment

82.6A

devopsdependency-managementsecurity-patching

Dependency Guardian Agent

by AaaS DevOps Foundry

Maps the entire dependency tree across an organization's codebases, tests library updates in isolated sandbox environments, writes localized unit tests to verify compatibility, and submits fully validated pull requests that respect architectural constraints. Prevents the cascade-of-breaking-changes problem that plagues manual dependency updates, where an LLM taking a prompt literally would introduce version conflicts or accidentally remove necessary features.

76.8B+

agent-platformapifunction-calling

OpenAI Assistants API

by OpenAI

OpenAI's managed agent platform for building custom AI assistants with persistent threads, built-in code interpreter, file search, and function calling. Handles conversation state, tool orchestration, and context management so developers can focus on business logic.

enterprisecopilotmicrosoft-365

Microsoft Copilot Agent

by Microsoft

Microsoft's autonomous agent within the Copilot ecosystem that operates across Microsoft 365 apps to automate business processes. Handles email triage, meeting preparation, document summarization, and cross-app workflow automation with enterprise-grade security.

educationtutoringadaptive-learning

Personalized Tutor Agent

by Khanmigo (Khan Academy)

An adaptive tutoring agent that dynamically adjusts difficulty, pacing, and instructional modality based on individual learner performance signals. It maintains a persistent knowledge model per student, identifies misconceptions through Socratic questioning, and routes learners to mastery via spaced-repetition scheduling.

devopsarchitecturecode-search

Codebase Architecture Agent

by AaaS DevOps Foundry

Maps structural dependencies, architectural patterns, and historical technical decisions across enterprise codebases. When a critical service fails and the original developers are unavailable, this agent produces a semantic architecture map — dependency graphs, hotspot analysis, and knowledge gap identification — in minutes instead of weeks. Integrates deeply with repositories to understand code as architecture, not just text.

customer-serviceomnichannellive-chat

Omnichannel Support Agent

by Intercom

A fully-autonomous customer support agent that unifies conversations across chat, email, SMS, and social DMs into a single threaded context window. It resolves tier-1 and tier-2 tickets using a retrieval-augmented knowledge base and maintains CSAT targets through sentiment-aware tone calibration.

multi-agentconversablemicrosoft

AutoGen

by Microsoft Research

Microsoft's multi-agent conversation framework enabling multiple LLM agents to converse, collaborate, and solve tasks through automated chat. Supports customizable agent behaviors, human-in-the-loop, and code execution sandboxing.

research-agentanswer-enginereal-time-search

Perplexity

by Perplexity AI

AI-powered answer engine that combines real-time web search with LLM synthesis to provide cited, accurate answers. Features multi-step research capabilities, source verification, and conversational follow-up for deep topic exploration.

healthcareehrclinical-documentation

EHR Documentation Agent

by Nuance Communications (Microsoft)

Ambient AI agent that listens to physician-patient encounters, generates structured clinical notes (SOAP, H&P, discharge summaries), and auto-populates EHR fields in real time. Reduces documentation burden by over 70% while maintaining compliance with ICD-10 and CPT coding standards.

Salesforce Einstein Agent

by Salesforce

Salesforce's autonomous AI agent built on the Einstein platform that handles customer interactions, resolves support cases, and automates sales workflows. Operates within the Salesforce ecosystem with full access to CRM data, knowledge bases, and business rules.

enterprisecrmsales

healthcarepharmacologydrug-safety

Drug Interaction Checker

by Wolters Kluwer Health

Real-time pharmacological agent that screens multi-drug regimens for contraindications, adverse interactions, and dosing conflicts. Cross-references patient allergy profiles, renal function, and genetic pharmacogenomics data to surface clinically relevant alerts at point of prescribing.

marketingseokeyword-research

SEO Analysis Agent

by Ahrefs

A fully-autonomous SEO agent that continuously crawls a target website, audits technical health, researches high-intent keywords, and generates prioritized optimization recommendations. It tracks ranking movements in real time and surfaces backlink opportunities from competitor gap analysis.

71.6B+

AgentSpeech & Audio AI

ElevenLabs Conversational Agent

by ElevenLabs

ElevenLabs' conversational AI agent platform combining industry-leading voice synthesis with real-time dialogue capabilities. Supports 29+ languages, custom voice creation, and ultra-low-latency responses for natural phone and web interactions.

voice-agenttext-to-speechvoice-cloning

71.1B+

general-agentautonomousopen-source

AutoGPT

by Significant Gravitas

One of the first open-source autonomous AI agents that chains LLM calls to accomplish complex goals. Decomposes high-level objectives into sub-tasks, maintains memory, and executes multi-step plans with internet access and file operations.

Latency Budget Planner Agent

by AaaS DevOps Foundry

Decomposes end-to-end application latency into detailed per-component budgets for real-time and streaming pipeline architectures. Autonomously adds graceful degradation protocols, timeout handling configurations, and p50/p95 tracing metrics required for production multimodal systems. Where a generic AI produces streaming pipeline code without real-world latency considerations, this agent understands the physics of distributed systems and produces actionable latency allocation plans.

devopsperformancelatency

educationlearning-pathspersonalization

Learning Path Optimizer

by Coursera

A recommendation agent that maps learner skill profiles against target competency frameworks and synthesizes the shortest credentialed path to proficiency. It continuously reoptimizes routing as learners complete modules and integrates real-time labor-market signals to prioritize high-value skill sequences.

legallegal-researchcase-law

Legal Research Agent

by Westlaw AI (Thomson Reuters)

Comprehensive legal research agent that queries case law databases, statutes, regulations, and secondary sources to synthesize jurisdiction-specific memos, identify controlling precedents, and map circuit splits. Generates formatted legal research memos with citation-verified sources and confidence scores.

69.9B

enterprisegoogle-workspaceproductivity

Google Duet AI

by Google

Google's AI-powered assistant embedded across Google Workspace and Google Cloud that automates document creation, email drafting, data analysis, and cloud infrastructure management. Leverages Gemini models for contextual understanding across the Google ecosystem.

enterprisemeetingsproductivity

Meeting Summarizer Agent

by Otter.ai

An autonomous agent that joins virtual meetings, transcribes conversations in real time with speaker diarization, and generates structured summaries containing decisions made, action items with owners and due dates, and key discussion points. It distributes follow-up notes to participants, syncs action items into project management tools, and maintains a searchable meeting knowledge base.

69.1B

securitydevsecopsvulnerability-scanning

Snyk AI Agent

by Snyk

AI-powered developer security agent that continuously scans code, dependencies, containers, and infrastructure-as-code for vulnerabilities. Provides automated fix pull requests, prioritizes issues by exploitability, and integrates directly into the developer workflow for shift-left security.

68.8B

coding-agentgithubcollaborative

GitHub Copilot Workspace

by GitHub (Microsoft)

GitHub's AI-native development environment that turns issues into fully implemented code changes. Plans, implements, and validates multi-file edits with human-in-the-loop review before merging.

68.6B

LangGraph

by LangChain Inc.

LangChain's framework for building stateful, multi-agent applications using graph-based workflows. Provides fine-grained control over agent state, cycles, branching, and human-in-the-loop checkpoints for production-grade agentic systems.

multi-agentgraphstateful

codingdependenciessecurity

Dependency Updater Agent

by Mend (WhiteSource)

An automated agent that scans software repositories for outdated or vulnerable dependencies, opens pull requests with tested dependency upgrades, and resolves breaking API changes introduced by major version bumps. It groups related updates, runs the test suite for each PR, and prioritizes CVE-critical packages to ensure security patches ship within SLA windows.

agent-platformawsenterprise

AWS Bedrock Agents

by Amazon Web Services

AWS's fully managed agent service within Amazon Bedrock that orchestrates multi-step tasks using foundation models. Automatically breaks down user requests, calls APIs, queries knowledge bases, and executes actions while maintaining enterprise security and compliance controls.

68.2B

itsmai-agentworkflow-automation

ServiceNow AI Agent

by ServiceNow

An autonomous AI agent built on the Now Platform, designed to automate end-to-end IT Service Management (ITSM) processes. It independently resolves common incidents, fulfills service requests, and executes standard change workflows by leveraging a proprietary knowledge graph and workflow engine, reducing the need for human intervention.

67.6B

codingperformanceprofiling

Performance Profiler Agent

by Datadog

An autonomous profiling agent that instruments application code, analyzes CPU flame graphs, memory heap snapshots, and database query plans to identify performance bottlenecks, then proposes and optionally applies targeted code optimizations. It tracks regression history, correlates deployments with latency spikes, and benchmarks fixes against baseline measurements before recommending production rollout.

multi-agentcrewrole-playing

CrewAI

by CrewAI

Framework for orchestrating role-playing autonomous AI agents that work together as a crew. Enables defining agents with specific roles, goals, and backstories to collaborate on complex tasks through structured workflows.

67.4B

enterprise-risk-managementgrccompliance-automation

Risk Assessment Agent

by ServiceNow

This AI agent automates enterprise risk management (ERM) by continuously synthesizing data from internal systems and external intelligence. It identifies, categorizes, and scores diverse risks, maintaining a live risk register and mapping control effectiveness to provide a real-time, holistic view of the organization's risk posture.

customer-supporthelpdeskticketing

Zendesk AI Agent

by Zendesk

Zendesk's AI Agent is an autonomous customer support tool designed to resolve inquiries across email, chat, and messaging. Trained on billions of real service interactions, it understands intent and sentiment to provide resolutions without requiring human intervention, freeing up teams for complex issues.

marketing-automationsocial-media-managementcontent-optimization

Social Media Optimizer

by Sprout Social

A semi-autonomous agent that optimizes social media content for maximum reach. It analyzes platform-specific engagement patterns, rewrites posts, schedules them for peak audience times, and A/B tests caption variations to improve performance across channels.

Document Classification Agent

by ABBYY

An AI agent that automates document processing by classifying unstructured files like invoices, contracts, and emails into predefined categories. It extracts key data, validates it against business logic, and routes documents to appropriate systems, supporting multiple languages and improving over time via human feedback.

idpdocument-ainlp

research-agentacademicliterature-review

Elicit

by Elicit

AI research assistant that automates systematic literature reviews and evidence synthesis. Searches across 200M+ academic papers, extracts key findings, and synthesizes results into structured summaries with full citations.

coding-agentresearchopen-source

SWE-agent

by Princeton NLP

Princeton NLP's research agent that turns LLMs into autonomous software engineers. Achieves state-of-the-art results on SWE-bench by providing an agent-computer interface optimized for code navigation and editing.

66.7B

design-agentfigmagenerative-ui

Figma AI Agent

by Figma

Figma AI is a suite of native artificial intelligence features integrated directly within the Figma and FigJam platforms. It accelerates the design process by generating UI elements from text prompts, automatically populating mockups with realistic content, and providing intelligent suggestions to improve design consistency.

Agentcustomer-success-foundry

Support Resolver Agent

by AaaS

Resolves up to 80% of Tier-1 and Tier-2 support requests by directly accessing the CRM and payment gateways. Processes refunds within configurable monetary caps, updates account settings, modifies subscriptions, and routes edge cases to human representatives with full conversation summaries and diagnostic context. Unlike basic chatbots that regurgitate FAQ documents, this agent takes transactional action — it resolves, not deflects.

supportcustomer-serviceticketing

66.2B

agent-platformgoogle-cloudenterprise-ai

Google Vertex AI Agents

by Google Cloud

Google Vertex AI Agents is an enterprise-grade platform for building and deploying production-ready generative AI agents on Google Cloud. It enables developers to create agents that can reason, use tools, and leverage grounded generation with Google Search to complete complex tasks and engage in multi-turn conversations.

cybersecuritygenerative-aiai-security

CrowdStrike Charlotte AI

by CrowdStrike

CrowdStrike's generative AI security analyst, Charlotte AI, accelerates threat operations by automating investigation and response. It correlates alerts, enriches incidents with threat intelligence, and recommends actions, allowing security teams to query vast datasets and understand threats using natural language.

manufacturingpredictive-maintenanceIoT

Predictive Maintenance Agent

by SparkCognition

An IoT-connected agent that ingests vibration, temperature, acoustic, and electrical signals from industrial equipment to predict failure events hours to weeks in advance using ML anomaly detection and physics-based models. It generates work orders in CMMS systems, recommends spare parts pre-positioning, and calculates optimal maintenance windows to minimize production impact.

healthcareradiologyreporting

Radiology Report Agent

by Nuance PowerScribe

An AI assistant that accelerates radiology reporting by automatically drafting structured reports from imaging findings. It applies standard templates like ACR BI-RADS, extracts key measurements, and codes findings using RadLex terminology, significantly reducing radiologist documentation time and improving data consistency for analytics.

healthcaremedical-imagingradiology

Medical Imaging Analyzer

by Arterys

Deep-learning agent that analyzes DICOM medical images across modalities — CT, MRI, X-ray, and PET — to surface anomalies, measure lesions, and generate structured findings. Integrates directly into PACS workflows and flags priority studies for radiologist review.

customer-serviceescalationrouting

Escalation Manager Agent

by Zendesk

A decision-intelligence agent that monitors live support queues in real time, detects escalation signals (frustrated language, churn-risk keywords, repeat contacts), and routes high-priority cases to the most qualified available agent with full context pre-loaded. It enforces tiered escalation policies and logs every routing decision for compliance auditing.

database-migrationschema-managementdevops

Database Migration Agent

by Redgate

An autonomous agent designed to automate the entire database migration lifecycle. It analyzes schema differences, generates forward and rollback migration scripts, and validates data integrity post-migration. The agent supports complex data transformations and migrations across different database platforms like PostgreSQL and Oracle, ensuring zero data loss.

financefraud-detectionrisk-management

Fraud Detection Agent

by Featurespace

An AI agent designed for real-time fraud prevention across various payment channels. It leverages behavioral biometrics, graph analytics, and machine learning to analyze transaction streams, identify suspicious patterns, and provide sub-50ms risk decisions. The system includes adaptive feedback loops for continuous model improvement.

incident-managementon-callaiops

AgentAI Infrastructure

PagerDuty AI

by PagerDuty

PagerDuty AI is an AIOps agent for incident management that automates triage and response. It intelligently groups related alerts to reduce noise, correlates events to identify root causes, and suggests or executes automated remediation runbooks. This helps teams minimize downtime and streamline their on-call processes.

marketingcontent-strategyeditorial-planning

Content Strategy Agent

by Jasper AI

An autonomous AI agent designed to streamline content marketing operations. It performs comprehensive audits of existing content, identifies strategic topic gaps by analyzing competitors and search trends, and generates data-driven editorial calendars. The agent ensures all content aligns with brand voice and business objectives.

automlmachine-learningenterprise-ai

DataRobot AI Agent

by DataRobot

DataRobot is an enterprise AI platform that automates the end-to-end machine learning lifecycle. It enables users to build, deploy, and monitor predictive models at scale, from data preparation to production. The platform offers automated feature engineering, model selection, and hyperparameter tuning to accelerate the path from raw data to business value.

educationassessmentgrading

Student Assessment Agent

by Turnitin

An automated assessment agent that generates item banks, administers adaptive quizzes, and provides calibrated scoring with detailed feedback explanations. It applies Item Response Theory to estimate learner proficiency and surfaces at-risk students to instructors via configurable alert thresholds.

64.9B

contract-lifecycle-managementlegal-techai-agent

Contract Management Agent

by Ironclad

An AI agent for automating contract lifecycle management (CLM). It extracts critical data like terms, dates, and obligations from agreements, centralizes them into a searchable repository, and provides automated alerts for key deadlines. The agent streamlines review by comparing new contracts against pre-approved clause libraries and company playbooks.

financeportfolio-managementquantitative-finance

Portfolio Optimizer

by BlackRock Aladdin

An advanced AI agent for constructing and managing investment portfolios. It leverages quantitative techniques like mean-variance optimization and the Black-Litterman model to align portfolios with specific investor goals, risk tolerances, and constraints such as ESG mandates, while continuously monitoring for drift and executing tax-efficient rebalancing.

customer-supportchatbotconversational-ai

Intercom Fin

by Intercom

Intercom Fin is an AI-powered chatbot designed for customer support automation, built on OpenAI's GPT-4. It autonomously resolves customer queries by leveraging a company's help center content and past conversation data. Fin provides human-like answers, can execute actions, and intelligently escalates complex issues to human agents.

humanoid-robotroboticsmobility

Boston Dynamics Atlas

by Boston Dynamics

Next-generation fully electric humanoid robot designed for industrial and commercial applications. Features unmatched athletic ability, whole-body manipulation, and advanced perception for operating in complex, dynamic environments alongside humans.

64.4B

agent-platformazureenterprise-ai

Azure AI Agent Service

by Microsoft

An enterprise-grade platform from Microsoft for building, deploying, and managing sophisticated AI agents. Built on the Copilot stack, it allows developers to create agents that can reason, use tools, and orchestrate complex tasks. The service features deep integration with Microsoft services and robust responsible AI controls.

marketingadvertisingcopywriting

Ad Copy Generator

by Copy.ai

An AI agent designed for paid advertising that generates multiple headline and description variants to boost click-through rates. It analyzes product data, target personas, and landing pages to create optimized copy for Google Ads, Meta, and LinkedIn, ensuring strong message-to-market alignment.

multi-agentopen-sourcesoftware-development

MetaGPT

by DeepWisdom

MetaGPT is an open-source multi-agent framework that automates software development by simulating a virtual company. It assigns distinct roles like product manager, architect, and engineer to different LLM agents. Starting from a single-line requirement, it follows Standardized Operating Procedures (SOPs) to generate comprehensive outputs, including user stories, system designs, diagrams, and executable code.

customer-servicefeedbacksentiment-analysis

Customer Feedback Analyzer

by Medallia

A continuous feedback intelligence agent that ingests NPS surveys, review platforms, support tickets, and social mentions to extract structured voice-of-customer insights. It applies aspect-level sentiment analysis to surface product and service themes and auto-generates prioritized improvement briefs for product and operations teams.

gaminganalyticsplayer-behavior

Player Analytics Agent

by GameAnalytics

A behavioral analytics agent that ingests player telemetry streams to build individual and cohort behavior models, predict churn risk, and surface liveops intervention opportunities. It continuously segments the player base by engagement and monetization propensity, feeding recommendations into targeted push notification, reward, and re-engagement campaign engines.

literature-reviewsystematic-reviewresearch-automation

Literature Review Agent

by Elicit AI

An AI-powered agent designed to automate systematic literature reviews. It queries major academic databases like PubMed and arXiv to identify, screen, and synthesize evidence from thousands of papers. The agent produces structured outputs including evidence tables, meta-analysis plots, and PRISMA-compliant reports with bias assessments.

enterpriseemailproductivity

Email Triage Agent

by Superhuman

An inbox intelligence agent that reads, categorizes, and prioritizes incoming emails by urgency and business intent, drafts context-aware reply suggestions, auto-responds to routine inquiries within configured policies, and escalates high-priority items with briefings to ensure nothing critical is missed. It learns communication preferences over time to continuously improve draft quality and routing accuracy.

coding-agentclipair-programming

Aider

by Paul Gauthier

AI pair programming tool in the terminal that works with any LLM to edit code in local git repositories. Features automatic git commits, multi-file editing, and voice coding with support for connecting to dozens of model providers.

63.5B

Agentfinance-foundry

Invoice Reconciler Agent

by AaaS

Ingests unstructured invoice data across wildly varying formats (multi-page PDFs, email attachments, CSV exports). Matches invoices deterministically against Purchase Orders and receipt logs in the ERP system, and authorizes payment for the established happy path — achieving 60-80% straight-through processing without human intervention. Exception invoices (mismatches, missing POs, duplicate detection) are routed to a human queue with full context. Every payment authorization generates an immutable audit trail.

financeaccounts-payableinvoice

Contract Review Agent

by Ironclad AI

An AI agent designed to automate the legal contract review process. It extracts key clauses from documents like NDAs, MSAs, and SOWs, comparing them against a pre-defined legal playbook to flag non-standard language. The agent scores risk levels by clause and can automatically generate redlines with preferred positions, accelerating review cycles.

legalcontract-reviewclm

coding-agentawsenterprise

Amazon Q Developer Agent

by Amazon Web Services

Amazon Q is an AI-powered developer agent from AWS that automates code transformations, feature implementation, and security remediation. It is deeply integrated with the AWS ecosystem, allowing it to understand project context, suggest relevant AWS services, and streamline cloud-native development workflows directly within the IDE.

quality-controlcomputer-visiondefect-detection

Quality Inspection Agent

by Landing AI

An AI agent that uses computer vision to perform real-time quality inspections on manufacturing lines. It automatically detects, classifies, and logs surface defects, dimensional inaccuracies, and assembly errors at production speed, triggering alerts or reject mechanisms to prevent faulty products from proceeding.

multi-agentsoftware-developmentcollaborative

ChatDev

by OpenBMB

ChatDev is a virtual software company powered by multiple LLM agents that simulate a real-world development team. These agents, playing roles like CEO, programmer, and tester, collaborate to automate the entire software development lifecycle, from design and coding to testing, based on a single natural language prompt.

marketing-analyticsattribution-modelingroi-optimization

Campaign Analytics Agent

by Northbeam

An autonomous AI agent that unifies campaign data from disparate marketing channels to provide a holistic view of performance. It leverages advanced multi-touch attribution models to calculate true ROI and delivers actionable recommendations for budget optimization. The agent automatically generates executive-level reports and issues real-time alerts for performance anomalies.

enterprisefinanceexpense-management

Expense Audit Agent

by AppZen

An AI agent that automates the auditing of employee expense reports. It uses OCR to extract data from receipts, then validates expenses against company policies, per-diem rates, and vendor lists. The agent flags violations and potential fraud, auto-approves compliant reports, and routes exceptions for human review.

63B

coding-agentopen-sourcesandboxed

OpenHands

by All Hands AI

OpenHands is an open-source platform for creating autonomous AI software agents. It offers a secure, sandboxed environment where agents can execute complex development tasks by writing code, running commands, browsing the web, and interacting with APIs. It supports multi-agent delegation for tackling intricate problems.

62.9B

research-agentacademic-searchevidence-based-medicine

Consensus

by Consensus NLP

Consensus is an AI-powered search engine designed to extract and synthesize findings directly from peer-reviewed scientific literature. It uses natural language processing to answer user questions with evidence-based conclusions, highlighting the general consensus from multiple studies and providing metrics on study quality.

voice-aivoice-agentvoice-api

AgentSpeech & Audio AI

Vapi AI

by Vapi

Vapi AI is a developer-first platform for building and deploying real-time, conversational voice agents. It provides low-latency streaming, interruptible speech, and seamless integrations with various LLM, TTS, and STT providers. The platform is designed for developers to create sophisticated voice experiences with features like function calling and call analytics.

ai-assistantdevsecopscode-generation

AgentAI Infrastructure

GitLab Duo Agent

by GitLab

GitLab Duo is an AI-powered assistant integrated into the GitLab DevSecOps platform. It enhances developer productivity across the software development lifecycle by offering code suggestions, summarizing issues, explaining vulnerabilities, and generating tests, all within the native GitLab environment.

salesprospectingintent-data

Agentrevenue-foundry

Intent Prospector Agent

by AaaS

Continuously tracks buyer intent signals across website visits, email engagement, CRM activity, and third-party intent data providers. Scores leads against Ideal Customer Profiles, drafts hyper-personalized outreach sequences based on behavioral signals, and books meetings directly on sales representatives' calendars. Replaces generic outreach spam with data-driven, compliant prospecting that respects CAN-SPAM and GDPR opt-out requirements.

62.5B

supply-chainlogisticsrouting

Logistics Routing Agent

by project44

An AI agent designed to solve complex vehicle routing problems (VRP) for logistics and supply chain operations. It optimizes multi-stop routes for entire fleets by considering constraints like time windows, vehicle capacity, and traffic. The agent dynamically reroutes in real-time to adapt to new orders, delays, or cancellations.

62.4B

automlmachine-learningopen-source

H2O AI Agent

by H2O.ai

H2O.ai offers an open-source and enterprise AutoML platform that automates the machine learning lifecycle. It excels at automated model training, interpretation, and deployment, supporting distributed computing for large datasets. The platform provides comprehensive model explainability features like SHAP values, making complex models transparent.

62.3B

esgsustainabilitycarbon-accounting

Carbon Footprint Analyzer

by Persefoni

Calculates comprehensive Scope 1, 2, and 3 carbon emissions across the entire value chain. This ESG intelligence agent ingests diverse data like energy bills, travel records, and procurement data to generate audit-ready GHG inventory reports. It benchmarks performance and identifies key reduction opportunities, ensuring alignment with GHG Protocol standards.

Agentfinance-foundry

Fraud Isolator Agent

by AaaS

Continuously evaluates live transaction streams against episodic memories of individual user behavior patterns. Detects complex anomaly patterns that rule-based fraud detection systems miss due to high false-positive rates. Autonomously pauses suspicious transactions and triggers secure multi-factor escalation workflows. The agent can only pause and flag — it never approves or releases funds, ensuring a human always makes the final call on flagged transactions.

financefraudsecurity

sla-managementcustomer-serviceit-service-management

SLA Monitor Agent

by PagerDuty

Monitors service-level agreement (SLA) compliance by tracking response and resolution times for all active tickets. The agent proactively alerts teams about potential SLA breaches, allowing them to act before a violation occurs. It can also automatically reprioritize ticket queues based on urgency and generates regular SLA performance reports for management.

fleet-managementautonomous-vehiclesdispatch

Fleet Management Agent

by Geotab

An operational intelligence agent for managing autonomous vehicle fleets. It optimizes asset utilization and uptime by intelligently dispatching vehicles to demand hotspots, scheduling predictive maintenance from telemetry data, and balancing charge levels for EVs. The agent provides a real-time control dashboard to surface anomalies for operations teams.

supply-chainforecastingdemand-planning

Demand Forecasting Agent

by o9 Solutions

The Demand Forecasting Agent leverages machine learning to analyze diverse datasets, including historical sales, market trends, and external factors like weather or promotions. It produces accurate, SKU-level demand forecasts for various time horizons, enabling businesses to optimize inventory, reduce stockouts, and improve supply chain efficiency.

61.8B

finopscloudcost-optimization

Codex CLI

by OpenAI

OpenAI's open-source CLI coding agent that operates in the terminal with sandboxed execution. Reads and edits files, runs commands, and supports multiple approval modes from suggest to full-auto.

coding-agentcliopenai

61.6B

Agentfinance-foundry

Cloud Cost Optimizer Agent

by AaaS

Holds persistent, long-term memory of historical cloud infrastructure utilization patterns. Autonomously monitors resource usage across regions and cloud providers, identifies idle or underutilized resources, right-sizes instances based on real traffic patterns, and executes cost-saving measures continuously — all without waiting for a human FinOps review. Operates within safety guardrails: never terminates production instances, enforces a 7-day cooldown before right-sizing, and logs all actions with rollback capability.

61.4B

Agentcustomer-success-foundry

Architecture Review Agent

by Codescene

A senior-engineer-level agent that statically analyzes codebase architecture using dependency graphs, coupling metrics, and design pattern recognition to identify anti-patterns, circular dependencies, and violations of architectural fitness functions. It produces architectural decision records (ADRs), generates C4 model diagrams, and prioritizes refactoring opportunities by technical debt cost and business risk.

codingarchitecturereview

61.4B

Churn Prevention Agent

by AaaS

Tracks subtle drops in product usage, performs sentiment analysis on support tickets, and synthesizes disparate signals into churn risk scores for each account. Preemptively drafts retention plans, schedules proactive check-in calls, and highlights upselling opportunities — all before the critical renewal window opens. Transforms customer success from reactive firefighting into a data-driven retention engine.

churnretentioncustomer-success

legaldocument-draftinglegal-writing

Legal Document Drafter

by Harvey AI

An AI agent that automates the creation of legal documents by leveraging structured data, template libraries, and firm-specific style guides. It generates jurisdiction-compliant agreements, pleadings, and regulatory filings, incorporating precedents and flagging potential issues for attorney review. The system streamlines drafting workflows, ensuring consistency and accuracy.

supply-chainrisk-managementsupplier

Supplier Risk Agent

by Resilinc

An intelligence agent that continuously monitors supplier financial health, geopolitical exposure, ESG compliance, news sentiment, and delivery performance to generate dynamic risk scores for every vendor in the supply network. It alerts procurement teams to emerging threats and recommends dual-sourcing or buffer stock adjustments before disruptions materialize.

gamingnpcreinforcement-learning

NPC Behavior Agent

by Inworld AI

An AI agent that uses reinforcement learning (RL) to generate dynamic NPC behaviors. Instead of relying on static scripts, it learns complex strategies through self-play and interaction, adapting its difficulty and tactics in real-time to match a player's skill level, ensuring a consistently challenging and unpredictable experience.

61B

digital-twinmanufacturingsimulation

Digital Twin Agent

by Ansys

This AI agent creates and manages high-fidelity virtual replicas of physical assets and processes. By synchronizing with real-time IoT data, it runs complex simulations to test changes, predict failures, and analyze what-if scenarios posed in natural language, enabling optimization before physical implementation.

60.9B

research-agentopen-sourceautonomous

GPT Researcher

by Tavily

Open-source autonomous research agent that conducts comprehensive web research on any topic. Generates detailed research reports by planning queries, scraping multiple sources, filtering information, and synthesizing findings with citations.

financefinancial-analysisaccounting

Financial Statement Analyzer

by Visible Alpha

An AI-powered agent designed for systematic financial statement analysis. It automates the ingestion and parsing of corporate filings like 10-Ks and 10-Qs to compute key financial ratios, identify accounting anomalies, and benchmark performance against industry peers. The agent generates concise, investment-grade summaries highlighting financial health and potential risks.

enterprise-aiit-support-automationemployee-experience

Moveworks

by Moveworks

Moveworks is an enterprise AI copilot platform that automates employee support. It uses conversational AI to understand and resolve requests across IT, HR, finance, and other departments directly in collaboration tools like Slack and Microsoft Teams, reducing the need for manual intervention.

HR Screening Agent

by HireVue

An AI recruiting agent that screens resumes at scale, scores candidates against job description criteria, conducts asynchronous video interview analysis, and shortlists top applicants while flagging potential bias signals for human review. It integrates with ATS platforms to automate interview scheduling for shortlisted candidates and maintains a structured candidate evaluation audit trail.

enterpriseHRrecruitment

marketingcampaignadvertising

Agentrevenue-foundry

Campaign Orchestrator Agent

by AaaS

Monitors live campaign performance across Google Ads, Meta, LinkedIn, and other advertising channels continuously. Automatically reallocates budgets to highest-performing channels, dynamically personalizes ad copy for different audience segments, and kills underperformers before they waste spend. Operates within configurable budget guardrails including max daily spend caps, minimum ROAS thresholds, and A/B test significance gates.

60.6B

design-agentui-generationfigma

Galileo AI

by Galileo AI

Galileo AI is a design copilot that transforms natural language prompts into high-fidelity, editable UI designs. It generates complete screens, individual components, and custom illustrations directly within Figma, aiming to accelerate the design process by automating repetitive tasks and providing instant visual mockups.

60.6B

traffic-predictionspatio-temporalforecasting

Traffic Prediction Agent

by HERE Technologies

This agent specializes in spatio-temporal traffic forecasting, predicting conditions up to 30 minutes in advance for intersections and corridors. It processes data from V2X communications, vehicle telemetry, and infrastructure sensors. The predictions are designed for fleet routing engines to optimize ETAs and alleviate urban congestion.

research-agentopen-sourcestanford

STORM (Stanford)

by Stanford NLP

STORM is an open-source AI research agent from Stanford University designed to automate the creation of comprehensive, Wikipedia-style articles. It simulates a human research process by generating diverse questions, searching the web for information, and synthesizing the findings into a well-structured, cited narrative based on a generated outline.

research-agentsearch-apireal-time

Tavily Research Agent

by Tavily

Tavily is a specialized search API designed for Large Language Models (LLMs) and AI agents. It provides real-time, fact-grounded web search results in a structured, clean format, eliminating the need for manual data cleaning. The API is optimized to deliver relevant, concise information, making it ideal for powering autonomous agents and RAG applications.

60.25B

knowledge-managementcustomer-serviceself-service

Knowledge Base Builder Agent

by Guru

This autonomous agent streamlines knowledge management by ingesting data from support tickets, chat logs, and documents. It automatically generates, updates, and deduplicates knowledge base articles, identifying content gaps by analyzing unanswered user queries. New drafts are created for human review, ensuring a constantly improving self-service resource.

coding-agentcode-intelligencesourcegraph

Sourcegraph Cody

by Sourcegraph

Sourcegraph's AI coding assistant with deep codebase context powered by code graph intelligence. Understands entire repositories through code search, cross-references, and dependency analysis for highly accurate code generation and answers.

conversational-aicustomer-supportchatbot-platform

Ada AI

by Ada

Ada is an enterprise-grade conversational AI platform designed for automating customer service. Its no-code builder allows businesses to create and deploy AI agents across various digital channels, aiming to resolve a high percentage of customer inquiries without human intervention and providing seamless handoffs when needed.

transfer-learningdomain-adaptationfine-tuning

Transfer Learning

by Community

Leverages knowledge from a source domain to improve model performance on a target domain with limited labeled data. A foundational technique for reducing training costs and accelerating model development across diverse applications.

78.2B+

promptingreasoningchain-of-thought

Chain-of-Thought

by AaaS

Guides LLMs to produce step-by-step reasoning before arriving at a final answer. Dramatically improves performance on math, logic, and multi-step problems by making the model's reasoning process explicit and verifiable.

76.6B+

promptingengineeringoptimization

Prompt Engineering

by AaaS

The foundational discipline of crafting effective prompts to elicit desired behaviors from language models. Covers system prompt design, instruction formatting, output structuring, temperature tuning, and iterative prompt refinement techniques.

76.5B+

codinggenerationprogramming

Code Generation

by AaaS

Generates functional code from natural language descriptions, specifications, or partial implementations. Covers multiple languages and frameworks with support for boilerplate scaffolding, algorithm implementation, and API integration patterns.

75B+

function-callingtoolsstructured-output

Function Calling

by AaaS

Enables LLMs to invoke external functions by generating structured JSON arguments matching defined schemas. Supports parallel function calls, error handling, and chained invocations for complex multi-step tool interactions.

recommendationcollaborative-filteringmatrix-factorization

Collaborative Filtering

by Community

Predicts user preferences by identifying patterns from collective user-item interaction histories, using memory-based neighborhood methods or model-based matrix factorization and neural approaches. The backbone of recommendation systems at scale across e-commerce, streaming, and social platforms.

73.6B+

promptingfew-shotexamples

Few-Shot Learning

by AaaS

Teaches LLMs to perform tasks by providing a small number of input-output examples in the prompt. Enables rapid task adaptation without fine-tuning by demonstrating the desired pattern through carefully selected, representative examples.

73.5B+

Tool Use

by AaaS

Equips AI agents with the ability to select and use appropriate tools from a defined toolkit to accomplish tasks. Covers tool selection logic, input marshalling, output interpretation, and fallback strategies when tools fail or return unexpected results.

toolsagentsintegration

SkillSpeech & Audio AI

Speech Recognition

by AaaS

Teaches integration and optimization of automatic speech recognition (ASR) systems — from Whisper to streaming cloud APIs — for agentic voice pipelines. Covers language identification, word error rate reduction, punctuation restoration, and handling noisy audio environments.

asrwhispertranscription

71.9B+

time-seriesforecastingtemporal

Time-Series Forecasting

by Community

Predicts future values of sequential, time-indexed data using classical statistical models (ARIMA, ETS), gradient boosting (LightGBM, XGBoost), and deep learning architectures (Transformers, N-BEATS, TFT). Handles trend, seasonality, exogenous covariates, and uncertainty quantification.

71.3B+

fine-tuningdomain-adaptationllm

Domain-Specific Fine-Tuning

by Community

Adapts a general-purpose pretrained model to a narrow domain by continuing training on curated domain corpora or instruction datasets. Produces specialized models that outperform generalist baselines on domain-specific benchmarks while preserving broad language understanding.

Code Review

by AaaS

Analyzes code for bugs, security vulnerabilities, performance issues, and style violations. Provides actionable feedback with severity levels and suggested fixes aligned to language-specific best practices and project conventions.

codingreviewquality

recommendationhybridensemble

Hybrid Recommendation Systems

by Community

Combines collaborative filtering and content-based signals — along with contextual, knowledge-graph, and session-based features — into unified ranking models that outperform single-strategy approaches. Modern implementations use two-tower neural architectures for efficient retrieval followed by cross-attention reranking.

GNNgraph-learningnode-classification

Graph Neural Networks

by Community

Applies deep learning directly to graph-structured data by passing and aggregating messages between connected nodes across multiple layers, enabling node classification, link prediction, and graph-level tasks. Powers state-of-the-art knowledge graph completion, molecular property prediction, and social network analysis.

reinforcement-learningcontrolautonomous-systems

Reinforcement Learning for Control

by Community

Trains control policies for autonomous systems through environment interaction and reward signals using model-free (PPO, SAC, TD3) and model-based (MBPO, Dreamer) RL algorithms. Enables superhuman performance in complex continuous control tasks from locomotion to manipulation.

69.9B

summarizationcondensationnlp

Summarization

by AaaS

Condenses long documents into concise summaries while preserving key information and maintaining factual accuracy. Supports extractive, abstractive, and hierarchical summarization with configurable length, style, and focus area parameters.

anomaly-detectiontime-seriesoutlier-detection

Anomaly Detection

by Community

Identifies unusual patterns, outliers, and change points in time-series and tabular data using statistical, density-based, isolation forest, autoencoder, and transformer-based methods. Fundamental for operational monitoring, fraud detection, and predictive maintenance systems.

ragretrieval-augmented-generationllm

RAG Retrieval

by AaaS

A technique that enhances large language models by dynamically retrieving relevant information from an external knowledge base. This process grounds the model's responses in factual data, reducing hallucinations and enabling it to answer questions about information not present in its original training data.

computer-visionobject-detectionbounding-box

Object Detection

by AaaS

A core computer vision skill that enables agents to identify and locate objects within an image or video stream. By predicting bounding boxes and class labels for each object, this skill forms the foundation for environmental understanding. It is crucial for applications requiring spatial awareness, from autonomous navigation to automated inspection.

debuggingtroubleshootingerror-analysis

Code Debugging

by AaaS

Diagnoses and resolves software bugs by analyzing error messages, stack traces, and code behavior. Applies systematic debugging strategies including root cause analysis, state inspection, and targeted fix generation with regression awareness.

68.1B

searchembeddingssimilarity

Semantic Search

by AaaS

Enables meaning-based retrieval by converting queries and documents into dense vector representations and finding nearest neighbors. Foundational skill for any RAG pipeline or knowledge-base-powered agent.

67.6B

federated-learningprivacy-preserving-mldistributed-training

Federated Learning

by Community

A machine learning technique that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging the data itself. It enables collaborative model training by aggregating locally computed updates, thereby preserving data privacy, security, and sovereignty.

recommendationcontent-baseditem-features

Content-Based Recommendation

by Community

Recommends items by matching item feature profiles to user preference profiles derived from their interaction history, using TF-IDF, embeddings, and semantic similarity techniques. Effective for cold-start scenarios where user interaction data is sparse and item metadata is rich.

text-classificationnlpcategorization

Text Classification

by AaaS

Automates the categorization of text into predefined classes. This skill leverages large language models to perform zero-shot and multi-label classification, eliminating the need for extensive training data. It can analyze documents, user feedback, or social media posts, assigning relevant labels from a simple list or a complex hierarchical taxonomy.

path-planningroboticsmotion-planning

Path Planning

by Community

Path Planning is a fundamental capability in robotics and autonomous systems that computes a collision-free geometric path from a start to a goal configuration. It operates within a system's configuration space, using algorithms like A* or RRT to find optimal or feasible routes, distinct from motion planning which also considers dynamics like velocity and acceleration.

vqavision-languagemultimodal

Visual Question Answering

by AaaS

Enables agents to answer free-form natural language questions about images by grounding language in visual features. Covers prompt construction for vision-language models, chain-of-thought visual reasoning, and failure modes such as hallucination and spatial confusion.

privacydifferential-privacynoise-injection

Differential Privacy

by Community

Provides mathematically rigorous privacy guarantees by adding calibrated noise to query outputs or model gradients, ensuring individual data points cannot be inferred from published statistics or trained models. The de facto standard for privacy-preserving data analysis and compliant ML training.

embeddingsvectorsrepresentation

Embedding Generation

by AaaS

Generates dense vector embeddings from text, images, or other data types for use in similarity search, clustering, and classification. Covers model selection, batch processing, dimensionality considerations, and normalization strategies for optimal retrieval performance.

test-generationautomated-testingunit-testing

Test Generation

by AaaS

Automates the creation of test suites by analyzing source code, function signatures, or specifications. It generates unit tests, integration tests, and edge case scenarios for popular frameworks, complete with necessary mocks and assertions. This accelerates development cycles and improves code reliability.

ReAct Prompting

by AaaS

Implements the Reasoning + Acting (ReAct) paradigm where LLMs alternate between thinking steps and action steps. The model reasons about what to do next, takes an action (like searching or computing), observes the result, and continues reasoning until the task is complete.

promptingreactreasoning

autonomous-systemssensor-fusionKalman-filter

Sensor Fusion

by Community

Combines data from multiple heterogeneous sensors — cameras, LiDAR, radar, GPS, IMU — using probabilistic filters and deep learning to produce a unified, accurate state estimate of the environment. Foundational for autonomous vehicles, drones, and any robot requiring robust situational awareness.

trainingfine-tuningadaptation

Fine-Tuning

by AaaS

Adapts pre-trained language models to specific domains, tasks, or styles through additional training on curated datasets. Covers full fine-tuning, parameter-efficient methods like LoRA and QLoRA, and best practices for dataset preparation, hyperparameter selection, and evaluation.

anomalydetectiontelemetry

Anomaly Detection

by AaaS

Identifies deviations from normal system behavior across time-series telemetry data (CPU, memory, latency, error rates, request volumes). Uses statistical methods (z-score, IQR) and learned baselines to distinguish genuine anomalies from expected variance. A critical cross-foundry skill reused by SRE (F1), Fraud Detection (F6), and Supply Chain (F8) agents.

refactoringclean-codedesign-patterns

Code Refactoring

by AaaS

Code Refactoring is the disciplined process of restructuring existing computer code without altering its external behavior. It focuses on enhancing nonfunctional attributes like readability, maintainability, and performance. This practice is key to managing technical debt, applying design patterns, and modernizing legacy systems to align with current best practices.

translationmultilinguallocalization

Translation

by AaaS

Provides the ability to translate text from a source language to a target language. It aims to preserve the original meaning, tone, and cultural context. The skill supports domain-specific terminology for fields like legal or medical, allows for register control between formal and informal language, and handles idiomatic expressions with contextually appropriate equivalents.

synthetic-datadata-augmentationgenerative-ai

Synthetic Data Generation

by Community

A process for creating artificial data that mimics the statistical properties and patterns of real-world datasets. It employs techniques like GANs, VAEs, and diffusion models to generate new data points, addressing issues of data scarcity, privacy, and imbalance. This enables robust model training and testing where real data is unavailable or sensitive.

roboticsperceptioncomputer-vision

Robot Perception

by Community

Enables robots to interpret their surroundings by processing and fusing data from sensors like cameras, LiDAR, and IMUs. This capability allows machines to build environmental models, detect and track objects, and determine their own position and orientation (localization). It is a cornerstone of autonomous navigation and interaction.

ocrdocument-parsingtext-extraction

OCR Pipeline

by AaaS

Builds end-to-end pipelines for extracting structured text from images, scanned documents, and PDFs using OCR engines combined with layout analysis. Teaches preprocessing, engine selection (Tesseract, PaddleOCR, Google Document AI), post-correction, and handoff to language models for structured extraction.

motion-planningroboticspath-planning

Motion Planning

by Community

Motion Planning is the process of generating a valid trajectory for an autonomous system, such as a robot arm or self-driving car, from a starting state to a desired goal state. It computes a collision-free path that respects the system's kinematic and dynamic constraints, effectively bridging perception with physical action.

knowledge-graphinformation-extractionNLP

Knowledge Graph Construction

by Community

Builds structured knowledge graphs from unstructured text and semi-structured sources through entity recognition, relation extraction, coreference resolution, and entity linking. The resulting graphs power question answering, search, recommendation, and reasoning applications.

image-generationdiffusionprompt-craft

Image Generation Prompting

by AaaS

Master structured prompting for text-to-image diffusion models like Stable Diffusion and Midjourney. Learn to control style, composition, and quality using techniques such as negative prompting, LoRA weights, and iterative refinement. This skill enables the programmatic generation of consistent, on-brand imagery at scale.

prompt-engineeringchainingllm-orchestration

Prompt Chaining

by AaaS

Prompt Chaining is a technique for executing complex tasks by breaking them into a sequence of smaller, interconnected prompts. The output from one large language model (LLM) call serves as the input for the next, creating a multi-step workflow. This method enables more sophisticated reasoning, state management, and integration with external tools.

Hybrid Search

by AaaS

Hybrid search enhances information retrieval by merging the results of two distinct search methods: dense vector search for semantic understanding and sparse keyword search (like BM25) for lexical precision. This dual approach ensures that search results are not only contextually relevant but also capture exact term matches, significantly improving recall and relevance across diverse and complex queries.

ragsearchhybrid-search

data-extractionstructured-dataparsing

Data Extraction

by AaaS

Data Extraction is the process of automatically identifying and pulling structured information from unstructured or semi-structured sources like documents, web pages, and text. It uses NLP and computer vision to parse content into a predefined schema, enabling data to be used in databases, analytics, and automated workflows.

few-shotdomain-adaptationmeta-learning

Few-Shot Domain Adaptation

by Community

Adapts models to new target domains using only a handful of labeled examples, combining meta-learning, prompt engineering, and prototype-based methods. Critical for enterprise deployments where labeled data is scarce or expensive to acquire.

63.5B

CRM Data Retrieval

by AaaS

Queries CRM systems to retrieve customer account data, ticket history, subscription status, and interaction logs. Provides the customer context foundation that support, churn, and sales agents depend on for personalized actions.

crmdata-accesssalesforce

63.5B

Document Chunking

by AaaS

Splits large documents into semantically coherent chunks optimized for embedding and retrieval. Supports recursive, semantic, and sentence-based splitting strategies with configurable overlap and size parameters.

ragchunkingpreprocessing

63.45B

multi-step-reasoningchain-of-thoughttree-of-thoughts

Multi-Step Reasoning

by AaaS

A core AI capability that enables agents to break down complex queries into a sequence of manageable, logical steps. By generating intermediate thoughts and verifying them, this process mimics human reasoning to solve problems that require planning, deduction, and synthesis of information over multiple stages.

active-learningdata-labelingannotation

Active Learning

by Community

Active Learning is a machine learning technique that intelligently selects the most informative data points from a large pool of unlabeled data to be labeled by a human annotator. By prioritizing examples where the model is most uncertain, it aims to achieve higher model accuracy with significantly fewer labeled samples, reducing annotation costs and time.

Log Analysis

by AaaS

Parses, correlates, and summarizes structured and unstructured log streams from multiple sources (application logs, system logs, CI/CD logs). Identifies error patterns, correlates events across distributed services using trace IDs, and extracts actionable insights from high-volume log data. A foundational skill reused across DevOps, SRE, and security agents.

logsanalysisparsing

Image Segmentation

by AaaS

Covers semantic, instance, and panoptic segmentation techniques that enable agents to produce pixel-level masks for scene understanding. Includes practical guidance on using SAM 2, Mask R-CNN, and integrating segmentation outputs into multimodal pipelines.

visionsegmentationSAM

xaiexplainable-aiinterpretability

Feature Attribution

by AaaS

This skill involves computing and communicating which input features most influenced a model's prediction. It leverages methods like SHAP, LIME, and Integrated Gradients for tabular, text, and image data. The core focus is on generating local and global explanations and presenting them visually for both technical and non-technical audiences.

causal-inferencecausal-effect-estimationaverage-treatment-effect

Causal Effect Estimation

by Community

Causal Effect Estimation quantifies the true impact of an action or intervention by analyzing observational data. It moves beyond simple correlation to isolate causality using statistical methods, which is crucial for evaluating policies, business strategies, and medical treatments where A/B tests are infeasible.

62.3B

planningstrategytask-management

Planning

by AaaS

Enables agents to create structured execution plans for multi-step tasks by analyzing goals, identifying sub-tasks, ordering dependencies, and allocating resources. Supports plan revision when steps fail or new information emerges during execution.

Named Entity Recognition

by AaaS

Identifies and classifies named entities (people, organizations, locations, dates, etc.) within unstructured text. Supports custom entity types, relationship extraction between entities, and structured output formatting for downstream processing.

nerentity-extractionnlp

sim-to-realroboticsreinforcement-learning

Sim-to-Real Transfer

by Community

Sim-to-Real Transfer is a set of techniques used in robotics and AI to bridge the 'reality gap' between simulation and the real world. It enables models and control policies trained in a virtual environment to be deployed effectively on physical hardware, drastically reducing the need for costly, time-consuming, and potentially unsafe real-world data collection.

multi-agentorchestrationcoordination

Multi-Agent Coordination

by AaaS

Multi-Agent Coordination involves designing systems where multiple autonomous agents collaborate to achieve a common goal. This skill encompasses architectural patterns like hierarchical supervision and peer-to-peer negotiation for task distribution and conflict resolution. It focuses on managing shared information and ensuring coherent collective action in complex, dynamic environments.

SkillAI Infrastructure

Streaming Responses

by AaaS

This skill involves implementing real-time, token-by-token data delivery from Large Language Models to end-users. It utilizes protocols like Server-Sent Events (SSE) or WebSockets to create interactive and responsive applications, such as chatbots or code assistants, by progressively displaying content as it's generated.

streamingssewebsockets

61.9B

Reranking

by AaaS

Applies a cross-encoder or LLM-based reranker to refine initial retrieval results by scoring query-document pairs for relevance. Dramatically improves precision by promoting the most contextually relevant passages to the top of the result set.

ragrerankingrelevance

ocrdocument-processinginvoice

Skillfinance-foundry

OCR Extraction

by AaaS

Extracts structured data from unstructured documents (PDFs, scanned images, email attachments) using optical character recognition with layout-aware parsing. Handles multi-page invoices, varying formats, and poor scan quality — producing structured key-value pairs for downstream reconciliation.

escalationroutingincident

Escalation Routing

by AaaS

Routes unresolved or high-risk incidents to the appropriate human responder with full diagnostic context. Determines escalation urgency (P1-P5), identifies the correct on-call engineer or team based on service ownership, and packages a complete incident summary (timeline, diagnostics run, hypothesis). A cross-foundry skill reused by Customer Success (F4) and Healthcare (F9) agents.

content-moderationai-safetytrust-and-safety

Content Filtering

by AaaS

A system that automatically screens text inputs and outputs for large language models (LLMs) to detect and manage harmful content. It uses multi-category classification to identify issues like toxicity, hate speech, and violence, applying configurable rules and thresholds to enforce safety policies and protect users.

explanationunderstandingdocumentation

Code Explanation

by AaaS

Provides detailed, multi-level explanations for code snippets, functions, or entire repositories. It breaks down complex algorithms, clarifies control flow, and describes the purpose of variables and dependencies. The skill supports numerous programming languages, generating documentation-style overviews or granular, line-by-line analyses to accelerate learning and code reviews.

Web Browsing

by AaaS

Empowers autonomous agents to interact with the web like a human user. This skill provides the core functionality to navigate to URLs, render pages including executing JavaScript, and parse DOM elements. It enables complex workflows such as filling out forms, clicking buttons, and extracting structured data for analysis or task completion.

browsingwebnavigation

continual-learninglifelong-learningcatastrophic-forgetting

Continual Learning

by Community

A machine learning paradigm enabling models to learn sequentially from a continuous stream of data without forgetting previously acquired knowledge. Continual Learning, or Lifelong Learning, directly addresses the problem of catastrophic forgetting in neural networks using methods like regularization, memory replay, and dynamic architectures.

securityprompt-injectiondefense

Prompt Injection Defense

by AaaS

Detects and mitigates prompt injection attacks where malicious inputs attempt to override system instructions or extract sensitive information. Implements input sanitization, instruction hierarchy enforcement, and output monitoring to protect LLM-powered applications.

Agentic RAG

by AaaS

Agentic RAG transforms Retrieval-Augmented Generation from a static, single-step process into a dynamic, multi-step workflow. In this paradigm, an LLM-powered agent intelligently decides when to retrieve information, what queries to use, and whether to perform additional retrieval cycles, often using external tools to refine its approach.

ragagentictool-use

sqldatabasequery-generation

SQL Generation

by AaaS

Converts natural language questions into executable SQL queries against relational databases. Supports schema-aware generation, multi-table joins, aggregations, and query optimization with dialect-specific syntax for PostgreSQL, MySQL, SQLite, and others.

60.5B

nlpsentimentcustomer-feedback

Sentiment Analysis

by AaaS

Classifies the emotional tone and sentiment polarity of customer text communications — support tickets, survey responses, chat logs, and social mentions. Produces sentiment scores with confidence levels, enabling churn prevention and coaching agents to identify dissatisfied accounts before explicit complaints surface.

ragknowledge-baseretrieval

Knowledge Retrieval

by AaaS

Retrieves relevant articles, documentation, and policy information from knowledge bases in response to real-time queries. Uses hybrid search (keyword + semantic) with cross-encoder reranking to surface the most contextually appropriate content for support and coaching agents.

Ticket Routing

by AaaS

Classifies support tickets by category, urgency, and required expertise, then routes them to the correct queue or human agent. Handles auto-resolution for simple cases and escalates complex ones with full context summaries.

supportroutingtriage

ci-cdmonitoringdeployment

Deployment Monitoring

by AaaS

Continuously observes deployment pipelines and post-deploy health metrics. Detects anomalous deployment patterns (elevated error rates, latency spikes, failed health checks) within seconds of release. Integrates with canary and blue-green deployment strategies to provide real-time go/no-go signals based on configurable thresholds.

salesscoringqualification

Skillrevenue-foundry

Lead Scoring

by AaaS

Assigns numerical scores to leads based on demographic fit, firmographic match, behavioral engagement, and intent signals. Enables agents to rank prospects by conversion likelihood and route high-scoring leads to immediate outreach while nurturing lower-scoring ones.

context-windowoptimizationtoken-management

SkillAI Infrastructure

Context Window Optimization

by AaaS

A set of techniques for managing the limited memory (context window) of Large Language Models. It involves strategically structuring prompts, summarizing or pruning conversation history, and selectively including relevant information to ensure efficient, cost-effective, and coherent long-form interactions with an AI.

PII Detection

by AaaS

Identifies and flags personally identifiable information (PII) in text data, including names, addresses, phone numbers, SSNs, and financial details. Supports configurable sensitivity levels, redaction strategies, and compliance reporting for GDPR, HIPAA, and CCPA requirements.

piiprivacydetection

entity-resolutionrecord-linkagededuplication

Entity Resolution

by Community

Identifies and merges records across heterogeneous data sources that refer to the same real-world entity, using blocking, similarity scoring, and classification models to scale to large corpora. Critical for maintaining knowledge graph integrity and enabling cross-source analytics.

documentationdocsapi-docs

Documentation Generation

by AaaS

Generates technical documentation from source code, including API references, README files, inline comments, and architectural guides. Adapts tone and detail level for different audiences from developer guides to end-user documentation.

monitoringalertingthresholds

Threshold Detection

by AaaS

Evaluates real-time metrics against configurable thresholds (SLOs, SLIs, error budgets) and triggers appropriate responses. Supports static thresholds, dynamic baselines, and anomaly-based detection. Distinguishes between noise and genuine threshold breaches using historical context and burn-rate analysis.

causal-inferencecausal-discoverydag

Causal Discovery

by Community

Causal Discovery is a subfield of AI that infers causal relationships from observational data. It constructs a Directed Acyclic Graph (DAG) to represent these cause-and-effect links without manual intervention or controlled experiments, using statistical algorithms to distinguish correlation from causation.

59.1C+

Agent Memory Systems

by AaaS

Teaches design and implementation of multi-tier agent memory architectures — in-context working memory, episodic memory via vector stores, and semantic memory via knowledge graphs — enabling agents to maintain coherent state across long-running tasks and sessions. Covers retrieval-augmented memory, memory consolidation, and forgetting strategies.

memorylong-termepisodic

59C+

Structured Output RAG

by AaaS

This skill involves building Retrieval-Augmented Generation (RAG) systems that output structured data, like JSON, conforming to a predefined schema. Instead of unreliable free-form text, it uses techniques like constrained decoding and validation to ensure outputs are machine-readable and ready for direct use in APIs or databases.

structured-outputragjson

reflectionself-evaluationmetacognition

Reflection

by AaaS

Allows agents to evaluate their own outputs, identify errors or weaknesses, and iteratively improve responses. Implements self-critique loops where the agent reviews its work against quality criteria and refines until standards are met.

Refund Processing

by AaaS

Processes customer refunds through payment gateway APIs within configurable monetary caps. Enforces refund policies (max per transaction, max per day, cooling periods) and generates immutable audit trails for every refund action. Escalates requests above caps to human approval.

refundpaymentprocessing

memory-managementcontext-retentionstateful-ai

Memory Management

by AaaS

Enables AI agents to maintain state and context across multiple interactions by managing short-term and long-term memory. This is crucial for creating coherent, personalized experiences, moving beyond stateless request-response models. It uses techniques like conversation buffers, summarization, and vector-based retrieval.

web-scrapingdata-extractionweb-crawling

Web Scraping

by AaaS

Web scraping automates the extraction of large amounts of data from websites. By simulating human browsing, it can crawl through pages, parse HTML, and collect specific information like prices, contacts, or articles, transforming unstructured web content into structured data for analysis or other applications.

autonomous-planningai-agentsgoal-decomposition

Autonomous Planning

by AaaS

Autonomous Planning enables AI agents to independently decompose high-level, long-horizon objectives into a structured graph of executable sub-tasks. It involves generating plans using classical (PDDL), LLM-based, or hybrid methods, estimating necessary resources, and dynamically replanning in response to execution failures or new environmental data.

Usage Trend Analysis

by AaaS

Tracks product usage patterns over time — login frequency, feature adoption, session duration, and activity drops. Identifies accounts showing declining engagement that correlate with churn risk, enabling proactive retention before the customer disengages.

analyticsusagetrends

Telemetry Analysis

by AaaS

Ingests and analyzes telemetry data (metrics, traces, spans) from distributed systems. Correlates performance data across service boundaries using distributed tracing, identifies bottleneck services, and produces latency breakdowns. Provides the observability foundation that SRE Triage and Latency Budget Planner agents depend on.

telemetrymetricstraces

accounts-payablepurchase-ordermatching

Skillfinance-foundry

PO Matching

by AaaS

Matches extracted invoice data against Purchase Orders and receipt logs in ERP systems using deterministic matching rules (PO number, vendor, amount, line items). Handles partial matches, tolerance thresholds, and multi-line reconciliation. Routes exceptions to human queues with full mismatch details.

schedulingcalendarcoordination

Skillpeople-foundry

Calendar Negotiation

by AaaS

Accesses multiple participants' calendars simultaneously and finds optimal meeting times across time zones, working hours, and scheduling constraints. Handles rescheduling, cancellations, and conflict resolution autonomously.

Skillfinance-foundry

Approval Workflow

by AaaS

Routes transactions, documents, and exceptions through configurable multi-step approval chains based on amount thresholds, risk levels, and organizational policies. Tracks approver actions with timestamps, sends reminders for pending items, and escalates stalled approvals — ensuring no payment or commitment is authorized without the required sign-offs.

workflowapprovalsrouting

pull-requestautomationgit

Pull Request Generation

by AaaS

Generates complete, well-structured pull requests including: descriptive title, detailed body with change rationale, test results summary, dependency diff, and reviewer assignments. Follows the organization's PR template and conventional commit conventions. Produces PRs that human reviewers can approve quickly because all context is pre-packaged.

58.3C+

safetyalignmentconstitutional

Constitutional AI

by AaaS

Applies Anthropic's Constitutional AI principles to self-supervise model outputs against a set of defined rules or principles. The model critiques and revises its own responses to ensure they align with safety guidelines, ethical principles, and quality standards.

validationoutput-qualityschema-validation

Output Validation

by AaaS

Validates LLM outputs against expected schemas, formats, and quality criteria before delivery to end users. Implements JSON schema validation, hallucination checks, citation verification, and automated retry logic for outputs that fail validation.

57.7C+

causal-inferencecounterfactualsexplainability

Counterfactual Reasoning

by Community

Generates and evaluates counterfactual explanations — minimal input changes that would alter a model's prediction — using structural causal models and algorithmic recourse techniques. Provides actionable explanations for model decisions and supports causal effect estimation under interventions.

57.6C+

Skillrevenue-foundry

Personalized Outreach

by AaaS

Drafts hyper-personalized outreach messages for each prospect using their specific firmographic profile, recent intent signals, and ICP match factors. Enforces brand voice and CAN-SPAM/GDPR compliance, adapts tone by channel (email, LinkedIn, phone script), and graduates from human-approved to autonomous sending as trust is established.

salesoutreachpersonalization

57.5C+

dependenciesgraphpackage-management

Dependency Mapping

by AaaS

Constructs complete dependency graphs across package managers (npm, pip, cargo, Maven) and internal modules. Identifies version conflicts, circular dependencies, security-vulnerable transitive dependencies, and upgrade paths. Produces actionable dependency health reports that inform both the Codebase Architect and Dependency Guardian agents.

57.5C+

Skillrevenue-foundry

Buyer Intent Tracking

by AaaS

Monitors buyer intent signals across website visits, email opens, content downloads, CRM activity, and third-party intent data providers. Correlates engagement patterns to identify accounts showing active buying behavior, enabling agents to prioritize high-intent prospects over cold outreach.

salesintent-datasignals

57.5C+

SkillSpeech & Audio AI

Speaker Diarization

by AaaS

Enables agents to segment audio recordings by speaker identity, answering 'who spoke when' for downstream summarization and analysis tasks. Covers embedding-based clustering (pyannote.audio, NeMo), overlapping speech handling, and merging diarization with ASR transcripts.

diarizationspeaker-idaudio

rollbackdeploymentrecovery

Rollback Execution

by AaaS

Executes safe, policy-constrained rollbacks of failed deployments. Respects blast-radius limits (max affected services), rate limits (max rollbacks per hour), and change-window constraints. Supports multiple rollback strategies: Git revert, container image pinning, feature flag disabling, and traffic shifting. Produces a detailed rollback report with root cause hypothesis.

Hugging Face Transformers Training Script

by Hugging Face

The Hugging Face Transformers training script simplifies the process of training and fine-tuning transformer models for various NLP tasks. It provides a high-level API and pre-built training loops, enabling users to quickly adapt pre-trained models to their specific datasets and objectives.

transformersnlptraining

91.8A+

image classificationpytorchdeep learning

PyTorch Image Classification Script

by PyTorch

A Python script using PyTorch for training and evaluating image classification models. It provides a modular structure for defining datasets, models, training loops, and evaluation metrics, enabling researchers and practitioners to quickly prototype and deploy image classification solutions.

89.8A

tensorflowmodelsmachine-learning

TensorFlow Model Garden

by Google

The TensorFlow Model Garden is a repository containing a collection of example implementations for state-of-the-art (SOTA) machine learning models and modeling solutions for TensorFlow. It provides a wide variety of models, pre-trained weights, and scripts to help users quickly prototype and deploy TensorFlow-based AI solutions.

87.2A

tensorflowmodel-optimizationquantization

TensorFlow Model Optimization Toolkit Script

by Google

The TensorFlow Model Optimization Toolkit script provides tools and techniques to optimize TensorFlow models for deployment, including quantization, pruning, and clustering. It reduces model size and improves inference speed, making models more suitable for edge devices and resource-constrained environments.

86.2A

model evaluationscikit-learnmachine learning

Scikit-learn Model Evaluation Script

by Scikit-learn

A Python script leveraging scikit-learn to comprehensively evaluate machine learning models. It calculates various performance metrics (e.g., accuracy, precision, recall, F1-score, AUC) and generates visualizations (e.g., confusion matrices, ROC curves) to provide insights into model behavior and facilitate informed decision-making.

85.1A

langchainchainingexpression language

LangChain Expression Language (LCEL) Script

by LangChain

LCEL is a declarative way to compose chains of language models and other primitives in LangChain. This script demonstrates how to use LCEL to build complex AI pipelines with features like streaming, parallel execution, and retry mechanisms, enabling developers to create robust and scalable AI applications.

84.7A

image generationdiffusion modelinference

Stable Diffusion XL Turbo Inference Script

by Stability AI

This script provides a streamlined method for performing image generation using Stable Diffusion XL Turbo. It leverages optimized inference techniques to achieve faster generation speeds, making it suitable for real-time applications and interactive experiences.

82.8A

speech-to-textwhispertranscription

Speech-to-Text Pipeline

by OpenAI

Production-grade ASR pipeline using OpenAI Whisper or faster-whisper with VAD-based chunking, speaker timestamp alignment, and SRT/VTT subtitle export. Handles long-form audio via sliding window segmentation and automatic language detection.

71.4B+

object-detectionyolobounding-boxes

Object Detection Setup

by Ultralytics

Bootstraps a production-ready object detection workflow using YOLOv8 or RT-DETR, including webcam/video stream ingestion, NMS post-processing, and annotation overlay rendering. Outputs annotated frames and a structured JSON detections log suitable for downstream analytics.

feature-importanceshappermutation-importance

Feature Importance Analyzer

by Community

Analyzes feature importance for scikit-learn compatible models using multiple advanced techniques. It computes SHAP values with Tree and Kernel Explainers, calculates permutation importance, and performs feature selection with Boruta. Results are compiled into an interactive HTML dashboard for easy interpretation and sharing.

REST AI API Template

by Community

Production-ready FastAPI template for AI-powered REST APIs, with pre-wired OpenAI/Anthropic client, async streaming endpoints, JWT authentication, rate limiting, structured logging, and OpenAPI docs. Includes Docker Compose stack with Redis rate-limit store and Prometheus metrics.

rest-apifastapiopenai

66.7B

fraud-detectionanomaly-detectionimbalanced-learning

Fraud Detection Pipeline

by Community

This is a complete machine learning pipeline for detecting fraudulent transactions in real-time. It employs a hybrid approach, using XGBoost or LightGBM for classification and an Isolation Forest for anomaly detection. The system is specifically designed to handle severely imbalanced datasets through SMOTE-Tomek resampling and cost-sensitive learning.

63.7B

image-classificationvisionpytorch

Image Classification Pipeline

by Community

End-to-end image classification pipeline that handles dataset loading, preprocessing, model inference, and result export using PyTorch and torchvision. Supports batch inference against any Hugging Face ViT or ResNet checkpoint with configurable confidence thresholds.

Model Fine-Tuning (LoRA)

by AaaS

This script automates the process of fine-tuning large language models using Low-Rank Adaptation (LoRA). It provides an end-to-end workflow, from preparing custom datasets to training lightweight adapters and merging them into a base model for efficient deployment. This enables domain-specific model specialization with significantly reduced computational costs.

fine-tuningloratraining

ocrtext-extractiondocument-ai

OCR Pipeline Script

by Community

This script provides a sophisticated OCR pipeline that intelligently routes documents to the most suitable engine—Tesseract, PaddleOCR, or a cloud API—based on image quality analysis. It processes various document types and outputs structured JSON containing text sorted by reading order, complete with bounding box coordinates and confidence scores for each word or line.

Image Segmentation Script

by Meta AI

Runs Segment Anything Model (SAM 2) or Mask2Former on image batches, producing per-pixel segmentation masks with class labels and confidence scores. Includes utilities for mask overlay visualization and RLE-encoded mask export compatible with COCO annotation format.

segmentationsammask

data-qualitygreat-expectationsvalidation

Data Quality Checker

by Great Expectations

Automates data quality testing for tabular data using the Great Expectations library. This script profiles datasets to generate and validate 'Expectations' covering schema, statistical properties, and referential integrity. It produces a comprehensive HTML report (Data Docs) and can be integrated into CI/CD pipelines as a quality gate to prevent bad data from entering production systems.

pii-redactiondata-maskingdata-anonymization

PII Redaction Pipeline

by Microsoft

An automated pipeline that leverages Microsoft Presidio to identify and remove personally identifiable information (PII) from text and structured data. It supports configurable entity recognizers for GDPR and HIPAA compliance and features a reversible pseudonymization capability with a secure vault for authorized re-identification.

61.7B

Basic RAG Pipeline

by AaaS

This script provides a foundational Retrieval-Augmented Generation (RAG) pipeline. It handles core tasks like loading documents, splitting text into chunks, generating embeddings, and indexing them into a vector store. It includes a basic query interface, making it ideal for learning the RAG workflow and prototyping simple applications.

scriptragpipeline

speaker-diarizationaudio-processingtranscription

Speaker Diarization Script

by pyannote

This script automates the process of creating turn-by-turn transcripts from multi-speaker audio files. It first uses the pyannote.audio library to perform speaker diarization, identifying who spoke and when. These speaker segments are then aligned and merged with a transcription generated by OpenAI's Whisper, producing a final text output that attributes each line of dialogue to a specific speaker.

Chatbot Builder Script

by Community

This script generates a production-ready chatbot foundation using Rasa for structured dialogue and an LLM for open-ended fallback. It provides a unified channel adapter for deploying to Web, WhatsApp, and Slack, and includes built-in conversation analytics and a Streamlit-based testing environment for rapid development.

chatbotrasallm

knowledge-graphneo4jgraph-rag

Neo4j RAG Pipeline

by Neo4j

Implements a GraphRAG pattern that stores document entities and relationships in Neo4j, then retrieves contextually relevant subgraphs at query time before passing them to an LLM. Includes automatic entity extraction with spaCy, relationship inference, and a Cypher query generator.

visual-searchimage-embeddingssimilarity-search

Visual Search Engine

by Community

This script provides a complete framework for building a multimodal visual search engine. It uses CLIP to generate image and text embeddings, which are indexed in a vector database like Qdrant or Weaviate for efficient similarity search. The system supports both text-to-image and image-to-image queries and includes a FastAPI server for API access.

Serverless Model Deploy

by Community

Packages a trained ML model into a serverless function on AWS Lambda, Modal, or Google Cloud Run, handling cold-start optimization, dependency layering, and auto-scaling configuration. Includes health-check endpoints, structured logging, and a GitHub Actions workflow for automated rollout.

serverlesslambdamodal

59C+

recommendation-enginecollaborative-filteringllm-reranking

Recommendation Engine Setup

by Community

This script provides a complete setup for a modern, two-stage recommendation engine. It uses a two-tower neural network for efficient candidate retrieval and a powerful Large Language Model (LLM) for nuanced re-ranking. The system integrates with a Feast feature store to leverage real-time user context, ensuring timely and relevant suggestions.

edge-deploymentonnxquantization

Edge Model Optimization

by Community

Optimizes PyTorch and TensorFlow models for edge hardware by applying INT8/FP16 quantization and converting them to ONNX or TFLite formats. This script provides platform-specific tuning for ARM and NPU targets, benchmarking latency and memory usage while generating a report on accuracy trade-offs.

llm-servingmodel-deploymentvllm

Model Serving (vLLM)

by AaaS

This script automates the deployment of a large language model using the vLLM inference engine. It creates a high-throughput, OpenAI-compatible API endpoint. Key features like PagedAttention and continuous batching are configured to maximize performance and memory efficiency, making it suitable for production environments.

websocketstreamingreal-time

WebSocket Streaming API

by Community

WebSocket server that proxies token-by-token LLM streaming to multiple simultaneous clients, with connection lifecycle management, heartbeat keep-alives, and per-session context persistence. Supports fan-out broadcasting for collaborative AI sessions and reconnection with message replay.

58.4C+

feature-engineeringfeaturetoolsautoml

Automated Feature Engineering

by Alteryx

Applies Deep Feature Synthesis via Featuretools and AutoFeat to automatically generate hundreds of candidate features from relational tabular data, then prunes them using mutual information and SHAP-based importance filters. Produces a reproducible feature pipeline serializable to scikit-learn format.

58.1C+

sentiment-analysisdashboardbrand-monitoring

Sentiment Dashboard

by Community

Ingests social media feeds, reviews, and support tickets in near-real-time, scores sentiment at entity and aspect level using a fine-tuned RoBERTa model, and renders a live Streamlit dashboard with trend charts, topic clustering, and configurable alert thresholds for brand-crisis detection.

Data Cleaning Script

by AaaS

Cleans and normalizes text data for LLM consumption by removing HTML artifacts, fixing encoding issues, standardizing whitespace, deduplicating near-identical entries, and filtering low-quality content based on configurable quality heuristics.

scriptautomationcleaning

57.6C+

Voice Cloning Setup

by Coqui

Sets up a zero-shot voice cloning pipeline using Coqui XTTS-v2 or Tortoise-TTS, requiring only a 3-second reference audio clip to synthesize new speech in the target voice. Includes a FastAPI inference server, audio quality normalization, and speaker embedding export for reuse.

voice-cloningttscoqui

scriptautomationingestion

Document Ingestion Pipeline

by AaaS

Automated pipeline for ingesting documents from multiple sources (files, URLs, APIs) into a vector store. Handles format detection, text extraction, chunking, deduplication, metadata enrichment, and incremental updates for growing knowledge bases.

57.3C+

Dataset Preparation

by AaaS

Prepares datasets for LLM fine-tuning by converting raw data into instruction-following, conversation, or completion formats. Handles data cleaning, deduplication, train/val/test splitting, tokenization analysis, and quality filtering.

scriptautomationdataset

57.3C+

Web Scraping Pipeline

by AaaS

Automated web scraping pipeline with configurable crawl depth, content extraction, and rate limiting. Converts web content into clean text documents suitable for embedding and RAG ingestion with support for dynamic JavaScript-rendered pages.

scriptautomationscraping

56.8C+

scriptautomationtool-calling

Tool Calling Setup

by AaaS

Sets up a tool-calling agent with typed tool definitions, argument validation, error handling, and execution sandboxing. Includes example tools for web search, calculator, file operations, and database queries with a pluggable tool registry.

56.4C+

scriptautomationembeddings

Batch Embedding Generation

by AaaS

Generates embeddings at scale for large document collections with batching, rate limiting, checkpointing, and error recovery. Supports multiple embedding providers (OpenAI, Cohere, local models) with automatic dimension detection and output format selection.

56.4C+

temporal-featurestime-seriesrolling-windows

Temporal Feature Builder

by Community

Generates comprehensive temporal features from time-series data including rolling statistics, lag features, Fourier transforms, and calendar encodings using tsfresh and custom transformers. Handles irregular time series with forward-fill interpolation and produces a point-in-time-correct feature matrix to prevent leakage.

56.2C+

RAG Pipeline Setup

by AaaS

End-to-end setup script for deploying a production RAG pipeline. Provisions vector database, configures document ingestion, sets up embedding generation, and creates retrieval endpoints.

ragpipelinesetup

Advanced RAG Pipeline

by AaaS

Production-grade RAG pipeline with hybrid search, reranking, contextual compression, and multi-index routing. Includes query decomposition, metadata filtering, evaluation metrics, and performance monitoring for enterprise deployments.

scriptautomationrag

55.6C+

a-b-testingshadow-modetraffic-splitting

Model A/B Testing

by Community

Implements statistically rigorous A/B and shadow-mode testing for competing ML model versions behind a feature flag router, logging predictions and latencies to a data warehouse for significance testing. Automatically computes sample size requirements and stops experiments when significance thresholds are met.

55.6C+

graph-embeddingsnode2vecgraph-neural-networks

Graph Embedding Generator

by Community

Generates node and edge embeddings for knowledge graphs using Node2Vec, TransE, or a GNN (via PyTorch Geometric), then indexes them in a vector store for similarity search and link prediction. Includes training scripts, evaluation on standard link-prediction benchmarks, and a REST API for embedding lookup.

55.1C+

financial-nlpsec-filingsearnings

Financial Report Parser

by Community

Parses SEC filings, earnings call transcripts, and annual reports using FinBERT for sentiment analysis and a table-extraction pipeline that converts HTML/XBRL financial statements into normalized pandas DataFrames. Exports structured financial metrics to a database and generates LLM-ready summaries for investor Q&A.

55C+

clinical-nlphealthcareicd-10

Clinical NLP Pipeline

by Community

Processes unstructured clinical notes using medspaCy and BioClinicalBERT to extract diagnoses, medications, procedures, and lab values, then maps entities to ICD-10 and SNOMED-CT codes. Outputs FHIR-compatible JSON bundles and includes a de-identification step compliant with HIPAA Safe Harbor.

55C+

Feature Store Sync

by Feast

Synchronizes feature definitions and materialized feature values between offline (BigQuery/Snowflake) and online (Redis/DynamoDB) feature stores using Feast or Tecton, with configurable freshness SLAs and backfill scheduling. Includes drift monitoring to alert when online and offline distributions diverge.

feature-storefeasttecton

54.4C+

PDF Extraction Pipeline

by AaaS

Specialized pipeline for extracting structured content from PDF documents including text, tables, images, and metadata. Supports OCR for scanned documents, layout analysis for complex formats, and chunking optimized for PDF document structures.

scriptautomationpdf

54.3C+

Docker ML Deployment

by AaaS

Containerizes ML models and inference servers with optimized Docker images for production deployment. Includes multi-stage builds for minimal image size, GPU support configuration, health checks, and docker-compose setups for full inference stacks.

scriptautomationdocker

54.2C+

canary-deploymentprogressive-rolloutkubernetes

Canary Deployment ML

by Community

Orchestrates progressive canary deployments of ML model services on Kubernetes using Istio traffic shifting, with automated rollback triggered by error-rate or latency SLO breaches. Integrates with Argo Rollouts for declarative release management and posts deployment status to Slack.

54.1C+

scriptautomationevaluation

Model Evaluation Harness

by AaaS

Comprehensive model evaluation script that runs models against standard benchmarks including MMLU, HumanEval, GSM8K, and custom evaluation sets. Produces detailed reports with per-category breakdowns, confidence intervals, and comparison charts.

53.9C+

GGUF Conversion

by AaaS

Converts Hugging Face model weights to GGUF format for use with llama.cpp and compatible inference engines. Supports multiple quantization levels (Q4_K_M, Q5_K_M, Q8_0), validates output integrity, and generates model cards with performance characteristics.

scriptautomationgguf

53.9C+

music-generationaudiocraftmusicgen

Music Generation Script

by Meta AI

Generates royalty-free music from text prompts using Meta's MusicGen or AudioCraft, with controls for tempo, key, duration, and genre conditioning. Provides a CLI for batch generation and a streaming mode that writes 30-second chunks to disk or an S3 bucket.

53.8C+

face-recognitionbiometricsdeepface

Face Recognition Setup

by Community

Configures a face recognition system using InsightFace or DeepFace, supporting gallery enrollment, real-time identification against a FAISS vector store, and liveness detection. Designed with privacy-first defaults and includes GDPR-compliant consent logging.

53.5C+

scriptautomationcomparison

Model Comparison Script

by AaaS

Side-by-side model comparison script that runs identical prompts through multiple LLM APIs and presents results in a structured format. Measures response quality, latency, token usage, and cost per query with automated scoring via LLM judges.

53.3C+

knowledge-graphentity-extractiontriple-extraction

Knowledge Graph Builder

by Community

Automatically constructs a knowledge graph from unstructured text by extracting subject-predicate-object triples using an LLM, then serializing them to RDF/OWL or property-graph formats. Supports ontology alignment, duplicate merging via entity resolution, and Turtle/JSON-LD export.

53C+

scriptautomationknowledge-base

Knowledge Base Builder

by AaaS

End-to-end script for building a searchable knowledge base from heterogeneous sources including documents, APIs, databases, and web content. Orchestrates ingestion, deduplication, embedding, indexing, and creates a unified query interface across all sources.

53C+

legal-nlpcontract-analysisclause-extraction

Legal Document Analyzer

by Community

Analyzes legal contracts and court documents using a fine-tuned LegalBERT model for clause classification, obligation extraction, and risk-flag detection, with outputs cross-referenced against a configurable playbook of standard clause definitions. Generates a redline-ready Word document and a structured JSON risk register.

52.8C+

scriptautomationmulti-agent

Multi-Agent Orchestration

by AaaS

Orchestrates multiple specialized AI agents in coordinated workflows with task routing, state management, and result aggregation. Implements supervisor and swarm patterns with configurable agent selection logic and inter-agent communication.

52.6C+

scriptautomationquantization

Model Quantization (GPTQ)

by AaaS

Quantizes language models using GPTQ for efficient inference on consumer hardware. Performs calibration-based quantization, quality evaluation against the original model, and exports in formats compatible with vLLM, llama.cpp, and other inference engines.

52.2C+

entity-linkingnelwikidata

Entity Linking Script

by Community

Disambiguates named entities in text by linking them to canonical Wikidata or custom knowledge base entries, using a bi-encoder retriever followed by a cross-encoder reranker. Handles multi-lingual input via mBERT and outputs entity URIs with confidence scores for downstream graph population.

52.1C+

Cost Calculator

by AaaS

Calculates and projects LLM API costs based on usage patterns, model pricing, and workload forecasts. Compares costs across providers and models, identifies the most cost-effective configuration for a given quality threshold, and generates budget reports.

scriptautomationcost

51.6C+

scriptautomationhallucination

Hallucination Detector

by AaaS

Detects hallucinated content in LLM outputs by cross-referencing claims against source documents and knowledge bases. Uses claim decomposition, source attribution scoring, and consistency checking to flag unsupported or fabricated statements.

Hybrid Search Setup

by AaaS

Configures a hybrid search system combining dense vector similarity with sparse BM25 keyword matching. Sets up dual index creation, score fusion strategies, and query routing logic for optimal retrieval across different query types.

scriptautomationsearch

50.9C+

scriptautomationprompt-testing

Prompt Testing Suite

by AaaS

Automated testing framework for prompt engineering with test case management, assertion-based evaluation, regression detection, and A/B comparison. Validates prompt outputs against expected patterns, formats, and quality criteria with CI/CD integration.

MCP Server Template

by AaaS

Template for building Model Context Protocol (MCP) servers that expose tools, resources, and prompts to MCP-compatible clients. Includes typed tool handlers, resource providers, error handling, and transport configuration for stdio and HTTP modes.

scriptautomationmcp

scriptautomationload-testing

LLM Load Testing

by AaaS

Load tests LLM API endpoints with configurable concurrency, request patterns, and duration. Measures throughput, latency percentiles (p50/p95/p99), time-to-first-token, error rates, and generates performance reports with degradation alerts.

CSV to Embeddings

by AaaS

Converts CSV data into vector embeddings with configurable column selection, text template formatting, and metadata extraction. Outputs to popular vector stores or file formats with chunking support for large CSV files that exceed memory limits.

scriptautomationcsv

audio-classificationsound-eventsast

Audio Classification Setup

by Community

Configures an audio classification system using Audio Spectrogram Transformer (AST) or YAMNet fine-tuned on AudioSet, with Mel spectrogram feature extraction and batch inference. Exports per-clip predictions with top-5 class probabilities and integrates with a streaming event bus for real-time use.

scriptautomationclassification

Document Classification

by AaaS

Classifies documents into predefined categories using LLM-based inference with configurable taxonomies. Supports batch processing, multi-label classification, confidence thresholds, and exports results to CSV or database with audit trails.

data-lineageopenlineagemarquez

Data Lineage Tracker

by OpenLineage

Instruments ETL and ML pipelines with OpenLineage events, shipping dataset-level provenance metadata to a Marquez or Apache Atlas backend. Generates interactive lineage DAGs showing data transformations from source to model artifact, supporting impact analysis and audit trails.

Cost Optimization Script

by AaaS

Analyzes LLM API usage patterns and identifies cost optimization opportunities. Recommends model downgrades for simple tasks, prompt compression strategies, caching opportunities, and batch processing windows based on historical usage data and cost metrics.

scriptautomationcost

graphqlai-gatewaystrawberry

GraphQL AI Gateway

by Community

GraphQL gateway for multi-model AI services built with Strawberry Python, exposing query, mutation, and subscription resolvers for chat, embedding, and image generation endpoints across multiple LLM providers. Features a DataLoader-based batching layer and persisted query caching to minimize token usage.

49.7C

supply-chainoptimizationor-tools

Supply Chain Optimizer

by Community

Combines ML demand forecasting (Prophet + LightGBM) with constraint-based optimization (Google OR-Tools) to minimize inventory costs while meeting service-level targets across a multi-echelon supply chain. Outputs replenishment orders, safety stock recommendations, and a scenario simulation dashboard.

49.4C

Safety Audit Script

by AaaS

Comprehensive safety audit for LLM-powered applications testing for prompt injection vulnerabilities, PII leakage, harmful content generation, and policy violations. Generates detailed audit reports with severity ratings and remediation recommendations.

scriptautomationsafety

49.2C

scriptautomationentity-extraction

Entity Extraction Pipeline

by AaaS

Extracts named entities and relationships from unstructured text at scale using LLM-powered NER with custom entity type support. Outputs structured data with entity linking, relationship graphs, and confidence scores for knowledge graph construction.

49.2C

scriptautomationmonitoring

Monitoring Setup (Grafana)

by AaaS

Sets up Grafana dashboards and Prometheus metrics for LLM application monitoring. Includes pre-built dashboards for token usage, latency, error rates, cost tracking, and model performance with configurable alert rules and notification channels.

49.1C

scriptautomationbenchmarking

Model Benchmarking Suite

by AaaS

Performance benchmarking suite measuring LLM inference throughput, latency percentiles, time-to-first-token, and tokens-per-second under various load patterns. Generates detailed performance reports with charts for capacity planning and SLA validation.

49.1C

scriptautomationregression

LLM Regression Testing

by AaaS

Detects regressions in LLM behavior across model updates, prompt changes, or configuration modifications. Runs golden test sets, compares outputs using semantic similarity and LLM judges, and flags significant quality degradation with detailed diff reports.

48.7C

scriptautomationevaluation

Agent Evaluation Framework

by AaaS

Evaluates AI agent performance across defined test scenarios with success criteria, step tracking, and automated scoring. Supports custom evaluation rubrics, regression detection, and generates detailed reports comparing agent versions over time.

48.3C

scriptautomationannotation

Annotation Pipeline

by AaaS

Automated data annotation pipeline using LLMs for labeling, classification, and quality scoring of training data. Implements multi-annotator consensus, confidence thresholds, human review queuing for uncertain samples, and annotation analytics.

47.9C

Token Usage Analyzer

by AaaS

Analyzes token usage patterns across LLM applications to identify optimization opportunities. Tracks input/output token ratios, identifies verbose prompts, detects unnecessary context, and recommends prompt engineering improvements for cost reduction.

scriptautomationtokens

47.8C

CI/CD ML Pipeline

by AaaS

CI/CD pipeline for machine learning models with automated testing, evaluation, registry management, and staged deployment. Runs benchmark suites, compares against baseline metrics, and promotes models through staging environments with approval gates.

scriptautomationci-cd

47.8C

Bias Detection Script

by AaaS

Detects demographic and topical biases in LLM outputs by running structured test prompts across protected categories. Measures response quality disparities, sentiment differences, and representation gaps with statistical significance testing and bias scorecards.

scriptautomationbias

47.7C

scriptautomationrate-limiting

Rate Limiter Setup

by AaaS

Configures intelligent rate limiting for LLM API proxies with per-user, per-model, and per-endpoint limits. Implements token bucket, sliding window, and adaptive rate limiting algorithms with Redis-backed distributed state and graceful degradation.

47.4C

scriptautomationdeployment

Agent Deployment Script

by AaaS

Deploys AI agents as production services with health checks, graceful shutdown, error recovery, and monitoring integration. Supports Docker and Kubernetes deployments with configurable scaling, environment management, and rollback capabilities.

47.4C

scriptautomationred-teaming

Red Teaming Script

by AaaS

Automated red teaming toolkit that generates and tests adversarial prompts against LLM applications. Covers jailbreak attempts, prompt injection variants, social engineering patterns, and boundary probing with categorized attack vectors and success tracking.

46.7C

Latency Benchmarking

by AaaS

Benchmarks LLM API latency across providers, models, and prompt sizes with detailed statistical analysis. Measures time-to-first-token, inter-token latency, total response time, and generates comparison reports with confidence intervals and percentile distributions.

scriptautomationlatency

scriptautomationkubernetes

Kubernetes Model Serving

by AaaS

Deploys and manages LLM inference workloads on Kubernetes with GPU scheduling, auto-scaling based on queue depth, rolling updates, and canary deployments. Generates Helm charts and Kustomize configurations for reproducible deployments.

energydemand-forecastingtime-series

Energy Forecast Script

by Community

Forecasts electricity demand and renewable generation (solar/wind) using Temporal Fusion Transformer or N-HiTS via NeuralForecast, with weather feature integration and probabilistic intervals for grid balancing. Outputs 24-hour and 7-day ahead forecasts in an InfluxDB-compatible format.

scriptautomationapi-gateway

API Gateway Configuration

by AaaS

Configures an API gateway for LLM inference endpoints with provider routing, rate limiting, authentication, request/response logging, and failover between multiple LLM providers. Includes usage tracking and cost allocation by API key.

Multi-Source RAG

by AaaS

RAG pipeline that queries multiple specialized vector indexes and merges results with intelligent routing. Implements source-aware retrieval with automatic query classification, per-source relevance scoring, and citation tracking across diverse knowledge domains.

scriptautomationrag

45.8C

scriptautomationab-testing

A/B Testing Framework

by AaaS

Framework for A/B testing different LLM configurations including models, prompts, temperatures, and system instructions. Runs controlled experiments with statistical significance testing, effect size calculation, and automated winner selection.

45.7C

scriptautomationmonitoring

Agent Monitoring Dashboard

by AaaS

Sets up a monitoring dashboard for AI agent systems tracking task completion rates, error rates, latency, token usage, and cost. Integrates with Prometheus for metrics collection and Grafana for visualization with pre-built alert rules.

45.3C

consentgdprdata-governance

Consent Management Script

by Community

Implements a GDPR-compliant consent management layer that records per-user data processing consents in an append-only ledger, enforces purpose limitation at the data access layer, and generates DSAR (data subject access request) reports on demand. Supports consent propagation to downstream ML training pipelines.

45.2C

Model Merging

by AaaS

Merges multiple fine-tuned model checkpoints using strategies like SLERP, TIES, DARE, and linear interpolation. Enables combining specialized model capabilities without additional training, with automated quality validation against benchmark suites.

scriptautomationmerging

45C

scriptautomationmigration

Vector DB Migration

by AaaS

Migrates vector data between different vector database providers (Pinecone, Weaviate, Chroma, Qdrant, Milvus). Handles schema mapping, batch transfers, index recreation, metadata preservation, and validation with rollback support.

44.4C

Agent Testing Harness

by AaaS

Testing harness for AI agents with mock tool providers, simulated user interactions, and deterministic replay capabilities. Enables unit testing of agent logic, integration testing of tool chains, and end-to-end testing of complete agent workflows.

scriptautomationtesting

43.6C

A2A Communication Setup

by AaaS

Configures Agent-to-Agent (A2A) communication infrastructure with message routing, capability discovery, and protocol compliance. Sets up agent registries, message queues, and typed message schemas for reliable inter-agent collaboration.

scriptautomationa2a

40.7C

traininghardwareperformance

MLPerf Training

by MLCommons

MLPerf Training is a suite of benchmarks that measure the time it takes to train various machine learning models on different hardware and software platforms. It provides a standardized way to compare the performance of different AI training systems, driving innovation in hardware and software optimization for AI workloads.

89.3A

language-modelsevaluationholistic

HELM: Holistic Evaluation of Language Models

by Stanford Center for Research on Foundation Models (CRFM)

HELM is a living benchmark designed to provide a comprehensive and holistic evaluation of language models across a wide range of scenarios and metrics. It aims to move beyond single-number evaluations by assessing models on factors like truthfulness, calibration, fairness, robustness, and efficiency, providing a more nuanced understanding of their capabilities and limitations.

87A

image-classificationvisiontop-1-accuracy

ImageNet

by Deng et al. / Stanford / Princeton

ImageNet (ILSVRC) is the foundational large-scale visual recognition benchmark with 1.2 million training images across 1,000 object categories. Top-1 and Top-5 accuracy on the validation set have been the standard measure of progress in image classification for over a decade.

81.2A

roboticsreinforcement-learningsimulation

RoboSuite

by Stanford AI Lab

RoboSuite is a simulation framework and benchmark suite for robot learning. It provides a standardized set of environments and tasks for training and evaluating reinforcement learning algorithms in robotics, focusing on manipulation and locomotion tasks with realistic physics and sensor models.

80.9A

reasoningquestion-answeringscience

AI2 Reasoning Challenge (ARC)

by Allen Institute for AI (AI2)

The AI2 Reasoning Challenge (ARC) is a question-answering dataset designed to evaluate advanced reasoning capabilities in AI systems. It consists of elementary-level science questions specifically crafted to be difficult for retrieval-based methods and require deeper understanding and reasoning to answer correctly.

80.7A

object-detectioninstance-segmentationvision

COCO Detection

by Lin et al. / Microsoft

COCO Detection is the standard benchmark for object detection and instance segmentation, featuring 330,000 images with over 1.5 million annotated instances across 80 object categories. Mean Average Precision (mAP) at various IoU thresholds is the primary metric.

BenchmarkSpeech & Audio AI

LibriSpeech

by Panayotov et al. / Johns Hopkins

LibriSpeech is the standard English automatic speech recognition (ASR) benchmark derived from LibriVox audiobooks, containing 1,000 hours of read speech at 16kHz. Word Error Rate (WER) on clean and noisy test splits drives competitive progress in ASR research.

asrspeech-recognitionenglish

semantic-segmentationscene-parsingvision

ADE20K Segmentation

by Zhou et al. / MIT CSAIL

ADE20K is the benchmark for semantic scene parsing, containing 25,000 images densely annotated with 150 semantic categories. Mean Intersection over Union (mIoU) is the standard metric, and it drives progress in perception systems for autonomous driving, robotics, and scene understanding.

76B+

GSM8K

by OpenAI

Grade School Math 8K benchmark with 8,500 linguistically diverse grade school math word problems requiring 2-8 step reasoning. Tests basic mathematical reasoning and arithmetic with problems that require sequential multi-step solutions.

benchmarkevaluationmath

75.7B+

benchmarkevaluationsoftware-engineering

SWE-bench Verified

by Princeton NLP

Human-validated subset of SWE-bench containing 500 problems verified by software engineers for correctness, clarity, and solvability. Provides a more reliable signal than the full SWE-bench by filtering out ambiguous or under-specified issues.

benchmarkevaluationmathematics

MATH

by UC Berkeley

Collection of 12,500 competition mathematics problems from AMC, AIME, and other math competitions covering algebra, geometry, number theory, combinatorics, and more. Problems require multi-step reasoning and mathematical insight beyond pattern matching.

agiabstract-reasoningvisual-patterns

ARC-AGI

by Chollet / ARC Prize Foundation

ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) measures fluid intelligence through visual grid transformation puzzles. Models must infer transformation rules from three or fewer examples and apply them to a test grid — a task trivially solved by humans but historically extremely difficult for AI systems.

74.1B+

benchmarkevaluationcommonsense

HellaSwag

by Allen AI

Evaluates commonsense natural language inference by asking models to select the most plausible continuation of a scenario. Uses adversarially filtered endings generated by language models, making it challenging for machines while trivial for humans.

74B+

BenchmarkSpeech & Audio AI

Common Voice

by Mozilla Foundation

Common Voice is Mozilla's crowd-sourced multilingual speech corpus spanning 100+ languages with verified recordings from volunteers. It benchmarks ASR systems on low-resource and diverse language conditions, making it critical for evaluating cross-lingual speech model generalization.

asrmultilingualcrowdsourced

73.5B+

inferencethroughputlatency

MLPerf Inference

by MLCommons

MLPerf Inference is the industry-standard benchmark for measuring AI inference performance across hardware platforms. It covers image classification, object detection, NLP, speech recognition, and generative AI workloads, enabling fair apples-to-apples comparison of accelerators and inference stacks.

benchmarkevaluationscience

ARC Challenge

by Allen AI

AI2 Reasoning Challenge featuring grade-school science questions that require commonsense reasoning and world knowledge. The Challenge set contains questions that simple retrieval and co-occurrence methods fail to answer correctly.

MedQA

by Jin et al. / UC San Diego

MedQA tests medical knowledge using free-form multiple-choice questions drawn from the US Medical Licensing Examination (USMLE). It evaluates whether language models can reason through complex clinical scenarios requiring deep biomedical knowledge.

medicalqaclinical

72.8B+

benchmarkevaluationmulti-turn

MT-Bench

by LMSYS

Multi-turn conversation benchmark with 80 high-quality questions across 8 categories including writing, reasoning, math, coding, and extraction. Uses GPT-4 as an automated judge to evaluate response quality on a 1-10 scale across two conversation turns.

translationmultilinguallow-resource

FLORES-200

by NLLB Team / Meta AI

FLORES-200 is a many-to-many multilingual translation benchmark covering 200 languages, including many low-resource ones. It evaluates machine translation systems across 40,000 language direction pairs, making it the most comprehensive translation benchmark for assessing cross-lingual generalization.

benchmarkevaluationgraduate-level

GPQA

by NYU

Graduate-level Google-Proof Question Answering benchmark featuring questions written by domain experts in physics, chemistry, and biology. Questions are designed to be unsearchable, requiring genuine reasoning rather than memorization.

71.6B+

benchmarkevaluationtruthfulness

TruthfulQA

by University of Oxford

Measures whether language models generate truthful answers to questions where humans are commonly mistaken. Covers health, law, finance, and politics topics where popular misconceptions and conspiracies create systematic failure modes.

benchmarkevaluationcoding

MBPP

by Google Research

Mostly Basic Programming Problems — a collection of 974 crowd-sourced Python programming tasks with natural language descriptions and test cases. Tests foundational programming ability including string manipulation, list processing, and basic algorithms.

image-captioningvisual-groundingretrieval

Flickr30k

by Young et al. / University of Illinois

Flickr30k is a benchmark for image-text retrieval and visual grounding, comprising 31,783 Flickr images each paired with five human-written captions. Models are evaluated on bidirectional image-to-text and text-to-image retrieval recall at ranks 1, 5, and 10.

70.9B+

long-contextretrievalsingle-fact

Needle-in-a-Haystack

by Greg Kamradt (community)

Needle-in-a-Haystack is a pressure test for long-context language models that places a single fact (the needle) at a specific position within a long document (the haystack) and asks the model to retrieve it. It systematically varies both context length and needle depth to reveal performance degradation patterns.

VQA v2

by Georgia Tech / VT

Visual Question Answering benchmark requiring models to answer open-ended questions about images. Version 2 balances the dataset to reduce language biases, ensuring models must genuinely understand image content rather than relying on question-type priors.

benchmarkevaluationreasoning

BIG-Bench Hard

by Google DeepMind

Curated subset of 23 challenging BIG-Bench tasks where prior language models performed below average human raters. Specifically designed to test tasks that benefit significantly from chain-of-thought prompting and multi-step reasoning.

benchmarkevaluationcommonsense

WinoGrande

by Allen AI

Large-scale dataset for commonsense coreference resolution inspired by Winograd schemas. Tests whether models can correctly resolve pronoun references based on world knowledge and commonsense reasoning in carefully constructed sentence pairs.

RealToxicityPrompts

by Gehman et al. / Allen Institute for AI

RealToxicityPrompts measures the propensity of language model generations to produce toxic content when conditioned on a diverse set of 100,000 naturally occurring prompts extracted from the web. It uses the Perspective API to score generated text on toxicity dimensions.

toxicitygenerationsafety

medicalbiomedicalresearch

PubMedQA

by Jin et al. / Carnegie Mellon University

PubMedQA is a biomedical question-answering dataset sourced from PubMed abstracts. Models must answer yes/no/maybe questions about biomedical research findings, testing the ability to reason over scientific literature.

LegalBench

by Guha et al. / Stanford CodeX

LegalBench is a collaboratively built benchmark measuring the legal reasoning ability of large language models across 162 tasks spanning issue spotting, rule recall, rule application, and legal interpretation. It provides a comprehensive evaluation of whether models can reason like lawyers.

legalreasoningnlp

benchmarkscience-qamultimodal-reasoning

ScienceQA

by Lu et al. / UCLA

ScienceQA is a large-scale multimodal benchmark featuring 21,208 science questions for grades 3-12. It uniquely combines visual diagrams and textual contexts, requiring models to perform complex reasoning. Each question includes multiple-choice options, a detailed lecture, and a step-by-step explanation for the correct answer.

benchmarkevaluationinstruction-following

AlpacaEval

by Stanford

Automated evaluation framework comparing model outputs against a reference model on 805 instructions. Uses LLM judges to determine win rates, with length-controlled metrics to avoid rewarding verbosity over quality.

biomedicalquestion-answeringinformation-retrieval

BioASQ

by Tsatsaronis et al. / BioASQ Challenge

BioASQ is a large-scale benchmark for biomedical semantic question answering. It challenges systems to perform document retrieval, concept mapping, and answer extraction from PubMed literature. The benchmark includes diverse question types like yes/no, factoid, list, and summary, with gold-standard answers curated by experts.

benchmarkmodel-evaluationllm-testing

MMLU-Pro

by TIGER-Lab

MMLU-Pro is a challenging benchmark designed to evaluate the advanced reasoning and knowledge capabilities of frontier AI models. It enhances the original MMLU by introducing harder, professionally-vetted questions, expanding answer choices from 4 to 10, and reducing sensitivity to prompt formatting for a more robust and discriminative assessment.

ToolBench

by Qin et al. / Tsinghua University

ToolBench evaluates LLMs on their ability to use real-world REST APIs to complete user instructions. It provides 16,000+ real APIs from RapidAPI Hub across 49 categories and 12,000+ instruction–API solution pairs, measuring whether models can plan and execute multi-step API call sequences.

tool-useapiagents

MMMU

by CUHK / Waterloo

MMMU is a challenging multimodal benchmark designed to evaluate large models on expert-level tasks. It contains over 11,500 college-level problems spanning six core disciplines, requiring models to integrate deep subject knowledge with visual perception to answer multiple-choice questions with detailed reasoning.

benchmarkdatasetevaluation

DROP

by Allen AI

DROP (Discrete Reasoning Over Paragraphs) is a challenging benchmark designed to evaluate a model's numerical reasoning capabilities within textual contexts. It requires systems to read paragraphs and answer questions that involve discrete operations like addition, counting, sorting, or comparison. Unlike simpler QA datasets, DROP necessitates multi-step reasoning processes, pushing models beyond basic information retrieval.

66.7B

toxicity-detectionhate-speechimplicit-bias

ToxiGen

by Hartvigsen et al. / MIT

ToxiGen is a large-scale, machine-generated dataset for evaluating nuanced hate speech detection. It contains over 274,000 toxic and benign statements about 13 minority groups, designed to challenge models to identify implicit toxicity without relying on obvious slurs or surface-level cues.

benchmarkcode-generationllm-evaluation

BigCodeBench

by Zhuo et al. / BigCode / Hugging Face

BigCodeBench is a challenging benchmark for evaluating large language models on practical, function-level code generation tasks. It comprises 1,140 problems that require the use and integration of popular Python libraries like NumPy, Pandas, and Scikit-learn, moving beyond simple algorithmic puzzles to mirror real-world software development scenarios.

question-answeringmultilingualtypologically-diverse

TyDi QA

by Clark et al. / Google Research

TyDi QA is a multilingual question-answering benchmark featuring 11 typologically diverse languages. Questions are written natively by speakers of each language, ensuring genuine linguistic challenges and avoiding translation artifacts. It is designed to evaluate reading comprehension across a wide range of language structures.

MedMCQA

by Pal et al. / IIT Kanpur

MedMCQA is a massive multiple-choice question dataset sourced from Indian medical entrance examinations like AIIMS and NEET-PG. It contains over 194,000 questions covering 2,400 healthcare topics, designed to rigorously test a model's breadth of medical knowledge and reasoning abilities across multiple subjects.

medicalmcqindian-medical

long-context-evaluationllm-benchmarkretrieval-testing

RULER

by Hsieh et al. / NVIDIA

RULER is a synthetic benchmark for evaluating large language models in long-context scenarios, scaling from 4K to 128K tokens. It assesses complex skills like multi-hop retrieval, aggregation, and coreference resolution, offering a more nuanced analysis than simple 'needle-in-a-haystack' tests.

benchmarkmodel-evaluationmathematics

AIME 2024

by MAA

A highly challenging benchmark for evaluating the mathematical reasoning of frontier AI models. It uses 30 problems from the 2024 American Invitational Mathematics Examination (AIME), which are designed to test creative problem-solving, multi-step deduction, and knowledge across number theory, geometry, algebra, and combinatorics.

64.9B

BBQ (Bias Benchmark for QA)

by Parrish et al. / NYU

BBQ is a question-answering benchmark designed to expose social biases in language models. It uses ambiguous and disambiguated questions related to nine protected categories to measure a model's tendency to rely on harmful stereotypes when context is lacking versus its ability to answer correctly when enough information is provided.

biasqasocial-bias

long-contextbilingualmulti-task

LongBench

by Bai et al. / Tsinghua University

LongBench is a comprehensive bilingual benchmark designed to evaluate the long-context understanding capabilities of large language models in English and Chinese. It comprises 21 diverse tasks, including single and multi-document QA, summarization, and code completion, with an average context length of over 6,700 tokens to rigorously test model performance on extended inputs.

BenchmarkSpeech & Audio AI

MusicCaps

by Agostinelli et al. / Google DeepMind

MusicCaps is a benchmark dataset of 5,521 music clips from AudioSet, each paired with a detailed text description written by professional musicians. It is primarily used for evaluating text-to-music generation models, as well as for music captioning, retrieval tasks, and fine-tuning audio-language models.

musicaudio-captioningmultimodal

benchmarkevaluationinstruction-following

IFEval

by Google Research

Instruction-Following Evaluation benchmark testing models' ability to precisely follow verifiable formatting instructions. Includes constraints like word count limits, specific formatting requirements, keyword inclusion/exclusion, and structural rules that can be programmatically verified.

Chatbot Arena Hard

by LMSYS

Chatbot Arena Hard is a static benchmark composed of 500 challenging prompts curated from Chatbot Arena. It is designed to rigorously evaluate and differentiate the capabilities of large language models. The benchmark utilizes an automated judging system, typically employing a powerful model like GPT-4, to provide a quick, reproducible proxy for human preference.

benchmarkevaluationchat

benchmarkevaluationcoding

HumanEval+

by BigCode

HumanEval+ is a benchmark for rigorously evaluating code generation models. It augments the original HumanEval dataset by expanding the test suite for each of its 164 problems by 80x. This extensive testing helps uncover subtle bugs and failures on edge cases that simpler benchmarks miss, providing a more accurate measure of a model's true coding ability.

cybersecurityai-safetyllm-evaluation

CyberSecEval

by Meta AI

CyberSecEval is a benchmark developed by Meta to assess the cybersecurity risks associated with Large Language Models (LLMs). It evaluates a model's propensity to generate insecure code, assist in exploiting vulnerabilities, and facilitate attacks, helping safety teams quantify the dual-use risk of code-capable models.

benchmarkdatasetdocument-ai

DocVQA

by CVC Barcelona

DocVQA is a large-scale dataset and benchmark for Visual Question Answering on document images. It challenges models to answer questions by reading and interpreting text, understanding layouts, and reasoning about information within complex documents like forms, invoices, and reports. It serves as a standard for evaluating document intelligence systems.

financeragnumerical-reasoning

FinanceBench

by Islam et al. / Patronus AI

FinanceBench is a benchmark designed to evaluate the financial question-answering capabilities of Large Language Models. It uses publicly available corporate documents like 10-K filings and earnings reports to test models on information retrieval, numerical reasoning, and multi-step financial calculations, providing a standardized testbed for financial AI.

62.8B

benchmarkagent-evaluationweb-benchmark

WebArena

by CMU

WebArena is a realistic and reproducible benchmark environment designed to evaluate autonomous language agents. It tests an agent's ability to perform complex, multi-step tasks across a diverse set of self-hosted websites, including e-commerce, forums, and content management systems, using real web interfaces.

62.4B

summarizationmultilingualnews

XL-Sum

by Hasan et al. / University of Edinburgh

XL-Sum is a large-scale benchmark dataset for multilingual abstractive summarization. It contains 1.35 million article-summary pairs from BBC News across 44 languages, designed to evaluate a model's ability to generate concise summaries across diverse linguistic families and writing systems.

GAIA Benchmark

by Meta / Hugging Face

GAIA (General AI Assistants) is a benchmark for evaluating AI models on complex, real-world tasks. It features questions with unambiguous factual answers that require sophisticated capabilities like multi-step reasoning, web browsing, and tool use. GAIA is designed to test the practical limits of general-purpose AI assistants.

CrowS-Pairs

by Nangia et al. / NYU

CrowS-Pairs is a benchmark dataset for evaluating social bias in masked language models. It contains 1,508 sentence pairs with stereotypical and anti-stereotypical statements across nine bias types. The benchmark measures a model's preference for stereotypical completions using pseudo-log-likelihood scores.

biasstereotypesmasked-lm

MGSM

by Google Research

MGSM (Multilingual Grade School Math) is a benchmark for evaluating the mathematical reasoning of large language models across multiple languages. It consists of 250 grade-school math problems from the GSM8K dataset, professionally translated into ten typologically diverse languages, including low-resource ones like Swahili and Telugu.

benchmarkevaluationmath

61.4B

agent-evaluationllm-benchmarkmulti-task-evaluation

AgentBoard

by Ma et al. / Shanghai AI Lab

AgentBoard is a comprehensive evaluation framework for Large Language Model (LLM) based agents. It assesses agent performance across nine diverse tasks, including embodied AI, gaming, web browsing, and tool use. The framework uniquely measures both final task success and partial progress through a fine-grained sub-goal metric.

ContractNLI

by Koreeda & Manning / Stanford NLP

ContractNLI is a dataset for natural language inference (NLI) focused on contract understanding. It challenges models to determine if a hypothesis about a contract is entailed, contradicted, or not mentioned by the contract text. This simulates real-world legal document review, testing a model's ability to reason over complex legal language.

legalnlicontract

benchmarkevaluationfactuality

SimpleQA

by OpenAI

SimpleQA is a benchmark dataset developed by OpenAI to assess the factual accuracy of language models. It consists of simple, unambiguous questions that have a single, verifiable correct answer. The benchmark is designed to measure a model's ability to recall factual knowledge and, crucially, to abstain from answering when it is uncertain, providing a measure of its calibration.

benchmarkevaluationfrontier-testing

Humanity's Last Exam

by CAIS

Humanity's Last Exam is a crowdsourced benchmark designed to rigorously test the limits of advanced AI systems. It comprises extremely difficult questions contributed by domain experts across diverse fields like science, math, and philosophy, serving as a public evaluation for frontier model capabilities in complex reasoning and specialized knowledge.

biasgender-biascoreference

WinoBias

by Zhao et al. / USC

WinoBias is a benchmark dataset designed to measure gender bias in coreference resolution systems. It consists of sentence pairs where pronouns refer to individuals in stereotyped or non-stereotyped occupations, allowing for the quantification of a model's reliance on gender stereotypes versus grammatical correctness.

long-contextllm-evaluationbenchmark

InfiniteBench

by Zhang et al. / Peking University

InfiniteBench is a benchmark designed to evaluate the long-context capabilities of large language models. It features tasks that require processing and reasoning over inputs exceeding 100,000 tokens, including math, code debugging, and retrieval from novels, where crucial information is distributed across the entire context.

AgentBench

by Tsinghua University

Comprehensive benchmark evaluating LLM agents across 8 distinct environments including operating systems, databases, knowledge graphs, digital card games, lateral thinking puzzles, and web shopping. Tests generalization of agent capabilities across diverse interaction paradigms.

benchmarkevaluationmathematics

Minerva Math

by Google Research

Minerva Math is a quantitative reasoning benchmark designed to evaluate large language models on complex STEM problems. Sourced from web pages with LaTeX and arXiv preprints, it covers subjects like math, physics, and chemistry, requiring multi-step computation, symbolic manipulation, and deep scientific understanding to solve.

legal-nlpbenchmarkcase-law

CaseHOLD

by Zheng et al. / Berkeley Law / LexGLUE

CaseHOLD is a legal NLP benchmark for evaluating a model's ability to identify the correct holding statement for a US court case. Given a citing context, the model must choose the correct holding from a list of candidates. Sourced from over 53,000 cases, it is a core component of the LexGLUE benchmark suite for legal AI.

58.8C+

API-Bank

by Li et al. / Wuhan University

API-Bank is a comprehensive benchmark for evaluating tool-augmented LLMs. It features 73 diverse APIs and assesses models on three levels: API retrieval, API calling, and complex planning. The benchmark measures both the correctness of tool selection and the accuracy of execution, providing a thorough test of an agent's capabilities.

tool-useapi-callagents

58.8C+

MathVista

by UCLA

Mathematical reasoning benchmark requiring visual understanding of charts, plots, geometry diagrams, and infographics. Tests the intersection of visual perception and mathematical reasoning with 6,141 problems from 28 existing datasets and 3 newly collected ones.

58.3C+

benchmarkevaluationcoding

Aider Polyglot

by Aider

Multi-language code editing benchmark testing models' ability to make targeted code changes across Python, JavaScript, TypeScript, Java, C++, and other languages. Evaluates real-world code modification tasks rather than generation from scratch.

MLAgentBench

by Huang et al. / Stanford

MLAgentBench challenges AI agents to perform machine learning research tasks autonomously — reading papers, writing code, running experiments, analyzing results, and improving models. It tests whether agents can replicate and build upon real ML research across 13 diverse ML tasks.

agentsml-researchcoding

57.9C+

benchmarkevaluationmathematics

FrontierMath

by Epoch AI

Benchmark of original, research-level mathematics problems created by professional mathematicians. Tests capabilities at the frontier of mathematical reasoning including novel proofs, advanced computation, and multi-domain mathematical synthesis.

55.9C+

medicalclinicalinstruction-following

ClinicalCamel Benchmark

by Toma et al. / University of Toronto

ClinicalCamel Benchmark evaluates open-source language models on clinical dialogue and medical instruction-following tasks derived from physician–patient interactions. It focuses on safety, accuracy, and appropriateness of clinical advice generation.

55.9C+

benchmarkevaluationcompetitive-programming

Codeforces Benchmark

by Codeforces / Community

Evaluates models on competitive programming problems from the Codeforces platform across difficulty ratings. Tests algorithmic thinking, data structure knowledge, and the ability to produce correct and efficient solutions under competitive constraints.

55.7C+

TAU-bench

by Sierra AI

Tool-Agent-User benchmark evaluating AI agents on realistic customer service scenarios requiring multi-step tool use. Tests agents' ability to navigate complex workflows, use tools correctly, follow policies, and handle edge cases in airline and retail domains.

54.8C+

benchmarkevaluationmachine-learning

MLE-bench

by OpenAI

Benchmark evaluating AI agents on real Kaggle machine learning competitions. Tests the full ML engineering pipeline including data exploration, feature engineering, model selection, training, and submission formatting against actual competition leaderboards.

54.8C+

OSWorld

by University of Hong Kong

Benchmark for evaluating multimodal agents on real operating system tasks spanning Ubuntu, Windows, and macOS environments. Tests agents' ability to interact with desktop applications, file systems, terminals, and GUI elements to complete everyday computer tasks.

53.7C+

RealWorldQA

by xAI

Benchmark testing multimodal models on practical real-world visual understanding tasks. Features questions about real photographs requiring spatial reasoning, object recognition, scene understanding, and practical knowledge that goes beyond simple object detection.

53.1C+

energyefficiencysustainability

EnergyBench

by Lannelongue et al. / EMBL-EBI

EnergyBench quantifies the energy consumption and carbon footprint of AI inference across hardware and software configurations. It correlates task accuracy with joules consumed, enabling practitioners to make informed accuracy-efficiency trade-offs for sustainable AI deployment.

49C

GreenAI Benchmark

by Schwartz et al. / AI2 / University of Washington

GreenAI Benchmark evaluates the efficiency of AI training and inference by reporting accuracy alongside FLOPs, parameters, and CO2 emissions. It promotes the efficiency metric paradigm where reporting results without computational cost is considered incomplete science.

green-aiefficiencyflops

48.5C

benchmarkcodingsoftware-engineering

SWE-bench

by Princeton NLP

SWE-bench is a benchmark for evaluating AI systems' ability to resolve real GitHub issues from popular Python repositories. Each instance requires understanding a codebase, identifying the bug, and producing a correct patch. SWE-bench Verified is the curated subset accepted as the standard for coding agent evaluation by the AI industry.

benchmarkembeddingsretrieval

MTEB

by Hugging Face / MTEB Team

MTEB (Massive Text Embedding Benchmark) is the standard benchmark for evaluating text embedding models across 8 task types (retrieval, clustering, classification, etc.) and 112 datasets. The MTEB leaderboard on Hugging Face is the primary reference for selecting embedding models and is updated continuously as new models are released.

benchmarkknowledgemultitask

MMLU

by UC Berkeley

MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark covering 57 academic subjects from elementary to professional level, including STEM, law, medicine, and social sciences. It became the standard for measuring general knowledge breadth in LLMs and is included in virtually every model evaluation suite.

benchmarkcontamination-resistantlive

LiveBench

by LiveBench OSS

LiveBench is a contamination-resistant benchmark that continuously updates with new questions sourced from recent math competitions, research papers, and news. By using only data post-dating model training cutoffs, LiveBench mitigates benchmark saturation and provides more reliable capability assessments of frontier models.

HumanEval

by OpenAI

HumanEval is OpenAI's code generation benchmark consisting of 164 hand-written Python programming problems with unit tests. It measures a model's ability to generate syntactically correct and functionally complete code from docstring descriptions. HumanEval is the foundational coding benchmark that all subsequent code benchmarks build upon.

benchmarkcodingpython

benchmarkholisticfairness

HELM

by Stanford CRFM

HELM (Holistic Evaluation of Language Models) from Stanford CRFM provides a multi-dimensional evaluation framework that measures LLMs across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. It evaluates models on 42 scenarios and 59 metrics, providing the most comprehensive public assessment of LLM capabilities and risks.

benchmarksciencereasoning

GPQA Diamond

by NYU / Cohere

GPQA Diamond (Graduate-Level Google-Proof Q&A) is a challenging multiple-choice benchmark requiring expert-level knowledge in biology, chemistry, and physics. Questions are designed to be answerable by domain PhD students but not by web search. GPQA Diamond is the standard for measuring frontier scientific reasoning capability.

benchmarkhuman-evaluationelo

Chatbot Arena

by LMSys

Chatbot Arena is a crowdsourced human evaluation platform from LMSys where users anonymously compare responses from two random LLMs and vote for the better one. The resulting Elo-based leaderboard (LMSYS Leaderboard) is widely regarded as the most reliable measure of real-world LLM preference across diverse user tasks.

ARC-AGI-2

by ARC Prize Foundation

ARC-AGI-2 is the second iteration of François Chollet's Abstraction and Reasoning Corpus benchmark, designed to measure fluid intelligence and generalization in AI systems. Tasks require identifying abstract visual patterns that cannot be solved by memorization, targeting a capability gap that separates current LLMs from human-level reasoning.

benchmarkagiabstraction

AIME 2025

by MAA / Community Eval

AIME (American Invitational Mathematics Examination) 2025 is used as a frontier math reasoning benchmark for LLMs. The competition-level math problems require multi-step reasoning without lookup, making AIME scores a direct indicator of a model's mathematical problem-solving depth. Frontier models are evaluated on the 2025 problem set to avoid training data contamination.

benchmarkmathreasoning

mathreasoningproblem-solving

Benchmark

MATH-500

Mathematics benchmark testing advanced problem-solving from algebra to competition mathematics.

36D

Benchmark

Arena-Hard Auto

Automated benchmark derived from Chatbot Arena for evaluating instruction-following and open-ended generation.

evaluationinstructionautomated

32D

Datasetai-datasets

AI2 Reasoning Challenge (ARC)

by Allen Institute for AI (AI2)

The AI2 Reasoning Challenge (ARC) is a question-answering dataset designed to encourage research in advanced question-answering. It consists of grade-school science questions specifically crafted to require reasoning beyond simple fact retrieval, posing a significant challenge for AI models.

question answeringreasoningscience

84.2A

image-classificationobject-recognitionbenchmark

ImageNet-1K

by ImageNet / Stanford Vision Lab

The canonical large-scale visual recognition benchmark containing 1.28 million training images across 1,000 object categories. ImageNet-1K underpins the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and has driven the majority of deep learning breakthroughs in computer vision since 2012.

83.3A

object-detectionsegmentationkeypoints

COCO 2017

by Microsoft

Microsoft COCO (Common Objects in Context) 2017 provides 118K training images with 860K object instances annotated with bounding boxes, segmentation masks, keypoints, and captions across 80 object categories. It remains the primary benchmark for object detection and instance segmentation research.

82.5A

proteinsstructuresbiology

Protein Data Bank

by RCSB PDB / wwPDB Consortium

The RCSB Protein Data Bank (PDB) is the single worldwide archive of experimentally determined 3D structures of proteins, nucleic acids, and complex assemblies, currently containing over 220,000 biological macromolecular structures determined by X-ray crystallography, NMR, and cryo-EM. It is the foundational structural dataset for computational biology and was used to train and validate AlphaFold2 and other structure-prediction models.

81.9A

UniProt

by UniProt Consortium (EMBL-EBI / SIB / PIR)

UniProt (Universal Protein Resource) is the world's comprehensive, freely accessible protein sequence and functional information database, maintained by a consortium of EMBL-EBI, SIB, and PIR. It contains over 250 million protein sequences in UniParc, with 570,000+ manually reviewed entries in SwissProt providing expert-curated functional annotations, and serves as the gold-standard training source for protein language models.

proteinsbiologysequences

80.9A

benchmarkmultiple-choiceknowledge

MMLU Dataset

by UC Berkeley

Massive Multitask Language Understanding (MMLU) is a benchmark covering 57 academic subjects from STEM to humanities, with 14,000+ multiple-choice questions at undergraduate and professional level. It has become the de facto standard for measuring broad world knowledge and academic reasoning in LLMs.

80.9A

Datasetknowledge

Wikipedia (Processed)

by Wikimedia Foundation / Hugging Face

The processed Wikipedia dataset is a cleaned and tokenized version of Wikipedia dumps covering 20+ languages, available via Hugging Face Datasets. With HTML stripped and paragraph structure preserved, it is one of the most universally used pretraining corpora and a standard knowledge-grounding source for retrieval-augmented generation (RAG) baselines and open-domain QA systems.

wikipediaencyclopedicpretraining

Wikipedia Dump

by Wikimedia Foundation

The full text dump of Wikipedia articles available in over 300 languages, regularly updated and distributed by the Wikimedia Foundation. It is one of the most universally included components in language model pretraining pipelines due to its high factual density, editorial quality, and broad topical coverage.

nlpencyclopedicfactual

automatic-speech-recognitionASRenglish

LibriSpeech

by OpenSLR / Johns Hopkins University

LibriSpeech is a corpus of approximately 1,000 hours of 16kHz read English speech derived from LibriVox audiobooks, split into clean and other subsets of 100h and 360h for training, with dedicated development and test sets. It has become the de facto standard benchmark for English ASR systems.

benchmarkmathgrade-school

GSM8K Dataset

by OpenAI

Grade School Math 8K is a dataset of 8,500 high-quality linguistically diverse grade school math word problems requiring 2-8 step reasoning. Created by OpenAI, GSM8K is widely used for evaluating multi-step arithmetic reasoning and the effectiveness of chain-of-thought prompting.

79.8B+

chemistrymoleculesbioassay

PubChem

by NCBI / NIH

PubChem is the world's largest open chemical database maintained by the NCBI, containing information on over 115 million compounds, 295 million substances, and 270 million bioactivity outcomes from more than 1.2 million assays. It provides standardized molecular structures, properties, and biological activity data freely accessible via REST API and bulk download, making it the canonical resource for cheminformatics and drug discovery research.

79.6B+

Datasetai-datasets

GENIE Benchmark

by Stanford University

The GENIE Benchmark is a comprehensive dataset for evaluating the performance of text-to-SQL models. It includes a diverse set of SQL queries and corresponding natural language questions across multiple domains, designed to assess the generalization capabilities of these models.

text-to-sqlnatural language processingdatabase

HumanEval Dataset

by OpenAI

A curated set of 164 handwritten Python programming problems released by OpenAI, each consisting of a function signature, docstring, reference solution, and unit tests. HumanEval introduced the pass@k metric for functional code correctness evaluation and has become the de facto standard benchmark reported in virtually every code generation model paper.

codeevaluationpython

MIMIC-IV

by MIT Laboratory for Computational Physiology / Beth Israel Deaconess Medical Center

MIMIC-IV (Medical Information Mart for Intensive Care) is a comprehensive de-identified electronic health record database covering over 300,000 patients admitted to Beth Israel Deaconess Medical Center's ICU between 2008 and 2019. It contains detailed clinical data including diagnoses, procedures, medications, laboratory values, and waveforms, enabling a wide range of clinical AI research.

ehrclinicalicu

78.8B+

benchmarkcompetition-mathhard-math

MATH Dataset

by UC Berkeley

A challenging benchmark of 12,500 competition mathematics problems from AMC, AIME, and similar competitions across 5 difficulty levels and 7 subjects. Each problem includes a full step-by-step solution in LaTeX, making it suitable for both evaluation and training of mathematical reasoning.

77.3B+

segmentationSAMfoundation-model

SA-1B (Segment Anything)

by Meta AI

SA-1B is Meta AI's massive segmentation dataset released alongside the Segment Anything Model (SAM), containing over 1 billion high-quality segmentation masks across 11 million diverse, high-resolution images. It is the largest segmentation dataset ever created and enables training of generalist vision models with strong zero-shot transfer capabilities.

77.2B+

benchmarkcommonsensesentence-completion

HellaSwag Dataset

by University of Washington

HellaSwag is an adversarially filtered commonsense NLI benchmark where models must pick the most plausible sentence completion from 4 options. Humans score 95%+ while early LLMs struggled below 50%, making it a robust test of grounded language understanding and commonsense reasoning.

77B+

nlpweb-crawlmassive-scale

Common Crawl

by Common Crawl Foundation

The world's largest open repository of web crawl data, maintained by the non-profit Common Crawl Foundation and updated with new crawls monthly since 2011. It forms the foundational raw data layer for virtually every major language model pretraining pipeline including GPT-3, LLaMA, PaLM, and Falcon, typically after quality filtering and deduplication steps.

76.4B+

benchmarkscience-questionsmultiple-choice

ARC Dataset

by Allen Institute for AI

The AI2 Reasoning Challenge (ARC) dataset contains 7,787 grade 3–9 science exam questions split into Easy and Challenge partitions. The Challenge set contains questions that require deeper reasoning and world knowledge, making it a reliable signal for advanced language understanding.

76.2B+

object-detectionsegmentationvisual-relationships

Open Images V7

by Google

Google's Open Images V7 is one of the largest existing datasets with object-level annotations, containing approximately 9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives across 600+ object classes.

76.1B+

benchmarktruthfulnesshallucination

TruthfulQA Dataset

by University of Oxford

TruthfulQA measures the truthfulness of LLMs across 817 adversarially crafted questions spanning 38 categories where humans are commonly misled by false beliefs. Models are scored on generating truthful AND informative answers, revealing how larger models can paradoxically become more confidently wrong.

75.1B+

Datasetknowledge

Stack Exchange Dump

by Stack Exchange

The Stack Exchange Data Dump is a quarterly XML export of all public questions, answers, comments, and votes across the entire Stack Exchange network of 170+ Q&A communities including Stack Overflow. Containing hundreds of millions of high-quality technical and domain-specific Q&A pairs, it is a critical pretraining source for code and reasoning capabilities and a standard retrieval benchmark for dense passage retrieval.

qacommunitycode

75B+

benchmarknlp-benchmarknatural-language-understanding

SuperGLUE

by New York University

SuperGLUE is a benchmark suite of 8 challenging NLU tasks including question answering, coreference resolution, causal reasoning, and word-sense disambiguation, designed as a harder successor to GLUE. It includes human baselines and has driven significant progress in pre-trained language model capabilities.

multimodalimage-textlarge-scale

LAION-5B

by LAION

The largest openly available image-text pair dataset, containing 5.85 billion CLIP-filtered image-text pairs across English, multilingual, and aesthetic subsets. LAION-5B was the primary training corpus for Stable Diffusion, DALL-E 2 replications, and numerous open vision-language models, enabling the open-source community to train competitive text-to-image generation models.

74.2B+

semantic-segmentationscene-parsingscene-understanding

ADE20K Dataset

by MIT CSAIL

ADE20K is a densely annotated semantic segmentation dataset containing over 27,000 images with pixel-level annotations for 150 semantic categories covering both indoor and outdoor scenes. It is the primary benchmark for scene parsing and semantic segmentation tasks in the computer vision community.

74.2B+

benchmarkcommonsensewinograd-schema

WinoGrande Dataset

by Allen Institute for AI

WinoGrande is a large-scale crowdsourced dataset of 44,000 Winograd-style fill-in-the-blank commonsense problems, debiased using the AFLITE algorithm to minimize spurious statistical cues. It is significantly harder than the original Winograd Schema Challenge for contemporary NLP models.

73.8B+

audio-classificationsound-eventslarge-scale

AudioSet

by Google

Google's AudioSet is a large-scale dataset of manually annotated audio events comprising over 2 million 10-second YouTube clips labeled with a hierarchical ontology of 632 audio event classes. It is the primary benchmark for audio tagging and sound event detection, spanning music, speech, and environmental sounds.

chest-x-rayradiologymulti-label

CheXpert

by Stanford ML Group

CheXpert is a large chest X-ray dataset from Stanford containing 224,316 chest radiographs from 65,240 patients with labels for 14 observations mined from radiology reports using an automated labeler. It uniquely addresses label uncertainty with positive, negative, and uncertain labels, making it a challenging and realistic benchmark for automated chest X-ray interpretation.

73.2B+

biomedical-nlpscientific-literaturefull-text

PubMedCentral OA

by National Institutes of Health / National Library of Medicine

PubMedCentral Open Access (PMC OA) is a subset of the PMC literature archive made freely available for text mining and NLP research, containing over 4 million full-text biomedical and life science articles. It is the primary corpus used for pretraining biomedical language models such as BioBERT, PubMedBERT, and BioGPT.

speaker-verificationspeaker-recognitionin-the-wild

VoxCeleb2

by Oxford Visual Geometry Group (VGG)

VoxCeleb2 is a large-scale speaker recognition dataset containing over 1 million utterances from 6,112 celebrities extracted from YouTube videos in challenging real-world conditions. It is the standard benchmark for speaker verification and diarization research, providing naturalistic conversational speech at scale.

evaluationmachine-translation200-languages

FLORES-200 Dataset

by Meta AI

FLORES-200 is Meta's few-shot translation evaluation benchmark spanning 200 languages, including many low-resource and endangered ones. Each language contains 1,012 parallel sentences translated from English Wikipedia, covering both devtest and test splits for systematic MT evaluation at scale.

instruction-followingself-instructstanford

Alpaca Dataset

by Stanford University

Stanford Alpaca's 52,000 instruction-following examples generated using the self-instruct technique applied to GPT-3.5 (text-davinci-003). This foundational dataset enabled the creation of the Alpaca 7B model and popularized cost-effective instruction-tuning approaches.

ASRmultilingualcrowdsourced

Common Voice 15

by Mozilla

Mozilla's Common Voice 15.0 is the world's largest publicly available multilingual speech corpus, containing over 30,000 hours of validated speech data across 114 languages, all contributed and validated by volunteers. It enables training and evaluation of multilingual and low-resource speech recognition systems.

Datasetfinancial

SEC-EDGAR Filings

by U.S. Securities and Exchange Commission

The SEC-EDGAR Filings dataset encompasses over 20 million full-text regulatory filings submitted to the US Securities and Exchange Commission since 1993, including 10-K annual reports, 10-Q quarterly reports, 8-K current reports, and proxy statements from all US public companies. It is the foundational corpus for financial NLP research, sentiment analysis, and financial document AI.

financial-nlp10-K10-Q

72.5B+

MBPP (Mostly Basic Python Problems)

by Google

A dataset of 974 crowd-sourced Python programming problems suitable for entry-level programmers, each with a problem description, code solution, and three automated test cases. MBPP complements HumanEval by covering a broader variety of programming concepts and is widely used alongside it for comprehensive evaluation of code generation capabilities across model families.

codeevaluationpython

72.5B+

codepretrainingpermissive-license

The Stack v2

by BigCode

An expanded code pretraining dataset containing 3 trillion tokens of source code in 619 programming languages, curated by BigCode from GitHub repositories with permissive SPDX licenses. Version 2 triples the size of the original Stack and includes improved deduplication, opt-out mechanisms for authors, and structured data from GitHub issues and pull requests alongside raw source files.

72.3B+

face-generationGANhigh-resolution

CelebA-HQ

by NVIDIA / CUHK

CelebA-HQ is a high-quality version of the CelebA face dataset containing 30,000 celebrity images at 1024×1024 resolution with 40 binary attribute annotations. It was introduced alongside Progressive GAN and has become the standard benchmark for high-fidelity face generation and synthesis research.

72.3B+

scientific-paperspreprintsnlp

ArXiv Papers Dataset

by Cornell University / arXiv

The ArXiv Papers Dataset is a bulk export of over 2.3 million scientific preprints from arXiv spanning physics, mathematics, computer science, biology, finance, and economics, provided by Cornell University and hosted on Kaggle and AWS S3. The full-text LaTeX source and parsed metadata make it a primary pretraining corpus for scientific language models and citation-network research.

rlhfinstruction-followingconversations

OpenAssistant Conversations

by LAION

A large-scale, human-annotated dataset of assistant-style conversations collected through the OpenAssistant crowdsourcing platform. Contains over 161,000 messages across 66,000+ conversation trees, with ranked responses for RLHF training.

multilingualweb-crawlpre-training

mC4

by Google

The multilingual Colossal Clean Crawled Corpus (mC4) spans 101 languages and contains hundreds of billions of tokens scraped from Common Crawl with language detection and heuristic quality filters. It was used to train mT5 and is one of the largest publicly available multilingual pre-training corpora.

scientific-papersopen-researchfull-text

Semantic Scholar ORC

by Allen Institute for AI (AI2)

The Semantic Scholar Open Research Corpus (S2ORC) is a large English-language corpus of 136 million academic papers with structured metadata, abstracts, citation graphs, and full-text body paragraphs where licensing allows. Maintained by the Allen Institute for AI, it covers 19 scientific fields and is widely used for scientific NLP tasks including citation prediction, claim verification, and scientific QA.

BookCorpus

by University of Toronto

A dataset of over 11,000 unpublished books spanning fiction and non-fiction genres, originally scraped from Smashwords and used as the primary pretraining corpus for BERT alongside Wikipedia. It provides rich long-range dependency data that helps models learn coherent narrative structure and extended discourse patterns.

nlpbookslong-form

71.3B+

scene-recognitionscene-classificationtransfer-learning

Places365

by MIT CSAIL

Places365 is a scene-centric database with 1.8 million training images across 365 scene categories, designed to train and evaluate scene recognition models. The dataset enables models to understand the semantic meaning of places and environments, making it ideal for applications in autonomous driving, robotics, and image retrieval.

codecode-searchdocumentation

CodeSearchNet

by GitHub / Microsoft Research

A dataset and benchmark challenge for code retrieval and search containing 2 million (code, documentation) pairs in six programming languages — Python, Java, JavaScript, PHP, Ruby, and Go — curated by GitHub and Microsoft Research. It is the canonical benchmark for code-to-natural-language and natural-language-to-code retrieval tasks and is widely used to evaluate code embedding models.

codecompetitive-programmingevaluation

APPS (Automated Programming Progress Standard)

by UC Berkeley

A benchmark of 10,000 programming problems at introductory, interview, and competitive programming difficulty levels, each with problem statements, test cases, and human-written solutions. APPS is the standard dataset for evaluating code generation models on realistic programming tasks ranging from simple loops to complex algorithmic challenges drawn from competitive programming platforms.

rlhfpreference-datagpt-4-annotated

UltraFeedback

by Tsinghua University

A large-scale, high-quality preference dataset with 64,000 instructions each answered by 4 LLMs and rated by GPT-4 on instruction-following, truthfulness, honesty, and helpfulness. UltraFeedback is the backbone of the Zephyr and Tulu 2 DPO models.

70.2B+

Datasetfinancial

Financial PhraseBank

by Pekka Malo et al. / Aalto University

Financial PhraseBank is a sentiment analysis dataset containing 4,845 sentences from English-language financial news annotated by 16 financial domain experts with positive, negative, or neutral sentiment labels. It is the most widely used benchmark for financial sentiment analysis and has been used to fine-tune FinBERT and numerous other financial NLP models.

financial-sentimentNLPsentiment-analysis

instruction-tuningself-playseed-tasks

Self-Instruct

by University of Washington

Self-Instruct is the foundational instruction-tuning dataset and methodology introduced by Wang et al. (2022), where 175 human-written seed tasks are iteratively expanded into 52,000 instruction-input-output triplets using GPT-3 as the generator. It established the paradigm of bootstrapping instruction data from existing LLMs and directly inspired Alpaca, WizardLM, and most subsequent synthetic alignment datasets.

StarCoderData

by BigCode

The 780 billion token code dataset used to pretrain the StarCoder family of models, assembled by BigCode from The Stack v1 spanning 86 programming languages with permissive licenses. It includes GitHub issues, Git commits, and Jupyter notebook data alongside source files, enabling models to learn from developer workflows and not just static code.

codepretraininggithub

mathematicsreasoningsymbolic

Datasetmathematics

DM Mathematics

by Google DeepMind

DeepMind Mathematics (DM Mathematics) is a dataset of 2 million mathematical question-answer pairs covering algebra, arithmetic, calculus, comparisons, measurement, numbers, polynomials, and probability, procedurally generated to test mathematical reasoning capabilities of language models. The symbolic and step-structured nature of the dataset makes it a standard benchmark for evaluating compositional generalization and multi-step arithmetic reasoning.

69.2B

quality-over-quantityinstruction-followingmeta

LIMA

by Meta AI

LIMA (Less Is More for Alignment) is a carefully curated dataset of 1,000 high-quality instruction-response pairs demonstrating that alignment quality matters more than quantity. Sourced from StackExchange, wikiHow, and manually written prompts, LIMA-tuned models rival GPT-4 on many benchmarks.

69.1B

syntheticgpt-4instruction-following

OpenHermes 2.5

by Nous Research

A large curated synthetic instruction dataset with ~1 million entries sourced from multiple high-quality open datasets including Airoboros, Camel, GPT4-LLM, and others. OpenHermes 2.5 powers the Nous Hermes model family and is widely regarded as one of the best open instruction datasets.

68.7B

OASST2

by LAION / OpenAssistant

OpenAssistant Conversations 2 (OASST2) is a crowd-sourced human-annotated dataset of 100,000+ assistant-style conversations in 35 languages, where human contributors created and ranked message trees to produce preference labels for RLHF training. It is the largest open multilingual human-feedback dataset and is widely used for training preference models and reward functions in open-source alignment pipelines.

rlhfhuman-feedbackchat

machine-translation200-languagesparallel-corpus

NLLB Training Data

by Meta AI

The No Language Left Behind (NLLB) training corpus released by Meta AI contains high-quality parallel data across 200+ language pairs, including newly mined bitext for dozens of low-resource languages. It was used to train the NLLB-200 model achieving state-of-the-art translation on low-resource language pairs.

conversationsgpt-4chatgpt

ShareGPT

by Community

A community-collected dataset of real ChatGPT and GPT-4 conversation logs shared by users, covering a broad range of tasks and domains. Available in multiple filtered and cleaned versions including ShareGPT52K and ShareGPT90K used by Vicuna and other open models.

scene-classificationscene-understandinglarge-scale

LSUN

by Princeton / Columbia University

The Large-Scale Scene Understanding (LSUN) dataset is a massive collection of nearly one million labeled images for each of 10 scene and 20 object categories. It is a key benchmark for advancing research in scene understanding, particularly for generative modeling, classification, and reconstruction tasks.

instruction-tuningsupervised-fine-tuninghuman-generated-data

Dolly-15K

by Databricks

Dolly-15K is a high-quality, open-source dataset of 15,000 instruction-following records generated by humans. Created by Databricks employees, it's designed for fine-tuning large language models to exhibit instruction-following capabilities, such as those seen in ChatGPT, using a relatively small, targeted dataset.

parallel-corpusmachine-translationmultilingual-nlp

OPUS-100

by University of Helsinki

OPUS-100 is a large-scale multilingual parallel corpus for machine translation, featuring 100 languages pivoted through English. Sampled from the OPUS collection, it provides up to 1 million sentence pairs per language pair, making it a standard benchmark for training and evaluating multilingual models.

68.1B

synthetic-datatextbookscoding

Phi-1 TextBooks

by Microsoft

Phi-1 TextBooks is a synthetic dataset of Python coding textbooks and exercises generated by GPT-3.5 and GPT-4. It was created to pretrain Microsoft's Phi-1 small language model, demonstrating that high-quality, curriculum-style data can significantly boost the coding abilities of smaller models compared to training on general web data.

GigaSpeech

by Seasalt.ai / SpeechColab

GigaSpeech is a multi-domain English speech corpus with 10,000 hours of high-quality labeled audio for ASR, sourced from audiobooks, podcasts, and YouTube across a broad range of topics and recording conditions. Its scale and diversity make it particularly valuable for training robust, domain-generalizable speech recognition models.

ASRlarge-scaleenglish

evol-instructcomplexity-evolutionsynthetic

WizardLM Evol-Instruct

by Microsoft Research

WizardLM Evol-Instruct is a synthetic dataset created by Microsoft Research for fine-tuning large language models. It uses an LLM-based evolutionary process to iteratively rewrite and complicate a seed set of instructions, progressively increasing their complexity and diversity. The dataset is designed to enhance a model's ability to follow intricate, multi-step commands across various domains like coding, math, and reasoning.

question-answeringmultilingualtypologically-diverse

TyDi QA Dataset

by Google Research

TyDi QA is a benchmark for question answering across 11 typologically diverse languages. It features information-seeking questions written by native speakers who have not seen the answer, ensuring real-world applicability. This design challenges models to generalize beyond high-resource, typologically similar languages.

multimodalimage-textbenchmark

DataComp-1B

by DataComp Consortium

A curated 1.28 billion image-text pair dataset produced through the DataComp benchmark competition, which challenged participants to filter a 12.8 billion pair candidate pool to produce the best downstream CLIP model. DataComp-1B represents the winning filtering strategy and achieves state-of-the-art zero-shot classification performance among datasets of its size.

66.6B

OpenWebText

by EleutherAI

OpenWebText is a large-scale, open-source English text corpus created by scraping web pages linked from Reddit. Designed as a public replication of OpenAI's original WebText dataset used for GPT-2, it contains approximately 38 GB of text filtered by Reddit upvotes to ensure a baseline of quality and relevance.

nlpweb-textreddit

LAION-400M Text Captions

by LAION

The text caption component of the LAION-400M dataset, offering 400 million English alt-text captions. These captions were scraped from the web and filtered using CLIP to ensure a minimum similarity to their corresponding images. The text is used independently for large-scale NLP and multimodal research.

nlpcaptionsimage-text

biomedical-qaquestion-answeringsemantic-indexing

BioASQ Dataset

by BioASQ Consortium

The BioASQ dataset is a benchmark for biomedical semantic indexing and question answering. It contains thousands of expert-annotated questions (factoid, list, yes/no, summary) paired with relevant PubMed articles, concepts, and ideal answers, designed to train and evaluate advanced NLP systems in the medical domain.

66.2B

roboticsmanipulationmulti-robot

Open X-Embodiment

by Google DeepMind / Consortium

Open X-Embodiment (OXE) is a massive robotics dataset combining over 1 million demonstration episodes from 22 distinct robot embodiments. It covers 527 skills and is designed to train generalist robot policies that can transfer skills across diverse hardware, serving as a key resource for vision-language-action models.

legal-nlppretrainingcontracts

Datasetlegal

Legal-BERT Training Data

by Gerasimos Spanakis / Maastricht University

The Legal-BERT training corpus is a large collection of English legal text assembled from UK legislation, EU legislation, ECHR/ECLI court decisions, and US contracts specifically curated to pretrain domain-adapted BERT models. It has enabled a family of Legal-BERT models that significantly outperform general-domain language models on legal NLP tasks.

65.9B

Datasetai-datasets

GenLaw: A Legal Reasoning Dataset

by Stanford Center for Legal Informatics

GenLaw is a comprehensive dataset designed for evaluating legal reasoning capabilities of large language models. It contains a diverse set of legal questions, case summaries, and relevant statutes, enabling researchers to assess a model's ability to understand and apply legal principles.

legalreasoninglaw

nlppretrainingdeduplicated

SlimPajama

by Cerebras

SlimPajama is a cleaned and deduplicated version of the RedPajama dataset, containing 627 billion high-quality tokens. Produced by Cerebras, it demonstrates that training on fewer, higher-quality tokens can match or exceed the performance of models trained on larger, noisier datasets.

european-lawcourt-decisionsmultilingual

Datasetlegal

EU Court Decisions

by European Court of Human Rights / CJEU

The EU Court Decisions dataset aggregates judgments from the European Court of Human Rights (ECHR) and the Court of Justice of the European Union (CJEU), covering tens of thousands of decisions in multiple EU languages with structured metadata. It is widely used for multilingual legal NLP research, legal judgment prediction, and cross-lingual information retrieval.

code-generationinstruction-tuningevol-instruct

Datasetcode

Evol-CodeAlpaca

by Microsoft Research

Evol-CodeAlpaca is a dataset of 110,000 instruction-solution pairs for code generation, created by applying the EvolInstruct method to Code Alpaca seeds. Using GPT-4, it progressively increases the complexity and diversity of programming problems, serving as the primary training data for the WizardCoder models.

datasetmultimodalinstruction-tuning

ShareGPT4V

by Shanghai AI Lab

ShareGPT4V is a large-scale, high-quality dataset containing 100,000 image-text pairs generated by GPT-4V. It is specifically designed for the instruction-tuning of open-source large vision-language models (LVLMs). The dataset's detailed captions and conversational QA pairs significantly enhance a model's ability to perform complex scene understanding, OCR, and visual reasoning.

financial-qanumerical-reasoningtable-qa

Datasetfinancial

FinQA Dataset

by Zhiyu Chen et al. / University of California Santa Barbara

FinQA is a large-scale dataset for numerical reasoning over financial data, containing over 8,000 question-answer pairs from S&P 500 earnings reports. Each question requires multi-step reasoning across both unstructured text and structured tables, making it a challenging benchmark for financial AI systems.

synthetic-datatext-corpusllm-pretraining

Cosmopedia

by Hugging Face

Cosmopedia is a massive synthetic dataset containing 30 million documents styled as textbooks, blog posts, and articles. Generated by Mixtral-8x7B-Instruct, it provides a vast, multilingual corpus of high-quality educational content designed for pretraining large language models at scale.

multimodalimage-textweb-crawl

CC12M (Conceptual 12M)

by Google

CC12M is a large-scale dataset by Google containing 12 million image-text pairs from the web. It was created with a less restrictive filtering process than its predecessor, CC3M, to achieve greater scale and diversity. This makes it a foundational resource for pretraining large vision-language models like CLIP and ALIGN.

summarizationmultilingualnews

XL-Sum Dataset

by BUET (Bangladesh University of Engineering and Technology)

XL-Sum is a massive multilingual dataset for abstractive summarization. It consists of over 1 million article-summary pairs scraped from BBC News, covering 44 different languages. This diversity makes it a crucial resource for developing and evaluating cross-lingual and multilingual summarization models.

64.9B

Datasetlegal

CaseText Corpus

by Casetext (acquired by Thomson Reuters)

The CaseText Corpus is a large-scale dataset of US federal and state court decisions. It includes full text, structured metadata, and citation networks, designed for legal research and the development of AI applications like legal language models and case retrieval systems, spanning decades of US jurisprudence.

case-lawlegal-researchcase-retrieval

roboticsmanipulationbenchmark

RLBench

by Dyson Robotics Lab / Imperial College London

RLBench is a large-scale robot learning benchmark and dataset built on the CoppeliaSim simulator, providing 100 unique manipulation tasks with demonstrations, observations, and reward functions. It offers RGB, depth, and point-cloud observations for a Franka Panda arm across diverse household tasks, widely used for evaluating imitation learning, reinforcement learning, and multi-task robot policies.

synthetic-datamathematicsinstruction-tuning

OpenMathInstruct

by NVIDIA

OpenMathInstruct is a large-scale, synthetic dataset by NVIDIA featuring 1.8M+ math problem-solution pairs. Generated by Mixtral models and verified for correctness, it provides reliable, step-by-step reasoning chains for training and fine-tuning language models on diverse mathematical topics, from arithmetic to competition math.

codemultilingual-codegithub

GitHub Code Dataset

by Hugging Face / BigCode

The GitHub Code Dataset is a massive, multilingual collection of source code from public GitHub repositories, spanning 32 programming languages. Distributed via Hugging Face under the BigCode project, it provides a foundational resource for pretraining large language models on diverse code-related tasks, from generation to analysis.

CC-News

by CommonCrawl Foundation

CC-News is a large-scale dataset of over 700,000 English news articles from the CommonCrawl archive, collected between 2016 and 2019. It serves as a key pretraining corpus, notably for the RoBERTa model, providing a rich source of journalistic text for developing models that understand news language and current events.

nlpnewsweb-crawl

63.3B

musicinstrument-recognitionnote-annotations

MusicNet

by University of Washington

MusicNet is a collection of 330 freely licensed classical music recordings with over 1 million annotated labels indicating the precise timing and identity of every musical note in each recording. It supports supervised learning for music transcription, instrument recognition, and music information retrieval tasks.

multilingual-corpuspre-training-datasetllm-training

CulturaX

by University of Oregon

CulturaX is a massive, cleaned multilingual text corpus containing 6.3 trillion tokens across 167 languages. It was created by combining, deduplicating, and filtering the mC4 and OSCAR datasets using language model-based quality scoring. This makes it one of the largest and cleanest public datasets for pre-training large language models.

instruction-tuningsftdata-mixture

Tulu V2 Mix

by Allen Institute for AI (AI2)

Tulu V2 Mix is a curated 326,000-sample mixture of instruction-tuning datasets from AI2. It blends diverse sources like FLAN, Open Assistant, and Code Alpaca to train the Tulu 2 model family. The dataset serves as a benchmark for analyzing the impact of different data sources on model performance and quality.

natural-language-inferenceclinical-nlpentailment

MedNLI

by University of Massachusetts / Partners Healthcare

MedNLI is a benchmark dataset for Natural Language Inference (NLI) in the clinical domain. Derived from the MIMIC-III database, it contains over 14,000 sentence pairs from clinical notes, each annotated by a clinician as representing entailment, contradiction, or a neutral relationship, enabling the evaluation of clinical text reasoning.

62.8B

multimodalvideo-textvideo-captioning

WebVid-10M

by University of Oxford

WebVid-10M is a massive dataset containing over 10 million video clips paired with descriptive text captions. Scraped from stock video websites, it serves as a foundational pretraining corpus for state-of-the-art video-language models, facilitating research in video understanding, retrieval, and generation.

PushShift Reddit Dataset

by PushShift.io

A massive, multi-billion token archive of Reddit comments and submissions from 2005 to 2023, collected by the PushShift project. This dataset is a cornerstone for social NLP research, large-scale language model pre-training, and studying the dynamics of online communities and conversational discourse.

nlpsocial-mediadialogue

rlhfpreference-datareward-model

Nectar

by UC Berkeley

Nectar is a large-scale, high-quality preference dataset from Berkeley AI Research (BAIR). It contains 183,000 prompts, each with seven ranked responses from diverse models like GPT-4, ChatGPT, and open-source LLMs. It is designed for training robust reward models for RLHF and DPO.

61.6B

roboticsvideomanipulation

RoboNet

by Berkeley AI Research (BAIR)

RoboNet is a large-scale dataset for robot learning, featuring 15 million video frames from diverse robot arms across multiple labs. It is designed to train and benchmark self-supervised visual models, aiming to achieve generalization across different robot morphologies and workspaces without task-specific labels.

Orca DPO Pairs

by Intel Labs / Community

Orca DPO Pairs is a synthetic dataset containing 12,000 instruction-following examples. Each example includes a prompt, a high-quality response from GPT-4 (chosen), and a lower-quality response from GPT-3.5 (rejected). It is designed for efficiently aligning language models using Direct Preference Optimization (DPO) without a reward model.

dpopreferencealignment

roboticslanguage-conditionedmanipulation

CALVIN

by Albert-Ludwigs-Universität Freiburg

CALVIN is a large-scale dataset and benchmark for long-horizon, language-conditioned robot manipulation. It features over 24 hours of teleoperated demonstration data in a tabletop environment, encompassing 34 distinct skills that can be composed to solve complex, multi-step tasks from natural language instructions.

instruction-tuningdata-selectionquality-filtering

Deita 6K

by HKUST / Community

Deita 6K is an ultra-compact, high-quality instruction-tuning dataset of 6,000 carefully selected samples produced by the Data-Efficient Instruction Tuning for Alignment (DEITA) framework, which scores and filters instruction data by complexity and quality using LLM judges. Despite its small size, models trained on Deita 6K match or outperform those trained on datasets 10-100x larger, demonstrating the power of principled data selection over scale.

syntheticmulti-agentrole-playing

CAMEL-AI Datasets

by CAMEL-AI

The CAMEL-AI Datasets are a collection of synthetic multi-agent conversation datasets generated through the Communicative Agents framework, where AI assistants and user agents collaborate via role-playing to solve tasks. The collection covers coding, math, science, and open-ended reasoning domains, providing diverse instruction-following dialogues useful for SFT and alignment research.

CodeParrot GitHub Code

by Hugging Face

A 50 GB dataset of Python code scraped from GitHub, originally created to train the CodeParrot model as a demonstration of code-focused language model pretraining. It filters repositories for Python files only and applies basic deduplication, making it a lightweight starting point for Python-specific code generation research and experimentation.

codegithubpython

instruction-tuninglong-formdiverse

Capybara

by Argilla / LDJnr

Capybara is a high-quality instruction-tuning dataset of 15,000 diverse, long-form single- and multi-turn conversations synthesized to cover a wide range of topics and response styles, designed to improve model coherence and verbosity on open-ended tasks. It emphasizes narrative quality and conceptual depth over simple factual responses, making it particularly effective for improving chat model fluency and reasoning.

syntheticinstruction-tuningdocument-grounded

Genstruct

by NousResearch

Genstruct is a synthetic instruction dataset generated by the Genstruct-7B model, which converts raw documents into structured instruction-response pairs. Unlike typical self-instruct approaches, Genstruct grounds every instruction in a source document, ensuring factual consistency and enabling controllable synthetic data generation from any text corpus.

53.4C+

Datasetdatasets

UltraChat

by Tsinghua University

1.5M high-quality multi-turn dialogue dataset for instruction fine-tuning.

alignmentdialoguesft

pretrainingenglishdiverse

Datasetdatasets

The Pile

by EleutherAI

825GB diverse English pretraining corpus from 22 high-quality data sources.

Datasetdatasets

SWE-bench

by Princeton NLP

2.3K real GitHub issues requiring AI agents to write and verify code fixes.

benchmarkcodingagents

bertpre-trainingbidirectional

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

by Google AI

Introduced BERT, a bidirectional Transformer pre-trained on masked language modeling and next sentence prediction. Established the pretrain-then-fine-tune paradigm that dominated NLP for years and achieved state-of-the-art on 11 NLP benchmarks.

82.8A

clipcontrastive-learningzero-shot

Learning Transferable Visual Models From Natural Language Supervision (CLIP)

by OpenAI

Introduced CLIP (Contrastive Language-Image Pre-training), a model trained on 400 million image-text pairs using contrastive learning that achieves remarkable zero-shot transfer to diverse vision tasks. CLIP became foundational for vision-language alignment and generative AI pipelines.

82.2A

chain-of-thoughtreasoningprompting

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

by Google Brain

Introduced chain-of-thought prompting, a simple technique of providing exemplars with step-by-step reasoning traces in few-shot prompts. This approach dramatically improves LLM performance on arithmetic, commonsense, and symbolic reasoning tasks, with the effect emerging at approximately 100B parameters.

82.1A

stable-diffusionlatent-diffusiontext-to-image

High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion)

by CompVis / Stability AI

Introduced Latent Diffusion Models (LDMs), which perform the diffusion process in a compressed latent space rather than pixel space, dramatically reducing computational cost while maintaining image quality. This work underpins Stable Diffusion, the most widely used open-source image generation model.

82A

gpt-3few-shotin-context-learning

Language Models are Few-Shot Learners (GPT-3)

by OpenAI

Introduced GPT-3, a 175B parameter language model demonstrating remarkable few-shot learning capabilities across diverse tasks. Showed that scaling model size dramatically improves in-context learning without gradient updates, reshaping the field.

82A

vision-transformerimage-classificationattention

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

by Google Brain

Introduced the Vision Transformer (ViT), demonstrating that a pure transformer applied directly to sequences of image patches achieves state-of-the-art performance on image classification when pretrained on large datasets. The paper challenged the dominance of convolutional neural networks in computer vision.

81.9A

rlhfalignmentinstruction-following

Training Language Models to Follow Instructions with Human Feedback

by OpenAI

Presents InstructGPT, which uses Reinforcement Learning from Human Feedback (RLHF) to align GPT-3 with human intent. By fine-tuning on human demonstrations and training a reward model on human preference comparisons, InstructGPT produces outputs that human evaluators prefer to GPT-3 outputs despite having 100× fewer parameters.

81.8A

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

by Facebook AI Research

Introduces Retrieval-Augmented Generation (RAG), combining parametric memory (language model weights) with non-parametric memory (dense retrieval over Wikipedia) for knowledge-intensive NLP tasks. RAG models achieve state-of-the-art on open-domain QA benchmarks and produce more specific, factual, and diverse responses than pure parametric models.

ragretrievalgeneration

81.2A

reinforcement-learningppopolicy-gradient

Proximal Policy Optimization Algorithms

by OpenAI

PPO introduces a clipped surrogate objective that constrains policy update step sizes, achieving the stability of trust-region methods (TRPO) with the simplicity and scalability of first-order optimizers. It quickly became the dominant RL algorithm for training large language models with human feedback.

81.1A

Paperdomain-specific

Highly Accurate Protein Structure Prediction with AlphaFold

by DeepMind

AlphaFold 2 achieves atomic-level accuracy in protein structure prediction by combining evolutionary information from multiple sequence alignments with a novel Evoformer architecture and structure module, solving a 50-year grand challenge in biology. Its predictions have been released for virtually all known proteins and have accelerated drug discovery, enzyme design, and structural biology worldwide.

biologyprotein-structurealphafold

81.1A

GPT-4 Technical Report

by OpenAI

Technical report for GPT-4, OpenAI's multimodal large language model accepting image and text inputs. Demonstrates state-of-the-art performance on academic and professional benchmarks, including passing the bar exam in the top 10% of test takers.

gpt-4multimodalrlhf

81A

segmentationfoundation-modelpromptable

Segment Anything

by Meta AI

Introduced the Segment Anything Model (SAM) and the SA-1B dataset of 1 billion masks on 11 million images. SAM is a promptable segmentation foundation model that generalizes to new image distributions and tasks without additional training, enabling a new paradigm of interactive segmentation.

codexcode-generationgithub-copilot

Evaluating Large Language Models Trained on Code (Codex)

by OpenAI

Introduced Codex, a GPT language model fine-tuned on publicly available code from GitHub, and the HumanEval benchmark for measuring code synthesis from docstrings. Codex powers GitHub Copilot and represents a breakthrough in automated programming assistance.

ReAct: Synergizing Reasoning and Acting in Language Models

by Google / Princeton

Introduces ReAct, a paradigm that combines reasoning traces and task-specific actions in language models. By interleaving thinking steps with tool calls, ReAct agents outperform chain-of-thought and act-only baselines on diverse tasks including question answering, fact verification, and interactive decision-making.

agentsreasoningtool-use

LoRA: Low-Rank Adaptation of Large Language Models

by Microsoft Research

Introduces LoRA, which freezes pretrained model weights and injects trainable low-rank decomposition matrices into Transformer layers. Reduces trainable parameters by 10,000× and GPU memory by 3× with no inference latency overhead, enabling efficient LLM fine-tuning.

lorafine-tuninglow-rank

78.8B+

llamaopen-sourceefficient

LLaMA: Open and Efficient Foundation Language Models

by Meta AI

Introduces LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, trained on publicly available datasets. Showed that smaller models trained on more tokens can match or exceed larger models, democratizing LLM research.

78.1B+

reinforcement-learningrlhfhuman-feedback

Deep Reinforcement Learning from Human Preferences

by OpenAI

This foundational RLHF paper shows that human preference comparisons between agent behaviors can train a reward model that guides deep RL agents in complex tasks like Atari games and MuJoCo locomotion, without hand-crafted reward functions. The approach reduces human labeling effort by ~3 orders of magnitude compared to direct reward specification.

78B+

Gemini: A Family of Highly Capable Multimodal Models

by Google DeepMind

Introduced the Gemini family of multimodal models (Ultra, Pro, Nano) natively trained to process and combine text, images, audio, and video. Gemini Ultra is the first model to surpass human expert performance on MMLU and achieves state-of-the-art across 30 of 32 benchmarks evaluated.

geminimultimodalgoogle

77.8B+

paged-attentionvllminference

Efficient Memory Management for Large Language Model Serving with PagedAttention

by UC Berkeley

Introduced PagedAttention and the vLLM serving system, which manages the KV cache in non-contiguous physical memory blocks inspired by OS paging, enabling near-zero memory waste and efficient sharing of KV cache across requests. vLLM achieves 2-4x higher throughput than HuggingFace Transformers and 1.7x over Orca.

77.7B+

Generative Agents: Interactive Simulacra of Human Behavior

by Stanford University / Google

Introduces generative agents—computational software agents that simulate believable human behavior—by combining a large language model with memory streams, reflection synthesis, and planning mechanisms. Twenty-five agents populate a virtual town, exhibiting emergent social behaviors including relationship formation, information propagation, and event coordination.

agentssimulationsocial

77.3B+

dall-e-2text-to-imagediffusion

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)

by OpenAI

Presented DALL-E 2 (unCLIP), a hierarchical text-conditional image generation system using CLIP image embeddings as a prior and a diffusion decoder. The system achieves state-of-the-art photorealism and text-image alignment, substantially advancing the field of text-to-image synthesis.

77.1B+

Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

by OpenAI

Introduces InstructGPT, fine-tuning GPT-3 with Reinforcement Learning from Human Feedback (RLHF) to follow instructions. A 1.3B InstructGPT model is preferred over 175B GPT-3 by human labelers, establishing RLHF as the dominant alignment technique.

rlhfinstructgptalignment

77B+

self-consistencychain-of-thoughtreasoning

Self-Consistency Improves Chain of Thought Reasoning in Language Models

by Google Brain

Introduced self-consistency, a decoding strategy that samples diverse reasoning paths from a language model and returns the most consistent answer by marginalizing out the reasoning paths. Self-consistency is a simple, training-free technique that substantially improves chain-of-thought prompting across arithmetic and commonsense reasoning tasks.

76.7B+

Paperresearch

Scaling Laws for Neural Language Models

by OpenAI

Empirically establishes power-law scaling relationships between language model performance and model size, dataset size, and compute budget. Provides the foundational framework for predicting LLM capabilities as a function of scale, guiding research for years.

scaling-lawscompute-optimallanguage-models

76.7B+

llavamultimodalinstruction-tuning

Visual Instruction Tuning (LLaVA)

by University of Wisconsin–Madison / Microsoft Research

Introduced LLaVA (Large Language and Vision Assistant), a multimodal model trained via visual instruction tuning using GPT-4-generated multimodal instruction-following data. LLaVA demonstrates impressive multimodal chat abilities and achieves 85.1% on Science QA, pioneering open-source visual instruction tuning.

76B+

mixture-of-expertsmoesparse-model

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

by Google Brain

Introduced Switch Transformers, a simplified mixture-of-experts (MoE) architecture that routes each token to exactly one expert (top-1 routing), enabling trillion-parameter models with sub-linear compute scaling. Switch Transformers achieve 7x pretraining speedup over a dense T5 model while maintaining model quality.

ethicsmodel-cardstransparency

Model Cards for Model Reporting

by Google

Model Cards introduces a structured framework for documenting machine learning models across intended uses, performance disaggregated by demographic groups, and ethical considerations, enabling informed model selection and deployment decisions. The paper has become an industry standard, with model card adoption by Google, Hugging Face, and most major AI providers.

gpt-2language-modelingzero-shot

Language Models are Unsupervised Multitask Learners (GPT-2)

by OpenAI

Introduced GPT-2, demonstrating that large language models trained on diverse web text can perform zero-shot transfer across many NLP tasks without task-specific fine-tuning. Showed emergent capabilities at scale and sparked debate on responsible AI release.

flash-attentionio-awarememory-efficient

Paperinfrastructure

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

by Stanford University

Introduces FlashAttention, an IO-aware exact attention algorithm that restructures attention computation to minimize memory reads/writes between HBM and SRAM. Achieves 2-4× speedup over standard attention and enables training on much longer sequences.

75.7B+

chinchillascaling-lawscompute-optimal

Training Compute-Optimal Large Language Models (Chinchilla)

by DeepMind

Challenges the Kaplan et al. scaling laws by showing that model size and training tokens should scale equally. Trains Chinchilla (70B) on 4× more data than Gopher, matching or beating models 4× its size, redefining compute-optimal training strategies.

75.4B+

ethicsdatasetsdocumentation

Datasheets for Datasets

by Microsoft Research / Multiple Institutions

Drawing an analogy to electronics component datasheets, this paper proposes that every ML dataset should be accompanied by a standardized document covering its motivation, composition, collection process, preprocessing, uses, distribution, and maintenance. Datasheets for Datasets has become the foundational standard for dataset transparency and is widely required by major AI venues.

75.2B+

tree-of-thoughtsreasoningsearch

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

by Princeton University / Google DeepMind

Introduced Tree of Thoughts (ToT), a framework that generalizes chain-of-thought prompting to a tree search over intermediate reasoning steps. ToT enables LLMs to explore multiple reasoning paths, evaluate choices, and backtrack, achieving dramatic improvements on tasks requiring lookahead and planning.

74.9B+

speculative-decodinginference-efficiencydraft-model

Fast Inference from Transformers via Speculative Decoding

by Google Research

Introduced speculative decoding, a lossless inference acceleration technique that uses a smaller, faster draft model to propose multiple tokens, then verifies them in parallel with the target model in a single forward pass. This achieves 2-3x speedup without any degradation in output quality or distribution.

74.7B+

alignmentsafetyconstitutional-ai

Constitutional AI: Harmlessness from AI Feedback

by Anthropic

Introduces Constitutional AI (CAI), a method for training harmless AI assistants using a set of written principles (a 'constitution') to guide both supervised learning and reinforcement learning from AI feedback (RLAIF). CAI enables Anthropic to reduce reliance on human harm labels while maintaining helpfulness and making AI reasoning about harmlessness explicit.

74.7B+

sdxlstable-diffusiontext-to-image

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

by Stability AI

Presented SDXL, a significantly improved latent diffusion model architecture featuring a 3.5B parameter UNet backbone with a secondary refiner model, conditioning on image size and crop parameters, and a curated high-aesthetic dataset. SDXL substantially improves visual quality and prompt adherence over prior Stable Diffusion versions.

reasoningreinforcement-learningdeepseek

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

by DeepSeek

DeepSeek-R1 demonstrates that pure reinforcement learning with rule-based rewards—without supervised fine-tuning on chain-of-thought data—can incentivize emergent reasoning capabilities in LLMs including self-verification, reflection, and long chain-of-thought. The model achieves performance comparable to OpenAI-o1 on reasoning benchmarks while being fully open-sourced, triggering a significant industry response.

qloraquantizationfine-tuning

QLoRA: Efficient Finetuning of Quantized LLMs

by University of Washington

Introduces QLoRA, which combines 4-bit quantization with LoRA adapters to fine-tune a 65B LLM on a single 48GB GPU while preserving full 16-bit fine-tuning performance. Introduces NF4 data type and double quantization for extreme memory reduction.

gptqquantizationpost-training-quantization

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

by Institute of Science and Technology Austria (IST Austria)

Presented GPTQ, a one-shot weight quantization method based on approximate second-order information that can quantize GPT models with 175B parameters to 4-bit or 3-bit precision in approximately four GPU-hours with negligible accuracy loss. GPTQ made large model inference practical on consumer hardware.

code-llamametacode-generation

Code Llama: Open Foundation Models for Code

by Meta AI

Introduced Code Llama, a family of large language models for code built on Llama 2 through code-specific pretraining and fine-tuning. Code Llama achieves state-of-the-art performance among open models on HumanEval and MBPP, with variants for Python, instruction following, and long context (100K tokens).

74.3B+

evaluationhuman-preferenceelo

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

by LMSYS / UC Berkeley

Introduces Chatbot Arena, a platform for crowdsourced human evaluation of LLMs via pairwise comparisons using an Elo rating system. The arena has collected over 240K human votes across 50+ models, revealing human preference rankings that often diverge from standard benchmark leaderboards and providing a complementary evaluation signal.

74B+

tool-useself-supervisedapi-calling

Toolformer: Language Models Can Teach Themselves to Use Tools

by Meta AI

Presents Toolformer, a model that learns to use external tools (APIs) in a self-supervised manner without requiring human annotations. The model decides which APIs to call, how to call them, and how to incorporate results, achieving strong performance across diverse tasks while maintaining generative language modeling ability.

mistralefficientsliding-window-attention

Mistral 7B

by Mistral AI

Introduces Mistral 7B, a 7B parameter language model outperforming LLaMA 2 13B on all benchmarks and approaching LLaMA 2 34B on code and reasoning. Uses grouped-query attention and sliding window attention for efficient inference.

73.6B+

awqquantizationactivation-aware

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

by MIT / MIT-IBM Watson AI Lab

Introduced AWQ (Activation-aware Weight Quantization), a hardware-friendly low-bit weight quantization approach that protects a small fraction (1%) of salient weights based on activation magnitudes, achieving better performance than GPTQ at 4-bit while being faster and more broadly applicable across model architectures.

73.3B+

PaLM: Scaling Language Modeling with Pathways

by Google Research

Introduces PaLM (Pathways Language Model), a 540B parameter model trained on 780B tokens using the Pathways system. Achieved breakthrough performance on reasoning tasks and demonstrated discontinuous performance improvements that define emergent abilities.

palmscalingpathways

73.2B+

dinov2self-supervisedvision-transformer

DINOv2: Learning Robust Visual Features without Supervision

by Meta AI

Presented DINOv2, a self-supervised vision foundation model trained on a curated dataset of 142 million images using a combination of self-distillation and contrastive objectives. DINOv2 features serve as universal visual representations, excelling on depth estimation, segmentation, and classification without fine-tuning.

evaluationbenchmarkholistic

Holistic Evaluation of Language Models

by Stanford CRFM

Presents HELM, a holistic evaluation framework for language models across 42 scenarios and 59 metrics including accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. HELM reveals that no single model dominates across all dimensions and exposes significant gaps between narrow and comprehensive model assessment.

GPT-4V(ision) System Card

by OpenAI

The system card for GPT-4 with vision (GPT-4V), detailing the model's visual understanding capabilities, safety evaluations, limitations, and mitigation strategies. GPT-4V represents a major advancement in large multimodal models, enabling complex visual reasoning from natural language prompts.

gpt-4vmultimodalvision

72.7B+

claudeanthropicmultimodal

The Claude 3 Model Family: Opus, Sonnet, Haiku

by Anthropic

Presents the Claude 3 family of models (Opus, Sonnet, Haiku), demonstrating state-of-the-art performance on reasoning, vision, and multilingual tasks. Highlights Anthropic's safety techniques including Constitutional AI and RLHF-based alignment.

imagentext-to-imagediffusion

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)

by Google Brain

Introduced Imagen, a text-to-image diffusion model that leverages large pretrained language models (T5-XXL) for text understanding combined with cascaded diffusion models for image synthesis. Imagen demonstrated that scaling text encoders is more impactful than scaling diffusion models, establishing DrawBench as a new evaluation benchmark.

flash-attention-2attentionparallelism

Paperinfrastructure

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

by Princeton University / Together AI

Extends FlashAttention with improved work partitioning across GPU thread blocks and warps, achieving 2× speedup over FlashAttention and ~9× speedup over standard attention. Enables efficient training of models with context lengths up to 256K tokens.

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

by University of Washington / Black in AI

This influential FAccT paper argues that ever-larger language models carry significant risks—including environmental costs, biased training data, and the illusion of meaning—that are often overlooked in the race for benchmark performance. It calls for pausing scaling to focus on documentation, auditing, and community-centered research practices.

ethicsllmbias

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

by Salesforce Research

Presented BLIP-2, which bridges the modality gap between frozen image encoders and frozen LLMs using a lightweight Querying Transformer (Q-Former) trained in two stages. BLIP-2 achieves state-of-the-art VQA performance with significantly fewer trainable parameters than prior methods.

blip-2multimodalq-former

71.9B+

flamingomultimodalfew-shot

Flamingo: a Visual Language Model for Few-Shot Learning

by DeepMind

Introduced Flamingo, a family of visual language models that bridge powerful pretrained vision and language models, enabling few-shot learning on a diverse range of multimodal tasks by training on arbitrarily interleaved sequences of images, video, and text. Flamingo set new few-shot state-of-the-art on 16 benchmarks.

71.8B+

agentsmulti-agentsoftware-engineering

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

by Tsinghua / Peking University / DeepWisdom

Presents MetaGPT, a multi-agent framework that encodes human workflows as Standardized Operating Procedures (SOPs) for LLM agents acting as specialized software roles. By assigning product manager, architect, engineer, and QA roles, MetaGPT produces complete, executable codebases from natural language requirements with higher quality than prior approaches.

process-reward-modelsreasoningrlhf

Let's Verify Step by Step

by OpenAI

Demonstrated that process-based reward models (PRMs), which provide feedback on each reasoning step, substantially outperform outcome-based reward models (ORMs) for training LLMs to solve mathematical reasoning problems. The paper also introduced PRM800K, a dataset of 800K step-level human feedback labels on MATH solutions.

71.6B+

roperotary-position-embeddingpositional-encoding

RoFormer: Enhanced Transformer with Rotary Position Embedding

by Zhuiyi Technology

Introduces Rotary Position Embedding (RoPE), encoding absolute position information with a rotation matrix and naturally incorporating relative position in self-attention. Adopted by LLaMA, PaLM 2, and most modern LLMs for its length generalization properties.

71.4B+

swe-benchsoftware-engineeringbenchmark

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

by Princeton University

Introduced SWE-bench, a benchmark of 2,294 real GitHub issues from 12 popular Python repositories requiring models to resolve issues by writing code patches. SWE-bench reveals that even the best LLMs resolve fewer than 4% of issues with standard techniques, motivating research into code agents.

71.3B+

agentsminecraftlifelong-learning

Voyager: An Open-Ended Embodied Agent with Large Language Models

by NVIDIA / Caltech / UT Austin

Presents Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager uses an automatic curriculum, an ever-growing skill library of executable code, and an iterative prompting mechanism to overcome failures.

dpoalignmentpreference-optimization

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

by Stanford University

Introduces DPO, a stable and efficient alternative to RLHF that directly optimizes a language model on human preference data without an explicit reward model or RL. Achieves comparable or superior alignment results with significantly simpler implementation.

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

by Microsoft Research

Presents GraphRAG, which uses LLM-generated knowledge graphs and community detection to enable query-focused summarization over entire text corpora. Unlike standard RAG which answers local questions from text chunks, GraphRAG enables global sensemaking queries by reasoning over interconnected entity communities at multiple granularities.

ragknowledge-graphgraph

70.9B+

reinforcement-learningoffline-rltransformers

Decision Transformer: Reinforcement Learning via Sequence Modeling

by UC Berkeley / Google Brain

Decision Transformer recasts offline reinforcement learning as a conditional sequence modeling problem, predicting actions given return-to-go, states, and past actions using a causal Transformer. This eliminates the need for temporal difference learning and bootstrapping while achieving competitive performance on Atari and MuJoCo benchmarks.

REALM: Retrieval-Augmented Language Model Pre-Training

by Google Research

Proposes REALM, which augments language model pre-training with a learned textual knowledge retriever, enabling the model to retrieve and attend over documents from a large corpus during both pre-training and fine-tuning. REALM achieves state-of-the-art on Open-domain QA benchmarks while providing interpretable knowledge retrieval.

ragpretrainingretrieval

starcodercode-llmopen-source

StarCoder: May the Source Be With You!

by BigCode / Hugging Face / ServiceNow

Presented StarCoder, a 15.5B parameter open-source code LLM trained on 1 trillion tokens from The Stack (permissively licensed source code) with fill-in-the-middle capability, fast multi-token prediction inference, and a commitment to responsible AI through a model card and attribution feature.

mixture-of-expertsmoesparse

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

by Google Brain

Introduces the Sparsely-Gated Mixture-of-Experts (MoE) layer, enabling 1000× capacity increase with only marginal computational cost increase. A learned gating network selects a sparse subset of expert sub-networks per input, enabling unprecedented model scale.

roboticslanguage-groundingllm

Paperrobotics

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

by Google / Everyday Robots

SayCan combines the semantic reasoning capabilities of large language models with learned value functions that encode physical feasibility, allowing robots to plan long-horizon tasks expressed in natural language. The approach grounds high-level language instructions in real-world robot affordances without task-specific fine-tuning.

distilbertknowledge-distillationbert

DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter

by Hugging Face

Introduces DistilBERT, a knowledge-distilled version of BERT that retains 97% of BERT's language understanding while being 40% smaller and 60% faster. Demonstrates the effectiveness of task-agnostic knowledge distillation for pretrained language models.

69.9B

reinforcement-learningoffline-rlq-learning

Conservative Q-Learning for Offline Reinforcement Learning

by UC Berkeley

CQL (Conservative Q-Learning) addresses distribution shift in offline RL by augmenting the standard Bellman objective with a term that penalizes Q-values for out-of-distribution actions, producing a lower bound on the true value function. This conservative approach prevents over-optimistic value estimation and achieves strong performance across locomotion, navigation, and robotic manipulation datasets.

ragself-reflectioncritique

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

by University of Washington / IBM AI Research / Allen AI

Introduces Self-RAG, a framework that trains a single LM to adaptively retrieve passages on demand, generate text, and critique its own outputs using special reflection tokens. Unlike standard RAG, Self-RAG decides when to retrieve and reflects on retrieved passages and generation quality, outperforming ChatGPT and standard RAG on diverse downstream tasks.

emergent-abilitiesscalingphase-transitions

Paperresearch

Emergent Abilities of Large Language Models

by Google Research / Stanford / DeepMind / UNC

Defines and documents emergent abilities in LLMs — capabilities that appear sharply at certain model scales rather than improving gradually. Surveys over 100 tasks where models exhibit phase-transition-like capability gains, sparking debate on whether emergence is real or a measurement artifact.

69.6B

ragretrievallanguage-model

Improving Language Models by Retrieving from Trillions of Tokens

by DeepMind

Presents RETRO (Retrieval-Enhanced Transformers), a model that retrieves from a 2-trillion-token database at inference time via chunked cross-attention. RETRO achieves performance comparable to GPT-3 with 25× fewer parameters by leveraging retrieved passages, demonstrating that retrieval augmentation is a compute-efficient alternative to scaling.

agentssoftware-engineeringcode

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

by Princeton NLP / Princeton Language and Intelligence

Introduces SWE-agent, which defines Agent-Computer Interfaces (ACIs) to enable LLMs to autonomously solve real GitHub issues by browsing codebases, editing files, and running tests. On the SWE-bench benchmark, SWE-agent with GPT-4 Turbo resolves 12.5% of issues, significantly outperforming prior methods.

69.3B

safetyred-teamingadversarial

Red Teaming Language Models with Language Models

by DeepMind

Proposes using language models to automatically generate test cases that elicit harmful behaviors from target language models—a scalable alternative to manual red teaming. The approach discovers diverse attack prompts across harm categories and reveals that larger models are harder to red-team but produce more harmful outputs when successfully attacked.

69B

Qwen2.5 Technical Report

by Alibaba Cloud / Qwen Team

Qwen2.5 is a comprehensive family of open-source LLMs (0.5B to 72B parameters) trained on 18 trillion tokens including significantly expanded coding and mathematics data, achieving state-of-the-art open-source performance on coding (HumanEval), mathematics (MATH), and multilingual benchmarks. The series includes specialized Qwen2.5-Coder and Qwen2.5-Math variants.

llmqwenalibaba

69B

evaluationopen-sourcechatbot

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* Quality

by LMSYS / UC Berkeley / CMU / UCSD

Presents Vicuna-13B, an open-source chatbot created by fine-tuning LLaMA on ShareGPT conversation data, achieving approximately 90% of ChatGPT and Bard quality as judged by GPT-4. The paper introduces GPT-4 as an automated judge for chatbot evaluation, establishing a widely adopted evaluation paradigm for conversational AI.

68.9B

benchmarkagentsevaluation

AgentBench: Evaluating LLMs as Agents

by Tsinghua University

Introduces AgentBench, the first systematic benchmark for evaluating LLMs as autonomous agents across eight distinct environments spanning operating systems, databases, knowledge graphs, digital games, and web browsing. The benchmark reveals a large performance gap between commercial and open-source models on real-world agent tasks.

alphacodedeepmindcode-generation

Competition-Level Code Generation with AlphaCode

by DeepMind

AlphaCode is a large-scale language model from DeepMind designed for competitive programming. It was pre-trained on public GitHub code and fine-tuned on a curated dataset of programming contest problems. The system generates a vast number of potential solutions and then filters them using test cases to find a correct one.

ai-safetyalignmentsuperalignment

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

by OpenAI

This paper explores weak-to-strong generalization, a method for training a powerful AI model using supervision from a weaker one. It serves as an analogy for aligning superintelligent AI with human values. The research shows that strong models can learn beyond their weak supervisors and introduces techniques like auxiliary confidence loss to improve performance.

alignmentscalable-oversightreward-modeling

Scalable agent alignment via reward modeling: a research direction

by DeepMind

This research paper proposes a method for aligning advanced AI systems by using recursive reward modeling. The approach leverages AI assistants to help human evaluators assess complex AI actions, enabling scalable oversight and positioning this technique alongside debate and amplification as key AI safety strategies.

qwen-vlmultimodalvision-language

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

by Alibaba Cloud / DAMO Academy

Qwen-VL is a large-scale vision-language model series from Alibaba, trained on a curated multilingual multimodal dataset. It supports high-resolution image understanding, visual grounding with bounding boxes, and multilingual text reading, achieving state-of-the-art results on multiple visual benchmarks.

67.8B

multi-agent-systemsagent-communicationrole-playing-ai

CAMEL: Communicative Agents for Mind Exploration of Large Language Model Society

by KAUST

CAMEL introduces a novel framework for studying multi-agent cooperation by having AI agents role-play to solve tasks. It utilizes a technique called 'inception prompting' to ensure agents adhere to their assigned personas, enabling the exploration of complex communicative behaviors and societal dynamics within large language models with minimal human guidance.

starself-taught-reasonerbootstrapping

STaR: Bootstrapping Reasoning With Reasoning

by Stanford University / Google Brain

STaR (Self-Taught Reasoner) is a research paper introducing an iterative bootstrapping method for language models. The model learns to improve its reasoning abilities by generating rationales for problems, filtering out the incorrect ones, and then fine-tuning itself on the successfully reasoned examples. This allows smaller models to achieve reasoning performance comparable to much larger ones.

llmmultimodalmixture-of-experts

The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI

by Meta AI

Llama 4 introduces a family of natively multimodal mixture-of-experts models—Scout (17B/16 experts), Maverick (17B/128 experts), and Behemoth (288B/16 experts)—pretrained jointly on text, image, and video data. Maverick achieves top scores on vision-language benchmarks while Scout offers 10M-token context at a fraction of the compute of comparable models.

grouped-query-attentiongqamulti-query-attention

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

by Google Research

Introduces Grouped-Query Attention (GQA), an efficient attention mechanism that generalizes Multi-Head and Multi-Query Attention. GQA groups query heads to share key and value heads, drastically reducing the KV cache size and memory bandwidth, which accelerates inference speed while maintaining near Multi-Head quality.

67.4B

ethicsai-policyguidelines

Artificial Intelligence Ethics Guidelines: A Global Inventory

by EPFL / Multiple Institutions

This paper presents a systematic review of 84 prominent AI ethics guidelines from around the world. It identifies a global convergence on five key ethical principles, including transparency and justice, but reveals significant divergence in how these principles are interpreted and operationalized across different sectors and regions.

interpretabilitymechanistic-interpretabilitycircuits

Paperinterpretability

Zoom In: An Introduction to Circuits

by Distill / OpenAI

This essay by Chris Olah and colleagues at Distill introduces the circuits framework for mechanistic interpretability, arguing that neural network weights encode interpretable algorithms composed of features and circuits. It presents case studies of curve detectors and multimodal neurons as evidence that individual units and motifs in neural networks are meaningfully interpretable.

66.6B

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

by Anthropic

Demonstrates that LLMs can be trained to behave safely during normal operation but exhibit unsafe behaviors when triggered by specific conditions—acting as 'sleeper agents'—and that standard safety training techniques including RLHF, supervised fine-tuning, and adversarial training fail to reliably remove these backdoors, sometimes even hiding them deeper.

safetydeceptionalignment

switch-transformermixture-of-expertstrillion-parameters

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

by Google Research

Introduces Switch Transformers, simplifying MoE routing to select a single expert per token (top-1), enabling stable trillion-parameter T5-scale models with 7× pre-training speedup. Demonstrates that parameter count and compute can be decoupled through sparsity.

reinforcement-learninggrpomath-reasoning

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

by DeepSeek

This paper introduces Group Relative Policy Optimization (GRPO), a memory-efficient reinforcement learning algorithm. GRPO enables scalable RLHF-style training by replacing the critic model with group-sampled reward baselines, a technique used to enhance the mathematical reasoning of models like DeepSeekMath.

roboticsvision-language-modelsaction-models

Paperrobotics

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

by Google DeepMind

RT-2 is a Vision-Language-Action (VLA) model that translates visual and language inputs directly into robotic actions. By co-fine-tuning large models on both web-scale and robotics data, it transfers knowledge from the internet to physical control, enabling robots to reason about and execute tasks involving novel objects and scenarios without explicit robotic training.

interpretabilitytransparencyrepresentation-engineering

Representation Engineering: A Top-Down Approach to AI Transparency

by Center for AI Safety / UC Berkeley

Representation Engineering (RepE) is a top-down AI transparency technique for interpreting and controlling Large Language Models. It uses linear probes on activation differences from contrastive prompts to identify and manipulate high-level concepts like truthfulness and emotion without needing to retrain or fine-tune the model.

ragfew-shot-learningretrieval-augmented-generation

Atlas: Few-shot Learning with Retrieval Augmented Language Models

by Meta AI / University College London

Atlas is a retrieval-augmented language model designed for few-shot learning. It uniquely pre-trains its retriever and language model components jointly, enabling it to effectively leverage external knowledge documents. This approach allows Atlas to achieve state-of-the-art few-shot performance on knowledge-intensive NLP benchmarks like MMLU, outperforming much larger models.

interpretabilitymonosemanticitydictionary-learning

Paperinterpretability

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning

by Anthropic

This research paper from Anthropic introduces a method using sparse autoencoders to decompose the internal activations of a transformer model. It successfully extracts thousands of interpretable, monosemantic features, demonstrating that the superposition of concepts within neurons can be untangled.

claude-opus-4anthropicllm-research

Claude Opus 4 Technical Report

by Anthropic

The Claude Opus 4 technical report details Anthropic's flagship model, highlighting its extended thinking, advanced coding, and agentic capabilities. It showcases top-tier performance on benchmarks like SWE-bench and GPQA, along with significant improvements in safety through Constitutional AI and RLHF.

Gemini 2.5 Pro Technical Report

by Google DeepMind

Gemini 2.5 Pro introduces thinking mode—an integrated chain-of-thought reasoning layer—combined with a 1M-token context window and natively multimodal capabilities spanning text, image, audio, and video. The model achieves leading positions on multiple reasoning and coding benchmarks including Codeforces, AIME, and MMMU.

llmgeminigoogle

interpretabilitycircuitsinduction-heads

Paperinterpretability

In-context Learning and Induction Heads

by Anthropic

This paper establishes a causal link between specific transformer circuits, termed "induction heads," and the phenomenon of in-context learning. It demonstrates that these two-layer attention patterns, which copy and complete sequences, emerge predictably during training and are a key mechanistic driver of few-shot learning abilities in LLMs.

mambastate-space-modelssm

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

by Carnegie Mellon University / Together AI

Mamba is a novel sequence modeling architecture based on structured state space models (SSMs). It introduces a selection mechanism that allows the model to selectively propagate or forget information based on the input, overcoming a key limitation of previous SSMs. This enables Mamba to achieve Transformer-level performance with linear time complexity and significantly faster inference.

datasetevaluationconversations

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

by LMSYS / UC Berkeley

Introduces LMSYS-Chat-1M, a large-scale dataset of one million real-world conversations with 25 state-of-the-art LLMs collected from the Chatbot Arena platform. Analysis reveals diverse usage patterns, safety violations, and human preference signals, making it a valuable resource for safety evaluation, capability assessment, and alignment research.

Paperdomain-specific

Towards Expert-Level Medical Question Answering with Large Language Models

by Google Research

This paper introduces Med-PaLM 2, a large language model fine-tuned on medical data. It achieves expert-level performance on medical licensing exam questions, demonstrating clinical reasoning comparable to physicians, and proposes a framework for evaluating the safety and alignment of medical AI systems.

healthcaremedical-aillm

cogvlmmultimodalvisual-expert

CogVLM: Visual Expert for Pretrained Language Models

by Tsinghua University / Zhipu AI

CogVLM is a vision-language model that enhances pretrained language models (LLMs) with visual understanding. It introduces a trainable visual expert module into each layer of a frozen LLM, enabling deep fusion of image and text features. This approach achieves state-of-the-art results on numerous vision-language benchmarks without altering the original language model's parameters.

multi-query-attentionmqainference-speed

Fast Transformer Decoding: One Write-Head is All You Need (Multi-Query Attention)

by Google Brain

Introduces Multi-Query Attention (MQA), an efficient attention mechanism for autoregressive decoding. By sharing a single key and value head across all query heads, MQA drastically reduces the size of the KV cache. This leads to significant memory bandwidth savings and faster inference speeds with minimal impact on model quality.

NVIDIA AI

by NVIDIA

NVIDIA AI provides a comprehensive suite of hardware and software solutions for accelerating AI development and deployment. Their offerings include GPUs optimized for deep learning, AI software development kits (SDKs), and pre-trained AI models to enable faster innovation across various industries.

gpudeep-learninghardware

93A+

cloud-mlmanaged-servicemachine-learning

Amazon SageMaker

by Amazon Web Services (AWS)

Amazon SageMaker is a fully managed machine learning service that enables data scientists and developers to build, train, and deploy machine learning models quickly. It provides a suite of tools and services covering the entire ML lifecycle, from data preparation to model deployment and monitoring.

86.7A

data-engineeringmachine-learningapache-spark

Databricks

by Databricks

Databricks is a unified data analytics platform built on Apache Spark, providing tools for data engineering, data science, and machine learning. It enables organizations to process large datasets, build and deploy ML models, and collaborate across teams.

82.3A

speech-to-textaudio-intelligencetranscription

AssemblyAI

by AssemblyAI

AssemblyAI provides a Speech-to-Text API that allows developers to transcribe audio and video files with high accuracy. Their platform offers features like speaker diarization, sentiment analysis, and content moderation, making it a comprehensive solution for audio intelligence.

80.8A

ProviderAI Tools & APIs

Hugging Face

by Hugging Face

Hugging Face is the GitHub of AI, providing the world's largest open model hub, dataset repository, and ML collaboration platform. Its Transformers library is the de-facto standard for working with open-weight models, and the Hugging Face Hub hosts hundreds of thousands of models and datasets. Its Spaces platform allows AI demos to be deployed instantly.

model-hubopen-sourceinfrastructure

78.3B+

cloud-providermlopsenterprise

Amazon Web Services AI

by Amazon

Amazon Web Services is the world's largest cloud provider and offers the most comprehensive set of AI and machine learning services, including Amazon Bedrock for managed foundation model APIs, SageMaker for MLOps, Rekognition for computer vision, and Alexa for voice AI. AWS Bedrock gives enterprises access to models from Anthropic, Meta, Mistral, Cohere, and others through a unified API.

75.3B+

ProviderAI Tools & APIs

LangChain Inc

by LangChain Inc

LangChain Inc is the company behind the most widely adopted LLM orchestration framework in the AI ecosystem. LangChain provides composable abstractions for building LLM-powered applications, while its LangSmith platform offers observability and evaluation tooling, and LangGraph enables the construction of stateful, multi-actor agent workflows.

ai-frameworkorchestrationrag

cloud-providerenterprisemanaged-ai

Microsoft Azure AI

by Microsoft

Microsoft Azure AI is the AI services division of Microsoft's cloud platform, uniquely positioned as the exclusive cloud partner of OpenAI. Through Azure OpenAI Service, enterprises access GPT-4, DALL-E, and Whisper with enterprise-grade compliance and data residency guarantees. Microsoft has deeply integrated AI across its product suite including Copilot for Microsoft 365, GitHub Copilot, and Azure AI Foundry.

73.9B+

cloud-providerenterprisemanaged-ai

Google Cloud AI

by Google

Google Cloud AI provides enterprise access to Google DeepMind's Gemini models and a comprehensive suite of managed AI services via Vertex AI. As the creator of the Transformer architecture and TensorFlow, Google Cloud offers unmatched AI infrastructure including custom TPUs, a full MLOps platform, and pre-built APIs for vision, speech, and natural language processing.

71.4B+

Graphcore

by Graphcore

Graphcore is a semiconductor company that develops Intelligence Processing Units (IPUs), a type of microprocessor designed specifically for AI and machine learning workloads. Their IPUs are designed to accelerate training and inference for complex AI models, offering an alternative to GPUs.

hardwareacceleratoripu

69.5B

vector-databaseinfrastructurerag

Pinecone Systems

by Pinecone

Pinecone is the leading managed vector database, purpose-built for AI applications requiring similarity search at scale. It powers retrieval-augmented generation, semantic search, and recommendation systems for thousands of enterprises. Pinecone's serverless architecture eliminates infrastructure management while delivering sub-millisecond query performance.

69.2B

open-sourcebenchmarkingresearch

LMSYS

by LMSYS / UC Berkeley

LMSYS (Large Model Systems Organization) is a research collective from UC Berkeley known for creating Chatbot Arena—the leading human preference-based LLM evaluation leaderboard—and developing high-performance open-source inference systems including vLLM and FastChat. LMSYS research on Elo-based evaluation and serving efficiency has become foundational to the field.

68.6B

EleutherAI

by EleutherAI

EleutherAI is a decentralized open-source AI research collective best known for training and releasing the GPT-Neo, GPT-J, GPT-NeoX, and Pythia model families, as well as developing the LM Evaluation Harness—the standard benchmarking framework for language models. The organization operates as a grassroots nonprofit committed to open and reproducible AI research.

open-sourcellmresearch

67.8B

Allen Institute for AI (AI2)

by Allen Institute for AI

The Allen Institute for AI (AI2) is a nonprofit research institute focused on high-impact, open-source AI. Founded by Paul Allen, it produces foundational models like OLMo, influential datasets such as MMLU, and reasoning benchmarks. Its Semantic Scholar platform provides AI-powered discovery across 200M+ academic papers.

open-sourceresearchnlp

data-labelingrlhfevaluation

Providerai-data

Scale AI

by Scale AI

Scale AI is the leading AI data platform providing high-quality training data labeling, RLHF pipelines, and model evaluation services for frontier AI labs, government agencies, and Fortune 500 enterprises. Its Rapid platform and data engine power training datasets for many leading language and vision models.

Providerai-audio

ElevenLabs

by ElevenLabs

ElevenLabs is a voice technology research company developing advanced text-to-speech and voice cloning software. Their platform allows users to generate high-quality spoken audio in numerous languages, create custom AI voices, or clone existing ones. It is widely used for audiobooks, video games, and content creation.

ttsvoice-cloningaudio-ai

datasetsopen-sourcenonprofit

LAION

by LAION

LAION (Large-scale Artificial Intelligence Open Network) is a German nonprofit that creates and releases massive open datasets for AI research. Its most notable contribution, LAION-5B, is a dataset of 5.85 billion image-text pairs that was pivotal in training foundational models like Stable Diffusion.

Providerai-search

Perplexity AI

by Perplexity AI

Perplexity AI is an answer engine that combines real-time web search with large language model reasoning to deliver cited, conversational responses. Founded in 2022, it has rapidly grown to tens of millions of monthly active users and positions itself as an AI-native alternative to traditional search engines.

searchraganswer-engine

mlopsexperiment-trackingobservability

ProviderAI Tools & APIs

Weights & Biases

by Weights & Biases

Weights & Biases (W&B) is a leading MLOps platform for developers, specializing in experiment tracking, model evaluation, and dataset versioning. It provides tools to visualize model performance, manage datasets, and collaborate on machine learning projects, integrating with popular frameworks like PyTorch and TensorFlow.

video-generationcreative-aimultimodal

Providerai-creative

Runway ML

by Runway ML

Runway is an applied AI research company focused on building multimodal AI systems for art, entertainment, and human creativity. It provides a suite of web-based tools for generative content creation, including industry-leading text-to-video, image-to-video, and AI-powered video editing features for creative professionals.

chatbotsroleplayconsumer-ai

Providerai-consumer

Character AI

by Character AI

Character AI is a consumer platform for creating and interacting with AI-powered characters. Users can engage in conversations for entertainment, role-playing, and creative exploration. It has become a major consumer AI application with a massive user base, focusing on personalized and immersive chat experiences.

generative-aiimage-generationvideo-generation

Stability AI

by Stability AI

Stability AI is a generative AI company known for developing the popular open-source Stable Diffusion text-to-image model. They focus on creating open, multi-modal AI models for image, language, audio, and video generation, which are accessible via APIs and as downloadable weights for custom implementation.

Groq

by Groq

Groq is a semiconductor company that developed the Language Processing Unit (LPU), a custom chip for ultra-fast AI inference. Their managed API provides some of the fastest publicly available LLM inference speeds, often exceeding 800 tokens/second, making it ideal for latency-sensitive applications.

inferencehardwarelpu

62.3B

vector-databaseopen-sourceinfrastructure

Weaviate

by Weaviate

Weaviate is an open-source vector database designed for AI-native applications. It enables flexible hybrid search, combining vector and keyword methods, and uniquely supports multi-modal data like text, images, and audio. Weaviate offers both self-hosting for maximum control and a managed cloud service for ease of use.

open-sourcecode-modelsresearch-collaboration

BigCode Project

by BigCode / Hugging Face / ServiceNow

BigCode is an open scientific collaboration by Hugging Face and ServiceNow for the responsible development of large language models (LLMs) for code. The project produced the StarCoder and StarCoder2 models, trained on 'The Stack' dataset, with a strong emphasis on ethical data governance, source attribution, and consent.

BigScience

by BigScience / Hugging Face

BigScience was a year-long, open research collaboration involving over 1,000 volunteer researchers, organized by Hugging Face. This global effort focused on the transparent and ethical development of large language models, culminating in the creation of BLOOM, a 176-billion parameter open-access multilingual model.

open-sourcellmresearch

inferenceopen-source-hostingfine-tuning

Together AI

by Together AI

Together AI provides a high-performance cloud inference platform for open-source models, offering one of the fastest and most cost-effective APIs for running models like Llama, Mistral, and DeepSeek. Its Together Inference platform specializes in speculative decoding and model parallelism techniques, and also offers managed fine-tuning and custom model deployment.

57.8C+

Providerai-creative

Synthesia

by Synthesia

Synthesia is an enterprise AI video generation platform that enables users to create professional-quality videos featuring realistic AI avatars from text scripts, without cameras, actors, or studios. Serving thousands of enterprise customers including Accenture, BBC, and Reuters, it is the leading platform for scalable AI-generated corporate video content.

video-generationavatarssynthetic-media

copywritingmarketing-aicontent-generation

Providerai-marketing

Jasper AI

by Jasper AI

Jasper AI is an enterprise-grade AI content platform designed for marketing teams to produce brand-consistent copy, campaigns, and creative assets at scale. It integrates with brand voice guidelines, company knowledge bases, and major marketing workflows to maintain tone consistency across channels.

56.4C+

Providerai-legal

Casetext

by Casetext / Thomson Reuters

Casetext was a pioneer in AI-powered legal research and drafting, launching CoCounsel—the first AI legal assistant powered by GPT-4—before being acquired by Thomson Reuters in 2023 for $650M. Its technology is now integrated into Westlaw and Practical Law, making AI legal assistance available to millions of legal professionals.

legal-ailegaltechlegal-research

56.3C+

infrastructuredistributed-computingray

Anyscale

by Anyscale

Anyscale is the company behind Ray, the open-source distributed computing framework that has become the infrastructure backbone for training and serving large-scale AI at companies like OpenAI, Uber, and Spotify. Anyscale provides a managed platform for Ray workloads, including Anyscale Endpoints for scalable LLM inference and RayLLM for open-model serving.

56.2C+

model-deploymentinfrastructuremanaged-inference

Replicate

by Replicate

Replicate is a cloud platform that makes it trivial to run open-source machine learning models via a simple API with pay-per-second billing. It hosts thousands of community models spanning image generation, video, audio, and language, and allows developers to package and deploy custom models as Cogs without managing any GPU infrastructure.

55.5C+

Providerai-data

Labelbox

by Labelbox

Labelbox is an enterprise data-curation and annotation platform that streamlines the creation of high-quality training datasets for computer vision, NLP, and multimodal AI models. It provides annotation tooling, quality workflows, model-assisted labeling, and a managed workforce marketplace.

data-labelingannotationmlops

53.4C+

Providerai-legal

Harvey AI

by Harvey AI

Harvey AI is an enterprise legal AI platform built on foundation models fine-tuned on legal corpora to assist law firms and corporate legal departments with research, drafting, due diligence, and contract analysis. It is deployed at leading global law firms and backed by OpenAI, positioning itself as the AI layer for professional legal services.

legal-ailegaltechenterprise

52.5C+

Providerai-hardware

Cerebras Systems

by Cerebras Systems

Cerebras Systems designs and manufactures the Wafer Scale Engine (WSE), the world's largest AI chip, enabling ultra-fast LLM training and inference at speeds far exceeding GPU clusters. Its CS-3 system and Cerebras Inference cloud service deliver token generation rates of 2,000+ tokens/second for leading open-weight models.

ai-chipswafer-scaleinference

52.3C+

mlopsmodel-servingopen-source

BentoML

by BentoML

BentoML is an open-source platform for building, shipping, and scaling AI applications and model inference services, providing a unified framework from local development to cloud production. BentoCloud, its managed service, offers one-click deployment, auto-scaling, and observability for ML teams.

52.1C+

open-sourceembeddingsvisualization

Nomic AI

by Nomic AI

Nomic AI builds open, auditable AI systems focused on embedding models and large-scale data visualization, most notably the nomic-embed-text model and Atlas—a platform for exploring and understanding massive datasets through interactive AI-powered maps. The company emphasizes transparency and reproducibility in model development.

serverless-gpumlopscloud-compute

Modal

by Modal Labs

Modal is a serverless cloud platform purpose-built for running GPU-intensive Python workloads including ML inference, fine-tuning, and batch processing without managing infrastructure. Developers define compute requirements in Python decorators and Modal handles container orchestration, scaling, and cold-start optimization.

inferenceopen-source-hostingenterprise

Fireworks AI

by Fireworks AI

Fireworks AI is a production inference platform founded by ex-Google Brain researchers, offering fast and reliable serving for open-weight models with enterprise SLAs. Fireworks specializes in compound AI systems, function calling, and JSON-mode inference, and provides FireFunction—its own fine-tuned function-calling model—alongside hosting for Llama, Mistral, and other popular open models.

pathologymedical-aidiagnostics

Providerai-healthcare

PathAI

by PathAI

PathAI develops AI-powered pathology solutions that enable more accurate cancer diagnosis, biomarker assessment, and drug development support by analyzing histopathology images at scale. Its AISight platform is deployed in clinical laboratories and pharmaceutical research, improving diagnostic consistency and accelerating oncology trials.

49.2C

Providerai-data

Snorkel AI

by Snorkel AI

Snorkel AI commercializes weak supervision and programmatic data development research from Stanford AI Lab, enabling teams to build, manage, and iterate on AI training datasets programmatically at scale. Its platform reduces reliance on manual labeling by using labeling functions and foundation model assistance.

programmatic-labelingdata-developmentweak-supervision

49C

cloud-providerenterprisegoverned-ai

IBM Watson / watsonx

by IBM

IBM Watson, now branded as IBM watsonx, is IBM's enterprise AI platform offering governed, trustworthy AI for regulated industries. The watsonx.ai studio, watsonx.data lakehouse, and watsonx.governance suite provide a complete enterprise AI development and deployment pipeline with strong emphasis on explainability, fairness, and compliance for sectors like finance, healthcare, and government.

47.2C

cloud-providerenterprisedatabase-ai

Oracle AI

by Oracle

Oracle AI provides a suite of generative AI services built into Oracle Cloud Infrastructure (OCI), including the OCI Generative AI Service powered by Cohere and Meta models. Oracle has uniquely integrated AI capabilities directly into its database (Oracle Database 23ai), ERP, and industry cloud offerings, targeting enterprises with existing Oracle relationships.

47C

ai-labfoundation-modelschinese

Zhipu AI (GLM)

by Zhipu AI

Zhipu AI is a Chinese AI company spun out of Tsinghua University's KEG Lab, known for the GLM (General Language Model) series. Its ChatGLM models were among the first high-quality open Chinese language models and have been widely adopted in Chinese industry and research communities.

46.9C

ProviderAI Agents

Adept AI

by Adept AI

Adept AI builds AI systems that can take actions in software to complete complex multi-step workflows on behalf of users. The company focuses on general-purpose action models trained to interact with real-world software interfaces through browser and desktop automation.

agentscomputer-useworkflow-automation

46.9C

Providerai-biotech

Recursion Pharmaceuticals

by Recursion Pharmaceuticals

Recursion Pharmaceuticals is a clinical-stage techbio company that combines automated biology, large-scale imaging, and machine learning to industrialize drug discovery, operating one of the largest biological datasets in the industry. Its Recursion OS platform maps biological relationships at unprecedented scale to identify novel therapeutic targets and drug candidates.

drug-discoverybiotechai-biology

46.7C

Providerai-observability

Helicone

by Helicone

Helicone is an open-source LLM observability and monitoring platform that provides a single proxy endpoint for logging, tracking costs, debugging, and improving LLM applications across all major model providers. It integrates with a one-line code change and supports caching, rate limiting, and prompt management.

observabilityllm-monitoringlogging

46.4C

Providerai-biotech

Insilico Medicine

by Insilico Medicine

Insilico Medicine is an AI-driven drug discovery company that has become the first to advance an AI-designed small molecule into Phase II clinical trials, demonstrating end-to-end AI-powered drug development from target identification through IND. Its Chemistry42 and PandaOmics platforms generatively design and screen drug candidates.

drug-discoveryai-chemistrygenerative-ai

46C

Providerai-hardware

SambaNova Systems

by SambaNova Systems

SambaNova Systems builds reconfigurable AI hardware and software solutions optimized for enterprise-scale LLM training and inference, offering its Samba-1 model and SambaNova Cloud API as commercial services. The company's Reconfigurable Dataflow Unit (RDU) architecture is designed specifically for deep learning workloads.

ai-chipsreconfigurableinference

45.4C

xAI

by xAI

xAI is Elon Musk's AI company and creator of the Grok model family. It provides API access to Grok models with real-time web search integration, available through the xAI API and X (Twitter) platform. Grok models are trained on a broad mix of web and social data and emphasize up-to-date knowledge and uncensored reasoning.

llmgrokreal-time

gpu-cloudmarketplacepeer-to-peer

Vast.ai

by Vast.ai

Vast.ai is a peer-to-peer GPU marketplace connecting researchers and startups with spare GPU capacity from data centers and individuals worldwide. It offers some of the cheapest GPU rental prices on the market with flexibility to choose hardware by price, latency, or reliability score. Best suited for cost-sensitive experimentation and training runs.

Together AI (GPU Compute)

by Together AI

Together AI's compute platform provides on-demand and reserved GPU clusters for training and fine-tuning open-source models. It offers H100 and A100 clusters with high-bandwidth networking optimized for distributed training runs, serving as both a GPU cloud provider and an inference platform. Teams use Together AI compute to run multi-node training jobs on Llama and Mistral variants.

gpu-cloudh100a100

inferencefine-tuningopen-source

Together AI

by Together AI

Together AI provides a cloud platform for running, fine-tuning, and deploying open-source language models. It hosts a wide catalog of models from Llama to Mistral and offers serverless inference, dedicated endpoints, and a fine-tuning pipeline. Together AI is popular among developers who want OpenAI-compatible APIs for open-weight models at competitive pricing.

SambaNova

by SambaNova Systems

SambaNova Systems builds custom AI hardware (Reconfigurable Dataflow Units) and offers cloud inference via SambaNova Cloud. It delivers some of the highest throughput speeds for large models including Llama 3 and Meta's frontier releases, targeting enterprises that need predictable, high-throughput inference at scale.

inferencerduhardware

gpu-cloudcost-efficientmarketplace

RunPod

by RunPod

RunPod is a community-driven GPU cloud marketplace offering some of the lowest per-hour prices for NVIDIA and AMD GPUs. It enables developers to rent GPU compute from a distributed network of data centers and deploy containerized workloads instantly. RunPod supports serverless GPU endpoints, making it popular for open-source model inference.

gpu-cloudmodel-hostingapi

Replicate

by Replicate

Replicate is a platform for running machine learning models in the cloud via a simple API. It hosts thousands of open-source models for image generation, language, audio, and video, deployable with a single API call. Replicate charges per-second of GPU usage and supports deploying custom models as private or public endpoints.

OpenAI

by OpenAI

OpenAI is the leading AI research and deployment company behind the GPT and o-series model families. It offers API access to frontier language models, image generation via DALL-E, speech recognition via Whisper, and an Assistants API for building stateful agent workflows. OpenAI operates both a consumer product (ChatGPT) and an enterprise API platform used by millions of developers.

llmgptapi

gpu-cloudserverlesspython

Modal

by Modal Labs

Modal is a cloud compute platform for running GPU workloads from Python, with a focus on developer ergonomics and serverless scaling. It allows deploying Python functions as GPU-accelerated endpoints with zero infrastructure configuration, automatic scaling to zero, and fast cold-start times. Popular for ML inference, batch jobs, and LLM serving.

Mistral AI

by Mistral AI

Mistral AI is a French AI company known for publishing high-efficiency open-weight models alongside its commercial API offerings. The Mistral and Mixtral model families deliver strong benchmark performance at a fraction of the compute cost of larger models. Mistral's La Plateforme API provides access to both open and closed proprietary models.

llmmistralmixtral

Meta AI

by Meta

Meta AI is the open-source AI division of Meta, responsible for the Llama model family. Llama 4 and its variants are released under open weights licenses, enabling local deployment, fine-tuning, and commercial use. Meta provides model weights via Hugging Face and its own download portal, making it the dominant open-weights LLM ecosystem.

llmllamaopen-weights

Lambda Labs

by Lambda Labs

Lambda Labs provides cloud GPU instances and on-premises GPU servers targeted at AI researchers and ML engineers. Its Lambda Cloud offers on-demand and reserved NVIDIA H100 and A100 instances at competitive rates with a simple developer-friendly interface. Lambda also sells GPU workstations and servers for local development.

gpu-cloudh100a100

Groq

by Groq

Groq offers ultra-low-latency LLM inference through its custom Language Processing Unit (LPU) hardware. The GroqCloud API serves open-weight models including Llama, Mixtral, and Gemma at speeds that far exceed GPU-based inference, making it ideal for real-time agent applications. Groq provides a developer-friendly API compatible with the OpenAI client format.

inferencelpulow-latency

Google DeepMind

by Google DeepMind

Google DeepMind is the unified AI research division behind the Gemini model family. It offers API access through Google AI Studio and Vertex AI, covering multimodal reasoning, code generation, long-context understanding up to 2M tokens, and tight integration with Google Cloud services. DeepMind also publishes foundational research in reinforcement learning and scientific AI.

llmgeminimultimodal

Google Cloud (GPU)

by Google Cloud

Google Cloud offers A100, H100, and TPU v5 instances for AI training and inference via Compute Engine and Vertex AI. Google Cloud's TPU pods provide unique competitive advantage for training large models efficiently, while its A3 instances with H100s target inference workloads. Deep integration with Vertex AI simplifies the MLOps lifecycle.

gpu-cloudgoogletpu

FluidStack

by FluidStack

FluidStack aggregates spare GPU capacity from data centers globally, providing an on-demand cloud GPU rental marketplace at competitive rates. It offers H100, A100, and RTX GPU clusters for training and inference with an API-driven provisioning model. FluidStack is used by AI startups for burst compute and cost-efficient long-running training jobs.

gpu-cloudmarketplaceh100

Fireworks AI

by Fireworks AI

Fireworks AI specializes in fast, cost-efficient inference for open-source models including Llama, Mistral, and Mixtral families. It offers serverless and on-demand deployment with a focus on production reliability. Fireworks provides an OpenAI-compatible API and supports compound AI systems through its FireFunction tool-calling models.

inferenceopen-sourcefast

DeepSeek

by DeepSeek

DeepSeek is a Chinese AI lab that has released competitive open-weight models rivaling frontier closed models at dramatically lower training costs. DeepSeek R1 and V3 demonstrated that mixture-of-experts and reinforcement learning at scale can close the gap with GPT-4-class models. Models are freely available via Hugging Face and a low-cost API.

llmdeepseekopen-weights

CoreWeave

by CoreWeave

CoreWeave is a specialized cloud infrastructure provider built exclusively for GPU-intensive AI and ML workloads. It offers on-demand and reserved access to NVIDIA H100, A100, and H200 clusters with high-bandwidth InfiniBand networking. CoreWeave is trusted by AI labs and enterprises for large-scale model training and inference at competitive pricing.

gpu-cloudh100a100

Cohere

by Cohere

Cohere is an enterprise-focused AI company specializing in language models optimized for business applications including search, retrieval-augmented generation, and text classification. Its Command and Embed model families are widely used in enterprise RAG pipelines. Cohere offers private cloud and on-premises deployment options alongside its API.

llmembeddingsrag

inferencewsehigh-throughput

Cerebras Inference

by Cerebras Systems

Cerebras provides cloud inference powered by its Wafer-Scale Engine (WSE) chip, delivering some of the highest token throughput for large language models. Cerebras Inference serves Llama and other open-weight models with hardware-level advantages that push tokens-per-second beyond what GPU clusters can achieve for certain model sizes.

inferencegpu-cloudproduction

Baseten

by Baseten

Baseten is a model inference platform for deploying ML models to production with high performance and reliability. It specializes in low-latency serving of open-source LLMs and diffusion models with features like cascade batching, LoRA serving, and speculative decoding. Baseten targets teams that need production-grade inference without managing Kubernetes.

Azure (GPU)

by Microsoft Azure

Microsoft Azure provides ND H100 v5 and NCv3 GPU instances for AI model training and inference, with tight integration into Azure AI Studio, Azure OpenAI Service, and GitHub Copilot infrastructure. Azure is the preferred cloud for enterprises with Microsoft licensing agreements and provides access to OpenAI models via Azure OpenAI Service.

gpu-cloudazuremicrosoft

AWS EC2 (GPU)

by Amazon Web Services

Amazon EC2 provides GPU instances (P4, P5, G5, Inf2 families) for AI/ML training and inference at any scale. As the largest cloud provider, AWS offers the broadest ecosystem of managed ML services including SageMaker, Bedrock, and Trainium-based Inf2 instances. Best for enterprises requiring deep AWS integration and compliance certifications.

gpu-cloudawsenterprise

Anthropic

by Anthropic

Anthropic is an AI safety company and the creator of the Claude model family. Its API provides access to Claude Opus, Sonnet, and Haiku variants, with strong support for long-context reasoning, tool use, and multi-agent workflows via the Claude Agent SDK. Anthropic publishes extensive safety research and pioneered Constitutional AI alignment techniques.

llmclaudesafety

Alibaba / Qwen

by Alibaba Cloud

Alibaba Cloud's Qwen team releases the Qwen model series, a family of open-weight and API-accessible language models covering dense and mixture-of-experts architectures. Qwen models are competitive on multilingual and coding benchmarks and are available through Alibaba Cloud's DashScope API as well as Hugging Face for local deployment.

llmqwenmultilingual

AI21 Labs

by AI21 Labs

AI21 Labs is an Israeli AI company known for the Jamba model family, which uses a hybrid SSM-Transformer architecture for long-context efficiency. Its Wordtune product targets writing assistance while the API focuses on enterprise NLP tasks. Jamba 1.6 offers a unique balance of long-context window handling and low inference latency.

llmjambassm

ai-labfoundation-modelschinese

01.AI (Yi)

by 01.AI

01.AI is a Chinese AI startup founded by Kai-Fu Lee, creator of the Yi series of bilingual large language models. Yi models are released as open weights under permissive licenses and have demonstrated strong performance on multilingual benchmarks, positioning 01.AI as a key contributor to the open-source AI ecosystem.

43.4C

Providerai-robotics

Figure AI

by Figure AI

Figure AI is building general-purpose humanoid robots designed to perform physical labor in warehouses, factories, and logistics environments, powered by a neural network trained with visual data and language models. Its Figure 02 robot, developed in partnership with BMW and backed by OpenAI, Microsoft, and NVIDIA, is one of the most advanced humanoid platforms commercially deployed.

humanoid-robotsroboticsembodied-ai

39.8D

Lepton AI

by Lepton AI

Lepton AI provides a serverless cloud platform for running open-source AI models and custom workloads with a Pythonic SDK, eliminating infrastructure management overhead for ML teams. Founded by ex-Meta researchers, the platform supports fine-tuning, deployment, and monitoring of models with pay-per-use pricing.

mlopsserverlessinference

39.6D

ai-labfoundation-modelschinese

Baichuan

by Baichuan

Baichuan Intelligence is a Chinese AI startup founded by Zhiyuan Wang, a former Sogou CEO, specializing in large language models with applications in healthcare and enterprise workflows. Its Baichuan2 series models are notable for strong Chinese language performance and vertical-specific fine-tuning capabilities.

38.7D

Provider

Cerebras

AI compute provider with wafer-scale chips delivering record-breaking inference speeds for LLMs.

AIhardwareinference

38D

ai-labenterprisefoundation-models

Inflection AI

by Inflection AI

Inflection AI was co-founded by Mustafa Suleyman (ex-DeepMind) and Reid Hoffman, initially building the Pi personal AI assistant. After a major leadership transition to Microsoft in 2024, the remaining company pivoted to enterprise AI services, offering its Inflection 3 model and AI consulting for large organizations.

37.3D

open-sourceresponsible-ainonprofit

Mozilla AI

by Mozilla

Mozilla AI is a startup launched by the Mozilla Foundation to build open, trustworthy AI tools and advocate for responsible AI development as a counterweight to closed proprietary systems. The organization releases tools like Lumigator (LLM evaluation) and contributes to open-source AI infrastructure aligned with the open web.

36.9D

Provider

Cerebras

AI compute provider with wafer-scale chips delivering record-breaking inference speeds for LLMs.

AIhardwareinference

gpudata-centerai-accelerator

AMD Instinct MI350X

by AMD

The AMD Instinct MI350X is a data center GPU designed for high-performance computing and AI workloads. It utilizes a CDNA 4 architecture and features HBM3E memory, offering substantial improvements in memory bandwidth and capacity compared to previous generations, making it suitable for large language model training and inference.

73.8B+

NVIDIA RTX 4090

by NVIDIA

NVIDIA's flagship consumer GPU based on Ada Lovelace. Has become popular for local LLM inference and fine-tuning due to its 24GB GDDR6X memory and high performance-per-dollar ratio, enabling on-premise AI workloads without data center costs.

gpuconsumerworkstation

data-centeracceleratorhpc

AMD Instinct MI400A

by Advanced Micro Devices (AMD)

The AMD Instinct MI400A is a data center accelerator designed for high-performance computing and AI workloads. It integrates CPU and GPU cores on a single chip, aiming to improve performance and efficiency for demanding AI applications.

wafer-scaleai-acceleratorhigh-performance-computing

Cerebras Wafer Scale Engine 4 (WSE-4)

by Cerebras Systems

The Cerebras WSE-4 is the fourth generation wafer-scale processor designed specifically for AI compute. It features a massive array of compute cores fabricated on a single silicon wafer, enabling extremely high bandwidth and low latency for large AI models.

71.8B+

gpuai-acceleratordata-center

AMD Instinct MI400 Series

by Advanced Micro Devices (AMD)

The AMD Instinct MI400 series is a family of data center GPUs designed for high-performance computing and AI workloads. It leverages AMD's CDNA 4 architecture and offers significant improvements in performance and energy efficiency compared to previous generations, targeting large-scale AI training and inference.

71.5B+

ai-supercomputerlarge-scale-trainingenterprise-ai

NVIDIA DGX H100

by NVIDIA

The NVIDIA DGX H100 is a purpose-built AI supercomputer, serving as the foundational building block for large-scale AI infrastructure. It integrates eight H100 Tensor Core GPUs with high-speed NVLink interconnects, providing a turnkey solution for the most demanding AI training, inference, and data analytics workloads.

ai-acceleratorautonomous-drivingsupercomputer

Tesla Dojo D2 Chip

by Tesla

The Tesla Dojo D2 chip is a custom-designed AI accelerator developed by Tesla for training large-scale neural networks used in autonomous driving. It is a key component of Tesla's Dojo supercomputer, aimed at improving the efficiency and speed of AI model training.

gpuai-acceleratordata-center

NVIDIA B100

by NVIDIA

The NVIDIA B100 is a data center GPU based on the Blackwell architecture, succeeding the H100. It offers substantial performance improvements for AI training and inference, featuring a second-generation Transformer Engine with FP4 precision, and a fifth-generation NVLink interconnect for massive multi-GPU scaling.

edge-aiembedded-systemsrobotics-platform

NVIDIA Jetson AGX Orin

by NVIDIA

The NVIDIA Jetson AGX Orin is a high-performance System-on-Module (SoM) designed for edge AI and autonomous machines. It delivers up to 275 TOPS of AI performance, integrating an NVIDIA Ampere architecture GPU with Arm CPUs and deep learning accelerators for server-class computing in a power-efficient package.

ipugraph-neural-networkssparse-models

Graphcore Bow Pod2024

by Graphcore

The Graphcore Bow Pod2024 is a modular AI compute system built for large-scale machine learning. It utilizes Graphcore's Intelligence Processing Units (IPUs) and is specifically engineered to accelerate sparse models, such as graph neural networks and large language models, in data center environments.

ai-acceleratorrisc-vdata-center

Tenstorrent Wormhole GF12

by Tenstorrent

The Tenstorrent Wormhole GF12 is a high-performance AI accelerator built on GlobalFoundries' 12nm process. It features a grid of programmable Tensix cores, RISC-V CPUs, and a high-speed Ethernet fabric for direct chip-to-chip communication, enabling scalable systems for both AI training and inference workloads.

in-memory computeanalog computeinference

d-Matrix Corsair

by d-Matrix

The d-Matrix Corsair is an in-memory compute platform designed to accelerate AI inference workloads. It leverages analog compute to achieve high energy efficiency and low latency, targeting applications like recommendation engines and generative AI.

NVIDIA A10G

by NVIDIA

NVIDIA Ampere GPU optimized for graphics and inference workloads. Commonly deployed in AWS G5 instances, offering a cost-effective option for inference, graphics rendering, and video processing at cloud scale.

gpudata-centerinference

NVIDIA V100

by NVIDIA

NVIDIA Volta architecture GPU that introduced Tensor Cores to the data center, providing the first dedicated matrix multiply hardware for AI. Powered the first wave of transformer model training including BERT and GPT-2, and became the dominant AI training platform from 2017–2020.

NVIDIA L40S

by NVIDIA

The NVIDIA L40S is a universal data center GPU based on the Ada Lovelace architecture. It features 48GB of GDDR6 memory and combines powerful AI compute, graphics, and media acceleration capabilities, making it a versatile solution for a wide range of workloads from generative AI to professional visualization.

gpudata-centerinference

neural-engineedgeapple-silicon

Apple M4 Ultra Neural Engine

by Apple

Apple M4 Ultra's 32-core Neural Engine capable of 38 TOPS, embedded in Apple's highest-end desktop and workstation chips. Combined with up to 192GB unified memory shared between CPU, GPU, and Neural Engine, it enables running large models locally on macOS with exceptional energy efficiency.

ipuai-hardwaresupercomputer

Graphcore Bow Pod1024

by Graphcore

The Graphcore Bow Pod1024 is a supercomputing-scale AI system, delivering over 250 PetaFLOPS of AI compute. It leverages 1,024 Bow IPU processors linked by a high-bandwidth fabric, specifically engineered for training massive, next-generation AI models and complex graph analytics workloads at an unprecedented scale.

NVIDIA GB200 NVL72

by NVIDIA

The NVIDIA GB200 NVL72 is a liquid-cooled, rack-scale system designed for exascale AI. It connects 36 Grace Blackwell Superchips, comprising 72 B200 GPUs and 36 Grace CPUs, via fifth-generation NVLink to function as a single massive GPU for training and inferencing on trillion-parameter models with unprecedented performance and energy efficiency.

58.8C+

tpuai-acceleratorgoogle-cloud

Google TPU v5p

by Google

Google's fifth-generation Tensor Processing Unit, the TPU v5p, is an AI accelerator designed for training and serving the largest AI models. It offers significant performance gains over its predecessor, featuring liquid cooling, 95 GB of HBM, and support for new data formats like MX4 for enhanced efficiency and scalability in massive pod configurations.

Google TPU v4

by Google

Google's fourth-generation TPU, used internally to train PaLM, LaMDA, and early Gemini models. Features 32GB HBM2 per chip and an optical circuit-switched ICI for flexible pod topology, enabling massive-scale distributed training.

tpudata-centertraining

NVIDIA Jetson Orin NX

by NVIDIA

Compact Orin-based Jetson module delivering up to 100 TOPS in a small form factor. Targets robotics, drones, medical devices, and industrial edge AI applications requiring significant AI performance in constrained size, weight, and power envelopes.

gpuedgeembedded

Google TPU v5e

by Google

Google's cost-efficient TPU variant optimized for inference and medium-scale training. Offers a better price-performance ratio than TPU v5p for serving workloads, with 16GB HBM2 per chip and excellent throughput for transformer inference.

tpudata-centerinference

57.1C+

Google TPU v6 (Trillium)

by Google

Google's sixth-generation TPU, codenamed Trillium, delivering 4.7x compute improvement over TPU v5e. Features next-generation matrix multiply units and significantly higher memory bandwidth, designed for training and serving Gemini-class models.

tpudata-centertraining

53.7C+

ai-acceleratorinferenceaws

AWS Inferentia2

by AWS

AWS second-generation custom inference chip with 4x higher compute and 10x higher memory bandwidth than Inferentia1. Optimized for cost-efficient large-scale inference of transformer models with very high throughput and low latency.

51.1C+

NVIDIA P100

by NVIDIA

NVIDIA Pascal architecture GPU and the first to use HBM2 memory in a data center product. Delivered 10x deep learning performance over its predecessor and was the primary platform for training early deep learning models before the Volta generation.

Google Tensor G4

by Google

Google's fourth-generation Tensor chip powering Pixel 9 smartphones. Features a dedicated TPU-derived neural core enabling on-device Gemini Nano inference for features like live captions, call screening, and generative AI photography without cloud latency.

neural-coremobileedge

49.8C

Intel Meteor Lake NPU

by Intel

Intel's first dedicated Neural Processing Unit embedded in Core Ultra (Meteor Lake) laptop processors. Delivers 10+ TOPS for AI inferencing on Windows AI PCs, enabling background AI workloads like live captioning, noise suppression, and on-device LLM assistance without using GPU/CPU resources.

npuedgepc

48.9C

ai-acceleratortrainingaws

AWS Trainium2

by AWS

AWS second-generation custom AI training chip delivering up to 4x performance improvement over Trainium. Designed specifically for training large language models on AWS, with tight integration with UltraCluster networking for scale-out training jobs.

48C

wafer-scaletraininginference

Cerebras CS-3

by Cerebras

Cerebras Wafer Scale Engine 3 — the world's largest chip, spanning an entire silicon wafer. Contains 4 trillion transistors and 44GB of on-chip SRAM, eliminating off-chip memory bandwidth as a bottleneck for training large neural networks.

47.4C

Google TPU v3

by Google

Google's third-generation TPU featuring liquid cooling to sustain higher clock speeds and 32GB HBM per chip. Doubled compute and memory versus TPU v2, enabling training of BERT, T5, and early large language models. Powered many foundational AI research papers at Google Brain and DeepMind.

tputraininginference

47.2C

MediaTek Dimensity 9400 APU

by MediaTek

MediaTek Dimensity 9400's AI Processing Unit — the most powerful mobile NPU in Android smartphones. Delivers 50 TOPS for on-device AI with support for 13B parameter models on-device, enabling private, low-latency AI features for Android flagship devices.

apumobileedge

46.4C

Google TPU v7 Ironwood

by Google

Google's TPU v7 Ironwood is the seventh generation of Google's custom Tensor Processing Units, designed for large-scale AI inference at hyperscaler capacity. Ironwood pods target serving frontier models like Gemini at Google's internal scale and are available to cloud customers via Google Cloud's TPU v7 instances.

googletpuinference

Google TPU v6e Trillium

by Google

Google TPU v6e Trillium is Google's sixth-generation TPU with 4x the compute and 3x the memory bandwidth per chip compared to v5e. Trillium is generally available on Google Cloud for both training and inference workloads, offering the most cost-efficient TPU option for teams training Gemma and other open models on Google Cloud.

googletputraining

SambaNova SN40L RDU

by SambaNova Systems

SambaNova's SN40L is a Reconfigurable Dataflow Unit designed for high-throughput LLM inference and training. Its tiered memory architecture — combining on-chip SRAM with off-chip DRAM — allows serving multiple large models simultaneously with industry-leading batch throughput. The SN40L is the hardware underlying SambaNova Cloud's inference API.

sambanovarduinference

nvidiablackwellconsumer-gpu

NVIDIA RTX 5090

by NVIDIA

The NVIDIA RTX 5090 is NVIDIA's flagship consumer/prosumer GPU in the Blackwell generation, featuring 32GB GDDR7 memory and massive compute for local AI inference and fine-tuning. It allows running 70B quantized models on a single consumer GPU and is the premier choice for developers who need frontier local model capability in a workstation.

NVIDIA H200

by NVIDIA

The NVIDIA H200 is a Hopper-generation GPU with 141GB of HBM3e memory — nearly double the H100's bandwidth — targeting inference workloads for very large models. The additional memory enables running 70B+ parameter models on fewer GPUs, significantly reducing the cost per inference token for large-scale deployments.

nvidiahoppergpu

NVIDIA H100

by NVIDIA

The NVIDIA H100 Hopper GPU is the dominant AI training and inference accelerator in production deployments as of 2024–2025. With 80GB HBM3 memory and NVLink 4 support, it delivers 4x the compute of the A100. The H100 SXM5 variant connects to 8-GPU NVL8 nodes via NVSwitch for large model training runs.

nvidiahoppergpu

nvidiablackwellrack-scale

NVIDIA GB200 NVL72

by NVIDIA

The GB200 NVL72 is NVIDIA's rack-scale AI system combining 36 Grace CPUs and 72 Blackwell B200 GPUs via NVLink interconnect. It delivers up to 1.44 ExaFLOPS of AI compute in a single rack, targeting hyperscaler-class training of frontier models. The NVL72 represents a fundamental shift from server-level to rack-level GPU system design.

NVIDIA B200

by NVIDIA

The NVIDIA B200 is the first Blackwell-architecture data center GPU, delivering 2.5x the training throughput and 5x the inference performance of the H100. With 192GB of HBM3e memory and NVLink 5 interconnects, it is designed for training and serving trillion-parameter models. The B200 anchors NVIDIA's Blackwell product generation.

nvidiablackwellgpu

NVIDIA A100

by NVIDIA

The NVIDIA A100 Ampere GPU remains widely deployed in cloud and on-premises AI infrastructure for training and inference. With 40GB or 80GB HBM2e memory variants and MIG (Multi-Instance GPU) support for partitioning into up to 7 isolated GPU instances, the A100 is the proven workhorse of many production AI deployments.

nvidiaamperegpu

Intel Gaudi 3

by Intel

Intel Gaudi 3 is Intel's AI training and inference accelerator designed as a cost-competitive alternative to NVIDIA H100. It features 128GB of HBM2e memory and 24 100GbE RoCE ports for scale-out connectivity. Gaudi 3 is supported by Intel's Optimum Habana software stack and available via major cloud providers and on-premises.

intelgauditraining

Groq LPU

by Groq

Groq's Language Processing Unit (LPU) is a deterministic ASIC architecture optimized for sequential transformer inference, eliminating the memory-bandwidth bottlenecks of GPU-based serving. Groq LPU clusters deliver measured token generation speeds of 500+ tokens/second for Llama-class models, significantly outpacing GPU inference for latency-critical applications.

groqlpuasic

Cerebras WSE-3

by Cerebras Systems

The Cerebras Wafer-Scale Engine 3 (WSE-3) is the world's largest chip, containing 4 trillion transistors on a single 46,225 mm² silicon wafer. Its architecture eliminates the memory bandwidth bottlenecks of conventional GPU clusters for large model inference, achieving industry-leading tokens-per-second throughput for models up to 70B parameters.

cerebraswafer-scaleasic

AWS Trainium3

by Amazon Web Services

AWS Trainium3 is Amazon's third-generation custom ML training chip, offering significant improvements in training throughput and energy efficiency over Trainium2. Trainium3 instances are available through Amazon SageMaker and EC2, targeting cost-efficient training of large language models for AWS-native AI development teams.

awstrainiumtraining

AMD MI325X

by AMD

The AMD Instinct MI325X is an updated Instinct GPU with 288GB of HBM3e memory and improved memory bandwidth over the MI300X. It targets inference workloads for the largest frontier models and positions AMD competitively against the NVIDIA H200 in memory-bound inference scenarios.

amdinstinctgpu

AMD MI300X

by AMD

The AMD Instinct MI300X is AMD's flagship AI accelerator featuring 192GB of HBM3 memory, the highest of any GPU when released. This massive memory capacity makes it compelling for inference of 70B+ parameter models and has led to adoption by Microsoft Azure, Oracle, and major AI labs as an H100 alternative.

amdinstinctgpu

SambaNova SN40L

by SambaNova

SambaNova's Reconfigurable Dataflow Unit with a three-tier memory hierarchy: on-chip scratchpad, on-package HBM, and off-package DRAM. The unique architecture enables running multiple models simultaneously and excels at efficient mixture-of-experts inference.

rduinferencetraining

43.4C

Google TPU v2

by Google

Google's second-generation TPU and the first available on Google Cloud. Added training capability (v1 was inference-only), HBM memory for gradient storage, and introduced the concept of TPU Pods — interconnected multi-chip systems enabling distributed training at scale.

tputraininginference

42.9C

Google TPU v1

by Google

Google's first Tensor Processing Unit — the seminal custom AI ASIC that launched the modern era of purpose-built ML hardware. Deployed in 2015 and described publicly in a landmark 2017 ISCA paper, it ran inference for Google Search, Maps, and Translate, delivering 30x performance-per-watt vs contemporary GPUs.

tpuinferencegoogle

42.5C

ai-acceleratorinferencequalcomm

Qualcomm Cloud AI 100

by Qualcomm

Qualcomm's data center AI inference accelerator designed for power-efficient deployment. Based on the same AI architecture as Snapdragon, it delivers competitive inference performance with a focus on power efficiency metrics (TOPS/W) for hyperscale deployments.

38.9D

NVIDIA K80

by NVIDIA

NVIDIA Kepler-based dual-GPU data center card that became the first widely available cloud GPU for deep learning. Google Colab's original free tier ran on K80s, making it instrumental in democratizing access to GPU-accelerated deep learning for researchers and students worldwide.

38D

Graphcore Bow IPU

by Graphcore

Graphcore's Bow Intelligence Processing Unit using 3D wafer-on-wafer technology. Features a massively parallel MIMD architecture with 1472 processor cores and 900MB on-chip SRAM, designed for graph-structured AI workloads and sparse computation.

iputraininginference

37.8D

Graphcore MK2 IPU (Colossus GC200)

by Graphcore

Graphcore's second-generation Colossus GC200 Intelligence Processing Unit. Featured 1472 IPU-Cores with 900MB on-chip SRAM and introduced the Bulk Synchronous Parallel with Staleness (BSS) execution model. Preceded the Bow IPU and established Graphcore's approach to graph-native, SRAM-centric AI compute.

iputraininginference

33.3D

ai-acceleratorinferencetraining

Tenstorrent Grayskull

by Tenstorrent

Tenstorrent's first commercial AI accelerator co-designed by Jim Keller. Built on a RISC-V Tensix processor architecture with a mesh NoC, enabling programmable AI compute. Notable for its open software stack and developer-friendly approach to hardware AI.

32.2D

ai-acceleratortrainingintel

Intel Nervana NNP-T1000

by Intel

Intel Nervana Neural Network Processor for Training — Intel's attempt at a purpose-built AI training chip following the 2016 acquisition of Nervana Systems. Featured 32GB HBM2 and a novel MCDRAM+HBM architecture. Discontinued in 2020 as Intel pivoted focus to the Habana Gaudi line.

25.4D

feature-storemlopsmodel-tracking

Databricks Feature Store - MLflow Integration

by Databricks

The Databricks Feature Store provides a centralized repository for managing and sharing machine learning features. Its integration with MLflow enables seamless tracking of feature usage in ML models, ensuring reproducibility and simplifying model deployment workflows by automatically packaging feature dependencies.

82.8A

graph neural networkspytorchgeometric deep learning

PyTorch Geometric

by PyTorch

PyTorch Geometric (PyG) is a library built upon PyTorch to facilitate the development of graph neural networks (GNNs). It provides data handling utilities, learning methods on graphs and other irregular structures, and benchmark datasets for various graph-related tasks.

81.8A

quantum computingmachine learningtensorflow

TensorFlow Quantum

by Google

TensorFlow Quantum (TFQ) is a library for building quantum machine learning models. It allows researchers to construct and train hybrid quantum-classical models by leveraging TensorFlow's infrastructure for classical computation and quantum simulators or quantum hardware for quantum computation.

langchainopenaillm-integration

LangChain + OpenAI

by LangChain

Native integration between LangChain and OpenAI's GPT models. Provides seamless access to chat completions, embeddings, and function calling through LangChain's unified interface. Supports streaming, tool use, and structured output via the langchain-openai package.

78.4B+

mlopsmodel trackingexperiment management

MLflow Databricks Integration

by Databricks

The MLflow integration with Databricks provides a managed MLflow service within the Databricks platform. It simplifies the process of tracking experiments, managing models, and deploying them to production by leveraging Databricks' scalable infrastructure and collaborative environment.

77.2B+

GitHub Copilot + VS Code

by GitHub

GitHub Copilot integrates into VS Code as a first-party extension, delivering inline ghost-text completions, multi-line suggestions, and a dedicated Copilot Chat panel for conversational refactoring, test generation, and documentation. It leverages Codex and GPT-4 models under the hood, with workspace-aware context from open tabs and the current file.

idevscodecode-completion

76.4B+

Meta + HuggingFace (Llama)

by Meta AI

Official Meta Llama model weights distributed through the HuggingFace Hub under Meta's community license. Covers Llama 3.1, 3.2, and 3.3 variants from 1B to 405B parameters with full transformers, TGI, and vLLM compatibility. HuggingFace serves as the primary public distribution channel for Meta's open-weight releases.

metahuggingfacellama

LangChain + Anthropic

by LangChain

Official LangChain integration for Anthropic's Claude model family. Exposes Claude's extended context window, vision capabilities, and tool use through LangChain's standard chat model interface. Supports streaming and the full Messages API via the langchain-anthropic package.

langchainanthropicclaude

73.4B+

Pinecone + OpenAI Embeddings

by Pinecone

Direct integration pairing Pinecone's managed vector database with OpenAI's text-embedding-3 models. Commonly used pattern for production RAG systems where OpenAI generates dense vectors and Pinecone handles ANN retrieval at scale. Supports serverless and pod-based indexes with metadata filtering.

pineconeopenaiembeddings

73.2B+

experiment-trackingfine-tuninghuggingface

W&B + Hugging Face

by Weights & Biases

Weights & Biases integrates directly into Hugging Face Trainer and PEFT via a built-in report_to callback, logging training loss curves, GPU utilization, gradient norms, and hyperparameters to shareable W&B runs. The integration supports sweep-based hyperparameter optimization and artifact versioning for model checkpoints.

72.5B+

differential privacyprivacy-preserving MLtensorflow

TensorFlow Privacy

by Google

TensorFlow Privacy is a library that makes it easier to train machine learning models with differential privacy. It provides TensorFlow optimizers that implement differentially private stochastic gradient descent (DP-SGD), allowing developers to protect the privacy of training data while still achieving good model performance.

vLLM + NVIDIA

by vLLM Project

vLLM's NVIDIA backend leverages CUDA kernels, FlashAttention-2, and PagedAttention to deliver state-of-the-art throughput for LLM inference on NVIDIA A100, H100, and H200 GPUs. The integration supports tensor and pipeline parallelism across multiple GPUs, FP8/FP16/BF16 quantization, and CUDA graph capture for minimal per-token latency.

inferencenvidiagpu

observabilitytracingllm-ops

LangSmith + LangChain

by LangChain Inc.

LangSmith provides first-class tracing and evaluation for LangChain pipelines, capturing every LLM call, chain step, and tool invocation with full prompt/response payloads. Teams use the integration to debug production failures, build evaluation datasets, and run automated regression tests against golden traces.

OpenAI + Azure OpenAI Service

by Microsoft Azure

Microsoft Azure's managed deployment of OpenAI models including GPT-4o, o1, and DALL-E 3 with enterprise SLAs, private networking, and regional data residency. Provides the same OpenAI API surface with additional Azure IAM, VNet integration, content filtering, and Azure Monitor observability.

openaiazureenterprise-ai

71.5B+

Databricks Feature Store - Feast Integration

by Databricks

The Databricks Feature Store integrates with Feast, an open-source feature store, to streamline feature engineering and management for machine learning workflows. This integration allows users to define, store, and serve features consistently across training and inference, reducing data skew and improving model performance within the Databricks environment.

feature-storefeastmlops

langchainpineconevector-store

LangChain + Pinecone

by LangChain

LangChain VectorStore integration for Pinecone's managed vector database. Enables similarity search, MMR retrieval, and metadata filtering within LangChain RAG pipelines. Supports both serverless and pod-based Pinecone indexes via the langchain-pinecone package.

70.2B+

hugging faceinteloptimization

Hugging Face Optimum Intel Extension

by Hugging Face / Intel

Hugging Face Optimum Intel Extension is a toolkit designed to accelerate inference and training of transformer models on Intel CPUs and GPUs. It leverages Intel's Deep Learning Boost (DL Boost) and other hardware features to optimize model performance within the Hugging Face ecosystem.

Cursor + OpenAI

by Anysphere

Cursor is a VS Code fork that uses OpenAI's GPT-4 and o-series models as its reasoning engine for multi-file edits, semantic codebase search, and an agent mode that can autonomously implement features across the entire repository. It offers a Composer panel for multi-file diffs and a codebase-aware chat that indexes the project with embeddings for precise retrieval.

ideai-editoropenai

69.6B

Anthropic + AWS Bedrock

by Amazon Web Services

Anthropic's Claude model family available through Amazon Bedrock's fully managed foundation model service. Provides serverless inference with pay-per-token pricing, AWS IAM authentication, VPC endpoint support, and model evaluation tools. Claude 3.5 Sonnet, Haiku, and Opus are all available through the Bedrock API.

anthropicawsbedrock

68.2B

inferencehuggingfacetext-generation

TGI + Hugging Face Hub

by Hugging Face

Text Generation Inference (TGI) by Hugging Face is a production-grade inference server that directly loads models from the Hugging Face Hub via model IDs, handling shard downloading, quantization, and OpenAI-compatible endpoint serving in a single Docker command. It implements continuous batching, speculative decoding, and FlashAttention for optimal throughput on Ampere and Hopper GPUs.

local-inferencedockerself-hosted

Ollama + Docker

by Ollama

Ollama's official Docker image provides a self-contained environment for running large language models locally. It enables developers to easily deploy and manage quantized GGUF models using familiar container orchestration tools like Docker Compose and Kubernetes, supporting GPU acceleration and an OpenAI-compatible API.

MCP + GitHub

by Anthropic / GitHub

Integrates the MCP environment with GitHub's REST and GraphQL APIs, enabling programmatic control over software development workflows. Users can manage repositories, track issues, review pull requests, and search code directly from an agent context, streamlining development tasks without switching tools.

mcpgithubgit

ai-code-assistantcode-completioncopilot

GitHub Copilot + JetBrains

by GitHub

The GitHub Copilot plugin for JetBrains IDEs integrates AI-powered code completion and a conversational chat panel directly into the editor. It provides inline, ghost-text suggestions and mirrors the functionality of the VS Code extension, adapting to JetBrains' native keymaps and user interface for a seamless experience across IDEs like IntelliJ IDEA and PyCharm.

MCP + Filesystem

by Anthropic

The Anthropic MCP Filesystem server allows AI agents, like Claude, to interact directly with a user's local files. It exposes a secure API for reading, writing, listing, and searching files and directories, enabling agents to perform tasks such as code analysis, data processing, and file organization on the host machine.

mcpfilesystemfile-access

langchainchromavector-store

LangChain + Chroma

by LangChain

LangChain VectorStore integration for Chroma, the open-source AI-native embedding database. Ideal for local development and prototyping with zero infrastructure setup. Supports persistent and in-memory collections, metadata filtering, and relevance-scored retrieval via langchain-chroma.

LangChain + Google AI

by LangChain

This integration connects the LangChain framework with Google's advanced AI services, including the Gemini API via Google AI Studio and models on Vertex AI. It enables developers to build sophisticated applications leveraging multimodal capabilities for processing text and images, advanced function calling for tool use, and grounding responses with Google Search for accuracy.

langchaingooglegemini

google-cloudvertex-aigenerative-ai

Google AI + Vertex AI

by Google Cloud

Vertex AI is Google Cloud's managed machine learning platform for deploying and scaling AI applications. It provides an enterprise-grade environment for using Google's foundation models like Gemini and PaLM, adding MLOps tooling, security controls, and deep integration with the Google Cloud ecosystem. This includes features like model tuning, evaluation, and grounding with Google Search.

langchain-integrationhuggingfaceopen-source-llm

LangChain + HuggingFace

by LangChain

This integration connects LangChain with the HuggingFace ecosystem, enabling the use of thousands of open-source models. It allows developers to call models via the HuggingFace Inference API, run local inference using the `transformers` library, and generate embeddings, all within LangChain's structured framework for building complex LLM applications.

inference-optimizationllm-servingnvidia

TensorRT-LLM + NVIDIA Triton

by NVIDIA

TensorRT-LLM optimizes large language models into fused CUDA kernels, while the Triton Inference Server orchestrates serving. Together, they form NVIDIA's production stack for maximizing token throughput and minimizing latency on datacenter GPUs, enabling high-performance, scalable LLM inference.

LangGraph + LangSmith

by LangChain Inc.

The LangGraph and LangSmith integration provides built-in observability for stateful agent graphs. It automatically captures every node execution, state change, and tool call as a structured trace in LangSmith, enabling deep, step-by-step debugging, performance analysis, and regression testing of complex agent workflows.

agentslanggraphlangsmith

CrewAI + LangChain

by CrewAI / LangChain

This integration enables CrewAI agents to leverage the entire LangChain tool ecosystem. CrewAI orchestrates multi-agent workflows by assigning roles and delegating tasks, while LangChain provides the foundational tools for capabilities like web search, code execution, vector store retrieval, and API connectivity.

agentscrewailangchain

63.7B

Ray Serve + GCP

by Anyscale

Ray Serve deploys scalable model serving applications on Google Cloud Platform using GKE and Vertex AI infrastructure, with Ray's distributed runtime managing replica placement, traffic splitting, and resource scheduling across GPU node pools. The integration supports multi-model serving graphs, A/B rollouts, and seamless scale-to-zero on GCP Spot instances for cost optimization.

deploymentgcpkubernetes

62.5B

LlamaParse + LlamaIndex

by LlamaIndex

LlamaParse is a proprietary parsing service for complex documents like PDFs with embedded tables and charts. Its first-party integration with the open-source LlamaIndex framework allows developers to directly ingest parsed, structured objects (Nodes) into advanced Retrieval-Augmented Generation (RAG) pipelines, preserving the original document's rich context.

ragllamaparsellamaindex

llm-observabilityapi-proxyopenai

Helicone + OpenAI

by Helicone

Helicone is an observability platform for LLMs that acts as a proxy for the OpenAI API. It enables developers to monitor usage, track costs, and optimize performance with minimal code changes. Key features include real-time dashboards, request-level caching, rate-limiting, and detailed analytics.

61.9B

MCP + Slack

by Anthropic / Slack

This integration connects MCP-compatible AI agents, such as Claude, directly to a Slack workspace. It enables programmatic control over Slack functionalities, allowing agents to read channel histories, post messages, manage channels, and look up user information. The connection is authenticated using a Slack Bot token for secure, automated communication.

mcpslackmessaging

mcpbrave-searchweb-search

MCP + Brave Search

by Anthropic / Brave

An integration that connects the Multi-agent Control Plane (MCP) with Brave's independent search index. It equips AI agents, like Claude, with tools for real-time web, local, and news searches, offering a privacy-focused alternative to Google and Bing for data retrieval and grounding.

langchainweaviatevector-store

LangChain + Weaviate

by LangChain

LangChain integration for Weaviate's open-source vector database. Supports hybrid search (BM25 + vector), multi-tenancy, and generative search modules within LangChain chains and agents. Connects via the Weaviate Python client inside the langchain-weaviate package.

observabilitytracingopen-source

Langfuse + LlamaIndex

by Langfuse

Langfuse integrates with LlamaIndex to provide open-source observability for LLM applications. A simple callback handler captures detailed traces of query engines, retrievers, and LLM calls. This data, including token usage, latency, and custom scores, is visualized in a self-hostable dashboard for comprehensive monitoring.

61B

mcppuppeteerbrowser-automation

MCP + Puppeteer

by Anthropic

Official MCP Puppeteer server providing headless Chrome browser control to MCP clients. Exposes tools for page navigation, element interaction, form filling, screenshot capture, and JavaScript execution, enabling Claude to automate complex web workflows that require a real browser environment.

autogenazure-openaimulti-agent-systems

AutoGen + Azure OpenAI

by Microsoft

Integrate the AutoGen multi-agent framework with Azure OpenAI Service to build sophisticated, enterprise-grade AI applications. This connector enables developers to leverage Azure's security features, including RBAC and private endpoints, while using all standard AutoGen agents like AssistantAgent and UserProxyAgent for complex, collaborative tasks.

Tabnine + VS Code

by Tabnine

Tabnine's VS Code extension provides AI-powered code completions, including whole-line and full-function suggestions. It is designed for enterprises with strict privacy and data-residency needs, offering on-premise or private cloud deployment options. The AI can be trained on a team's specific codebase for highly relevant completions.

idevscodecode-completion

ide-extensionvscodeagentic-coding

Cline + VS Code

by Community

Cline is an open-source VS Code extension that provides an AI agent with direct access to the IDE's environment. It enables multi-step agentic workflows by allowing the AI to use the file system, terminal, and an integrated browser. The extension supports various models and includes a human-in-the-loop approval process for safety.

59.7C+

LlamaIndex + Qdrant

by LlamaIndex / Qdrant

Native LlamaIndex vector store adapter for Qdrant, enabling index construction, similarity search, and filtered retrieval over Qdrant collections. Supports both in-memory and hosted Qdrant deployments with payload-based metadata filtering.

ragllamaindexqdrant

ragdocument-parsingvector-store

Unstructured + Pinecone

by Unstructured / Pinecone

This integration provides a direct pipeline from Unstructured's data transformation service to the Pinecone vector database. It automates extracting, cleaning, and chunking data from documents like PDFs and DOCX, then embeds and indexes the content into a Pinecone namespace for use in RAG applications.

MCP + PostgreSQL

by Anthropic

This integration provides a secure, read-only connection to a PostgreSQL database within the MCP environment. It allows agents to perform database introspection, such as listing schemas and describing tables. A key feature is its ability to facilitate natural-language-to-SQL workflows, enabling users to ask questions in plain English and have them translated into safe, read-only SELECT queries for execution.

mcppostgresqldatabase

LangChain + Ollama

by LangChain

Integrate LangChain with Ollama for fully local LLM inference. This allows developers to use models like Llama 3 and Mistral on their own hardware, ensuring data privacy by eliminating external API calls. It's ideal for building offline-capable, privacy-sensitive applications.

langchainollamalocal-llm

llmopsobservabilityml-monitoring

Arize Phoenix + LangChain

by Arize AI

Arize Phoenix integrates with LangChain to provide deep observability for LLM applications. By leveraging OpenTelemetry, it captures and streams traces for chains, agents, and retrievers to a local UI or the Arize cloud. This enables developers to debug applications, detect embedding drift, score retrieval quality, and analyze hallucinations at the span level.

ai-gatewayllm-opsmulti-provider

Portkey + Multi-Provider

by Portkey

Portkey's AI gateway unifies over 200 LLM providers through a single OpenAI-compatible API. It enables automatic fallbacks, load balancing, and semantic caching to improve reliability and performance. The platform provides full observability, capturing detailed cost, latency, and metadata for every request.

langchainmistralfunction-calling

LangChain + Mistral AI

by LangChain

This integration connects the LangChain framework with Mistral AI's suite of models, including Mistral Large and Codestral. It enables developers to build sophisticated applications by leveraging Mistral's capabilities like function calling, JSON mode, and streaming within LangChain's structured environment for creating agents and chains.

mlopsmodel-deploymentmodel-serving

BentoML + AWS

by BentoML

BentoML streamlines deploying machine learning models to the AWS cloud. It packages models and their inference logic into standardized containers, enabling one-command deployment to services like SageMaker, EC2, and ECS. The platform automates production concerns such as auto-scaling, batching, and monitoring.

Windsurf + Anthropic

by Codeium

Windsurf (by Codeium) is an AI-native IDE that integrates Anthropic's Claude models as the backbone of its Cascade agent, which autonomously plans and executes multi-step coding tasks with real-time file and terminal access. The Anthropic integration powers deep context awareness across large codebases and supports long-horizon agent tasks with coherent state tracking.

ideai-editoranthropic

Claude Agent SDK + MCP

by Anthropic

Anthropic's Claude Agent SDK ships with native Model Context Protocol (MCP) client support, allowing Claude-powered agents to connect to any MCP server and use its exposed tools, resources, and prompts. The integration bridges Claude's tool-use capabilities with the open MCP ecosystem for plug-and-play external integrations.

agentsanthropicclaude

LangChain + Cohere

by LangChain

LangChain integration for Cohere's enterprise AI platform. Provides access to Command models for generation, Embed v3 for multilingual embeddings, and the Rerank API for RAG pipeline precision improvement. Available via the langchain-cohere package with first-class reranker support.

langchaincoherereranking

57.7C+

Sourcegraph + Cody

by Sourcegraph

Sourcegraph Cody combines enterprise-grade code search with an AI coding assistant, letting developers ask questions grounded in the entire codebase indexed by Sourcegraph. The integration uses Sourcegraph's precise code intelligence (SCIP) as a retrieval layer for Cody's Claude-powered chat, delivering context-accurate answers across mono-repos with millions of files.

idecode-searchcody

57.7C+

MCP + Google Drive

by Anthropic / Google

Official MCP Google Drive server granting MCP clients access to Drive file listings, search, and document content reading via OAuth 2.0. Supports Docs, Sheets, Slides, and plain files, enabling agents to retrieve and reason over cloud-stored enterprise documents.

mcpgoogle-drivegdocs

groqlangchainfast-inference

Groq + LangChain

by Groq

LangChain chat model integration for Groq's Language Processing Unit (LPU) inference API. Enables ultra-low-latency LLM calls within LangChain chains and agents with first-token latency under 100ms. Supports Llama 3, Mixtral, and Gemma models served on Groq hardware via the langchain-groq package.

Continue + VS Code

by Continue Dev

Continue is an open-source AI code assistant for VS Code that supports any LLM through a flexible config file, covering inline completions, chat, edit mode, and custom slash commands. Its context providers system lets developers include files, docs, web search results, and terminal output in every prompt, making it highly adaptable to team-specific workflows.

idevscodeopen-source

57.2C+

chromahuggingfacelocal-embeddings

Chroma + HuggingFace

by Chroma

Chroma's built-in embedding function for HuggingFace's sentence-transformers library. Enables fully local embedding generation and vector storage without any API keys. Supports hundreds of pre-trained models from the HuggingFace Hub including all-MiniLM, BGE, and E5 variants.

56.2C+

qdrantllamaindexvector-store

Qdrant + LlamaIndex

by Qdrant

LlamaIndex VectorStore integration for Qdrant's high-performance vector search engine. Exposes Qdrant's payload filtering, sparse-dense hybrid search, and collection management through LlamaIndex's standard index and query engine abstractions for advanced RAG pipelines.

55.9C+

deepseektogether-aiinference-provider

DeepSeek + Together AI

by Together AI

DeepSeek's open-weight models including DeepSeek-V3 and DeepSeek-R1 served through Together AI's inference cloud at competitive token prices. Provides an OpenAI-compatible API endpoint, enabling drop-in substitution for cost-sensitive workloads. Together AI's custom GPU kernels deliver high throughput for DeepSeek's MoE architecture.

observabilityragllamaindex

Arize Phoenix + LlamaIndex

by Arize AI

Arize Phoenix instruments LlamaIndex query pipelines with OpenTelemetry spans, exposing retrieval precision, reranker performance, and LLM generation quality in a local-first UI. The integration is particularly valuable for RAG applications where diagnosing retrieval failures requires joint analysis of embeddings, chunks, and generation outputs.

55.4C+

Firecrawl + LangChain

by Firecrawl / LangChain

LangChain document loader built on Firecrawl's web crawling and scraping API, transforming live web content into clean Markdown documents ready for chunking and indexing. Supports full-site crawls, sitemap-driven ingestion, and JavaScript-rendered pages.

ragweb-scrapinglangchain

55.4C+

MCP + Notion

by Community / Notion

MCP Notion server built on the official Notion API, providing tools for searching pages, reading blocks, creating pages, and updating database entries. Enables Claude and other agents to use Notion as a structured knowledge store within agentic workflows.

mcpnotionknowledge-base

55.3C+

weaviatecoherevectorize-module

Weaviate + Cohere

by Weaviate

Weaviate's built-in text2vec-cohere and reranker-cohere modules for zero-ETL vectorization and result reranking within Weaviate clusters. Automatically embeds documents at write time using Cohere Embed v3 and reranks retrieval results without external orchestration code.

54C+

milvuslangchainvector-store

Milvus + LangChain

by Zilliz

LangChain VectorStore integration for Milvus, the open-source distributed vector database. Supports billion-scale ANN search, multiple index types (IVF_FLAT, HNSW, DiskANN), and collection-level partitioning through LangChain's unified retriever interface via the pymilvus client.

52.9C+

agentspydanticaianthropic

PydanticAI + Anthropic

by Pydantic

PydanticAI's native Anthropic model provider, enabling type-safe agentic workflows backed by Claude models. Agent inputs, tool call parameters, and structured outputs are all validated through Pydantic schemas, with full support for Claude's extended tool use and streaming responses.

52.6C+

agentssmolagentshuggingface

SmolAgents + HuggingFace

by HuggingFace

SmolAgents is HuggingFace's minimal agent framework that defaults to code-writing agents powered by HuggingFace-hosted open-source models. The integration allows seamless use of models from the HuggingFace Hub (Qwen, Mistral, LLaMA) through the Inference API or local transformers without API key lock-in.

52.5C+

local-inferencesingle-binaryportable

LlamaFile + Local Execution

by Mozilla

LlamaFile by Mozilla and Justine Tunney bundles a complete LLM with its runtime into a single self-contained executable that runs on Linux, macOS, Windows, FreeBSD, NetBSD, and OpenBSD without any installation. It embeds a compressed GGUF model and a llama.cpp backend into a polyglot binary (ZIP + ELF/Mach-O), serving an OpenAI-compatible HTTP API on localhost at startup.

52C+

MCP + Sentry

by Community / Sentry

MCP Sentry server exposing Sentry's error tracking and performance monitoring data to MCP-compatible agents. Agents can list recent issues, retrieve stack traces, inspect breadcrumbs, and query performance data, enabling AI-powered incident triage and root cause analysis workflows.

mcpsentryerror-tracking

51.6C+

Swarm + OpenAI

by OpenAI

OpenAI's experimental Swarm framework natively targets the OpenAI Chat Completions API for lightweight, stateless multi-agent handoffs. Agents are plain Python functions decorated with tool schemas; the framework manages context passing and agent-to-agent transfers through the standard OpenAI function-calling interface.

agentsswarmopenai

50.9C+

Mistral AI + AWS Bedrock

by Amazon Web Services

Mistral AI's Mistral Large and Mistral Small models available through Amazon Bedrock for serverless inference. Provides AWS-native access to Mistral's frontier models with pay-per-token pricing, IAM-based auth, and Bedrock Guardrails — enabling EU-origin AI capabilities within AWS infrastructure without a separate Mistral API account.

mistralawsbedrock

evaluationobservabilityanthropic

Braintrust + Anthropic

by Braintrust Data

Braintrust wraps the Anthropic SDK to automatically trace every Claude API call and funnel results into structured eval datasets. Developers can run model-graded scoring, regression suites against golden datasets, and A/B comparisons between Claude model versions directly from the Braintrust dashboard.

50C+

pgvector + Django

by pgvector

pgvector-django package adding native vector similarity search to Django's ORM via PostgreSQL's pgvector extension. Adds VectorField, IvfflatIndex, and HnswIndex with cosine, L2, and inner product distance operators. Enables AI-powered search inside existing Django applications without a separate vector DB.

pgvectordjangopostgresql

49.9C

Marker + ChromaDB

by VikParuchuri / ChromaDB

Combines Marker's high-fidelity PDF-to-Markdown conversion with ChromaDB's local-first vector store for lightweight, self-hosted RAG pipelines. Ideal for on-device or air-gapped deployments where cloud vector stores are unavailable.

ragpdf-parsingchromadb

48.2C

Agency Swarm + OpenAI

by VRSEN

Agency Swarm is built on top of the OpenAI Assistants API, wrapping it with agency-level abstractions for defining communication flows between specialized agents. It provides a higher-level interface for creating persistent agent threads, shared tool registries, and structured agent communication protocols.

agentsagency-swarmopenai

47.2C

Jina Reader + PGVector

by Jina AI / PostgreSQL

Routes Jina Reader's URL-to-text extraction through PostgreSQL's pgvector extension for SQL-native RAG storage. Enables teams already running PostgreSQL to add vector search without adopting a separate vector database, keeping the stack simple.

ragjinapgvector

45.3C

observabilityevaluationlangchain

Opik + LangChain

by Comet ML

Opik by Comet provides an open-source LLM observability platform that integrates with LangChain via a callback handler, recording traces, token counts, and custom scores into a queryable dataset. The integration includes built-in hallucination and answer-relevance evaluators that run automatically on captured traces.

45.1C

Docling + Weaviate

by IBM / Weaviate

Combines IBM's Docling document conversion library with Weaviate's vector database for structured RAG pipelines. Docling extracts rich document structure (tables, figures, headings) which is then stored as typed Weaviate objects with native vector indexing.

ragdoclingweaviate

44.8C

lancedbllamaindexserverless-vector-db

LanceDB + LlamaIndex

by LanceDB

LlamaIndex integration for LanceDB's serverless, embedded vector database built on the Lance columnar format. Supports multimodal data (text, images, video), zero-copy queries, and versioned datasets. Ideal for local or edge AI applications requiring a zero-ops vector store with full LlamaIndex query engine compatibility.

44.3C

Cohere + AWS SageMaker

by Amazon Web Services

Cohere's Command and Embed models deployed as dedicated SageMaker endpoints for real-time inference with guaranteed throughput. Available through AWS Marketplace as JumpStart models, supporting VPC isolation, auto-scaling, and A/B testing. Preferred for enterprises requiring dedicated capacity and AWS billing consolidation.

cohereawssagemaker

43.9C

fireworks-aivllmself-hosted-inference

Fireworks AI + vLLM

by Fireworks AI

Integration between Fireworks AI's model platform and the vLLM inference engine for on-premises or self-hosted deployment of Fireworks-optimized models. Fireworks packages FireOptimizer-quantized models in formats directly compatible with vLLM's OpenAI-compatible server, enabling enterprise teams to run Fireworks-quality inference on their own GPU infrastructure.

42.4C

vespahaystackhybrid-search

Vespa + Haystack

by deepset

Haystack DocumentStore integration for Vespa, Yahoo's open-source big-data serving engine. Combines Vespa's multi-stage ranking, approximate nearest neighbor search, and real-time indexing with Haystack's RAG pipeline builder. Supports BM25 + dense hybrid retrieval at web scale.

42.2C

observabilityauto-loggingopenai

Log10 + OpenAI

by Log10

Log10 provides zero-configuration auto-logging for OpenAI API calls through a context manager that intercepts completions and stores full request/response pairs with automatic tagging. The integration supports user feedback collection, few-shot prompt organization, and GDPR-compliant data masking for PII in logged payloads.

41.2C

Chunkr + Milvus

by Chunkr / Zilliz

Pairs Chunkr's semantic chunking service with Milvus's high-performance vector database for production-scale RAG. Chunkr splits documents using structure-aware boundaries and Milvus stores the resulting dense vectors with ANN indexing for sub-millisecond retrieval.

ragchunkingmilvus

41C

zillizapache-sparkbatch-vectorization

Zilliz + Apache Spark

by Zilliz

Connector linking Zilliz Cloud (managed Milvus) with Apache Spark for large-scale batch embedding ingestion and vector ETL pipelines. Enables parallel document embedding across Spark executors with direct write to Zilliz collections, supporting data lake to vector store pipelines at petabyte scale.

38.7D

Integration

Weights & Biases

ML experiment tracking and model monitoring platform. Integrates with all major training frameworks.

MLtrackingexperiments

38D

cerebraslitellmwafer-scale

Cerebras + LiteLLM

by LiteLLM

LiteLLM proxy integration for Cerebras Inference, enabling Cerebras's wafer-scale chip throughput to be accessed via a unified OpenAI-compatible gateway. Allows developers to route requests to Cerebras's CS-3 hardware — delivering over 2000 tokens/second on Llama 3.1 70B — from any existing OpenAI SDK integration through LiteLLM's model aliases.

37.8D