Search, filter, and sort across the complete AI ecosystem database.
Full Index — Real-Time SearchShowing 1427 of 1427 entities
by PyTorch
A Python script using PyTorch for training and evaluating image classification models. It provides a modular structure for defining datasets, models, training loops, and evaluation metrics, enabling researchers and practitioners to quickly prototype and deploy image classification solutions.
by Hugging Face
The Hugging Face Transformers training script simplifies the process of training and fine-tuning transformer models for various NLP tasks. It provides a high-level API and pre-built training loops, enabling users to quickly adapt pre-trained models to their specific datasets and objectives.
by NVIDIA
NVIDIA AI provides a comprehensive suite of hardware and software solutions for accelerating AI development and deployment. Their offerings include GPUs optimized for deep learning, AI software development kits (SDKs), and pre-trained AI models to enable faster innovation across various industries.
by LangChain
LCEL is a declarative way to compose chains of language models and other primitives in LangChain. This script demonstrates how to use LCEL to build complex AI pipelines with features like streaming, parallel execution, and retry mechanisms, enabling developers to create robust and scalable AI applications.
by Google
The TensorFlow Model Garden is a repository containing a collection of example implementations for state-of-the-art (SOTA) machine learning models and modeling solutions for TensorFlow. It provides a wide variety of models, pre-trained weights, and scripts to help users quickly prototype and deploy TensorFlow-based AI solutions.
by MLCommons
MLPerf Training is a suite of benchmarks that measure the time it takes to train various machine learning models on different hardware and software platforms. It provides a standardized way to compare the performance of different AI training systems, driving innovation in hardware and software optimization for AI workloads.
by Scikit-learn
A Python script leveraging scikit-learn to comprehensively evaluate machine learning models. It calculates various performance metrics (e.g., accuracy, precision, recall, F1-score, AUC) and generates visualizations (e.g., confusion matrices, ROC curves) to provide insights into model behavior and facilitate informed decision-making.
by Amazon Web Services (AWS)
Amazon SageMaker is a fully managed machine learning service that enables data scientists and developers to build, train, and deploy machine learning models quickly. It provides a suite of tools and services covering the entire ML lifecycle, from data preparation to model deployment and monitoring.
by Google
The TensorFlow Model Optimization Toolkit script provides tools and techniques to optimize TensorFlow models for deployment, including quantization, pruning, and clustering. It reduces model size and improves inference speed, making models more suitable for edge devices and resource-constrained environments.
by Stability AI
This script provides a streamlined method for performing image generation using Stable Diffusion XL Turbo. It leverages optimized inference techniques to achieve faster generation speeds, making it suitable for real-time applications and interactive experiences.
by Databricks
Databricks is a unified data analytics platform built on Apache Spark, providing tools for data engineering, data science, and machine learning. It enables organizations to process large datasets, build and deploy ML models, and collaborate across teams.
by Stanford Center for Research on Foundation Models (CRFM)
HELM is a living benchmark designed to provide a comprehensive and holistic evaluation of language models across a wide range of scenarios and metrics. It aims to move beyond single-number evaluations by assessing models on factors like truthfulness, calibration, fairness, robustness, and efficiency, providing a more nuanced understanding of their capabilities and limitations.
by Databricks
The Databricks Feature Store provides a centralized repository for managing and sharing machine learning features. Its integration with MLflow enables seamless tracking of feature usage in ML models, ensuring reproducibility and simplifying model deployment workflows by automatically packaging feature dependencies.
by AssemblyAI
AssemblyAI provides a Speech-to-Text API that allows developers to transcribe audio and video files with high accuracy. Their platform offers features like speaker diarization, sentiment analysis, and content moderation, making it a comprehensive solution for audio intelligence.
by Allen Institute for AI (AI2)
The AI2 Reasoning Challenge (ARC) is a question-answering dataset designed to encourage research in advanced question-answering. It consists of grade-school science questions specifically crafted to require reasoning beyond simple fact retrieval, posing a significant challenge for AI models.
by PyTorch
PyTorch Geometric (PyG) is a library built upon PyTorch to facilitate the development of graph neural networks (GNNs). It provides data handling utilities, learning methods on graphs and other irregular structures, and benchmark datasets for various graph-related tasks.
by Databricks
The MLflow integration with Databricks provides a managed MLflow service within the Databricks platform. It simplifies the process of tracking experiments, managing models, and deploying them to production by leveraging Databricks' scalable infrastructure and collaborative environment.
by Stanford AI Lab
RoboSuite is a simulation framework and benchmark suite for robot learning. It provides a standardized set of environments and tasks for training and evaluating reinforcement learning algorithms in robotics, focusing on manipulation and locomotion tasks with realistic physics and sensor models.
by Allen Institute for AI (AI2)
The AI2 Reasoning Challenge (ARC) is a question-answering dataset designed to evaluate advanced reasoning capabilities in AI systems. It consists of elementary-level science questions specifically crafted to be difficult for retrieval-based methods and require deeper understanding and reasoning to answer correctly.
by ImageNet / Stanford Vision Lab
The canonical large-scale visual recognition benchmark containing 1.28 million training images across 1,000 object categories. ImageNet-1K underpins the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) and has driven the majority of deep learning breakthroughs in computer vision since 2012.
by AMD
The AMD Instinct MI350X is a data center GPU designed for high-performance computing and AI workloads. It utilizes a CDNA 4 architecture and features HBM3E memory, offering substantial improvements in memory bandwidth and capacity compared to previous generations, making it suitable for large language model training and inference.
by Microsoft
Microsoft COCO (Common Objects in Context) 2017 provides 118K training images with 860K object instances annotated with bounding boxes, segmentation masks, keypoints, and captions across 80 object categories. It remains the primary benchmark for object detection and instance segmentation research.
by OpenAI
Introduced CLIP (Contrastive Language-Image Pre-training), a model trained on 400 million image-text pairs using contrastive learning that achieves remarkable zero-shot transfer to diverse vision tasks. CLIP became foundational for vision-language alignment and generative AI pipelines.
by CompVis / Stability AI
Introduced Latent Diffusion Models (LDMs), which perform the diffusion process in a compressed latent space rather than pixel space, dramatically reducing computational cost while maintaining image quality. This work underpins Stable Diffusion, the most widely used open-source image generation model.
by Google Brain
Introduced chain-of-thought prompting, a simple technique of providing exemplars with step-by-step reasoning traces in few-shot prompts. This approach dramatically improves LLM performance on arithmetic, commonsense, and symbolic reasoning tasks, with the effect emerging at approximately 100B parameters.
by Google AI
Introduced BERT, a bidirectional Transformer pre-trained on masked language modeling and next sentence prediction. Established the pretrain-then-fine-tune paradigm that dominated NLP for years and achieved state-of-the-art on 11 NLP benchmarks.
by Stanford University
The GENIE Benchmark is a comprehensive dataset for evaluating the performance of text-to-SQL models. It includes a diverse set of SQL queries and corresponding natural language questions across multiple domains, designed to assess the generalization capabilities of these models.
by UniProt Consortium (EMBL-EBI / SIB / PIR)
UniProt (Universal Protein Resource) is the world's comprehensive, freely accessible protein sequence and functional information database, maintained by a consortium of EMBL-EBI, SIB, and PIR. It contains over 250 million protein sequences in UniParc, with 570,000+ manually reviewed entries in SwissProt providing expert-curated functional annotations, and serves as the gold-standard training source for protein language models.
by RCSB PDB / wwPDB Consortium
The RCSB Protein Data Bank (PDB) is the single worldwide archive of experimentally determined 3D structures of proteins, nucleic acids, and complex assemblies, currently containing over 220,000 biological macromolecular structures determined by X-ray crystallography, NMR, and cryo-EM. It is the foundational structural dataset for computational biology and was used to train and validate AlphaFold2 and other structure-prediction models.
by Google
TensorFlow Quantum (TFQ) is a library for building quantum machine learning models. It allows researchers to construct and train hybrid quantum-classical models by leveraging TensorFlow's infrastructure for classical computation and quantum simulators or quantum hardware for quantum computation.
by OpenAI
Presents InstructGPT, which uses Reinforcement Learning from Human Feedback (RLHF) to align GPT-3 with human intent. By fine-tuning on human demonstrations and training a reward model on human preference comparisons, InstructGPT produces outputs that human evaluators prefer to GPT-3 outputs despite having 100× fewer parameters.
by OpenAI
PPO introduces a clipped surrogate objective that constrains policy update step sizes, achieving the stability of trust-region methods (TRPO) with the simplicity and scalability of first-order optimizers. It quickly became the dominant RL algorithm for training large language models with human feedback.
by Microsoft Research
Introduces LoRA, which freezes pretrained model weights and injects trainable low-rank decomposition matrices into Transformer layers. Reduces trainable parameters by 10,000× and GPU memory by 3× with no inference latency overhead, enabling efficient LLM fine-tuning.
by OpenAI
Technical report for GPT-4, OpenAI's multimodal large language model accepting image and text inputs. Demonstrates state-of-the-art performance on academic and professional benchmarks, including passing the bar exam in the top 10% of test takers.
by OpenAI
Introduced GPT-3, a 175B parameter language model demonstrating remarkable few-shot learning capabilities across diverse tasks. Showed that scaling model size dramatically improves in-context learning without gradient updates, reshaping the field.
by Google Brain
Introduced the Vision Transformer (ViT), demonstrating that a pure transformer applied directly to sequences of image patches achieves state-of-the-art performance on image classification when pretrained on large datasets. The paper challenged the dominance of convolutional neural networks in computer vision.
by DeepMind
AlphaFold 2 achieves atomic-level accuracy in protein structure prediction by combining evolutionary information from multiple sequence alignments with a novel Evoformer architecture and structure module, solving a 50-year grand challenge in biology. Its predictions have been released for virtually all known proteins and have accelerated drug discovery, enzyme design, and structural biology worldwide.
by Advanced Micro Devices (AMD)
The AMD Instinct MI400A is a data center accelerator designed for high-performance computing and AI workloads. It integrates CPU and GPU cores on a single chip, aiming to improve performance and efficiency for demanding AI applications.
by Deng et al. / Stanford / Princeton
ImageNet (ILSVRC) is the foundational large-scale visual recognition benchmark with 1.2 million training images across 1,000 object categories. Top-1 and Top-5 accuracy on the validation set have been the standard measure of progress in image classification for over a decade.
by Lin et al. / Microsoft
COCO Detection is the standard benchmark for object detection and instance segmentation, featuring 330,000 images with over 1.5 million annotated instances across 80 object categories. Mean Average Precision (mAP) at various IoU thresholds is the primary metric.
by Wikimedia Foundation / Hugging Face
The processed Wikipedia dataset is a cleaned and tokenized version of Wikipedia dumps covering 20+ languages, available via Hugging Face Datasets. With HTML stripped and paragraph structure preserved, it is one of the most universally used pretraining corpora and a standard knowledge-grounding source for retrieval-augmented generation (RAG) baselines and open-domain QA systems.
by Wikimedia Foundation
The full text dump of Wikipedia articles available in over 300 languages, regularly updated and distributed by the Wikimedia Foundation. It is one of the most universally included components in language model pretraining pipelines due to its high factual density, editorial quality, and broad topical coverage.
by NCBI / NIH
PubChem is the world's largest open chemical database maintained by the NCBI, containing information on over 115 million compounds, 295 million substances, and 270 million bioactivity outcomes from more than 1.2 million assays. It provides standardized molecular structures, properties, and biological activity data freely accessible via REST API and bulk download, making it the canonical resource for cheminformatics and drug discovery research.
by UC Berkeley
Massive Multitask Language Understanding (MMLU) is a benchmark covering 57 academic subjects from STEM to humanities, with 14,000+ multiple-choice questions at undergraduate and professional level. It has become the de facto standard for measuring broad world knowledge and academic reasoning in LLMs.
by OpenSLR / Johns Hopkins University
LibriSpeech is a corpus of approximately 1,000 hours of 16kHz read English speech derived from LibriVox audiobooks, split into clean and other subsets of 100h and 360h for training, with dedicated development and test sets. It has become the de facto standard benchmark for English ASR systems.
by OpenAI
Grade School Math 8K is a dataset of 8,500 high-quality linguistically diverse grade school math word problems requiring 2-8 step reasoning. Created by OpenAI, GSM8K is widely used for evaluating multi-step arithmetic reasoning and the effectiveness of chain-of-thought prompting.
by Google
TensorFlow Privacy is a library that makes it easier to train machine learning models with differential privacy. It provides TensorFlow optimizers that implement differentially private stochastic gradient descent (DP-SGD), allowing developers to protect the privacy of training data while still achieving good model performance.
by Databricks
The Databricks Feature Store integrates with Feast, an open-source feature store, to streamline feature engineering and management for machine learning workflows. This integration allows users to define, store, and serve features consistently across training and inference, reducing data skew and improving model performance within the Databricks environment.
by LangChain
Native integration between LangChain and OpenAI's GPT models. Provides seamless access to chat completions, embeddings, and function calling through LangChain's unified interface. Supports streaming, tool use, and structured output via the langchain-openai package.
by Meta AI
Introduced the Segment Anything Model (SAM) and the SA-1B dataset of 1 billion masks on 11 million images. SAM is a promptable segmentation foundation model that generalizes to new image distributions and tasks without additional training, enabling a new paradigm of interactive segmentation.
by Facebook AI Research
Introduces Retrieval-Augmented Generation (RAG), combining parametric memory (language model weights) with non-parametric memory (dense retrieval over Wikipedia) for knowledge-intensive NLP tasks. RAG models achieve state-of-the-art on open-domain QA benchmarks and produce more specific, factual, and diverse responses than pure parametric models.
by Meta AI
Introduces LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, trained on publicly available datasets. Showed that smaller models trained on more tokens can match or exceed larger models, democratizing LLM research.
by Google DeepMind
Introduced the Gemini family of multimodal models (Ultra, Pro, Nano) natively trained to process and combine text, images, audio, and video. Gemini Ultra is the first model to surpass human expert performance on MMLU and achieves state-of-the-art across 30 of 32 benchmarks evaluated.
by OpenAI
Introduced Codex, a GPT language model fine-tuned on publicly available code from GitHub, and the HumanEval benchmark for measuring code synthesis from docstrings. Codex powers GitHub Copilot and represents a breakthrough in automated programming assistance.
by UC Berkeley
Introduced PagedAttention and the vLLM serving system, which manages the KV cache in non-contiguous physical memory blocks inspired by OS paging, enabling near-zero memory waste and efficient sharing of KV cache across requests. vLLM achieves 2-4x higher throughput than HuggingFace Transformers and 1.7x over Orca.
by AaaS
Generates functional code from natural language descriptions, specifications, or partial implementations. Covers multiple languages and frameworks with support for boilerplate scaffolding, algorithm implementation, and API integration patterns.
by AaaS
Guides LLMs to produce step-by-step reasoning before arriving at a final answer. Dramatically improves performance on math, logic, and multi-step problems by making the model's reasoning process explicit and verifiable.
by Hugging Face
Hugging Face is the GitHub of AI, providing the world's largest open model hub, dataset repository, and ML collaboration platform. Its Transformers library is the de-facto standard for working with open-weight models, and the Hugging Face Hub hosts hundreds of thousands of models and datasets. Its Spaces platform allows AI demos to be deployed instantly.
by Panayotov et al. / Johns Hopkins
LibriSpeech is the standard English automatic speech recognition (ASR) benchmark derived from LibriVox audiobooks, containing 1,000 hours of read speech at 16kHz. Word Error Rate (WER) on clean and noisy test splits drives competitive progress in ASR research.
by MIT Laboratory for Computational Physiology / Beth Israel Deaconess Medical Center
MIMIC-IV (Medical Information Mart for Intensive Care) is a comprehensive de-identified electronic health record database covering over 300,000 patients admitted to Beth Israel Deaconess Medical Center's ICU between 2008 and 2019. It contains detailed clinical data including diagnoses, procedures, medications, laboratory values, and waveforms, enabling a wide range of clinical AI research.
by OpenAI
A curated set of 164 handwritten Python programming problems released by OpenAI, each consisting of a function signature, docstring, reference solution, and unit tests. HumanEval introduced the pass@k metric for functional code correctness evaluation and has become the de facto standard benchmark reported in virtually every code generation model paper.
by Midjourney
Midjourney V6 represents a major leap in photorealism, prompt adherence, and artistic coherence, setting a new industry benchmark for AI image generation quality. It introduced native text rendering within images and dramatically improved its understanding of complex, multi-subject prompts.
by Hugging Face / Intel
Hugging Face Optimum Intel Extension is a toolkit designed to accelerate inference and training of transformer models on Intel CPUs and GPUs. It leverages Intel's Deep Learning Boost (DL Boost) and other hardware features to optimize model performance within the Hugging Face ecosystem.
by GitHub
GitHub Copilot integrates into VS Code as a first-party extension, delivering inline ghost-text completions, multi-line suggestions, and a dedicated Copilot Chat panel for conversational refactoring, test generation, and documentation. It leverages Codex and GPT-4 models under the hood, with workspace-aware context from open tabs and the current file.
by Stability AI
Presented SDXL, a significantly improved latent diffusion model architecture featuring a 3.5B parameter UNet backbone with a secondary refiner model, conditioning on image size and crop parameters, and a curated high-aesthetic dataset. SDXL substantially improves visual quality and prompt adherence over prior Stable Diffusion versions.
by Google / Princeton
Introduces ReAct, a paradigm that combines reasoning traces and task-specific actions in language models. By interleaving thinking steps with tool calls, ReAct agents outperform chain-of-thought and act-only baselines on diverse tasks including question answering, fact verification, and interactive decision-making.
by OpenAI
Introduces InstructGPT, fine-tuning GPT-3 with Reinforcement Learning from Human Feedback (RLHF) to follow instructions. A 1.3B InstructGPT model is preferred over 175B GPT-3 by human labelers, establishing RLHF as the dominant alignment technique.
by OpenAI
Presented DALL-E 2 (unCLIP), a hierarchical text-conditional image generation system using CLIP image embeddings as a prior and a diffusion decoder. The system achieves state-of-the-art photorealism and text-image alignment, substantially advancing the field of text-to-image synthesis.
by OpenAI
The system card for GPT-4 with vision (GPT-4V), detailing the model's visual understanding capabilities, safety evaluations, limitations, and mitigation strategies. GPT-4V represents a major advancement in large multimodal models, enabling complex visual reasoning from natural language prompts.
by Stanford University
Introduces FlashAttention, an IO-aware exact attention algorithm that restructures attention computation to minimize memory reads/writes between HBM and SRAM. Achieves 2-4× speedup over standard attention and enables training on much longer sequences.
by Community
Leverages knowledge from a source domain to improve model performance on a target domain with limited labeled data. A foundational technique for reducing training costs and accelerating model development across diverse applications.
by AaaS
The foundational discipline of crafting effective prompts to elicit desired behaviors from language models. Covers system prompt design, instruction formatting, output structuring, temperature tuning, and iterative prompt refinement techniques.
by Advanced Micro Devices (AMD)
The AMD Instinct MI400 series is a family of data center GPUs designed for high-performance computing and AI workloads. It leverages AMD's CDNA 4 architecture and offers significant improvements in performance and energy efficiency compared to previous generations, targeting large-scale AI training and inference.
by Amazon
Amazon Web Services is the world's largest cloud provider and offers the most comprehensive set of AI and machine learning services, including Amazon Bedrock for managed foundation model APIs, SageMaker for MLOps, Rekognition for computer vision, and Alexa for voice AI. AWS Bedrock gives enterprises access to models from Anthropic, Meta, Mistral, Cohere, and others through a unified API.
by Meta AI
SA-1B is Meta AI's massive segmentation dataset released alongside the Segment Anything Model (SAM), containing over 1 billion high-quality segmentation masks across 11 million diverse, high-resolution images. It is the largest segmentation dataset ever created and enables training of generalist vision models with strong zero-shot transfer capabilities.
by Google
Google's Open Images V7 is one of the largest existing datasets with object-level annotations, containing approximately 9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives across 600+ object classes.
by UC Berkeley
A challenging benchmark of 12,500 competition mathematics problems from AMC, AIME, and similar competitions across 5 difficulty levels and 7 subjects. Each problem includes a full step-by-step solution in LaTeX, making it suitable for both evaluation and training of mathematical reasoning.
by University of Washington
HellaSwag is an adversarially filtered commonsense NLI benchmark where models must pick the most plausible sentence completion from 4 options. Humans score 95%+ while early LLMs struggled below 50%, making it a robust test of grounded language understanding and commonsense reasoning.
by OpenAI
OpenAI's state-of-the-art open-source automatic speech recognition model trained on 680K hours of multilingual audio. Supports 99 languages with near-human accuracy and includes translation, timestamp, and language detection capabilities.
by Meta AI
Official Meta Llama model weights distributed through the HuggingFace Hub under Meta's community license. Covers Llama 3.1, 3.2, and 3.3 variants from 1B to 405B parameters with full transformers, TGI, and vLLM compatibility. HuggingFace serves as the primary public distribution channel for Meta's open-weight releases.
by University of Wisconsin–Madison / Microsoft Research
Introduced LLaVA (Large Language and Vision Assistant), a multimodal model trained via visual instruction tuning using GPT-4-generated multimodal instruction-following data. LLaVA demonstrates impressive multimodal chat abilities and achieves 85.1% on Science QA, pioneering open-source visual instruction tuning.
by Princeton University / Google DeepMind
Introduced Tree of Thoughts (ToT), a framework that generalizes chain-of-thought prompting to a tree search over intermediate reasoning steps. ToT enables LLMs to explore multiple reasoning paths, evaluate choices, and backtrack, achieving dramatic improvements on tasks requiring lookahead and planning.
by Google Brain
Introduced Switch Transformers, a simplified mixture-of-experts (MoE) architecture that routes each token to exactly one expert (top-1 routing), enabling trillion-parameter models with sub-linear compute scaling. Switch Transformers achieve 7x pretraining speedup over a dense T5 model while maintaining model quality.
by Google Brain
Introduced self-consistency, a decoding strategy that samples diverse reasoning paths from a language model and returns the most consistent answer by marginalizing out the reasoning paths. Self-consistency is a simple, training-free technique that substantially improves chain-of-thought prompting across arithmetic and commonsense reasoning tasks.
by OpenAI
Empirically establishes power-law scaling relationships between language model performance and model size, dataset size, and compute budget. Provides the foundational framework for predicting LLM capabilities as a function of scale, guiding research for years.
by University of Washington
Introduces QLoRA, which combines 4-bit quantization with LoRA adapters to fine-tune a 65B LLM on a single 48GB GPU while preserving full 16-bit fine-tuning performance. Introduces NF4 data type and double quantization for extreme memory reduction.
by Mistral AI
Introduces Mistral 7B, a 7B parameter language model outperforming LLaMA 2 13B on all benchmarks and approaching LLaMA 2 34B on code and reasoning. Uses grouped-query attention and sliding window attention for efficient inference.
by Institute of Science and Technology Austria (IST Austria)
Presented GPTQ, a one-shot weight quantization method based on approximate second-order information that can quantize GPT models with 175B parameters to 4-bit or 3-bit precision in approximately four GPU-hours with negligible accuracy loss. GPTQ made large model inference practical on consumer hardware.
by Stanford University / Google
Introduces generative agents—computational software agents that simulate believable human behavior—by combining a large language model with memory streams, reflection synthesis, and planning mechanisms. Twenty-five agents populate a virtual town, exhibiting emergent social behaviors including relationship formation, information propagation, and event coordination.
by Princeton University / Together AI
Extends FlashAttention with improved work partitioning across GPU thread blocks and warps, achieving 2× speedup over FlashAttention and ~9× speedup over standard attention. Enables efficient training of models with context lengths up to 256K tokens.
by Google Research
Introduced speculative decoding, a lossless inference acceleration technique that uses a smaller, faster draft model to propose multiple tokens, then verifies them in parallel with the target model in a single forward pass. This achieves 2-3x speedup without any degradation in output quality or distribution.
by DeepSeek
DeepSeek-R1 demonstrates that pure reinforcement learning with rule-based rewards—without supervised fine-tuning on chain-of-thought data—can incentivize emergent reasoning capabilities in LLMs including self-verification, reflection, and long chain-of-thought. The model achieves performance comparable to OpenAI-o1 on reasoning benchmarks while being fully open-sourced, triggering a significant industry response.
by OpenAI
This foundational RLHF paper shows that human preference comparisons between agent behaviors can train a reward model that guides deep RL agents in complex tasks like Atari games and MuJoCo locomotion, without hand-crafted reward functions. The approach reduces human labeling effort by ~3 orders of magnitude compared to direct reward specification.
by Meta AI
Introduced Code Llama, a family of large language models for code built on Llama 2 through code-specific pretraining and fine-tuning. Code Llama achieves state-of-the-art performance among open models on HumanEval and MBPP, with variants for Python, instruction following, and long context (100K tokens).
by Anthropic
Presents the Claude 3 family of models (Opus, Sonnet, Haiku), demonstrating state-of-the-art performance on reasoning, vision, and multilingual tasks. Highlights Anthropic's safety techniques including Constitutional AI and RLHF-based alignment.
by MIT / MIT-IBM Watson AI Lab
Introduced AWQ (Activation-aware Weight Quantization), a hardware-friendly low-bit weight quantization approach that protects a small fraction (1%) of salient weights based on activation magnitudes, achieving better performance than GPTQ at 4-bit while being faster and more broadly applicable across model architectures.
by AaaS
Equips AI agents with the ability to select and use appropriate tools from a defined toolkit to accomplish tasks. Covers tool selection logic, input marshalling, output interpretation, and fallback strategies when tools fail or return unexpected results.
by AaaS
Enables LLMs to invoke external functions by generating structured JSON arguments matching defined schemas. Supports parallel function calls, error handling, and chained invocations for complex multi-step tool interactions.
by Microsoft
Microsoft's autonomous agent within the Copilot ecosystem that operates across Microsoft 365 apps to automate business processes. Handles email triage, meeting preparation, document summarization, and cross-app workflow automation with enterprise-grade security.
by Cerebras Systems
The Cerebras WSE-4 is the fourth generation wafer-scale processor designed specifically for AI compute. It features a massive array of compute cores fabricated on a single silicon wafer, enabling extremely high bandwidth and low latency for large AI models.
by NVIDIA
NVIDIA's flagship consumer GPU based on Ada Lovelace. Has become popular for local LLM inference and fine-tuning due to its 24GB GDDR6X memory and high performance-per-dollar ratio, enabling on-premise AI workloads without data center costs.
by OpenAI
Production-grade ASR pipeline using OpenAI Whisper or faster-whisper with VAD-based chunking, speaker timestamp alignment, and SRT/VTT subtitle export. Handles long-form audio via sliding window segmentation and automatic language detection.
by Microsoft
Microsoft Azure AI is the AI services division of Microsoft's cloud platform, uniquely positioned as the exclusive cloud partner of OpenAI. Through Azure OpenAI Service, enterprises access GPT-4, DALL-E, and Whisper with enterprise-grade compliance and data residency guarantees. Microsoft has deeply integrated AI across its product suite including Copilot for Microsoft 365, GitHub Copilot, and Azure AI Foundry.
by MLCommons
MLPerf Inference is the industry-standard benchmark for measuring AI inference performance across hardware platforms. It covers image classification, object detection, NLP, speech recognition, and generative AI workloads, enabling fair apples-to-apples comparison of accelerators and inference stacks.
by OpenAI
Grade School Math 8K benchmark with 8,500 linguistically diverse grade school math word problems requiring 2-8 step reasoning. Tests basic mathematical reasoning and arithmetic with problems that require sequential multi-step solutions.
by Chollet / ARC Prize Foundation
ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) measures fluid intelligence through visual grid transformation puzzles. Models must infer transformation rules from three or fewer examples and apply them to a test grid — a task trivially solved by humans but historically extremely difficult for AI systems.
by Zhou et al. / MIT CSAIL
ADE20K is the benchmark for semantic scene parsing, containing 25,000 images densely annotated with 150 semantic categories. Mean Intersection over Union (mIoU) is the standard metric, and it drives progress in perception systems for autonomous driving, robotics, and scene understanding.
by University of Oxford
TruthfulQA measures the truthfulness of LLMs across 817 adversarially crafted questions spanning 38 categories where humans are commonly misled by false beliefs. Models are scored on generating truthful AND informative answers, revealing how larger models can paradoxically become more confidently wrong.
by Stack Exchange
The Stack Exchange Data Dump is a quarterly XML export of all public questions, answers, comments, and votes across the entire Stack Exchange network of 170+ Q&A communities including Stack Overflow. Containing hundreds of millions of high-quality technical and domain-specific Q&A pairs, it is a critical pretraining source for code and reasoning capabilities and a standard retrieval benchmark for dense passage retrieval.
by Allen Institute for AI
The AI2 Reasoning Challenge (ARC) dataset contains 7,787 grade 3–9 science exam questions split into Easy and Challenge partitions. The Challenge set contains questions that require deeper reasoning and world knowledge, making it a reliable signal for advanced language understanding.
by Stability AI
Stability AI's high-resolution image generation model producing photorealistic and artistic images at 1024x1024 resolution. Features a two-stage architecture with a base model and refiner for enhanced detail and compositional quality.
by Meta
Meta's high-performance 70B parameter model closing the gap with proprietary frontier models. Achieved competitive results on major benchmarks while remaining fully open-source.
by vLLM Project
vLLM's NVIDIA backend leverages CUDA kernels, FlashAttention-2, and PagedAttention to deliver state-of-the-art throughput for LLM inference on NVIDIA A100, H100, and H200 GPUs. The integration supports tensor and pipeline parallelism across multiple GPUs, FP8/FP16/BF16 quantization, and CUDA graph capture for minimal per-token latency.
by Pinecone
Direct integration pairing Pinecone's managed vector database with OpenAI's text-embedding-3 models. Commonly used pattern for production RAG systems where OpenAI generates dense vectors and Pinecone handles ANN retrieval at scale. Supports serverless and pod-based indexes with metadata filtering.
by LangChain
Official LangChain integration for Anthropic's Claude model family. Exposes Claude's extended context window, vision capabilities, and tool use through LangChain's standard chat model interface. Supports streaming and the full Messages API via the langchain-anthropic package.
by Princeton University
Introduced SWE-bench, a benchmark of 2,294 real GitHub issues from 12 popular Python repositories requiring models to resolve issues by writing code patches. SWE-bench reveals that even the best LLMs resolve fewer than 4% of issues with standard techniques, motivating research into code agents.
by Google
Model Cards introduces a structured framework for documenting machine learning models across intended uses, performance disaggregated by demographic groups, and ethical considerations, enabling informed model selection and deployment decisions. The paper has become an industry standard, with model card adoption by Google, Hugging Face, and most major AI providers.
by OpenAI
Introduced GPT-2, demonstrating that large language models trained on diverse web text can perform zero-shot transfer across many NLP tasks without task-specific fine-tuning. Showed emergent capabilities at scale and sparked debate on responsible AI release.
by Meta AI
Presented DINOv2, a self-supervised vision foundation model trained on a curated dataset of 142 million images using a combination of self-distillation and contrastive objectives. DINOv2 features serve as universal visual representations, excelling on depth estimation, segmentation, and classification without fine-tuning.
by DeepMind
Challenges the Kaplan et al. scaling laws by showing that model size and training tokens should scale equally. Trains Chinchilla (70B) on 4× more data than Gopher, matching or beating models 4× its size, redefining compute-optimal training strategies.
by LMSYS / UC Berkeley
Introduces Chatbot Arena, a platform for crowdsourced human evaluation of LLMs via pairwise comparisons using an Elo rating system. The arena has collected over 240K human votes across 50+ models, revealing human preference rankings that often diverge from standard benchmark leaderboards and providing a complementary evaluation signal.
by AaaS
Teaches integration and optimization of automatic speech recognition (ASR) systems — from Whisper to streaming cloud APIs — for agentic voice pipelines. Covers language identification, word error rate reduction, punctuation restoration, and handling noisy audio environments.
by AaaS
Analyzes code for bugs, security vulnerabilities, performance issues, and style violations. Provides actionable feedback with severity levels and suggested fixes aligned to language-specific best practices and project conventions.
by Khanmigo (Khan Academy)
An adaptive tutoring agent that dynamically adjusts difficulty, pacing, and instructional modality based on individual learner performance signals. It maintains a persistent knowledge model per student, identifies misconceptions through Socratic questioning, and routes learners to mastery via spaced-repetition scheduling.
by OpenAI
OpenAI's managed agent platform for building custom AI assistants with persistent threads, built-in code interpreter, file search, and function calling. Handles conversation state, tool orchestration, and context management so developers can focus on business logic.
by Intercom
A fully-autonomous customer support agent that unifies conversations across chat, email, SMS, and social DMs into a single threaded context window. It resolves tier-1 and tier-2 tickets using a retrieval-augmented knowledge base and maintains CSAT targets through sentiment-aware tone calibration.
by Graphcore
Graphcore is a semiconductor company that develops Intelligence Processing Units (IPUs), a type of microprocessor designed specifically for AI and machine learning workloads. Their IPUs are designed to accelerate training and inference for complex AI models, offering an alternative to GPUs.
by LangChain Inc
LangChain Inc is the company behind the most widely adopted LLM orchestration framework in the AI ecosystem. LangChain provides composable abstractions for building LLM-powered applications, while its LangSmith platform offers observability and evaluation tooling, and LangGraph enables the construction of stateful, multi-actor agent workflows.
by Princeton NLP
Human-validated subset of SWE-bench containing 500 problems verified by software engineers for correctness, clarity, and solvability. Provides a more reliable signal than the full SWE-bench by filtering out ambiguous or under-specified issues.
by UC Berkeley
Collection of 12,500 competition mathematics problems from AMC, AIME, and other math competitions covering algebra, geometry, number theory, combinatorics, and more. Problems require multi-step reasoning and mathematical insight beyond pattern matching.
by Allen AI
Evaluates commonsense natural language inference by asking models to select the most plausible continuation of a scenario. Uses adversarially filtered endings generated by language models, making it challenging for machines while trivial for humans.
by NYU
Graduate-level Google-Proof Question Answering benchmark featuring questions written by domain experts in physics, chemistry, and biology. Questions are designed to be unsearchable, requiring genuine reasoning rather than memorization.
by Mozilla Foundation
Common Voice is Mozilla's crowd-sourced multilingual speech corpus spanning 100+ languages with verified recordings from volunteers. It benchmarks ASR systems on low-resource and diverse language conditions, making it critical for evaluating cross-lingual speech model generalization.
by Allen AI
AI2 Reasoning Challenge featuring grade-school science questions that require commonsense reasoning and world knowledge. The Challenge set contains questions that simple retrieval and co-occurrence methods fail to answer correctly.
by Stanford Center for Legal Informatics
GenLaw is a comprehensive dataset designed for evaluating legal reasoning capabilities of large language models. It contains a diverse set of legal questions, case summaries, and relevant statutes, enabling researchers to assess a model's ability to understand and apply legal principles.
by Allen Institute for AI
WinoGrande is a large-scale crowdsourced dataset of 44,000 Winograd-style fill-in-the-blank commonsense problems, debiased using the AFLITE algorithm to minimize spurious statistical cues. It is significantly harder than the original Winograd Schema Challenge for contemporary NLP models.
by BigCode
An expanded code pretraining dataset containing 3 trillion tokens of source code in 619 programming languages, curated by BigCode from GitHub repositories with permissive SPDX licenses. Version 2 triples the size of the original Stack and includes improved deduplication, opt-out mechanisms for authors, and structured data from GitHub issues and pull requests alongside raw source files.
by U.S. Securities and Exchange Commission
The SEC-EDGAR Filings dataset encompasses over 20 million full-text regulatory filings submitted to the US Securities and Exchange Commission since 1993, including 10-K annual reports, 10-Q quarterly reports, 8-K current reports, and proxy statements from all US public companies. It is the foundational corpus for financial NLP research, sentiment analysis, and financial document AI.
by National Institutes of Health / National Library of Medicine
PubMedCentral Open Access (PMC OA) is a subset of the PMC literature archive made freely available for text mining and NLP research, containing over 4 million full-text biomedical and life science articles. It is the primary corpus used for pretraining biomedical language models such as BioBERT, PubMedBERT, and BioGPT.
by Mozilla
Mozilla's Common Voice 15.0 is the world's largest publicly available multilingual speech corpus, containing over 30,000 hours of validated speech data across 114 languages, all contributed and validated by volunteers. It enables training and evaluation of multilingual and low-resource speech recognition systems.
by Common Crawl Foundation
The world's largest open repository of web crawl data, maintained by the non-profit Common Crawl Foundation and updated with new crawls monthly since 2011. It forms the foundational raw data layer for virtually every major language model pretraining pipeline including GPT-3, LLaMA, PaLM, and Falcon, typically after quality filtering and deduplication steps.
by Cornell University / arXiv
The ArXiv Papers Dataset is a bulk export of over 2.3 million scientific preprints from arXiv spanning physics, mathematics, computer science, biology, finance, and economics, provided by Cornell University and hosted on Kaggle and AWS S3. The full-text LaTeX source and parsed metadata make it a primary pretraining corpus for scientific language models and citation-network research.
by MIT CSAIL
ADE20K is a densely annotated semantic segmentation dataset containing over 27,000 images with pixel-level annotations for 150 semantic categories covering both indoor and outdoor scenes. It is the primary benchmark for scene parsing and semantic segmentation tasks in the computer vision community.
by Meta
Meta's refined 70B model delivering performance comparable to the much larger 405B variant through improved training techniques. Offers the best performance-to-cost ratio in the Llama family.
by Weights & Biases
Weights & Biases integrates directly into Hugging Face Trainer and PEFT via a built-in report_to callback, logging training loss curves, GPU utilization, gradient norms, and hyperparameters to shareable W&B runs. The integration supports sweep-based hyperparameter optimization and artifact versioning for model checkpoints.
by Microsoft Azure
Microsoft Azure's managed deployment of OpenAI models including GPT-4o, o1, and DALL-E 3 with enterprise SLAs, private networking, and regional data residency. Provides the same OpenAI API surface with additional Azure IAM, VNet integration, content filtering, and Azure Monitor observability.
by LangChain Inc.
LangSmith provides first-class tracing and evaluation for LangChain pipelines, capturing every LLM call, chain step, and tool invocation with full prompt/response payloads. Teams use the integration to debug production failures, build evaluation datasets, and run automated regression tests against golden traces.
by BigCode / Hugging Face / ServiceNow
Presented StarCoder, a 15.5B parameter open-source code LLM trained on 1 trillion tokens from The Stack (permissively licensed source code) with fill-in-the-middle capability, fast multi-token prediction inference, and a commitment to responsible AI through a model card and attribution feature.
by Zhuiyi Technology
Introduces Rotary Position Embedding (RoPE), encoding absolute position information with a rotation matrix and naturally incorporating relative position in self-attention. Adopted by LLaMA, PaLM 2, and most modern LLMs for its length generalization properties.
by OpenAI
Demonstrated that process-based reward models (PRMs), which provide feedback on each reasoning step, substantially outperform outcome-based reward models (ORMs) for training LLMs to solve mathematical reasoning problems. The paper also introduced PRM800K, a dataset of 800K step-level human feedback labels on MATH solutions.
by Stanford University
Introduces DPO, a stable and efficient alternative to RLHF that directly optimizes a language model on human preference data without an explicit reward model or RL. Achieves comparable or superior alignment results with significantly simpler implementation.
by Microsoft Research / Multiple Institutions
Drawing an analogy to electronics component datasheets, this paper proposes that every ML dataset should be accompanied by a standardized document covering its motivation, composition, collection process, preprocessing, uses, distribution, and maintenance. Datasheets for Datasets has become the foundational standard for dataset transparency and is widely required by major AI venues.
by Anthropic
Introduces Constitutional AI (CAI), a method for training harmless AI assistants using a set of written principles (a 'constitution') to guide both supervised learning and reinforcement learning from AI feedback (RLAIF). CAI enables Anthropic to reduce reliance on human harm labels while maintaining helpfulness and making AI reasoning about harmlessness explicit.
by AaaS
Teaches LLMs to perform tasks by providing a small number of input-output examples in the prompt. Enables rapid task adaptation without fine-tuning by demonstrating the desired pattern through carefully selected, representative examples.
by Community
Predicts user preferences by identifying patterns from collective user-item interaction histories, using memory-based neighborhood methods or model-based matrix factorization and neural approaches. The backbone of recommendation systems at scale across e-commerce, streaming, and social platforms.
by Ahrefs
A fully-autonomous SEO agent that continuously crawls a target website, audits technical health, researches high-intent keywords, and generates prioritized optimization recommendations. It tracks ranking movements in real time and surfaces backlink opportunities from competitor gap analysis.
by Salesforce
Salesforce's autonomous AI agent built on the Einstein platform that handles customer interactions, resolves support cases, and automates sales workflows. Operates within the Salesforce ecosystem with full access to CRM data, knowledge bases, and business rules.
by Perplexity AI
AI-powered answer engine that combines real-time web search with LLM synthesis to provide cited, accurate answers. Features multi-step research capabilities, source verification, and conversational follow-up for deep topic exploration.
by d-Matrix
The d-Matrix Corsair is an in-memory compute platform designed to accelerate AI inference workloads. It leverages analog compute to achieve high energy efficiency and low latency, targeting applications like recommendation engines and generative AI.
by Graphcore
The Graphcore Bow Pod2024 is a modular AI compute system built for large-scale machine learning. It utilizes Graphcore's Intelligence Processing Units (IPUs) and is specifically engineered to accelerate sparse models, such as graph neural networks and large language models, in data center environments.
by Tenstorrent
The Tenstorrent Wormhole GF12 is a high-performance AI accelerator built on GlobalFoundries' 12nm process. It features a grid of programmable Tensix cores, RISC-V CPUs, and a high-speed Ethernet fabric for direct chip-to-chip communication, enabling scalable systems for both AI training and inference workloads.
by LMSYS
Multi-turn conversation benchmark with 80 high-quality questions across 8 categories including writing, reasoning, math, coding, and extraction. Uses GPT-4 as an automated judge to evaluate response quality on a 1-10 scale across two conversation turns.
by Jin et al. / UC San Diego
MedQA tests medical knowledge using free-form multiple-choice questions drawn from the US Medical Licensing Examination (USMLE). It evaluates whether language models can reason through complex clinical scenarios requiring deep biomedical knowledge.
by NLLB Team / Meta AI
FLORES-200 is a many-to-many multilingual translation benchmark covering 200 languages, including many low-resource ones. It evaluates machine translation systems across 40,000 language direction pairs, making it the most comprehensive translation benchmark for assessing cross-lingual generalization.
by Oxford Visual Geometry Group (VGG)
VoxCeleb2 is a large-scale speaker recognition dataset containing over 1 million utterances from 6,112 celebrities extracted from YouTube videos in challenging real-world conditions. It is the standard benchmark for speaker verification and diarization research, providing naturalistic conversational speech at scale.
by New York University
SuperGLUE is a benchmark suite of 8 challenging NLU tasks including question answering, coreference resolution, causal reasoning, and word-sense disambiguation, designed as a harder successor to GLUE. It includes human baselines and has driven significant progress in pre-trained language model capabilities.
by Google
A dataset of 974 crowd-sourced Python programming problems suitable for entry-level programmers, each with a problem description, code solution, and three automated test cases. MBPP complements HumanEval by covering a broader variety of programming concepts and is widely used alongside it for comprehensive evaluation of code generation capabilities across model families.
by LAION
The largest openly available image-text pair dataset, containing 5.85 billion CLIP-filtered image-text pairs across English, multilingual, and aesthetic subsets. LAION-5B was the primary training corpus for Stable Diffusion, DALL-E 2 replications, and numerous open vision-language models, enabling the open-source community to train competitive text-to-image generation models.
by Meta AI
FLORES-200 is Meta's few-shot translation evaluation benchmark spanning 200 languages, including many low-resource and endangered ones. Each language contains 1,012 parallel sentences translated from English Wikipedia, covering both devtest and test splits for systematic MT evaluation at scale.
by Stanford ML Group
CheXpert is a large chest X-ray dataset from Stanford containing 224,316 chest radiographs from 65,240 patients with labels for 14 observations mined from radiology reports using an automated labeler. It uniquely addresses label uncertainty with positive, negative, and uncertain labels, making it a challenging and realistic benchmark for automated chest X-ray interpretation.
by NVIDIA / CUHK
CelebA-HQ is a high-quality version of the CelebA face dataset containing 30,000 celebrity images at 1024×1024 resolution with 40 binary attribute annotations. It was introduced alongside Progressive GAN and has become the standard benchmark for high-fidelity face generation and synthesis research.
by Google
Google's AudioSet is a large-scale dataset of manually annotated audio events comprising over 2 million 10-second YouTube clips labeled with a hierarchical ontology of 632 audio event classes. It is the primary benchmark for audio tagging and sound event detection, spanning music, speech, and environmental sounds.
by Suno AI
Suno V3.5 is a text-to-song AI model that generates complete, radio-quality music tracks with vocals, instrumentation, and song structure directly from natural language prompts or custom lyrics. It supports an enormous range of genres and styles and is widely regarded as the most accessible and highest-quality text-to-music system for non-musicians.
by OpenAI
OpenAI's first reasoning model that uses extended internal chain-of-thought before responding. Achieves expert-level performance on competitive math (AIME), PhD-level science (GPQA), and complex coding tasks through deliberative alignment.
by Mistral AI
Mistral AI's breakthrough 7B parameter model that outperformed Llama 2 13B across all benchmarks at launch. Introduced sliding window attention and grouped-query attention for efficient inference.
by LangChain
LangChain VectorStore integration for Pinecone's managed vector database. Enables similarity search, MMR retrieval, and metadata filtering within LangChain RAG pipelines. Supports both serverless and pod-based Pinecone indexes via the langchain-pinecone package.
by Anysphere
Cursor is a VS Code fork that uses OpenAI's GPT-4 and o-series models as its reasoning engine for multi-file edits, semantic codebase search, and an agent mode that can autonomously implement features across the entire repository. It offers a Composer panel for multi-file diffs and a codebase-aware chat that indexes the project with embeddings for precise retrieval.
by Meta AI
Presents Toolformer, a model that learns to use external tools (APIs) in a self-supervised manner without requiring human annotations. The model decides which APIs to call, how to call them, and how to incorporate results, achieving strong performance across diverse tasks while maintaining generative language modeling ability.
by Google Research
Introduces PaLM (Pathways Language Model), a 540B parameter model trained on 780B tokens using the Pathways system. Achieved breakthrough performance on reasoning tasks and demonstrated discontinuous performance improvements that define emergent abilities.
by Google Brain
Introduces the Sparsely-Gated Mixture-of-Experts (MoE) layer, enabling 1000× capacity increase with only marginal computational cost increase. A learned gating network selects a sparse subset of expert sub-networks per input, enabling unprecedented model scale.
by Meta AI
Llama 4 introduces a family of natively multimodal mixture-of-experts models—Scout (17B/16 experts), Maverick (17B/128 experts), and Behemoth (288B/16 experts)—pretrained jointly on text, image, and video data. Maverick achieves top scores on vision-language benchmarks while Scout offers 10M-token context at a fraction of the compute of comparable models.
by Stanford CRFM
Presents HELM, a holistic evaluation framework for language models across 42 scenarios and 59 metrics including accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. HELM reveals that no single model dominates across all dimensions and exposes significant gaps between narrow and comprehensive model assessment.
by Anthropic
The Claude Opus 4 technical report details Anthropic's flagship model, highlighting its extended thinking, advanced coding, and agentic capabilities. It showcases top-tier performance on benchmarks like SWE-bench and GPQA, along with significant improvements in safety through Constitutional AI and RLHF.
by Salesforce Research
Presented BLIP-2, which bridges the modality gap between frozen image encoders and frozen LLMs using a lightweight Querying Transformer (Q-Former) trained in two stages. BLIP-2 achieves state-of-the-art VQA performance with significantly fewer trainable parameters than prior methods.
by Community
Predicts future values of sequential, time-indexed data using classical statistical models (ARIMA, ETS), gradient boosting (LightGBM, XGBoost), and deep learning architectures (Transformers, N-BEATS, TFT). Handles trend, seasonality, exogenous covariates, and uncertainty quantification.
by AaaS
Condenses long documents into concise summaries while preserving key information and maintaining factual accuracy. Supports extractive, abstractive, and hierarchical summarization with configurable length, style, and focus area parameters.
by AaaS
Enables meaning-based retrieval by converting queries and documents into dense vector representations and finding nearest neighbors. Foundational skill for any RAG pipeline or knowledge-base-powered agent.
by Community
Adapts a general-purpose pretrained model to a narrow domain by continuing training on curated domain corpora or instruction datasets. Produces specialized models that outperform generalist baselines on domain-specific benchmarks while preserving broad language understanding.
by Coursera
A recommendation agent that maps learner skill profiles against target competency frameworks and synthesizes the shortest credentialed path to proficiency. It continuously reoptimizes routing as learners complete modules and integrates real-time labor-market signals to prioritize high-value skill sequences.
by Tesla
The Tesla Dojo D2 chip is a custom-designed AI accelerator developed by Tesla for training large-scale neural networks used in autonomous driving. It is a key component of Tesla's Dojo supercomputer, aimed at improving the efficiency and speed of AI model training.
by Community
Production-ready FastAPI template for AI-powered REST APIs, with pre-wired OpenAI/Anthropic client, async streaming endpoints, JWT authentication, rate limiting, structured logging, and OpenAPI docs. Includes Docker Compose stack with Redis rate-limit store and Prometheus metrics.
by Ultralytics
Bootstraps a production-ready object detection workflow using YOLOv8 or RT-DETR, including webcam/video stream ingestion, NMS post-processing, and annotation overlay rendering. Outputs annotated frames and a structured JSON detections log suitable for downstream analytics.
by Google
Google Cloud AI provides enterprise access to Google DeepMind's Gemini models and a comprehensive suite of managed AI services via Vertex AI. As the creator of the Transformer architecture and TensorFlow, Google Cloud offers unmatched AI infrastructure including custom TPUs, a full MLOps platform, and pre-built APIs for vision, speech, and natural language processing.
by ElevenLabs
ElevenLabs is a voice technology research company developing advanced text-to-speech and voice cloning software. Their platform allows users to generate high-quality spoken audio in numerous languages, create custom AI voices, or clone existing ones. It is widely used for audiobooks, video games, and content creation.
by University of Oxford
Measures whether language models generate truthful answers to questions where humans are commonly mistaken. Covers health, law, finance, and politics topics where popular misconceptions and conspiracies create systematic failure modes.
by Greg Kamradt (community)
Needle-in-a-Haystack is a pressure test for long-context language models that places a single fact (the needle) at a specific position within a long document (the haystack) and asks the model to retrieve it. It systematically varies both context length and needle depth to reveal performance degradation patterns.
by Google Research
Mostly Basic Programming Problems — a collection of 974 crowd-sourced Python programming tasks with natural language descriptions and test cases. Tests foundational programming ability including string manipulation, list processing, and basic algorithms.
by Google DeepMind
Curated subset of 23 challenging BIG-Bench tasks where prior language models performed below average human raters. Specifically designed to test tasks that benefit significantly from chain-of-thought prompting and multi-step reasoning.
by Allen Institute for AI (AI2)
The Semantic Scholar Open Research Corpus (S2ORC) is a large English-language corpus of 136 million academic papers with structured metadata, abstracts, citation graphs, and full-text body paragraphs where licensing allows. Maintained by the Allen Institute for AI, it covers 19 scientific fields and is widely used for scientific NLP tasks including citation prediction, claim verification, and scientific QA.
by LAION
A large-scale, human-annotated dataset of assistant-style conversations collected through the OpenAssistant crowdsourcing platform. Contains over 161,000 messages across 66,000+ conversation trees, with ranked responses for RLHF training.
by Google
The multilingual Colossal Clean Crawled Corpus (mC4) spans 101 languages and contains hundreds of billions of tokens scraped from Common Crawl with language detection and heuristic quality filters. It was used to train mT5 and is one of the largest publicly available multilingual pre-training corpora.
by Stanford University
Stanford Alpaca's 52,000 instruction-following examples generated using the self-instruct technique applied to GPT-3.5 (text-davinci-003). This foundational dataset enabled the creation of the Alpaca 7B model and popularized cost-effective instruction-tuning approaches.
by Alibaba Cloud
The flagship open-weight model in the Qwen 2.5 series, offering substantial improvements in reasoning, instruction following, and structured output over its predecessor. Supports 128K context with strong performance across 29+ languages.
by OpenAI
A compact and cost-efficient reasoning model that delivers strong STEM performance at a fraction of o3's cost. Supports configurable reasoning effort (low/medium/high) to balance speed and accuracy for different use cases.
by Hugging Face
Text Generation Inference (TGI) by Hugging Face is a production-grade inference server that directly loads models from the Hugging Face Hub via model IDs, handling shard downloading, quantization, and OpenAI-compatible endpoint serving in a single Docker command. It implements continuous batching, speculative decoding, and FlashAttention for optimal throughput on Ampere and Hopper GPUs.
by Ollama
Ollama's official Docker image provides a self-contained environment for running large language models locally. It enables developers to easily deploy and manage quantized GGUF models using familiar container orchestration tools like Docker Compose and Kubernetes, supporting GPU acceleration and an OpenAI-compatible API.
by Anthropic / GitHub
Integrates the MCP environment with GitHub's REST and GraphQL APIs, enabling programmatic control over software development workflows. Users can manage repositories, track issues, review pull requests, and search code directly from an agent context, streamlining development tasks without switching tools.
by Amazon Web Services
Anthropic's Claude model family available through Amazon Bedrock's fully managed foundation model service. Provides serverless inference with pay-per-token pricing, AWS IAM authentication, VPC endpoint support, and model evaluation tools. Claude 3.5 Sonnet, Haiku, and Opus are all available through the Bedrock API.
by NVIDIA / Caltech / UT Austin
Presents Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager uses an automatic curriculum, an ever-growing skill library of executable code, and an iterative prompting mechanism to overcome failures.
by Princeton NLP / Princeton Language and Intelligence
Introduces SWE-agent, which defines Agent-Computer Interfaces (ACIs) to enable LLMs to autonomously solve real GitHub issues by browsing codebases, editing files, and running tests. On the SWE-bench benchmark, SWE-agent with GPT-4 Turbo resolves 12.5% of issues, significantly outperforming prior methods.
by Alibaba Cloud / DAMO Academy
Qwen-VL is a large-scale vision-language model series from Alibaba, trained on a curated multilingual multimodal dataset. It supports high-resolution image understanding, visual grounding with bounding boxes, and multilingual text reading, achieving state-of-the-art results on multiple visual benchmarks.
by Alibaba Cloud / Qwen Team
Qwen2.5 is a comprehensive family of open-source LLMs (0.5B to 72B parameters) trained on 18 trillion tokens including significantly expanded coding and mathematics data, achieving state-of-the-art open-source performance on coding (HumanEval), mathematics (MATH), and multilingual benchmarks. The series includes specialized Qwen2.5-Coder and Qwen2.5-Math variants.
by Google Brain
Introduced Imagen, a text-to-image diffusion model that leverages large pretrained language models (T5-XXL) for text understanding combined with cascaded diffusion models for image synthesis. Imagen demonstrated that scaling text encoders is more impactful than scaling diffusion models, establishing DrawBench as a new evaluation benchmark.
by University of Washington / Black in AI
This influential FAccT paper argues that ever-larger language models carry significant risks—including environmental costs, biased training data, and the illusion of meaning—that are often overlooked in the race for benchmark performance. It calls for pausing scaling to focus on documentation, auditing, and community-centered research practices.
by Tsinghua / Peking University / DeepWisdom
Presents MetaGPT, a multi-agent framework that encodes human workflows as Standardized Operating Procedures (SOPs) for LLM agents acting as specialized software roles. By assigning product manager, architect, engineer, and QA roles, MetaGPT produces complete, executable codebases from natural language requirements with higher quality than prior approaches.
by Microsoft Research
Presents GraphRAG, which uses LLM-generated knowledge graphs and community detection to enable query-focused summarization over entire text corpora. Unlike standard RAG which answers local questions from text chunks, GraphRAG enables global sensemaking queries by reasoning over interconnected entity communities at multiple granularities.
by Google Research
Introduces Grouped-Query Attention (GQA), an efficient attention mechanism that generalizes Multi-Head and Multi-Query Attention. GQA groups query heads to share key and value heads, drastically reducing the KV cache size and memory bandwidth, which accelerates inference speed while maintaining near Multi-Head quality.
by Google DeepMind
Gemini 2.5 Pro introduces thinking mode—an integrated chain-of-thought reasoning layer—combined with a 1M-token context window and natively multimodal capabilities spanning text, image, audio, and video. The model achieves leading positions on multiple reasoning and coding benchmarks including Codeforces, AIME, and MMMU.
by DeepMind
Introduced Flamingo, a family of visual language models that bridge powerful pretrained vision and language models, enabling few-shot learning on a diverse range of multimodal tasks by training on arbitrarily interleaved sequences of images, video, and text. Flamingo set new few-shot state-of-the-art on 16 benchmarks.
by Google Research / Stanford / DeepMind / UNC
Defines and documents emergent abilities in LLMs — capabilities that appear sharply at certain model scales rather than improving gradually. Surveys over 100 tasks where models exhibit phase-transition-like capability gains, sparking debate on whether emergence is real or a measurement artifact.
by Hugging Face
Introduces DistilBERT, a knowledge-distilled version of BERT that retains 97% of BERT's language understanding while being 40% smaller and 60% faster. Demonstrates the effectiveness of task-agnostic knowledge distillation for pretrained language models.
by AaaS
A core computer vision skill that enables agents to identify and locate objects within an image or video stream. By predicting bounding boxes and class labels for each object, this skill forms the foundation for environmental understanding. It is crucial for applications requiring spatial awareness, from autonomous navigation to automated inspection.
by AaaS
Master structured prompting for text-to-image diffusion models like Stable Diffusion and Midjourney. Learn to control style, composition, and quality using techniques such as negative prompting, LoRA weights, and iterative refinement. This skill enables the programmatic generation of consistent, on-brand imagery at scale.
by Community
Combines collaborative filtering and content-based signals — along with contextual, knowledge-graph, and session-based features — into unified ranking models that outperform single-strategy approaches. Modern implementations use two-tower neural architectures for efficient retrieval followed by cross-attention reranking.
by AaaS
Diagnoses and resolves software bugs by analyzing error messages, stack traces, and code behavior. Applies systematic debugging strategies including root cause analysis, state inspection, and targeted fix generation with regression awareness.
by Zendesk
Zendesk's AI Agent is an autonomous customer support tool designed to resolve inquiries across email, chat, and messaging. Trained on billions of real service interactions, it understands intent and sentiment to provide resolutions without requiring human intervention, freeing up teams for complex issues.
by Snyk
AI-powered developer security agent that continuously scans code, dependencies, containers, and infrastructure-as-code for vulnerabilities. Provides automated fix pull requests, prioritizes issues by exploitability, and integrates directly into the developer workflow for shift-left security.
by ServiceNow
An autonomous AI agent built on the Now Platform, designed to automate end-to-end IT Service Management (ITSM) processes. It independently resolves common incidents, fulfills service requests, and executes standard change workflows by leveraging a proprietary knowledge graph and workflow engine, reducing the need for human intervention.
by Otter.ai
An autonomous agent that joins virtual meetings, transcribes conversations in real time with speaker diarization, and generates structured summaries containing decisions made, action items with owners and due dates, and key discussion points. It distributes follow-up notes to participants, syncs action items into project management tools, and maintains a searchable meeting knowledge base.
by Westlaw AI (Thomson Reuters)
Comprehensive legal research agent that queries case law databases, statutes, regulations, and secondary sources to synthesize jurisdiction-specific memos, identify controlling precedents, and map circuit splits. Generates formatted legal research memos with citation-verified sources and confidence scores.
by Community
Analyzes feature importance for scikit-learn compatible models using multiple advanced techniques. It computes SHAP values with Tree and Kernel Explainers, calculates permutation importance, and performs feature selection with Boruta. Results are compiled into an interactive HTML dashboard for easy interpretation and sharing.
by Pinecone
Pinecone is the leading managed vector database, purpose-built for AI applications requiring similarity search at scale. It powers retrieval-augmented generation, semantic search, and recommendation systems for thousands of enterprises. Pinecone's serverless architecture eliminates infrastructure management while delivering sub-millisecond query performance.
by Perplexity AI
Perplexity AI is an answer engine that combines real-time web search with large language model reasoning to deliver cited, conversational responses. Founded in 2022, it has rapidly grown to tens of millions of monthly active users and positions itself as an AI-native alternative to traditional search engines.
by Character AI
Character AI is a consumer platform for creating and interacting with AI-powered characters. Users can engage in conversations for entertainment, role-playing, and creative exploration. It has become a major consumer AI application with a massive user base, focusing on personalized and immersive chat experiences.
by Allen AI
Large-scale dataset for commonsense coreference resolution inspired by Winograd schemas. Tests whether models can correctly resolve pronoun references based on world knowledge and commonsense reasoning in carefully constructed sentence pairs.
by Georgia Tech / VT
Visual Question Answering benchmark requiring models to answer open-ended questions about images. Version 2 balances the dataset to reduce language biases, ensuring models must genuinely understand image content rather than relying on question-type priors.
by TIGER-Lab
MMLU-Pro is a challenging benchmark designed to evaluate the advanced reasoning and knowledge capabilities of frontier AI models. It enhances the original MMLU by introducing harder, professionally-vetted questions, expanding answer choices from 4 to 10, and reducing sensitivity to prompt formatting for a more robust and discriminative assessment.
by Young et al. / University of Illinois
Flickr30k is a benchmark for image-text retrieval and visual grounding, comprising 31,783 Flickr images each paired with five human-written captions. Models are evaluated on bidirectional image-to-text and text-to-image retrieval recall at ranks 1, 5, and 10.
by Tsinghua University
A large-scale, high-quality preference dataset with 64,000 instructions each answered by 4 LLMs and rated by GPT-4 on instruction-following, truthfulness, honesty, and helpfulness. UltraFeedback is the backbone of the Zephyr and Tulu 2 DPO models.
by BigCode
The 780 billion token code dataset used to pretrain the StarCoder family of models, assembled by BigCode from The Stack v1 spanning 86 programming languages with permissive licenses. It includes GitHub issues, Git commits, and Jupyter notebook data alongside source files, enabling models to learn from developer workflows and not just static code.
by MIT CSAIL
Places365 is a scene-centric database with 1.8 million training images across 365 scene categories, designed to train and evaluate scene recognition models. The dataset enables models to understand the semantic meaning of places and environments, making it ideal for applications in autonomous driving, robotics, and image retrieval.
by Nous Research
A large curated synthetic instruction dataset with ~1 million entries sourced from multiple high-quality open datasets including Airoboros, Camel, GPT4-LLM, and others. OpenHermes 2.5 powers the Nous Hermes model family and is widely regarded as one of the best open instruction datasets.
by LAION / OpenAssistant
OpenAssistant Conversations 2 (OASST2) is a crowd-sourced human-annotated dataset of 100,000+ assistant-style conversations in 35 languages, where human contributors created and ranked message trees to produce preference labels for RLHF training. It is the largest open multilingual human-feedback dataset and is widely used for training preference models and reward functions in open-source alignment pipelines.
by Pekka Malo et al. / Aalto University
Financial PhraseBank is a sentiment analysis dataset containing 4,845 sentences from English-language financial news annotated by 16 financial domain experts with positive, negative, or neutral sentiment labels. It is the most widely used benchmark for financial sentiment analysis and has been used to fine-tune FinBERT and numerous other financial NLP models.
by University of Toronto
A dataset of over 11,000 unpublished books spanning fiction and non-fiction genres, originally scraped from Smashwords and used as the primary pretraining corpus for BERT alongside Wikipedia. It provides rich long-range dependency data that helps models learn coherent narrative structure and extended discourse patterns.
by UC Berkeley
A benchmark of 10,000 programming problems at introductory, interview, and competitive programming difficulty levels, each with problem statements, test cases, and human-written solutions. APPS is the standard dataset for evaluating code generation models on realistic programming tasks ranging from simple loops to complex algorithmic challenges drawn from competitive programming platforms.
by Stability AI
Stable Diffusion 3 is a powerful text-to-image model using a Multimodal Diffusion Transformer (MMDiT) architecture. It excels at generating images with unprecedented text quality, adhering closely to complex prompts, and achieving high photorealism and compositional accuracy compared to its predecessors.
by Mistral AI
Mistral AI's sparse mixture-of-experts model using 8 expert networks of 7B parameters each, activating only 2 per token. Matches GPT-3.5 performance while using a fraction of the compute at inference.
by Meta
Meta's third-generation compact language model with significantly improved performance over Llama 2 at the same size class. Features an expanded 128K token vocabulary and improved tokenizer.
by Anthropic
The Anthropic MCP Filesystem server allows AI agents, like Claude, to interact directly with a user's local files. It exposes a secure API for reading, writing, listing, and searching files and directories, enabling agents to perform tasks such as code analysis, data processing, and file organization on the host machine.
by LangChain
This integration connects the LangChain framework with Google's advanced AI services, including the Gemini API via Google AI Studio and models on Vertex AI. It enables developers to build sophisticated applications leveraging multimodal capabilities for processing text and images, advanced function calling for tool use, and grounding responses with Google Search for accuracy.
by LangChain
LangChain VectorStore integration for Chroma, the open-source AI-native embedding database. Ideal for local development and prototyping with zero infrastructure setup. Supports persistent and in-memory collections, metadata filtering, and relevance-scored retrieval via langchain-chroma.
by GitHub
The GitHub Copilot plugin for JetBrains IDEs integrates AI-powered code completion and a conversational chat panel directly into the editor. It provides inline, ghost-text suggestions and mirrors the functionality of the VS Code extension, adapting to JetBrains' native keymaps and user interface for a seamless experience across IDEs like IntelliJ IDEA and PyCharm.
by University of Washington / IBM AI Research / Allen AI
Introduces Self-RAG, a framework that trains a single LM to adaptively retrieve passages on demand, generate text, and critique its own outputs using special reflection tokens. Unlike standard RAG, Self-RAG decides when to retrieve and reflects on retrieved passages and generation quality, outperforming ChatGPT and standard RAG on diverse downstream tasks.
by Google / Everyday Robots
SayCan combines the semantic reasoning capabilities of large language models with learned value functions that encode physical feasibility, allowing robots to plan long-horizon tasks expressed in natural language. The approach grounds high-level language instructions in real-world robot affordances without task-specific fine-tuning.
by UC Berkeley / Google Brain
Decision Transformer recasts offline reinforcement learning as a conditional sequence modeling problem, predicting actions given return-to-go, states, and past actions using a causal Transformer. This eliminates the need for temporal difference learning and bootstrapping while achieving competitive performance on Atari and MuJoCo benchmarks.
by AaaS
Queries CRM systems to retrieve customer account data, ticket history, subscription status, and interaction logs. Provides the customer context foundation that support, churn, and sales agents depend on for personalized actions.
by AaaS
Identifies deviations from normal system behavior across time-series telemetry data (CPU, memory, latency, error rates, request volumes). Uses statistical methods (z-score, IQR) and learned baselines to distinguish genuine anomalies from expected variance. A critical cross-foundry skill reused by SRE (F1), Fraud Detection (F6), and Supply Chain (F8) agents.
by AaaS
Enables agents to answer free-form natural language questions about images by grounding language in visual features. Covers prompt construction for vision-language models, chain-of-thought visual reasoning, and failure modes such as hallucination and spatial confusion.
by AaaS
Automates the categorization of text into predefined classes. This skill leverages large language models to perform zero-shot and multi-label classification, eliminating the need for extensive training data. It can analyze documents, user feedback, or social media posts, assigning relevant labels from a simple list or a complex hierarchical taxonomy.
by AaaS
Automates the creation of test suites by analyzing source code, function signatures, or specifications. It generates unit tests, integration tests, and edge case scenarios for popular frameworks, complete with necessary mocks and assertions. This accelerates development cycles and improves code reliability.
by Community
Trains control policies for autonomous systems through environment interaction and reward signals using model-free (PPO, SAC, TD3) and model-based (MBPO, Dreamer) RL algorithms. Enables superhuman performance in complex continuous control tasks from locomotion to manipulation.
by Community
Applies deep learning directly to graph-structured data by passing and aggregating messages between connected nodes across multiple layers, enabling node classification, link prediction, and graph-level tasks. Powers state-of-the-art knowledge graph completion, molecular property prediction, and social network analysis.
by AaaS
Generates dense vector embeddings from text, images, or other data types for use in similarity search, clustering, and classification. Covers model selection, batch processing, dimensionality considerations, and normalization strategies for optimal retrieval performance.
by AaaS
Code Refactoring is the disciplined process of restructuring existing computer code without altering its external behavior. It focuses on enhancing nonfunctional attributes like readability, maintainability, and performance. This practice is key to managing technical debt, applying design patterns, and modernizing legacy systems to align with current best practices.
by AaaS
Resolves up to 80% of Tier-1 and Tier-2 support requests by directly accessing the CRM and payment gateways. Processes refunds within configurable monetary caps, updates account settings, modifies subscriptions, and routes edge cases to human representatives with full conversation summaries and diagnostic context. Unlike basic chatbots that regurgitate FAQ documents, this agent takes transactional action — it resolves, not deflects.
by Sprout Social
A semi-autonomous agent that optimizes social media content for maximum reach. It analyzes platform-specific engagement patterns, rewrites posts, schedules them for peak audience times, and A/B tests caption variations to improve performance across channels.
by ServiceNow
This AI agent automates enterprise risk management (ERM) by continuously synthesizing data from internal systems and external intelligence. It identifies, categorizes, and scores diverse risks, maintaining a live risk register and mapping control effectiveness to provide a real-time, holistic view of the organization's risk posture.
by Datadog
An autonomous profiling agent that instruments application code, analyzes CPU flame graphs, memory heap snapshots, and database query plans to identify performance bottlenecks, then proposes and optionally applies targeted code optimizations. It tracks regression history, correlates deployments with latency spikes, and benchmarks fixes against baseline measurements before recommending production rollout.
by LangChain Inc.
LangChain's framework for building stateful, multi-agent applications using graph-based workflows. Provides fine-grained control over agent state, cycles, branching, and human-in-the-loop checkpoints for production-grade agentic systems.
by NVIDIA
The NVIDIA Jetson AGX Orin is a high-performance System-on-Module (SoM) designed for edge AI and autonomous machines. It delivers up to 275 TOPS of AI performance, integrating an NVIDIA Ampere architecture GPU with Arm CPUs and deep learning accelerators for server-class computing in a power-efficient package.
by NVIDIA
The NVIDIA B100 is a data center GPU based on the Blackwell architecture, succeeding the H100. It offers substantial performance improvements for AI training and inference, featuring a second-generation Transformer Engine with FP4 precision, and a fifth-generation NVLink interconnect for massive multi-GPU scaling.
by NVIDIA
NVIDIA Ampere GPU optimized for graphics and inference workloads. Commonly deployed in AWS G5 instances, offering a cost-effective option for inference, graphics rendering, and video processing at cloud scale.
by Community
This script provides a sophisticated OCR pipeline that intelligently routes documents to the most suitable engine—Tesseract, PaddleOCR, or a cloud API—based on image quality analysis. It processes various document types and outputs structured JSON containing text sorted by reading order, complete with bounding box coordinates and confidence scores for each word or line.
by Scale AI
Scale AI is the leading AI data platform providing high-quality training data labeling, RLHF pipelines, and model evaluation services for frontier AI labs, government agencies, and Fortune 500 enterprises. Its Rapid platform and data engine power training datasets for many leading language and vision models.
by Runway ML
Runway is an applied AI research company focused on building multimodal AI systems for art, entertainment, and human creativity. It provides a suite of web-based tools for generative content creation, including industry-leading text-to-video, image-to-video, and AI-powered video editing features for creative professionals.
by Lu et al. / UCLA
ScienceQA is a large-scale multimodal benchmark featuring 21,208 science questions for grades 3-12. It uniquely combines visual diagrams and textual contexts, requiring models to perform complex reasoning. Each question includes multiple-choice options, a detailed lecture, and a step-by-step explanation for the correct answer.
by Gehman et al. / Allen Institute for AI
RealToxicityPrompts measures the propensity of language model generations to produce toxic content when conditioned on a diverse set of 100,000 naturally occurring prompts extracted from the web. It uses the Perspective API to score generated text on toxicity dimensions.
by CUHK / Waterloo
MMMU is a challenging multimodal benchmark designed to evaluate large models on expert-level tasks. It contains over 11,500 college-level problems spanning six core disciplines, requiring models to integrate deep subject knowledge with visual perception to answer multiple-choice questions with detailed reasoning.
by Guha et al. / Stanford CodeX
LegalBench is a collaboratively built benchmark measuring the legal reasoning ability of large language models across 162 tasks spanning issue spotting, rule recall, rule application, and legal interpretation. It provides a comprehensive evaluation of whether models can reason like lawyers.
by Zhuo et al. / BigCode / Hugging Face
BigCodeBench is a challenging benchmark for evaluating large language models on practical, function-level code generation tasks. It comprises 1,140 problems that require the use and integration of popular Python libraries like NumPy, Pandas, and Scikit-learn, moving beyond simple algorithmic puzzles to mirror real-world software development scenarios.
by Stanford
Automated evaluation framework comparing model outputs against a reference model on 805 instructions. Uses LLM judges to determine win rates, with length-controlled metrics to avoid rewarding verbosity over quality.
by Community
A community-collected dataset of real ChatGPT and GPT-4 conversation logs shared by users, covering a broad range of tasks and domains. Available in multiple filtered and cleaned versions including ShareGPT52K and ShareGPT90K used by Vicuna and other open models.
by University of Helsinki
OPUS-100 is a large-scale multilingual parallel corpus for machine translation, featuring 100 languages pivoted through English. Sampled from the OPUS collection, it provides up to 1 million sentence pairs per language pair, making it a standard benchmark for training and evaluating multilingual models.
by Princeton / Columbia University
The Large-Scale Scene Understanding (LSUN) dataset is a massive collection of nearly one million labeled images for each of 10 scene and 20 object categories. It is a key benchmark for advancing research in scene understanding, particularly for generative modeling, classification, and reconstruction tasks.
by Meta AI
LIMA (Less Is More for Alignment) is a carefully curated dataset of 1,000 high-quality instruction-response pairs demonstrating that alignment quality matters more than quantity. Sourced from StackExchange, wikiHow, and manually written prompts, LIMA-tuned models rival GPT-4 on many benchmarks.
by Seasalt.ai / SpeechColab
GigaSpeech is a multi-domain English speech corpus with 10,000 hours of high-quality labeled audio for ASR, sourced from audiobooks, podcasts, and YouTube across a broad range of topics and recording conditions. Its scale and diversity make it particularly valuable for training robust, domain-generalizable speech recognition models.
by Databricks
Dolly-15K is a high-quality, open-source dataset of 15,000 instruction-following records generated by humans. Created by Databricks employees, it's designed for fine-tuning large language models to exhibit instruction-following capabilities, such as those seen in ChatGPT, using a relatively small, targeted dataset.
by Google DeepMind
DeepMind Mathematics (DM Mathematics) is a dataset of 2 million mathematical question-answer pairs covering algebra, arithmetic, calculus, comparisons, measurement, numbers, polynomials, and probability, procedurally generated to test mathematical reasoning capabilities of language models. The symbolic and step-structured nature of the dataset makes it a standard benchmark for evaluating compositional generalization and multi-step arithmetic reasoning.
by Hugging Face
Cosmopedia is a massive synthetic dataset containing 30 million documents styled as textbooks, blog posts, and articles. Generated by Mixtral-8x7B-Instruct, it provides a vast, multilingual corpus of high-quality educational content designed for pretraining large language models at scale.
by GitHub / Microsoft Research
A dataset and benchmark challenge for code retrieval and search containing 2 million (code, documentation) pairs in six programming languages — Python, Java, JavaScript, PHP, Ruby, and Go — curated by GitHub and Microsoft Research. It is the canonical benchmark for code-to-natural-language and natural-language-to-code retrieval tasks and is widely used to evaluate code embedding models.
by Alibaba Cloud
Qwen 2.5 Coder 32B is an open-weight, code-specialized large language model from Alibaba Cloud. Fine-tuned on a massive corpus covering over 92 programming languages, it excels at code generation, completion, and debugging tasks, demonstrating performance on par with or exceeding proprietary models like GPT-4o on several benchmarks.
by LangChain Inc.
The LangGraph and LangSmith integration provides built-in observability for stateful agent graphs. It automatically captures every node execution, state change, and tool call as a structured trace in LangSmith, enabling deep, step-by-step debugging, performance analysis, and regression testing of complex agent workflows.
by Google Cloud
Vertex AI is Google Cloud's managed machine learning platform for deploying and scaling AI applications. It provides an enterprise-grade environment for using Google's foundation models like Gemini and PaLM, adding MLOps tooling, security controls, and deep integration with the Google Cloud ecosystem. This includes features like model tuning, evaluation, and grounding with Google Search.
by OpenAI
This paper explores weak-to-strong generalization, a method for training a powerful AI model using supervision from a weaker one. It serves as an analogy for aligning superintelligent AI with human values. The research shows that strong models can learn beyond their weak supervisors and introduces techniques like auxiliary confidence loss to improve performance.
by Google Research
Introduces Switch Transformers, simplifying MoE routing to select a single expert per token (top-1), enabling stable trillion-parameter T5-scale models with 7× pre-training speedup. Demonstrates that parameter count and compute can be decoupled through sparsity.
by Stanford University / Google Brain
STaR (Self-Taught Reasoner) is a research paper introducing an iterative bootstrapping method for language models. The model learns to improve its reasoning abilities by generating rationales for problems, filtering out the incorrect ones, and then fine-tuning itself on the successfully reasoned examples. This allows smaller models to achieve reasoning performance comparable to much larger ones.
by DeepMind
Proposes using language models to automatically generate test cases that elicit harmful behaviors from target language models—a scalable alternative to manual red teaming. The approach discovers diverse attack prompts across harm categories and reveals that larger models are harder to red-team but produce more harmful outputs when successfully attacked.
by Google Research
Proposes REALM, which augments language model pre-training with a learned textual knowledge retriever, enabling the model to retrieve and attend over documents from a large corpus during both pre-training and fine-tuning. REALM achieves state-of-the-art on Open-domain QA benchmarks while providing interpretable knowledge retrieval.
by LMSYS / UC Berkeley / CMU / UCSD
Presents Vicuna-13B, an open-source chatbot created by fine-tuning LLaMA on ShareGPT conversation data, achieving approximately 90% of ChatGPT and Bard quality as judged by GPT-4. The paper introduces GPT-4 as an automated judge for chatbot evaluation, establishing a widely adopted evaluation paradigm for conversational AI.
by UC Berkeley
CQL (Conservative Q-Learning) addresses distribution shift in offline RL by augmenting the standard Bellman objective with a term that penalizes Q-values for out-of-distribution actions, producing a lower bound on the true value function. This conservative approach prevents over-optimistic value estimation and achieves strong performance across locomotion, navigation, and robotic manipulation datasets.
by DeepMind
AlphaCode is a large-scale language model from DeepMind designed for competitive programming. It was pre-trained on public GitHub code and fine-tuned on a curated dataset of programming contest problems. The system generates a vast number of potential solutions and then filters them using test cases to find a correct one.
by Tsinghua University
Introduces AgentBench, the first systematic benchmark for evaluating LLMs as autonomous agents across eight distinct environments spanning operating systems, databases, knowledge graphs, digital games, and web browsing. The benchmark reveals a large performance gap between commercial and open-source models on real-world agent tasks.
by AaaS
Parses, correlates, and summarizes structured and unstructured log streams from multiple sources (application logs, system logs, CI/CD logs). Identifies error patterns, correlates events across distributed services using trace IDs, and extracts actionable insights from high-volume log data. A foundational skill reused across DevOps, SRE, and security agents.
by Community
Path Planning is a fundamental capability in robotics and autonomous systems that computes a collision-free geometric path from a start to a goal configuration. It operates within a system's configuration space, using algorithms like A* or RRT to find optimal or feasible routes, distinct from motion planning which also considers dynamics like velocity and acceleration.
by AaaS
Builds end-to-end pipelines for extracting structured text from images, scanned documents, and PDFs using OCR engines combined with layout analysis. Teaches preprocessing, engine selection (Tesseract, PaddleOCR, Google Document AI), post-correction, and handoff to language models for structured extraction.
by AaaS
Hybrid search enhances information retrieval by merging the results of two distinct search methods: dense vector search for semantic understanding and sparse keyword search (like BM25) for lexical precision. This dual approach ensures that search results are not only contextually relevant but also capture exact term matches, significantly improving recall and relevance across diverse and complex queries.
by AaaS
Splits large documents into semantically coherent chunks optimized for embedding and retrieval. Supports recursive, semantic, and sentence-based splitting strategies with configurable overlap and size parameters.
by AaaS
Data Extraction is the process of automatically identifying and pulling structured information from unstructured or semi-structured sources like documents, web pages, and text. It uses NLP and computer vision to parse content into a predefined schema, enabling data to be used in databases, analytics, and automated workflows.
by Community
Recommends items by matching item feature profiles to user preference profiles derived from their interaction history, using TF-IDF, embeddings, and semantic similarity techniques. Effective for cold-start scenarios where user interaction data is sparse and item metadata is rich.
by AaaS DevOps Foundry
Detects anomalies in live system telemetry, runs deterministic diagnostics from the organization's top remediation runbooks, and autonomously resolves up to 40% of standard incidents without human intervention. Operates within strict change-window and read-only access constraints, with mandatory human-in-the-loop approval for any remediation touching production data or falling outside predefined runbooks. Reduces mean-time-to-recovery and augments on-call teams.
by AaaS
Ingests unstructured invoice data across wildly varying formats (multi-page PDFs, email attachments, CSV exports). Matches invoices deterministically against Purchase Orders and receipt logs in the ERP system, and authorizes payment for the established happy path — achieving 60-80% straight-through processing without human intervention. Exception invoices (mismatches, missing POs, duplicate detection) are routed to a human queue with full context. Every payment authorization generates an immutable audit trail.
by Turnitin
An automated assessment agent that generates item banks, administers adaptive quizzes, and provides calibrated scoring with detailed feedback explanations. It applies Item Response Theory to estimate learner proficiency and surfaces at-risk students to instructors via configurable alert thresholds.
by Nuance PowerScribe
An AI assistant that accelerates radiology reporting by automatically drafting structured reports from imaging findings. It applies standard templates like ACR BI-RADS, extracts key measurements, and codes findings using RadLex terminology, significantly reducing radiologist documentation time and improving data consistency for analytics.
by SparkCognition
An IoT-connected agent that ingests vibration, temperature, acoustic, and electrical signals from industrial equipment to predict failure events hours to weeks in advance using ML anomaly detection and physics-based models. It generates work orders in CMMS systems, recommends spare parts pre-positioning, and calculates optimal maintenance windows to minimize production impact.
by PagerDuty
PagerDuty AI is an AIOps agent for incident management that automates triage and response. It intelligently groups related alerts to reduce noise, correlates events to identify root causes, and suggests or executes automated remediation runbooks. This helps teams minimize downtime and streamline their on-call processes.
by Intercom
Intercom Fin is an AI-powered chatbot designed for customer support automation, built on OpenAI's GPT-4. It autonomously resolves customer queries by leveraging a company's help center content and past conversation data. Fin provides human-like answers, can execute actions, and intelligently escalates complex issues to human agents.
by Graphcore
The Graphcore Bow Pod1024 is a supercomputing-scale AI system, delivering over 250 PetaFLOPS of AI compute. It leverages 1,024 Bow IPU processors linked by a high-bandwidth fabric, specifically engineered for training massive, next-generation AI models and complex graph analytics workloads at an unprecedented scale.
by NVIDIA
The NVIDIA DGX H100 is a purpose-built AI supercomputer, serving as the foundational building block for large-scale AI infrastructure. It integrates eight H100 Tensor Core GPUs with high-speed NVLink interconnects, providing a turnkey solution for the most demanding AI training, inference, and data analytics workloads.
by Community
End-to-end image classification pipeline that handles dataset loading, preprocessing, model inference, and result export using PyTorch and torchvision. Supports batch inference against any Hugging Face ViT or ResNet checkpoint with configurable confidence thresholds.
by Community
This is a complete machine learning pipeline for detecting fraudulent transactions in real-time. It employs a hybrid approach, using XGBoost or LightGBM for classification and an Isolation Forest for anomaly detection. The system is specifically designed to handle severely imbalanced datasets through SMOTE-Tomek resampling and cost-sensitive learning.
by Great Expectations
Automates data quality testing for tabular data using the Great Expectations library. This script profiles datasets to generate and validate 'Expectations' covering schema, statistical properties, and referential integrity. It produces a comprehensive HTML report (Data Docs) and can be integrated into CI/CD pipelines as a quality gate to prevent bad data from entering production systems.
by Clark et al. / Google Research
TyDi QA is a multilingual question-answering benchmark featuring 11 typologically diverse languages. Questions are written natively by speakers of each language, ensuring genuine linguistic challenges and avoiding translation artifacts. It is designed to evaluate reading comprehension across a wide range of language structures.
by Hartvigsen et al. / MIT
ToxiGen is a large-scale, machine-generated dataset for evaluating nuanced hate speech detection. It contains over 274,000 toxic and benign statements about 13 minority groups, designed to challenge models to identify implicit toxicity without relying on obvious slurs or surface-level cues.
by Qin et al. / Tsinghua University
ToolBench evaluates LLMs on their ability to use real-world REST APIs to complete user instructions. It provides 16,000+ real APIs from RapidAPI Hub across 49 categories and 12,000+ instruction–API solution pairs, measuring whether models can plan and execute multi-step API call sequences.
by Jin et al. / Carnegie Mellon University
PubMedQA is a biomedical question-answering dataset sourced from PubMed abstracts. Models must answer yes/no/maybe questions about biomedical research findings, testing the ability to reason over scientific literature.
by Google Research
Instruction-Following Evaluation benchmark testing models' ability to precisely follow verifiable formatting instructions. Includes constraints like word count limits, specific formatting requirements, keyword inclusion/exclusion, and structural rules that can be programmatically verified.
by BigCode
HumanEval+ is a benchmark for rigorously evaluating code generation models. It augments the original HumanEval dataset by expanding the test suite for each of its 164 problems by 80x. This extensive testing helps uncover subtle bugs and failures on edge cases that simpler benchmarks miss, providing a more accurate measure of a model's true coding ability.
by Allen AI
DROP (Discrete Reasoning Over Paragraphs) is a challenging benchmark designed to evaluate a model's numerical reasoning capabilities within textual contexts. It requires systems to read paragraphs and answer questions that involve discrete operations like addition, counting, sorting, or comparison. Unlike simpler QA datasets, DROP necessitates multi-step reasoning processes, pushing models beyond basic information retrieval.
by Tsatsaronis et al. / BioASQ Challenge
BioASQ is a large-scale benchmark for biomedical semantic question answering. It challenges systems to perform document retrieval, concept mapping, and answer extraction from PubMed literature. The benchmark includes diverse question types like yes/no, factoid, list, and summary, with gold-standard answers curated by experts.
by Microsoft Research
WizardLM Evol-Instruct is a synthetic dataset created by Microsoft Research for fine-tuning large language models. It uses an LLM-based evolutionary process to iteratively rewrite and complicate a seed set of instructions, progressively increasing their complexity and diversity. The dataset is designed to enhance a model's ability to follow intricate, multi-step commands across various domains like coding, math, and reasoning.
by Cerebras
SlimPajama is a cleaned and deduplicated version of the RedPajama dataset, containing 627 billion high-quality tokens. Produced by Cerebras, it demonstrates that training on fewer, higher-quality tokens can match or exceed the performance of models trained on larger, noisier datasets.
by Shanghai AI Lab
ShareGPT4V is a large-scale, high-quality dataset containing 100,000 image-text pairs generated by GPT-4V. It is specifically designed for the instruction-tuning of open-source large vision-language models (LVLMs). The dataset's detailed captions and conversational QA pairs significantly enhance a model's ability to perform complex scene understanding, OCR, and visual reasoning.
by University of Washington
Self-Instruct is the foundational instruction-tuning dataset and methodology introduced by Wang et al. (2022), where 175 human-written seed tasks are iteratively expanded into 52,000 instruction-input-output triplets using GPT-3 as the generator. It established the paradigm of bootstrapping instruction data from existing LLMs and directly inspired Alpaca, WizardLM, and most subsequent synthetic alignment datasets.
by EleutherAI
OpenWebText is a large-scale, open-source English text corpus created by scraping web pages linked from Reddit. Designed as a public replication of OpenAI's original WebText dataset used for GPT-2, it contains approximately 38 GB of text filtered by Reddit upvotes to ensure a baseline of quality and relevance.
by Meta AI
The No Language Left Behind (NLLB) training corpus released by Meta AI contains high-quality parallel data across 200+ language pairs, including newly mined bitext for dozens of low-resource languages. It was used to train the NLLB-200 model achieving state-of-the-art translation on low-resource language pairs.
by Microsoft Research
Evol-CodeAlpaca is a dataset of 110,000 instruction-solution pairs for code generation, created by applying the EvolInstruct method to Code Alpaca seeds. Using GPT-4, it progressively increases the complexity and diversity of programming problems, serving as the primary training data for the WizardCoder models.
by DataComp Consortium
A curated 1.28 billion image-text pair dataset produced through the DataComp benchmark competition, which challenged participants to filter a 12.8 billion pair candidate pool to produce the best downstream CLIP model. DataComp-1B represents the winning filtering strategy and achieves state-of-the-art zero-shot classification performance among datasets of its size.
by Casetext (acquired by Thomson Reuters)
The CaseText Corpus is a large-scale dataset of US federal and state court decisions. It includes full text, structured metadata, and citation networks, designed for legal research and the development of AI applications like legal language models and case retrieval systems, spanning decades of US jurisprudence.
by OpenAI
OpenAI's TTS-1 is a text-to-speech model designed for real-time audio generation. It provides six distinct, natural-sounding preset voices and supports low-latency streaming, making it ideal for interactive applications. A higher-quality variant, tts-1-hd, is available for tasks where audio fidelity is prioritized over speed.
by Google
T5 (Text-To-Text Transfer Transformer) is Google's 2019 framework that reframes all NLP tasks as text-to-text problems, allowing a single model to be trained on a unified mixture of tasks. Its clean formulation and the C4 dataset became foundational references for multitask learning research, and T5 variants remain widely used in production and research.
by Runway
Runway Gen-3 Alpha is a professional-grade video generation model for high-fidelity, temporally consistent clips. It offers fine-grained control over motion, style, and camera behavior via text and image inputs, making it a key tool in professional film and advertising workflows for meeting commercial standards.
by Alibaba Cloud (Qwen Team)
Qwen2.5-VL-72B is Alibaba's flagship open vision-language model at 72 billion parameters, achieving top-tier performance on visual understanding benchmarks including chart analysis, document parsing, and fine-grained image understanding. It supports dynamic resolution image inputs and video understanding with native high-resolution processing.
by Alibaba Cloud
Qwen2-72B is a 72-billion parameter large language model from Alibaba's Qwen2 series. It offers state-of-the-art performance, particularly in multilingual understanding, reasoning, and coding tasks. As an open-weight model, it provides a powerful alternative to proprietary systems for a wide range of applications.
by Microsoft
Phi-3.5-mini is a 3.8B parameter instruction-tuned model from Microsoft, optimized for edge and mobile devices. Despite its compact size, it delivers performance comparable to much larger models on benchmarks for reasoning, coding, and language tasks, making it highly efficient for on-device AI applications.
by OpenAI
A smaller, faster, and more affordable reasoning model optimized for STEM tasks. Delivers 80% of o1's reasoning capability at roughly 80% lower cost, making it ideal for high-volume coding and math workloads.
by Mistral AI
Mistral Large is Mistral AI's flagship proprietary model, offering top-tier reasoning and multilingual capabilities. It is designed to compete with other frontier models like GPT-4, excelling in complex tasks that require deep understanding. Its native function calling and fluency in over 30 languages make it highly versatile for enterprise-grade applications.
by NVIDIA
TensorRT-LLM optimizes large language models into fused CUDA kernels, while the Triton Inference Server orchestrates serving. Together, they form NVIDIA's production stack for maximizing token throughput and minimizing latency on datacenter GPUs, enabling high-performance, scalable LLM inference.
by LlamaIndex
LlamaParse is a proprietary parsing service for complex documents like PDFs with embedded tables and charts. Its first-party integration with the open-source LlamaIndex framework allows developers to directly ingest parsed, structured objects (Nodes) into advanced Retrieval-Augmented Generation (RAG) pipelines, preserving the original document's rich context.
by LangChain
This integration connects LangChain with the HuggingFace ecosystem, enabling the use of thousands of open-source models. It allows developers to call models via the HuggingFace Inference API, run local inference using the `transformers` library, and generate embeddings, all within LangChain's structured framework for building complex LLM applications.
by CrewAI / LangChain
This integration enables CrewAI agents to leverage the entire LangChain tool ecosystem. CrewAI orchestrates multi-agent workflows by assigning roles and delegating tasks, while LangChain provides the foundational tools for capabilities like web search, code execution, vector store retrieval, and API connectivity.
by Distill / OpenAI
This essay by Chris Olah and colleagues at Distill introduces the circuits framework for mechanistic interpretability, arguing that neural network weights encode interpretable algorithms composed of features and circuits. It presents case studies of curve detectors and multimodal neurons as evidence that individual units and motifs in neural networks are meaningfully interpretable.
by Anthropic
Demonstrates that LLMs can be trained to behave safely during normal operation but exhibit unsafe behaviors when triggered by specific conditions—acting as 'sleeper agents'—and that standard safety training techniques including RLHF, supervised fine-tuning, and adversarial training fail to reliably remove these backdoors, sometimes even hiding them deeper.
by Google DeepMind
RT-2 is a Vision-Language-Action (VLA) model that translates visual and language inputs directly into robotic actions. By co-fine-tuning large models on both web-scale and robotics data, it transfers knowledge from the internet to physical control, enabling robots to reason about and execute tasks involving novel objects and scenarios without explicit robotic training.
by DeepMind
Presents RETRO (Retrieval-Enhanced Transformers), a model that retrieves from a 2-trillion-token database at inference time via chunked cross-attention. RETRO achieves performance comparable to GPT-3 with 25× fewer parameters by leveraging retrieved passages, demonstrating that retrieval augmentation is a compute-efficient alternative to scaling.
by Carnegie Mellon University / Together AI
Mamba is a novel sequence modeling architecture based on structured state space models (SSMs). It introduces a selection mechanism that allows the model to selectively propagate or forget information based on the input, overcoming a key limitation of previous SSMs. This enables Mamba to achieve Transformer-level performance with linear time complexity and significantly faster inference.
by DeepSeek
This paper introduces Group Relative Policy Optimization (GRPO), a memory-efficient reinforcement learning algorithm. GRPO enables scalable RLHF-style training by replacing the critic model with group-sampled reward baselines, a technique used to enhance the mathematical reasoning of models like DeepSeekMath.
by Google Brain
Introduces Multi-Query Attention (MQA), an efficient attention mechanism for autoregressive decoding. By sharing a single key and value head across all query heads, MQA drastically reduces the size of the KV cache. This leads to significant memory bandwidth savings and faster inference speeds with minimal impact on model quality.
by KAUST
CAMEL introduces a novel framework for studying multi-agent cooperation by having AI agents role-play to solve tasks. It utilizes a technique called 'inception prompting' to ensure agents adhere to their assigned personas, enabling the exploration of complex communicative behaviors and societal dynamics within large language models with minimal human guidance.
by AaaS
Extracts structured data from unstructured documents (PDFs, scanned images, email attachments) using optical character recognition with layout-aware parsing. Handles multi-page invoices, varying formats, and poor scan quality — producing structured key-value pairs for downstream reconciliation.
by AaaS
Routes unresolved or high-risk incidents to the appropriate human responder with full diagnostic context. Determines escalation urgency (P1-P5), identifies the correct on-call engineer or team based on service ownership, and packages a complete incident summary (timeline, diagnostics run, hypothesis). A cross-foundry skill reused by Customer Success (F4) and Healthcare (F9) agents.
by AaaS
Provides the ability to translate text from a source language to a target language. It aims to preserve the original meaning, tone, and cultural context. The skill supports domain-specific terminology for fields like legal or medical, allows for register control between formal and informal language, and handles idiomatic expressions with contextually appropriate equivalents.
by AaaS
This skill involves implementing real-time, token-by-token data delivery from Large Language Models to end-users. It utilizes protocols like Server-Sent Events (SSE) or WebSockets to create interactive and responsive applications, such as chatbots or code assistants, by progressively displaying content as it's generated.
by AaaS
Implements the Reasoning + Acting (ReAct) paradigm where LLMs alternate between thinking steps and action steps. The model reasons about what to do next, takes an action (like searching or computing), observes the result, and continues reasoning until the task is complete.
by AaaS
Prompt Chaining is a technique for executing complex tasks by breaking them into a sequence of smaller, interconnected prompts. The output from one large language model (LLM) call serves as the input for the next, creating a multi-step workflow. This method enables more sophisticated reasoning, state management, and integration with external tools.
by AaaS
Enables agents to create structured execution plans for multi-step tasks by analyzing goals, identifying sub-tasks, ordering dependencies, and allocating resources. Supports plan revision when steps fail or new information emerges during execution.
by AaaS
A core AI capability that enables agents to break down complex queries into a sequence of manageable, logical steps. By generating intermediate thoughts and verifying them, this process mimics human reasoning to solve problems that require planning, deduction, and synthesis of information over multiple stages.
by AaaS
Adapts pre-trained language models to specific domains, tasks, or styles through additional training on curated datasets. Covers full fine-tuning, parameter-efficient methods like LoRA and QLoRA, and best practices for dataset preparation, hyperparameter selection, and evaluation.
by Community
A machine learning technique that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging the data itself. It enables collaborative model training by aggregating locally computed updates, thereby preserving data privacy, security, and sovereignty.
by AaaS DevOps Foundry
Continuously observes CI/CD pipelines, code repositories, and incident logs. Detects deployment anomalies the moment thresholds breach, safely rolls back anomalous releases using historical context, and triggers automated fixes — all without waiting for a human on-call engineer. Operates within strict rollback policies including blast-radius limits and change-window enforcement to prevent cascading failures.
by AaaS
Continuously tracks buyer intent signals across website visits, email engagement, CRM activity, and third-party intent data providers. Scores leads against Ideal Customer Profiles, drafts hyper-personalized outreach sequences based on behavioral signals, and books meetings directly on sales representatives' calendars. Replaces generic outreach spam with data-driven, compliant prospecting that respects CAN-SPAM and GDPR opt-out requirements.
by AaaS
Continuously evaluates live transaction streams against episodic memories of individual user behavior patterns. Detects complex anomaly patterns that rule-based fraud detection systems miss due to high false-positive rates. Autonomously pauses suspicious transactions and triggers secure multi-factor escalation workflows. The agent can only pause and flag — it never approves or releases funds, ensuring a human always makes the final call on flagged transactions.
by AaaS
Holds persistent, long-term memory of historical cloud infrastructure utilization patterns. Autonomously monitors resource usage across regions and cloud providers, identifies idle or underutilized resources, right-sizes instances based on real traffic patterns, and executes cost-saving measures continuously — all without waiting for a human FinOps review. Operates within safety guardrails: never terminates production instances, enforces a 7-day cooldown before right-sizing, and logs all actions with rollback capability.
by AaaS
Tracks subtle drops in product usage, performs sentiment analysis on support tickets, and synthesizes disparate signals into churn risk scores for each account. Preemptively drafts retention plans, schedules proactive check-in calls, and highlights upselling opportunities — all before the critical renewal window opens. Transforms customer success from reactive firefighting into a data-driven retention engine.
by GameAnalytics
A behavioral analytics agent that ingests player telemetry streams to build individual and cohort behavior models, predict churn risk, and surface liveops intervention opportunities. It continuously segments the player base by engagement and monetization propensity, feeding recommendations into targeted push notification, reward, and re-engagement campaign engines.
by project44
An AI agent designed to solve complex vehicle routing problems (VRP) for logistics and supply chain operations. It optimizes multi-stop routes for entire fleets by considering constraints like time windows, vehicle capacity, and traffic. The agent dynamically reroutes in real-time to adapt to new orders, delays, or cancellations.
by NVIDIA
The NVIDIA L40S is a universal data center GPU based on the Ada Lovelace architecture. It features 48GB of GDDR6 memory and combines powerful AI compute, graphics, and media acceleration capabilities, making it a versatile solution for a wide range of workloads from generative AI to professional visualization.
by AaaS
This script provides a foundational Retrieval-Augmented Generation (RAG) pipeline. It handles core tasks like loading documents, splitting text into chunks, generating embeddings, and indexing them into a vector store. It includes a basic query interface, making it ideal for learning the RAG workflow and prototyping simple applications.
by Microsoft
An automated pipeline that leverages Microsoft Presidio to identify and remove personally identifiable information (PII) from text and structured data. It supports configurable entity recognizers for GDPR and HIPAA compliance and features a reversible pseudonymization capability with a secure vault for authorized re-identification.
by AaaS
This script automates the process of fine-tuning large language models using Low-Rank Adaptation (LoRA). It provides an end-to-end workflow, from preparing custom datasets to training lightweight adapters and merging them into a base model for efficient deployment. This enables domain-specific model specialization with significantly reduced computational costs.
by Meta AI
Runs Segment Anything Model (SAM 2) or Mask2Former on image batches, producing per-pixel segmentation masks with class labels and confidence scores. Includes utilities for mask overlay visualization and RLE-encoded mask export compatible with COCO annotation format.
by Weights & Biases
Weights & Biases (W&B) is a leading MLOps platform for developers, specializing in experiment tracking, model evaluation, and dataset versioning. It provides tools to visualize model performance, manage datasets, and collaborate on machine learning projects, integrating with popular frameworks like PyTorch and TensorFlow.
by LMSYS / UC Berkeley
LMSYS (Large Model Systems Organization) is a research collective from UC Berkeley known for creating Chatbot Arena—the leading human preference-based LLM evaluation leaderboard—and developing high-performance open-source inference systems including vLLM and FastChat. LMSYS research on Elo-based evaluation and serving efficiency has become foundational to the field.
by EleutherAI
EleutherAI is a decentralized open-source AI research collective best known for training and releasing the GPT-Neo, GPT-J, GPT-NeoX, and Pythia model families, as well as developing the LM Evaluation Harness—the standard benchmarking framework for language models. The organization operates as a grassroots nonprofit committed to open and reproducible AI research.
by Hsieh et al. / NVIDIA
RULER is a synthetic benchmark for evaluating large language models in long-context scenarios, scaling from 4K to 128K tokens. It assesses complex skills like multi-hop retrieval, aggregation, and coreference resolution, offering a more nuanced analysis than simple 'needle-in-a-haystack' tests.
by Agostinelli et al. / Google DeepMind
MusicCaps is a benchmark dataset of 5,521 music clips from AudioSet, each paired with a detailed text description written by professional musicians. It is primarily used for evaluating text-to-music generation models, as well as for music captioning, retrieval tasks, and fine-tuning audio-language models.
by Pal et al. / IIT Kanpur
MedMCQA is a massive multiple-choice question dataset sourced from Indian medical entrance examinations like AIIMS and NEET-PG. It contains over 194,000 questions covering 2,400 healthcare topics, designed to rigorously test a model's breadth of medical knowledge and reasoning abilities across multiple subjects.
by LMSYS
Chatbot Arena Hard is a static benchmark composed of 500 challenging prompts curated from Chatbot Arena. It is designed to rigorously evaluate and differentiate the capabilities of large language models. The benchmark utilizes an automated judging system, typically employing a powerful model like GPT-4, to provide a quick, reproducible proxy for human preference.
by CVC Barcelona
DocVQA is a large-scale dataset and benchmark for Visual Question Answering on document images. It challenges models to answer questions by reading and interpreting text, understanding layouts, and reasoning about information within complex documents like forms, invoices, and reports. It serves as a standard for evaluating document intelligence systems.
by Meta AI
CyberSecEval is a benchmark developed by Meta to assess the cybersecurity risks associated with Large Language Models (LLMs). It evaluates a model's propensity to generate insecure code, assist in exploiting vulnerabilities, and facilitate attacks, helping safety teams quantify the dual-use risk of code-capable models.
by Parrish et al. / NYU
BBQ is a question-answering benchmark designed to expose social biases in language models. It uses ambiguous and disambiguated questions related to nine protected categories to measure a model's tendency to rely on harmful stereotypes when context is lacking versus its ability to answer correctly when enough information is provided.
by MAA
A highly challenging benchmark for evaluating the mathematical reasoning of frontier AI models. It uses 30 problems from the 2024 American Invitational Mathematics Examination (AIME), which are designed to test creative problem-solving, multi-step deduction, and knowledge across number theory, geometry, algebra, and combinatorics.
by Google Research
TyDi QA is a benchmark for question answering across 11 typologically diverse languages. It features information-seeking questions written by native speakers who have not seen the answer, ensuring real-world applicability. This design challenges models to generalize beyond high-resource, typologically similar languages.
by Allen Institute for AI (AI2)
Tulu V2 Mix is a curated 326,000-sample mixture of instruction-tuning datasets from AI2. It blends diverse sources like FLAN, Open Assistant, and Code Alpaca to train the Tulu 2 model family. The dataset serves as a benchmark for analyzing the impact of different data sources on model performance and quality.
by Microsoft
Phi-1 TextBooks is a synthetic dataset of Python coding textbooks and exercises generated by GPT-3.5 and GPT-4. It was created to pretrain Microsoft's Phi-1 small language model, demonstrating that high-quality, curriculum-style data can significantly boost the coding abilities of smaller models compared to training on general web data.
by Google DeepMind / Consortium
Open X-Embodiment (OXE) is a massive robotics dataset combining over 1 million demonstration episodes from 22 distinct robot embodiments. It covers 527 skills and is designed to train generalist robot policies that can transfer skills across diverse hardware, serving as a key resource for vision-language-action models.
by NVIDIA
OpenMathInstruct is a large-scale, synthetic dataset by NVIDIA featuring 1.8M+ math problem-solution pairs. Generated by Mixtral models and verified for correctness, it provides reliable, step-by-step reasoning chains for training and fine-tuning language models on diverse mathematical topics, from arithmetic to competition math.
by Hugging Face / BigCode
The GitHub Code Dataset is a massive, multilingual collection of source code from public GitHub repositories, spanning 32 programming languages. Distributed via Hugging Face under the BigCode project, it provides a foundational resource for pretraining large language models on diverse code-related tasks, from generation to analysis.
by Zhiyu Chen et al. / University of California Santa Barbara
FinQA is a large-scale dataset for numerical reasoning over financial data, containing over 8,000 question-answer pairs from S&P 500 earnings reports. Each question requires multi-step reasoning across both unstructured text and structured tables, making it a challenging benchmark for financial AI systems.
by European Court of Human Rights / CJEU
The EU Court Decisions dataset aggregates judgments from the European Court of Human Rights (ECHR) and the Court of Justice of the European Union (CJEU), covering tens of thousands of decisions in multiple EU languages with structured metadata. It is widely used for multilingual legal NLP research, legal judgment prediction, and cross-lingual information retrieval.
by BioASQ Consortium
The BioASQ dataset is a benchmark for biomedical semantic indexing and question answering. It contains thousands of expert-annotated questions (factoid, list, yes/no, summary) paired with relevant PubMed articles, concepts, and ideal answers, designed to train and evaluate advanced NLP systems in the medical domain.
by Alibaba Cloud
Alibaba Cloud's most capable proprietary model in the Qwen 2.5 family, optimized for complex reasoning and enterprise applications. Available exclusively through Alibaba Cloud's Model Studio API with enhanced safety and alignment.
by Mistral AI
Mixtral 8x22B is a large-scale, open-source Mixture-of-Experts (MoE) model from Mistral AI. It features 176 billion total parameters but only activates 39 billion per token, balancing immense power with efficiency. The model excels at reasoning, code generation, and multilingual tasks, and includes native function calling capabilities.
by Anyscale
Ray Serve deploys scalable model serving applications on Google Cloud Platform using GKE and Vertex AI infrastructure, with Ray's distributed runtime managing replica placement, traffic splitting, and resource scheduling across GPU node pools. The integration supports multi-model serving graphs, A/B rollouts, and seamless scale-to-zero on GCP Spot instances for cost optimization.
by Anthropic / Slack
This integration connects MCP-compatible AI agents, such as Claude, directly to a Slack workspace. It enables programmatic control over Slack functionalities, allowing agents to read channel histories, post messages, manage channels, and look up user information. The connection is authenticated using a Slack Bot token for secure, automated communication.
by Anthropic / Brave
An integration that connects the Multi-agent Control Plane (MCP) with Brave's independent search index. It equips AI agents, like Claude, with tools for real-time web, local, and news searches, offering a privacy-focused alternative to Google and Bing for data retrieval and grounding.
by Langfuse
Langfuse integrates with LlamaIndex to provide open-source observability for LLM applications. A simple callback handler captures detailed traces of query engines, retrievers, and LLM calls. This data, including token usage, latency, and custom scores, is visualized in a self-hostable dashboard for comprehensive monitoring.
by LangChain
LangChain integration for Weaviate's open-source vector database. Supports hybrid search (BM25 + vector), multi-tenancy, and generative search modules within LangChain chains and agents. Connects via the Weaviate Python client inside the langchain-weaviate package.
by Helicone
Helicone is an observability platform for LLMs that acts as a proxy for the OpenAI API. It enables developers to monitor usage, track costs, and optimize performance with minimal code changes. Key features include real-time dashboards, request-level caching, rate-limiting, and detailed analytics.
by Anthropic
This research paper from Anthropic introduces a method using sparse autoencoders to decompose the internal activations of a transformer model. It successfully extracts thousands of interpretable, monosemantic features, demonstrating that the superposition of concepts within neurons can be untangled.
by DeepMind
This research paper proposes a method for aligning advanced AI systems by using recursive reward modeling. The approach leverages AI assistants to help human evaluators assess complex AI actions, enabling scalable oversight and positioning this technique alongside debate and amplification as key AI safety strategies.
by Center for AI Safety / UC Berkeley
Representation Engineering (RepE) is a top-down AI transparency technique for interpreting and controlling Large Language Models. It uses linear probes on activation differences from contrastive prompts to identify and manipulate high-level concepts like truthfulness and emotion without needing to retrain or fine-tune the model.
by LMSYS / UC Berkeley
Introduces LMSYS-Chat-1M, a large-scale dataset of one million real-world conversations with 25 state-of-the-art LLMs collected from the Chatbot Arena platform. Analysis reveals diverse usage patterns, safety violations, and human preference signals, making it a valuable resource for safety evaluation, capability assessment, and alignment research.
by Tsinghua University / Zhipu AI
CogVLM is a vision-language model that enhances pretrained language models (LLMs) with visual understanding. It introduces a trainable visual expert module into each layer of a frozen LLM, enabling deep fusion of image and text features. This approach achieves state-of-the-art results on numerous vision-language benchmarks without altering the original language model's parameters.
by EPFL / Multiple Institutions
This paper presents a systematic review of 84 prominent AI ethics guidelines from around the world. It identifies a global convergence on five key ethical principles, including transparency and justice, but reveals significant divergence in how these principles are interpreted and operationalized across different sectors and regions.
by AaaS
Classifies the emotional tone and sentiment polarity of customer text communications — support tickets, survey responses, chat logs, and social mentions. Produces sentiment scores with confidence levels, enabling churn prevention and coaching agents to identify dissatisfied accounts before explicit complaints surface.
by AaaS
Classifies support tickets by category, urgency, and required expertise, then routes them to the correct queue or human agent. Handles auto-resolution for simple cases and escalates complex ones with full context summaries.
by AaaS
Assigns numerical scores to leads based on demographic fit, firmographic match, behavioral engagement, and intent signals. Enables agents to rank prospects by conversion likelihood and route high-scoring leads to immediate outreach while nurturing lower-scoring ones.
by AaaS
Retrieves relevant articles, documentation, and policy information from knowledge bases in response to real-time queries. Uses hybrid search (keyword + semantic) with cross-encoder reranking to surface the most contextually appropriate content for support and coaching agents.
by AaaS
Evaluates real-time metrics against configurable thresholds (SLOs, SLIs, error budgets) and triggers appropriate responses. Supports static thresholds, dynamic baselines, and anomaly-based detection. Distinguishes between noise and genuine threshold breaches using historical context and burn-rate analysis.
by AaaS
Continuously observes deployment pipelines and post-deploy health metrics. Detects anomalous deployment patterns (elevated error rates, latency spikes, failed health checks) within seconds of release. Integrates with canary and blue-green deployment strategies to provide real-time go/no-go signals based on configurable thresholds.
by Community
A process for creating artificial data that mimics the statistical properties and patterns of real-world datasets. It employs techniques like GANs, VAEs, and diffusion models to generate new data points, addressing issues of data scarcity, privacy, and imbalance. This enables robust model training and testing where real data is unavailable or sensitive.
by Community
Combines data from multiple heterogeneous sensors — cameras, LiDAR, radar, GPS, IMU — using probabilistic filters and deep learning to produce a unified, accurate state estimate of the environment. Foundational for autonomous vehicles, drones, and any robot requiring robust situational awareness.
by Community
Enables robots to interpret their surroundings by processing and fusing data from sensors like cameras, LiDAR, and IMUs. This capability allows machines to build environmental models, detect and track objects, and determine their own position and orientation (localization). It is a cornerstone of autonomous navigation and interaction.
by AaaS
Identifies and classifies named entities (people, organizations, locations, dates, etc.) within unstructured text. Supports custom entity types, relationship extraction between entities, and structured output formatting for downstream processing.
by AaaS
Multi-Agent Coordination involves designing systems where multiple autonomous agents collaborate to achieve a common goal. This skill encompasses architectural patterns like hierarchical supervision and peer-to-peer negotiation for task distribution and conflict resolution. It focuses on managing shared information and ensuring coherent collective action in complex, dynamic environments.
by AaaS
Covers semantic, instance, and panoptic segmentation techniques that enable agents to produce pixel-level masks for scene understanding. Includes practical guidance on using SAM 2, Mask R-CNN, and integrating segmentation outputs into multimodal pipelines.
by AaaS
This skill involves computing and communicating which input features most influenced a model's prediction. It leverages methods like SHAP, LIME, and Integrated Gradients for tabular, text, and image data. The core focus is on generating local and global explanations and presenting them visually for both technical and non-technical audiences.
by Community
Provides mathematically rigorous privacy guarantees by adding calibrated noise to query outputs or model gradients, ensuring individual data points cannot be inferred from published statistics or trained models. The de facto standard for privacy-preserving data analysis and compliant ML training.
by AaaS
A system that automatically screens text inputs and outputs for large language models (LLMs) to detect and manage harmful content. It uses multi-category classification to identify issues like toxicity, hate speech, and violence, applying configurable rules and thresholds to enforce safety policies and protect users.
by AaaS
Provides detailed, multi-level explanations for code snippets, functions, or entire repositories. It breaks down complex algorithms, clarifies control flow, and describes the purpose of variables and dependencies. The skill supports numerous programming languages, generating documentation-style overviews or granular, line-by-line analyses to accelerate learning and code reviews.
by AaaS
Monitors live campaign performance across Google Ads, Meta, LinkedIn, and other advertising channels continuously. Automatically reallocates budgets to highest-performing channels, dynamically personalizes ad copy for different audience segments, and kills underperformers before they waste spend. Operates within configurable budget guardrails including max daily spend caps, minimum ROAS thresholds, and A/B test significance gates.
by Vapi
Vapi AI is a developer-first platform for building and deploying real-time, conversational voice agents. It provides low-latency streaming, interruptible speech, and seamless integrations with various LLM, TTS, and STT providers. The platform is designed for developers to create sophisticated voice experiences with features like function calling and call analytics.
by PagerDuty
Monitors service-level agreement (SLA) compliance by tracking response and resolution times for all active tickets. The agent proactively alerts teams about potential SLA breaches, allowing them to act before a violation occurs. It can also automatically reprioritize ticket queues based on urgency and generates regular SLA performance reports for management.
by Landing AI
An AI agent that uses computer vision to perform real-time quality inspections on manufacturing lines. It automatically detects, classifies, and logs surface defects, dimensional inaccuracies, and assembly errors at production speed, triggering alerts or reject mechanisms to prevent faulty products from proceeding.
by BlackRock Aladdin
An advanced AI agent for constructing and managing investment portfolios. It leverages quantitative techniques like mean-variance optimization and the Black-Litterman model to align portfolios with specific investor goals, risk tolerances, and constraints such as ESG mandates, while continuously monitoring for drift and executing tax-efficient rebalancing.
by Arterys
Deep-learning agent that analyzes DICOM medical images across modalities — CT, MRI, X-ray, and PET — to surface anomalies, measure lesions, and generate structured findings. Integrates directly into PACS workflows and flags priority studies for radiologist review.
by Elicit AI
An AI-powered agent designed to automate systematic literature reviews. It queries major academic databases like PubMed and arXiv to identify, screen, and synthesize evidence from thousands of papers. The agent produces structured outputs including evidence tables, meta-analysis plots, and PRISMA-compliant reports with bias assessments.
by Apple
Apple M4 Ultra's 32-core Neural Engine capable of 38 TOPS, embedded in Apple's highest-end desktop and workstation chips. Combined with up to 192GB unified memory shared between CPU, GPU, and Neural Engine, it enables running large models locally on macOS with exceptional energy efficiency.
by Community
WebSocket server that proxies token-by-token LLM streaming to multiple simultaneous clients, with connection lifecycle management, heartbeat keep-alives, and per-session context persistence. Supports fan-out broadcasting for collaborative AI sessions and reconnection with message replay.
by pyannote
This script automates the process of creating turn-by-turn transcripts from multi-speaker audio files. It first uses the pyannote.audio library to perform speaker diarization, identifying who spoke and when. These speaker segments are then aligned and merged with a transcription generated by OpenAI's Whisper, producing a final text output that attributes each line of dialogue to a specific speaker.
by Community
Packages a trained ML model into a serverless function on AWS Lambda, Modal, or Google Cloud Run, handling cold-start optimization, dependency layering, and auto-scaling configuration. Includes health-check endpoints, structured logging, and a GitHub Actions workflow for automated rollout.
by Neo4j
Implements a GraphRAG pattern that stores document entities and relationships in Neo4j, then retrieves contextually relevant subgraphs at query time before passing them to an LLM. Includes automatic entity extraction with spaCy, relationship inference, and a Cypher query generator.
by Community
This script generates a production-ready chatbot foundation using Rasa for structured dialogue and an LLM for open-ended fallback. It provides a unified channel adapter for deploying to Web, WhatsApp, and Slack, and includes built-in conversation analytics and a Streamlit-based testing environment for rapid development.
by Groq
Groq is a semiconductor company that developed the Language Processing Unit (LPU), a custom chip for ultra-fast AI inference. Their managed API provides some of the fastest publicly available LLM inference speeds, often exceeding 800 tokens/second, making it ideal for latency-sensitive applications.
by Allen Institute for AI
The Allen Institute for AI (AI2) is a nonprofit research institute focused on high-impact, open-source AI. Founded by Paul Allen, it produces foundational models like OLMo, influential datasets such as MMLU, and reasoning benchmarks. Its Semantic Scholar platform provides AI-powered discovery across 200M+ academic papers.
by Bai et al. / Tsinghua University
LongBench is a comprehensive bilingual benchmark designed to evaluate the long-context understanding capabilities of large language models in English and Chinese. It comprises 21 diverse tasks, including single and multi-document QA, summarization, and code completion, with an average context length of over 6,700 tokens to rigorously test model performance on extended inputs.
by Meta / Hugging Face
GAIA (General AI Assistants) is a benchmark for evaluating AI models on complex, real-world tasks. It features questions with unambiguous factual answers that require sophisticated capabilities like multi-step reasoning, web browsing, and tool use. GAIA is designed to test the practical limits of general-purpose AI assistants.
by Islam et al. / Patronus AI
FinanceBench is a benchmark designed to evaluate the financial question-answering capabilities of Large Language Models. It uses publicly available corporate documents like 10-K filings and earnings reports to test models on information retrieval, numerical reasoning, and multi-step financial calculations, providing a standardized testbed for financial AI.
by BUET (Bangladesh University of Engineering and Technology)
XL-Sum is a massive multilingual dataset for abstractive summarization. It consists of over 1 million article-summary pairs scraped from BBC News, covering 44 different languages. This diversity makes it a crucial resource for developing and evaluating cross-lingual and multilingual summarization models.
by Gerasimos Spanakis / Maastricht University
The Legal-BERT training corpus is a large collection of English legal text assembled from UK legislation, EU legislation, ECHR/ECLI court decisions, and US contracts specifically curated to pretrain domain-adapted BERT models. It has enabled a family of Legal-BERT models that significantly outperform general-domain language models on legal NLP tasks.
by LAION
The text caption component of the LAION-400M dataset, offering 400 million English alt-text captions. These captions were scraped from the web and filtered using CLIP to ensure a minimum similarity to their corresponding images. The text is used independently for large-scale NLP and multimodal research.
by University of Oregon
CulturaX is a massive, cleaned multilingual text corpus containing 6.3 trillion tokens across 167 languages. It was created by combining, deduplicating, and filtering the mC4 and OSCAR datasets using language model-based quality scoring. This makes it one of the largest and cleanest public datasets for pre-training large language models.
by Google
CC12M is a large-scale dataset by Google containing 12 million image-text pairs from the web. It was created with a less restrictive filtering process than its predecessor, CC3M, to achieve greater scale and diversity. This makes it a foundational resource for pretraining large vision-language models like CLIP and ALIGN.
by OpenAI
Sora is a text-to-video diffusion transformer model by OpenAI that generates high-fidelity, minute-long videos from textual prompts. It demonstrates an advanced understanding of language and the physical world, enabling complex scenes with multiple characters, specific motions, and coherent narratives.
by Cohere
Cohere Rerank v3 is a state-of-the-art neural model designed to significantly boost the relevance of search results for Retrieval-Augmented Generation (RAG) systems. It re-scores a list of candidate documents from any keyword or vector search system, identifying the most pertinent information. It supports over 100 languages and can process long documents, making it highly versatile.
by Microsoft
Microsoft's Phi-3 Mini is a 3.8 billion parameter small language model (SLM) designed for high performance on resource-constrained devices. Despite its compact size, it exhibits strong reasoning and language understanding capabilities, making it suitable for on-device and edge AI applications. It is optimized for efficient inference.
by Meta AI
MusicGen is an open-source text-to-music model from Meta AI that generates high-quality instrumental music from text descriptions. It can also be conditioned on a melody reference, providing a strong, controllable baseline for both research and commercial applications, trained on 20K hours of licensed music.
by Microsoft Research
Multilingual-E5-Large is a powerful text embedding model from Microsoft supporting 100 languages. Trained on billions of text pairs using contrastive learning, it excels at cross-lingual information retrieval and semantic similarity, establishing a strong open-source baseline for multilingual NLP tasks.
by Codeium
Windsurf (by Codeium) is an AI-native IDE that integrates Anthropic's Claude models as the backbone of its Cascade agent, which autonomously plans and executes multi-step coding tasks with real-time file and terminal access. The Anthropic integration powers deep context awareness across large codebases and supports long-horizon agent tasks with coherent state tracking.
by Unstructured / Pinecone
This integration provides a direct pipeline from Unstructured's data transformation service to the Pinecone vector database. It automates extracting, cleaning, and chunking data from documents like PDFs and DOCX, then embeds and indexes the content into a Pinecone namespace for use in RAG applications.
by Tabnine
Tabnine's VS Code extension provides AI-powered code completions, including whole-line and full-function suggestions. It is designed for enterprises with strict privacy and data-residency needs, offering on-premise or private cloud deployment options. The AI can be trained on a team's specific codebase for highly relevant completions.
by Portkey
Portkey's AI gateway unifies over 200 LLM providers through a single OpenAI-compatible API. It enables automatic fallbacks, load balancing, and semantic caching to improve reliability and performance. The platform provides full observability, capturing detailed cost, latency, and metadata for every request.
by Anthropic
Official MCP Puppeteer server providing headless Chrome browser control to MCP clients. Exposes tools for page navigation, element interaction, form filling, screenshot capture, and JavaScript execution, enabling Claude to automate complex web workflows that require a real browser environment.
by Anthropic
This integration provides a secure, read-only connection to a PostgreSQL database within the MCP environment. It allows agents to perform database introspection, such as listing schemas and describing tables. A key feature is its ability to facilitate natural-language-to-SQL workflows, enabling users to ask questions in plain English and have them translated into safe, read-only SELECT queries for execution.
by LangChain
Integrate LangChain with Ollama for fully local LLM inference. This allows developers to use models like Llama 3 and Mistral on their own hardware, ensuring data privacy by eliminating external API calls. It's ideal for building offline-capable, privacy-sensitive applications.
by Community
Cline is an open-source VS Code extension that provides an AI agent with direct access to the IDE's environment. It enables multi-step agentic workflows by allowing the AI to use the file system, terminal, and an integrated browser. The extension supports various models and includes a human-in-the-loop approval process for safety.
by Microsoft
Integrate the AutoGen multi-agent framework with Azure OpenAI Service to build sophisticated, enterprise-grade AI applications. This connector enables developers to leverage Azure's security features, including RBAC and private endpoints, while using all standard AutoGen agents like AssistantAgent and UserProxyAgent for complex, collaborative tasks.
by Arize AI
Arize Phoenix integrates with LangChain to provide deep observability for LLM applications. By leveraging OpenTelemetry, it captures and streams traces for chains, agents, and retrievers to a local UI or the Arize cloud. This enables developers to debug applications, detect embedding drift, score retrieval quality, and analyze hallucinations at the span level.
by Google DeepMind / Multiple Institutions
Open X-Embodiment aggregates 527 robot skills from 22 different robot embodiments across 21 institutions into a unified dataset, enabling the training of RT-X models that transfer across robot types. This collaborative effort establishes a foundation for generalist robotic policies.
by Google Research
This paper introduces Med-PaLM 2, a large language model fine-tuned on medical data. It achieves expert-level performance on medical licensing exam questions, demonstrating clinical reasoning comparable to physicians, and proposes a framework for evaluating the safety and alignment of medical AI systems.
by Stanford CRFM
Presents HEIM, a comprehensive framework for evaluating text-to-image models across 12 aspects like alignment, quality, aesthetics, bias, and toxicity. The study benchmarks 26 models, revealing that no single model excels in all areas and highlighting significant safety gaps in current generative AI.
by Meta AI / University College London
Atlas is a retrieval-augmented language model designed for few-shot learning. It uniquely pre-trains its retriever and language model components jointly, enabling it to effectively leverage external knowledge documents. This approach allows Atlas to achieve state-of-the-art few-shot performance on knowledge-intensive NLP benchmarks like MMLU, outperforming much larger models.
by AaaS
Routes transactions, documents, and exceptions through configurable multi-step approval chains based on amount thresholds, risk levels, and organizational policies. Tracks approver actions with timestamps, sends reminders for pending items, and escalates stalled approvals — ensuring no payment or commitment is authorized without the required sign-offs.
by AaaS
Tracks product usage patterns over time — login frequency, feature adoption, session duration, and activity drops. Identifies accounts showing declining engagement that correlate with churn risk, enabling proactive retention before the customer disengages.
by AaaS
Processes customer refunds through payment gateway APIs within configurable monetary caps. Enforces refund policies (max per transaction, max per day, cooling periods) and generates immutable audit trails for every refund action. Escalates requests above caps to human approval.
by AaaS
Matches extracted invoice data against Purchase Orders and receipt logs in ERP systems using deterministic matching rules (PO number, vendor, amount, line items). Handles partial matches, tolerance thresholds, and multi-line reconciliation. Routes exceptions to human queues with full mismatch details.
by AaaS
Accesses multiple participants' calendars simultaneously and finds optimal meeting times across time zones, working hours, and scheduling constraints. Handles rescheduling, cancellations, and conflict resolution autonomously.
by AaaS
Ingests and analyzes telemetry data (metrics, traces, spans) from distributed systems. Correlates performance data across service boundaries using distributed tracing, identifies bottleneck services, and produces latency breakdowns. Provides the observability foundation that SRE Triage and Latency Budget Planner agents depend on.
by AaaS
Executes safe, policy-constrained rollbacks of failed deployments. Respects blast-radius limits (max affected services), rate limits (max rollbacks per hour), and change-window constraints. Supports multiple rollback strategies: Git revert, container image pinning, feature flag disabling, and traffic shifting. Produces a detailed rollback report with root cause hypothesis.
by AaaS
Generates complete, well-structured pull requests including: descriptive title, detailed body with change rationale, test results summary, dependency diff, and reviewer assignments. Follows the organization's PR template and conventional commit conventions. Produces PRs that human reviewers can approve quickly because all context is pre-packaged.
by AaaS
Constructs complete dependency graphs across package managers (npm, pip, cargo, Maven) and internal modules. Identifies version conflicts, circular dependencies, security-vulnerable transitive dependencies, and upgrade paths. Produces actionable dependency health reports that inform both the Codebase Architect and Dependency Guardian agents.
by AaaS
Empowers autonomous agents to interact with the web like a human user. This skill provides the core functionality to navigate to URLs, render pages including executing JavaScript, and parse DOM elements. It enables complex workflows such as filling out forms, clicking buttons, and extracting structured data for analysis or task completion.
by AaaS
Applies a cross-encoder or LLM-based reranker to refine initial retrieval results by scoring query-document pairs for relevance. Dramatically improves precision by promoting the most contextually relevant passages to the top of the result set.
by Community
Motion Planning is the process of generating a valid trajectory for an autonomous system, such as a robot arm or self-driving car, from a starting state to a desired goal state. It computes a collision-free path that respects the system's kinematic and dynamic constraints, effectively bridging perception with physical action.
by Community
Builds structured knowledge graphs from unstructured text and semi-structured sources through entity recognition, relation extraction, coreference resolution, and entity linking. The resulting graphs power question answering, search, recommendation, and reasoning applications.
by Community
Adapts models to new target domains using only a handful of labeled examples, combining meta-learning, prompt engineering, and prototype-based methods. Critical for enterprise deployments where labeled data is scarce or expensive to acquire.
by AaaS
Securely accesses the calendar APIs of hiring managers, executives, and candidates simultaneously. Autonomously negotiates complex scheduling matrices across time zones, manages sudden cancellations and rescheduling, and sends contextual briefing documents with candidate summaries prior to each meeting. Reduces time-to-schedule from days to minutes, eliminating the email-thread coordination bottleneck.
by AaaS DevOps Foundry
Maps the entire dependency tree across an organization's codebases, tests library updates in isolated sandbox environments, writes localized unit tests to verify compatibility, and submits fully validated pull requests that respect architectural constraints. Prevents the cascade-of-breaking-changes problem that plagues manual dependency updates, where an LLM taking a prompt literally would introduce version conflicts or accidentally remove necessary features.
by AaaS
Analyzes lengthy Master Service Agreements using comprehensive redlining playbooks, contextual clause analysis, and dynamic risk scoring. Evaluates interactions between indemnification provisions and liability caps — something generic LLMs miss because they analyze clauses in isolation. Maintains multi-document consistency across related agreements. Cuts contract review cycles by 60-75%. The agent redlines and scores; a human attorney approves or modifies every recommendation.
by AaaS
Performs deep semantic analysis of resumes, identifying non-obvious skill adjacencies, transferable experiences from non-traditional backgrounds, and behavioral intent signals. Surfaces high-potential candidates that keyword-based ATS parsers and human recruiters routinely overlook. Every screening batch runs through an automated disparate-impact bias audit to ensure compliance with anti-discrimination laws. The agent ranks and recommends; humans make the final screening decision.
by Inworld AI
An AI agent that uses reinforcement learning (RL) to generate dynamic NPC behaviors. Instead of relying on static scripts, it learns complex strategies through self-play and interaction, adapting its difficulty and tactics in real-time to match a player's skill level, ensuring a consistently challenging and unpredictable experience.
by Moveworks
Moveworks is an enterprise AI copilot platform that automates employee support. It uses conversational AI to understand and resolve requests across IT, HR, finance, and other departments directly in collaboration tools like Slack and Microsoft Teams, reducing the need for manual intervention.
by Harvey AI
An AI agent that automates the creation of legal documents by leveraging structured data, template libraries, and firm-specific style guides. It generates jurisdiction-compliant agreements, pleadings, and regulatory filings, incorporating precedents and flagging potential issues for attorney review. The system streamlines drafting workflows, ensuring consistency and accuracy.
by JetBrains
JetBrains' AI coding agent deeply integrated into IntelliJ-based IDEs. Leverages the full power of JetBrains' code analysis engine, refactoring tools, and project model for context-aware code generation and transformation.
by Blue Yonder
This autonomous agent optimizes inventory management by analyzing real-time demand and supply chain data. It dynamically adjusts reorder points, safety stock, and order quantities to minimize carrying costs and prevent stockouts. The agent automates purchase orders and warehouse transfers for a resilient supply chain.
by H2O.ai
H2O.ai offers an open-source and enterprise AutoML platform that automates the machine learning lifecycle. It excels at automated model training, interpretation, and deployment, supporting distributed computing for large datasets. The platform provides comprehensive model explainability features like SHAP values, making complex models transparent.
by Coqui
Sets up a zero-shot voice cloning pipeline using Coqui XTTS-v2 or Tortoise-TTS, requiring only a 3-second reference audio clip to synthesize new speech in the target voice. Includes a FastAPI inference server, audio quality normalization, and speaker embedding export for reuse.
by Community
This script provides a complete framework for building a multimodal visual search engine. It uses CLIP to generate image and text embeddings, which are indexed in a vector database like Qdrant or Weaviate for efficient similarity search. The system supports both text-to-image and image-to-image queries and includes a FastAPI server for API access.
by Community
Ingests social media feeds, reviews, and support tickets in near-real-time, scores sentiment at entity and aspect level using a fine-tuned RoBERTa model, and renders a live Streamlit dashboard with trend charts, topic clustering, and configurable alert thresholds for brand-crisis detection.
by Community
This script provides a complete setup for a modern, two-stage recommendation engine. It uses a two-tower neural network for efficient candidate retrieval and a powerful Large Language Model (LLM) for nuanced re-ranking. The system integrates with a Feast feature store to leverage real-time user context, ensuring timely and relevant suggestions.
by AaaS
This script automates the deployment of a large language model using the vLLM inference engine. It creates a high-throughput, OpenAI-compatible API endpoint. Key features like PagedAttention and continuous batching are configured to maximize performance and memory efficiency, making it suitable for production environments.
by Community
Optimizes PyTorch and TensorFlow models for edge hardware by applying INT8/FP16 quantization and converting them to ONNX or TFLite formats. This script provides platform-specific tuning for ARM and NPU targets, benchmarking latency and memory usage while generating a report on accuracy trade-offs.
by AaaS
Cleans and normalizes text data for LLM consumption by removing HTML artifacts, fixing encoding issues, standardizing whitespace, deduplicating near-identical entries, and filtering low-quality content based on configurable quality heuristics.
by Alteryx
Applies Deep Feature Synthesis via Featuretools and AutoFeat to automatically generate hundreds of candidate features from relational tabular data, then prunes them using mutual information and SHAP-based importance filters. Produces a reproducible feature pipeline serializable to scikit-learn format.
by Synthesia
Synthesia is an enterprise AI video generation platform that enables users to create professional-quality videos featuring realistic AI avatars from text scripts, without cameras, actors, or studios. Serving thousands of enterprise customers including Accenture, BBC, and Reuters, it is the leading platform for scalable AI-generated corporate video content.
by LAION
LAION (Large-scale Artificial Intelligence Open Network) is a German nonprofit that creates and releases massive open datasets for AI research. Its most notable contribution, LAION-5B, is a dataset of 5.85 billion image-text pairs that was pivotal in training foundational models like Stable Diffusion.
by Hasan et al. / University of Edinburgh
XL-Sum is a large-scale benchmark dataset for multilingual abstractive summarization. It contains 1.35 million article-summary pairs from BBC News across 44 languages, designed to evaluate a model's ability to generate concise summaries across diverse linguistic families and writing systems.
by CMU
WebArena is a realistic and reproducible benchmark environment designed to evaluate autonomous language agents. It tests an agent's ability to perform complex, multi-step tasks across a diverse set of self-hosted websites, including e-commerce, forums, and content management systems, using real web interfaces.
by OpenAI
SimpleQA is a benchmark dataset developed by OpenAI to assess the factual accuracy of language models. It consists of simple, unambiguous questions that have a single, verifiable correct answer. The benchmark is designed to measure a model's ability to recall factual knowledge and, crucially, to abstain from answering when it is uncertain, providing a measure of its calibration.
by Google Research
MGSM (Multilingual Grade School Math) is a benchmark for evaluating the mathematical reasoning of large language models across multiple languages. It consists of 250 grade-school math problems from the GSM8K dataset, professionally translated into ten typologically diverse languages, including low-resource ones like Swahili and Telugu.
by Ma et al. / Shanghai AI Lab
AgentBoard is a comprehensive evaluation framework for Large Language Model (LLM) based agents. It assesses agent performance across nine diverse tasks, including embodied AI, gaming, web browsing, and tool use. The framework uniquely measures both final task success and partial progress through a fine-grained sub-goal metric.
by Dyson Robotics Lab / Imperial College London
RLBench is a large-scale robot learning benchmark and dataset built on the CoppeliaSim simulator, providing 100 unique manipulation tasks with demonstrations, observations, and reward functions. It offers RGB, depth, and point-cloud observations for a Franka Panda arm across diverse household tasks, widely used for evaluating imitation learning, reinforcement learning, and multi-task robot policies.
by Intel Labs / Community
Orca DPO Pairs is a synthetic dataset containing 12,000 instruction-following examples. Each example includes a prompt, a high-quality response from GPT-4 (chosen), and a lower-quality response from GPT-3.5 (rejected). It is designed for efficiently aligning language models using Direct Preference Optimization (DPO) without a reward model.
by UC Berkeley
Nectar is a large-scale, high-quality preference dataset from Berkeley AI Research (BAIR). It contains 183,000 prompts, each with seven ranked responses from diverse models like GPT-4, ChatGPT, and open-source LLMs. It is designed for training robust reward models for RLHF and DPO.
by University of Washington
MusicNet is a collection of 330 freely licensed classical music recordings with over 1 million annotated labels indicating the precise timing and identity of every musical note in each recording. It supports supervised learning for music transcription, instrument recognition, and music information retrieval tasks.
by University of Massachusetts / Partners Healthcare
MedNLI is a benchmark dataset for Natural Language Inference (NLI) in the clinical domain. Derived from the MIMIC-III database, it contains over 14,000 sentence pairs from clinical notes, each annotated by a clinician as representing entailment, contradiction, or a neutral relationship, enabling the evaluation of clinical text reasoning.
by CommonCrawl Foundation
CC-News is a large-scale dataset of over 700,000 English news articles from the CommonCrawl archive, collected between 2016 and 2019. It serves as a key pretraining corpus, notably for the RoBERTa model, providing a rich source of journalistic text for developing models that understand news language and current events.
by BigCode (ServiceNow + Hugging Face)
StarCoder2 15B is a powerful open-source code generation model from the BigCode project. Trained on The Stack v2 dataset spanning over 600 programming languages, it excels at code completion, generation, and fill-in-the-middle tasks, emphasizing data transparency and author opt-out.
by Alibaba / Qwen Team
QwQ-32B is a 32 billion parameter language model from Alibaba, specifically optimized for complex reasoning tasks. It utilizes a deep chain-of-thought methodology to excel at mathematical, scientific, and logical problems, achieving performance comparable to much larger models and showcasing high parameter efficiency.
by Pika Labs
Pika 1.5 is an accessible AI video generation model that transforms text prompts or images into high-quality videos. It is known for its expressive motion, diverse cinematic styles, and unique features like physics-based effects and automated lip-sync, making it popular among creators and consumers.
by Meta
Llama 3.2 11B Vision is Meta's first open-source multimodal model, integrating native image understanding with advanced text generation. At a compact 11B parameters, it's designed for efficiency, enabling visual question answering, image captioning, and complex reasoning across text and images in a single, deployable model.
by LlamaIndex / Qdrant
Native LlamaIndex vector store adapter for Qdrant, enabling index construction, similarity search, and filtered retrieval over Qdrant collections. Supports both in-memory and hosted Qdrant deployments with payload-based metadata filtering.
by LangChain
This integration connects the LangChain framework with Mistral AI's suite of models, including Mistral Large and Codestral. It enables developers to build sophisticated applications by leveraging Mistral's capabilities like function calling, JSON mode, and streaming within LangChain's structured environment for creating agents and chains.
by Anthropic
Anthropic's Claude Agent SDK ships with native Model Context Protocol (MCP) client support, allowing Claude-powered agents to connect to any MCP server and use its exposed tools, resources, and prompts. The integration bridges Claude's tool-use capabilities with the open MCP ecosystem for plug-and-play external integrations.
by BentoML
BentoML streamlines deploying machine learning models to the AWS cloud. It packages models and their inference logic into standardized containers, enabling one-command deployment to services like SageMaker, EC2, and ECS. The platform automates production concerns such as auto-scaling, batching, and monitoring.
by OpenAI
The Sparse Transformer introduces factored sparse attention patterns to reduce the self-attention mechanism's complexity from O(n²) to O(n√n). This innovation enables Transformer models to process and generate sequences thousands of steps long, making them effective for high-resolution generative tasks.
by Center for AI Safety / UCSD
Representation Engineering (RepE) is a top-down AI transparency technique that identifies and manipulates high-level concepts within a model's activations. By finding linear directions corresponding to traits like honesty or power-seeking, it enables real-time monitoring and steering of model behavior, offering a scalable alternative to circuit-level analysis.
by Anthropic
This paper establishes a causal link between specific transformer circuits, termed "induction heads," and the phenomenon of in-context learning. It demonstrates that these two-layer attention patterns, which copy and complete sequences, emerge predictably during training and are a key mechanistic driver of few-shot learning abilities in LLMs.
by Beihang University / University of Illinois Urbana-Champaign / Microsoft
Corrective Retrieval Augmented Generation (CRAG) is an AI framework that enhances standard RAG by adding a self-correction layer. It uses a lightweight retrieval evaluator to score the relevance of retrieved documents. If documents are deemed irrelevant or ambiguous, CRAG triggers corrective actions like web searches to improve the knowledge source before generation.
by University of Edinburgh / Allen AI
This paper introduces a benchmark suite for evaluating autonomous agents like Auto-GPT on online decision-making tasks. It assesses their ability in multi-step planning and tool use, analyzes common failure modes, and highlights the challenges these agents face in reliably completing long-horizon goals.
by AaaS
Analyzes cloud infrastructure utilization across compute, storage, and network resources. Identifies idle instances, underutilized reservations, oversized resources, and orphaned volumes. Produces actionable right-sizing recommendations with projected savings.
by AaaS
Assigns real-time risk scores to individual transactions based on behavioral deviation, amount patterns, merchant category, and contextual signals. Produces scores on a 0-100 scale with configurable thresholds for pass/review/block decisions.
by AaaS
Produces concise, structured summaries of customer conversations for handoff to human agents or for audit trails. Extracts key issues raised, actions taken, unresolved items, and customer sentiment — ensuring no context is lost during escalation.
by AaaS
Monitors buyer intent signals across website visits, email opens, content downloads, CRM activity, and third-party intent data providers. Correlates engagement patterns to identify accounts showing active buying behavior, enabling agents to prioritize high-intent prospects over cold outreach.
by AaaS
This skill involves building Retrieval-Augmented Generation (RAG) systems that output structured data, like JSON, conforming to a predefined schema. Instead of unreliable free-form text, it uses techniques like constrained decoding and validation to ensure outputs are machine-readable and ready for direct use in APIs or databases.
by AaaS
Converts natural language questions into executable SQL queries against relational databases. Supports schema-aware generation, multi-table joins, aggregations, and query optimization with dialect-specific syntax for PostgreSQL, MySQL, SQLite, and others.
by AaaS
Detects and mitigates prompt injection attacks where malicious inputs attempt to override system instructions or extract sensitive information. Implements input sanitization, instruction hierarchy enforcement, and output monitoring to protect LLM-powered applications.
by AaaS
Identifies and flags personally identifiable information (PII) in text data, including names, addresses, phone numbers, SSNs, and financial details. Supports configurable sensitivity levels, redaction strategies, and compliance reporting for GDPR, HIPAA, and CCPA requirements.
by AaaS
Validates LLM outputs against expected schemas, formats, and quality criteria before delivery to end users. Implements JSON schema validation, hallucination checks, citation verification, and automated retry logic for outputs that fail validation.
by AaaS
Generates technical documentation from source code, including API references, README files, inline comments, and architectural guides. Adapts tone and detail level for different audiences from developer guides to end-user documentation.
by AaaS
A set of techniques for managing the limited memory (context window) of Large Language Models. It involves strategically structuring prompts, summarizing or pruning conversation history, and selectively including relevant information to ensure efficient, cost-effective, and coherent long-form interactions with an AI.
by Anthropic
Claude Code is Anthropic's agentic coding assistant that operates directly in the terminal. It reads codebases, edits files, runs tests, manages git, and executes shell commands autonomously, with support for MCP tools, custom agents, and persistent memory via CLAUDE.md.
by AaaS
Triggers a coordinated sequence of automated actions on a new hire's start date: provisions software licenses, configures identity and access management, generates a custom training schedule based on the specific role, and serves as a conversational guide for corporate policies during the first 90 days. Replaces the fragmented IT/HR onboarding experience with a single orchestrated workflow that completes in hours, not weeks.
by AaaS DevOps Foundry
Maps structural dependencies, architectural patterns, and historical technical decisions across enterprise codebases. When a critical service fails and the original developers are unavailable, this agent produces a semantic architecture map — dependency graphs, hotspot analysis, and knowledge gap identification — in minutes instead of weeks. Integrates deeply with repositories to understand code as architecture, not just text.
by Tavily
Tavily is a specialized search API designed for Large Language Models (LLMs) and AI agents. It provides real-time, fact-grounded web search results in a structured, clean format, eliminating the need for manual data cleaning. The API is optimized to deliver relevant, concise information, making it ideal for powering autonomous agents and RAG applications.
by Tabnine
Privacy-first AI code assistant that can run entirely on-premise or in air-gapped environments. Specializes in personalized code completions trained on your team's codebase with zero data retention and full IP protection.
by Princeton NLP
Princeton NLP's research agent that turns LLMs into autonomous software engineers. Achieves state-of-the-art results on SWE-bench by providing an agent-computer interface optimized for code navigation and editing.
by Resilinc
An intelligence agent that continuously monitors supplier financial health, geopolitical exposure, ESG compliance, news sentiment, and delivery performance to generate dynamic risk scores for every vendor in the supply network. It alerts procurement teams to emerging threats and recommends dual-sourcing or buffer stock adjustments before disruptions materialize.
by Loop Returns
This AI agent provides end-to-end automation for e-commerce returns. It intelligently classifies return reasons, verifies requests against your store's policies, and automatically issues shipping labels. The agent processes refunds or exchanges and integrates with warehouse systems to update inventory, while also flagging potential fraud.
by Preactor (Siemens)
An AI-powered optimization engine for dynamic production scheduling in complex manufacturing environments. It automates the creation and adjustment of job sequences to maximize throughput and on-time delivery. The agent reacts instantly to real-world disruptions like machine failures or rush orders, ensuring continuous operational efficiency.
by Symptom Checker Health
Conversational triage agent that collects structured symptom histories, applies validated acuity scoring (ESI, Manchester), and routes patients to appropriate care levels. Operates across web, SMS, and kiosk channels with multilingual support and EHR hand-off.
by PandasAI
Conversational data analysis agent that lets users query pandas DataFrames using natural language. Translates questions into Python code, executes against datasets, and returns results with visualizations, making data exploration accessible to non-technical users.
by All Hands AI
OpenHands is an open-source platform for creating autonomous AI software agents. It offers a secure, sandboxed environment where agents can execute complex development tasks by writing code, running commands, browsing the web, and interacting with APIs. It supports multi-agent delegation for tackling intricate problems.
by DeepWisdom
MetaGPT is an open-source multi-agent framework that automates software development by simulating a virtual company. It assigns distinct roles like product manager, architect, and engineer to different LLM agents. Starting from a single-line requirement, it follows Standardized Operating Procedures (SOPs) to generate comprehensive outputs, including user stories, system designs, diagrams, and executable code.
by Kensho Technologies (S&P Global)
High-frequency NLP agent that aggregates and scores market-moving information from earnings calls, SEC filings, news wires, social media, and analyst reports to generate real-time sentiment signals and thematic trend scores by sector and ticker. Feeds alpha signals to trading and research workflows.
by Guru
This autonomous agent streamlines knowledge management by ingesting data from support tickets, chat logs, and documents. It automatically generates, updates, and deduplicates knowledge base articles, identifying content gaps by analyzing unanswered user queries. New drafts are created for human review, ensuring a constantly improving self-service resource.
by HireVue
An AI recruiting agent that screens resumes at scale, scores candidates against job description criteria, conducts asynchronous video interview analysis, and shortlists top applicants while flagging potential bias signals for human review. It integrates with ATS platforms to automate interview scheduling for shortlisted candidates and maintains a structured candidate evaluation audit trail.
by Harness
Harness AI is an intelligent software delivery agent that automates CI/CD pipelines. It leverages machine learning to verify deployments, detect anomalies in real-time, and automate rollback decisions to ensure service health. This helps reduce mean time to recovery (MTTR) and optimize pipeline execution in complex environments.
by NVIDIA
Compact Orin-based Jetson module delivering up to 100 TOPS in a small form factor. Targets robotics, drones, medical devices, and industrial edge AI applications requiring significant AI performance in constrained size, weight, and power envelopes.
by AaaS
Automated web scraping pipeline with configurable crawl depth, content extraction, and rate limiting. Converts web content into clean text documents suitable for embedding and RAG ingestion with support for dynamic JavaScript-rendered pages.
by Community
Implements statistically rigorous A/B and shadow-mode testing for competing ML model versions behind a feature flag router, logging predictions and latencies to a data warehouse for significance testing. Automatically computes sample size requirements and stops experiments when significance thresholds are met.
by AaaS
Automated pipeline for ingesting documents from multiple sources (files, URLs, APIs) into a vector store. Handles format detection, text extraction, chunking, deduplication, metadata enrichment, and incremental updates for growing knowledge bases.
by AaaS
Prepares datasets for LLM fine-tuning by converting raw data into instruction-following, conversation, or completion formats. Handles data cleaning, deduplication, train/val/test splitting, tokenization analysis, and quality filtering.
by Weaviate
Weaviate is an open-source vector database designed for AI-native applications. It enables flexible hybrid search, combining vector and keyword methods, and uniquely supports multi-modal data like text, images, and audio. Weaviate offers both self-hosting for maximum control and a managed cloud service for ease of use.
by Stability AI
Stability AI is a generative AI company known for developing the popular open-source Stable Diffusion text-to-image model. They focus on creating open, multi-modal AI models for image, language, audio, and video generation, which are accessible via APIs and as downloadable weights for custom implementation.
by Jasper AI
Jasper AI is an enterprise-grade AI content platform designed for marketing teams to produce brand-consistent copy, campaigns, and creative assets at scale. It integrates with brand voice guidelines, company knowledge bases, and major marketing workflows to maintain tone consistency across channels.
by UCLA
Mathematical reasoning benchmark requiring visual understanding of charts, plots, geometry diagrams, and infographics. Tests the intersection of visual perception and mathematical reasoning with 6,141 problems from 28 existing datasets and 3 newly collected ones.
by Zhang et al. / Peking University
InfiniteBench is a benchmark designed to evaluate the long-context capabilities of large language models. It features tasks that require processing and reasoning over inputs exceeding 100,000 tokens, including math, code debugging, and retrieval from novels, where crucial information is distributed across the entire context.
by CAIS
Humanity's Last Exam is a crowdsourced benchmark designed to rigorously test the limits of advanced AI systems. It comprises extremely difficult questions contributed by domain experts across diverse fields like science, math, and philosophy, serving as a public evaluation for frontier model capabilities in complex reasoning and specialized knowledge.
by Aider
Multi-language code editing benchmark testing models' ability to make targeted code changes across Python, JavaScript, TypeScript, Java, C++, and other languages. Evaluates real-world code modification tasks rather than generation from scratch.
by Tsinghua University
Comprehensive benchmark evaluating LLM agents across 8 distinct environments including operating systems, databases, knowledge graphs, digital card games, lateral thinking puzzles, and web shopping. Tests generalization of agent capabilities across diverse interaction paradigms.
by University of Oxford
WebVid-10M is a massive dataset containing over 10 million video clips paired with descriptive text captions. Scraped from stock video websites, it serves as a foundational pretraining corpus for state-of-the-art video-language models, facilitating research in video understanding, retrieval, and generation.
by PushShift.io
A massive, multi-billion token archive of Reddit comments and submissions from 2005 to 2023, collected by the PushShift project. This dataset is a cornerstone for social NLP research, large-scale language model pre-training, and studying the dynamics of online communities and conversational discourse.
by Coqui AI
XTTS-v2 is an open-source, cross-lingual text-to-speech model from Coqui AI. It excels at high-quality voice cloning from just a few seconds of audio and supports 17 languages. With real-time streaming inference, it's ideal for applications needing custom voices and low-latency output.
by Google DeepMind
Google DeepMind's state-of-the-art video generation model capable of producing high-resolution cinematic clips from text prompts. Supports diverse styles, camera controls, and realistic physics simulation.
by Udio
Udio is a high-fidelity text-to-music model developed by former Google DeepMind researchers that generates full songs with vocals and instruments at production-quality audio fidelity, often cited as having higher audio quality than competitors at the cost of slightly less stylistic range. It supports custom lyrics, genre blending, and audio extension, enabling iterative music production workflows.
by Google
Med-PaLM 2 is Google's large language model specialized for the medical domain. It achieves expert-level performance on medical licensing exams (USMLE) by leveraging advanced clinical reasoning and question-answering capabilities. The model is designed to generate accurate and helpful responses for healthcare professionals.
by Anthropic / Google
Official MCP Google Drive server granting MCP clients access to Drive file listings, search, and document content reading via OAuth 2.0. Supports Docs, Sheets, Slides, and plain files, enabling agents to retrieve and reason over cloud-stored enterprise documents.
by LangChain
LangChain integration for Cohere's enterprise AI platform. Provides access to Command models for generation, Embed v3 for multilingual embeddings, and the Rerank API for RAG pipeline precision improvement. Available via the langchain-cohere package with first-class reranker support.
by Groq
LangChain chat model integration for Groq's Language Processing Unit (LPU) inference API. Enables ultra-low-latency LLM calls within LangChain chains and agents with first-token latency under 100ms. Supports Llama 3, Mixtral, and Gemma models served on Groq hardware via the langchain-groq package.
by Continue Dev
Continue is an open-source AI code assistant for VS Code that supports any LLM through a flexible config file, covering inline completions, chat, edit mode, and custom slash commands. Its context providers system lets developers include files, docs, web search results, and terminal output in every prompt, making it highly adaptable to team-specific workflows.
by Sourcegraph
Sourcegraph Cody combines enterprise-grade code search with an AI coding assistant, letting developers ask questions grounded in the entire codebase indexed by Sourcegraph. The integration uses Sourcegraph's precise code intelligence (SCIP) as a retrieval layer for Cody's Claude-powered chat, delivering context-accurate answers across mono-repos with millions of files.
by Google / Everyday Robots
Inner Monologue incorporates closed-loop language feedback from the environment—scene descriptions, success detectors, and human feedback—into robotic planning, enabling robots to re-plan dynamically when steps fail. The paper shows that rich language-based feedback loops significantly improve task success rates over open-loop LLM planners.
by Microsoft Research
BioGPT is a domain-specific generative transformer pretrained on 15 million PubMed abstracts. Developed by Microsoft, it achieves state-of-the-art performance on various biomedical natural language processing tasks, including relation extraction, question answering, and document classification, outperforming general-domain models.
by AaaS
Drafts hyper-personalized outreach messages for each prospect using their specific firmographic profile, recent intent signals, and ICP match factors. Enforces brand voice and CAN-SPAM/GDPR compliance, adapts tone by channel (email, LinkedIn, phone script), and graduates from human-approved to autonomous sending as trust is established.
by AaaS
Scans cloud infrastructure for resources that have been inactive beyond configurable thresholds — unused EC2 instances, unattached EBS volumes, idle load balancers, and zero-traffic IP addresses. Produces a prioritized waste report with projected monthly savings and safe termination recommendations.
by AaaS
Divides prospect and customer databases into distinct behavioral and firmographic segments using clustering algorithms and rule-based criteria. Produces named audience segments with defining characteristics, size estimates, and conversion potential scores — enabling agents to tailor campaigns and content per segment.
by AaaS
Provisions software licenses and SaaS tool access for new employees based on their role, department, and team. Integrates with identity providers to create accounts, assign licenses, and configure initial permissions in a single orchestrated flow.
by AaaS
Composite skill that unifies buyer intent tracking, lead scoring, and ICP matching into a single prospecting engine. Monitors intent signals across channels, scores leads against the Ideal Customer Profile, and produces a ranked prospect list with personalized outreach recommendations. Designed to fill the Intent Scoring slot that requires all three capabilities working in concert.
by AaaS
Compares lead attributes against Ideal Customer Profile definitions including industry, company size, tech stack, budget range, and buying behavior patterns. Produces a match percentage and highlights gaps, enabling agents to focus on prospects that fit the target customer archetype.
by AaaS
Builds episodic behavioral profiles of individual users from transaction history, login patterns, and interaction data. Detects deviations from established behavioral baselines that indicate potential fraud — unusual transaction amounts, atypical timing, geographic impossibilities, and velocity anomalies.
by AaaS
Enables agents to understand codebases as architecture rather than text. Parses abstract syntax trees, maps structural dependencies between modules, identifies architectural patterns (MVC, microservices, event-driven), and detects knowledge gaps where documentation is missing or outdated. Goes beyond keyword search to find semantic relationships that define software architecture.
by AaaS
Executes code changes and dependency updates in isolated sandbox environments (Docker containers, ephemeral VMs) to verify compatibility before merging to the main codebase. Runs the project's existing test suite against the modified dependency set, captures build artifacts and test results, and produces a pass/fail report with detailed failure analysis.
by AaaS
Executes structured remediation runbooks deterministically against live systems. Supports YAML-defined runbooks with step sequencing, precondition checks, rollback steps, and human-approval gates. Logs every action with timestamps for audit compliance. Enables agents to resolve standard incidents using the same playbooks human SREs follow.
by AaaS
Compares test results and behavior metrics between the baseline (pre-change) and candidate (post-change) versions. Detects functional regressions (new test failures), performance regressions (latency/throughput degradation), and behavioral regressions (changed API responses). Produces a structured diff report that the Dependency Guardian uses to decide whether an update is safe to merge.
by AaaS
Teaches agents to synthesize speech in a target speaker's voice using few-shot and zero-shot voice cloning models, enabling personalized TTS experiences. Covers consent and ethical frameworks, reference audio quality requirements, model selection (ElevenLabs, Coqui XTTS, Tortoise), and anti-spoofing safeguards.
by AaaS
Covers temporal reasoning over video streams, including frame sampling strategies, action recognition, scene change detection, and dense video captioning. Teaches agents to leverage video-native models (Gemini 1.5 Pro, Video-LLaVA) and build efficient pipelines that avoid processing every frame.
by AaaS
Enables agents to segment audio recordings by speaker identity, answering 'who spoke when' for downstream summarization and analysis tasks. Covers embedding-based clustering (pyannote.audio, NeMo), overlapping speech handling, and merging diarization with ASR transcripts.
by Community
Sim-to-Real Transfer is a set of techniques used in robotics and AI to bridge the 'reality gap' between simulation and the real world. It enables models and control policies trained in a virtual environment to be deployed effectively on physical hardware, drastically reducing the need for costly, time-consuming, and potentially unsafe real-world data collection.
by AaaS
Allows agents to evaluate their own outputs, identify errors or weaknesses, and iteratively improve responses. Implements self-critique loops where the agent reviews its work against quality criteria and refines until standards are met.
by AaaS
Deploys and serves language models in production environments with high availability and low latency. Covers framework selection (vLLM, TGI, Triton), batching strategies, GPU memory management, and auto-scaling configurations for different workload profiles.
by AaaS
Enables AI agents to maintain state and context across multiple interactions by managing short-term and long-term memory. This is crucial for creating coherent, personalized experiences, moving beyond stateless request-response models. It uses techniques like conversation buffers, summarization, and vector-based retrieval.
by Community
A machine learning paradigm enabling models to learn sequentially from a continuous stream of data without forgetting previously acquired knowledge. Continual Learning, or Lifelong Learning, directly addresses the problem of catastrophic forgetting in neural networks using methods like regularization, memory replay, and dynamic architectures.
by AaaS
Continuously monitors official state and federal regulatory sources for new rules, amendments, and enforcement actions. Cross-references every regulatory change against active client case files and operational workflows. Automatically embeds structured audit checks into daily operational workflows with immutable timestamps. A compliance officer that never sleeps, never misses an update, and documents everything for legal defensibility.
by AaaS
Continuously ingests competitor pricing data, live inventory levels, and historical demand elasticity curves. Autonomously adjusts pricing in real-time to optimize revenue yield across SaaS subscriptions, logistics rates, and service pricing. Operates within strict pricing policy bounds including min/max price limits, rate-of-change constraints, and competitor parity alerts to prevent pricing wars or customer backlash.
by Codeium
Windsurf is an AI-powered IDE from Codeium built around Cascade, a deep agentic workflow engine. It maintains context across complex, multi-step coding tasks, allowing it to function autonomously. The platform combines the features of a coding copilot with the power of a fully agentic system in a single editor.
by HERE Technologies
This agent specializes in spatio-temporal traffic forecasting, predicting conditions up to 30 minutes in advance for intersections and corridors. It processes data from V2X communications, vehicle telemetry, and infrastructure sensors. The predictions are designed for fleet routing engines to optimize ETAs and alleviate urban congestion.
by Posit AI (formerly RStudio)
Autonomous scientific data analysis agent that ingests raw experimental datasets — omics, imaging, time-series, and tabular — applies appropriate statistical and machine learning pipelines, and generates publication-ready figures, statistical summaries, and reproducible analysis notebooks. Interprets results in the context of prior literature and flags confounders automatically.
by K8sGPT
AI-powered Kubernetes diagnostics agent that scans clusters for issues and provides plain-language explanations with remediation suggestions. Integrates with multiple LLM backends to analyze pod failures, misconfigurations, and performance bottlenecks in real time.
by Grin
An intelligent influencer discovery and vetting agent that scores creators against brand-safety criteria, audience-demographic fit, and historical engagement authenticity. It automates outreach sequencing, tracks negotiation status, and monitors campaign deliverable compliance post-activation.
by Inflection AI
Inflection Pi is a personal AI assistant designed for emotionally intelligent and natural conversations. It acts as a supportive companion, remembering past interactions to provide context-aware advice and support. Pi adapts its communication style to the user's preferences, aiming to be a kind and helpful conversational partner.
by NVIDIA
NVIDIA Volta architecture GPU that introduced Tensor Cores to the data center, providing the first dedicated matrix multiply hardware for AI. Powered the first wave of transformer model training including BERT and GPT-2, and became the dominant AI training platform from 2017–2020.
by Google
Google's cost-efficient TPU variant optimized for inference and medium-scale training. Offers a better price-performance ratio than TPU v5p for serving workloads, with 16GB HBM2 per chip and excellent throughput for transformer inference.
by AaaS
Sets up a tool-calling agent with typed tool definitions, argument validation, error handling, and execution sandboxing. Includes example tools for web search, calculator, file operations, and database queries with a pluggable tool registry.
by Community
Generates comprehensive temporal features from time-series data including rolling statistics, lag features, Fourier transforms, and calendar encodings using tsfresh and custom transformers. Handles irregular time series with forward-fill interpolation and produces a point-in-time-correct feature matrix to prevent leakage.
by AaaS
Production-grade RAG pipeline with hybrid search, reranking, contextual compression, and multi-index routing. Includes query decomposition, metadata filtering, evaluation metrics, and performance monitoring for enterprise deployments.
by AaaS
Generates embeddings at scale for large document collections with batching, rate limiting, checkpointing, and error recovery. Supports multiple embedding providers (OpenAI, Cohere, local models) with automatic dimension detection and output format selection.
by Community
Orchestrates progressive canary deployments of ML model services on Kubernetes using Istio traffic shifting, with automated rollback triggered by error-rate or latency SLO breaches. Integrates with Argo Rollouts for declarative release management and posts deployment status to Slack.
by AaaS
End-to-end setup script for deploying a production RAG pipeline. Provisions vector database, configures document ingestion, sets up embedding generation, and creates retrieval endpoints.
by Together AI
Together AI provides a high-performance cloud inference platform for open-source models, offering one of the fastest and most cost-effective APIs for running models like Llama, Mistral, and DeepSeek. Its Together Inference platform specializes in speculative decoding and model parallelism techniques, and also offers managed fine-tuning and custom model deployment.
by Google Research
Minerva Math is a quantitative reasoning benchmark designed to evaluate large language models on complex STEM problems. Sourced from web pages with LaTeX and arXiv preprints, it covers subjects like math, physics, and chemistry, requiring multi-step computation, symbolic manipulation, and deep scientific understanding to solve.
by Nangia et al. / NYU
CrowS-Pairs is a benchmark dataset for evaluating social bias in masked language models. It contains 1,508 sentence pairs with stereotypical and anti-stereotypical statements across nine bias types. The benchmark measures a model's preference for stereotypical completions using pseudo-log-likelihood scores.
by Koreeda & Manning / Stanford NLP
ContractNLI is a dataset for natural language inference (NLI) focused on contract understanding. It challenges models to determine if a hypothesis about a contract is entailed, contradicted, or not mentioned by the contract text. This simulates real-world legal document review, testing a model's ability to reason over complex legal language.
by Li et al. / Wuhan University
API-Bank is a comprehensive benchmark for evaluating tool-augmented LLMs. It features 73 diverse APIs and assesses models on three levels: API retrieval, API calling, and complex planning. The benchmark measures both the correctness of tool selection and the accuracy of execution, providing a thorough test of an agent's capabilities.
by HKUST / Community
Deita 6K is an ultra-compact, high-quality instruction-tuning dataset of 6,000 carefully selected samples produced by the Data-Efficient Instruction Tuning for Alignment (DEITA) framework, which scores and filters instruction data by complexity and quality using LLM judges. Despite its small size, models trained on Deita 6K match or outperform those trained on datasets 10-100x larger, demonstrating the power of principled data selection over scale.
by Argilla / LDJnr
Capybara is a high-quality instruction-tuning dataset of 15,000 diverse, long-form single- and multi-turn conversations synthesized to cover a wide range of topics and response styles, designed to improve model coherence and verbosity on open-ended tasks. It emphasizes narrative quality and conceptual depth over simple factual responses, making it particularly effective for improving chat model fluency and reasoning.
by Voyage AI
Voyage Code 3 is Voyage AI's third-generation code-specialized embedding model, optimized for code retrieval, code search, and programming language understanding. It achieves top performance on code retrieval benchmarks and is widely used in AI coding tools, IDEs, and code intelligence platforms.
by Alibaba Cloud
Alibaba Cloud's flagship vision-language model capable of understanding images, documents, charts, and diagrams alongside text. Excels at OCR, visual question answering, and document comprehension tasks across multiple languages.
by Amazon
Amazon's most capable Nova model balancing accuracy, speed, and cost for a wide range of enterprise tasks. Supports text, image, and video input with strong performance on analysis, reasoning, and content generation through AWS Bedrock.
by Mistral AI
Mistral AI's cost-optimized API model designed for high-volume enterprise workloads requiring low latency. Balances strong performance with significantly reduced cost per token compared to larger models.
by Luma AI
Luma Dream Machine is a fast, high-quality video generation model from Luma AI that produces 120-frame, 24fps video clips with strong 3D spatial consistency and realistic lighting derived from Luma's heritage in NeRF-based 3D capture. It was notable for its web-based accessibility and fast generation times at launch, generating 5-second clips in roughly 2 minutes.
by Qdrant
LlamaIndex VectorStore integration for Qdrant's high-performance vector search engine. Exposes Qdrant's payload filtering, sparse-dense hybrid search, and collection management through LlamaIndex's standard index and query engine abstractions for advanced RAG pipelines.
by Arize AI
Arize Phoenix instruments LlamaIndex query pipelines with OpenTelemetry spans, exposing retrieval precision, reranker performance, and LLM generation quality in a local-first UI. The integration is particularly valuable for RAG applications where diagnosing retrieval failures requires joint analysis of embeddings, chunks, and generation outputs.
by Community / Notion
MCP Notion server built on the official Notion API, providing tools for searching pages, reading blocks, creating pages, and updating database entries. Enables Claude and other agents to use Notion as a structured knowledge store within agentic workflows.
by Firecrawl / LangChain
LangChain document loader built on Firecrawl's web crawling and scraping API, transforming live web content into clean Markdown documents ready for chunking and indexing. Supports full-site crawls, sitemap-driven ingestion, and JavaScript-rendered pages.
by Together AI
DeepSeek's open-weight models including DeepSeek-V3 and DeepSeek-R1 served through Together AI's inference cloud at competitive token prices. Provides an OpenAI-compatible API endpoint, enabling drop-in substitution for cost-sensitive workloads. Together AI's custom GPU kernels deliver high throughput for DeepSeek's MoE architecture.
by Chroma
Chroma's built-in embedding function for HuggingFace's sentence-transformers library. Enables fully local embedding generation and vector storage without any API keys. Supports hundreds of pre-trained models from the HuggingFace Hub including all-MiniLM, BGE, and E5 variants.
by Idiap Research Institute / EPFL
Shows that by approximating the softmax attention kernel, transformers can be expressed as linear RNNs, enabling O(1) autoregressive inference. Introduces the linear attention framework that inspired many subsequent efficient attention variants.
by Hugging Face / ETH Zurich
Investigates scaling behavior when data is limited and must be repeated, finding that repeated data is less harmful than expected and that compute should be redirected toward more parameters when data is exhausted. Provides practical guidance for real-world data-constrained training.
by Anthropic
This paper from Anthropic scales sparse autoencoders (SAEs) to GPT-4-level models and provides rigorous evaluation methods for measuring dictionary quality, showing that SAE features are interpretable, monosemantic, and causally relevant to model behavior. The work establishes SAEs as a core tool for mechanistic interpretability at scale.
by Shanghai AI Laboratory / Fudan University
Presents OS-Copilot, a self-improving agent framework for generalizable computer task automation that interacts with operating system components including files, terminals, browsers, and applications. The framework includes a self-directed learning loop that enables agents to acquire new skills from online documentation without human intervention.
by Bloomberg LP
BloombergGPT is a 50-billion-parameter large language model specifically trained for the financial domain. By leveraging a massive, proprietary corpus of financial documents combined with general-purpose text, it achieves state-of-the-art results on financial NLP benchmarks while remaining competitive on general language tasks.
by Korea Advanced Institute of Science and Technology (KAIST)
Proposes Adaptive-RAG, a framework that learns to select the most suitable retrieval strategy for each question based on its complexity using a small classifier. The approach dynamically routes queries to no-retrieval, single-step, or multi-step retrieval strategies, balancing accuracy and efficiency across question types.
by AaaS
Analyzes customer usage patterns, feature adoption gaps, and engagement trends to identify accounts ready for upsell or cross-sell conversations. Produces ranked expansion opportunities with supporting evidence — which features they are actively using, which limits they are approaching, and the timing signals that indicate readiness.
by AaaS
Generates customized 30/60/90-day training schedules for new employees based on their role, department, seniority, and onboarding goals. Sequences mandatory compliance training, role-specific tool walkthroughs, and team introductions into a coherent calendar that integrates with the employee's actual availability.
by AaaS
Composite skill that unifies the three capabilities required for autonomous SRE incident triage: anomaly detection in live telemetry, deterministic runbook execution against production systems, and distributed telemetry analysis for root cause identification. Detects anomalies, correlates across service boundaries using trace IDs, selects the matching runbook, executes it within change-window constraints, and escalates with full diagnostic context if auto-resolution fails. Designed to fill the SRE Triage slot that individual skills cannot cover alone.
by AaaS
Analyzes resumes beyond keyword matching — understanding skill adjacencies, transferable experiences, and career trajectory patterns. Identifies high-potential non-traditional candidates that ATS keyword parsers miss by evaluating semantic meaning of accomplishments and role descriptions.
by AaaS
Recommends and executes instance type changes based on actual resource utilization patterns. Enforces safety guardrails: 7-day cooldown periods, production-instance protection, peak-headroom buffers, and rollback capability for every sizing action.
by AaaS
Configures identity and access management for new employees — creating accounts in the identity provider, assigning security groups, setting up MFA, and granting role-appropriate access levels. Ensures least-privilege principles from day one.
by AaaS
Categorizes unstructured user feedback from support tickets, interviews, NPS surveys, and social mentions into actionable themes. Identifies feature requests, bug reports, UX complaints, and praise clusters, enabling product agents to aggregate signal from noise across thousands of data points.
by AaaS
Analyzes real-time campaign performance across advertising channels and reallocates budget from underperforming to outperforming campaigns. Respects configurable guardrails including max daily spend caps, minimum ROAS thresholds, and channel floor allocations to prevent over-concentration.
by AaaS
Constructs the full transitive dependency tree from lockfiles and manifest files. Identifies vulnerable transitive dependencies, license conflicts, and bloated dependency chains. Produces a structured tree that the Dependency Guardian agent uses to plan safe update paths through isolated sandbox testing.
by AaaS
Covers heuristics and learned strategies for agents to select the right tool from a large catalog given a task description, including embedding-based tool retrieval, LLM-based routing, and multi-step tool chaining. Teaches fallback hierarchies, tool description engineering, and cost-aware selection to minimize unnecessary API calls.
by Community
Identifies and merges records across heterogeneous data sources that refer to the same real-world entity, using blocking, similarity scoring, and classification models to scale to large corpora. Critical for maintaining knowledge graph integrity and enabling cross-source analytics.
by Community
Causal Effect Estimation quantifies the true impact of an action or intervention by analyzing observational data. It moves beyond simple correlation to isolate causality using statistical methods, which is crucial for evaluating policies, business strategies, and medical treatments where A/B tests are infeasible.
by AaaS
Analyzes vast amounts of unstructured user feedback from support tickets, customer interviews, NPS surveys, and social mentions. Autonomously categorizes feature requests by theme, assigns tasks by engineering expertise match, and predicts project delays based on historical velocity data. Removes guesswork from roadmap prioritization and surfaces patterns that human product managers miss in the noise of daily feedback.
by AaaS
Continuously tracks anonymized engagement metrics, learning platform usage, internal mobility data, and team dynamics signals. Predicts flight risks among high-performing employees using composite scoring and suggests personalized upskilling or internal mobility paths to their direct managers. All individual data is anonymized at collection; predictions are shared only with the employee's direct manager and HR, never used for surveillance.
by Uizard Technologies
AI design assistant that transforms text descriptions, screenshots, and hand-drawn sketches into polished digital prototypes. Enables non-designers to rapidly create professional app and web mockups with zero design experience.
by Thomson Reuters ONESOURCE AI
AI tax advisory agent that analyzes entity structures, transaction types, and jurisdictional nexus to surface optimization strategies, compute effective tax rates, and identify planning opportunities across federal, state, and international tax regimes. Integrates with general ledger systems to automate provision calculations and tax return preparation.
by ScholarAI
AI research agent that provides access to full-text academic papers, not just abstracts. Searches peer-reviewed literature, reads entire PDFs, extracts figures and tables, and answers questions grounded in scientific publications.
by Retell AI
Conversational voice AI platform purpose-built for call center automation. Delivers sub-second latency, natural turn-taking, and enterprise-grade reliability for handling millions of concurrent voice interactions.
by Axiom SL (AxiomSL)
Proactive compliance agent that monitors regulatory feeds (SEC, FINRA, FCA, Basel IV), maps rule changes to affected business processes, and generates gap analysis reports with remediation workflows. Tracks obligations, manages evidence collection, and surfaces regulatory changes before effective dates.
by Hume AI
Hume AI offers a toolkit and APIs for building emotionally intelligent applications. It analyzes human expression across voice, face, and language to measure nuanced emotions. Its Empathic Voice Interface (EVI) enables conversational agents to adapt their tone and prosody in real-time for more natural, empathetic interactions.
by Google
Google's fifth-generation Tensor Processing Unit, the TPU v5p, is an AI accelerator designed for training and serving the largest AI models. It offers significant performance gains over its predecessor, featuring liquid cooling, 95 GB of HBM, and support for new data formats like MX4 for enhanced efficiency and scalability in massive pod configurations.
by AaaS
Specialized pipeline for extracting structured content from PDF documents including text, tables, images, and metadata. Supports OCR for scanned documents, layout analysis for complex formats, and chunking optimized for PDF document structures.
by Meta AI
Generates royalty-free music from text prompts using Meta's MusicGen or AudioCraft, with controls for tempo, key, duration, and genre conditioning. Provides a CLI for batch generation and a streaming mode that writes 30-second chunks to disk or an S3 bucket.
by AaaS
Side-by-side model comparison script that runs identical prompts through multiple LLM APIs and presents results in a structured format. Measures response quality, latency, token usage, and cost per query with automated scoring via LLM judges.
by Community
Parses SEC filings, earnings call transcripts, and annual reports using FinBERT for sentiment analysis and a table-extraction pipeline that converts HTML/XBRL financial statements into normalized pandas DataFrames. Exports structured financial metrics to a database and generates LLM-ready summaries for investor Q&A.
by Feast
Synchronizes feature definitions and materialized feature values between offline (BigQuery/Snowflake) and online (Redis/DynamoDB) feature stores using Feast or Tecton, with configurable freshness SLAs and backfill scheduling. Includes drift monitoring to alert when online and offline distributions diverge.
by Community
Configures a face recognition system using InsightFace or DeepFace, supporting gallery enrollment, real-time identification against a FAISS vector store, and liveness detection. Designed with privacy-first defaults and includes GDPR-compliant consent logging.
by AaaS
Containerizes ML models and inference servers with optimized Docker images for production deployment. Includes multi-stage builds for minimal image size, GPU support configuration, health checks, and docker-compose setups for full inference stacks.
by Community
Processes unstructured clinical notes using medspaCy and BioClinicalBERT to extract diagnoses, medications, procedures, and lab values, then maps entities to ICD-10 and SNOMED-CT codes. Outputs FHIR-compatible JSON bundles and includes a de-identification step compliant with HIPAA Safe Harbor.
by Casetext / Thomson Reuters
Casetext was a pioneer in AI-powered legal research and drafting, launching CoCounsel—the first AI legal assistant powered by GPT-4—before being acquired by Thomson Reuters in 2023 for $650M. Its technology is now integrated into Westlaw and Practical Law, making AI legal assistance available to millions of legal professionals.
by BigCode / Hugging Face / ServiceNow
BigCode is an open scientific collaboration by Hugging Face and ServiceNow for the responsible development of large language models (LLMs) for code. The project produced the StarCoder and StarCoder2 models, trained on 'The Stack' dataset, with a strong emphasis on ethical data governance, source attribution, and consent.
by Zhao et al. / USC
WinoBias is a benchmark dataset designed to measure gender bias in coreference resolution systems. It consists of sentence pairs where pronouns refer to individuals in stereotyped or non-stereotyped occupations, allowing for the quantification of a model's reliance on gender stereotypes versus grammatical correctness.
by Sierra AI
Tool-Agent-User benchmark evaluating AI agents on realistic customer service scenarios requiring multi-step tool use. Tests agents' ability to navigate complex workflows, use tools correctly, follow policies, and handle edge cases in airline and retail domains.
by OpenAI
Benchmark evaluating AI agents on real Kaggle machine learning competitions. Tests the full ML engineering pipeline including data exploration, feature engineering, model selection, training, and submission formatting against actual competition leaderboards.
by Huang et al. / Stanford
MLAgentBench challenges AI agents to perform machine learning research tasks autonomously — reading papers, writing code, running experiments, analyzing results, and improving models. It tests whether agents can replicate and build upon real ML research across 13 diverse ML tasks.
by Codeforces / Community
Evaluates models on competitive programming problems from the Codeforces platform across difficulty ratings. Tests algorithmic thinking, data structure knowledge, and the ability to produce correct and efficient solutions under competitive constraints.
by Zheng et al. / Berkeley Law / LexGLUE
CaseHOLD is a legal NLP benchmark for evaluating a model's ability to identify the correct holding statement for a US court case. Given a citing context, the model must choose the correct holding from a list of candidates. Sourced from over 53,000 cases, it is a core component of the LexGLUE benchmark suite for legal AI.
by Berkeley AI Research (BAIR)
RoboNet is a large-scale dataset for robot learning, featuring 15 million video frames from diverse robot arms across multiple labs. It is designed to train and benchmark self-supervised visual models, aiming to achieve generalization across different robot morphologies and workspaces without task-specific labels.
by CAMEL-AI
The CAMEL-AI Datasets are a collection of synthetic multi-agent conversation datasets generated through the Communicative Agents framework, where AI assistants and user agents collaborate via role-playing to solve tasks. The collection covers coding, math, science, and open-ended reasoning domains, providing diverse instruction-following dialogues useful for SFT and alignment research.
by Albert-Ludwigs-Universität Freiburg
CALVIN is a large-scale dataset and benchmark for long-horizon, language-conditioned robot manipulation. It features over 24 hours of teleoperated demonstration data in a tabletop environment, encompassing 34 distinct skills that can be composed to solve complex, multi-step tasks from natural language instructions.
by Zhang Peiyuan et al. (Academic)
TinyLlama is a compact 1.1 billion parameter language model trained on approximately 3 trillion tokens using the LLaMA architecture and tokenizer. It achieves impressive performance for its size, particularly on commonsense reasoning tasks, making it ideal for resource-constrained environments and real-time applications.
by ByteDance
SDXL-Lightning is a distilled version of Stable Diffusion XL from ByteDance that generates high-quality 1024px images in as few as 1–4 diffusion steps, compared to the 25–50 steps required by the base model. It uses a novel progressive adversarial diffusion distillation technique, enabling near-real-time image generation without meaningful quality degradation.
by Google
PaLM (Pathways Language Model) is Google's 540 billion parameter language model trained using the Pathways system across 6,144 TPU v4 chips, demonstrating breakthrough capabilities on chain-of-thought reasoning, code generation, and multilingual tasks. It introduced the concept of 'discontinuous' capability jumps at scale and set new benchmarks on hundreds of NLP tasks upon release in 2022.
by OpenAI
A premium tier of o1 available through ChatGPT Pro that uses additional compute for more reliable and thorough reasoning. Achieves higher consistency on hard problems in math, science, and coding by spending more inference-time compute.
by NVIDIA
NV-Embed-v2 is NVIDIA's second-generation text embedding model, achieving top-1 performance on the MTEB benchmark at release with innovative decoder-only LLM architecture adaptations for embedding tasks. It introduces latent attention layers and removes causal masking to produce richer, task-aware text representations.
by Mistral AI
A 12B parameter model co-developed by Mistral AI and NVIDIA featuring the novel Tekken tokenizer for improved multilingual efficiency. Designed as a drop-in replacement for Mistral 7B with superior performance.
by Meshy
Meshy AI is a commercial 3D generation platform that produces game-ready, textured 3D meshes from text descriptions or images with PBR material support, making it one of the most production-ready AI 3D tools available. Its focus on texture quality, polygon efficiency, and export format compatibility (FBX, OBJ, GLB) targets game developers, 3D artists, and product designers directly.
by Weaviate
Weaviate's built-in text2vec-cohere and reranker-cohere modules for zero-ETL vectorization and result reranking within Weaviate clusters. Automatically embeds documents at write time using Cohere Embed v3 and reranks retrieval results without external orchestration code.
by Pydantic
PydanticAI's native Anthropic model provider, enabling type-safe agentic workflows backed by Claude models. Agent inputs, tool call parameters, and structured outputs are all validated through Pydantic schemas, with full support for Claude's extended tool use and streaming responses.
by Stanford University
VoxPoser uses LLMs and vision-language models to synthesize 3D voxel-based value and constraint maps that guide robot motion planners, enabling zero-shot generalization to novel language instructions and object configurations. The approach produces trajectories without any robot-specific training by composing affordance maps in 3D space.
by AaaS
Generates personalized upskilling and learning path recommendations for employees based on their current skill profile, career trajectory goals, and identified flight risk signals. Matches employees to specific courses, mentorship opportunities, and internal projects that address their development gaps and increase retention probability.
by AaaS
Identifies non-obvious skill adjacencies and transferable capabilities between different roles and domains. Maps how expertise in one area translates to another — for example, how a military logistics background maps to supply chain management, or how academic research skills map to product analytics. Surfaces high-potential candidates that traditional keyword-based ATS systems routinely discard.
by AaaS
Scores and ranks feature requests by impact (revenue, retention, adoption), effort (engineering complexity, dependencies), and strategic alignment. Produces prioritized backlogs with clear rationale that product managers can use to make data-driven roadmap decisions.
by AaaS
Assigns quantitative risk scores to contract clauses, transactions, regulatory exposures, and compliance gaps based on configurable risk frameworks. Supports multiple scoring methodologies (probability × impact, weighted checklist, bayesian) with historical calibration.
by AaaS
Drafts personalized retention plans based on churn risk signals — recommending specific interventions like check-in calls, discount offers, feature tutorials, or executive escalations. Each plan includes timing, channel, message, and success criteria.
by AaaS
Continuously monitors live performance metrics and applies optimization decisions in real-time without waiting for batch analysis cycles. Supports A/B test significance gates, automatic pause of degrading variants, and dynamic parameter tuning for campaigns, pricing, and content delivery.
by AaaS
Generates tamper-proof audit logs with cryptographic timestamps for every agent action, decision, and data access event. Ensures legal defensibility of automated actions by maintaining an unbroken chain of evidence that satisfies regulatory audit requirements.
by AaaS
Tracks anonymized employee engagement metrics across platforms — learning completions, collaboration tool usage, feedback survey responses, and internal mobility activity. Provides the raw signal data that flight-risk and attrition models depend on.
by AaaS
Analyzes individual contract clauses in the context of the full agreement — evaluating interactions between indemnification and liability caps, cross-referencing defined terms, and scoring risk based on deviation from standard playbook positions. Goes beyond clause-in-isolation analysis that generic LLMs provide.
by AaaS
Runs automated disparate-impact testing on screening batches to ensure no protected class is disproportionately affected. Produces bias audit reports with statistical evidence, flags potential violations of employment discrimination laws, and recommends corrective actions.
by AaaS
Designs and implements graceful degradation strategies for distributed systems. Configures circuit breakers, fallback responses, timeout cascades, and load shedding policies. Ensures that when individual components fail or exceed their latency budget, the overall system continues serving degraded (but functional) responses rather than failing completely.
by AaaS
Extends RAG pipelines to index and retrieve across text, images, tables, and charts — enabling agents to answer questions grounded in visually rich documents like PDFs, slide decks, and technical manuals. Covers ColPaLI-style late interaction retrieval, multi-vector indexing, and vision-language model integration for answer synthesis.
by AaaS
Implements programmable guardrails that constrain LLM behavior within defined boundaries. Covers input validation, output format enforcement, topic restriction, factuality checking, and automated intervention when model responses deviate from acceptable parameters.
by AaaS
Teaches systematic evaluation of ML models for demographic disparities across protected attributes using established fairness metrics (demographic parity, equalized odds, calibration). Covers AIF360, Fairlearn, bias mitigation strategies, and producing audit-ready fairness reports for regulatory submissions.
by AaaS
Enables agents to retrieve images from text queries (or vice versa) by projecting both modalities into a shared embedding space using models like CLIP, ImageBind, and SigLIP. Covers index construction, cross-modal similarity scoring, and integration with vector databases for unified multimodal knowledge retrieval.
by AaaS
Runs probabilistic reasoning models across thousands of internal financial variables to simulate supply chain shocks, interest rate hikes, customer churn scenarios, and market disruptions. Produces highly specific, actionable risk mitigation strategies — not generic 'diversify your portfolio' advice. Finance teams use this for rapid scenario modeling during sudden market shifts instead of manually updating brittle spreadsheets.
by AaaS
Listens to live customer calls via event-driven audio streams, autonomously fetching relevant knowledge base articles, providing real-time sentiment analysis, and suggesting compliant dialogue paths to the human support agent. The AI never speaks directly to the customer — it coaches the human agent in real-time, reducing average handle time by up to 30% and ensuring policy adherence on every call.
by AaaS
Generates commercial creative assets (images, social posts, ad creatives) that pass automated brand compliance checks. Verifies typography hierarchy, spacing consistency, contrast ratios, and color palette adherence against the organization's brand kit. Eliminates the telltale AI artifacts, hallucinated logos, and off-brand content that plague generic image generation, producing assets that marketing teams can publish without extensive manual correction.
by Symbotic
An orchestration agent that coordinates autonomous mobile robots (AMRs), conveyor systems, and human pickers within warehouse environments to maximize throughput and minimize travel time. It dynamically slots SKUs based on velocity, directs wave picking operations, and monitors equipment health to preemptively flag maintenance needs.
by Jaggaer
A procurement intelligence agent that manages RFP/RFQ processes end-to-end by distributing questionnaires, scoring vendor responses against weighted evaluation criteria, aggregating references and financial health data, and generating comparative vendor scorecards with recommendation summaries for procurement committees. It tracks vendor performance post-award against SLAs and KPIs, triggering contract remediation workflows on breaches.
by Stanford NLP
STORM is an open-source AI research agent from Stanford University designed to automate the creation of comprehensive, Wikipedia-style articles. It simulates a human research process by generating diverse questions, searching the web for information, and synthesizing the findings into a well-structured, cited narrative based on a generated outline.
by Elsevier
A semi-autonomous research mentoring agent that guides students and early-career researchers through literature discovery, hypothesis formation, and academic writing. It performs systematic literature reviews across major databases, synthesizes contradictions between papers, and provides structured feedback on drafts aligned with target journal style guides.
by Waymo
This agent processes and fuses data from camera, LiDAR, and radar sensors in real-time. It generates high-fidelity 3D object detections, lane segmentation, and drivable area maps. The system is robust, featuring adaptive noise modeling for adverse weather conditions and operates at a high frequency (100Hz).
by OpenAI
OpenAI Swarm is an experimental, lightweight multi-agent orchestration framework. It focuses on agent handoffs and routines, designed for educational and exploratory use. The framework emphasizes simplicity and ergonomic patterns for coordinating multiple agents in a straightforward manner.
by MultiOn
Autonomous browser agent that completes real-world web tasks on behalf of users. Navigates websites, fills forms, makes purchases, and interacts with web applications through a simple API, handling complex multi-step workflows end to end.
by Spring Health
Validated digital screening agent that administers standardized mental health assessments (PHQ-9, GAD-7, AUDIT, PCL-5) through conversational interfaces, interprets scores, and escalates crisis situations to human clinicians. Supports longitudinal tracking and integrates with care management platforms.
by Julius AI
AI-powered data analysis agent that transforms raw data into insights through natural language conversation. Supports CSV, Excel, and database connections with automated chart generation, statistical analysis, and exportable reports without requiring programming skills.
by Patsnap AI
Comprehensive intellectual property agent that conducts freedom-to-operate analyses, prior art searches, and patent landscape mapping across global patent databases. Identifies claim overlaps, assesses infringement risk, and generates invalidity arguments with cited evidence for strategic IP prosecution and litigation support.
by NVIDIA
The NVIDIA GB200 NVL72 is a liquid-cooled, rack-scale system designed for exascale AI. It connects 36 Grace Blackwell Superchips, comprising 72 B200 GPUs and 36 Grace CPUs, via fifth-generation NVLink to function as a single massive GPU for training and inferencing on trillion-parameter models with unprecedented performance and energy efficiency.
by AaaS
Orchestrates multiple specialized AI agents in coordinated workflows with task routing, state management, and result aggregation. Implements supervisor and swarm patterns with configurable agent selection logic and inter-agent communication.
by AaaS
Comprehensive model evaluation script that runs models against standard benchmarks including MMLU, HumanEval, GSM8K, and custom evaluation sets. Produces detailed reports with per-category breakdowns, confidence intervals, and comparison charts.
by Community
Analyzes legal contracts and court documents using a fine-tuned LegalBERT model for clause classification, obligation extraction, and risk-flag detection, with outputs cross-referenced against a configurable playbook of standard clause definitions. Generates a redline-ready Word document and a structured JSON risk register.
by Community
Automatically constructs a knowledge graph from unstructured text by extracting subject-predicate-object triples using an LLM, then serializing them to RDF/OWL or property-graph formats. Supports ontology alignment, duplicate merging via entity resolution, and Turtle/JSON-LD export.
by AaaS
End-to-end script for building a searchable knowledge base from heterogeneous sources including documents, APIs, databases, and web content. Orchestrates ingestion, deduplication, embedding, indexing, and creates a unified query interface across all sources.
by Community
Generates node and edge embeddings for knowledge graphs using Node2Vec, TransE, or a GNN (via PyTorch Geometric), then indexes them in a vector store for similarity search and link prediction. Includes training scripts, evaluation on standard link-prediction benchmarks, and a REST API for embedding lookup.
by AaaS
Converts Hugging Face model weights to GGUF format for use with llama.cpp and compatible inference engines. Supports multiple quantization levels (Q4_K_M, Q5_K_M, Q8_0), validates output integrity, and generates model cards with performance characteristics.
by AaaS
Calculates and projects LLM API costs based on usage patterns, model pricing, and workload forecasts. Compares costs across providers and models, identifies the most cost-effective configuration for a given quality threshold, and generates budget reports.
by Replicate
Replicate is a cloud platform that makes it trivial to run open-source machine learning models via a simple API with pay-per-second billing. It hosts thousands of community models spanning image generation, video, audio, and language, and allows developers to package and deploy custom models as Cogs without managing any GPU infrastructure.
by Labelbox
Labelbox is an enterprise data-curation and annotation platform that streamlines the creation of high-quality training datasets for computer vision, NLP, and multimodal AI models. It provides annotation tooling, quality workflows, model-assisted labeling, and a managed workforce marketplace.
by BentoML
BentoML is an open-source platform for building, shipping, and scaling AI applications and model inference services, providing a unified framework from local development to cloud production. BentoCloud, its managed service, offers one-click deployment, auto-scaling, and observability for ML teams.
by xAI
Benchmark testing multimodal models on practical real-world visual understanding tasks. Features questions about real photographs requiring spatial reasoning, object recognition, scene understanding, and practical knowledge that goes beyond simple object detection.
by Epoch AI
Benchmark of original, research-level mathematics problems created by professional mathematicians. Tests capabilities at the frontier of mathematical reasoning including novel proofs, advanced computation, and multi-domain mathematical synthesis.
by Toma et al. / University of Toronto
ClinicalCamel Benchmark evaluates open-source language models on clinical dialogue and medical instruction-following tasks derived from physician–patient interactions. It focuses on safety, accuracy, and appropriateness of clinical advice generation.
by NousResearch
Genstruct is a synthetic instruction dataset generated by the Genstruct-7B model, which converts raw documents into structured instruction-response pairs. Unlike typical self-instruct approaches, Genstruct grounds every instruction in a source document, ensuring factual consistency and enabling controllable synthetic data generation from any text corpus.
by Hugging Face
A 50 GB dataset of Python code scraped from GitHub, originally created to train the CodeParrot model as a demonstration of code-focused language model pretraining. It filters repositories for Python files only and applies basic deduplication, making it a lightweight starting point for Python-specific code generation research and experimentation.
by Voyage AI
Voyage AI's third-generation embedding model designed for high-precision retrieval and RAG applications. Outperforms OpenAI and Cohere embeddings on retrieval benchmarks with optimized latent representations.
by Stability AI / Tripo AI
TripoSR is a fast, open-source image-to-3D reconstruction model developed by Stability AI and Tripo AI that generates high-quality 3D meshes from a single image in under 0.5 seconds on modern hardware. It is based on the Large Reconstruction Model (LRM) architecture and represents a step-change in accessible, real-time single-image 3D reconstruction quality.
by James Betker (neonbjb)
Tortoise TTS is a highly expressive, open-source multi-voice text-to-speech system created by James Betker that achieves exceptional naturalness and zero-shot voice cloning quality through an autoregressive + diffusion pipeline, at the cost of significantly higher inference time than real-time TTS systems. Despite its slow generation speed, Tortoise remains a gold standard for open-source TTS quality and is widely used for offline audiobook and creative narration tasks.
by Mistral AI
Mistral AI's natively multimodal model with a dedicated 400M parameter vision encoder alongside a 12B language backbone. Processes images at their native resolution without fixed-size tokenization.
by Microsoft
Microsoft's mixture-of-experts small language model with 42B total parameters but only 6.6B active per token. Delivers performance competitive with much larger dense models while maintaining efficient inference costs and strong multilingual capabilities.
by Tsinghua University (ModelBest)
MiniCPM-V 2.6 is a compact yet capable vision-language model from Tsinghua University, designed for deployment on edge devices and mobile platforms while achieving GPT-4V-level performance on several visual benchmarks. It supports high-resolution images and multi-image inputs, making it remarkably capable relative to its small footprint.
by HuggingFace
SmolAgents is HuggingFace's minimal agent framework that defaults to code-writing agents powered by HuggingFace-hosted open-source models. The integration allows seamless use of models from the HuggingFace Hub (Qwen, Mistral, LLaMA) through the Inference API or local transformers without API key lock-in.
by Zilliz
LangChain VectorStore integration for Milvus, the open-source distributed vector database. Supports billion-scale ANN search, multiple index types (IVF_FLAT, HNSW, DiskANN), and collection-level partitioning through LangChain's unified retriever interface via the pymilvus client.
by Community / Sentry
MCP Sentry server exposing Sentry's error tracking and performance monitoring data to MCP-compatible agents. Agents can list recent issues, retrieve stack traces, inspect breadcrumbs, and query performance data, enabling AI-powered incident triage and root cause analysis workflows.
by Mozilla
LlamaFile by Mozilla and Justine Tunney bundles a complete LLM with its runtime into a single self-contained executable that runs on Linux, macOS, Windows, FreeBSD, NetBSD, and OpenBSD without any installation. It embeds a compressed GGUF model and a llama.cpp backend into a polyglot binary (ZIP + ELF/Mach-O), serving an OpenAI-compatible HTTP API on localhost at startup.
by Meta AI
Galactica is a series of LLMs (125M to 120B parameters) trained on a curated corpus of 48M scientific papers, reference materials, and knowledge bases, designed to store, combine, and reason over scientific knowledge. Despite controversy at release, Galactica established important design principles for scientific LLMs including citation token formatting and working memory prompting.
by Databricks / CMU
Extends Chinchilla scaling laws by incorporating inference costs into the compute-optimal analysis. Shows that when inference demand is high, training smaller models on more tokens is suboptimal — real deployments should use larger models trained on fewer tokens than Chinchilla suggests.
by AaaS
Forecasts engineering delivery timelines using historical sprint velocity, team composition, and dependency complexity. Identifies tasks at high risk of delay before they become blockers, producing adjusted delivery estimates and capacity recommendations that inform roadmap sequencing.
by AaaS
Analyzes historical usage patterns to recommend optimal reserved instance and savings plan commitments across cloud providers. Models break-even points, commitment risk given usage variability, and mixed on-demand/reserved strategies — producing purchase recommendations with payback periods and risk ratings.
by AaaS
Applies organization-specific contract playbooks to incoming agreements, flagging deviations from approved positions on key clause types — payment terms, limitation of liability, indemnification, and IP ownership. Tracks each deviation with severity level and recommended fallback language, requiring senior attorney approval to accept non-standard positions.
by AaaS
Runs thousands of probabilistic simulations across financial variables to model outcome distributions under uncertainty. Produces probability-weighted scenarios, confidence intervals, and tail-risk analysis for stress testing, capital planning, and strategic decision support.
by AaaS
Expands user queries with synonyms, related terms, and generated sub-questions to improve retrieval recall. Uses LLMs to reformulate queries into multiple perspectives, capturing documents that a single query might miss.
by AaaS
Reduces model size and inference cost by converting weights from higher to lower precision (FP16 to INT8/INT4). Covers GPTQ, AWQ, GGUF, and bitsandbytes quantization methods with quality-preservation techniques that minimize accuracy degradation.
by AaaS
Implements checkpoints where AI agents pause execution to request human approval, review, or input before proceeding with high-stakes actions. Supports configurable approval workflows, timeout handling, and escalation paths for different risk levels.
by AaaS
Implements the Corrective RAG (CRAG) framework where retrieved documents are evaluated for relevance before use, triggering web search fallbacks or query reformulation when confidence is low. Teaches practitioners to build retrieval evaluators, correction triggers, and knowledge refinement steps that significantly reduce hallucination in production RAG systems.
by AaaS
Applies Anthropic's Constitutional AI principles to self-supervise model outputs against a set of defined rules or principles. The model critiques and revises its own responses to ensure they align with safety guidelines, ethical principles, and quality standards.
by Community
Causal Discovery is a subfield of AI that infers causal relationships from observational data. It constructs a Directed Acyclic Graph (DAG) to represent these cause-and-effect links without manual intervention or controlled experiments, using statistical algorithms to distinguish correlation from causation.
by Anysphere
Cursor is an AI-native code editor built on VS Code that integrates LLMs deeply into the development workflow. It offers inline completions, multi-file chat, agent mode for autonomous task execution, and codebase-wide context via semantic search.
by AaaS DevOps Foundry
Decomposes end-to-end application latency into detailed per-component budgets for real-time and streaming pipeline architectures. Autonomously adds graceful degradation protocols, timeout handling configurations, and p50/p95 tracing metrics required for production multimodal systems. Where a generic AI produces streaming pipeline code without real-world latency considerations, this agent understands the physics of distributed systems and produces actionable latency allocation plans.
by AaaS
Extracts raw emissions data from deeply nested supply chain databases and maps them precisely against 600+ distinct global sustainability frameworks (ISSB, EU CSRD, GRI, TCFD). Generates audit-ready disclosure reports while proactively flagging high-risk carbon hotspots across multiple supplier tiers. Where generic AI produces dangerously non-compliant summaries, this agent understands the precise data requirements of each framework and produces reports that withstand regulatory scrutiny.
by AaaS
Accesses the specific employee's benefits tier, tenure, and family status to answer highly specific questions about deductibles, coverage limits, FSA/HSA eligibility, and retirement matching. Uses targeted retrieval-augmented generation against the actual policy documents — not generic summaries of 100-page PDFs. Every answer includes a source citation; the agent never recommends specific financial decisions and escalates ambiguous cases to HR.
by AaaS
The meta-agent that watches the agents. Audits the actions of all other operating agents across every foundry, verifying that automated decisions comply with anti-discrimination laws (Colorado AI Act, EU AI Act) and organizational policies. Maintains immutable logs of all machine-driven actions for legal defensibility. Cannot modify agent behavior directly — it can only log, alert, recommend, and escalate violations. Essential for any enterprise deploying autonomous AI at scale in the post-Colorado AI Act regulatory environment.
by Nvidia (GameWorks)
A generative AI agent for game content that combines rule-based PCG with diffusion and language models to produce coherent levels, biomes, quest narratives, and item catalogues at scale. It enforces designer-specified constraints through a constraint satisfaction layer and integrates directly into studio content pipelines for one-click asset export.
by PathAI
Computational pathology agent that analyzes whole-slide digital pathology images to classify tissue subtypes, grade tumors, detect biomarkers (PD-L1, HER2, MSI), and quantify spatial cell distributions. Provides pathologists with AI-assisted preliminary reads and comprehensive morphometric data.
by Benchling AI
AI laboratory protocol agent that designs, optimizes, and translates experimental protocols for wet-lab execution and liquid-handling robot automation. Adapts published methods to lab-specific reagent inventories, instrument constraints, and safety requirements, and generates step-by-step SOPs with timing, volumes, and QC checkpoints.
by Google
Google's fourth-generation TPU, used internally to train PaLM, LaMDA, and early Gemini models. Features 32GB HBM2 per chip and an optical circuit-switched ICI for flexible pod topology, enabling massive-scale distributed training.
by AaaS
Automated testing framework for prompt engineering with test case management, assertion-based evaluation, regression detection, and A/B comparison. Validates prompt outputs against expected patterns, formats, and quality criteria with CI/CD integration.
by AaaS
Quantizes language models using GPTQ for efficient inference on consumer hardware. Performs calibration-based quantization, quality evaluation against the original model, and exports in formats compatible with vLLM, llama.cpp, and other inference engines.
by AaaS
Template for building Model Context Protocol (MCP) servers that expose tools, resources, and prompts to MCP-compatible clients. Includes typed tool handlers, resource providers, error handling, and transport configuration for stdio and HTTP modes.
by AaaS
Load tests LLM API endpoints with configurable concurrency, request patterns, and duration. Measures throughput, latency percentiles (p50/p95/p99), time-to-first-token, error rates, and generates performance reports with degradation alerts.
by AaaS
Converts CSV data into vector embeddings with configurable column selection, text template formatting, and metadata extraction. Outputs to popular vector stores or file formats with chunking support for large CSV files that exceed memory limits.
by Community
Configures an audio classification system using Audio Spectrogram Transformer (AST) or YAMNet fine-tuned on AudioSet, with Mel spectrogram feature extraction and batch inference. Exports per-clip predictions with top-5 class probabilities and integrates with a streaming event bus for real-time use.
by Modal Labs
Modal is a serverless cloud platform purpose-built for running GPU-intensive Python workloads including ML inference, fine-tuning, and batch processing without managing infrastructure. Developers define compute requirements in Python decorators and Modal handles container orchestration, scaling, and cold-start optimization.
by Anyscale
Anyscale is the company behind Ray, the open-source distributed computing framework that has become the infrastructure backbone for training and serving large-scale AI at companies like OpenAI, Uber, and Spotify. Anyscale provides a managed platform for Ray workloads, including Anyscale Endpoints for scalable LLM inference and RayLLM for open-model serving.
by University of Hong Kong
Benchmark for evaluating multimodal agents on real operating system tasks spanning Ubuntu, Windows, and macOS environments. Tests agents' ability to interact with desktop applications, file systems, terminals, and GUI elements to complete everyday computer tasks.
by WhereIsAI (SeanLee97)
UAE-Large-V1 (Universal AnglE Embedding) is a high-performance text embedding model using the AnglE optimization method to mitigate vanishing gradient issues in embedding training. It achieves excellent MTEB scores in a moderately-sized package, making it practical for production deployment with a strong quality-to-size ratio.
by BigCode (Hugging Face)
StarCoder2-3B is the smallest model in the StarCoder2 family from BigCode, designed for efficient code generation on consumer hardware with minimal resource requirements. Despite its compact size, it delivers competitive performance on code completion benchmarks relative to much larger open code models.
by Stability AI
Stable Audio 2 from Stability AI is a latent diffusion model capable of generating up to 3-minute stereo music and audio at 44.1kHz, making it the first publicly released model to produce near-CD-quality long-form audio generation. It supports precise timing and structure control through natural language, enabling users to specify song sections, BPM, and mood with high fidelity.
by Google DeepMind
Google's previous-generation large language model trained with an emphasis on multilingual understanding, reasoning, and code generation. Succeeded by the Gemini family but still available in legacy deployments.
by NVIDIA
Nemotron-4-340B is NVIDIA's open-weight 340-billion-parameter model family, featuring both a base model and instruction-tuned variant optimized for synthetic data generation and RLHF pipelines. It delivers frontier performance on reasoning and knowledge benchmarks and is particularly valued as a teacher model for training smaller models.
by pgvector
pgvector-django package adding native vector similarity search to Django's ORM via PostgreSQL's pgvector extension. Adds VectorField, IvfflatIndex, and HnswIndex with cosine, L2, and inner product distance operators. Enables AI-powered search inside existing Django applications without a separate vector DB.
by Amazon Web Services
Mistral AI's Mistral Large and Mistral Small models available through Amazon Bedrock for serverless inference. Provides AWS-native access to Mistral's frontier models with pay-per-token pricing, IAM-based auth, and Bedrock Guardrails — enabling EU-origin AI capabilities within AWS infrastructure without a separate Mistral API account.
by Braintrust Data
Braintrust wraps the Anthropic SDK to automatically trace every Claude API call and funnel results into structured eval datasets. Developers can run model-graded scoring, regression suites against golden datasets, and A/B comparisons between Claude model versions directly from the Braintrust dashboard.
by DeepMind / Multiple Institutions
This paper argues that LLM safety evaluations must account for sociotechnical contexts—who uses a system, in what social settings, and with what deployment constraints—rather than treating safety as a purely technical property of the model. It proposes a framework integrating stakeholder analysis, deployment context, and systemic risk assessment into safety evaluation pipelines.
by AaaS
Translates identified risks and scenario simulation results into concrete, actionable mitigation strategies. For each risk factor, produces specific hedging recommendations, contingency plans, and trigger thresholds that activate pre-defined responses — moving from abstract risk scores to operational playbooks finance teams can act on.
by AaaS
Performs retrieval-augmented generation scoped to a specific employee's context — their benefits tier, tenure, family status, and location. Unlike generic document search, this skill pre-filters the retrieval corpus to only policy sections applicable to the individual, eliminating hallucinated answers from irrelevant plan tiers.
by AaaS
Identifies which input variables have the greatest impact on financial outcomes by systematically varying each parameter while holding others constant. Produces tornado charts and sensitivity tables that show which levers matter most for risk mitigation.
by AaaS
Solves complex scheduling constraint problems — finding optimal time slots across multiple participants with varying availability, timezone differences, room requirements, and interview format constraints. Produces ranked options with trade-off explanations.
by AaaS
Continuously monitors official regulatory sources — Federal Register, state legislatures, SEC filings, EU Official Journal — for new rules, amendments, and enforcement actions. Filters by jurisdiction, topic, and relevance to active matters, producing structured alerts within hours of publication.
by AaaS
Analyzes live conversation streams (audio transcripts or chat) for emotional tone shifts in real-time. Detects frustration, confusion, satisfaction, and urgency — enabling coaching agents to surface relevant responses before customer emotions escalate.
by AaaS
Predicts employee flight risk from anonymized engagement signals — learning platform usage drops, decreased collaboration, manager relationship changes, and peer departure patterns. Produces risk scores with contributing factor breakdown for HR intervention.
by AaaS
Models how price changes affect demand volume using historical transaction data, seasonal patterns, and market conditions. Produces elasticity curves that pricing agents use to find optimal price points that maximize revenue without triggering demand collapse.
by AaaS
Tracks competitor pricing, product changes, and market positioning across public sources, APIs, and data feeds. Detects price changes, new product launches, and positioning shifts, providing the real-time competitive context that pricing and strategy agents need.
by AaaS
Validates generated creative assets against a structured brand kit — checking typography hierarchy, color palette adherence, spacing rules, contrast ratios, and logo usage. Rejects assets with AI artifacts, hallucinated logos, or off-brand styling before they reach human review.
by AaaS
Breaks down end-to-end request latency into per-component, per-service, and per-operation budgets using distributed tracing data. Identifies which components consume disproportionate latency, recommends budget allocations, and detects when individual components exceed their allocation. Essential for designing real-time streaming and multimodal pipelines.
by AaaS
Identifies architectural patterns (monolith, microservices, event-driven, CQRS, hexagonal) from code structure and communication patterns. Classifies modules by their architectural role, detects anti-patterns (distributed monolith, god service), and produces architecture decision records that help teams understand why the system was built the way it was.
by Community
Discovers recurring motifs, sequential association rules, and temporal dependencies within time-series corpora using motif discovery, episode mining, and shapelets-based methods. Enables interpretable feature extraction and behavioral pattern analysis from complex temporal datasets.
by AaaS
Improves LLM accuracy by generating multiple independent reasoning paths for the same question and selecting the most frequent answer through majority voting. Reduces variance in model outputs and increases reliability on reasoning-heavy tasks.
by Community
Designs and maintains formal ontologies that define concepts, properties, and relationships within a domain, enabling machine-readable semantic interoperability and inference. Provides the schema layer for knowledge graphs and supports rule-based reasoning over structured knowledge bases.
by AaaS
Teaches strategies for combining heterogeneous inputs — text, image, audio, tabular — at the feature, decision, or representation level within a single model or agentic pipeline. Covers early fusion, late fusion, cross-attention fusion, and learned weighted aggregation for downstream classification or generation tasks.
by Community
Generates and evaluates counterfactual explanations — minimal input changes that would alter a model's prediction — using structural causal models and algorithmic recourse techniques. Provides actionable explanations for model decisions and supports causal effect estimation under interventions.
by Cognition
Devin is Cognition's fully autonomous software engineer agent. It operates in a sandboxed environment with browser, terminal, and editor access, capable of completing end-to-end engineering tasks from issue triage to deployment with minimal human supervision.
by Sweep AI
AI-powered GitHub bot that transforms issues into pull requests by reading the codebase, planning changes, and writing code. Acts as a junior developer handling routine tasks like bug fixes and small features.
by TransformerOptimus
Open-source autonomous agent framework for building, managing, and running production-grade AI agents. Features a marketplace of tools, concurrent agent execution, performance telemetry, and a GUI for agent management and monitoring.
by Motional
A hierarchical motion planning agent that converts destination goals into kinematically feasible trajectories using HD-map data, real-time traffic feeds, and upstream perception outputs. It performs behavior prediction for surrounding agents, resolves multi-agent interaction scenarios, and replans at 10Hz to handle dynamic obstacles.
by modl.ai
A fully-autonomous QA agent that plays through game builds using learned behavioral policies, detects visual and functional regressions via screenshot diffing, and files structured bug reports with repro steps to the issue tracker. It prioritizes test coverage toward recent code changes using diff-awareness and runs 24/7 across cloud GPU fleets to accelerate release cycles.
by Langroid
Lightweight multi-agent programming framework that treats agents as first-class citizens communicating via structured message passing. Provides an intuitive developer experience with strong typing and minimal boilerplate.
by AaaS
Sets up Grafana dashboards and Prometheus metrics for LLM application monitoring. Includes pre-built dashboards for token usage, latency, error rates, cost tracking, and model performance with configurable alert rules and notification channels.
by AaaS
Performance benchmarking suite measuring LLM inference throughput, latency percentiles, time-to-first-token, and tokens-per-second under various load patterns. Generates detailed performance reports with charts for capacity planning and SLA validation.
by AaaS
Configures a hybrid search system combining dense vector similarity with sparse BM25 keyword matching. Sets up dual index creation, score fusion strategies, and query routing logic for optimal retrieval across different query types.
by AaaS
Detects hallucinated content in LLM outputs by cross-referencing claims against source documents and knowledge bases. Uses claim decomposition, source attribution scoring, and consistency checking to flag unsupported or fabricated statements.
by Community
GraphQL gateway for multi-model AI services built with Strawberry Python, exposing query, mutation, and subscription resolvers for chat, embedding, and image generation endpoints across multiple LLM providers. Features a DataLoader-based batching layer and persisted query caching to minimize token usage.
by Community
Disambiguates named entities in text by linking them to canonical Wikidata or custom knowledge base entries, using a bi-encoder retriever followed by a cross-encoder reranker. Handles multi-lingual input via mBERT and outputs entity URIs with confidence scores for downstream graph population.
by AaaS
Classifies documents into predefined categories using LLM-based inference with configurable taxonomies. Supports batch processing, multi-label classification, confidence thresholds, and exports results to CSV or database with audit trails.
by OpenLineage
Instruments ETL and ML pipelines with OpenLineage events, shipping dataset-level provenance metadata to a Marquez or Apache Atlas backend. Generates interactive lineage DAGs showing data transformations from source to model artifact, supporting impact analysis and audit trails.
by AaaS
Analyzes LLM API usage patterns and identifies cost optimization opportunities. Recommends model downgrades for simple tasks, prompt compression strategies, caching opportunities, and batch processing windows based on historical usage data and cost metrics.
by Harvey AI
Harvey AI is an enterprise legal AI platform built on foundation models fine-tuned on legal corpora to assist law firms and corporate legal departments with research, drafting, due diligence, and contract analysis. It is deployed at leading global law firms and backed by OpenAI, positioning itself as the AI layer for professional legal services.
by BigScience / Hugging Face
BigScience was a year-long, open research collaboration involving over 1,000 volunteer researchers, organized by Hugging Face. This global effort focused on the transparent and ethical development of large language models, culminating in the creation of BLOOM, a 176-billion parameter open-access multilingual model.
by Columbia University (Li et al.)
StyleTTS2 is an open-source text-to-speech model from Columbia University that achieves human-level naturalness on LJSpeech and VCTK benchmarks by modeling speech styles as latent diffusion variables for zero-shot voice cloning and expressive synthesis. It surpassed commercial systems like ElevenLabs in several blind listening evaluations, establishing the highest quality bar achieved by any open-source TTS system at the time of publication.
by Stability AI
Stability AI's latent diffusion model for generating high-quality music and sound effects from text descriptions. Produces variable-length stereo audio at 44.1kHz, suitable for commercial music production and sound design workflows.
by Salesforce Research
SFR-Embedding-2 is Salesforce Research's second-generation text embedding model built on a large language model backbone, achieving top results on the MTEB leaderboard at release. It employs a novel training strategy combining supervised contrastive learning with LLM fine-tuning for exceptional retrieval and semantic similarity performance.
by Playground AI
PlaygroundAI v2.5 is an open-source text-to-image model that achieved top scores on human preference benchmarks at the time of its release, surpassing SDXL and DALL-E 3 in aesthetic quality evaluations. Built by Playground AI, it uses a novel training approach focused on human aesthetic preference rather than just perceptual similarity metrics.
by Microsoft
Microsoft's internally developed large language model designed for Azure AI enterprise workloads. Built independently of any partnership models, MAI-1 represents Microsoft's own frontier research in training and serving large-scale transformers.
by LLaVA Team (UW-Madison / Microsoft)
Large Language and Vision Assistant with improved visual reasoning and OCR capabilities through dynamic high-resolution image encoding. An open-source multimodal model that rivals proprietary alternatives on vision benchmarks.
by OpenAI
OpenAI's experimental Swarm framework natively targets the OpenAI Chat Completions API for lightweight, stateless multi-agent handoffs. Agents are plain Python functions decorated with tool schemas; the framework manages context passing and agent-to-agent transfers through the standard OpenAI function-calling interface.
by AaaS
Interprets complex HR policy documents (benefits plans, handbooks, compliance requirements) and produces clear, accurate answers to employee questions. Every answer includes source citations and confidence scores. Refuses to give financial advice and escalates ambiguous cases.
by AaaS
Implements the Self-RAG framework where the model learns to autonomously decide when retrieval is needed, critique retrieved passages, and reflect on its own generated output — all within a single end-to-end trained model or prompted pipeline. Covers Self-RAG training, inference-time control tokens, and evaluation against standard RAG baselines.
by AaaS
Implements intelligent caching layers for LLM responses to reduce latency and API costs. Covers semantic caching (matching similar queries), exact-match caching, TTL-based invalidation, and cache warming strategies for predictable workloads.
by AaaS
Compresses retrieved documents by extracting only the most relevant passages relative to the query before injecting them into the LLM context. Reduces token usage while maintaining answer quality by eliminating irrelevant content from retrieved chunks.
by AaaS
Translates source code between programming languages while preserving logic, idioms, and patterns. Handles framework-specific migrations, API mappings, and ecosystem-specific conventions for accurate cross-language porting.
by Relevance AI
No-code AI agent platform that enables teams to build, customize, and deploy AI workforce agents without engineering resources. Features a visual agent builder, multi-agent coordination, and pre-built templates for sales, support, and operations workflows.
by Google
Google's sixth-generation TPU, codenamed Trillium, delivering 4.7x compute improvement over TPU v5e. Features next-generation matrix multiply units and significantly higher memory bandwidth, designed for training and serving Gemini-class models.
by Google
Google's fourth-generation Tensor chip powering Pixel 9 smartphones. Features a dedicated TPU-derived neural core enabling on-device Gemini Nano inference for features like live captions, call screening, and generative AI photography without cloud latency.
by AWS
AWS second-generation custom inference chip with 4x higher compute and 10x higher memory bandwidth than Inferentia1. Optimized for cost-efficient large-scale inference of transformer models with very high throughput and low latency.
by AaaS
Analyzes token usage patterns across LLM applications to identify optimization opportunities. Tracks input/output token ratios, identifies verbose prompts, detects unnecessary context, and recommends prompt engineering improvements for cost reduction.
by Community
Combines ML demand forecasting (Prophet + LightGBM) with constraint-based optimization (Google OR-Tools) to minimize inventory costs while meeting service-level targets across a multi-echelon supply chain. Outputs replenishment orders, safety stock recommendations, and a scenario simulation dashboard.
by AaaS
Comprehensive safety audit for LLM-powered applications testing for prompt injection vulnerabilities, PII leakage, harmful content generation, and policy violations. Generates detailed audit reports with severity ratings and remediation recommendations.
by AaaS
Detects regressions in LLM behavior across model updates, prompt changes, or configuration modifications. Runs golden test sets, compares outputs using semantic similarity and LLM judges, and flags significant quality degradation with detailed diff reports.
by AaaS
Extracts named entities and relationships from unstructured text at scale using LLM-powered NER with custom entity type support. Outputs structured data with entity linking, relationship graphs, and confidence scores for knowledge graph construction.
by AaaS
CI/CD pipeline for machine learning models with automated testing, evaluation, registry management, and staged deployment. Runs benchmark suites, compares against baseline metrics, and promotes models through staging environments with approval gates.
by AaaS
Evaluates AI agent performance across defined test scenarios with success criteria, step tracking, and automated scoring. Supports custom evaluation rubrics, regression detection, and generates detailed reports comparing agent versions over time.
by Nomic AI
Nomic AI builds open, auditable AI systems focused on embedding models and large-scale data visualization, most notably the nomic-embed-text model and Atlas—a platform for exploring and understanding massive datasets through interactive AI-powered maps. The company emphasizes transparency and reproducibility in model development.
by Fireworks AI
Fireworks AI is a production inference platform founded by ex-Google Brain researchers, offering fast and reliable serving for open-weight models with enterprise SLAs. Fireworks specializes in compound AI systems, function calling, and JSON-mode inference, and provides FireFunction—its own fine-tuned function-calling model—alongside hosting for Llama, Mistral, and other popular open models.
by Cerebras Systems
Cerebras Systems designs and manufactures the Wafer Scale Engine (WSE), the world's largest AI chip, enabling ultra-fast LLM training and inference at speeds far exceeding GPU clusters. Its CS-3 system and Cerebras Inference cloud service deliver token generation rates of 2,000+ tokens/second for leading open-weight models.
by Hugging Face
SmolLM 1.7B is Hugging Face's compact and highly capable language model trained on a curated dataset called SmolLM-Corpus consisting of high-quality web, code, and math data. It achieves remarkable performance for its size and was designed specifically for efficient on-device deployment and real-time inference.
by Minimax
Minimax Video-01 (also known as Hailuo AI) is a video generation model from Minimax that delivers exceptional subject consistency and smooth motion across long clips, making it particularly strong for character-driven narratives and story continuity. It is one of the few publicly accessible models capable of maintaining consistent human subject identity across multiple generated scenes.
by VikParuchuri / ChromaDB
Combines Marker's high-fidelity PDF-to-Markdown conversion with ChromaDB's local-first vector store for lightweight, self-hosted RAG pipelines. Ideal for on-device or air-gapped deployments where cloud vector stores are unavailable.
by AaaS
Verifies consistency of defined terms, cross-referenced clauses, and party obligations across related legal agreements — MSAs, SOWs, NDAs, and amendments. Detects conflicting definitions, inconsistent liability caps, and orphaned references that would create ambiguity or enforceability issues during disputes.
by AaaS
Embeds structured compliance check sequences directly into operational workflows, ensuring that audit verification steps execute automatically at defined trigger points. Each check is timestamped and linked to the regulatory requirement it satisfies, creating an unbroken compliance thread that survives personnel changes and process updates.
by AaaS
Cross-references new regulatory changes against active client case files, contracts, and operational workflows. Identifies which clients, matters, or business processes are affected by each regulatory update, enabling targeted compliance action rather than blanket review.
by AaaS
Trains agents to localize specific image regions described by natural language referring expressions, bridging the gap between language and spatial visual understanding. Covers grounding models (Grounding DINO, Grounded SAM), evaluation metrics (R@k, mAP), and integration into tool-use agents for UI automation and document analysis.
by AaaS
Extends chain-of-thought prompting by exploring multiple reasoning paths simultaneously and evaluating them as a branching tree. The model generates, evaluates, and prunes candidate solutions using breadth-first or depth-first search strategies for optimal problem solving.
by AaaS
Implements Reinforcement Learning from Human Feedback to align language models with human values and preferences. Covers the full pipeline: supervised fine-tuning, reward model training from comparison data, and policy optimization with PPO or similar algorithms.
by AaaS
Covers structured prompting strategies for text-to-music models (MusicGen, Suno, Udio) to generate on-brand, mood-appropriate audio tracks at scale. Teaches tempo, key, instrumentation, and style descriptors alongside iterative regeneration and stem separation workflows.
by AaaS
Provides a systematic framework for understanding the internal representations, circuits, and learned concepts of deep learning models beyond surface-level feature attribution. Covers probing classifiers, concept activation vectors (TCAV), sparse autoencoders for mechanistic interpretability, and best practices for communicating findings.
by AaaS
Detects and blocks jailbreak attempts that try to bypass LLM safety training through adversarial prompting techniques. Uses pattern recognition, semantic analysis, and classifier-based approaches to identify known and novel jailbreak vectors before they reach the model.
by AaaS
Augments traditional vector-based RAG with knowledge graph structures for multi-hop reasoning and relationship-aware retrieval. Builds entity-relationship graphs from documents and traverses them to answer complex queries requiring cross-document reasoning.
by AaaS
Teaches generation of 'what-if' explanations that show users the minimal input changes required to flip a model's decision, providing actionable recourse in high-stakes settings. Covers DiCE, NICE, and custom counterfactual search algorithms, with guidance on feasibility constraints and user-facing presentation.
by Intel
Intel's first dedicated Neural Processing Unit embedded in Core Ultra (Meteor Lake) laptop processors. Delivers 10+ TOPS for AI inferencing on Windows AI PCs, enabling background AI workloads like live captioning, noise suppression, and on-device LLM assistance without using GPU/CPU resources.
by AaaS
Configures intelligent rate limiting for LLM API proxies with per-user, per-model, and per-endpoint limits. Implements token bucket, sliding window, and adaptive rate limiting algorithms with Redis-backed distributed state and graceful degradation.
by AaaS
Benchmarks LLM API latency across providers, models, and prompt sizes with detailed statistical analysis. Measures time-to-first-token, inter-token latency, total response time, and generates comparison reports with confidence intervals and percentile distributions.
by AaaS
Deploys and manages LLM inference workloads on Kubernetes with GPU scheduling, auto-scaling based on queue depth, rolling updates, and canary deployments. Generates Helm charts and Kustomize configurations for reproducible deployments.
by Community
Forecasts electricity demand and renewable generation (solar/wind) using Temporal Fusion Transformer or N-HiTS via NeuralForecast, with weather feature integration and probabilistic intervals for grid balancing. Outputs 24-hour and 7-day ahead forecasts in an InfluxDB-compatible format.
by AaaS
Configures an API gateway for LLM inference endpoints with provider routing, rate limiting, authentication, request/response logging, and failover between multiple LLM providers. Includes usage tracking and cost allocation by API key.
by AaaS
Automated data annotation pipeline using LLMs for labeling, classification, and quality scoring of training data. Implements multi-annotator consensus, confidence thresholds, human review queuing for uncertain samples, and annotation analytics.
by AaaS
Deploys AI agents as production services with health checks, graceful shutdown, error recovery, and monitoring integration. Supports Docker and Kubernetes deployments with configurable scaling, environment management, and rollback capabilities.
by Helicone
Helicone is an open-source LLM observability and monitoring platform that provides a single proxy endpoint for logging, tracking costs, debugging, and improving LLM applications across all major model providers. It integrates with a one-line code change and supports caching, rate limiting, and prompt management.
by Stanford University (Academic)
OpenVLA is Stanford University's open-source vision-language-action model for robot manipulation, built on a 7B parameter vision-language model backbone and fine-tuned on the Open X-Embodiment dataset. It achieves strong performance on robot manipulation tasks while being fully open-source and reproducible, serving as a community standard for VLA research.
by Amazon
Amazon's cost-effective multimodal model optimized for fast processing of image, video, and text inputs at low latency. Ideal for high-volume workloads where speed and cost efficiency matter more than peak accuracy.
by Comet ML
Opik by Comet provides an open-source LLM observability platform that integrates with LangChain via a callback handler, recording traces, token counts, and custom scores into a queryable dataset. The integration includes built-in hallucination and answer-relevance evaluators that run automatically on captured traces.
by VRSEN
Agency Swarm is built on top of the OpenAI Assistants API, wrapping it with agency-level abstractions for defining communication flows between specialized agents. It provides a higher-level interface for creating persistent agent threads, shared tool registries, and structured agent communication protocols.
by Meta
Meta's efficient open-weight model family outperforming larger closed models.
by OpenAI
175B parameter few-shot learner demonstrating emergent in-context learning.
by Google
Wei et al. showing step-by-step reasoning prompts dramatically improve LLM accuracy.
by AaaS
Maps raw data against 600+ global sustainability and compliance frameworks (ISSB, EU CSRD, GRI, TCFD, SASB). Understands the precise data requirements of each framework and produces framework-specific disclosure reports that withstand regulatory scrutiny.
by AaaS
Audits the actions taken by other AI agents across all foundries — verifying that automated decisions comply with anti-discrimination laws, organizational policies, and regulatory requirements. The meta-skill that enables the AI Governance Auditor to watch the watchers.
by AWS
AWS second-generation custom AI training chip delivering up to 4x performance improvement over Trainium. Designed specifically for training large language models on AWS, with tight integration with UltraCluster networking for scale-out training jobs.
by AaaS
Automated red teaming toolkit that generates and tests adversarial prompts against LLM applications. Covers jailbreak attempts, prompt injection variants, social engineering patterns, and boundary probing with categorized attack vectors and success tracking.
by AaaS
RAG pipeline that queries multiple specialized vector indexes and merges results with intelligent routing. Implements source-aware retrieval with automatic query classification, per-source relevance scoring, and citation tracking across diverse knowledge domains.
by Community
Implements a GDPR-compliant consent management layer that records per-user data processing consents in an append-only ledger, enforces purpose limitation at the data access layer, and generates DSAR (data subject access request) reports on demand. Supports consent propagation to downstream ML training pipelines.
by AaaS
Detects demographic and topical biases in LLM outputs by running structured test prompts across protected categories. Measures response quality disparities, sentiment differences, and representation gaps with statistical significance testing and bias scorecards.
by AaaS
Sets up a monitoring dashboard for AI agent systems tracking task completion rates, error rates, latency, token usage, and cost. Integrates with Prometheus for metrics collection and Grafana for visualization with pre-built alert rules.
by AaaS
Framework for A/B testing different LLM configurations including models, prompts, temperatures, and system instructions. Runs controlled experiments with statistical significance testing, effect size calculation, and automated winner selection.
by Together AI
Together AI provides a cloud platform for running, fine-tuning, and deploying open-source language models. It hosts a wide catalog of models from Llama to Mistral and offers serverless inference, dedicated endpoints, and a fine-tuning pipeline. Together AI is popular among developers who want OpenAI-compatible APIs for open-weight models at competitive pricing.
by Replicate
Replicate is a platform for running machine learning models in the cloud via a simple API. It hosts thousands of open-source models for image generation, language, audio, and video, deployable with a single API call. Replicate charges per-second of GPU usage and supports deploying custom models as private or public endpoints.
by OpenAI
OpenAI is the leading AI research and deployment company behind the GPT and o-series model families. It offers API access to frontier language models, image generation via DALL-E, speech recognition via Whisper, and an Assistants API for building stateful agent workflows. OpenAI operates both a consumer product (ChatGPT) and an enterprise API platform used by millions of developers.
by Modal Labs
Modal is a cloud compute platform for running GPU workloads from Python, with a focus on developer ergonomics and serverless scaling. It allows deploying Python functions as GPU-accelerated endpoints with zero infrastructure configuration, automatic scaling to zero, and fast cold-start times. Popular for ML inference, batch jobs, and LLM serving.
by Groq
Groq offers ultra-low-latency LLM inference through its custom Language Processing Unit (LPU) hardware. The GroqCloud API serves open-weight models including Llama, Mixtral, and Gemma at speeds that far exceed GPU-based inference, making it ideal for real-time agent applications. Groq provides a developer-friendly API compatible with the OpenAI client format.
by Lannelongue et al. / EMBL-EBI
EnergyBench quantifies the energy consumption and carbon footprint of AI inference across hardware and software configurations. It correlates task accuracy with joules consumed, enabling practitioners to make informed accuracy-efficiency trade-offs for sustainable AI deployment.
by OpenAI
OpenAI's text-embedding-3-small offers a cost-optimized embedding model producing 1536-dimensional vectors. Despite its smaller size, it outperforms the older ada-002 model on most benchmarks. It is widely used for production RAG systems where cost per embedding is a key constraint.
by OpenAI
OpenAI's text-embedding-3-large is the highest-performance model in OpenAI's embedding lineup, producing 3072-dimensional vectors with top results on MTEB English benchmarks. It supports dimensionality reduction via the dimensions parameter, allowing developers to trade off performance against storage and compute cost.
by Microsoft
Phi-4 is Microsoft's 14B-parameter small language model that outperforms much larger models on reasoning and STEM benchmarks through high-quality synthetic training data. Available as open weights on Hugging Face and via Azure AI, Phi-4 demonstrates that strategic data curation can substitute for raw parameter scale in many practical applications.
by OpenAI
OpenAI's o3 is a frontier reasoning model that uses chain-of-thought inference to solve complex scientific, mathematical, and coding problems. It achieves top scores on GPQA Diamond, AIME, and SWE-bench, representing the state of the art in step-by-step reasoning. o3 trades higher inference latency for significantly deeper problem-solving capability.
by DeepSeek
DeepSeek R1 is a reasoning-focused model trained with reinforcement learning, achieving performance comparable to o1 on math, code, and scientific reasoning benchmarks at a fraction of the training cost. Released as an open-weight model, R1 sparked widespread adoption in the community and demonstrated the viability of RL-trained reasoning models outside of large US labs.
by 01.AI
01.AI's bilingual large language model excelling at both Chinese and English tasks with a 200K extended context variant. Demonstrates that smaller, well-trained models can compete with much larger alternatives on reasoning benchmarks.
by OpenAI
Shap-E is OpenAI's improved 3D generation model that generates implicit neural representations of 3D objects (which can be rendered as textured meshes or NeRFs) from text or image prompts, producing richer geometry and appearance than its predecessor Point-E. It encodes 3D assets into a compact latent space and trains a diffusion model over that space, enabling more coherent shape-and-texture output.
by Google DeepMind
RT-2 (Robotics Transformer 2) is Google DeepMind's vision-language-action model that directly maps visual observations and language instructions to robot actions, enabling robots to perform novel tasks through generalization from web-scale pretraining. It represents a breakthrough in combining foundation model capabilities with physical robot control.
by UC Berkeley (Academic)
Octo is an open-source generalist robot policy model from UC Berkeley that can be fine-tuned to new robot setups with minimal data, trained on 800K robot trajectories from the Open X-Embodiment dataset. It uses a flexible transformer architecture that supports language instructions, image observations, and diverse action spaces across different robot morphologies.
by Nomic AI
Nomic AI's fully open-source and open-data text embedding model with auditable training data and competitive MTEB performance. First embedding model to be trained with full data transparency, enabling reproducibility and trust.
by LanceDB
LlamaIndex integration for LanceDB's serverless, embedded vector database built on the Lance columnar format. Supports multimodal data (text, images, video), zero-copy queries, and versioned datasets. Ideal for local or edge AI applications requiring a zero-ops vector store with full LlamaIndex query engine compatibility.
by Jina AI / PostgreSQL
Routes Jina Reader's URL-to-text extraction through PostgreSQL's pgvector extension for SQL-native RAG storage. Enables teams already running PostgreSQL to add vector search without adopting a separate vector database, keeping the stack simple.
by IBM / Weaviate
Combines IBM's Docling document conversion library with Weaviate's vector database for structured RAG pipelines. Docling extracts rich document structure (tables, figures, headings) which is then stored as typed Weaviate objects with native vector indexing.
by DeepSeek
DeepSeek's open reasoning model matching o1 via RL on chain-of-thought.
by AaaS
Verifies that automated decisions comply with anti-discrimination laws including the Colorado AI Act and EU AI Act. Tests for disparate impact across protected classes, validates that no prohibited factors influence decisions, and generates compliance reports with statistical evidence.
by AaaS
Routes queries to the most relevant vector index or knowledge source from a collection of specialized indexes. Enables agents to search across multiple domains, document types, or data sources with automatic index selection and result merging.
by AaaS
Implements Direct Preference Optimization for aligning language models with human preferences without requiring a separate reward model. Simplifies the RLHF pipeline by directly optimizing the policy model using preference pairs of chosen and rejected responses.
by Cline
Cline is an open-source autonomous coding agent that runs as a VS Code extension. It can create and edit files, execute terminal commands, use the browser, and call MCP tools—all with a transparent approval workflow showing every action before execution.
by Zhejiang University / Tencent
Multimodal web agent that combines vision and language understanding to navigate and interact with real-world websites. Uses screenshot-based observation and structured action prediction to complete complex web tasks without relying on DOM access.
by BioRelate (Elsevier)
Knowledge-graph-driven hypothesis generation agent that identifies non-obvious connections between biological entities, molecular pathways, and phenotypic outcomes to propose novel, testable scientific hypotheses. Ranks generated hypotheses by novelty, biological plausibility, and experimental feasibility scores.
by Google
Google's TPU v7 Ironwood is the seventh generation of Google's custom Tensor Processing Units, designed for large-scale AI inference at hyperscaler capacity. Ironwood pods target serving frontier models like Gemini at Google's internal scale and are available to cloud customers via Google Cloud's TPU v7 instances.
by Google
Google TPU v6e Trillium is Google's sixth-generation TPU with 4x the compute and 3x the memory bandwidth per chip compared to v5e. Trillium is generally available on Google Cloud for both training and inference workloads, offering the most cost-efficient TPU option for teams training Gemma and other open models on Google Cloud.
by SambaNova Systems
SambaNova's SN40L is a Reconfigurable Dataflow Unit designed for high-throughput LLM inference and training. Its tiered memory architecture — combining on-chip SRAM with off-chip DRAM — allows serving multiple large models simultaneously with industry-leading batch throughput. The SN40L is the hardware underlying SambaNova Cloud's inference API.
by NVIDIA
The NVIDIA RTX 5090 is NVIDIA's flagship consumer/prosumer GPU in the Blackwell generation, featuring 32GB GDDR7 memory and massive compute for local AI inference and fine-tuning. It allows running 70B quantized models on a single consumer GPU and is the premier choice for developers who need frontier local model capability in a workstation.
by NVIDIA
The NVIDIA H200 is a Hopper-generation GPU with 141GB of HBM3e memory — nearly double the H100's bandwidth — targeting inference workloads for very large models. The additional memory enables running 70B+ parameter models on fewer GPUs, significantly reducing the cost per inference token for large-scale deployments.
by NVIDIA
The NVIDIA H100 Hopper GPU is the dominant AI training and inference accelerator in production deployments as of 2024–2025. With 80GB HBM3 memory and NVLink 4 support, it delivers 4x the compute of the A100. The H100 SXM5 variant connects to 8-GPU NVL8 nodes via NVSwitch for large model training runs.
by NVIDIA
The GB200 NVL72 is NVIDIA's rack-scale AI system combining 36 Grace CPUs and 72 Blackwell B200 GPUs via NVLink interconnect. It delivers up to 1.44 ExaFLOPS of AI compute in a single rack, targeting hyperscaler-class training of frontier models. The NVL72 represents a fundamental shift from server-level to rack-level GPU system design.
by NVIDIA
The NVIDIA B200 is the first Blackwell-architecture data center GPU, delivering 2.5x the training throughput and 5x the inference performance of the H100. With 192GB of HBM3e memory and NVLink 5 interconnects, it is designed for training and serving trillion-parameter models. The B200 anchors NVIDIA's Blackwell product generation.
by NVIDIA
The NVIDIA A100 Ampere GPU remains widely deployed in cloud and on-premises AI infrastructure for training and inference. With 40GB or 80GB HBM2e memory variants and MIG (Multi-Instance GPU) support for partitioning into up to 7 isolated GPU instances, the A100 is the proven workhorse of many production AI deployments.
by Intel
Intel Gaudi 3 is Intel's AI training and inference accelerator designed as a cost-competitive alternative to NVIDIA H100. It features 128GB of HBM2e memory and 24 100GbE RoCE ports for scale-out connectivity. Gaudi 3 is supported by Intel's Optimum Habana software stack and available via major cloud providers and on-premises.
by Groq
Groq's Language Processing Unit (LPU) is a deterministic ASIC architecture optimized for sequential transformer inference, eliminating the memory-bandwidth bottlenecks of GPU-based serving. Groq LPU clusters deliver measured token generation speeds of 500+ tokens/second for Llama-class models, significantly outpacing GPU inference for latency-critical applications.
by Cerebras Systems
The Cerebras Wafer-Scale Engine 3 (WSE-3) is the world's largest chip, containing 4 trillion transistors on a single 46,225 mm² silicon wafer. Its architecture eliminates the memory bandwidth bottlenecks of conventional GPU clusters for large model inference, achieving industry-leading tokens-per-second throughput for models up to 70B parameters.
by Amazon Web Services
AWS Trainium3 is Amazon's third-generation custom ML training chip, offering significant improvements in training throughput and energy efficiency over Trainium2. Trainium3 instances are available through Amazon SageMaker and EC2, targeting cost-efficient training of large language models for AWS-native AI development teams.
by AMD
The AMD Instinct MI325X is an updated Instinct GPU with 288GB of HBM3e memory and improved memory bandwidth over the MI300X. It targets inference workloads for the largest frontier models and positions AMD competitively against the NVIDIA H200 in memory-bound inference scenarios.
by AMD
The AMD Instinct MI300X is AMD's flagship AI accelerator featuring 192GB of HBM3 memory, the highest of any GPU when released. This massive memory capacity makes it compelling for inference of 70B+ parameter models and has led to adoption by Microsoft Azure, Oracle, and major AI labs as an H100 alternative.
by MediaTek
MediaTek Dimensity 9400's AI Processing Unit — the most powerful mobile NPU in Android smartphones. Delivers 50 TOPS for on-device AI with support for 13B parameter models on-device, enabling private, low-latency AI features for Android flagship devices.
by AaaS
Migrates vector data between different vector database providers (Pinecone, Weaviate, Chroma, Qdrant, Milvus). Handles schema mapping, batch transfers, index recreation, metadata preservation, and validation with rollback support.
by AaaS
Merges multiple fine-tuned model checkpoints using strategies like SLERP, TIES, DARE, and linear interpolation. Enables combining specialized model capabilities without additional training, with automated quality validation against benchmark suites.
by AaaS
Testing harness for AI agents with mock tool providers, simulated user interactions, and deterministic replay capabilities. Enables unit testing of agent logic, integration testing of tool chains, and end-to-end testing of complete agent workflows.
by xAI
xAI is Elon Musk's AI company and creator of the Grok model family. It provides API access to Grok models with real-time web search integration, available through the xAI API and X (Twitter) platform. Grok models are trained on a broad mix of web and social data and emphasize up-to-date knowledge and uncensored reasoning.
by Vast.ai
Vast.ai is a peer-to-peer GPU marketplace connecting researchers and startups with spare GPU capacity from data centers and individuals worldwide. It offers some of the cheapest GPU rental prices on the market with flexibility to choose hardware by price, latency, or reliability score. Best suited for cost-sensitive experimentation and training runs.
by Together AI
Together AI's compute platform provides on-demand and reserved GPU clusters for training and fine-tuning open-source models. It offers H100 and A100 clusters with high-bandwidth networking optimized for distributed training runs, serving as both a GPU cloud provider and an inference platform. Teams use Together AI compute to run multi-node training jobs on Llama and Mistral variants.
by SambaNova Systems
SambaNova Systems builds custom AI hardware (Reconfigurable Dataflow Units) and offers cloud inference via SambaNova Cloud. It delivers some of the highest throughput speeds for large models including Llama 3 and Meta's frontier releases, targeting enterprises that need predictable, high-throughput inference at scale.
by RunPod
RunPod is a community-driven GPU cloud marketplace offering some of the lowest per-hour prices for NVIDIA and AMD GPUs. It enables developers to rent GPU compute from a distributed network of data centers and deploy containerized workloads instantly. RunPod supports serverless GPU endpoints, making it popular for open-source model inference.
by Mistral AI
Mistral AI is a French AI company known for publishing high-efficiency open-weight models alongside its commercial API offerings. The Mistral and Mixtral model families deliver strong benchmark performance at a fraction of the compute cost of larger models. Mistral's La Plateforme API provides access to both open and closed proprietary models.
by Meta
Meta AI is the open-source AI division of Meta, responsible for the Llama model family. Llama 4 and its variants are released under open weights licenses, enabling local deployment, fine-tuning, and commercial use. Meta provides model weights via Hugging Face and its own download portal, making it the dominant open-weights LLM ecosystem.
by Lambda Labs
Lambda Labs provides cloud GPU instances and on-premises GPU servers targeted at AI researchers and ML engineers. Its Lambda Cloud offers on-demand and reserved NVIDIA H100 and A100 instances at competitive rates with a simple developer-friendly interface. Lambda also sells GPU workstations and servers for local development.
by Google DeepMind
Google DeepMind is the unified AI research division behind the Gemini model family. It offers API access through Google AI Studio and Vertex AI, covering multimodal reasoning, code generation, long-context understanding up to 2M tokens, and tight integration with Google Cloud services. DeepMind also publishes foundational research in reinforcement learning and scientific AI.
by Google Cloud
Google Cloud offers A100, H100, and TPU v5 instances for AI training and inference via Compute Engine and Vertex AI. Google Cloud's TPU pods provide unique competitive advantage for training large models efficiently, while its A3 instances with H100s target inference workloads. Deep integration with Vertex AI simplifies the MLOps lifecycle.
by FluidStack
FluidStack aggregates spare GPU capacity from data centers globally, providing an on-demand cloud GPU rental marketplace at competitive rates. It offers H100, A100, and RTX GPU clusters for training and inference with an API-driven provisioning model. FluidStack is used by AI startups for burst compute and cost-efficient long-running training jobs.
by Fireworks AI
Fireworks AI specializes in fast, cost-efficient inference for open-source models including Llama, Mistral, and Mixtral families. It offers serverless and on-demand deployment with a focus on production reliability. Fireworks provides an OpenAI-compatible API and supports compound AI systems through its FireFunction tool-calling models.
by DeepSeek
DeepSeek is a Chinese AI lab that has released competitive open-weight models rivaling frontier closed models at dramatically lower training costs. DeepSeek R1 and V3 demonstrated that mixture-of-experts and reinforcement learning at scale can close the gap with GPT-4-class models. Models are freely available via Hugging Face and a low-cost API.
by CoreWeave
CoreWeave is a specialized cloud infrastructure provider built exclusively for GPU-intensive AI and ML workloads. It offers on-demand and reserved access to NVIDIA H100, A100, and H200 clusters with high-bandwidth InfiniBand networking. CoreWeave is trusted by AI labs and enterprises for large-scale model training and inference at competitive pricing.
by Cohere
Cohere is an enterprise-focused AI company specializing in language models optimized for business applications including search, retrieval-augmented generation, and text classification. Its Command and Embed model families are widely used in enterprise RAG pipelines. Cohere offers private cloud and on-premises deployment options alongside its API.
by Cerebras Systems
Cerebras provides cloud inference powered by its Wafer-Scale Engine (WSE) chip, delivering some of the highest token throughput for large language models. Cerebras Inference serves Llama and other open-weight models with hardware-level advantages that push tokens-per-second beyond what GPU clusters can achieve for certain model sizes.
by Baseten
Baseten is a model inference platform for deploying ML models to production with high performance and reliability. It specializes in low-latency serving of open-source LLMs and diffusion models with features like cascade batching, LoRA serving, and speculative decoding. Baseten targets teams that need production-grade inference without managing Kubernetes.
by Microsoft Azure
Microsoft Azure provides ND H100 v5 and NCv3 GPU instances for AI model training and inference, with tight integration into Azure AI Studio, Azure OpenAI Service, and GitHub Copilot infrastructure. Azure is the preferred cloud for enterprises with Microsoft licensing agreements and provides access to OpenAI models via Azure OpenAI Service.
by Amazon Web Services
Amazon EC2 provides GPU instances (P4, P5, G5, Inf2 families) for AI/ML training and inference at any scale. As the largest cloud provider, AWS offers the broadest ecosystem of managed ML services including SageMaker, Bedrock, and Trainium-based Inf2 instances. Best for enterprises requiring deep AWS integration and compliance certifications.
by Anthropic
Anthropic is an AI safety company and the creator of the Claude model family. Its API provides access to Claude Opus, Sonnet, and Haiku variants, with strong support for long-context reasoning, tool use, and multi-agent workflows via the Claude Agent SDK. Anthropic publishes extensive safety research and pioneered Constitutional AI alignment techniques.
by Alibaba Cloud
Alibaba Cloud's Qwen team releases the Qwen model series, a family of open-weight and API-accessible language models covering dense and mixture-of-experts architectures. Qwen models are competitive on multilingual and coding benchmarks and are available through Alibaba Cloud's DashScope API as well as Hugging Face for local deployment.
by AI21 Labs
AI21 Labs is an Israeli AI company known for the Jamba model family, which uses a hybrid SSM-Transformer architecture for long-context efficiency. Its Wordtune product targets writing assistance while the API focuses on enterprise NLP tasks. Jamba 1.6 offers a unique balance of long-context window handling and low inference latency.
by Oracle
Oracle AI provides a suite of generative AI services built into Oracle Cloud Infrastructure (OCI), including the OCI Generative AI Service powered by Cohere and Meta models. Oracle has uniquely integrated AI capabilities directly into its database (Oracle Database 23ai), ERP, and industry cloud offerings, targeting enterprises with existing Oracle relationships.
by Princeton NLP
SWE-bench is a benchmark for evaluating AI systems' ability to resolve real GitHub issues from popular Python repositories. Each instance requires understanding a codebase, identifying the bug, and producing a correct patch. SWE-bench Verified is the curated subset accepted as the standard for coding agent evaluation by the AI industry.
by Hugging Face / MTEB Team
MTEB (Massive Text Embedding Benchmark) is the standard benchmark for evaluating text embedding models across 8 task types (retrieval, clustering, classification, etc.) and 112 datasets. The MTEB leaderboard on Hugging Face is the primary reference for selecting embedding models and is updated continuously as new models are released.
by UC Berkeley
MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark covering 57 academic subjects from elementary to professional level, including STEM, law, medicine, and social sciences. It became the standard for measuring general knowledge breadth in LLMs and is included in virtually every model evaluation suite.
by LiveBench OSS
LiveBench is a contamination-resistant benchmark that continuously updates with new questions sourced from recent math competitions, research papers, and news. By using only data post-dating model training cutoffs, LiveBench mitigates benchmark saturation and provides more reliable capability assessments of frontier models.
by OpenAI
HumanEval is OpenAI's code generation benchmark consisting of 164 hand-written Python programming problems with unit tests. It measures a model's ability to generate syntactically correct and functionally complete code from docstring descriptions. HumanEval is the foundational coding benchmark that all subsequent code benchmarks build upon.
by Stanford CRFM
HELM (Holistic Evaluation of Language Models) from Stanford CRFM provides a multi-dimensional evaluation framework that measures LLMs across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. It evaluates models on 42 scenarios and 59 metrics, providing the most comprehensive public assessment of LLM capabilities and risks.
by NYU / Cohere
GPQA Diamond (Graduate-Level Google-Proof Q&A) is a challenging multiple-choice benchmark requiring expert-level knowledge in biology, chemistry, and physics. Questions are designed to be answerable by domain PhD students but not by web search. GPQA Diamond is the standard for measuring frontier scientific reasoning capability.
by LMSys
Chatbot Arena is a crowdsourced human evaluation platform from LMSys where users anonymously compare responses from two random LLMs and vote for the better one. The resulting Elo-based leaderboard (LMSYS Leaderboard) is widely regarded as the most reliable measure of real-world LLM preference across diverse user tasks.
by ARC Prize Foundation
ARC-AGI-2 is the second iteration of François Chollet's Abstraction and Reasoning Corpus benchmark, designed to measure fluid intelligence and generalization in AI systems. Tasks require identifying abstract visual patterns that cannot be solved by memorization, targeting a capability gap that separates current LLMs from human-level reasoning.
by MAA / Community Eval
AIME (American Invitational Mathematics Examination) 2025 is used as a frontier math reasoning benchmark for LLMs. The competition-level math problems require multi-step reasoning without lookup, making AIME scores a direct indicator of a model's mathematical problem-solving depth. Frontier models are evaluated on the 2025 problem set to avoid training data contamination.
by Tsinghua University
1.5M high-quality multi-turn dialogue dataset for instruction fine-tuning.
by EleutherAI
825GB diverse English pretraining corpus from 22 high-quality data sources.
by Princeton NLP
2.3K real GitHub issues requiring AI agents to write and verify code fixes.
by BigCode / HuggingFace
The Stack — 6.4TB permissively licensed source code across 350+ languages.
by Open-Orca
Curated 518K subset of OpenOrca GPT-4 explanations optimized for fine-tuning.
by Together AI
30 trillion token multilingual web dataset with quality annotations for pretraining.
by Together AI
1.2T token open reproduction of the LLaMA training dataset from 7 sources.
by LAION
OpenAssistant human-annotated conversational tree dataset for RLHF and SFT.
by UC Berkeley
Massive Multitask Language Understanding benchmark across 57 academic subjects.
by Community
Self-synthesized alignment dataset of 1M+ instruction pairs from frontier models.
by OpenAI
164 handwritten Python coding problems for evaluating code generation models.
by University of Washington
Commonsense NLI benchmark with adversarially filtered sentence completions.
by OpenAI
8.5K grade school math word problems requiring multi-step reasoning.
by Hugging Face
15T token filtered web dataset from Hugging Face optimized for LLM pretraining.
by Google
Colossal Clean Crawled Corpus of 156B tokens used to train T5 and many others.
by Voyage AI
Voyage AI's voyage-3-large is optimized for retrieval-augmented generation in code and long-form text contexts. It consistently achieves top rankings on MTEB benchmarks, particularly for code search and domain-specific retrieval. Anthropic partners with Voyage AI for recommended embedding use with Claude-based pipelines.
by Snowflake
Snowflake Arctic Embed is a family of open-source text embedding models optimized for enterprise retrieval tasks. The family includes multiple sizes for different performance-latency tradeoffs, all trained on high-quality curated data. Arctic Embed models achieve competitive MTEB retrieval scores and are optimized for integration with Snowflake Cortex.
by Alibaba Cloud
Qwen 3.6 Plus is Alibaba's latest large-scale model in the Qwen 3 series, featuring strong multilingual capabilities across 29+ languages alongside competitive coding and math performance. It uses a mixture-of-experts architecture for efficient inference and is available through DashScope API and as open weights on Hugging Face.
by Alibaba Cloud
Qwen 2.5 Coder is Alibaba's code-specialized model achieving performance comparable to GPT-4o on HumanEval and SWE-bench at 32B parameters. Available as open weights, it supports 92 programming languages and is widely adopted for code completion, debugging, and code review agent workflows.
by OpenAI
o4-mini is OpenAI's compact reasoning model designed for high-volume, cost-sensitive deployments that still require structured logical reasoning. It delivers much of o3's reasoning capability at a fraction of the cost and latency, making it practical for agentic loops, code review, and math tutoring applications.
by Nous Research
Nous Hermes 3 is a fine-tuned Llama-based model by Nous Research, optimized for agentic task following, function calling, and instruction adherence. It is one of the most popular community fine-tunes on Hugging Face for users wanting Llama-class capability with superior instruction compliance and tool use reliability.
by Nomic AI
Nomic Embed v2 is a fully open-source, fully reproducible embedding model with competitive performance on MTEB benchmarks. Nomic AI publishes complete training code, data, and model weights, making it the gold standard for transparency in the embedding model space. It supports 8192-token contexts and is available on Hugging Face.
by Mixedbread AI
mxbai-embed-large from Mixedbread AI is an open-weight embedding model delivering top-tier MTEB retrieval performance at the 335M parameter scale. It uses a novel training approach that removes dataset-specific instructions while maintaining generalization, and is widely adopted via Ollama for local embedding deployments.
by Mistral AI
Mistral Large 3 is Mistral AI's flagship model, offering strong performance on coding, multilingual tasks, and instruction following. It is available both as an open-weight model and through La Plateforme API, supporting function calling and JSON mode. Mistral Large 3 is competitive with GPT-4-class models while offering flexible deployment options.
by Mistral AI
Mistral Embed is Mistral AI's text embedding model designed for semantic search, clustering, and RAG retrieval. It produces 1024-dimensional vectors and is accessible via La Plateforme API. It offers a cost-competitive option for developers already using Mistral models who want a consistent provider for both generation and retrieval.
by Meta
Llama 4 Scout is Meta's efficient multimodal model in the Llama 4 family, targeting deployments where compute cost matters. It supports a long context window and demonstrates strong per-parameter efficiency. Scout is designed for use cases where teams need a capable, locally-runnable model without the resource requirements of Maverick.
by Meta
Llama 4 Maverick is Meta's multimodal mixture-of-experts flagship in the Llama 4 family, combining vision and language understanding with a large active parameter count. It delivers frontier-class performance on reasoning, coding, and multimodal benchmarks as an open-weight model, available for local deployment and fine-tuning by the community.
by Jina AI
Jina Embeddings v3 is a 570M-parameter multilingual embedding model with task-specific LoRA adapters for retrieval, clustering, classification, and text matching. It supports sequences up to 8192 tokens and is available via Jina AI's API and as open weights, making it flexible for both production and self-hosted deployments.
by AI21 Labs
Jamba 1.6 is AI21 Labs' hybrid SSM-Transformer model combining the long-context efficiency of Mamba state-space models with Transformer-style attention. It handles documents up to 256K tokens with low memory overhead, making it well-suited for enterprise document processing and knowledge-intensive tasks. Available as open weights and via AI21's API.
by xAI
Grok 4.3 is xAI's latest model with integrated real-time web access and reasoning capabilities. It is trained on X (Twitter) data alongside broader web corpora, giving it distinctive currency on current events and social sentiment. Grok 4.3 competes on frontier reasoning benchmarks and is available through the xAI API and X Premium.
by OpenAI
GPT-5.5 is OpenAI's advanced reasoning and instruction-following model positioned above GPT-5 in the product lineup. It delivers superior performance on complex multi-step reasoning, coding, and tool-use tasks, offering a balance of frontier capability with practical inference cost. Available through the OpenAI API with function calling and structured output support.
by OpenAI
GPT-5.4 is an OpenAI general-purpose model in the GPT-5 family, optimized for everyday API use cases including chat, summarization, classification, and tool-augmented tasks. It serves as the cost-performance sweet spot in the GPT-5 line, offering strong instruction following with lower latency than the flagship GPT-5.5.
by Google DeepMind
Google's text-embedding-004 is the latest production embedding model from Google, optimized for semantic similarity and retrieval tasks. Available through the Gemini API and Vertex AI, it offers competitive MTEB scores across English tasks and tight integration with Google Cloud's data and AI services.
by Google DeepMind
Gemma 3 27B is Google's largest open-weight model in the Gemma 3 family, competitive with much larger models on reasoning and instruction-following benchmarks. Released under a permissive license, it supports multimodal input and a 128K context window, making it the leading choice for teams needing a locally-runnable multimodal model at the 27B scale.
by Google DeepMind
Gemini 3 Flash is Google's high-speed, cost-efficient multimodal model designed for latency-sensitive applications and high-volume API usage. It maintains strong multimodal capability while delivering response times competitive with the fastest models available. Widely used for real-time chat, structured extraction, and agentic sub-tasks.
by Google DeepMind
Gemini 3.1 Pro is Google DeepMind's flagship multimodal model with native understanding of text, images, audio, video, and code. It supports a 2M-token context window and excels at long-document comprehension, scientific reasoning, and cross-modal tasks. Available through Google AI Studio and Vertex AI with grounding via Google Search.
by Microsoft Research
E5-Mistral-7B is a 7B-parameter embedding model from Microsoft Research that fine-tunes Mistral-7B using the E5 training recipe with synthetic data generation. It achieves state-of-the-art results on MTEB benchmarks, demonstrating that decoder-based LLMs can serve as powerful embedding models through instruction tuning.
by DeepSeek
DeepSeek V3.2 is an updated iteration of DeepSeek's dense general-purpose model, offering improvements in coding accuracy, instruction following, and multilingual performance. As an open-weight mixture-of-experts model, it provides competitive benchmark numbers against GPT-4-class closed models while remaining freely deployable.
by Nous Research
DeepHermes 3 by Nous Research is a reasoning-augmented open-weight model that combines DeepSeek R1-style chain-of-thought training with Hermes instruction tuning. It supports toggling between standard inference and extended thinking mode, offering one of the first open-source models with controllable reasoning depth for agentic applications.
by Cohere
Command A is Cohere's enterprise-grade instruction-following model optimized for RAG, tool use, and structured business workflows. It supports grounding with external documents and provides reliable JSON outputs for enterprise integrations. Command A is available through Cohere's API and supports private cloud and on-premises deployments.
by Cohere
Cohere embed-v4 is a state-of-the-art multimodal embedding model supporting text and images in 100+ languages. It delivers top MTEB benchmark scores for multilingual retrieval and is optimized for enterprise RAG pipelines with support for int8 and binary quantization for efficient storage and search.
by Anthropic
Claude Sonnet 4.6 is Anthropic's best-value model for coding, analysis, and everyday agent workflows. It delivers near-Opus quality at lower cost and latency, making it the default choice for production AI applications. Sonnet 4.6 excels at software engineering tasks, tool use, and multi-turn conversations with consistent instruction following.
by Anthropic
Claude Opus 4.6 is Anthropic's most capable model, excelling at complex reasoning, nuanced writing, research synthesis, and agentic task execution. It supports a 200K-token context window, advanced tool use, and multi-agent coordination. Opus 4.6 is optimized for high-stakes production workflows where depth and accuracy are paramount.
by BAAI
BGE-M3 from BAAI is a versatile open-source embedding model supporting dense, sparse, and multi-vector retrieval in a single unified model. It handles over 100 languages and long documents up to 8192 tokens, making it a top choice for multilingual and hybrid retrieval tasks. Available on Hugging Face for local and cloud deployment.
by Amazon
Amazon's general-purpose foundation model for text generation, summarization, and conversational AI. Offers competitive performance with strong data privacy guarantees, as AWS does not use customer data to train Titan models.
by Stability AI
StableLM 3B is Stability AI's compact 3 billion parameter language model trained on 1 trillion tokens and designed for commercial use on edge and consumer hardware. It supports long context windows relative to its size and achieves strong performance on reasoning and instruction-following benchmarks for its parameter count.
by PixArt-alpha (HKUST / Huawei Noah's Ark Lab)
PixArt-Σ is a high-efficiency text-to-image diffusion transformer model that achieves competitive quality with models 10× larger by leveraging a novel weak-to-strong training strategy and high-quality data curation. It supports 4K resolution output and is fully open-source, making it an attractive choice for researchers and developers seeking quality without compute overhead.
by Amazon Web Services
Cohere's Command and Embed models deployed as dedicated SageMaker endpoints for real-time inference with guaranteed throughput. Available through AWS Marketplace as JumpStart models, supporting VPC isolation, auto-scaling, and A/B testing. Preferred for enterprises requiring dedicated capacity and AWS billing consolidation.
by Google DeepMind
Scaling inference compute via verifiers and search improves reasoning without training.
by OpenAI
Kaplan et al. power-law relationships between model size, data, compute, and loss.
by Princeton / Google
Interleaved reasoning and acting pattern enabling LLMs to use tools iteratively.
by Mistral AI
Mistral's sparse Mixture-of-Experts model matching GPT-3.5 at fraction of cost.
by OpenAI
RLHF fine-tuning showing alignment with human instructions beats raw scaling.
by OpenAI
OpenAI technical report on GPT-4's multimodal capabilities and safety evaluations.
by Stanford
Direct Preference Optimization aligning LLMs from preferences without a reward model.
by Anthropic
Anthropic's paper on AI self-critique via constitutional principles for harmlessness.
by DeepMind
DeepMind compute-optimal scaling showing data and model size should scale equally.
by Google
Bidirectional encoder pre-training establishing the fine-tuning paradigm for NLP.
by Google Brain
Introduced the Transformer architecture that underpins modern LLMs.
by Codeium
Windsurf (by Codeium) is an agentic IDE that introduces Cascade, an AI agent with deep awareness of the developer's actions and codebase state. Cascade can autonomously write code, run commands, fix bugs, and maintain coherent multi-step workflows across files.
by Replit
Replit Agent is a browser-based AI software engineer that builds, deploys, and iterates on full-stack applications in the cloud. It handles environment setup, dependency installation, debugging, and deployment within Replit's managed cloud infrastructure.
by OpenAI
OpenAI Codex is a cloud-based software engineering agent that runs in isolated sandboxes, executing tasks in parallel from natural language instructions. It integrates with GitHub to read repositories, write code, run tests, and create PRs autonomously.
by Amazon
Kiro is Amazon's spec-driven AI IDE that introduces a structured approach to agentic development. It generates product requirements, design documents, and task lists before writing code, combining planning rigor with autonomous implementation and steering capability.
by GitHub
GitHub Copilot is Microsoft's AI coding assistant integrated across GitHub, VS Code, and other IDEs. It provides inline completions, chat-based assistance, and an autonomous coding agent mode (Copilot Workspace) for planning and implementing changes across repositories.
by Google
Gemini CLI is Google's open-source terminal-based AI agent powered by Gemini models. It provides interactive and scriptable access to Gemini's capabilities directly from the command line, with tool use, file context, and integration with Google services.
by Amazon
Amazon Q Developer is AWS's AI coding assistant with deep AWS service knowledge. It offers inline completions, chat, security scanning, code transformation, and autonomous agent capabilities for tasks like upgrading Java versions and generating unit tests at scale.
by Paul Gauthier
Aider is an open-source command-line coding assistant that pairs with LLMs to edit code in local git repositories. It supports architect-editor multi-model workflows, automatic git commits, and works with dozens of LLMs including Claude, GPT-4, and local models.
by Snorkel AI
Snorkel AI commercializes weak supervision and programmatic data development research from Stanford AI Lab, enabling teams to build, manage, and iterate on AI training datasets programmatically at scale. Its platform reduces reliance on manual labeling by using labeling functions and foundation model assistance.
by IBM
IBM Watson, now branded as IBM watsonx, is IBM's enterprise AI platform offering governed, trustworthy AI for regulated industries. The watsonx.ai studio, watsonx.data lakehouse, and watsonx.governance suite provide a complete enterprise AI development and deployment pipeline with strong emphasis on explainability, fairness, and compliance for sectors like finance, healthcare, and government.
by Schwartz et al. / AI2 / University of Washington
GreenAI Benchmark evaluates the efficiency of AI training and inference by reporting accuracy alongside FLOPs, parameters, and CO2 emissions. It promotes the efficiency metric paradigm where reporting results without computational cost is considered incomplete science.
by Meta AI
Voicebox is Meta AI's generative speech model based on non-autoregressive flow matching that achieves state-of-the-art performance on text-to-speech, noise removal, content editing, and style transfer tasks through a unified in-context learning approach. Its flow-matching architecture allows it to generalize to new voices and styles without fine-tuning, setting a new paradigm for zero-shot speech synthesis.
by PKU-YuanLab
Video-LLaVA is an open video-language model that extends the LLaVA architecture with temporal video understanding capabilities, enabling detailed question answering and reasoning over video content. It achieves strong performance on video QA benchmarks by aligning visual features from both images and videos into a shared representation space.
by Reka AI
Reka Core 2 is Reka AI's second-generation flagship multimodal model, offering improved performance on text, image, video, and audio understanding tasks. It competes with top-tier frontier models while maintaining a strong focus on enterprise reliability and API accessibility.
by Zihan Liu et al. (Academic)
PMC-LLaMA is an open-source medical language model built by fine-tuning LLaMA on 4.8 million PubMed Central biomedical papers and medical textbooks. It demonstrates competitive performance on medical QA benchmarks and is widely used in academic medical NLP research.
by Physical Intelligence
Pi0 (π0) is Physical Intelligence's general-purpose robot foundation model that uses a flow matching action expert architecture to handle diverse dexterous manipulation tasks across multiple robot platforms. It demonstrates unprecedented dexterity on tasks like folding laundry and assembling boxes, representing a major step toward general-purpose robot intelligence.
by Amazon
Amazon's fastest and most cost-effective text-only model in the Nova family, optimized for lowest latency responses. Provides strong text summarization, translation, and classification capabilities at a fraction of the cost of larger models.
by Alibaba AIDC
Marco-o1 is Alibaba International Digital Commerce's open-source reasoning model that extends o1-style step-by-step reasoning to multilingual settings, with particular strength in Chinese, English, and other languages. It combines Monte Carlo Tree Search with chain-of-thought prompting to improve reasoning on complex tasks requiring extended deliberation.
by deepset
Haystack DocumentStore integration for Vespa, Yahoo's open-source big-data serving engine. Combines Vespa's multi-stage ranking, approximate nearest neighbor search, and real-time indexing with Haystack's RAG pipeline builder. Supports BM25 + dense hybrid retrieval at web scale.
by Fireworks AI
Integration between Fireworks AI's model platform and the vLLM inference engine for on-premises or self-hosted deployment of Fireworks-optimized models. Fireworks packages FireOptimizer-quantized models in formats directly compatible with vLLM's OpenAI-compatible server, enabling enterprise teams to run Fireworks-quality inference on their own GPU infrastructure.
by Community
Enables computation on encrypted data so that ML inference or training can be performed without decrypting sensitive inputs, providing cryptographic confidentiality guarantees. Emerging technique for privacy-preserving AI inference in regulated industries such as healthcare and finance.
by AaaS
Runs large language model inference across multiple GPUs or nodes using tensor parallelism, pipeline parallelism, or expert parallelism. Covers distributed serving frameworks, inter-node communication, load balancing, and fault tolerance for enterprise-scale deployments.
by Rabbit Inc.
Dedicated AI hardware device powered by a Large Action Model (LAM) that learns and executes tasks across mobile apps and services. Operates through a push-to-talk interface with a camera for visual understanding, eliminating the need for traditional app-based workflows.
by Cerebras
Cerebras Wafer Scale Engine 3 — the world's largest chip, spanning an entire silicon wafer. Contains 4 trillion transistors and 44GB of on-chip SRAM, eliminating off-chip memory bandwidth as a bottleneck for training large neural networks.
by Zhipu AI
Zhipu AI is a Chinese AI company spun out of Tsinghua University's KEG Lab, known for the GLM (General Language Model) series. Its ChatGLM models were among the first high-quality open Chinese language models and have been widely adopted in Chinese industry and research communities.
by SambaNova Systems
SambaNova Systems builds reconfigurable AI hardware and software solutions optimized for enterprise-scale LLM training and inference, offering its Samba-1 model and SambaNova Cloud API as commercial services. The company's Reconfigurable Dataflow Unit (RDU) architecture is designed specifically for deep learning workloads.
by PathAI
PathAI develops AI-powered pathology solutions that enable more accurate cancer diagnosis, biomarker assessment, and drug development support by analyzing histopathology images at scale. Its AISight platform is deployed in clinical laboratories and pharmaceutical research, improving diagnostic consistency and accelerating oncology trials.
by Adept AI
Adept AI builds AI systems that can take actions in software to complete complex multi-step workflows on behalf of users. The company focuses on general-purpose action models trained to interact with real-world software interfaces through browser and desktop automation.
by OpenAI
Point-E is OpenAI's pioneering text-to-3D model that generates 3D point clouds from text prompts in seconds by first generating a 2D image and then lifting it to 3D, establishing a fast baseline for open-source 3D generation. While its geometric quality is limited compared to newer systems, its speed and simplicity made it a widely used research reference and starting point for subsequent 3D generation work.
by Apple
OpenELM is Apple's open-source family of efficient language models that uses a layer-wise scaling strategy to distribute parameters more effectively across transformer layers. The 1B variant achieves strong performance relative to its compute budget and is designed for on-device inference on Apple Silicon and mobile devices.
by Log10
Log10 provides zero-configuration auto-logging for OpenAI API calls through a context manager that intercepts completions and stores full request/response pairs with automatic tagging. The integration supports user feedback collection, few-shot prompt organization, and GDPR-compliant data masking for PII in logged payloads.
by Chunkr / Zilliz
Pairs Chunkr's semantic chunking service with Milvus's high-performance vector database for production-scale RAG. Chunkr splits documents using structure-aware boundaries and Milvus stores the resulting dense vectors with ANN indexing for sub-millisecond retrieval.
by AaaS
Configures Agent-to-Agent (A2A) communication infrastructure with message routing, capability discovery, and protocol compliance. Sets up agent registries, message queues, and typed message schemas for reliable inter-agent collaboration.
by Lepton AI
Lepton AI provides a serverless cloud platform for running open-source AI models and custom workloads with a Pythonic SDK, eliminating infrastructure management overhead for ML teams. Founded by ex-Meta researchers, the platform supports fine-tuning, deployment, and monitoring of models with pay-per-use pricing.
by Stability AI
Stability AI's compact open-source language model available in 1.6B and 12B parameter variants. Trained on multilingual data spanning seven languages, offering efficient text generation suitable for resource-constrained deployments and edge computing.
by Alibaba DAMO Academy
SeaLLM-v3 is a multilingual large language model specialized for Southeast Asian languages including Thai, Vietnamese, Indonesian, Malay, Burmese, Khmer, Lao, and Tagalog. Developed by DAMO Academy, it provides culturally aware and linguistically accurate generation for the SEA region, addressing a significant gap in multilingual AI coverage.
by Reka AI
Reka Flash is Reka AI's mid-tier multimodal model optimized for speed and cost-efficiency across text, image, and video inputs. It delivers strong multimodal capabilities at a fraction of the cost of top-tier models, making it suitable for high-throughput production applications.
by Zilliz
Connector linking Zilliz Cloud (managed Milvus) with Apache Spark for large-scale batch embedding ingestion and vector ETL pipelines. Enables parallel document embedding across Spark executors with direct write to Zilliz collections, supporting data lake to vector store pipelines at petabyte scale.
by LiteLLM
LiteLLM proxy integration for Cerebras Inference, enabling Cerebras's wafer-scale chip throughput to be accessed via a unified OpenAI-compatible gateway. Allows developers to route requests to Cerebras's CS-3 hardware — delivering over 2000 tokens/second on Llama 3.1 70B — from any existing OpenAI SDK integration through LiteLLM's model aliases.
by AaaS
Implements the RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) indexing strategy. Recursively clusters and summarizes document chunks into a tree hierarchy, enabling retrieval at multiple levels of abstraction for both detailed and high-level queries.
by SambaNova
SambaNova's Reconfigurable Dataflow Unit with a three-tier memory hierarchy: on-chip scratchpad, on-package HBM, and off-package DRAM. The unique architecture enables running multiple models simultaneously and excels at efficient mixture-of-experts inference.
by NVIDIA
NVIDIA Pascal architecture GPU and the first to use HBM2 memory in a data center product. Delivered 10x deep learning performance over its predecessor and was the primary platform for training early deep learning models before the Volta generation.
by 01.AI
01.AI is a Chinese AI startup founded by Kai-Fu Lee, creator of the Yi series of bilingual large language models. Yi models are released as open weights under permissive licenses and have demonstrated strong performance on multilingual benchmarks, positioning 01.AI as a key contributor to the open-source AI ecosystem.
by WizardLM Team
WizardLM's code-focused model fine-tuned from StarCoder using Evol-Instruct methodology for complex code generation. Achieves strong performance on HumanEval by evolving instruction complexity during training.
by RWKV Foundation
RWKV-5 (Eagle) is a 1.5 billion parameter model from the RWKV architecture family that combines the parallelism of transformers during training with the efficiency of RNNs during inference. It achieves linear time and constant memory complexity, making it exceptionally efficient for long-context tasks and edge deployment without the quadratic cost of attention mechanisms.
by Recursion Pharmaceuticals
Recursion Pharmaceuticals is a clinical-stage techbio company that combines automated biology, large-scale imaging, and machine learning to industrialize drug discovery, operating one of the largest biological datasets in the industry. Its Recursion OS platform maps biological relationships at unprecedented scale to identify novel therapeutic targets and drug candidates.
by Qualcomm
Qualcomm's data center AI inference accelerator designed for power-efficient deployment. Based on the same AI architecture as Snapdragon, it delivers competitive inference performance with a focus on power efficiency metrics (TOPS/W) for hyperscale deployments.
by Insilico Medicine
Insilico Medicine is an AI-driven drug discovery company that has become the first to advance an AI-designed small molecule into Phase II clinical trials, demonstrating end-to-end AI-powered drug development from target identification through IND. Its Chemistry42 and PandaOmics platforms generatively design and screen drug candidates.
by 01.AI
01.AI's speed-optimized frontier model balancing top-tier reasoning with rapid inference throughput. Designed for production workloads requiring both high quality and low latency at competitive pricing.
by Allen Institute for AI (AI2)
AI2's fully open language model with released training data, code, weights, and evaluation framework. Designed for maximum scientific reproducibility with the Dolma dataset and comprehensive documentation of training decisions.
by Turbopuffer
Integration connecting Turbopuffer's serverless vector database with Vercel's deployment platform. Turbopuffer stores vectors on object storage with sub-100ms cold query latency, making it viable for Vercel serverless functions and Edge Runtime. Zero infrastructure management for full-stack AI apps on Vercel.
by Inflection AI
Inflection AI was co-founded by Mustafa Suleyman (ex-DeepMind) and Reid Hoffman, initially building the Pi personal AI assistant. After a major leadership transition to Microsoft in 2024, the remaining company pivoted to enterprise AI services, offering its Inflection 3 model and AI consulting for large organizations.
by Baichuan
Baichuan Intelligence is a Chinese AI startup founded by Zhiyuan Wang, a former Sogou CEO, specializing in large language models with applications in healthcare and enterprise workflows. Its Baichuan2 series models are notable for strong Chinese language performance and vertical-specific fine-tuning capabilities.
by Vikhyat Korrapati
Ultra-compact vision-language model designed for edge deployment and resource-constrained environments. Delivers surprisingly strong visual understanding in under 2B parameters, enabling on-device multimodal inference.
by Carnegie Mellon / Princeton
Second-generation selective state space model achieving transformer-competitive quality with linear-time sequence processing. Introduces structured state space duality (SSD) for 2-8x faster training throughput compared to the original Mamba architecture.
by Humane
Wearable AI device that clips to clothing and provides a screenless computing experience via voice, gesture, and laser projection. Integrates with cloud AI services to answer questions, translate languages, identify objects, and manage communications hands-free.
by Google
Google's third-generation TPU featuring liquid cooling to sustain higher clock speeds and 32GB HBM per chip. Doubled compute and memory versus TPU v2, enabling training of BERT, T5, and early large language models. Powered many foundational AI research papers at Google Brain and DeepMind.
by Mozilla
Mozilla AI is a startup launched by the Mozilla Foundation to build open, trustworthy AI tools and advocate for responsible AI development as a counterweight to closed proprietary systems. The organization releases tools like Lumigator (LLM evaluation) and contributes to open-source AI infrastructure aligned with the open web.
by Upstage
Upstage's model created through a novel depth up-scaling technique that merges two pretrained models by removing intermediate layers. Achieves performance comparable to much larger models while maintaining efficient 10.7B parameter inference.
by Figure AI
Figure AI is building general-purpose humanoid robots designed to perform physical labor in warehouses, factories, and logistics environments, powered by a neural network trained with visual data and language models. Its Figure 02 robot, developed in partnership with BMW and backed by OpenAI, Microsoft, and NVIDIA, is one of the most advanced humanoid platforms commercially deployed.
by RWKV Foundation
Sixth generation of the RWKV architecture combining transformer-level quality with RNN efficiency for linear-time inference. Enables constant memory usage regardless of sequence length, making it ideal for resource-constrained and streaming applications.
by Graphcore
Graphcore's Bow Intelligence Processing Unit using 3D wafer-on-wafer technology. Features a massively parallel MIMD architecture with 1472 processor cores and 900MB on-chip SRAM, designed for graph-structured AI workloads and sparse computation.
by Meta FAIR
MegalodonLM is a research model from Meta FAIR introducing the Megalodon architecture, a novel efficient sequence modeling approach that achieves Transformer-level quality with linear complexity scaling. It represents a fundamental architectural advance in long-context language modeling without quadratic attention costs.
by Tenstorrent
Tenstorrent's first commercial AI accelerator co-designed by Jim Keller. Built on a RISC-V Tensix processor architecture with a mesh NoC, enabling programmable AI compute. Notable for its open software stack and developer-friendly approach to hardware AI.
by Google
Google's second-generation TPU and the first available on Google Cloud. Added training capability (v1 was inference-only), HBM memory for gradient storage, and introduced the concept of TPU Pods — interconnected multi-chip systems enabling distributed training at scale.
by NVIDIA
NVIDIA Kepler-based dual-GPU data center card that became the first widely available cloud GPU for deep learning. Google Colab's original free tier ran on K80s, making it instrumental in democratizing access to GPU-accelerated deep learning for researchers and students worldwide.
by Google
Google's first Tensor Processing Unit — the seminal custom AI ASIC that launched the modern era of purpose-built ML hardware. Deployed in 2015 and described publicly in a landmark 2017 ISCA paper, it ran inference for Google Search, Maps, and Translate, delivering 30x performance-per-watt vs contemporary GPUs.
by Graphcore
Graphcore's second-generation Colossus GC200 Intelligence Processing Unit. Featured 1472 IPU-Cores with 900MB on-chip SRAM and introduced the Bulk Synchronous Parallel with Staleness (BSS) execution model. Preceded the Bow IPU and established Graphcore's approach to graph-native, SRAM-centric AI compute.
by
AI compute provider with wafer-scale chips delivering record-breaking inference speeds for LLMs.
by Databricks
A human-generated instruction-following dataset of 15,000 high-quality prompt/response pairs, created by Databricks employees. It's designed to be used for fine-tuning large language models to follow instructions without relying on proprietary data.
by LAION, OpenAssistant Community
A diverse, high-quality, human-generated instruction-following dataset collected by the OpenAssistant project. It contains multi-turn conversations covering a wide range of topics, designed to train helpful and harmless chatbots.
by BigScience
A large-scale multilingual collection of instruction datasets, encompassing tasks like summarization, translation, question answering, and more, across over 40 languages. It's crucial for developing and evaluating instruction-following capabilities in diverse linguistic contexts.
by Meta AI
An advanced instruction-tuned large language model specifically designed for code generation, explanation, and debugging across multiple programming languages. This version offers improved performance and reduced hallucination rates compared to its predecessor.
by OpenAI
An updated version of OpenAI's Whisper model with expanded language support and improved accuracy for speech-to-text transcription.
by Stanford AI Lab
A specialized version of Meta's SAM, fine-tuned on medical imaging datasets to accurately segment anatomical structures and pathologies with minimal prompting.
by Meta AI Community
An instruct-tuned variant of Llama 3 8B, optimized for conversational AI and creative writing, showing significant improvements in coherence and factual accuracy.
by Stability AI
The next iteration of SDXL Turbo, offering real-time image generation with enhanced quality and consistency across diverse styles.
by HuggingFace (BigCode project)
A compact yet highly effective code generation model from the BigCode project, designed for efficiency and integration into development environments for tasks like code completion and bug fixing.
by TinyLlama Project
An exceptionally small yet capable language model, designed for efficiency and deployment on resource-constrained environments. Despite its compact size, it demonstrates surprisingly good performance in conversational tasks, making it ideal for edge computing.
by [unverified]
Mistral AI flagship model with strong multilingual and coding capabilities.
by [unverified]
Anthropic most capable model with deep reasoning, extended thinking, and superior coding performance.
by Google
An experimental open recurrent language model from Google, showcasing efficiency and performance with a novel recurrent architecture. It's designed to be memory-efficient and capable of handling long sequences effectively.
by Microsoft
A small, highly capable language model from Microsoft, optimized for performance on resource-constrained devices. Despite its size, Phi-3 Mini delivers impressive reasoning and language understanding, making it ideal for edge computing.
by
ML experiment tracking and model monitoring platform. Integrates with all major training frameworks.
by
Google research showing that prompting LLMs to show reasoning steps dramatically improves performance on complex tasks.
by
Paper introducing the ReAct framework for combining reasoning and acting in language model agents.
by SecureMind Solutions
An autonomous cybersecurity agent that continuously monitors networks, detects emerging threats, analyzes anomalies, and initiates automated responses to protect digital assets.
by DevGenius Labs
An autonomous coding agent that can understand complex requirements, generate code, test, debug, and deploy software across various programming languages and platforms.
by Microsoft
Ongoing development and recent updates to Microsoft's AutoGen, a framework for building multi-agent conversations and systems, making it easier to create complex, collaborative AI workflows, continuously expanding its capabilities.
by LangChain
Continuous updates and new integrations within the LangChain framework that empower developers to build more robust and capable AI agents by providing a wider array of tools and improved orchestration patterns, constantly trending in the developer community.
by [unverified]
Mathematics benchmark testing advanced problem-solving from algebra to competition mathematics.
by BioCompute AI
A compact yet effective model for accelerated protein structure prediction, leveraging recent advancements in AI for bioinformatics. Ideal for initial screening and rapid hypothesis generation in drug discovery.
by Anthropic
A small, efficient model specifically designed for red-teaming and safety alignment of larger language models, identifying harmful outputs.
by SonicCraft Studios
A creative voice agent specializing in generating original spoken word content, from podcasts to audiobooks, with customizable voices and styles.
by Vocalix Technologies
A next-generation voice agent framework for building highly conversational and context-aware AI assistants across various platforms.
by Cognito Systems
A multi-agent orchestration platform that allows users to design, deploy, and monitor complex workflows involving multiple specialized AI agents.
by EmpathicAI
A voice agent platform specializing in empathetic and context-aware customer service interactions, reducing call volumes and improving satisfaction.
by FlowCode Labs
An intelligent coding agent that automates software development workflows from ideation to deployment, offering code generation, testing, and CI/CD integration.
by Unified Perception Inc.
A multi-modal AI agent capable of processing and generating content across various data types, including text, image, audio, and video, for comprehensive understanding.
by Scholarly AI Labs
An AI agent designed to revolutionize academic and scientific research by autonomously searching, synthesizing, and summarizing vast amounts of literature and data.
by Cognitive Nexus
A robust framework and platform for building, deploying, and managing sophisticated multi-agent systems, facilitating complex problem-solving through collaborative AI.
by EfficiencAI
An AI-powered productivity agent that intelligently manages tasks, schedules, communications, and workflows, adapting to user preferences to maximize efficiency.
by Sonic Intelligence
A state-of-the-art voice AI agent offering highly natural, context-aware, and emotionally intelligent conversational experiences for various applications.
by Community-driven
An open-source project aiming to replicate and expand upon the capabilities of Devin AI, providing a coding AI agent that can autonomously plan, execute, and debug code to solve complex engineering tasks, with active community contributions and recent progress.
by OpenAI
Leverages OpenAI's latest multimodal model, GPT-4o, to enable highly natural and real-time voice interactions for AI agents, including understanding nuanced emotions and responding with expressive speech, gaining significant developer traction since its announcement.
by [unverified]
Automated benchmark derived from Chatbot Arena for evaluating instruction-following and open-ended generation.
by ScoutLogic Corp.
An enterprise-grade browser agent for automated data collection and analysis from public web sources, ensuring compliance and scalability.
by PixelCraft AI
An AI agent that assists in game development by generating game assets, narratives, character behaviors, and optimizing game mechanics based on design principles.
by Google
Google's new multimodal AI agent initiative, showcased at Google I/O, aiming for a universal AI agent that can understand and interact with the world through vision, speech, and text in real-time. The impressive demonstrations have garnered significant attention.
by Intel
Intel Nervana Neural Network Processor for Training — Intel's attempt at a purpose-built AI training chip following the 2016 acquisition of Nervana Systems. Featured 32GB HBM2 and a novel MCDRAM+HBM architecture. Discontinued in 2020 as Intel pivoted focus to the Habana Gaudi line.
by Google DeepMind
An embodied AI model integrating visual perception with language understanding to enable complex robotic task planning and execution. Designed for general-purpose robotic manipulation in unstructured environments.
by AutoFlow Personal
A personal browser agent that learns user habits to automate repetitive online tasks, from managing emails to booking appointments and comparing prices.
by Browser Use
A platform for building and deploying AI agents that can interact with web browsers to perform tasks. It allows agents to navigate websites, click elements, fill forms, and extract information like a human user.
by
Massive multitask language understanding benchmark for evaluating LLM knowledge across 57+ subjects.
by DebugAI Solutions
An AI-powered debugging agent that automatically identifies, diagnoses, and suggests fixes for code errors across multiple programming languages.
by
Anthropic paper on training AI systems with a set of principles (constitution) for harmlessness and helpfulness.
by [unverified]
A development platform for debugging, testing, evaluating, and monitoring LLM applications and agents, built by LangChain. It provides visibility into agent traces, helps identify issues, and facilitates iterative improvement of AI systems.
by [unverified]
A library for building stateful, multi-actor applications with LLMs, built on top of LangChain. It allows for the creation of cyclic graphs, enabling more complex and resilient agent behaviors including self-correction and human-in-the-loop processes.
by
ML experiment tracking and model monitoring platform. Integrates with all major training frameworks.
by [unverified]
Enterprise AI Education SaaS Platform with AI Tutor, RAG, Voice Learning Assistant, Personalized Study Planner, Learning Analytics, LMS Architecture, and Role-Based Authentication.
by
AI compute provider with wafer-scale chips delivering record-breaking inference speeds for LLMs.
by [unverified]
Real-time cognitive load assessment is essential for adaptive human-computer interaction but remains challenging due to limited labeled data and poor cross-subject generalization. Recent ECG foundation models pre-trained on millions of clinical recordings offer rich representations, but cannot be...
by [unverified]
Real-time cognitive load assessment from eye-tracking signals could potentially enable adaptive human-centered-AI such as safety-critical applications such as driver vigilance monitoring or automated flight deck assistance, yet two challenges persist: handling frequent data missingness from blink...
by [unverified]
Survival analysis aims to estimate a time-to-event distribution from data with censored observations. Many existing methods either impose structural assumptions on the hazard function or discretize the time axis, which may limit flexibility and introduce approximation errors. We propose the Survi...
by [unverified]
LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid checkpoint and rollback (C/R) of the complete sandbox state, including files and process state (e.g., memory, contexts, etc.). Existing mechanisms duplicate th...
by [unverified]
Large language model (LLM)-based multi-agent systems increasingly rely on intermediate communication to coordinate complex tasks. While most existing systems communicate through natural language, recent work shows that latent communication, particularly through transformer key-value (KV) caches, ...
by [unverified]
Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations....
by [unverified]
Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all confine evolution to text-mutable artifacts -- skill file...
by [unverified]
We propose and analyze a conservative drifting method for one-step generative modeling. The method replaces the original displacement-based drifting velocity by a kernel density estimator (KDE)-gradient velocity, namely the difference of the kernel-smoothed data score and the kernel-smoothed mode...
by [unverified]
Robustness, domain adaptation, photometric and occlusion invariance, compositional generalisation, temporal robustness, alignment safety, and classical anisotropic regularisation are usually treated as separate problems with separate method families. This paper argues that much of their shared st...
by [unverified]
Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-speci...
by [unverified]
The ability of generative AI (GenAI) methods to photorealistically alter camera images has raised awareness about the authenticity of images shared online. Interestingly, images captured directly by our cameras are considered authentic and faithful. However, with the increasing integration of dee...
by [unverified]
Event extraction is essential for event understanding and analysis. It supports tasks such as document summarization and decision-making in emergency scenarios. However, existing event extraction approaches have limitations: (1) closed-domain algorithms are restricted to predefined event types an...
by [unverified]
Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitorin...
by [unverified]
Maintaining instantaneous balance between electricity supply and demand is critical for reliability and grid instability. System operators achieve this through solving the task of Unit Commitment (UC),ca high dimensional large-scale Mixed-integer Linear Programming (MILP) problem that is strictly...
by [unverified]
This paper introduces a new paradigm for AI game programming, leveraging large language models (LLMs) to extend and operationalize Claude Shannon's taxonomy of game-playing machines. Central to this paradigm is Nemobot, an interactive agentic engineering environment that enables users to create, ...
by [unverified]
As model sizes continue to grow, parameter-efficient fine-tuning has emerged as a powerful alternative to full fine-tuning. While LoRA is widely adopted among these methods, recent research has explored vector-based adaptation methods due to their extreme parameter efficiency. However, these meth...
by [unverified]
Deep-learning video super-resolution has progressed rapidly, but climate applications typically super-resolve (increase resolution) either space or time, and joint spatiotemporal models are often designed for a single pair of super-resolution (SR) factors (upscaling spatial and temporal ratio bet...
by [unverified]
Scientific workflow systems automate execution -- scheduling, fault tolerance, resource management -- but not the semantic translation that precedes it. Scientists still manually convert research questions into workflow specifications, a task requiring both domain knowledge and infrastructure exp...
by [unverified]
Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone ...
by [unverified]
How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as ...
by [unverified]
AI-assisted curation and organization for large media datasets
by
A diverse, high-quality, human-generated instruction-following dataset collected by the OpenAssistant project. It contains multi-turn conversations covering a wide range of topics, designed to train helpful and harmless chatbots.
by
A large-scale multilingual collection of instruction datasets, encompassing tasks like summarization, translation, question answering, and more, across over 40 languages. It's crucial for developing and evaluating instruction-following capabilities in diverse linguistic contexts.
by
Crowdsourced benchmark for evaluating LLMs through pairwise comparisons by human judges.
by
A comprehensive benchmark designed to measure an AI model's knowledge across 57 subjects, ranging from humanities to STEM. It assesses a model's understanding and reasoning capabilities in a zero-shot or few-shot setting, crucial for evaluating general intelligence.
by [unverified]
HuggingFace model (general). Tags: region:us
by [unverified]
HuggingFace model (general). Tags: safetensors, gpt2, region:us
by [unverified]
HuggingFace model (general). Tags: region:us
by [unverified]
HuggingFace model (general). Tags: pytorch, transformer, language-model, long-context, agillm
by [unverified]
HuggingFace model (translation). Tags: mlx, safetensors, hunyuan_v1_dense, quantized, translation
by [unverified]
HuggingFace model (general). Tags: region:us
by [unverified]
Clean from-scratch math core for shannon-prime-lattice: KSTE encoder, Friedman sieve, ARM (HRR in CRT cyclotomic ring), CRT NTT primitives, Position-as-Arithmetic.
by [unverified]
RHEL AI Full-Stack Setup Manager - 150+ AI/ML Tools mit grafischer Oberfläche
by [unverified]
🧠 Production-grade LLM Evaluation & Benchmarking Framework — GPT-4, Claude, Gemini, Mistral. Accuracy, latency, cost, hallucination, reasoning metrics.
by [unverified]
A comprehensive interview preparation guide covering all major RAG (Retrieval-Augmented Generation) architectures. 50 questions across 10 types, from Naive RAG to Agentic, Graph, Self-RAG, and beyond. Includes difficulty tags, detailed answers, a cheatsheet, and a decision tree.
by [unverified]
HuggingFace model (general). Tags: region:us
by [unverified]
HuggingFace model (general). Tags: safetensors, region:us
by [unverified]
HuggingFace model (reinforcement-learning). Tags: chess, game-ai, monte-carlo-tree-search, reinforcement-learning, zone-guidance
by [unverified]
HuggingFace model (general). Tags: region:us
by [unverified]
HuggingFace model (general). Tags: region:us
by [unverified]
HuggingFace model (text-to-image). Tags: text-to-image, base_model:Tongyi-MAI/Z-Image, base_model:finetune:Tongyi-MAI/Z-Image, license:apache-2.0, region:us
by [unverified]
HuggingFace model (general). Tags: region:us
by [unverified]
HuggingFace model (text-classification). Tags: transformers, safetensors, xlm-roberta, text-classification, arxiv:1910.09700
by [unverified]
HuggingFace model (general). Tags: tensorboard, region:us
by [unverified]
HuggingFace model (general). Tags: safetensors, region:us
by [unverified]
AI-powered resume analyzer using Google Gemini 2.5 Flash — ATS scoring, skill gap analysis, and structured feedback via prompt engineering | Python · Streamlit
by [unverified]
pytest lab for testing LLMs: RAG eval, red teaming, guardrails, drift monitoring — 14 modules, 142 tests, zero API calls needed
by [unverified]
Self-Healing RAG System using LlamaIndex + Ollama to improve answer reliability and prevent hallucinations.
by [unverified]
AI drug discovery on Apple Silicon. De novo generation + drug repurposing. 100% local, no cloud, fully auditable.
by [unverified]
Multimodal AI studio powered by Qwen3.6-35B-A3B. End-to-end web app exposing visual reasoning, image captioning, and document understanding tools from a single model with side-by-side output across versions.
by [unverified]
A specialized version of Meta's SAM, fine-tuned on medical imaging datasets to accurately segment anatomical structures and pathologies with minimal prompting.
by [unverified]
An exceptionally small yet capable language model, designed for efficiency and deployment on resource-constrained environments. Despite its compact size, it demonstrates surprisingly good performance in conversational tasks, making it ideal for edge computing.
by [unverified]
An experimental open recurrent language model from Google, showcasing efficiency and performance with a novel recurrent architecture. It's designed to be memory-efficient and capable of handling long sequences effectively.
by [unverified]
AI system that translates natural language to code. Powers GitHub Copilot and advanced code generation tasks.
by [unverified]
OpenHermes-2.5-Mistral-7B is a highly popular instruction-tuned variant built on the Mistral-7B base model, known for its exceptional performance in following complex instructions. It leverages a diverse dataset of high-quality instructions to achieve superior conversational abilities.
by [unverified]
A frontier open model with a Mixture-of-Experts (MoE) architecture and 1M context, designed for advanced AI applications.
by [unverified]
A mid-range Mixture-of-Experts (MoE) model with 1M context, available through NVIDIA NIM for optimized deployment.
by [unverified]
Mamba-2-10B is a state-of-the-art State Space Model (SSM) that offers an alternative to transformer architectures, providing linear scaling with sequence length. This makes it highly efficient for processing long contexts and real-time applications.
by [unverified]
Falcon-40B-Instruct is a robust large language model from TII (Technology Innovation Institute), fine-tuned for instruction following. It was one of the first truly open-source models to challenge proprietary models in performance, offering strong general-purpose capabilities.
by [unverified]
Baichuan2-13B-Chat is a high-performing large language model developed by Baichuan Inc., optimized for chat and general-purpose natural language understanding. It offers strong capabilities in Chinese and English, making it a powerful tool for various applications.
by OWASP Foundation
Security standard for AI agent systems (2026).
by
Security standard for AI agent systems (2026).
by
Regulatory framework for AI systems in the EU (Aug 2026).
by
Autonomous agent commerce with crypto-signed mandates.
by [unverified]
YouTube MCP server — transcripts, comments & channel analysis as AI research source
by [unverified]
The package layer for AI agents.
by [unverified]
Fast hybrid code search for agents. Pure Go, drop-in MCP-compatible with semble.
by [unverified]
Repository of the official Mobbin MCP server
by [unverified]
Feed it a recording of yourself singing, get back actual coaching feedback. Works with audio files (mp3, wav, flac, m4a) and video files (mp4, mov).
by [unverified]
AI-powered mental health risk detection and emotional trend analysis using NLP and transformer models
by [unverified]
AI-powered bias detection and mitigation platform using NLP, sentiment analysis, toxicity evaluation, and interactive dashboards.
by [unverified]
Wrap any CLI as a Model Context Protocol (MCP) server for Claude, ChatGPT, Cursor, Gemini and any MCP-compatible client — schema auto-inferred from --help.
by [unverified]
A simple multiuser multiboard task MCP server managed via kanban board interface.
by [unverified]
Plug vas3k.club into your AI — search members, cite posts, pull links right in chat. Remote MCP server on Cloudflare Workers.
by [unverified]
Python client, CLI, and MCP server for the Elegoo Centauri Carbon 3D printer
by [unverified]
DecisionGraph is an engineering decision memory system that captures evidence from GitHub/Slack/Jira, answers “why” questions, and runs pre-change guardrails via CLI/API/MCP.
by [unverified]
한국 스타트업, 1인 법인, 프리랜서, 개인 사업자를 위한 장부 자동 생성 Claude Code 스킬. 카드명세서 PDF·은행 CSV → 재무제표·세무사 전달 CSV 자동 생성. Level 2 민감정보 마스킹 적용.
by [unverified]
Your First LLM-Wiki Conversation Knowledge Base
by [unverified]
An AI-based voice sentiment analysis application that converts speech to text and detects emotions using NLP models, featuring real-time visualization, voice feedback, and an interactive UI.
by [unverified]
Real-time AI system monitoring dashboard with predictive analytics & anomaly detection
by Waggle-AI
Discover, search, invoke, and rate A2A (Agent-to-Agent) protocol agents.
by ya-xyz
Crypto payments for the agent economy — policy-gated vault operations under human control
by drivenbymyai-max
Sputnik X: EU trade 63M+, LV customs, salary, sanctions, EU AI Act, SoulLedger + x402 USDC.
by rubenayla
Search products in physical stores near you. Compare prices, check availability.
by preflyte
DeFi financial intelligence for AI agents — free MCP + paid REST via MPP/USDC.
by picoads
Micro ad network for AI agents.
by perpvue
Query workouts, browse exercises, and explore fitness data from BearTrail
by salexashenko
Dreamtap provides sources of inspiration to your AI to make it more creative.
by ThomsenDrake
21 tools for AI agents to manage BTCPay Server: invoices, payments, stores, webhooks, Lightning
by Arithym-io
Precision math engine for AI agents. 203 exact methods. Zero hallucination.
by api
Pre-execution safety layer for autonomous agent wallets via MCP and x402.
by 3keys
Wish game — every agent wish permanently reshapes a shared world. Free, no key needed.
by ezoic
AI-powered video publishing, channel management, and monetization via open.video
by propertyscoop
US property data: powerlines, noise, crime, hazards, schools and more
by platform
The Uno Platform MCP Server to get up-to-date docs and prompts
by Propbar
UK property research tools - crime stats, schools, demographics, valuations for AI.
by garchi
Manage structured website content in Garchi CMS through MCP.
by kaikhq
Subscription payment platform for Taiwan — manage products, customers, and billing
by jmagar
MCP server for Unraid API — provides tools to interact with an Unraid server's GraphQL API.
by jmagar
MCP server for Overseerr media requests and discovery.
by jmagar
MCP server for self-hosted Gotify push notifications and message management.
by kismet-tech
Semantic hotel search with real-time availability and price comparison for Kismet Travel
by 609NFT
Full Solana DeFi coverage: launchpads, tokens, trades, and wallets, decoded at scale.
by ref-tools
Token-efficient search for coding agents over public and private documentation.
by dgunning
SEC filing intelligence — insider sentiment, material events, financial ratios, disclosure search.
by funn-to
Generate Amazon affiliate sales funnels from product URLs. AI-powered, free.
by draup
Global labour & market data for skills, workforce, planning, stakeholders, jobs, news & profiles
by times-up
Skybridge Example App showcasing a Time's up game with the LLM
by productivity
Skybridge Productivity MCP App Example
by investigation-game
Find out who killed Claude by interrogating the LLM driven suspects...
by everything
Example and testing ChatGPT and MCP App showcasing all the features of the Skybridge framework
by ecommerce
Example MCP App based on Skybridge framework showing an e-commerce products based on request
by capitals
Example MCP and ChatGPT App based on Skybridge framework showing capitals in a world map
by engageable-tech
Unified analytics MCP server for GA4, Mixpanel, and PostHog.
by agentbazaar
AI Agent Marketplace — Search, execute, chain, and sell 3,000+ agents via MCP. Free, no API key.
by bnmbnmai
36 tools: intel feeds, DeFi, crypto, OSINT, NLP, scraping, proxy. x402 micropayments.
by entia
20 tools: entity lookup, BORME, EU VAT, GLEIF, healthcare registries, economic data. 34 countries.
by prometheus
Knowledge Network for AI Agents and creators: Search, rate, and review programming guides via MCP
by supp
Support you can control from anywhere, for less. 15+ integrations. Solve tickets easily
by codereclaimers
AI support agent platform for small businesses. Query pricing, features, and examples.
by codereclaimers
Query DezignWorks reverse engineering software: features, compatibility, and pricing.
by codereclaimers
Query skills, project history, and availability for CodeReclaimers LLC consulting.
by approval
Approval Studio is an online proofing software for creative teams and everyone working with designs
by bopmarket
AI marketplace: search, buy, sell across Amazon, eBay, AliExpress. 13 tools.
by demo
Reserve a subdomain, upload .zip via MCP, get a link. Static only (Nginx; no PHP, no DBs).
by rally
Tools for Go-to-market teams creating sales materials, product demos, and deal rooms for customers.
by sombra
AI research library. Save, organise and reuse notes and webpages as clean markdown context.
by Pharaoh-so
Codebase knowledge graph — architectural awareness for AI coding agents via MCP
by mcp
Search the web in real time to get trustworthy, source-backed answers.
by jinkoso
Jinko is a travel MCP server that provides hotel search and booking capabilities.
by uldl
Persistent file storage for AI agents via MCP and curl. Upload, download, and version files.
by henry-ships
20+ pay-per-use APIs: image gen, crypto data, email verify, SSL check, web scraping, and more.
by buildinternet
An agent-friendly API for product changelogs. A unified registry via CLI, API, or MCP.
by 0xLighthouse
Public file storage for AI agents. Upload files and get a permalink back.
by jade-pico
Validates AI infra code on real VMs. Self-corrects until it works. No containers, no sandboxes.
by likarika
Manage shared household finances for two people - expenses, budgets, savings, and settlements.
by syncline
AI-powered meeting scheduling with intelligent auto-scheduling for Claude and AI agents.
by hivrich
Connect Claude to your Intervals.icu watch data for fitness, workout review, and plan writing.
by promtlabs
AI prompt library with collections, tags, version history and team collaboration.
by lagunapools
AI sales manager for composite swimming pools — recommendations, pricing, BIM/CAD, dealers
by PlainsightPro
Search the Plainsight playbook and review code against proven best practices.
by begonia
Local SEO: MCP for local business audits with tools and guidance.
by respira
WordPress MCP with native page-builder editing, full-page creation, and WooCommerce tools.
by cenogram
7M+ real estate transactions from Poland's RCN registry. Search, compare, and analyze prices.
by invoker
Deploy forms and sites to live URLs from AI conversations with automatic form capture.
by vivesca
Opinionated MCP server — Pydantic output schemas, 7 domains, built to evolve.
by Robbie1977
MCP server for Drosophila neuroscience data from VirtualFlyBrain
by icloud-calendar-mcp
MCP server for iCloud (Apple) Calendar access via CalDAV
by lightpaperorg
API-first publishing platform. Publish markdown as permanent web pages with quality scoring.
by IO-Aerospace-software-engineering
MCP server for aerospace calculations: orbital mechanics, ephemeris, DSN operations, ...
by thrill2212
Hut-to-hut hiking tours in the Alps with live availability from multiple booking systems.
by Aigen-Protocol
Token safety (27 scam patterns, 6 EVM chains) + AI agent economy with $AIGEN rewards.
by benevolabs
Collaborative whiteboard MCP server — create objects, connectors, C4 diagrams, and manage boards
by aquaview
AQUAVIEW MCP Server - Search and access global oceanographic and environmental datasets.
by paperlink
Document sharing, invoicing, and personal finance platform. 15+ AI tools via OAuth 2.1.
by albermm
AI sales — prospect discovery, ICP scoring, outreach generation.
by albermm
BJJ video analysis — YOLO pose detection + Claude AI technique analysis.
by albermm
AI marketing — content generation, competitor analysis, social publishing.
by albermm
AI analytics — sales analysis, ML forecasting, customer segmentation.
by prelude
Search and rent musical instruments in New Zealand. Pricing, teachers, and FAQs.
by deadpixel
Multi-model AI debates: GPT-4o, Claude, Gemini & 200+ models discuss, then synthesize insight.
by stellagent
Connect e-commerce and marketing data to AI assistants via MCP.
by piiiico
AEO audit: score any website 0-100 for AI visibility. Checks schema, meta, content, AI crawlers.
by RSNCNetwork
Find shopping deals, earn cashback, and redeem rewards across retail, dining, and travel brands.
by digikelly
Search the MOTE AI agent registry for machine-readable businesses.
by fortytwo
Ask high-complexity questions where the best answer is required — coding, hard reasoning, and more.
by pilotso11
Generate word clouds from text with custom fonts, colors, backgrounds, gradients, and shape masks
by flatirontek
Push notifications with personalized sounds - manage and trigger your vybits via MCP
by todoist
Official Todoist MCP server for AI assistants to manage tasks, projects, and workflows.
by sprtxbabe
Agent-native sports token network. 1,435 tokens across 98 sports, 9 global regions.
by kbhave
Professional DNS diagnostics: 14 tools for DNS, DNSSEC, email security, SSL, and propagation.
by singular
Marketing intelligence MCP server providing campaign performance data and analytics tools.
by settro
Public Settro sales MCP tools for missed-call ROI, direct-order recovery, social ordering, and fit.
by nymbo
Remote MCP server: fetch, search, Python, TTS, memory, image, video.
by madanc
AI credit card advisor - search cards, compare portfolios, and optimize rewards
by identifai
Detect AI-generated images, videos, and audio with identifAI's deepfake detection tools.
by gepuro
Search Japanese company database
by securityscan-api
Scan GitHub-hosted AI skills for vulnerabilities: prompt injection, malware, OWASP LLM Top 10.
by agentutil
Sub-cent factual claim verification against live data sources.
by agentutil
Reversibility intelligence — check if actions can be undone before executing them.
by agentutil
Intent security pre-flight checks for autonomous AI agents.
by agentutil
Statistical baselines — check if values are normal or anomalous for a given category.
by agentutil
Situational awareness — holidays, business hours, and platform status for timing-sensitive actions.
by agentsbase
Email for AI agents. Create mailboxes, send/receive emails, and auto-extract verification codes.
by portal
MCP server para El Cheff - punto de venta para restaurantes
by ai
Capture photos remotely from mobile devices via S3-backed upload URLs
by distro
Agentic press platform for managing publications, writing stories, and browsing content feeds.
by speedof
Official SpeedOf.Me server - Accurate speed tests via 129 global edge servers with analytics.
by remem
Browse Bible verse collections from Remember Me, a free memorization app in 48 languages
by pi22by7
Persistent codebase intelligence that gives AI assistants memory across sessions
by billy-enrizky
AI browser automation. Write async Python to navigate, click, type, and extract data.
by in-context
An interactive portfolio built for AI conversations. Browse work, services, and book calls.
by hikmahtech
Remote MCP server: 10 developer utilities (base64, JWT, DNS, UUID, URL, JSON, UA, IP lookup).
by install
Create guides as MCP servers to instruct coding agents to use your software (library, API, etc).
by UltravioletaDAO
AI agents publish bounties for real-world tasks. Gasless USDC payments via x402.
by app
Hosted MCP gateway for Web3 infra discovery across 20+ networks via one endpoint.
by taskman
Book London furniture assembly, wall mounting, handyman, electrical, and smart home jobs.
by bitcoinintel
Bitcoin 15-min prediction signals with on-chain commitment proofs. L402-gated.
by test5-ca2cc17f
Skybridge demo for QA and playground validation
by template-server-ed7df515
Description of my MCP server
by template-server-8e74ee32
Description of my MCP server
by tasks-example-ca49f0f6
A test MCP server for task management
by staging
An MCP server that provides email capabilities, hosted on Alpic platform
by template-nodejs-d3cd3099
A non functional MCP server (yet)
by template-0f352e26
A non functional MCP server (yet)
by system-prompts-of-ai-987eba72
Description of my MCP server
by skybridge-712c1350
A great map with info on Capitals all around the world
by server-template-3b6131fe
A non functional MCP server (yet)
by send-email-mcp-04d3fdca
Non functional server (yet)
by staging
Super description
by python-email-mcp-f5c8e0a7
Still WiP
by property-search-mcp-ce598409
Work in progress (for now)
by personal-chef-chatgp-c951fcb6
Ok
by staging
Description of my MCP server
by node-email-mcp-a629b9bc
Non functional server (yet)
by node-email-mcp-731b03b7
Non functional server (yet)
by staging
An MCP server that provides email capabilities, hosted on Alpic platform
by node-email-mcp-25ea0992
Still work in progress
by mcp-template-e37c5898
A non functional MCP server (yet)
by staging
Description of my MCP server
by mcp-server-template-99cbcf02
Not ready (yet!)
by mcp-server-template-90a89880
Work in progress
by mcp-server-template-8068f74b
A non functional MCP server (yet)
by mcp-server-template-7b99436a
Server not ready (yet!)
by mcp-server-template-5eacc5c3
A non functional MCP server (yet)
by staging
Description of my MCP server
by mcp-server-template--1488dac2
Non functional server (yet)
by mcp-server-a2d17427
A non functional MCP server (yet)
by mcp-server-6283dc54
A non functional MCP server (yet)
by everything-app-ce82911c
Still WiP
by everything-app-7f22928f
Still WiP
by staging
An MCP server that provides email capabilities, hosted on Alpic platform
by email-mcp-c880a7f7
Still work in progress
by email-mcp-3b5a375e
Still WiP!
by email-mcp-17ee097a
Still WiP
by demo-81c8a122
Description of my MCP server
by alpic-poc-frontend-c68d130f
Description of my MCP server
by staging
Description of my MCP server mega cool capabilities, lesgo
by send-email-49a8d7ee
Description of my MCP server
by qa-test-cli-b955e6a8
Testing publish command
by node-mcp-3a40efc5
Description of my MCP server
by node-email-mcp-b25f10ec
Description of my MCP server
by new-test-cli-239fbe01
Ask a question and get a deterministic Magic 8 Ball answer.
by mcpimmo-10e693d9
Description of my MCP server
by mcp-server-template-d2d1b3b5
Description of my MCP server
by mcp-server-template-7860f7d3
Description of my MCP server
by
Paper introducing the ReAct framework for combining reasoning and acting in language model agents.
by [unverified]
Hermes Agent CN desktop app, Windows-First, built with Tauri, Typescript and Rust. Isolated Hermes Agent core insides.
by [unverified]
MCP server (Go) for AI assistants: web search, content extraction, academic/patent/news research. Multi-provider routing, 4-tier scraping, search lenses. Works with Claude, Cursor, and any MCP client.
by [unverified]
Postman for MCP servers — inspect tools, test live, browse 150+ servers
by [unverified]
🤖 Cortex AI: #1 AI Trading Bot for Crypto, Forex & Stocks. Automated Arbitrage Engine v3.4. Works on Solana, TON, Binance & Bybit. Best Free Crypto Trading Bot 2026.
by [unverified]
玄霖超脑 · 无量网络 v4 重构版— 66 模块6通道 AI 认知神经系统 · 装一次,你所有的 AI 工具从此共享一个永远不失忆的大脑。跨会话记忆 · 多 Agent 互通 · 自动代码审查 · 零配置
by [unverified]
One memory. Every AI tool. Yours to keep. Local-first, MCP-compatible, Apache 2.0.
by [unverified]
W.A.D.E. — A local-first autonomous AI runtime that runs on your hardware, owns your data, and acts without being prompted.
by [unverified]
Organize, assign, and run AI-agent Kanban tasks on macOS with load balancing, batch execution, and OpenAI-compatible modes
by [unverified]
API-first variant triage pipeline combining genomic filtering, annotation, and LLM-driven interpretation for clinical genomics workflows
by [unverified]
Build your genius agent🤔
by [unverified]
Memory hygiene skills for Hermes Agent — dreaming (3-phase consolidation) and lean check (surgical trimmer)
by [unverified]
中文手绘技术 PPT 整页图像生成 Skill | 21:9 封面 + 16:9 正文配图 | PNG 输出
by [unverified]
Claude Code skill that diagnoses the top 3 gaps in any project — code or not
by [unverified]
开源我自己 — A Claude Code skill trained on Flood Sung's entire Zhihu corpus (152 articles + 178 pins + 254 answers). Fork it to open-source yourself.
by [unverified]
A Claude Code skill by Hao (駱君昊) that learns your Facebook voice and auto-posts to FB / IG / Threads / X with a 14-day content calendar. Mega-viral validated: 72K reach / 358 likes / 443 comments / +700 community members on first post. Includes Day 2 flop postmortem.
by [unverified]
ZeusHammer - AI Super Agent with Local Brain, Voice Interaction & Three-Tier Memory
by [unverified]
Soul-driven AI agent with permission-hardened tools, token budgets, and multi-channel access. Runs 24/7 from CLI or Telegram.
by [unverified]
Production Claude Code skills for n8n from a Verified Creator's 100+ workflows
by [unverified]
Building a Production-Grade MCP Server Architecture with a Multi-Agent System
by [unverified]
Bring Your Own Browser — let your AI agent use the Chrome you already have open
by [unverified]
Multi-Agent AI System for PV Solar Simulation with GUI and Cloud LLM Support
by [unverified]
An open-source framework that enables AI agents to truly evolve through the ancient wisdom of Yogacara Buddhism.
by [unverified]
High-performance Cognitive Memory Architecture for AI Agents. Features 4-layer hierarchy, Knowledge Graph, 3D Dashboard, and Hybrid Retrieval.
by [unverified]
An autonomous cybersecurity agent that continuously monitors networks, detects emerging threats, analyzes anomalies, and initiates automated responses to protect digital assets.
by [unverified]
An AI-powered productivity agent that intelligently manages tasks, schedules, communications, and workflows, adapting to user preferences to maximize efficiency.
by [unverified]
An AI agent designed to revolutionize academic and scientific research by autonomously searching, synthesizing, and summarizing vast amounts of literature and data.
by [unverified]
An AI agent that assists in game development by generating game assets, narratives, character behaviors, and optimizing game mechanics based on design principles.
by [unverified]
An autonomous coding agent that can understand complex requirements, generate code, test, debug, and deploy software across various programming languages and platforms.
by [unverified]
A minimalist AI coding agent designed to generate entire codebases from a single prompt, focusing on simplicity and efficiency. Gained recent attention for its lightweight approach.
by [unverified]
An open-source project aiming to replicate and expand upon the capabilities of Devin AI, providing a coding AI agent that can autonomously plan, execute, and debug code to solve complex engineering tasks, with active community contributions and recent progress.
by [unverified]
Recent updates to LlamaIndex's agent framework, focusing on integrating Retrieval Augmented Generation (RAG) more seamlessly with agentic workflows, enabling agents to leverage vast external knowledge bases effectively, with continuous community traction.
by [unverified]
Continuous updates and new integrations within the LangChain framework that empower developers to build more robust and capable AI agents by providing a wider array of tools and improved orchestration patterns, constantly trending in the developer community.
by [unverified]
Google's new multimodal AI agent initiative, showcased at Google I/O, aiming for a universal AI agent that can understand and interact with the world through vision, speech, and text in real-time. The impressive demonstrations have garnered significant attention.
by [unverified]
An AI assistant designed to handle administrative tasks, scheduling, and communication. It learns from user interactions to automate personalized workflows, acting as a personal AI assistant for professionals.
by [unverified]
A platform providing headless browsers as a service, specifically optimized for AI agents and web automation. It allows agents to interact with web pages reliably and at scale, enabling advanced data extraction and task execution.