AI Agents
Autonomous agents, assistants, and multi-agent systems
26 entities indexed
Tool Use
by AaaS
Equips AI agents with the ability to select and use appropriate tools from a defined toolkit to accomplish tasks. Covers tool selection logic, input marshalling, output interpretation, and fallback strategies when tools fail or return unexpected results.
Toolformer: Language Models Can Teach Themselves to Use Tools
by Meta AI
Presents Toolformer, a model that learns to use external tools (APIs) in a self-supervised manner without requiring human annotations. The model decides which APIs to call, how to call them, and how to incorporate results, achieving strong performance across diverse tasks while maintaining generative language modeling ability.
Voyager: An Open-Ended Embodied Agent with Large Language Models
by NVIDIA / Caltech / UT Austin
Presents Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager uses an automatic curriculum, an ever-growing skill library of executable code, and an iterative prompting mechanism to overcome failures.
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
by Princeton NLP / Princeton Language and Intelligence
Introduces SWE-agent, which defines Agent-Computer Interfaces (ACIs) to enable LLMs to autonomously solve real GitHub issues by browsing codebases, editing files, and running tests. On the SWE-bench benchmark, SWE-agent with GPT-4 Turbo resolves 12.5% of issues, significantly outperforming prior methods.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
by University of Washington / IBM AI Research / Allen AI
Introduces Self-RAG, a framework that trains a single LM to adaptively retrieve passages on demand, generate text, and critique its own outputs using special reflection tokens. Unlike standard RAG, Self-RAG decides when to retrieve and reflects on retrieved passages and generation quality, outperforming ChatGPT and standard RAG on diverse downstream tasks.
Planning
by AaaS
Enables agents to create structured execution plans for multi-step tasks by analyzing goals, identifying sub-tasks, ordering dependencies, and allocating resources. Supports plan revision when steps fail or new information emerges during execution.
ToolBench
by Qin et al. / Tsinghua University
ToolBench evaluates LLMs on their ability to use real-world REST APIs to complete user instructions. It provides 16,000+ real APIs from RapidAPI Hub across 49 categories and 12,000+ instruction–API solution pairs, measuring whether models can plan and execute multi-step API call sequences.
Improving Language Models by Retrieving from Trillions of Tokens
by DeepMind
Presents RETRO (Retrieval-Enhanced Transformers), a model that retrieves from a 2-trillion-token database at inference time via chunked cross-attention. RETRO achieves performance comparable to GPT-3 with 25× fewer parameters by leveraging retrieved passages, demonstrating that retrieval augmentation is a compute-efficient alternative to scaling.
Web Browsing
by AaaS
Empowers autonomous agents to interact with the web like a human user. This skill provides the core functionality to navigate to URLs, render pages including executing JavaScript, and parse DOM elements. It enables complex workflows such as filling out forms, clicking buttons, and extracting structured data for analysis or task completion.
WebArena
by CMU
WebArena is a realistic and reproducible benchmark environment designed to evaluate autonomous language agents. It tests an agent's ability to perform complex, multi-step tasks across a diverse set of self-hosted websites, including e-commerce, forums, and content management systems, using real web interfaces.
Reflection
by AaaS
Allows agents to evaluate their own outputs, identify errors or weaknesses, and iteratively improve responses. Implements self-critique loops where the agent reviews its work against quality criteria and refines until standards are met.
Tool Calling Setup
by AaaS
Sets up a tool-calling agent with typed tool definitions, argument validation, error handling, and execution sandboxing. Includes example tools for web search, calculator, file operations, and database queries with a pluggable tool registry.
Tool Selection Strategy
by AaaS
Covers heuristics and learned strategies for agents to select the right tool from a large catalog given a task description, including embedding-based tool retrieval, LLM-based routing, and multi-step tool chaining. Teaches fallback hierarchies, tool description engineering, and cost-aware selection to minimize unnecessary API calls.
TAU-bench
by Sierra AI
Tool-Agent-User benchmark evaluating AI agents on realistic customer service scenarios requiring multi-step tool use. Tests agents' ability to navigate complex workflows, use tools correctly, follow policies, and handle edge cases in airline and retail domains.
MLAgentBench
by Huang et al. / Stanford
MLAgentBench challenges AI agents to perform machine learning research tasks autonomously — reading papers, writing code, running experiments, analyzing results, and improving models. It tests whether agents can replicate and build upon real ML research across 13 diverse ML tasks.
Multi-Agent Orchestration
by AaaS
Orchestrates multiple specialized AI agents in coordinated workflows with task routing, state management, and result aggregation. Implements supervisor and swarm patterns with configurable agent selection logic and inter-agent communication.
MCP Server Template
by AaaS
Template for building Model Context Protocol (MCP) servers that expose tools, resources, and prompts to MCP-compatible clients. Includes typed tool handlers, resource providers, error handling, and transport configuration for stdio and HTTP modes.
OSWorld
by University of Hong Kong
Benchmark for evaluating multimodal agents on real operating system tasks spanning Ubuntu, Windows, and macOS environments. Tests agents' ability to interact with desktop applications, file systems, terminals, and GUI elements to complete everyday computer tasks.
Agent Monitoring Dashboard
by AaaS
Sets up a monitoring dashboard for AI agent systems tracking task completion rates, error rates, latency, token usage, and cost. Integrates with Prometheus for metrics collection and Grafana for visualization with pre-built alert rules.
Agent Testing Harness
by AaaS
Testing harness for AI agents with mock tool providers, simulated user interactions, and deterministic replay capabilities. Enables unit testing of agent logic, integration testing of tool chains, and end-to-end testing of complete agent workflows.
Adept AI
by Adept AI
Adept AI builds AI systems that can take actions in software to complete complex multi-step workflows on behalf of users. The company focuses on general-purpose action models trained to interact with real-world software interfaces through browser and desktop automation.
LyricLoom
by SonicCraft Studios
A creative voice agent specializing in generating original spoken word content, from podcasts to audiobooks, with customizable voices and styles.
AuraSpeak
by Vocalix Technologies
A next-generation voice agent framework for building highly conversational and context-aware AI assistants across various platforms.
DataScout AI
by ScoutLogic Corp.
An enterprise-grade browser agent for automated data collection and analysis from public web sources, ensuring compliance and scalability.
TaskWeaver
by AutoFlow Personal
A personal browser agent that learns user habits to automate repetitive online tasks, from managing emails to booking appointments and comparing prices.
BugFixer Bot
by DebugAI Solutions
An AI-powered debugging agent that automatically identifies, diagnoses, and suggests fixes for code errors across multiple programming languages.