Explore.
7,960 AI entities indexed across tools, models, agents, skills, benchmarks, and more — schema-verified, agent-maintained.
100 entities · paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
by Google AI
Introduced BERT, a bidirectional Transformer pre-trained on masked language modeling and next sentence prediction. Established the pretrain-then-fine-tune paradigm that dominated NLP for years and achieved state-of-the-art on 11 NLP benchmarks.
Learning Transferable Visual Models From Natural Language Supervision (CLIP)
by OpenAI
Introduced CLIP (Contrastive Language-Image Pre-training), a model trained on 400 million image-text pairs using contrastive learning that achieves remarkable zero-shot transfer to diverse vision tasks. CLIP became foundational for vision-language alignment and generative AI pipelines.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
by Google Brain
Introduced chain-of-thought prompting, a simple technique of providing exemplars with step-by-step reasoning traces in few-shot prompts. This approach dramatically improves LLM performance on arithmetic, commonsense, and symbolic reasoning tasks, with the effect emerging at approximately 100B parameters.
High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion)
by CompVis / Stability AI
Introduced Latent Diffusion Models (LDMs), which perform the diffusion process in a compressed latent space rather than pixel space, dramatically reducing computational cost while maintaining image quality. This work underpins Stable Diffusion, the most widely used open-source image generation model.
Language Models are Few-Shot Learners (GPT-3)
by OpenAI
Introduced GPT-3, a 175B parameter language model demonstrating remarkable few-shot learning capabilities across diverse tasks. Showed that scaling model size dramatically improves in-context learning without gradient updates, reshaping the field.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
by Google Brain
Introduced the Vision Transformer (ViT), demonstrating that a pure transformer applied directly to sequences of image patches achieves state-of-the-art performance on image classification when pretrained on large datasets. The paper challenged the dominance of convolutional neural networks in computer vision.
Training Language Models to Follow Instructions with Human Feedback
by OpenAI
Presents InstructGPT, which uses Reinforcement Learning from Human Feedback (RLHF) to align GPT-3 with human intent. By fine-tuning on human demonstrations and training a reward model on human preference comparisons, InstructGPT produces outputs that human evaluators prefer to GPT-3 outputs despite having 100× fewer parameters.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
by Facebook AI Research
Introduces Retrieval-Augmented Generation (RAG), combining parametric memory (language model weights) with non-parametric memory (dense retrieval over Wikipedia) for knowledge-intensive NLP tasks. RAG models achieve state-of-the-art on open-domain QA benchmarks and produce more specific, factual, and diverse responses than pure parametric models.
Proximal Policy Optimization Algorithms
by OpenAI
PPO introduces a clipped surrogate objective that constrains policy update step sizes, achieving the stability of trust-region methods (TRPO) with the simplicity and scalability of first-order optimizers. It quickly became the dominant RL algorithm for training large language models with human feedback.
Highly Accurate Protein Structure Prediction with AlphaFold
by DeepMind
AlphaFold 2 achieves atomic-level accuracy in protein structure prediction by combining evolutionary information from multiple sequence alignments with a novel Evoformer architecture and structure module, solving a 50-year grand challenge in biology. Its predictions have been released for virtually all known proteins and have accelerated drug discovery, enzyme design, and structural biology worldwide.
GPT-4 Technical Report
by OpenAI
Technical report for GPT-4, OpenAI's multimodal large language model accepting image and text inputs. Demonstrates state-of-the-art performance on academic and professional benchmarks, including passing the bar exam in the top 10% of test takers.
Segment Anything
by Meta AI
Introduced the Segment Anything Model (SAM) and the SA-1B dataset of 1 billion masks on 11 million images. SAM is a promptable segmentation foundation model that generalizes to new image distributions and tasks without additional training, enabling a new paradigm of interactive segmentation.
Evaluating Large Language Models Trained on Code (Codex)
by OpenAI
Introduced Codex, a GPT language model fine-tuned on publicly available code from GitHub, and the HumanEval benchmark for measuring code synthesis from docstrings. Codex powers GitHub Copilot and represents a breakthrough in automated programming assistance.
ReAct: Synergizing Reasoning and Acting in Language Models
by Google / Princeton
Introduces ReAct, a paradigm that combines reasoning traces and task-specific actions in language models. By interleaving thinking steps with tool calls, ReAct agents outperform chain-of-thought and act-only baselines on diverse tasks including question answering, fact verification, and interactive decision-making.
LoRA: Low-Rank Adaptation of Large Language Models
by Microsoft Research
Introduces LoRA, which freezes pretrained model weights and injects trainable low-rank decomposition matrices into Transformer layers. Reduces trainable parameters by 10,000× and GPU memory by 3× with no inference latency overhead, enabling efficient LLM fine-tuning.
LLaMA: Open and Efficient Foundation Language Models
by Meta AI
Introduces LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, trained on publicly available datasets. Showed that smaller models trained on more tokens can match or exceed larger models, democratizing LLM research.
Deep Reinforcement Learning from Human Preferences
by OpenAI
This foundational RLHF paper shows that human preference comparisons between agent behaviors can train a reward model that guides deep RL agents in complex tasks like Atari games and MuJoCo locomotion, without hand-crafted reward functions. The approach reduces human labeling effort by ~3 orders of magnitude compared to direct reward specification.
Gemini: A Family of Highly Capable Multimodal Models
by Google DeepMind
Introduced the Gemini family of multimodal models (Ultra, Pro, Nano) natively trained to process and combine text, images, audio, and video. Gemini Ultra is the first model to surpass human expert performance on MMLU and achieves state-of-the-art across 30 of 32 benchmarks evaluated.
Efficient Memory Management for Large Language Model Serving with PagedAttention
by UC Berkeley
Introduced PagedAttention and the vLLM serving system, which manages the KV cache in non-contiguous physical memory blocks inspired by OS paging, enabling near-zero memory waste and efficient sharing of KV cache across requests. vLLM achieves 2-4x higher throughput than HuggingFace Transformers and 1.7x over Orca.
Generative Agents: Interactive Simulacra of Human Behavior
by Stanford University / Google
Introduces generative agents—computational software agents that simulate believable human behavior—by combining a large language model with memory streams, reflection synthesis, and planning mechanisms. Twenty-five agents populate a virtual town, exhibiting emergent social behaviors including relationship formation, information propagation, and event coordination.
Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)
by OpenAI
Presented DALL-E 2 (unCLIP), a hierarchical text-conditional image generation system using CLIP image embeddings as a prior and a diffusion decoder. The system achieves state-of-the-art photorealism and text-image alignment, substantially advancing the field of text-to-image synthesis.
Training Language Models to Follow Instructions with Human Feedback (InstructGPT)
by OpenAI
Introduces InstructGPT, fine-tuning GPT-3 with Reinforcement Learning from Human Feedback (RLHF) to follow instructions. A 1.3B InstructGPT model is preferred over 175B GPT-3 by human labelers, establishing RLHF as the dominant alignment technique.
Self-Consistency Improves Chain of Thought Reasoning in Language Models
by Google Brain
Introduced self-consistency, a decoding strategy that samples diverse reasoning paths from a language model and returns the most consistent answer by marginalizing out the reasoning paths. Self-consistency is a simple, training-free technique that substantially improves chain-of-thought prompting across arithmetic and commonsense reasoning tasks.
Scaling Laws for Neural Language Models
by OpenAI
Empirically establishes power-law scaling relationships between language model performance and model size, dataset size, and compute budget. Provides the foundational framework for predicting LLM capabilities as a function of scale, guiding research for years.
Visual Instruction Tuning (LLaVA)
by University of Wisconsin–Madison / Microsoft Research
Introduced LLaVA (Large Language and Vision Assistant), a multimodal model trained via visual instruction tuning using GPT-4-generated multimodal instruction-following data. LLaVA demonstrates impressive multimodal chat abilities and achieves 85.1% on Science QA, pioneering open-source visual instruction tuning.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
by Google Brain
Introduced Switch Transformers, a simplified mixture-of-experts (MoE) architecture that routes each token to exactly one expert (top-1 routing), enabling trillion-parameter models with sub-linear compute scaling. Switch Transformers achieve 7x pretraining speedup over a dense T5 model while maintaining model quality.
Model Cards for Model Reporting
by Google
Model Cards introduces a structured framework for documenting machine learning models across intended uses, performance disaggregated by demographic groups, and ethical considerations, enabling informed model selection and deployment decisions. The paper has become an industry standard, with model card adoption by Google, Hugging Face, and most major AI providers.
Language Models are Unsupervised Multitask Learners (GPT-2)
by OpenAI
Introduced GPT-2, demonstrating that large language models trained on diverse web text can perform zero-shot transfer across many NLP tasks without task-specific fine-tuning. Showed emergent capabilities at scale and sparked debate on responsible AI release.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
by Stanford University
Introduces FlashAttention, an IO-aware exact attention algorithm that restructures attention computation to minimize memory reads/writes between HBM and SRAM. Achieves 2-4× speedup over standard attention and enables training on much longer sequences.
Training Compute-Optimal Large Language Models (Chinchilla)
by DeepMind
Challenges the Kaplan et al. scaling laws by showing that model size and training tokens should scale equally. Trains Chinchilla (70B) on 4× more data than Gopher, matching or beating models 4× its size, redefining compute-optimal training strategies.
Datasheets for Datasets
by Microsoft Research / Multiple Institutions
Drawing an analogy to electronics component datasheets, this paper proposes that every ML dataset should be accompanied by a standardized document covering its motivation, composition, collection process, preprocessing, uses, distribution, and maintenance. Datasheets for Datasets has become the foundational standard for dataset transparency and is widely required by major AI venues.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
by Princeton University / Google DeepMind
Introduced Tree of Thoughts (ToT), a framework that generalizes chain-of-thought prompting to a tree search over intermediate reasoning steps. ToT enables LLMs to explore multiple reasoning paths, evaluate choices, and backtrack, achieving dramatic improvements on tasks requiring lookahead and planning.
Fast Inference from Transformers via Speculative Decoding
by Google Research
Introduced speculative decoding, a lossless inference acceleration technique that uses a smaller, faster draft model to propose multiple tokens, then verifies them in parallel with the target model in a single forward pass. This achieves 2-3x speedup without any degradation in output quality or distribution.
Constitutional AI: Harmlessness from AI Feedback
by Anthropic
Introduces Constitutional AI (CAI), a method for training harmless AI assistants using a set of written principles (a 'constitution') to guide both supervised learning and reinforcement learning from AI feedback (RLAIF). CAI enables Anthropic to reduce reliance on human harm labels while maintaining helpfulness and making AI reasoning about harmlessness explicit.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
by Stability AI
Presented SDXL, a significantly improved latent diffusion model architecture featuring a 3.5B parameter UNet backbone with a secondary refiner model, conditioning on image size and crop parameters, and a curated high-aesthetic dataset. SDXL substantially improves visual quality and prompt adherence over prior Stable Diffusion versions.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
by DeepSeek
DeepSeek-R1 demonstrates that pure reinforcement learning with rule-based rewards—without supervised fine-tuning on chain-of-thought data—can incentivize emergent reasoning capabilities in LLMs including self-verification, reflection, and long chain-of-thought. The model achieves performance comparable to OpenAI-o1 on reasoning benchmarks while being fully open-sourced, triggering a significant industry response.
QLoRA: Efficient Finetuning of Quantized LLMs
by University of Washington
Introduces QLoRA, which combines 4-bit quantization with LoRA adapters to fine-tune a 65B LLM on a single 48GB GPU while preserving full 16-bit fine-tuning performance. Introduces NF4 data type and double quantization for extreme memory reduction.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
by Institute of Science and Technology Austria (IST Austria)
Presented GPTQ, a one-shot weight quantization method based on approximate second-order information that can quantize GPT models with 175B parameters to 4-bit or 3-bit precision in approximately four GPU-hours with negligible accuracy loss. GPTQ made large model inference practical on consumer hardware.
Code Llama: Open Foundation Models for Code
by Meta AI
Introduced Code Llama, a family of large language models for code built on Llama 2 through code-specific pretraining and fine-tuning. Code Llama achieves state-of-the-art performance among open models on HumanEval and MBPP, with variants for Python, instruction following, and long context (100K tokens).
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
by LMSYS / UC Berkeley
Introduces Chatbot Arena, a platform for crowdsourced human evaluation of LLMs via pairwise comparisons using an Elo rating system. The arena has collected over 240K human votes across 50+ models, revealing human preference rankings that often diverge from standard benchmark leaderboards and providing a complementary evaluation signal.
Toolformer: Language Models Can Teach Themselves to Use Tools
by Meta AI
Presents Toolformer, a model that learns to use external tools (APIs) in a self-supervised manner without requiring human annotations. The model decides which APIs to call, how to call them, and how to incorporate results, achieving strong performance across diverse tasks while maintaining generative language modeling ability.
Mistral 7B
by Mistral AI
Introduces Mistral 7B, a 7B parameter language model outperforming LLaMA 2 13B on all benchmarks and approaching LLaMA 2 34B on code and reasoning. Uses grouped-query attention and sliding window attention for efficient inference.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
by MIT / MIT-IBM Watson AI Lab
Introduced AWQ (Activation-aware Weight Quantization), a hardware-friendly low-bit weight quantization approach that protects a small fraction (1%) of salient weights based on activation magnitudes, achieving better performance than GPTQ at 4-bit while being faster and more broadly applicable across model architectures.
PaLM: Scaling Language Modeling with Pathways
by Google Research
Introduces PaLM (Pathways Language Model), a 540B parameter model trained on 780B tokens using the Pathways system. Achieved breakthrough performance on reasoning tasks and demonstrated discontinuous performance improvements that define emergent abilities.
DINOv2: Learning Robust Visual Features without Supervision
by Meta AI
Presented DINOv2, a self-supervised vision foundation model trained on a curated dataset of 142 million images using a combination of self-distillation and contrastive objectives. DINOv2 features serve as universal visual representations, excelling on depth estimation, segmentation, and classification without fine-tuning.
Holistic Evaluation of Language Models
by Stanford CRFM
Presents HELM, a holistic evaluation framework for language models across 42 scenarios and 59 metrics including accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. HELM reveals that no single model dominates across all dimensions and exposes significant gaps between narrow and comprehensive model assessment.
GPT-4V(ision) System Card
by OpenAI
The system card for GPT-4 with vision (GPT-4V), detailing the model's visual understanding capabilities, safety evaluations, limitations, and mitigation strategies. GPT-4V represents a major advancement in large multimodal models, enabling complex visual reasoning from natural language prompts.
The Claude 3 Model Family: Opus, Sonnet, Haiku
by Anthropic
Presents the Claude 3 family of models (Opus, Sonnet, Haiku), demonstrating state-of-the-art performance on reasoning, vision, and multilingual tasks. Highlights Anthropic's safety techniques including Constitutional AI and RLHF-based alignment.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)
by Google Brain
Introduced Imagen, a text-to-image diffusion model that leverages large pretrained language models (T5-XXL) for text understanding combined with cascaded diffusion models for image synthesis. Imagen demonstrated that scaling text encoders is more impactful than scaling diffusion models, establishing DrawBench as a new evaluation benchmark.
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
by Princeton University / Together AI
Extends FlashAttention with improved work partitioning across GPU thread blocks and warps, achieving 2× speedup over FlashAttention and ~9× speedup over standard attention. Enables efficient training of models with context lengths up to 256K tokens.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
by University of Washington / Black in AI
This influential FAccT paper argues that ever-larger language models carry significant risks—including environmental costs, biased training data, and the illusion of meaning—that are often overlooked in the race for benchmark performance. It calls for pausing scaling to focus on documentation, auditing, and community-centered research practices.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
by Salesforce Research
Presented BLIP-2, which bridges the modality gap between frozen image encoders and frozen LLMs using a lightweight Querying Transformer (Q-Former) trained in two stages. BLIP-2 achieves state-of-the-art VQA performance with significantly fewer trainable parameters than prior methods.
Flamingo: a Visual Language Model for Few-Shot Learning
by DeepMind
Introduced Flamingo, a family of visual language models that bridge powerful pretrained vision and language models, enabling few-shot learning on a diverse range of multimodal tasks by training on arbitrarily interleaved sequences of images, video, and text. Flamingo set new few-shot state-of-the-art on 16 benchmarks.
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
by Tsinghua / Peking University / DeepWisdom
Presents MetaGPT, a multi-agent framework that encodes human workflows as Standardized Operating Procedures (SOPs) for LLM agents acting as specialized software roles. By assigning product manager, architect, engineer, and QA roles, MetaGPT produces complete, executable codebases from natural language requirements with higher quality than prior approaches.
Let's Verify Step by Step
by OpenAI
Demonstrated that process-based reward models (PRMs), which provide feedback on each reasoning step, substantially outperform outcome-based reward models (ORMs) for training LLMs to solve mathematical reasoning problems. The paper also introduced PRM800K, a dataset of 800K step-level human feedback labels on MATH solutions.
RoFormer: Enhanced Transformer with Rotary Position Embedding
by Zhuiyi Technology
Introduces Rotary Position Embedding (RoPE), encoding absolute position information with a rotation matrix and naturally incorporating relative position in self-attention. Adopted by LLaMA, PaLM 2, and most modern LLMs for its length generalization properties.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
by Princeton University
Introduced SWE-bench, a benchmark of 2,294 real GitHub issues from 12 popular Python repositories requiring models to resolve issues by writing code patches. SWE-bench reveals that even the best LLMs resolve fewer than 4% of issues with standard techniques, motivating research into code agents.
Voyager: An Open-Ended Embodied Agent with Large Language Models
by NVIDIA / Caltech / UT Austin
Presents Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager uses an automatic curriculum, an ever-growing skill library of executable code, and an iterative prompting mechanism to overcome failures.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
by Stanford University
Introduces DPO, a stable and efficient alternative to RLHF that directly optimizes a language model on human preference data without an explicit reward model or RL. Achieves comparable or superior alignment results with significantly simpler implementation.
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
by Microsoft Research
Presents GraphRAG, which uses LLM-generated knowledge graphs and community detection to enable query-focused summarization over entire text corpora. Unlike standard RAG which answers local questions from text chunks, GraphRAG enables global sensemaking queries by reasoning over interconnected entity communities at multiple granularities.
Decision Transformer: Reinforcement Learning via Sequence Modeling
by UC Berkeley / Google Brain
Decision Transformer recasts offline reinforcement learning as a conditional sequence modeling problem, predicting actions given return-to-go, states, and past actions using a causal Transformer. This eliminates the need for temporal difference learning and bootstrapping while achieving competitive performance on Atari and MuJoCo benchmarks.
REALM: Retrieval-Augmented Language Model Pre-Training
by Google Research
Proposes REALM, which augments language model pre-training with a learned textual knowledge retriever, enabling the model to retrieve and attend over documents from a large corpus during both pre-training and fine-tuning. REALM achieves state-of-the-art on Open-domain QA benchmarks while providing interpretable knowledge retrieval.
StarCoder: May the Source Be With You!
by BigCode / Hugging Face / ServiceNow
Presented StarCoder, a 15.5B parameter open-source code LLM trained on 1 trillion tokens from The Stack (permissively licensed source code) with fill-in-the-middle capability, fast multi-token prediction inference, and a commitment to responsible AI through a model card and attribution feature.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
by Google Brain
Introduces the Sparsely-Gated Mixture-of-Experts (MoE) layer, enabling 1000× capacity increase with only marginal computational cost increase. A learned gating network selects a sparse subset of expert sub-networks per input, enabling unprecedented model scale.
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
by Google / Everyday Robots
SayCan combines the semantic reasoning capabilities of large language models with learned value functions that encode physical feasibility, allowing robots to plan long-horizon tasks expressed in natural language. The approach grounds high-level language instructions in real-world robot affordances without task-specific fine-tuning.
DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter
by Hugging Face
Introduces DistilBERT, a knowledge-distilled version of BERT that retains 97% of BERT's language understanding while being 40% smaller and 60% faster. Demonstrates the effectiveness of task-agnostic knowledge distillation for pretrained language models.
Conservative Q-Learning for Offline Reinforcement Learning
by UC Berkeley
CQL (Conservative Q-Learning) addresses distribution shift in offline RL by augmenting the standard Bellman objective with a term that penalizes Q-values for out-of-distribution actions, producing a lower bound on the true value function. This conservative approach prevents over-optimistic value estimation and achieves strong performance across locomotion, navigation, and robotic manipulation datasets.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
by University of Washington / IBM AI Research / Allen AI
Introduces Self-RAG, a framework that trains a single LM to adaptively retrieve passages on demand, generate text, and critique its own outputs using special reflection tokens. Unlike standard RAG, Self-RAG decides when to retrieve and reflects on retrieved passages and generation quality, outperforming ChatGPT and standard RAG on diverse downstream tasks.
Emergent Abilities of Large Language Models
by Google Research / Stanford / DeepMind / UNC
Defines and documents emergent abilities in LLMs — capabilities that appear sharply at certain model scales rather than improving gradually. Surveys over 100 tasks where models exhibit phase-transition-like capability gains, sparking debate on whether emergence is real or a measurement artifact.
Improving Language Models by Retrieving from Trillions of Tokens
by DeepMind
Presents RETRO (Retrieval-Enhanced Transformers), a model that retrieves from a 2-trillion-token database at inference time via chunked cross-attention. RETRO achieves performance comparable to GPT-3 with 25× fewer parameters by leveraging retrieved passages, demonstrating that retrieval augmentation is a compute-efficient alternative to scaling.
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
by Princeton NLP / Princeton Language and Intelligence
Introduces SWE-agent, which defines Agent-Computer Interfaces (ACIs) to enable LLMs to autonomously solve real GitHub issues by browsing codebases, editing files, and running tests. On the SWE-bench benchmark, SWE-agent with GPT-4 Turbo resolves 12.5% of issues, significantly outperforming prior methods.
Red Teaming Language Models with Language Models
by DeepMind
Proposes using language models to automatically generate test cases that elicit harmful behaviors from target language models—a scalable alternative to manual red teaming. The approach discovers diverse attack prompts across harm categories and reveals that larger models are harder to red-team but produce more harmful outputs when successfully attacked.
Qwen2.5 Technical Report
by Alibaba Cloud / Qwen Team
Qwen2.5 is a comprehensive family of open-source LLMs (0.5B to 72B parameters) trained on 18 trillion tokens including significantly expanded coding and mathematics data, achieving state-of-the-art open-source performance on coding (HumanEval), mathematics (MATH), and multilingual benchmarks. The series includes specialized Qwen2.5-Coder and Qwen2.5-Math variants.
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* Quality
by LMSYS / UC Berkeley / CMU / UCSD
Presents Vicuna-13B, an open-source chatbot created by fine-tuning LLaMA on ShareGPT conversation data, achieving approximately 90% of ChatGPT and Bard quality as judged by GPT-4. The paper introduces GPT-4 as an automated judge for chatbot evaluation, establishing a widely adopted evaluation paradigm for conversational AI.
AgentBench: Evaluating LLMs as Agents
by Tsinghua University
Introduces AgentBench, the first systematic benchmark for evaluating LLMs as autonomous agents across eight distinct environments spanning operating systems, databases, knowledge graphs, digital games, and web browsing. The benchmark reveals a large performance gap between commercial and open-source models on real-world agent tasks.
Competition-Level Code Generation with AlphaCode
by DeepMind
AlphaCode is a large-scale language model from DeepMind designed for competitive programming. It was pre-trained on public GitHub code and fine-tuned on a curated dataset of programming contest problems. The system generates a vast number of potential solutions and then filters them using test cases to find a correct one.
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
by OpenAI
This paper explores weak-to-strong generalization, a method for training a powerful AI model using supervision from a weaker one. It serves as an analogy for aligning superintelligent AI with human values. The research shows that strong models can learn beyond their weak supervisors and introduces techniques like auxiliary confidence loss to improve performance.
Scalable agent alignment via reward modeling: a research direction
by DeepMind
This research paper proposes a method for aligning advanced AI systems by using recursive reward modeling. The approach leverages AI assistants to help human evaluators assess complex AI actions, enabling scalable oversight and positioning this technique alongside debate and amplification as key AI safety strategies.
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
by Alibaba Cloud / DAMO Academy
Qwen-VL is a large-scale vision-language model series from Alibaba, trained on a curated multilingual multimodal dataset. It supports high-resolution image understanding, visual grounding with bounding boxes, and multilingual text reading, achieving state-of-the-art results on multiple visual benchmarks.
CAMEL: Communicative Agents for Mind Exploration of Large Language Model Society
by KAUST
CAMEL introduces a novel framework for studying multi-agent cooperation by having AI agents role-play to solve tasks. It utilizes a technique called 'inception prompting' to ensure agents adhere to their assigned personas, enabling the exploration of complex communicative behaviors and societal dynamics within large language models with minimal human guidance.
STaR: Bootstrapping Reasoning With Reasoning
by Stanford University / Google Brain
STaR (Self-Taught Reasoner) is a research paper introducing an iterative bootstrapping method for language models. The model learns to improve its reasoning abilities by generating rationales for problems, filtering out the incorrect ones, and then fine-tuning itself on the successfully reasoned examples. This allows smaller models to achieve reasoning performance comparable to much larger ones.
The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI
by Meta AI
Llama 4 introduces a family of natively multimodal mixture-of-experts models—Scout (17B/16 experts), Maverick (17B/128 experts), and Behemoth (288B/16 experts)—pretrained jointly on text, image, and video data. Maverick achieves top scores on vision-language benchmarks while Scout offers 10M-token context at a fraction of the compute of comparable models.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
by Google Research
Introduces Grouped-Query Attention (GQA), an efficient attention mechanism that generalizes Multi-Head and Multi-Query Attention. GQA groups query heads to share key and value heads, drastically reducing the KV cache size and memory bandwidth, which accelerates inference speed while maintaining near Multi-Head quality.
Artificial Intelligence Ethics Guidelines: A Global Inventory
by EPFL / Multiple Institutions
This paper presents a systematic review of 84 prominent AI ethics guidelines from around the world. It identifies a global convergence on five key ethical principles, including transparency and justice, but reveals significant divergence in how these principles are interpreted and operationalized across different sectors and regions.
Zoom In: An Introduction to Circuits
by Distill / OpenAI
This essay by Chris Olah and colleagues at Distill introduces the circuits framework for mechanistic interpretability, arguing that neural network weights encode interpretable algorithms composed of features and circuits. It presents case studies of curve detectors and multimodal neurons as evidence that individual units and motifs in neural networks are meaningfully interpretable.
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
by Anthropic
Demonstrates that LLMs can be trained to behave safely during normal operation but exhibit unsafe behaviors when triggered by specific conditions—acting as 'sleeper agents'—and that standard safety training techniques including RLHF, supervised fine-tuning, and adversarial training fail to reliably remove these backdoors, sometimes even hiding them deeper.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
by Google Research
Introduces Switch Transformers, simplifying MoE routing to select a single expert per token (top-1), enabling stable trillion-parameter T5-scale models with 7× pre-training speedup. Demonstrates that parameter count and compute can be decoupled through sparsity.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
by DeepSeek
This paper introduces Group Relative Policy Optimization (GRPO), a memory-efficient reinforcement learning algorithm. GRPO enables scalable RLHF-style training by replacing the critic model with group-sampled reward baselines, a technique used to enhance the mathematical reasoning of models like DeepSeekMath.
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
by Google DeepMind
RT-2 is a Vision-Language-Action (VLA) model that translates visual and language inputs directly into robotic actions. By co-fine-tuning large models on both web-scale and robotics data, it transfers knowledge from the internet to physical control, enabling robots to reason about and execute tasks involving novel objects and scenarios without explicit robotic training.
Representation Engineering: A Top-Down Approach to AI Transparency
by Center for AI Safety / UC Berkeley
Representation Engineering (RepE) is a top-down AI transparency technique for interpreting and controlling Large Language Models. It uses linear probes on activation differences from contrastive prompts to identify and manipulate high-level concepts like truthfulness and emotion without needing to retrain or fine-tune the model.
Atlas: Few-shot Learning with Retrieval Augmented Language Models
by Meta AI / University College London
Atlas is a retrieval-augmented language model designed for few-shot learning. It uniquely pre-trains its retriever and language model components jointly, enabling it to effectively leverage external knowledge documents. This approach allows Atlas to achieve state-of-the-art few-shot performance on knowledge-intensive NLP benchmarks like MMLU, outperforming much larger models.
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning
by Anthropic
This research paper from Anthropic introduces a method using sparse autoencoders to decompose the internal activations of a transformer model. It successfully extracts thousands of interpretable, monosemantic features, demonstrating that the superposition of concepts within neurons can be untangled.
Claude Opus 4 Technical Report
by Anthropic
The Claude Opus 4 technical report details Anthropic's flagship model, highlighting its extended thinking, advanced coding, and agentic capabilities. It showcases top-tier performance on benchmarks like SWE-bench and GPQA, along with significant improvements in safety through Constitutional AI and RLHF.
Gemini 2.5 Pro Technical Report
by Google DeepMind
Gemini 2.5 Pro introduces thinking mode—an integrated chain-of-thought reasoning layer—combined with a 1M-token context window and natively multimodal capabilities spanning text, image, audio, and video. The model achieves leading positions on multiple reasoning and coding benchmarks including Codeforces, AIME, and MMMU.
In-context Learning and Induction Heads
by Anthropic
This paper establishes a causal link between specific transformer circuits, termed "induction heads," and the phenomenon of in-context learning. It demonstrates that these two-layer attention patterns, which copy and complete sequences, emerge predictably during training and are a key mechanistic driver of few-shot learning abilities in LLMs.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
by Carnegie Mellon University / Together AI
Mamba is a novel sequence modeling architecture based on structured state space models (SSMs). It introduces a selection mechanism that allows the model to selectively propagate or forget information based on the input, overcoming a key limitation of previous SSMs. This enables Mamba to achieve Transformer-level performance with linear time complexity and significantly faster inference.
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
by LMSYS / UC Berkeley
Introduces LMSYS-Chat-1M, a large-scale dataset of one million real-world conversations with 25 state-of-the-art LLMs collected from the Chatbot Arena platform. Analysis reveals diverse usage patterns, safety violations, and human preference signals, making it a valuable resource for safety evaluation, capability assessment, and alignment research.
Towards Expert-Level Medical Question Answering with Large Language Models
by Google Research
This paper introduces Med-PaLM 2, a large language model fine-tuned on medical data. It achieves expert-level performance on medical licensing exam questions, demonstrating clinical reasoning comparable to physicians, and proposes a framework for evaluating the safety and alignment of medical AI systems.
CogVLM: Visual Expert for Pretrained Language Models
by Tsinghua University / Zhipu AI
CogVLM is a vision-language model that enhances pretrained language models (LLMs) with visual understanding. It introduces a trainable visual expert module into each layer of a frozen LLM, enabling deep fusion of image and text features. This approach achieves state-of-the-art results on numerous vision-language benchmarks without altering the original language model's parameters.
Fast Transformer Decoding: One Write-Head is All You Need (Multi-Query Attention)
by Google Brain
Introduces Multi-Query Attention (MQA), an efficient attention mechanism for autoregressive decoding. By sharing a single key and value head across all query heads, MQA drastically reduces the size of the KV cache. This leads to significant memory bandwidth savings and faster inference speeds with minimal impact on model quality.