Explore.

bertpre-trainingbidirectional

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

by Google AI

Introduced BERT, a bidirectional Transformer pre-trained on masked language modeling and next sentence prediction. Established the pretrain-then-fine-tune paradigm that dominated NLP for years and achieved state-of-the-art on 11 NLP benchmarks.

82.8A

clipcontrastive-learningzero-shot

Learning Transferable Visual Models From Natural Language Supervision (CLIP)

by OpenAI

Introduced CLIP (Contrastive Language-Image Pre-training), a model trained on 400 million image-text pairs using contrastive learning that achieves remarkable zero-shot transfer to diverse vision tasks. CLIP became foundational for vision-language alignment and generative AI pipelines.

82.2A

chain-of-thoughtreasoningprompting

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

by Google Brain

Introduced chain-of-thought prompting, a simple technique of providing exemplars with step-by-step reasoning traces in few-shot prompts. This approach dramatically improves LLM performance on arithmetic, commonsense, and symbolic reasoning tasks, with the effect emerging at approximately 100B parameters.

82.1A

stable-diffusionlatent-diffusiontext-to-image

High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion)

by CompVis / Stability AI

Introduced Latent Diffusion Models (LDMs), which perform the diffusion process in a compressed latent space rather than pixel space, dramatically reducing computational cost while maintaining image quality. This work underpins Stable Diffusion, the most widely used open-source image generation model.

82A

gpt-3few-shotin-context-learning

Language Models are Few-Shot Learners (GPT-3)

by OpenAI

Introduced GPT-3, a 175B parameter language model demonstrating remarkable few-shot learning capabilities across diverse tasks. Showed that scaling model size dramatically improves in-context learning without gradient updates, reshaping the field.

82A

vision-transformerimage-classificationattention

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

by Google Brain

Introduced the Vision Transformer (ViT), demonstrating that a pure transformer applied directly to sequences of image patches achieves state-of-the-art performance on image classification when pretrained on large datasets. The paper challenged the dominance of convolutional neural networks in computer vision.

81.9A

rlhfalignmentinstruction-following

Training Language Models to Follow Instructions with Human Feedback

by OpenAI

Presents InstructGPT, which uses Reinforcement Learning from Human Feedback (RLHF) to align GPT-3 with human intent. By fine-tuning on human demonstrations and training a reward model on human preference comparisons, InstructGPT produces outputs that human evaluators prefer to GPT-3 outputs despite having 100× fewer parameters.

81.8A

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

by Facebook AI Research

Introduces Retrieval-Augmented Generation (RAG), combining parametric memory (language model weights) with non-parametric memory (dense retrieval over Wikipedia) for knowledge-intensive NLP tasks. RAG models achieve state-of-the-art on open-domain QA benchmarks and produce more specific, factual, and diverse responses than pure parametric models.

ragretrievalgeneration

81.2A

reinforcement-learningppopolicy-gradient

Proximal Policy Optimization Algorithms

by OpenAI

PPO introduces a clipped surrogate objective that constrains policy update step sizes, achieving the stability of trust-region methods (TRPO) with the simplicity and scalability of first-order optimizers. It quickly became the dominant RL algorithm for training large language models with human feedback.

81.1A

Paperdomain-specific

Highly Accurate Protein Structure Prediction with AlphaFold

by DeepMind

AlphaFold 2 achieves atomic-level accuracy in protein structure prediction by combining evolutionary information from multiple sequence alignments with a novel Evoformer architecture and structure module, solving a 50-year grand challenge in biology. Its predictions have been released for virtually all known proteins and have accelerated drug discovery, enzyme design, and structural biology worldwide.

biologyprotein-structurealphafold

81.1A

GPT-4 Technical Report

by OpenAI

Technical report for GPT-4, OpenAI's multimodal large language model accepting image and text inputs. Demonstrates state-of-the-art performance on academic and professional benchmarks, including passing the bar exam in the top 10% of test takers.

gpt-4multimodalrlhf

81A

segmentationfoundation-modelpromptable

Segment Anything

by Meta AI

Introduced the Segment Anything Model (SAM) and the SA-1B dataset of 1 billion masks on 11 million images. SAM is a promptable segmentation foundation model that generalizes to new image distributions and tasks without additional training, enabling a new paradigm of interactive segmentation.

79.2B+

codexcode-generationgithub-copilot

Evaluating Large Language Models Trained on Code (Codex)

by OpenAI

Introduced Codex, a GPT language model fine-tuned on publicly available code from GitHub, and the HumanEval benchmark for measuring code synthesis from docstrings. Codex powers GitHub Copilot and represents a breakthrough in automated programming assistance.

79.2B+

ReAct: Synergizing Reasoning and Acting in Language Models

by Google / Princeton

Introduces ReAct, a paradigm that combines reasoning traces and task-specific actions in language models. By interleaving thinking steps with tool calls, ReAct agents outperform chain-of-thought and act-only baselines on diverse tasks including question answering, fact verification, and interactive decision-making.

agentsreasoningtool-use

79B+

LoRA: Low-Rank Adaptation of Large Language Models

by Microsoft Research

Introduces LoRA, which freezes pretrained model weights and injects trainable low-rank decomposition matrices into Transformer layers. Reduces trainable parameters by 10,000× and GPU memory by 3× with no inference latency overhead, enabling efficient LLM fine-tuning.

lorafine-tuninglow-rank

78.8B+

llamaopen-sourceefficient

LLaMA: Open and Efficient Foundation Language Models

by Meta AI

Introduces LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, trained on publicly available datasets. Showed that smaller models trained on more tokens can match or exceed larger models, democratizing LLM research.

78.1B+

reinforcement-learningrlhfhuman-feedback

Deep Reinforcement Learning from Human Preferences

by OpenAI

This foundational RLHF paper shows that human preference comparisons between agent behaviors can train a reward model that guides deep RL agents in complex tasks like Atari games and MuJoCo locomotion, without hand-crafted reward functions. The approach reduces human labeling effort by ~3 orders of magnitude compared to direct reward specification.

78B+

Gemini: A Family of Highly Capable Multimodal Models

by Google DeepMind

Introduced the Gemini family of multimodal models (Ultra, Pro, Nano) natively trained to process and combine text, images, audio, and video. Gemini Ultra is the first model to surpass human expert performance on MMLU and achieves state-of-the-art across 30 of 32 benchmarks evaluated.

geminimultimodalgoogle

77.8B+

paged-attentionvllminference

Efficient Memory Management for Large Language Model Serving with PagedAttention

by UC Berkeley

Introduced PagedAttention and the vLLM serving system, which manages the KV cache in non-contiguous physical memory blocks inspired by OS paging, enabling near-zero memory waste and efficient sharing of KV cache across requests. vLLM achieves 2-4x higher throughput than HuggingFace Transformers and 1.7x over Orca.

77.7B+

Generative Agents: Interactive Simulacra of Human Behavior

by Stanford University / Google

Introduces generative agents—computational software agents that simulate believable human behavior—by combining a large language model with memory streams, reflection synthesis, and planning mechanisms. Twenty-five agents populate a virtual town, exhibiting emergent social behaviors including relationship formation, information propagation, and event coordination.

agentssimulationsocial

77.3B+

dall-e-2text-to-imagediffusion

Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)

by OpenAI

Presented DALL-E 2 (unCLIP), a hierarchical text-conditional image generation system using CLIP image embeddings as a prior and a diffusion decoder. The system achieves state-of-the-art photorealism and text-image alignment, substantially advancing the field of text-to-image synthesis.

77.1B+

Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

by OpenAI

Introduces InstructGPT, fine-tuning GPT-3 with Reinforcement Learning from Human Feedback (RLHF) to follow instructions. A 1.3B InstructGPT model is preferred over 175B GPT-3 by human labelers, establishing RLHF as the dominant alignment technique.

rlhfinstructgptalignment

77B+

self-consistencychain-of-thoughtreasoning

Self-Consistency Improves Chain of Thought Reasoning in Language Models

by Google Brain

Introduced self-consistency, a decoding strategy that samples diverse reasoning paths from a language model and returns the most consistent answer by marginalizing out the reasoning paths. Self-consistency is a simple, training-free technique that substantially improves chain-of-thought prompting across arithmetic and commonsense reasoning tasks.

76.7B+

Paperresearch

Scaling Laws for Neural Language Models

by OpenAI

Empirically establishes power-law scaling relationships between language model performance and model size, dataset size, and compute budget. Provides the foundational framework for predicting LLM capabilities as a function of scale, guiding research for years.

scaling-lawscompute-optimallanguage-models

76.7B+

llavamultimodalinstruction-tuning

Visual Instruction Tuning (LLaVA)

by University of Wisconsin–Madison / Microsoft Research

Introduced LLaVA (Large Language and Vision Assistant), a multimodal model trained via visual instruction tuning using GPT-4-generated multimodal instruction-following data. LLaVA demonstrates impressive multimodal chat abilities and achieves 85.1% on Science QA, pioneering open-source visual instruction tuning.

76B+

mixture-of-expertsmoesparse-model

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

by Google Brain

Introduced Switch Transformers, a simplified mixture-of-experts (MoE) architecture that routes each token to exactly one expert (top-1 routing), enabling trillion-parameter models with sub-linear compute scaling. Switch Transformers achieve 7x pretraining speedup over a dense T5 model while maintaining model quality.

75.8B+

ethicsmodel-cardstransparency

Model Cards for Model Reporting

by Google

Model Cards introduces a structured framework for documenting machine learning models across intended uses, performance disaggregated by demographic groups, and ethical considerations, enabling informed model selection and deployment decisions. The paper has become an industry standard, with model card adoption by Google, Hugging Face, and most major AI providers.

75.8B+

gpt-2language-modelingzero-shot

Language Models are Unsupervised Multitask Learners (GPT-2)

by OpenAI

Introduced GPT-2, demonstrating that large language models trained on diverse web text can perform zero-shot transfer across many NLP tasks without task-specific fine-tuning. Showed emergent capabilities at scale and sparked debate on responsible AI release.

75.8B+

Paperinfrastructure

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

by Stanford University

Introduces FlashAttention, an IO-aware exact attention algorithm that restructures attention computation to minimize memory reads/writes between HBM and SRAM. Achieves 2-4× speedup over standard attention and enables training on much longer sequences.

flash-attentionio-awarememory-efficient

75.7B+

chinchillascaling-lawscompute-optimal

Training Compute-Optimal Large Language Models (Chinchilla)

by DeepMind

Challenges the Kaplan et al. scaling laws by showing that model size and training tokens should scale equally. Trains Chinchilla (70B) on 4× more data than Gopher, matching or beating models 4× its size, redefining compute-optimal training strategies.

75.4B+

ethicsdatasetsdocumentation

Datasheets for Datasets

by Microsoft Research / Multiple Institutions

Drawing an analogy to electronics component datasheets, this paper proposes that every ML dataset should be accompanied by a standardized document covering its motivation, composition, collection process, preprocessing, uses, distribution, and maintenance. Datasheets for Datasets has become the foundational standard for dataset transparency and is widely required by major AI venues.

75.2B+

tree-of-thoughtsreasoningsearch

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

by Princeton University / Google DeepMind

Introduced Tree of Thoughts (ToT), a framework that generalizes chain-of-thought prompting to a tree search over intermediate reasoning steps. ToT enables LLMs to explore multiple reasoning paths, evaluate choices, and backtrack, achieving dramatic improvements on tasks requiring lookahead and planning.

74.9B+

speculative-decodinginference-efficiencydraft-model

Fast Inference from Transformers via Speculative Decoding

by Google Research

Introduced speculative decoding, a lossless inference acceleration technique that uses a smaller, faster draft model to propose multiple tokens, then verifies them in parallel with the target model in a single forward pass. This achieves 2-3x speedup without any degradation in output quality or distribution.

74.7B+

alignmentsafetyconstitutional-ai

Constitutional AI: Harmlessness from AI Feedback

by Anthropic

Introduces Constitutional AI (CAI), a method for training harmless AI assistants using a set of written principles (a 'constitution') to guide both supervised learning and reinforcement learning from AI feedback (RLAIF). CAI enables Anthropic to reduce reliance on human harm labels while maintaining helpfulness and making AI reasoning about harmlessness explicit.

74.7B+

sdxlstable-diffusiontext-to-image

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

by Stability AI

Presented SDXL, a significantly improved latent diffusion model architecture featuring a 3.5B parameter UNet backbone with a secondary refiner model, conditioning on image size and crop parameters, and a curated high-aesthetic dataset. SDXL substantially improves visual quality and prompt adherence over prior Stable Diffusion versions.

74.5B+

reasoningreinforcement-learningdeepseek

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

by DeepSeek

DeepSeek-R1 demonstrates that pure reinforcement learning with rule-based rewards—without supervised fine-tuning on chain-of-thought data—can incentivize emergent reasoning capabilities in LLMs including self-verification, reflection, and long chain-of-thought. The model achieves performance comparable to OpenAI-o1 on reasoning benchmarks while being fully open-sourced, triggering a significant industry response.

74.5B+

qloraquantizationfine-tuning

QLoRA: Efficient Finetuning of Quantized LLMs

by University of Washington

Introduces QLoRA, which combines 4-bit quantization with LoRA adapters to fine-tune a 65B LLM on a single 48GB GPU while preserving full 16-bit fine-tuning performance. Introduces NF4 data type and double quantization for extreme memory reduction.

74.4B+

gptqquantizationpost-training-quantization

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

by Institute of Science and Technology Austria (IST Austria)

Presented GPTQ, a one-shot weight quantization method based on approximate second-order information that can quantize GPT models with 175B parameters to 4-bit or 3-bit precision in approximately four GPU-hours with negligible accuracy loss. GPTQ made large model inference practical on consumer hardware.

74.4B+

code-llamametacode-generation

Code Llama: Open Foundation Models for Code

by Meta AI

Introduced Code Llama, a family of large language models for code built on Llama 2 through code-specific pretraining and fine-tuning. Code Llama achieves state-of-the-art performance among open models on HumanEval and MBPP, with variants for Python, instruction following, and long context (100K tokens).

74.3B+

evaluationhuman-preferenceelo

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

by LMSYS / UC Berkeley

Introduces Chatbot Arena, a platform for crowdsourced human evaluation of LLMs via pairwise comparisons using an Elo rating system. The arena has collected over 240K human votes across 50+ models, revealing human preference rankings that often diverge from standard benchmark leaderboards and providing a complementary evaluation signal.

74B+

tool-useself-supervisedapi-calling

Toolformer: Language Models Can Teach Themselves to Use Tools

by Meta AI

Presents Toolformer, a model that learns to use external tools (APIs) in a self-supervised manner without requiring human annotations. The model decides which APIs to call, how to call them, and how to incorporate results, achieving strong performance across diverse tasks while maintaining generative language modeling ability.

73.7B+

mistralefficientsliding-window-attention

Mistral 7B

by Mistral AI

Introduces Mistral 7B, a 7B parameter language model outperforming LLaMA 2 13B on all benchmarks and approaching LLaMA 2 34B on code and reasoning. Uses grouped-query attention and sliding window attention for efficient inference.

73.6B+

awqquantizationactivation-aware

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

by MIT / MIT-IBM Watson AI Lab

Introduced AWQ (Activation-aware Weight Quantization), a hardware-friendly low-bit weight quantization approach that protects a small fraction (1%) of salient weights based on activation magnitudes, achieving better performance than GPTQ at 4-bit while being faster and more broadly applicable across model architectures.

73.3B+

PaLM: Scaling Language Modeling with Pathways

by Google Research

Introduces PaLM (Pathways Language Model), a 540B parameter model trained on 780B tokens using the Pathways system. Achieved breakthrough performance on reasoning tasks and demonstrated discontinuous performance improvements that define emergent abilities.

palmscalingpathways

73.2B+

dinov2self-supervisedvision-transformer

DINOv2: Learning Robust Visual Features without Supervision

by Meta AI

Presented DINOv2, a self-supervised vision foundation model trained on a curated dataset of 142 million images using a combination of self-distillation and contrastive objectives. DINOv2 features serve as universal visual representations, excelling on depth estimation, segmentation, and classification without fine-tuning.

73.1B+

evaluationbenchmarkholistic

Holistic Evaluation of Language Models

by Stanford CRFM

Presents HELM, a holistic evaluation framework for language models across 42 scenarios and 59 metrics including accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. HELM reveals that no single model dominates across all dimensions and exposes significant gaps between narrow and comprehensive model assessment.

73B+

GPT-4V(ision) System Card

by OpenAI

The system card for GPT-4 with vision (GPT-4V), detailing the model's visual understanding capabilities, safety evaluations, limitations, and mitigation strategies. GPT-4V represents a major advancement in large multimodal models, enabling complex visual reasoning from natural language prompts.

gpt-4vmultimodalvision

72.7B+

claudeanthropicmultimodal

The Claude 3 Model Family: Opus, Sonnet, Haiku

by Anthropic

Presents the Claude 3 family of models (Opus, Sonnet, Haiku), demonstrating state-of-the-art performance on reasoning, vision, and multilingual tasks. Highlights Anthropic's safety techniques including Constitutional AI and RLHF-based alignment.

72.6B+

imagentext-to-imagediffusion

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)

by Google Brain

Introduced Imagen, a text-to-image diffusion model that leverages large pretrained language models (T5-XXL) for text understanding combined with cascaded diffusion models for image synthesis. Imagen demonstrated that scaling text encoders is more impactful than scaling diffusion models, establishing DrawBench as a new evaluation benchmark.

72.2B+

Paperinfrastructure

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

by Princeton University / Together AI

Extends FlashAttention with improved work partitioning across GPU thread blocks and warps, achieving 2× speedup over FlashAttention and ~9× speedup over standard attention. Enables efficient training of models with context lengths up to 256K tokens.

flash-attention-2attentionparallelism

72.2B+

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

by University of Washington / Black in AI

This influential FAccT paper argues that ever-larger language models carry significant risks—including environmental costs, biased training data, and the illusion of meaning—that are often overlooked in the race for benchmark performance. It calls for pausing scaling to focus on documentation, auditing, and community-centered research practices.

ethicsllmbias

72.1B+

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

by Salesforce Research

Presented BLIP-2, which bridges the modality gap between frozen image encoders and frozen LLMs using a lightweight Querying Transformer (Q-Former) trained in two stages. BLIP-2 achieves state-of-the-art VQA performance with significantly fewer trainable parameters than prior methods.

blip-2multimodalq-former

71.9B+

flamingomultimodalfew-shot

Flamingo: a Visual Language Model for Few-Shot Learning

by DeepMind

Introduced Flamingo, a family of visual language models that bridge powerful pretrained vision and language models, enabling few-shot learning on a diverse range of multimodal tasks by training on arbitrarily interleaved sequences of images, video, and text. Flamingo set new few-shot state-of-the-art on 16 benchmarks.

71.8B+

agentsmulti-agentsoftware-engineering

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

by Tsinghua / Peking University / DeepWisdom

Presents MetaGPT, a multi-agent framework that encodes human workflows as Standardized Operating Procedures (SOPs) for LLM agents acting as specialized software roles. By assigning product manager, architect, engineer, and QA roles, MetaGPT produces complete, executable codebases from natural language requirements with higher quality than prior approaches.

71.7B+

process-reward-modelsreasoningrlhf

Let's Verify Step by Step

by OpenAI

Demonstrated that process-based reward models (PRMs), which provide feedback on each reasoning step, substantially outperform outcome-based reward models (ORMs) for training LLMs to solve mathematical reasoning problems. The paper also introduced PRM800K, a dataset of 800K step-level human feedback labels on MATH solutions.

71.6B+

roperotary-position-embeddingpositional-encoding

RoFormer: Enhanced Transformer with Rotary Position Embedding

by Zhuiyi Technology

Introduces Rotary Position Embedding (RoPE), encoding absolute position information with a rotation matrix and naturally incorporating relative position in self-attention. Adopted by LLaMA, PaLM 2, and most modern LLMs for its length generalization properties.

71.4B+

swe-benchsoftware-engineeringbenchmark

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

by Princeton University

Introduced SWE-bench, a benchmark of 2,294 real GitHub issues from 12 popular Python repositories requiring models to resolve issues by writing code patches. SWE-bench reveals that even the best LLMs resolve fewer than 4% of issues with standard techniques, motivating research into code agents.

71.3B+

agentsminecraftlifelong-learning

Voyager: An Open-Ended Embodied Agent with Large Language Models

by NVIDIA / Caltech / UT Austin

Presents Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human intervention. Voyager uses an automatic curriculum, an ever-growing skill library of executable code, and an iterative prompting mechanism to overcome failures.

71.2B+

dpoalignmentpreference-optimization

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

by Stanford University

Introduces DPO, a stable and efficient alternative to RLHF that directly optimizes a language model on human preference data without an explicit reward model or RL. Achieves comparable or superior alignment results with significantly simpler implementation.

71.2B+

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

by Microsoft Research

Presents GraphRAG, which uses LLM-generated knowledge graphs and community detection to enable query-focused summarization over entire text corpora. Unlike standard RAG which answers local questions from text chunks, GraphRAG enables global sensemaking queries by reasoning over interconnected entity communities at multiple granularities.

ragknowledge-graphgraph

70.9B+

reinforcement-learningoffline-rltransformers

Decision Transformer: Reinforcement Learning via Sequence Modeling

by UC Berkeley / Google Brain

Decision Transformer recasts offline reinforcement learning as a conditional sequence modeling problem, predicting actions given return-to-go, states, and past actions using a causal Transformer. This eliminates the need for temporal difference learning and bootstrapping while achieving competitive performance on Atari and MuJoCo benchmarks.

70.6B+

REALM: Retrieval-Augmented Language Model Pre-Training

by Google Research

Proposes REALM, which augments language model pre-training with a learned textual knowledge retriever, enabling the model to retrieve and attend over documents from a large corpus during both pre-training and fine-tuning. REALM achieves state-of-the-art on Open-domain QA benchmarks while providing interpretable knowledge retrieval.

ragpretrainingretrieval

70.5B+

starcodercode-llmopen-source

StarCoder: May the Source Be With You!

by BigCode / Hugging Face / ServiceNow

Presented StarCoder, a 15.5B parameter open-source code LLM trained on 1 trillion tokens from The Stack (permissively licensed source code) with fill-in-the-middle capability, fast multi-token prediction inference, and a commitment to responsible AI through a model card and attribution feature.

70.3B+

mixture-of-expertsmoesparse

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

by Google Brain

Introduces the Sparsely-Gated Mixture-of-Experts (MoE) layer, enabling 1000× capacity increase with only marginal computational cost increase. A learned gating network selects a sparse subset of expert sub-networks per input, enabling unprecedented model scale.

70.3B+

Paperrobotics

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

by Google / Everyday Robots

SayCan combines the semantic reasoning capabilities of large language models with learned value functions that encode physical feasibility, allowing robots to plan long-horizon tasks expressed in natural language. The approach grounds high-level language instructions in real-world robot affordances without task-specific fine-tuning.

roboticslanguage-groundingllm

70.1B+

distilbertknowledge-distillationbert

DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter

by Hugging Face

Introduces DistilBERT, a knowledge-distilled version of BERT that retains 97% of BERT's language understanding while being 40% smaller and 60% faster. Demonstrates the effectiveness of task-agnostic knowledge distillation for pretrained language models.

69.9B

reinforcement-learningoffline-rlq-learning

Conservative Q-Learning for Offline Reinforcement Learning

by UC Berkeley

CQL (Conservative Q-Learning) addresses distribution shift in offline RL by augmenting the standard Bellman objective with a term that penalizes Q-values for out-of-distribution actions, producing a lower bound on the true value function. This conservative approach prevents over-optimistic value estimation and achieves strong performance across locomotion, navigation, and robotic manipulation datasets.

69.8B

ragself-reflectioncritique

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

by University of Washington / IBM AI Research / Allen AI

Introduces Self-RAG, a framework that trains a single LM to adaptively retrieve passages on demand, generate text, and critique its own outputs using special reflection tokens. Unlike standard RAG, Self-RAG decides when to retrieve and reflects on retrieved passages and generation quality, outperforming ChatGPT and standard RAG on diverse downstream tasks.

69.7B

Paperresearch

Emergent Abilities of Large Language Models

by Google Research / Stanford / DeepMind / UNC

Defines and documents emergent abilities in LLMs — capabilities that appear sharply at certain model scales rather than improving gradually. Surveys over 100 tasks where models exhibit phase-transition-like capability gains, sparking debate on whether emergence is real or a measurement artifact.

emergent-abilitiesscalingphase-transitions

69.6B

ragretrievallanguage-model

Improving Language Models by Retrieving from Trillions of Tokens

by DeepMind

Presents RETRO (Retrieval-Enhanced Transformers), a model that retrieves from a 2-trillion-token database at inference time via chunked cross-attention. RETRO achieves performance comparable to GPT-3 with 25× fewer parameters by leveraging retrieved passages, demonstrating that retrieval augmentation is a compute-efficient alternative to scaling.

69.4B

agentssoftware-engineeringcode

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

by Princeton NLP / Princeton Language and Intelligence

Introduces SWE-agent, which defines Agent-Computer Interfaces (ACIs) to enable LLMs to autonomously solve real GitHub issues by browsing codebases, editing files, and running tests. On the SWE-bench benchmark, SWE-agent with GPT-4 Turbo resolves 12.5% of issues, significantly outperforming prior methods.

69.3B

safetyred-teamingadversarial

Red Teaming Language Models with Language Models

by DeepMind

Proposes using language models to automatically generate test cases that elicit harmful behaviors from target language models—a scalable alternative to manual red teaming. The approach discovers diverse attack prompts across harm categories and reveals that larger models are harder to red-team but produce more harmful outputs when successfully attacked.

69B

Qwen2.5 Technical Report

by Alibaba Cloud / Qwen Team

Qwen2.5 is a comprehensive family of open-source LLMs (0.5B to 72B parameters) trained on 18 trillion tokens including significantly expanded coding and mathematics data, achieving state-of-the-art open-source performance on coding (HumanEval), mathematics (MATH), and multilingual benchmarks. The series includes specialized Qwen2.5-Coder and Qwen2.5-Math variants.

llmqwenalibaba

69B

evaluationopen-sourcechatbot

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* Quality

by LMSYS / UC Berkeley / CMU / UCSD

Presents Vicuna-13B, an open-source chatbot created by fine-tuning LLaMA on ShareGPT conversation data, achieving approximately 90% of ChatGPT and Bard quality as judged by GPT-4. The paper introduces GPT-4 as an automated judge for chatbot evaluation, establishing a widely adopted evaluation paradigm for conversational AI.

68.9B

benchmarkagentsevaluation

AgentBench: Evaluating LLMs as Agents

by Tsinghua University

Introduces AgentBench, the first systematic benchmark for evaluating LLMs as autonomous agents across eight distinct environments spanning operating systems, databases, knowledge graphs, digital games, and web browsing. The benchmark reveals a large performance gap between commercial and open-source models on real-world agent tasks.

68.4B

alphacodedeepmindcode-generation

Competition-Level Code Generation with AlphaCode

by DeepMind

AlphaCode is a large-scale language model from DeepMind designed for competitive programming. It was pre-trained on public GitHub code and fine-tuned on a curated dataset of programming contest problems. The system generates a vast number of potential solutions and then filters them using test cases to find a correct one.

68.3B

ai-safetyalignmentsuperalignment

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

by OpenAI

This paper explores weak-to-strong generalization, a method for training a powerful AI model using supervision from a weaker one. It serves as an analogy for aligning superintelligent AI with human values. The research shows that strong models can learn beyond their weak supervisors and introduces techniques like auxiliary confidence loss to improve performance.

68B

alignmentscalable-oversightreward-modeling

Scalable agent alignment via reward modeling: a research direction

by DeepMind

This research paper proposes a method for aligning advanced AI systems by using recursive reward modeling. The approach leverages AI assistants to help human evaluators assess complex AI actions, enabling scalable oversight and positioning this technique alongside debate and amplification as key AI safety strategies.

67.9B

qwen-vlmultimodalvision-language

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

by Alibaba Cloud / DAMO Academy

Qwen-VL is a large-scale vision-language model series from Alibaba, trained on a curated multilingual multimodal dataset. It supports high-resolution image understanding, visual grounding with bounding boxes, and multilingual text reading, achieving state-of-the-art results on multiple visual benchmarks.

67.8B

multi-agent-systemsagent-communicationrole-playing-ai

CAMEL: Communicative Agents for Mind Exploration of Large Language Model Society

by KAUST

CAMEL introduces a novel framework for studying multi-agent cooperation by having AI agents role-play to solve tasks. It utilizes a technique called 'inception prompting' to ensure agents adhere to their assigned personas, enabling the exploration of complex communicative behaviors and societal dynamics within large language models with minimal human guidance.

67.7B

starself-taught-reasonerbootstrapping

STaR: Bootstrapping Reasoning With Reasoning

by Stanford University / Google Brain

STaR (Self-Taught Reasoner) is a research paper introducing an iterative bootstrapping method for language models. The model learns to improve its reasoning abilities by generating rationales for problems, filtering out the incorrect ones, and then fine-tuning itself on the successfully reasoned examples. This allows smaller models to achieve reasoning performance comparable to much larger ones.

67.5B

llmmultimodalmixture-of-experts

The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI

by Meta AI

Llama 4 introduces a family of natively multimodal mixture-of-experts models—Scout (17B/16 experts), Maverick (17B/128 experts), and Behemoth (288B/16 experts)—pretrained jointly on text, image, and video data. Maverick achieves top scores on vision-language benchmarks while Scout offers 10M-token context at a fraction of the compute of comparable models.

67.5B

grouped-query-attentiongqamulti-query-attention

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

by Google Research

Introduces Grouped-Query Attention (GQA), an efficient attention mechanism that generalizes Multi-Head and Multi-Query Attention. GQA groups query heads to share key and value heads, drastically reducing the KV cache size and memory bandwidth, which accelerates inference speed while maintaining near Multi-Head quality.

67.4B

ethicsai-policyguidelines

Artificial Intelligence Ethics Guidelines: A Global Inventory

by EPFL / Multiple Institutions

This paper presents a systematic review of 84 prominent AI ethics guidelines from around the world. It identifies a global convergence on five key ethical principles, including transparency and justice, but reveals significant divergence in how these principles are interpreted and operationalized across different sectors and regions.

67B

Paperinterpretability

Zoom In: An Introduction to Circuits

by Distill / OpenAI

This essay by Chris Olah and colleagues at Distill introduces the circuits framework for mechanistic interpretability, arguing that neural network weights encode interpretable algorithms composed of features and circuits. It presents case studies of curve detectors and multimodal neurons as evidence that individual units and motifs in neural networks are meaningfully interpretable.

interpretabilitymechanistic-interpretabilitycircuits

66.6B

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

by Anthropic

Demonstrates that LLMs can be trained to behave safely during normal operation but exhibit unsafe behaviors when triggered by specific conditions—acting as 'sleeper agents'—and that standard safety training techniques including RLHF, supervised fine-tuning, and adversarial training fail to reliably remove these backdoors, sometimes even hiding them deeper.

safetydeceptionalignment

66.4B

switch-transformermixture-of-expertstrillion-parameters

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

by Google Research

Introduces Switch Transformers, simplifying MoE routing to select a single expert per token (top-1), enabling stable trillion-parameter T5-scale models with 7× pre-training speedup. Demonstrates that parameter count and compute can be decoupled through sparsity.

66.1B

reinforcement-learninggrpomath-reasoning

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

by DeepSeek

This paper introduces Group Relative Policy Optimization (GRPO), a memory-efficient reinforcement learning algorithm. GRPO enables scalable RLHF-style training by replacing the critic model with group-sampled reward baselines, a technique used to enhance the mathematical reasoning of models like DeepSeekMath.

66.1B

Paperrobotics

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

by Google DeepMind

RT-2 is a Vision-Language-Action (VLA) model that translates visual and language inputs directly into robotic actions. By co-fine-tuning large models on both web-scale and robotics data, it transfers knowledge from the internet to physical control, enabling robots to reason about and execute tasks involving novel objects and scenarios without explicit robotic training.

roboticsvision-language-modelsaction-models

65.5B

interpretabilitytransparencyrepresentation-engineering

Representation Engineering: A Top-Down Approach to AI Transparency

by Center for AI Safety / UC Berkeley

Representation Engineering (RepE) is a top-down AI transparency technique for interpreting and controlling Large Language Models. It uses linear probes on activation differences from contrastive prompts to identify and manipulate high-level concepts like truthfulness and emotion without needing to retrain or fine-tune the model.

65.2B

ragfew-shot-learningretrieval-augmented-generation

Atlas: Few-shot Learning with Retrieval Augmented Language Models

by Meta AI / University College London

Atlas is a retrieval-augmented language model designed for few-shot learning. It uniquely pre-trains its retriever and language model components jointly, enabling it to effectively leverage external knowledge documents. This approach allows Atlas to achieve state-of-the-art few-shot performance on knowledge-intensive NLP benchmarks like MMLU, outperforming much larger models.

65.2B

Paperinterpretability

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning

by Anthropic

This research paper from Anthropic introduces a method using sparse autoencoders to decompose the internal activations of a transformer model. It successfully extracts thousands of interpretable, monosemantic features, demonstrating that the superposition of concepts within neurons can be untangled.

interpretabilitymonosemanticitydictionary-learning

65B

claude-opus-4anthropicllm-research

Claude Opus 4 Technical Report

by Anthropic

The Claude Opus 4 technical report details Anthropic's flagship model, highlighting its extended thinking, advanced coding, and agentic capabilities. It showcases top-tier performance on benchmarks like SWE-bench and GPQA, along with significant improvements in safety through Constitutional AI and RLHF.

65B

interpretabilitycircuitsinduction-heads

Gemini 2.5 Pro Technical Report

by Google DeepMind

Gemini 2.5 Pro introduces thinking mode—an integrated chain-of-thought reasoning layer—combined with a 1M-token context window and natively multimodal capabilities spanning text, image, audio, and video. The model achieves leading positions on multiple reasoning and coding benchmarks including Codeforces, AIME, and MMMU.

llmgeminigoogle

64.7B

Paperinterpretability

In-context Learning and Induction Heads

by Anthropic

This paper establishes a causal link between specific transformer circuits, termed "induction heads," and the phenomenon of in-context learning. It demonstrates that these two-layer attention patterns, which copy and complete sequences, emerge predictably during training and are a key mechanistic driver of few-shot learning abilities in LLMs.

63.9B

mambastate-space-modelssm

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

by Carnegie Mellon University / Together AI

Mamba is a novel sequence modeling architecture based on structured state space models (SSMs). It introduces a selection mechanism that allows the model to selectively propagate or forget information based on the input, overcoming a key limitation of previous SSMs. This enables Mamba to achieve Transformer-level performance with linear time complexity and significantly faster inference.

63.8B

datasetevaluationconversations

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

by LMSYS / UC Berkeley

Introduces LMSYS-Chat-1M, a large-scale dataset of one million real-world conversations with 25 state-of-the-art LLMs collected from the Chatbot Arena platform. Analysis reveals diverse usage patterns, safety violations, and human preference signals, making it a valuable resource for safety evaluation, capability assessment, and alignment research.

63.8B

Paperdomain-specific

Towards Expert-Level Medical Question Answering with Large Language Models

by Google Research

This paper introduces Med-PaLM 2, a large language model fine-tuned on medical data. It achieves expert-level performance on medical licensing exam questions, demonstrating clinical reasoning comparable to physicians, and proposes a framework for evaluating the safety and alignment of medical AI systems.

healthcaremedical-aillm

63.4B

cogvlmmultimodalvisual-expert

CogVLM: Visual Expert for Pretrained Language Models

by Tsinghua University / Zhipu AI

CogVLM is a vision-language model that enhances pretrained language models (LLMs) with visual understanding. It introduces a trainable visual expert module into each layer of a frozen LLM, enabling deep fusion of image and text features. This approach achieves state-of-the-art results on numerous vision-language benchmarks without altering the original language model's parameters.

63.4B