LLMs

chain-of-thoughtreasoningprompting

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

by Google Brain

Introduced chain-of-thought prompting, a simple technique of providing exemplars with step-by-step reasoning traces in few-shot prompts. This approach dramatically improves LLM performance on arithmetic, commonsense, and symbolic reasoning tasks, with the effect emerging at approximately 100B parameters.

82.1A

gpt-3few-shotin-context-learning

Language Models are Few-Shot Learners (GPT-3)

by OpenAI

Introduced GPT-3, a 175B parameter language model demonstrating remarkable few-shot learning capabilities across diverse tasks. Showed that scaling model size dramatically improves in-context learning without gradient updates, reshaping the field.

82A

GPT-4 Technical Report

by OpenAI

Technical report for GPT-4, OpenAI's multimodal large language model accepting image and text inputs. Demonstrates state-of-the-art performance on academic and professional benchmarks, including passing the bar exam in the top 10% of test takers.

gpt-4multimodalrlhf

81A

Wikipedia Dump

by Wikimedia Foundation

The full text dump of Wikipedia articles available in over 300 languages, regularly updated and distributed by the Wikimedia Foundation. It is one of the most universally included components in language model pretraining pipelines due to its high factual density, editorial quality, and broad topical coverage.

nlpencyclopedicfactual

80.2A

codexcode-generationgithub-copilot

Evaluating Large Language Models Trained on Code (Codex)

by OpenAI

Introduced Codex, a GPT language model fine-tuned on publicly available code from GitHub, and the HumanEval benchmark for measuring code synthesis from docstrings. Codex powers GitHub Copilot and represents a breakthrough in automated programming assistance.

79.2B+

GPT-5

by OpenAI

OpenAI's frontier model with advanced reasoning, native multimodal understanding, and robust function calling. Designed for complex enterprise workflows and agentic applications.

llmreasoningmultimodal

78.7B+

GPT-4o

by OpenAI

OpenAI's natively multimodal flagship model processing text, image, and audio inputs with a single unified architecture. Delivers GPT-4 Turbo-level intelligence at 2x speed and 50% lower cost, with breakthrough real-time voice capabilities.

llmmultimodalomni

78.1B+

llamaopen-sourceefficient

LLaMA: Open and Efficient Foundation Language Models

by Meta AI

Introduces LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, trained on publicly available datasets. Showed that smaller models trained on more tokens can match or exceed larger models, democratizing LLM research.

78.1B+

Claude 4

by Anthropic

Anthropic's most capable model featuring advanced reasoning, coding, and multimodal capabilities. Excels at complex analysis, agentic tasks, and extended thinking with industry-leading safety.

llmreasoningcoding

78B+

GPT-4

by OpenAI

OpenAI's breakthrough large language model that demonstrated a significant leap in reasoning and factual accuracy over GPT-3.5. Widely adopted across enterprise and developer workflows for code generation, analysis, and complex problem-solving.

llmreasoningmultimodal

77.9B+

Claude 3.5 Sonnet

by Anthropic

Anthropic's breakout model that surpassed Claude 3 Opus at Sonnet-tier pricing, setting new industry benchmarks for coding. Introduced computer use capability and became the most popular model on the API due to its exceptional intelligence-to-cost ratio.

llmcodingmultimodal

77.7B+

promptingreasoningchain-of-thought

Chain-of-Thought

by AaaS

Guides LLMs to produce step-by-step reasoning before arriving at a final answer. Dramatically improves performance on math, logic, and multi-step problems by making the model's reasoning process explicit and verifiable.

76.6B+

promptingengineeringoptimization

Prompt Engineering

by AaaS

The foundational discipline of crafting effective prompts to elicit desired behaviors from language models. Covers system prompt design, instruction formatting, output structuring, temperature tuning, and iterative prompt refinement techniques.

76.5B+

nlpweb-crawlmassive-scale

Common Crawl

by Common Crawl Foundation

The world's largest open repository of web crawl data, maintained by the non-profit Common Crawl Foundation and updated with new crawls monthly since 2011. It forms the foundational raw data layer for virtually every major language model pretraining pipeline including GPT-3, LLaMA, PaLM, and Falcon, typically after quality filtering and deduplication steps.

76.4B+

foundationalgoogletransformer

BERT

by Google

BERT (Bidirectional Encoder Representations from Transformers) is Google's landmark 2018 language model that introduced the bidirectional pre-training paradigm using masked language modeling and next sentence prediction. It revolutionized NLP by demonstrating that a single pre-trained model could achieve state-of-the-art results across dozens of downstream tasks with minimal fine-tuning.

76.3B+

GSM8K

by OpenAI

Grade School Math 8K benchmark with 8,500 linguistically diverse grade school math word problems requiring 2-8 step reasoning. Tests basic mathematical reasoning and arithmetic with problems that require sequential multi-step solutions.

benchmarkevaluationmath

75.7B+

benchmarkevaluationmathematics

MATH

by UC Berkeley

Collection of 12,500 competition mathematics problems from AMC, AIME, and other math competitions covering algebra, geometry, number theory, combinatorics, and more. Problems require multi-step reasoning and mathematical insight beyond pattern matching.

74.4B+

agiabstract-reasoningvisual-patterns

ARC-AGI

by Chollet / ARC Prize Foundation

ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) measures fluid intelligence through visual grid transformation puzzles. Models must infer transformation rules from three or fewer examples and apply them to a test grid — a task trivially solved by humans but historically extremely difficult for AI systems.

74.1B+

benchmarkevaluationcommonsense

HellaSwag

by Allen AI

Evaluates commonsense natural language inference by asking models to select the most plausible continuation of a scenario. Uses adversarially filtered endings generated by language models, making it challenging for machines while trivial for humans.

74B+

promptingfew-shotexamples

Few-Shot Learning

by AaaS

Teaches LLMs to perform tasks by providing a small number of input-output examples in the prompt. Enables rapid task adaptation without fine-tuning by demonstrating the desired pattern through carefully selected, representative examples.

73.5B+

inferencethroughputlatency

MLPerf Inference

by MLCommons

MLPerf Inference is the industry-standard benchmark for measuring AI inference performance across hardware platforms. It covers image classification, object detection, NLP, speech recognition, and generative AI workloads, enabling fair apples-to-apples comparison of accelerators and inference stacks.

73.1B+

benchmarkevaluationscience

ARC Challenge

by Allen AI

AI2 Reasoning Challenge featuring grade-school science questions that require commonsense reasoning and world knowledge. The Challenge set contains questions that simple retrieval and co-occurrence methods fail to answer correctly.

73.1B+

BookCorpus

by University of Toronto

A dataset of over 11,000 unpublished books spanning fiction and non-fiction genres, originally scraped from Smashwords and used as the primary pretraining corpus for BERT alongside Wikipedia. It provides rich long-range dependency data that helps models learn coherent narrative structure and extended discourse patterns.

nlpbookslong-form

71.3B+

summarizationcondensationnlp

Summarization

by AaaS

Condenses long documents into concise summaries while preserving key information and maintaining factual accuracy. Supports extractive, abstractive, and hierarchical summarization with configurable length, style, and focus area parameters.

69.8B

ragretrieval-augmented-generationllm

RAG Retrieval

by AaaS

A technique that enhances large language models by dynamically retrieving relevant information from an external knowledge base. This process grounds the model's responses in factual data, reducing hallucinations and enabling it to answer questions about information not present in its original training data.

68.3B

searchembeddingssimilarity

Semantic Search

by AaaS

Enables meaning-based retrieval by converting queries and documents into dense vector representations and finding nearest neighbors. Foundational skill for any RAG pipeline or knowledge-base-powered agent.

67.6B

OpenWebText

by EleutherAI

OpenWebText is a large-scale, open-source English text corpus created by scraping web pages linked from Reddit. Designed as a public replication of OpenAI's original WebText dataset used for GPT-2, it contains approximately 38 GB of text filtered by Reddit upvotes to ensure a baseline of quality and relevance.

nlpweb-textreddit

66.4B

LAION-400M Text Captions

by LAION

The text caption component of the LAION-400M dataset, offering 400 million English alt-text captions. These captions were scraped from the web and filtered using CLIP to ensure a minimum similarity to their corresponding images. The text is used independently for large-scale NLP and multimodal research.

nlpcaptionsimage-text

66.3B