LLMs
Large language models, fine-tuning, RAG, and inference
30 entities in this channel
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
by Google AI
Introduced BERT, a bidirectional Transformer pre-trained on masked language modeling and next sentence prediction. Established the pretrain-then-fine-tune paradigm that dominated NLP for years and achieved state-of-the-art on 11 NLP benchmarks.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
by Google Brain
Introduced chain-of-thought prompting, a simple technique of providing exemplars with step-by-step reasoning traces in few-shot prompts. This approach dramatically improves LLM performance on arithmetic, commonsense, and symbolic reasoning tasks, with the effect emerging at approximately 100B parameters.
Language Models are Few-Shot Learners (GPT-3)
by OpenAI
Introduced GPT-3, a 175B parameter language model demonstrating remarkable few-shot learning capabilities across diverse tasks. Showed that scaling model size dramatically improves in-context learning without gradient updates, reshaping the field.
GPT-4 Technical Report
by OpenAI
Technical report for GPT-4, OpenAI's multimodal large language model accepting image and text inputs. Demonstrates state-of-the-art performance on academic and professional benchmarks, including passing the bar exam in the top 10% of test takers.
Wikipedia Dump
by Wikimedia Foundation
The full text dump of Wikipedia articles available in over 300 languages, regularly updated and distributed by the Wikimedia Foundation. It is one of the most universally included components in language model pretraining pipelines due to its high factual density, editorial quality, and broad topical coverage.
Evaluating Large Language Models Trained on Code (Codex)
by OpenAI
Introduced Codex, a GPT language model fine-tuned on publicly available code from GitHub, and the HumanEval benchmark for measuring code synthesis from docstrings. Codex powers GitHub Copilot and represents a breakthrough in automated programming assistance.
GPT-5
by OpenAI
OpenAI's frontier model with advanced reasoning, native multimodal understanding, and robust function calling. Designed for complex enterprise workflows and agentic applications.
GPT-4o
by OpenAI
OpenAI's natively multimodal flagship model processing text, image, and audio inputs with a single unified architecture. Delivers GPT-4 Turbo-level intelligence at 2x speed and 50% lower cost, with breakthrough real-time voice capabilities.
LLaMA: Open and Efficient Foundation Language Models
by Meta AI
Introduces LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, trained on publicly available datasets. Showed that smaller models trained on more tokens can match or exceed larger models, democratizing LLM research.
Claude 4
by Anthropic
Anthropic's most capable model featuring advanced reasoning, coding, and multimodal capabilities. Excels at complex analysis, agentic tasks, and extended thinking with industry-leading safety.
GPT-4
by OpenAI
OpenAI's breakthrough large language model that demonstrated a significant leap in reasoning and factual accuracy over GPT-3.5. Widely adopted across enterprise and developer workflows for code generation, analysis, and complex problem-solving.
Claude 3.5 Sonnet
by Anthropic
Anthropic's breakout model that surpassed Claude 3 Opus at Sonnet-tier pricing, setting new industry benchmarks for coding. Introduced computer use capability and became the most popular model on the API due to its exceptional intelligence-to-cost ratio.
Chain-of-Thought
by AaaS
Guides LLMs to produce step-by-step reasoning before arriving at a final answer. Dramatically improves performance on math, logic, and multi-step problems by making the model's reasoning process explicit and verifiable.
Prompt Engineering
by AaaS
The foundational discipline of crafting effective prompts to elicit desired behaviors from language models. Covers system prompt design, instruction formatting, output structuring, temperature tuning, and iterative prompt refinement techniques.
Common Crawl
by Common Crawl Foundation
The world's largest open repository of web crawl data, maintained by the non-profit Common Crawl Foundation and updated with new crawls monthly since 2011. It forms the foundational raw data layer for virtually every major language model pretraining pipeline including GPT-3, LLaMA, PaLM, and Falcon, typically after quality filtering and deduplication steps.
BERT
by Google
BERT (Bidirectional Encoder Representations from Transformers) is Google's landmark 2018 language model that introduced the bidirectional pre-training paradigm using masked language modeling and next sentence prediction. It revolutionized NLP by demonstrating that a single pre-trained model could achieve state-of-the-art results across dozens of downstream tasks with minimal fine-tuning.
GSM8K
by OpenAI
Grade School Math 8K benchmark with 8,500 linguistically diverse grade school math word problems requiring 2-8 step reasoning. Tests basic mathematical reasoning and arithmetic with problems that require sequential multi-step solutions.
MATH
by UC Berkeley
Collection of 12,500 competition mathematics problems from AMC, AIME, and other math competitions covering algebra, geometry, number theory, combinatorics, and more. Problems require multi-step reasoning and mathematical insight beyond pattern matching.
ARC-AGI
by Chollet / ARC Prize Foundation
ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) measures fluid intelligence through visual grid transformation puzzles. Models must infer transformation rules from three or fewer examples and apply them to a test grid — a task trivially solved by humans but historically extremely difficult for AI systems.
HellaSwag
by Allen AI
Evaluates commonsense natural language inference by asking models to select the most plausible continuation of a scenario. Uses adversarially filtered endings generated by language models, making it challenging for machines while trivial for humans.
Few-Shot Learning
by AaaS
Teaches LLMs to perform tasks by providing a small number of input-output examples in the prompt. Enables rapid task adaptation without fine-tuning by demonstrating the desired pattern through carefully selected, representative examples.
MLPerf Inference
by MLCommons
MLPerf Inference is the industry-standard benchmark for measuring AI inference performance across hardware platforms. It covers image classification, object detection, NLP, speech recognition, and generative AI workloads, enabling fair apples-to-apples comparison of accelerators and inference stacks.
ARC Challenge
by Allen AI
AI2 Reasoning Challenge featuring grade-school science questions that require commonsense reasoning and world knowledge. The Challenge set contains questions that simple retrieval and co-occurrence methods fail to answer correctly.
BookCorpus
by University of Toronto
A dataset of over 11,000 unpublished books spanning fiction and non-fiction genres, originally scraped from Smashwords and used as the primary pretraining corpus for BERT alongside Wikipedia. It provides rich long-range dependency data that helps models learn coherent narrative structure and extended discourse patterns.
Summarization
by AaaS
Condenses long documents into concise summaries while preserving key information and maintaining factual accuracy. Supports extractive, abstractive, and hierarchical summarization with configurable length, style, and focus area parameters.
RAG Retrieval
by AaaS
A technique that enhances large language models by dynamically retrieving relevant information from an external knowledge base. This process grounds the model's responses in factual data, reducing hallucinations and enabling it to answer questions about information not present in its original training data.
Semantic Search
by AaaS
Enables meaning-based retrieval by converting queries and documents into dense vector representations and finding nearest neighbors. Foundational skill for any RAG pipeline or knowledge-base-powered agent.
OpenWebText
by EleutherAI
OpenWebText is a large-scale, open-source English text corpus created by scraping web pages linked from Reddit. Designed as a public replication of OpenAI's original WebText dataset used for GPT-2, it contains approximately 38 GB of text filtered by Reddit upvotes to ensure a baseline of quality and relevance.
LAION-400M Text Captions
by LAION
The text caption component of the LAION-400M dataset, offering 400 million English alt-text captions. These captions were scraped from the web and filtered using CLIP to ensure a minimum similarity to their corresponding images. The text is used independently for large-scale NLP and multimodal research.
SlimPajama
by Cerebras
SlimPajama is a cleaned and deduplicated version of the RedPajama dataset, containing 627 billion high-quality tokens. Produced by Cerebras, it demonstrates that training on fewer, higher-quality tokens can match or exceed the performance of models trained on larger, noisier datasets.