LLMs
Large language models, fine-tuning, RAG, and inference
25 entities indexed
Wikipedia Dump
by Wikimedia Foundation
The full text dump of Wikipedia articles available in over 300 languages, regularly updated and distributed by the Wikimedia Foundation. It is one of the most universally included components in language model pretraining pipelines due to its high factual density, editorial quality, and broad topical coverage.
Visual Instruction Tuning (LLaVA)
by University of Wisconsin–Madison / Microsoft Research
Introduced LLaVA (Large Language and Vision Assistant), a multimodal model trained via visual instruction tuning using GPT-4-generated multimodal instruction-following data. LLaVA demonstrates impressive multimodal chat abilities and achieves 85.1% on Science QA, pioneering open-source visual instruction tuning.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
by Princeton University / Google DeepMind
Introduced Tree of Thoughts (ToT), a framework that generalizes chain-of-thought prompting to a tree search over intermediate reasoning steps. ToT enables LLMs to explore multiple reasoning paths, evaluate choices, and backtrack, achieving dramatic improvements on tasks requiring lookahead and planning.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
by Google Brain
Introduced Switch Transformers, a simplified mixture-of-experts (MoE) architecture that routes each token to exactly one expert (top-1 routing), enabling trillion-parameter models with sub-linear compute scaling. Switch Transformers achieve 7x pretraining speedup over a dense T5 model while maintaining model quality.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
by Princeton University
Introduced SWE-bench, a benchmark of 2,294 real GitHub issues from 12 popular Python repositories requiring models to resolve issues by writing code patches. SWE-bench reveals that even the best LLMs resolve fewer than 4% of issues with standard techniques, motivating research into code agents.
Summarization
by AaaS
Condenses long documents into concise summaries while preserving key information and maintaining factual accuracy. Supports extractive, abstractive, and hierarchical summarization with configurable length, style, and focus area parameters.
Semantic Search
by AaaS
Enables meaning-based retrieval by converting queries and documents into dense vector representations and finding nearest neighbors. Foundational skill for any RAG pipeline or knowledge-base-powered agent.
TruthfulQA
by University of Oxford
Measures whether language models generate truthful answers to questions where humans are commonly mistaken. Covers health, law, finance, and politics topics where popular misconceptions and conspiracies create systematic failure modes.
WinoGrande
by Allen AI
Large-scale dataset for commonsense coreference resolution inspired by Winograd schemas. Tests whether models can correctly resolve pronoun references based on world knowledge and commonsense reasoning in carefully constructed sentence pairs.
Text Classification
by AaaS
Automates the categorization of text into predefined classes. This skill leverages large language models to perform zero-shot and multi-label classification, eliminating the need for extensive training data. It can analyze documents, user feedback, or social media posts, assigning relevant labels from a simple list or a complex hierarchical taxonomy.
TyDi QA
by Clark et al. / Google Research
TyDi QA is a multilingual question-answering benchmark featuring 11 typologically diverse languages. Questions are written natively by speakers of each language, ensuring genuine linguistic challenges and avoiding translation artifacts. It is designed to evaluate reading comprehension across a wide range of language structures.
SlimPajama
by Cerebras
SlimPajama is a cleaned and deduplicated version of the RedPajama dataset, containing 627 billion high-quality tokens. Produced by Cerebras, it demonstrates that training on fewer, higher-quality tokens can match or exceed the performance of models trained on larger, noisier datasets.
OpenWebText
by EleutherAI
OpenWebText is a large-scale, open-source English text corpus created by scraping web pages linked from Reddit. Designed as a public replication of OpenAI's original WebText dataset used for GPT-2, it contains approximately 38 GB of text filtered by Reddit upvotes to ensure a baseline of quality and relevance.
Translation
by AaaS
Provides the ability to translate text from a source language to a target language. It aims to preserve the original meaning, tone, and cultural context. The skill supports domain-specific terminology for fields like legal or medical, allows for register control between formal and informal language, and handles idiomatic expressions with contextually appropriate equivalents.
LAION-400M Text Captions
by LAION
The text caption component of the LAION-400M dataset, offering 400 million English alt-text captions. These captions were scraped from the web and filtered using CLIP to ensure a minimum similarity to their corresponding images. The text is used independently for large-scale NLP and multimodal research.
XL-Sum
by Hasan et al. / University of Edinburgh
XL-Sum is a large-scale benchmark dataset for multilingual abstractive summarization. It contains 1.35 million article-summary pairs from BBC News across 44 languages, designed to evaluate a model's ability to generate concise summaries across diverse linguistic families and writing systems.
SimpleQA
by OpenAI
SimpleQA is a benchmark dataset developed by OpenAI to assess the factual accuracy of language models. It consists of simple, unambiguous questions that have a single, verifiable correct answer. The benchmark is designed to measure a model's ability to recall factual knowledge and, crucially, to abstain from answering when it is uncertain, providing a measure of its calibration.
PushShift Reddit Dataset
by PushShift.io
A massive, multi-billion token archive of Reddit comments and submissions from 2005 to 2023, collected by the PushShift project. This dataset is a cornerstone for social NLP research, large-scale language model pre-training, and studying the dynamics of online communities and conversational discourse.
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
by Idiap Research Institute / EPFL
Shows that by approximating the softmax attention kernel, transformers can be expressed as linear RNNs, enabling O(1) autoregressive inference. Introduces the linear attention framework that inspired many subsequent efficient attention variants.
Tree-of-Thought
by AaaS
Extends chain-of-thought prompting by exploring multiple reasoning paths simultaneously and evaluating them as a branching tree. The model generates, evaluates, and prunes candidate solutions using breadth-first or depth-first search strategies for optimal problem solving.
CodeLlama-70B-Instruct-v2
by Meta AI
An advanced instruction-tuned large language model specifically designed for code generation, explanation, and debugging across multiple programming languages. This version offers improved performance and reduced hallucination rates compared to its predecessor.
Whisper-Large-v4-Multilingual
by OpenAI
An updated version of OpenAI's Whisper model with expanded language support and improved accuracy for speech-to-text transcription.
ProteinFold-Mini-v1
by BioCompute AI
A compact yet effective model for accelerated protein structure prediction, leveraging recent advancements in AI for bioinformatics. Ideal for initial screening and rapid hypothesis generation in drug discovery.
RLHF-Guard-7B
by Anthropic
A small, efficient model specifically designed for red-teaming and safety alignment of larger language models, identifying harmful outputs.
RoboAction-VLM-8B
by Google DeepMind
An embodied AI model integrating visual perception with language understanding to enable complex robotic task planning and execution. Designed for general-purpose robotic manipulation in unstructured environments.