Datasetalignmentv1.0

Self-Instruct

by University of Washington · open-source · Last verified 2026-03-17

Self-Instruct is the foundational instruction-tuning dataset and methodology introduced by Wang et al. (2022), where 175 human-written seed tasks are iteratively expanded into 52,000 instruction-input-output triplets using GPT-3 as the generator. It established the paradigm of bootstrapping instruction data from existing LLMs and directly inspired Alpaca, WizardLM, and most subsequent synthetic alignment datasets.

https://github.com/yizhongw/self-instruct ↗

B—Above Average

Adoption: B+Quality: B+Freshness: BCitations: A+Engagement: F

Specifications

License: Apache 2.0
Pricing: open-source
Capabilities: instruction-tuning, data-generation, self-play
Integrations: huggingface-datasets
Use Cases: sft-training, instruction-data-generation, alignment-research
API Available: No
Tags: instruction-tuning, self-play, seed-tasks, gpt-3, alignment
Added: 2026-03-17
Completeness: 100%

Index Score

69.8

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service