Self-Instruct
by University of Washington · open-source · Last verified 2026-03-17
Self-Instruct is the foundational instruction-tuning dataset and methodology introduced by Wang et al. (2022), where 175 human-written seed tasks are iteratively expanded into 52,000 instruction-input-output triplets using GPT-3 as the generator. It established the paradigm of bootstrapping instruction data from existing LLMs and directly inspired Alpaca, WizardLM, and most subsequent synthetic alignment datasets.
https://github.com/yizhongw/self-instruct ↗B
B—Above Average
Adoption: B+Quality: B+Freshness: BCitations: A+Engagement: F
Specifications
- License
- Apache 2.0
- Pricing
- open-source
- Capabilities
- instruction-tuning, data-generation, self-play
- Integrations
- huggingface-datasets
- Use Cases
- sft-training, instruction-data-generation, alignment-research
- API Available
- No
- Tags
- instruction-tuning, self-play, seed-tasks, gpt-3, alignment
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
69.8Adoption
78
Quality
78
Freshness
60
Citations
92
Engagement
0
Put AI to work for your business
Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.