StarCoderData
by BigCode · open-source · Last verified 2026-03-17
The 780 billion token code dataset used to pretrain the StarCoder family of models, assembled by BigCode from The Stack v1 spanning 86 programming languages with permissive licenses. It includes GitHub issues, Git commits, and Jupyter notebook data alongside source files, enabling models to learn from developer workflows and not just static code.
https://huggingface.co/datasets/bigcode/starcoderdata ↗B
B—Above Average
Adoption: B+Quality: AFreshness: B+Citations: AEngagement: F
Specifications
- License
- Apache-2.0
- Pricing
- open-source
- Capabilities
- code-generation, pretraining, fill-in-the-middle
- Integrations
- hugging-face
- Use Cases
- code-model-pretraining, research
- API Available
- Yes
- Tags
- code, pretraining, github, permissive-license, bigcode
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
69.7Adoption
79
Quality
88
Freshness
70
Citations
82
Engagement
0
Put AI to work for your business
Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.