Skip to main content
DatasetAI for Codev1.0

StarCoderData

by BigCode · open-source · Last verified 2026-03-17

The 780 billion token code dataset used to pretrain the StarCoder family of models, assembled by BigCode from The Stack v1 spanning 86 programming languages with permissive licenses. It includes GitHub issues, Git commits, and Jupyter notebook data alongside source files, enabling models to learn from developer workflows and not just static code.

https://huggingface.co/datasets/bigcode/starcoderdata
B
BAbove Average
Adoption: B+Quality: AFreshness: B+Citations: AEngagement: F

Specifications

License
Apache-2.0
Pricing
open-source
Capabilities
code-generation, pretraining, fill-in-the-middle
Integrations
hugging-face
Use Cases
code-model-pretraining, research
API Available
Yes
Tags
code, pretraining, github, permissive-license, bigcode
Added
2026-03-17
Completeness
100%

Index Score

69.7
Adoption
79
Quality
88
Freshness
70
Citations
82
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service