DatasetAI for Codev1.0

StarCoderData

by BigCode · open-source · Last verified 2026-03-17

The 780 billion token code dataset used to pretrain the StarCoder family of models, assembled by BigCode from The Stack v1 spanning 86 programming languages with permissive licenses. It includes GitHub issues, Git commits, and Jupyter notebook data alongside source files, enabling models to learn from developer workflows and not just static code.

https://huggingface.co/datasets/bigcode/starcoderdata ↗

B—Above Average

Adoption: B+Quality: AFreshness: B+Citations: AEngagement: F

Specifications

License: Apache-2.0
Pricing: open-source
Capabilities: code-generation, pretraining, fill-in-the-middle
Integrations: hugging-face
Use Cases: code-model-pretraining, research
API Available: Yes
Tags: code, pretraining, github, permissive-license, bigcode
Added: 2026-03-17
Completeness: 100%

Index Score

69.7

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service