GitHub Code Dataset
by Hugging Face / BigCode · open-source · Last verified 2026-03-17
A large multilingual code dataset extracted from public GitHub repositories covering 32 programming languages, distributed through Hugging Face as part of the BigCode initiative. It serves as a versatile baseline for code model training, supporting language-specific subsetting and providing raw source files across a diverse range of domains including web development, systems programming, and data science.
https://huggingface.co/datasets/codeparrot/github-code ↗B
B—Above Average
Adoption: B+Quality: B+Freshness: C+Citations: B+Engagement: F
Specifications
- License
- Various
- Pricing
- open-source
- Capabilities
- code-generation, pretraining, multilingual-code
- Integrations
- hugging-face
- Use Cases
- code-model-pretraining, multilingual-code-research
- API Available
- Yes
- Tags
- code, multilingual-code, github, pretraining, large-scale
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
63.6Adoption
75
Quality
78
Freshness
55
Citations
72
Engagement
0
Put AI to work for your business
Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.