Skip to main content
DatasetAI for Codev1.0

GitHub Code Dataset

by Hugging Face / BigCode · open-source · Last verified 2026-03-17

A large multilingual code dataset extracted from public GitHub repositories covering 32 programming languages, distributed through Hugging Face as part of the BigCode initiative. It serves as a versatile baseline for code model training, supporting language-specific subsetting and providing raw source files across a diverse range of domains including web development, systems programming, and data science.

https://huggingface.co/datasets/codeparrot/github-code
B
BAbove Average
Adoption: B+Quality: B+Freshness: C+Citations: B+Engagement: F

Specifications

License
Various
Pricing
open-source
Capabilities
code-generation, pretraining, multilingual-code
Integrations
hugging-face
Use Cases
code-model-pretraining, multilingual-code-research
API Available
Yes
Tags
code, multilingual-code, github, pretraining, large-scale
Added
2026-03-17
Completeness
100%

Index Score

63.6
Adoption
75
Quality
78
Freshness
55
Citations
72
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service