GitHub Code Dataset
by Hugging Face / BigCode · free · Last verified 2026-03-17
The GitHub Code Dataset is a massive, multilingual collection of source code from public GitHub repositories, spanning 32 programming languages. Distributed via Hugging Face under the BigCode project, it provides a foundational resource for pretraining large language models on diverse code-related tasks, from generation to analysis.
https://huggingface.co/datasets/codeparrot/github-code ↗B
B—Above Average
Adoption: B+Quality: B+Freshness: C+Citations: B+Engagement: F
Specifications
- License
- Various
- Pricing
- free
- Capabilities
- Large-scale model pretraining, Code generation and synthesis, Code completion and suggestion, Cross-language code translation, Code summarization and documentation generation, Bug detection and code repair, Code search and retrieval, Language-specific model fine-tuning
- Integrations
- [object Object], [object Object], [object Object]
- Use Cases
- [object Object], [object Object], [object Object], [object Object], [object Object]
- API Available
- Yes
- Tags
- code, multilingual-code, github, pretraining, large-scale, source-code, code-generation, hugging-face, bigcode, code-completion, code-analysis
- Added
- 2026-03-17
- Completeness
- 0.85%
Index Score
63.6Adoption
75
Quality
78
Freshness
55
Citations
72
Engagement
0