Skip to main content
brand
context
industry
strategy
AaaS
DatasetAI for Codev1.0

GitHub Code Dataset

by Hugging Face / BigCode · free · Last verified 2026-03-17

The GitHub Code Dataset is a massive, multilingual collection of source code from public GitHub repositories, spanning 32 programming languages. Distributed via Hugging Face under the BigCode project, it provides a foundational resource for pretraining large language models on diverse code-related tasks, from generation to analysis.

https://huggingface.co/datasets/codeparrot/github-code
B
BAbove Average
Adoption: B+Quality: B+Freshness: C+Citations: B+Engagement: F

Specifications

License
Various
Pricing
free
Capabilities
Large-scale model pretraining, Code generation and synthesis, Code completion and suggestion, Cross-language code translation, Code summarization and documentation generation, Bug detection and code repair, Code search and retrieval, Language-specific model fine-tuning
Integrations
[object Object], [object Object], [object Object]
Use Cases
[object Object], [object Object], [object Object], [object Object], [object Object]
API Available
Yes
Tags
code, multilingual-code, github, pretraining, large-scale, source-code, code-generation, hugging-face, bigcode, code-completion, code-analysis
Added
2026-03-17
Completeness
0.85%

Index Score

63.6
Adoption
75
Quality
78
Freshness
55
Citations
72
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service