DatasetAI for Codev1.0

GitHub Code Dataset

by Hugging Face / BigCode · free · Last verified 2026-03-17

The GitHub Code Dataset is a massive, multilingual collection of source code from public GitHub repositories, spanning 32 programming languages. Distributed via Hugging Face under the BigCode project, it provides a foundational resource for pretraining large language models on diverse code-related tasks, from generation to analysis.

https://huggingface.co/datasets/codeparrot/github-code ↗

C—Below Average

Adoption: B+Quality: B+Freshness: C+Citations: FEngagement: F

Specifications

License: Various
Pricing: free
Capabilities: Large-scale model pretraining, Code generation and synthesis, Code completion and suggestion, Cross-language code translation, Code summarization and documentation generation, Bug detection and code repair, Code search and retrieval, Language-specific model fine-tuning
Integrations: [object Object], [object Object], [object Object]
Use Cases: [object Object], [object Object], [object Object], [object Object], [object Object]
API Available: Yes
Tags: code, multilingual-code, github, pretraining, large-scale, source-code, code-generation, hugging-face, bigcode, code-completion, code-analysis
Added: 2026-03-17
Completeness: 0.85%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service