ArXiv Papers Dataset
by Cornell University / arXiv · open-source · Last verified 2026-03-17
The ArXiv Papers Dataset is a bulk export of over 2.3 million scientific preprints from arXiv spanning physics, mathematics, computer science, biology, finance, and economics, provided by Cornell University and hosted on Kaggle and AWS S3. The full-text LaTeX source and parsed metadata make it a primary pretraining corpus for scientific language models and citation-network research.
https://arxiv.org/abs/2302.02299 ↗B+
B+—Good
Adoption: AQuality: AFreshness: A+Citations: AEngagement: F
Specifications
- License
- CC0 1.0
- Pricing
- open-source
- Capabilities
- scientific-text-retrieval, citation-analysis, topic-modeling
- Integrations
- huggingface-datasets, s3
- Use Cases
- scientific-lm-pretraining, information-extraction, knowledge-graph-construction
- API Available
- Yes
- Tags
- scientific-papers, preprints, nlp, pretraining, multi-domain
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
72.2Adoption
88
Quality
85
Freshness
93
Citations
80
Engagement
0
Put AI to work for your business
Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.