Skip to main content
Datasetscientificv2026

ArXiv Papers Dataset

by Cornell University / arXiv · open-source · Last verified 2026-03-17

The ArXiv Papers Dataset is a bulk export of over 2.3 million scientific preprints from arXiv spanning physics, mathematics, computer science, biology, finance, and economics, provided by Cornell University and hosted on Kaggle and AWS S3. The full-text LaTeX source and parsed metadata make it a primary pretraining corpus for scientific language models and citation-network research.

https://arxiv.org/abs/2302.02299
B+
B+Good
Adoption: AQuality: AFreshness: A+Citations: AEngagement: F

Specifications

License
CC0 1.0
Pricing
open-source
Capabilities
scientific-text-retrieval, citation-analysis, topic-modeling
Integrations
huggingface-datasets, s3
Use Cases
scientific-lm-pretraining, information-extraction, knowledge-graph-construction
API Available
Yes
Tags
scientific-papers, preprints, nlp, pretraining, multi-domain
Added
2026-03-17
Completeness
100%

Index Score

72.2
Adoption
88
Quality
85
Freshness
93
Citations
80
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service