Skip to main content
brand
context
industry
strategy
AaaS
Datasetscientificv2026

ArXiv Papers Dataset

by Cornell University / arXiv · open-source · Last verified 2026-03-17

The ArXiv Papers Dataset is a bulk export of over 2.3 million scientific preprints from arXiv spanning physics, mathematics, computer science, biology, finance, and economics, provided by Cornell University and hosted on Kaggle and AWS S3. The full-text LaTeX source and parsed metadata make it a primary pretraining corpus for scientific language models and citation-network research.

https://arxiv.org/abs/2302.02299
B+
B+Good
Adoption: AQuality: AFreshness: A+Citations: AEngagement: F

Specifications

License
CC0 1.0
Pricing
open-source
Capabilities
scientific-text-retrieval, citation-analysis, topic-modeling
Integrations
huggingface-datasets, s3
Use Cases
scientific-lm-pretraining, information-extraction, knowledge-graph-construction
API Available
Yes
Tags
scientific-papers, preprints, nlp, pretraining, multi-domain
Added
2026-03-17
Completeness
100%

Index Score

72.2
Adoption
88
Quality
85
Freshness
93
Citations
80
Engagement
0

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service