Datasetscientificv2026

ArXiv Papers Dataset

by Cornell University / arXiv · open-source · Last verified 2026-03-17

The ArXiv Papers Dataset is a bulk export of over 2.3 million scientific preprints from arXiv spanning physics, mathematics, computer science, biology, finance, and economics, provided by Cornell University and hosted on Kaggle and AWS S3. The full-text LaTeX source and parsed metadata make it a primary pretraining corpus for scientific language models and citation-network research.

https://arxiv.org/abs/2302.02299 ↗

C+

C+—Average

Adoption: AQuality: AFreshness: A+Citations: FEngagement: F

Specifications

License: CC0 1.0
Pricing: open-source
Capabilities: scientific-text-retrieval, citation-analysis, topic-modeling
Integrations: huggingface-datasets, s3
Use Cases: scientific-lm-pretraining, information-extraction, knowledge-graph-construction
API Available: Yes
Tags: scientific-papers, preprints, nlp, pretraining, multi-domain
Added: 2026-03-17
Completeness: 100%

Index Score

Adoption

Quality

Freshness

Citations

Engagement

Need this tool deployed for your team?

Get a Custom Setup

Explore the full AI ecosystem on Agents as a Service