Skip to main content
Datasetmedicalv2024

PubMedCentral OA

by National Institutes of Health / National Library of Medicine · free · Last verified 2026-03-17

PubMedCentral Open Access (PMC OA) is a subset of the PMC literature archive made freely available for text mining and NLP research, containing over 4 million full-text biomedical and life science articles. It is the primary corpus used for pretraining biomedical language models such as BioBERT, PubMedBERT, and BioGPT.

https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
B+
B+Good
Adoption: AQuality: AFreshness: A+Citations: AEngagement: F

Specifications

License
Various open licenses (CC-BY, CC-BY-NC, etc.)
Pricing
free
Capabilities
biomedical-nlp, named-entity-recognition, relation-extraction, pretraining
Integrations
HuggingFace Datasets, NLTK, spaCy
Use Cases
language-model-pretraining, biomedical-nlp-research, information-extraction
API Available
Yes
Tags
biomedical-nlp, scientific-literature, full-text, open-access, pretraining
Added
2026-03-17
Completeness
100%

Index Score

73.1
Adoption
85
Quality
88
Freshness
95
Citations
86
Engagement
0

Put AI to work for your business

Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.

Explore the full AI ecosystem on Agents as a Service