PubMedCentral OA
by National Institutes of Health / National Library of Medicine · free · Last verified 2026-03-17
PubMedCentral Open Access (PMC OA) is a subset of the PMC literature archive made freely available for text mining and NLP research, containing over 4 million full-text biomedical and life science articles. It is the primary corpus used for pretraining biomedical language models such as BioBERT, PubMedBERT, and BioGPT.
https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ ↗B+
B+—Good
Adoption: AQuality: AFreshness: A+Citations: AEngagement: F
Specifications
- License
- Various open licenses (CC-BY, CC-BY-NC, etc.)
- Pricing
- free
- Capabilities
- biomedical-nlp, named-entity-recognition, relation-extraction, pretraining
- Integrations
- HuggingFace Datasets, NLTK, spaCy
- Use Cases
- language-model-pretraining, biomedical-nlp-research, information-extraction
- API Available
- Yes
- Tags
- biomedical-nlp, scientific-literature, full-text, open-access, pretraining
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
73.1Adoption
85
Quality
88
Freshness
95
Citations
86
Engagement
0
Put AI to work for your business
Deploy this dataset alongside autonomous AaaS agents that handle tasks end-to-end — no babysitting required.