PDF Extraction Pipeline
by AaaS · open-source · Last verified 2026-03-01
Specialized pipeline for extracting structured content from PDF documents including text, tables, images, and metadata. Supports OCR for scanned documents, layout analysis for complex formats, and chunking optimized for PDF document structures.
https://aaas.blog/script/pdf-extraction-pipeline ↗C+
C+—Average
Adoption: BQuality: B+Freshness: B+Citations: C+Engagement: F
Specifications
- License
- MIT
- Pricing
- open-source
- Capabilities
- text-extraction, table-extraction, ocr-processing, layout-analysis, metadata-extraction
- Integrations
- pypdf, unstructured, tesseract, langchain
- Use Cases
- contract-processing, report-digitization, invoice-extraction, research-paper-processing
- API Available
- No
- Language
- python
- Dependencies
- pypdf, unstructured, pytesseract, pdf2image, langchain
- Environment
- Python 3.11+ with Tesseract OCR and poppler-utils
- Est. Runtime
- 1-10 minutes per document depending on page count
- Tags
- script, automation, pdf, extraction, ocr
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
54.3Adoption
64
Quality
76
Freshness
78
Citations
54
Engagement
0