OCR Pipeline Script
by Community · open-source · Last verified 2026-03-17
Multi-engine OCR pipeline that routes documents to Tesseract, PaddleOCR, or a cloud OCR API based on image quality heuristics. Outputs structured JSON with bounding boxes, confidence scores, and reading-order-sorted text blocks ready for downstream NLP.
https://github.com/PaddlePaddle/PaddleOCR ↗B
B—Above Average
Adoption: AQuality: B+Freshness: ACitations: C+Engagement: F
Specifications
- License
- MIT
- Pricing
- open-source
- Capabilities
- multi-engine-routing, bounding-box-output, reading-order-sort, confidence-scoring
- Integrations
- paddleocr, tesseract, opencv, google-cloud-vision
- Use Cases
- invoice-processing, id-document-extraction, historical-document-digitization
- API Available
- No
- Language
- python
- Dependencies
- paddlepaddle, paddleocr, pytesseract, opencv-python, pillow
- Environment
- Python 3.9+
- Est. Runtime
- 1-10 minutes
- Tags
- ocr, text-extraction, document-ai, tesseract, paddleocr
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
62.1Adoption
80
Quality
78
Freshness
82
Citations
58
Engagement
0