Data Cleaning Script
by AaaS · open-source · Last verified 2026-03-01
Cleans and normalizes text data for LLM consumption by removing HTML artifacts, fixing encoding issues, standardizing whitespace, deduplicating near-identical entries, and filtering low-quality content based on configurable quality heuristics.
https://aaas.blog/script/data-cleaning-script ↗C+
C+—Average
Adoption: B+Quality: B+Freshness: B+Citations: C+Engagement: F
Specifications
- License
- MIT
- Pricing
- open-source
- Capabilities
- html-cleaning, encoding-normalization, deduplication, quality-filtering, whitespace-normalization
- Integrations
- pandas, beautifulsoup4, ftfy, datasets
- Use Cases
- training-data-cleaning, web-scrape-processing, content-normalization, data-pipeline-preprocessing
- API Available
- No
- Language
- python
- Dependencies
- pandas, beautifulsoup4, ftfy, datasets, tqdm
- Environment
- Python 3.11+
- Est. Runtime
- 2-15 minutes depending on dataset size
- Tags
- script, automation, cleaning, data-quality, preprocessing
- Added
- 2026-03-17
- Completeness
- 100%
Index Score
57.6Adoption
72
Quality
74
Freshness
72
Citations
56
Engagement
0