Skip to main content
ScriptAI Infrastructurev1.0

Data Cleaning Script

by AaaS · open-source · Last verified 2026-03-01

Cleans and normalizes text data for LLM consumption by removing HTML artifacts, fixing encoding issues, standardizing whitespace, deduplicating near-identical entries, and filtering low-quality content based on configurable quality heuristics.

https://aaas.blog/script/data-cleaning-script
C+
C+Average
Adoption: B+Quality: B+Freshness: B+Citations: C+Engagement: F

Specifications

License
MIT
Pricing
open-source
Capabilities
html-cleaning, encoding-normalization, deduplication, quality-filtering, whitespace-normalization
Integrations
pandas, beautifulsoup4, ftfy, datasets
Use Cases
training-data-cleaning, web-scrape-processing, content-normalization, data-pipeline-preprocessing
API Available
No
Language
python
Dependencies
pandas, beautifulsoup4, ftfy, datasets, tqdm
Environment
Python 3.11+
Est. Runtime
2-15 minutes depending on dataset size
Tags
script, automation, cleaning, data-quality, preprocessing
Added
2026-03-17
Completeness
100%

Index Score

57.6
Adoption
72
Quality
74
Freshness
72
Citations
56
Engagement
0

Explore the full AI ecosystem on Agents as a Service