HTML cleaning, sanitization, and text processing utilities for Rust.
- HTML Cleaning: Remove unwanted elements (scripts, styles, forms)
- Tag Stripping: Remove tags while preserving text content
- Text Normalization: Collapse whitespace, trim text
- Link Processing: Make URLs absolute, filter links
- Content Deduplication: LRU-based duplicate detection
- Markdown Output: Convert HTML to Markdown with structure preservation
- Presets: Ready-to-use configurations for common scenarios
use html_cleaning::{HtmlCleaner, presets};
use dom_query::Document;
// Use a preset for quick setup
let cleaner = HtmlCleaner::with_options(presets::standard());
let html = "<html><body><script>bad</script><p>Hello!</p></body></html>";
let doc = Document::from(html);
cleaner.clean(&doc);
// Scripts removed, paragraph content preservedAdd to your Cargo.toml:
[dependencies]
html-cleaning = "0.1"With all features:
[dependencies]
html-cleaning = { version = "0.1", features = ["full"] }use html_cleaning::{HtmlCleaner, CleaningOptions};
let options = CleaningOptions {
tags_to_remove: vec!["script".into(), "style".into()],
prune_empty: true,
normalize_whitespace: true,
..Default::default()
};
let cleaner = HtmlCleaner::with_options(options);use html_cleaning::CleaningOptions;
let options = CleaningOptions::builder()
.remove_tags(&["script", "style", "noscript"])
.remove_selectors(&[".advertisement", "#cookie-banner"])
.prune_empty(true)
.normalize_whitespace(true)
.build();use html_cleaning::presets;
// Minimal: Just scripts and styles
let minimal = presets::minimal();
// Standard: + forms, iframes, objects
let standard = presets::standard();
// Aggressive: + nav, header, footer, aside
let aggressive = presets::aggressive();
// Article extraction: Optimized for content extraction
let article = presets::article_extraction();use html_cleaning::text;
let has_content = text::has_content(" hello "); // true
let normalized = text::normalize(" multiple spaces "); // "multiple spaces"
let words = text::word_count("hello world"); // 2use html_cleaning::markdown::html_to_markdown;
let html = "<h1>Title</h1><p>Content with <strong>bold</strong></p>";
let md = html_to_markdown(html);
// Output: "# Title\n\nContent with **bold**\n"| Feature | Default | Description |
|---|---|---|
presets |
Yes | Include prebuilt cleaning configurations |
regex |
No | Enable regex-based selectors |
url |
No | Enable URL processing with the url crate |
markdown |
No | Enable HTML to Markdown conversion |
full |
No | Enable all features |
| Module | Description |
|---|---|
cleaner |
Core HtmlCleaner and cleaning operations |
text |
Text processing utilities |
tree |
lxml-style text/tail tree manipulation |
dom |
DOM helper utilities |
dedup |
Content deduplication |
presets |
Ready-to-use cleaning configurations |
links |
URL and link processing (feature: url) |
markdown |
HTML to Markdown conversion (feature: markdown) |
- Removes:
script,style,noscript - Best for: Quick sanitization
- Removes:
script,style,noscript,form,iframe,object,embed,svg,canvas,video,audio - Enables:
prune_empty,normalize_whitespace - Best for: General web scraping
- Includes all of
standard()plus: - Removes:
nav,header,footer,aside,figure,figcaption - Enables:
strip_attributes(preserveshref,src,alt) - Best for: Maximum content extraction
- Optimized for article content extraction
- Removes navigation and layout elements
- Strips wrapper tags (
div,span) while preserving content - Best for: News articles, blog posts
- rs-trafilatura - Web content extraction library (uses html-cleaning)
- dom_query - DOM manipulation library
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option.
Contributions are welcome! Please feel free to submit a Pull Request.