rs-trafilatura

Fast web content extraction, page type classification, HTML cleaning, and Markdown conversion for Python — powered by Rust.

rs-trafilatura is a Python package built with PyO3 that wraps four Rust crates into a single pip install. It extracts the main content from web pages, classifies page types, predicts extraction quality, cleans HTML, and converts HTML to Markdown — all at native Rust speed.

Why rs-trafilatura?

Fast: Rust-native extraction at ~44ms per page on commodity hardware — 36x faster than neural approaches
Accurate: F1 0.859 across 7 page types on the Web Content Extraction Benchmark, outperforming Trafilatura (0.791), MinerU-HTML (0.827), and ReaderLM-v2 (0.741)
Page-type aware: XGBoost classifier detects articles, forums, products, collections, listings, documentation, and service pages — then applies type-specific extraction profiles
Quality scoring: ML-based confidence predictor (0.0–1.0) tells you when extraction might be unreliable, enabling hybrid pipelines with LLM fallback
Framework adapters: Drop-in integrations for crawl4ai, Scrapy, Firecrawl, and Crawlee

Install

pip install rs-trafilatura

Quick Start

import rs_trafilatura

# Extract main content from HTML
result = rs_trafilatura.extract(html, url="https://example.com")
print(result.title)                # Page title
print(result.main_content)         # Clean extracted text
print(result.page_type)            # article, forum, product, etc.
print(result.extraction_quality)   # 0.0–1.0 confidence score

API Reference

Content Extraction

# From a string
result = rs_trafilatura.extract(
    html,
    url="https://example.com",      # URL for page type classification
    page_type="product",             # Force a page type (bypasses classifier)
    favor_precision=True,            # Stricter filtering, less noise
    favor_recall=False,              # More inclusive extraction
    include_tables=True,             # Include table content
    include_images=True,             # Extract image metadata
    include_comments=False,          # Include comment sections
    output_markdown=True,            # Generate Markdown in content_markdown
)

# From raw bytes (auto-detects encoding)
result = rs_trafilatura.extract_bytes(
    response_bytes,
    url="https://example.com",
    output_markdown=True,
)

ExtractResult fields:

Field	Type	Description
`title`	`str \| None`	Page title
`author`	`str \| None`	Author name
`date`	`str \| None`	Publication date (ISO 8601)
`main_content`	`str`	Extracted main content as plain text
`content_markdown`	`str \| None`	Markdown output (when `output_markdown=True`)
`content_html`	`str \| None`	Extracted content as HTML
`page_type`	`str \| None`	Detected page type
`extraction_quality`	`float`	Confidence score (0.0–1.0)
`classification_confidence`	`float \| None`	Page type classifier confidence
`language`	`str \| None`	Detected language
`sitename`	`str \| None`	Site name
`description`	`str \| None`	Meta description
`images`	`list[ImageData]`	Extracted images with src, alt, caption

Page Type Classification

# Fast URL-based heuristic (no HTML needed)
page_type, confidence = rs_trafilatura.classify_url("https://docs.example.com/api")
# ("documentation", 0.9) — or ("article", None) when no pattern matches

# ML classifier with DOM features (higher accuracy)
page_type, confidence = rs_trafilatura.classify_page(
    numeric_features,   # 89 numeric features from the HTML DOM
    "page title text",  # Title + description for TF-IDF
)

Extraction Quality Prediction

# Predict how reliable an extraction is (for hybrid pipeline routing)
quality = rs_trafilatura.predict_quality(features)  # 27 post-extraction features
# Returns 0.0–1.0. Below 0.80 suggests routing to an LLM fallback.

HTML Cleaning

# Remove scripts, styles, comments, SVGs, iframes — keep content
cleaned = rs_trafilatura.clean_html(raw_html)

HTML to Markdown

# Convert HTML to GitHub Flavored Markdown
markdown = rs_trafilatura.html_to_markdown(html)

Framework Integrations

crawl4ai

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from rs_trafilatura.crawl4ai import RsTrafilaturaStrategy

config = CrawlerRunConfig(extraction_strategy=RsTrafilaturaStrategy(output_markdown=True))
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com", config=config)
    data = json.loads(result.extracted_content)
    print(data[0]["main_content"])

Scrapy

# settings.py
ITEM_PIPELINES = {
    "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
}
RS_TRAFILATURA_MARKDOWN = True  # optional

# spider.py
def parse(self, response):
    yield {"url": response.url, "body": response.body}
    # Pipeline adds item["extraction"] with title, main_content, page_type, etc.

Firecrawl

from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result

app = FirecrawlApp(api_key="...")
result = app.scrape("https://example.com", formats=["html"])
extracted = extract_firecrawl_result(result)
print(extracted.title, extracted.main_content, extracted.page_type)

Crawlee

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from rs_trafilatura.crawlee import extract_crawlee_context

crawler = BeautifulSoupCrawler()

@crawler.router.default_handler
async def handler(context):
    extracted = extract_crawlee_context(context)
    print(extracted.title, extracted.main_content, extracted.page_type)

Benchmarks

Tested on the Web Content Extraction Benchmark (WCXB) — 1,497 pages across 7 page types:

System	F1	Speed
rs-trafilatura	0.859	44 ms/page
MinerU-HTML (0.6B)	0.827	1,570 ms/page
Trafilatura (Python)	0.791	94 ms/page
ReaderLM-v2 (1.5B)	0.741	10,410 ms/page

Per-page-type F1:

Page Type	F1
Article	0.932
Documentation	0.931
Service	0.843
Forum	0.792
Collection	0.713
Listing	0.704
Product	0.670

What's Inside

This package bundles four Rust crates compiled into a single Python extension:

Crate	What it does
rs-trafilatura	Content extraction with page-type-aware profiles
web-page-classifier	XGBoost page type classification + quality prediction
html-cleaning	HTML sanitisation and tag removal
quick_html2md	HTML to GFM Markdown conversion

Links

Website: webcontentextraction.org
Benchmark: GitHub
Rust crate: crates.io/crates/rs-trafilatura
Author: Murrough Foley · LinkedIn · ORCID

License

MIT OR Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
python/rs_trafilatura		python/rs_trafilatura
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rs-trafilatura

Why rs-trafilatura?

Install

Quick Start

API Reference

Content Extraction

Page Type Classification

Extraction Quality Prediction

HTML Cleaning

HTML to Markdown

Framework Integrations

crawl4ai

Scrapy

Firecrawl

Crawlee

Benchmarks

What's Inside

Links

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rs-trafilatura

Why rs-trafilatura?

Install

Quick Start

API Reference

Content Extraction

Page Type Classification

Extraction Quality Prediction

HTML Cleaning

HTML to Markdown

Framework Integrations

crawl4ai

Scrapy

Firecrawl

Crawlee

Benchmarks

What's Inside

Links

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages