Skip to content

Murrough-Foley/rs-trafilatura-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rs-trafilatura

Fast web content extraction, page type classification, HTML cleaning, and Markdown conversion for Python — powered by Rust.

rs-trafilatura is a Python package built with PyO3 that wraps four Rust crates into a single pip install. It extracts the main content from web pages, classifies page types, predicts extraction quality, cleans HTML, and converts HTML to Markdown — all at native Rust speed.

Why rs-trafilatura?

  • Fast: Rust-native extraction at ~44ms per page on commodity hardware — 36x faster than neural approaches
  • Accurate: F1 0.859 across 7 page types on the Web Content Extraction Benchmark, outperforming Trafilatura (0.791), MinerU-HTML (0.827), and ReaderLM-v2 (0.741)
  • Page-type aware: XGBoost classifier detects articles, forums, products, collections, listings, documentation, and service pages — then applies type-specific extraction profiles
  • Quality scoring: ML-based confidence predictor (0.0–1.0) tells you when extraction might be unreliable, enabling hybrid pipelines with LLM fallback
  • Framework adapters: Drop-in integrations for crawl4ai, Scrapy, Firecrawl, and Crawlee

Install

pip install rs-trafilatura

Quick Start

import rs_trafilatura

# Extract main content from HTML
result = rs_trafilatura.extract(html, url="https://example.com")
print(result.title)                # Page title
print(result.main_content)         # Clean extracted text
print(result.page_type)            # article, forum, product, etc.
print(result.extraction_quality)   # 0.0–1.0 confidence score

API Reference

Content Extraction

# From a string
result = rs_trafilatura.extract(
    html,
    url="https://example.com",      # URL for page type classification
    page_type="product",             # Force a page type (bypasses classifier)
    favor_precision=True,            # Stricter filtering, less noise
    favor_recall=False,              # More inclusive extraction
    include_tables=True,             # Include table content
    include_images=True,             # Extract image metadata
    include_comments=False,          # Include comment sections
    output_markdown=True,            # Generate Markdown in content_markdown
)

# From raw bytes (auto-detects encoding)
result = rs_trafilatura.extract_bytes(
    response_bytes,
    url="https://example.com",
    output_markdown=True,
)

ExtractResult fields:

Field Type Description
title str | None Page title
author str | None Author name
date str | None Publication date (ISO 8601)
main_content str Extracted main content as plain text
content_markdown str | None Markdown output (when output_markdown=True)
content_html str | None Extracted content as HTML
page_type str | None Detected page type
extraction_quality float Confidence score (0.0–1.0)
classification_confidence float | None Page type classifier confidence
language str | None Detected language
sitename str | None Site name
description str | None Meta description
images list[ImageData] Extracted images with src, alt, caption

Page Type Classification

# Fast URL-based heuristic (no HTML needed)
page_type, confidence = rs_trafilatura.classify_url("https://docs.example.com/api")
# ("documentation", 0.9) — or ("article", None) when no pattern matches

# ML classifier with DOM features (higher accuracy)
page_type, confidence = rs_trafilatura.classify_page(
    numeric_features,   # 89 numeric features from the HTML DOM
    "page title text",  # Title + description for TF-IDF
)

Extraction Quality Prediction

# Predict how reliable an extraction is (for hybrid pipeline routing)
quality = rs_trafilatura.predict_quality(features)  # 27 post-extraction features
# Returns 0.0–1.0. Below 0.80 suggests routing to an LLM fallback.

HTML Cleaning

# Remove scripts, styles, comments, SVGs, iframes — keep content
cleaned = rs_trafilatura.clean_html(raw_html)

HTML to Markdown

# Convert HTML to GitHub Flavored Markdown
markdown = rs_trafilatura.html_to_markdown(html)

Framework Integrations

crawl4ai

from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from rs_trafilatura.crawl4ai import RsTrafilaturaStrategy

config = CrawlerRunConfig(extraction_strategy=RsTrafilaturaStrategy(output_markdown=True))
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com", config=config)
    data = json.loads(result.extracted_content)
    print(data[0]["main_content"])

Scrapy

# settings.py
ITEM_PIPELINES = {
    "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
}
RS_TRAFILATURA_MARKDOWN = True  # optional

# spider.py
def parse(self, response):
    yield {"url": response.url, "body": response.body}
    # Pipeline adds item["extraction"] with title, main_content, page_type, etc.

Firecrawl

from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result

app = FirecrawlApp(api_key="...")
result = app.scrape("https://example.com", formats=["html"])
extracted = extract_firecrawl_result(result)
print(extracted.title, extracted.main_content, extracted.page_type)

Crawlee

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from rs_trafilatura.crawlee import extract_crawlee_context

crawler = BeautifulSoupCrawler()

@crawler.router.default_handler
async def handler(context):
    extracted = extract_crawlee_context(context)
    print(extracted.title, extracted.main_content, extracted.page_type)

Benchmarks

Tested on the Web Content Extraction Benchmark (WCXB) — 1,497 pages across 7 page types:

System F1 Speed
rs-trafilatura 0.859 44 ms/page
MinerU-HTML (0.6B) 0.827 1,570 ms/page
Trafilatura (Python) 0.791 94 ms/page
ReaderLM-v2 (1.5B) 0.741 10,410 ms/page

Per-page-type F1:

Page Type F1
Article 0.932
Documentation 0.931
Service 0.843
Forum 0.792
Collection 0.713
Listing 0.704
Product 0.670

What's Inside

This package bundles four Rust crates compiled into a single Python extension:

Crate What it does
rs-trafilatura Content extraction with page-type-aware profiles
web-page-classifier XGBoost page type classification + quality prediction
html-cleaning HTML sanitisation and tag removal
quick_html2md HTML to GFM Markdown conversion

Links

License

MIT OR Apache-2.0

About

Python bindings for rs-trafilatura: fast web content extraction, page classification, HTML cleaning, and Markdown conversion

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors