Skip to content

Murrough-Foley/web-page-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

web-page-classifier

Fast web page type classification using an XGBoost model with compact binary format.

Classifies web pages into 7 types: Article, Forum, Product, Collection, Listing, Documentation, Service.

Features

  • Three-stage classification: URL heuristics → HTML signals → ML model
  • Compact embedded model: ~1.1MB XGBoost binary (200 trees, 181 features)
  • Zero dependencies: Pure Rust, no ML frameworks required
  • Fast: Classification in <1ms per page

Quick Start

use web_page_classifier::{classify_url, classify_ml, PageType, N_NUMERIC_FEATURES};

// Stage 1: URL-only classification (fast, no HTML needed)
let page_type = classify_url("https://docs.example.com/api/reference");
assert_eq!(page_type, PageType::Documentation);

// Stage 2: ML classification (higher accuracy, needs extracted features)
let features = vec![0.0f64; N_NUMERIC_FEATURES];
let (page_type, confidence) = classify_ml(&features, "Article about technology");

Model Details

  • Algorithm: XGBoost (200 estimators, max depth 8)
  • Features: 81 numeric (URL patterns, HTML structure, DOM signals) + 100 TF-IDF
  • Training: 1,497 pages across 7 types with SMOTE oversampling
  • Accuracy: 87.3% (macro F1: 0.824)

Note on Binary Size

The embedded model adds ~1.1MB to binary size. This is the cost of shipping a production ML model with zero runtime dependencies.

License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.

About

Fast web page type classification (Article, Forum, Product, Collection, Listing, Documentation, Service) using an embedded XGBoost model. 89 numeric features + 100 TF-IDF, 86.6% accuracy, <1ms inference, zero dependencies. Pure Rust.

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages