Skip to content

allignement#1066

Merged
VinciGit00 merged 14 commits intopre/betafrom
main
Apr 19, 2026
Merged

allignement#1066
VinciGit00 merged 14 commits intopre/betafrom
main

Conversation

@VinciGit00
Copy link
Copy Markdown
Member

No description provided.

VinciGit00 and others added 13 commits March 31, 2026 09:41
Update all SDK usage to match the new v2 API from ScrapeGraphAI/scrapegraph-py#82:
- smartscraper() → extract(url=, prompt=)
- searchscraper() → search(query=)
- markdownify() → scrape(url=)
- Bump dependency to scrapegraph-py>=2.0.0

BREAKING CHANGE: requires scrapegraph-py v2.0.0+

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eeded)

Closes #1055

Plasmate (https://github.com/plasmate-labs/plasmate) is an open-source
Rust browser engine that outputs Structured Object Model (SOM) instead
of raw HTML. It requires no Chrome process, uses ~64MB RAM per session
vs ~300MB, and delivers 10-100x fewer tokens per page.

Changes:
- Add scrapegraphai/docloaders/plasmate.py: PlasmateLoader
  - Implements BaseLoader (lazy_load + alazy_load)
  - Calls plasmate binary via subprocess (pip install plasmate)
  - Supports output_format: 'text' (default), 'som', 'markdown', 'links'
  - Supports --selector, --header, --timeout flags
  - Optional fallback_to_chrome=True for JS-heavy SPAs
  - Async-safe: runs subprocess in executor thread pool
- Update scrapegraphai/docloaders/__init__.py: export PlasmateLoader
- Update scrapegraphai/nodes/fetch_node.py: support plasmate config dict
  in FetchNode (alongside browser_base and scrape_do)
- Add tests/test_plasmate.py: 25 unit tests (init, cmd building,
  lazy_load, alazy_load, fallback, error handling)

Usage:
  from scrapegraphai.docloaders import PlasmateLoader

  loader = PlasmateLoader(
      urls=['https://docs.python.org/3/library/json.html'],
      output_format='text',
      timeout=30,
      fallback_to_chrome=True,  # optional: retry with Chrome for SPAs
  )
  docs = loader.load()

  # Or via FetchNode config:
  graph_config = {
      'plasmate': {
          'output_format': 'text',
          'timeout': 30,
          'fallback_to_chrome': False,
      }
  }
feat: add PlasmateLoader as lightweight scraping backend (no Chrome needed)
## [1.76.0](v1.75.1...v1.76.0) (2026-04-09)

### Features

* add PlasmateLoader as lightweight scraping backend (no Chrome needed) ([9dd1fb5](9dd1fb5)), closes [#1055](#1055)

### CI

* reduce GitHub Actions costs by ~85% on PRs ([403080a](403080a))
- Pass output_schema to extract() so Pydantic schemas are forwarded to the v2 API
- Use context manager pattern (with Client(...) as client) for proper resource cleanup
- Simplify examples to match the v2 SDK style from scrapegraph-py
- Remove unused sgai_logger import (v2 client handles its own logging)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Support both the v2 Client API (PR #82) and the newer ScrapeGraphAI API
(PR #84) which uses Pydantic request models and ApiResult[T] wrappers.

- Add scrapegraph_py_compat helper with runtime API detection
- Route smart_scraper_graph through the compat layer
- Add v3-style examples for extract, search, and scrape

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scrapegraph-py 2.0.0 requires Python >=3.12, so bump the project's
requires-python to match. Simplify the test workflow to a single
unit-test job on Python 3.12 / ubuntu-latest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removed CodeQL badge from the README.
Removed the hero image section from the README.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ph-py-v2

feat!: migrate to scrapegraph-py v2 API surface
Comment thread tests/test_plasmate.py
cmd = loader._build_cmd("https://example.com")
assert "plasmate" in cmd[0]
assert "fetch" in cmd
assert "https://example.com" in cmd
Comment thread tests/test_plasmate.py
docs = asyncio.run(run())
assert len(docs) == 2
sources = {d.metadata["source"] for d in docs}
assert "https://a.com" in sources
Comment thread tests/test_plasmate.py
assert len(docs) == 2
sources = {d.metadata["source"] for d in docs}
assert "https://a.com" in sources
assert "https://b.com" in sources
Comment on lines +12 to +34
name: Unit Tests
runs-on: ubuntu-latest

strategy:
fail-fast: false
matrix:
test-group: [smart-scraper, multi-graph, file-formats]

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
python-version: '3.12'

- name: Install uv
uses: astral-sh/setup-uv@v4

- name: Install dependencies
run: |
uv sync
run: uv sync

- name: Install Playwright browsers
run: |
uv run playwright install chromium

- name: Run integration tests
env:
OPENAI_APIKEY: ${{ secrets.OPENAI_APIKEY }}
ANTHROPIC_APIKEY: ${{ secrets.ANTHROPIC_APIKEY }}
GROQ_APIKEY: ${{ secrets.GROQ_APIKEY }}
run: |
uv run pytest tests/integration/ -m integration --integration -v

- name: Upload test results
uses: actions/upload-artifact@v4
if: always()
with:
name: integration-test-results-${{ matrix.test-group }}
path: |
htmlcov/
benchmark_results/

benchmark-tests:
name: Performance Benchmarks
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'

- name: Install uv
uses: astral-sh/setup-uv@v4

- name: Install dependencies
run: |
uv sync

- name: Run performance benchmarks
env:
OPENAI_APIKEY: ${{ secrets.OPENAI_APIKEY }}
run: |
uv run pytest tests/ -m benchmark --benchmark -v

- name: Upload benchmark results
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: benchmark_results/

- name: Compare with baseline
if: github.event_name == 'pull_request'
run: |
# Download baseline from main branch
# Compare and comment on PR if regression detected
echo "Benchmark comparison would run here"

code-quality:
name: Code Quality Checks
runs-on: ubuntu-latest
if: github.event_name == 'push'

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'

- name: Install uv
uses: astral-sh/setup-uv@v4

- name: Install dependencies
run: |
uv sync

- name: Run Ruff linting
run: |
uv run ruff check scrapegraphai/ tests/

- name: Run Black formatting check
run: |
uv run black --check scrapegraphai/ tests/

- name: Run isort check
run: |
uv run isort --check-only scrapegraphai/ tests/

- name: Run type checking with mypy
run: |
uv run mypy scrapegraphai/
continue-on-error: true
run: uv run playwright install chromium

test-coverage-report:
name: Test Coverage Report
needs: [unit-tests, integration-tests]
runs-on: ubuntu-latest
if: always()

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Download coverage artifacts
uses: actions/download-artifact@v4

- name: Generate coverage report
run: |
echo "Coverage report generation would run here"

- name: Comment coverage on PR
if: github.event_name == 'pull_request'
uses: py-cov-action/python-coverage-comment-action@v3
with:
GITHUB_TOKEN: ${{ github.token }}

test-summary:
name: Test Summary
needs: [unit-tests, integration-tests, code-quality]
runs-on: ubuntu-latest
if: always()

steps:
- name: Check test results
run: |
echo "All test jobs completed"
echo "Unit tests: ${{ needs.unit-tests.result }}"
echo "Integration tests: ${{ needs.integration-tests.result }}"
echo "Code quality: ${{ needs.code-quality.result }}"
- name: Run unit tests
run: uv run pytest tests/ -m "unit or not integration"
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Apr 19, 2026
## [2.0.0](v1.76.0...v2.0.0) (2026-04-19)

### ⚠ BREAKING CHANGES

* requires scrapegraph-py v2.0.0+

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

### Features

* add scrapegraph-py PR [#84](#84) SDK compatibility ([e8b2a28](e8b2a28)), closes [#82](#82)
* align with scrapegraph-py v2 API surface from PR [#82](#82) ([c0f5fd5](c0f5fd5))
* migrate to scrapegraph-py v2 API surface ([fd23bb0](fd23bb0)), closes [ScrapeGraphAI/scrapegraph-py#82](ScrapeGraphAI/scrapegraph-py#82)

### CI

* bump min Python to 3.12 and trim test suite ([5fda03f](5fda03f))
@dosubot dosubot bot added the enhancement New feature or request label Apr 19, 2026
@VinciGit00 VinciGit00 merged commit 1bc4c49 into pre/beta Apr 19, 2026
6 of 7 checks passed
@github-actions
Copy link
Copy Markdown

🎉 This PR is included in version 2.1.0-beta.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request released on @dev size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants