LLM benchmarking framework with SystemDS, vLLM & OpenAI backends - LDE Project by kubraaksux · Pull Request #2431 · apache/systemds

kubraaksux · 2026-02-16T14:43:52Z

Adds the llmPredict DML built-in and a benchmarking framework that evaluates it against OpenAI API and vLLM across 5 workloads. Developed as part of the LDE course. (Supersedes the closed #2430.)

What this PR adds

Java (llmPredict built-in):

LlmPredictCPInstruction.java — dedicated CP instruction class, extracted from ParameterizedBuiltinCPInstruction
Structured error handling: ConnectException, SocketTimeoutException, MalformedURLException, HTTP non-200 with error body readback
Negative tests: testServerUnreachable, testInvalidUrl, testHttpErrorResponse, testMalformedJsonResponse, testMissingChoicesInResponse — all with message assertions

Python (benchmark framework in scripts/staging/llm-bench/):

Runner with OpenAI, vLLM, and SystemDS backends
5 workloads: math (GSM8K), reasoning (BoolQ), summarization (XSum), JSON extraction (CoNLL-2003), embeddings (STS-B)
SystemDS backend with JMLC pipeline latency breakdown (compile, marshal, exec, unmarshal)
Evaluation, aggregation, and HTML report generation
131 unit tests covering accuracy checks, extraction logic, runner validation
License headers on all files

Key results (n=50 per workload, 4-run APC experiment)

Metric	OpenAI gpt-4.1-mini	vLLM Qwen 3B (H100)	SystemDS Qwen 3B (H100)
Accuracy (math)	96%	68%	68%
Accuracy (reasoning)	88%	58%	58%
Accuracy (summarization)	86%	50%	62%
Accuracy (json_extraction)	61%	66%	66%
Accuracy (embeddings)	88%	90%	90%
Latency (math, mean ms)	4577	1913	1917
JMLC overhead	—	—	<3% vs vLLM (generation workloads)

SystemDS matches vLLM accuracy on 4/5 workloads exactly (math, reasoning, json_extraction, embeddings). The summarization difference (50% vs 62%) is caused by vLLM Automatic Prefix Caching (APC) — see README for the 4-run reverse-order experiment proving this.

JMLC pipeline breakdown (ms, n=50)

Workload	compile	marshal	exec/prompt	unmarshal	overhead
math	316	113	1908	0.8	483
reasoning	240	43	1128	0.8	337
summarization	304	52	355	0.8	412
json_extraction	299	48	259	0.9	403
embeddings	338	166	50	1.4	563

Cost comparison

Workload	OpenAI API Cost	vLLM Compute Cost	SystemDS Compute Cost
math	$0.0223	$0.0560	$0.0561
reasoning	$0.0100	$0.0324	$0.0332
summarization	$0.0075	$0.0107	$0.0106
json_extraction	$0.0056	$0.0078	$0.0078
embeddings	$0.0019	$0.0014	$0.0018
Total (5 workloads)	$0.047	$0.108	$0.109

How costs are computed:

OpenAI: Per-token API pricing (gpt-4.1-mini: $0.40/M input, $1.60/M output).
vLLM / SystemDS: Estimated from hardware ownership. Formula: electricity = (350W / 1000) * (wall_s / 3600) * $0.30/kWh + amortization = ($30,000 / 15,000h) * (wall_s / 3600).

Full documentation

scripts/staging/llm-bench/README.md — full methodology, all results tables, reverse-order experiment, JMLC pipeline breakdown, evaluation criteria, and setup instructions.

Generic LLM benchmark suite for evaluating inference performance across different backends (vLLM, Ollama, OpenAI, MLX). Features: - Multiple workload categories: math (GSM8K), reasoning (BoolQ, LogiQA), summarization (XSum, CNN/DM), JSON extraction - Pluggable backend architecture for different inference engines - Performance metrics: latency, throughput, memory usage - Accuracy evaluation per workload type - HTML report generation This framework can be used to evaluate SystemDS LLM inference components once they are developed.

- Connection.java: Changed loadModel(modelName) to loadModel(modelName, workerScriptPath) - Connection.java: Removed findPythonScript() method - LLMCallback.java: Added Javadoc for generate() method - JMLCLLMInferenceTest.java: Updated to pass script path to loadModel()

- Connection.java: Auto-find available ports for Py4J communication - Connection.java: Add loadModel() overload for manual port override - Connection.java: Use destroyForcibly() with waitFor() for clean shutdown - llm_worker.py: Accept python_port as command line argument

Move worker script from src/main/python/systemds/ to src/main/python/ to avoid shadowing Python stdlib operator module.

- Add generateWithTokenCount() returning JSON with input/output token counts - Update generateBatchWithMetrics() to include input_tokens and output_tokens columns - Add CUDA auto-detection with device_map=auto for multi-GPU support in llm_worker.py - Check Python process liveness during startup instead of blind 60s timeout

- Fix duplicate accuracy computation in runner.py - Add --model flag and error handling to run_all_benchmarks.sh - Fix ttft_stats and timing_stats logic bugs - Extract shared helpers into scripts/utils.py - Add HuggingFace download fallback to all loaders - Fix reasoning accuracy false positives with word-boundary regex - Pin dependency versions in requirements.txt - Clean up dead code and unify config keys across backends - Fix README clone URL and repo structure

- Use real token counts from Ollama/vLLM APIs, omit when unavailable - Correct TTFT and cost estimates - Add --gpu-hour-cost and --gpu-count flags for server benchmarks

- 121 unit tests for all accuracy checkers, loaders, and metrics - ROUGE-1/2/L scoring for summarization (replaces quality-gate heuristic) - Concurrent request benchmarking with --concurrency flag - GPU profiling via pynvml - Real TTFT for MLX backend via stream_generate - Backend factory pattern and config validation - Proper logging across all components - Updated configs to n_samples=50

Replace declare -A (bash 4+ only) with a case function for default model lookup. macOS ships with bash 3.x.

- New embeddings workload using STS-Benchmark from HuggingFace - Model rates semantic similarity between sentence pairs (0-5 scale) - 21 new tests for score extraction, accuracy check, sample loading - Total: 142 tests passing across 5 workloads

- Add electricity + hardware amortization cost estimation to runner (--power-draw-w, --electricity-rate, --hardware-cost flags) - Fix aggregate.py cost key mismatch (api_cost_usd vs cost_total_usd) - Add compute cost columns to CSV output and HTML report - Update README with cost model documentation and embeddings workload

Include all 10 benchmark runs (5 OpenAI + 5 Ollama, 50 samples each) with metrics, samples, configs, HTML report, and aggregated CSV.

- 5 workloads x 2 models on NVIDIA H100 PCIe via vLLM - Mistral-7B-Instruct-v0.3: strong reasoning (68%), fast embeddings (129ms) - Qwen2.5-3B-Instruct: best embeddings accuracy (90%), 75ms latency - Compute costs reflect H100 electricity (350W) + hardware amortization - Regenerated summary.csv and benchmark_report.html with all 20 runs

- Connection.java: Changed loadModel(modelName) to loadModel(modelName, workerScriptPath) - Connection.java: Removed findPythonScript() method - LLMCallback.java: Added Javadoc for generate() method - JMLCLLMInferenceTest.java: Updated to pass script path to loadModel()

- Connection.java: Auto-find available ports for Py4J communication - Connection.java: Add loadModel() overload for manual port override - Connection.java: Use destroyForcibly() with waitFor() for clean shutdown - llm_worker.py: Accept python_port as command line argument

Move worker script from src/main/python/systemds/ to src/main/python/ to avoid shadowing Python stdlib operator module.

- Add generateWithTokenCount() returning JSON with input/output token counts - Update generateBatchWithMetrics() to include input_tokens and output_tokens columns - Add CUDA auto-detection with device_map=auto for multi-GPU support in llm_worker.py - Check Python process liveness during startup instead of blind 60s timeout

Integrate SystemDS as a benchmark backend using the JMLC API. All prompts are processed through PreparedScript.generateBatchWithMetrics() which returns results in a typed FrameBlock with per-prompt timing and token metrics. Benchmark results for 4 workloads with distilgpt2 on H100.

Run the embeddings (semantic similarity) workload with SystemDS JMLC, bringing SystemDS to 5 workloads matching all other backends.

Run all 5 workloads with Qwen/Qwen2.5-3B-Instruct through the SystemDS JMLC backend, replacing the distilgpt2 toy model. This enables a direct apples-to-apples comparison with vLLM Qwen 3B: same model, different serving path (raw HuggingFace via JMLC vs optimized vLLM inference).

Replace distilgpt2 toy model with same models used by vLLM backends: - SystemDS + Qwen 3B (5 workloads) vs vLLM + Qwen 3B - SystemDS + Mistral 7B (5 workloads) vs vLLM + Mistral 7B All runs include compute cost flags (350W, $0.30/kWh, $30k hardware). Increase JMLC worker timeout from 60s to 300s for larger models.

Correct SystemDS concurrency scaling numbers to match actual metrics.json data (throughput-based instead of incorrect per-prompt estimates). Update latency table, concurrency scaling table, run_all_benchmarks.sh for automatic c=1/c=4 runs, and regenerate HTML report.

- Remove broken base SystemDS result directories (0% accuracy, 0ms latency from failed earlier run) - Remove fabricated cost per query table (benchmarks were run without --power-draw-w/--hardware-cost flags, all cost data was $0) - Fix accuracy claim: c=1 matches vLLM exactly, c=4 shows minor variation on reasoning (64% vs 60%) and summarization (62% vs 50%) due to vLLM batching non-determinism - Add SystemDS c=1 and c=4 columns to accuracy tables - Fix report.py to show c=1 and c=4 as separate backends instead of merging them into one "systemds (Qwen2.5-3B)" column - Fix floating point truncation bug in accuracy tooltip (int(50*0.58)=28, now uses accuracy_count from metrics.json directly) - Replace stale "Py4J bridge cost" references with "JMLC overhead" - Regenerate HTML report and summary CSV

…usions Major changes: - Restructure README: move SystemDS architecture section before results, add compilation pipeline files, add JMLC code example - Add measurement methodology note: vLLM uses Python streaming HTTP while SystemDS uses Java non-streaming HttpURLConnection, making per-prompt latency not directly comparable across backends - Rewrite conclusions to be evidence-based: llmPredict correctness proven by accuracy match, concurrency scaling quantified, model-vs-backend distinction made explicit, latency caveat explained - Remove MLX from supported backends table (not benchmarked), mark as "not benchmarked" in repo structure - Remove fabricated OpenAI cost claim ($0.02-0.03) - Remove "All backends overview" table (redundant with other tables) - Simplify concurrency scaling table to throughput only (remove misleading effective latency columns) - Put accuracy table first (apples-to-apples metric) before latency

…and evaluation methodology - Fix bold-pattern regex in math number extraction: allow arbitrary text between number and closing ** (fixes 3 false negatives in OpenAI math, 44/50 -> 47/50) - Re-score all 30 result sets from raw samples.jsonl (only OpenAI math changed) - Add complete cost comparison table with all backends including OpenAI API cost + local compute cost - Add cost calculation formula with hardware assumptions - Add evaluation methodology section explaining per-workload accuracy criteria - Add cross-backend comparisons (SystemDS vs vLLM, OpenAI vs local, Qwen 3B vs Mistral 7B, Ollama analysis) - Fix PR description scope: this is the benchmark framework PR, not llmPredict - Fix hardware claims: Ollama/OpenAI ran on MacBook, not H100 - Add model names to SystemDS column headers (SystemDS Qwen 3B c=1/c=4) - Explain Mistral's low math results (verbose output confuses extractor) - Regenerate HTML report

The previous explanation attributed all failures to the number extractor. Analysis of raw samples shows 20 of 31 incorrect answers were genuinely wrong (wrong formulas, negative results, refusing to solve), while only 10 had the correct answer present but extracted the wrong number.

…ests, and license headers - Extract llmPredict logic from ParameterizedBuiltinCPInstruction into dedicated LlmPredictCPInstruction class for better separation of concerns - Add structured error handling: ConnectException, SocketTimeoutException, MalformedURLException, HTTP non-200 responses with error body readback - Add conn.disconnect() in finally block for proper cleanup - Add negative tests (testServerUnreachable, testInvalidUrl) with message assertions verifying error messages reach the user - Add Apache license headers to llm_server.py and llm_worker.py (CI fix) - Rewrite benchmark framework with SystemDS JMLC backend, strict HuggingFace dataset loaders, and run_all_benchmarks.sh orchestration script - Fresh benchmark results: vLLM and SystemDS with Qwen2.5-3B on H100, 5 workloads (math, reasoning, summarization, json_extraction, embeddings)

- Run OpenAI gpt-4.1-mini on all 5 workloads (math 96%, reasoning 88%, summarization 86%, json_extraction 61%, embeddings 88%) - Update README with comprehensive results: OpenAI, vLLM Qwen 3B, and SystemDS Qwen 3B side-by-side accuracy, latency, throughput, and cost - Regenerate summary.csv and benchmark_report.html with 15 total runs

…#2430

- Remove unused backends: mlx_backend.py, ollama_backend.py - Remove unused files: llm_worker.py, benchmark_report.html - Add computed electricity and hardware amortization costs to vLLM and SystemDS metrics.json files (H100: 350W, $0.30/kWh, $30k hardware, 15k hour lifetime) - Update aggregate.py cost_per_1m logic for local backends - Clean stale ollama/mlx references from report.py, runner.py, run_all_benchmarks.sh, requirements.txt - Add pynvml to requirements.txt (used for GPU profiling) - Update README with cost comparison tables and methodology - Regenerate summary.csv

- math: remove 'last number anywhere' and 'last sentence-ending number' fallbacks from extract_number_from_response (returns None if no explicit answer marker found) - reasoning: remove 'last short standalone line' fallback from _extract_answer (returns None if no marker found) - embeddings: reject out-of-range scores instead of clamping (6.0 now returns -1.0 instead of 5.0) - summarization: remove silent fallback to unigram overlap when rouge-score not installed (rouge-score is a required dependency), remove unused _tokenize helper and re import - openai: remove str(resp) fallback when resp.output_text fails (let the error propagate instead of silently returning response repr) - Updated tests to match new strict behavior

…ed results - Switch vllm_backend.py to stream=false to match SystemDS - Update results/ with CUBLAS deterministic run (vLLM + SystemDS, H100, Mar 2) - README: add CUBLAS experiment results table (207/250 = 82.8% identical) - README: document vLLM Automatic Prefix Caching (APC) as root cause of remaining 43 divergent samples; proven by order-reversal experiment (43/43 swap, 0 exceptions) - README: add server log evidence with prefix cache hit rate 9% -> 55% - README: add concrete swap examples including factual hallucinations (xsum-30 athlete name, xsum-42 year, xsum-89 country) that follow cache state not backend identity - README: explain APC mechanism: cold cache runs full prefill kernel, warm cache skips prefill and loads stored KV tensors through different code path; outputs deterministic given fixed cache state, temperature=0, deterministic cuBLAS, sequential requests - Fix LlmPredictCPInstruction, JMLCLLMInferenceTest, workload configs, and test files

Explain why benchmark results are tracked in the repository: reproducibility, peer review, and data verification.

- Change vllm_backend.py default from port 8000 to 8080 to match systemds_backend.py - Update README with screen-based server lifecycle and GPU troubleshooting - Add vLLM shutdown reminder to run_all_benchmarks.sh

…esults - Re-score json_extraction results with correct entity-level F1 evaluator (was using strict 90% field-match, now uses entity F1 >= 0.5 for NER) Both backends: 15.2% -> 65.2% accuracy (same model outputs, fixed scorer) - Add reverse-order experiment section to README: SystemDS first, vLLM second confirms accuracy differences are per-backend, not APC artifacts - Add JAR rebuild reminder to README SystemDS backend section - Update accuracy tables and key observations to reflect new numbers

boolq-7 flipped to correct in both backends after _extract_boolean() fix. vLLM: 30->31, SystemDS: 32->33. All 15 result sets now verified consistent with current evaluation code.

- Fix vLLM backend: explicitly re-raise RuntimeError for model validation - Fix report.py: calculate per-query cost correctly (total cost / total queries) - Fix run_all_benchmarks.sh: robust argument parsing for optional model arg - Fix json_extraction: update config comment and fix loader sample bias - Enrich runner.py: add timestamp, platform, and config details to run_config.json - Update README: clarify APC vs GPU non-determinism based on 4-way analysis Made-with: Cursor

Remove shampoo optimizer, results_new experiment data, and other files that were accidentally included from the base branch.

- Add 3 mock-server negative tests (HTTP 500, malformed JSON, missing choices) using Java HttpServer — run without external LLM server - Instrument SystemDS backend with 4-phase latency breakdown: compile, marshal, exec, unmarshal - Document latency measurement methodology and all Java tests in README

- New results include compile_ms, marshal_ms, exec_wall_ms, unmarshal_ms, compile_cache_hit, and pipeline_overhead_ms per sample - Add JMLC pipeline breakdown table to README - Update all results tables with current numbers - Explain embeddings overhead (+46%) due to fixed pipeline cost on short requests

Old results had inconsistent data: different code versions, missing cost data, missing latency breakdown fields, missing vLLM reverse. Will re-run all 4 configurations (vLLM, SystemDS, vLLM reverse, SystemDS reverse) with final code, cost flags, and breakdown instrumentation in a single session. OpenAI results retained.

Fresh runs from same code, same server, with vLLM restart between sessions. All results include cost data and JMLC latency breakdown. Session 1 (normal): vLLM first, SystemDS second Session 2 (reverse): SystemDS first, vLLM second Key findings: - SystemDS matches vLLM on 4/5 workloads (byte-for-byte identical) - Summarization: 1st-run always 25/50, 2nd-run always 31/50 (APC) - Same-position runs are 100% text-identical across sessions - JMLC overhead: <3% on generation workloads, ~29% on embeddings

Document the architectural evolution from the previous Py4J callback approach (PR apache#2430) to the current llmPredict DML built-in with HTTP. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kubraaksux added 30 commits January 19, 2026 15:04

Add LLM inference support to JMLC API via Py4J bridge

8e7d6da

Move llm_worker.py to fix Python module collision

dacdc1c

Move worker script from src/main/python/systemds/ to src/main/python/ to avoid shadowing Python stdlib operator module.

Use python3 with fallback to python in Connection.java

29f657c

Add batch inference with FrameBlock and metrics support

e40e4f2

Clean up test: extract constants and shared setup method

fdd1684

Fix fake metrics and add compute cost tracking

1510e8a

- Use real token counts from Ollama/vLLM APIs, omit when unavailable - Correct TTFT and cost estimates - Add --gpu-hour-cost and --gpu-count flags for server benchmarks

Fix bash 3.x compatibility in run_all_benchmarks.sh

a18979f

Replace declare -A (bash 4+ only) with a case function for default model lookup. macOS ships with bash 3.x.

Add benchmark results for project submission

7239460

Include all 10 benchmark runs (5 OpenAI + 5 Ollama, 50 samples each) with metrics, samples, configs, HTML report, and aggregated CSV.

Add LLM inference support to JMLC API via Py4J bridge

fd3a117

Move llm_worker.py to fix Python module collision

0cc05f6

Move worker script from src/main/python/systemds/ to src/main/python/ to avoid shadowing Python stdlib operator module.

Use python3 with fallback to python in Connection.java

036a221

Add batch inference with FrameBlock and metrics support

ef8c1f4

Clean up test: extract constants and shared setup method

af54019

Add embeddings workload for SystemDS backend

190d952

Run the embeddings (semantic similarity) workload with SystemDS JMLC, bringing SystemDS to 5 workloads matching all other backends.

Trim verbose docstring in systemds_backend.py

a39078c

kubraaksux added 4 commits February 16, 2026 23:38

kubraaksux force-pushed the llm-benchmark branch from cf6a6b3 to 83b90e4 Compare February 16, 2026 23:41

kubraaksux added 3 commits February 17, 2026 00:44

kubraaksux changed the title ~~LLM benchmarking framework with SystemDS & Ollama & VLLM Backends - LDE Project~~ LLM benchmarking framework with SystemDS, vLLM & OpenAI backends - LDE Project Feb 27, 2026

Update README: llmPredict implementation merged from closed PR apache…

20e666d

…#2430

kubraaksux mentioned this pull request Feb 27, 2026

Add LLM benchmarking framework with SystemDS JMLC backend kubraaksux/systemds#1

Closed

kubraaksux added 3 commits February 27, 2026 22:18

kubraaksux force-pushed the llm-benchmark branch from 2b427fb to 8684bc1 Compare March 5, 2026 00:20

kubraaksux added 6 commits March 5, 2026 04:36

Add note on committed results for reproducibility

dcb5c72

Explain why benchmark results are tracked in the repository: reproducibility, peer review, and data verification.

Align vLLM backend default port to 8080 and add server management docs

bb61323

- Change vllm_backend.py default from port 8000 to 8080 to match systemds_backend.py - Update README with screen-based server lifecycle and GPU troubleshooting - Add vLLM shutdown reminder to run_all_benchmarks.sh

Re-score reasoning results with current boolean extraction code

787d279

boolq-7 flipped to correct in both backends after _extract_boolean() fix. vLLM: 30->31, SystemDS: 32->33. All 15 result sets now verified consistent with current evaluation code.

Clean up non-project files from branch history

e4176a2

Remove shampoo optimizer, results_new experiment data, and other files that were accidentally included from the base branch.

kubraaksux force-pushed the llm-benchmark branch from 8c6d927 to e4176a2 Compare March 5, 2026 03:45

kubraaksux and others added 8 commits March 5, 2026 04:46

Remove duplicate nested directories from reverse results SCP

f3dbd3c

Fix precision loss in HTTP latency: keep float instead of int truncation

a692e00

Add gpu-apc mode to benchmark script for reverse-order APC experiment

21456e9

Add Py4J comparison section to README

6777293

Document the architectural evolution from the previous Py4J callback approach (PR apache#2430) to the current llmPredict DML built-in with HTTP. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM benchmarking framework with SystemDS, vLLM & OpenAI backends - LDE Project#2431

LLM benchmarking framework with SystemDS, vLLM & OpenAI backends - LDE Project#2431
kubraaksux wants to merge 91 commits intoapache:mainfrom
kubraaksux:llm-benchmark

kubraaksux commented Feb 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kubraaksux commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR adds

Key results (n=50 per workload, 4-run APC experiment)

JMLC pipeline breakdown (ms, n=50)

Cost comparison

Full documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kubraaksux commented Feb 16, 2026 •

edited

Loading