CLEAR is an open-source toolkit for LLM error analysis using an LLM-as-a-Judge approach.
CLEAR provides systematic error analysis for:
- Single LLM Responses β Analyze quality issues in model outputs for tasks like Q&A, summarization, and generation
- Agentic Workflows β Evaluate complex workflows with multiple components, tool usage, and multi-step task trajectories
It combines automated LLM-as-a-judge evaluation with interactive dashboards to help you:
- Identify recurring error patterns across your dataset
- Quantify issue frequencies and severity
- Drill down into specific failure cases
- Prioritize improvements based on data-driven insights
CLEAR operates in two phases:
- Analysis β Generates per-instance textual feedback, identifies system-level error categories, and quantifies their frequencies.
- Interactive Dashboard β Explore aggregate visualizations, apply dynamic filters, and drill down into individual failure examples.
CLEAR supports two distinct analysis modes, each with its own pipeline, dashboard, and documentation:
Evaluate standard LLM outputs β generation quality, correctness, and recurring error patterns. Provide a CSV with prompts and responses, and CLEAR will score each instance, generate textual critiques, and surface system-level issues.
| Input | CSV with model inputs and responses |
| Output | Per-record scores, evaluation text, aggregated issue categories |
| Dashboard | Streamlit-based interactive explorer |
Evaluate multi-agent system trajectories β step-by-step agent interactions and full trajectory analysis. Supports traces from LangGraph, CrewAI, and other frameworks via MLflow or Langfuse.
| Input | Raw JSON traces or preprocessed trajectory CSVs |
| Output | Per-step CLEAR analysis, trajectory-level scores, rubric evaluations |
| Dashboard | NiceGUI-based workflow visualization with path and temporal analysis |
π Agentic Workflows Guide β | Agentic Dashboard Guide β
| π§ββοΈ LLM-as-a-Judge | Automated evaluation for any text generation task |
| π€ Agentic Workflows | Evaluate agent trajectories - step by step and as a whole |
| π Multiple Backends | LangChain, LiteLLM (100+ providers), or direct HTTP endpoints |
| π§© External Judges | Plug in custom evaluation functions |
| π Interactive Dashboards | Standard and agentic-specific visualizations |
| π οΈ Flexible Configuration | YAML config files, CLI flags, or Python API |
Requires Python 3.10+
pip install clear-evalgit clone https://github.com/IBM/CLEAR.git
cd CLEAR
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .CLEAR requires a supported LLM provider. Set the appropriate environment variables for your provider (e.g., OPENAI_API_KEY for OpenAI). See the Providers and Credentials Guide for all supported providers and backends.
With no data path specified, CLEAR runs on a built-in GSM8K sample dataset using default settings:
run-clear-eval-analysis --provider openai --eval-model-name gpt-4oResults are saved to results/gsm8k/sample_output/.
run-clear-eval-analysis \
--provider openai \
--eval-model-name gpt-4o \
--data-path path/to/your_data.csv \
--output-dir results/my_run/ \
--run-name my_runYour CSV should contain at minimum id, model_input, and response columns. See the LLM Analysis Guide for the full input format specification.
run-clear-eval-dashboardUpload the generated ZIP file from the results directory to explore issues, scores, and individual examples.
# Full pipeline
run-clear-eval-analysis --provider openai --eval-model-name gpt-4o --config_path path/to/config.yaml
# Evaluation only (using existing responses)
run-clear-eval-evaluation --provider openai --eval-model-name gpt-4o --config_path path/to/config.yamlfrom clear_eval.analysis_runner import run_clear_eval_analysis
run_clear_eval_analysis(
run_name="my_run",
provider="openai",
data_path="my_data.csv",
eval_model_name="gpt-4o",
output_dir="results/",
)run-clear-agentic-eval \
--data-dir data/my_traces \
--results-dir results \
--from-raw-traces true \
--eval-model-name gpt-4o \
--provider openai
# Launch agentic dashboard
run-clear-agentic-dashboardSee the Agentic Workflows Guide for full details.
| Guide | Description |
|---|---|
| π LLM Analysis Guide | Full pipeline reference β input formats, CLI arguments, configuration, and external judges |
| π€ Agentic Workflows Guide | Multi-agent evaluation β trace preprocessing, step-by-step and trajectory analysis, configuration reference |
| π Agentic Dashboard Guide | Dashboard features β workflow view, node analysis, trajectory explorer, path and temporal analysis |
| π Providers and Credentials | Inference backends (LangChain, LiteLLM, Endpoint), provider setup, and configuration examples |
| Provider | Backend | Credentials |
|---|---|---|
| OpenAI | LangChain, LiteLLM, Endpoint | OPENAI_API_KEY |
| WatsonX | LangChain, LiteLLM, Endpoint | WATSONX_APIKEY, WATSONX_URL, WATSONX_PROJECT_ID |
| Anthropic | LiteLLM | ANTHROPIC_API_KEY |
| AWS Bedrock | LiteLLM | AWS credentials |
| Google Vertex AI | LiteLLM | GCP credentials |
| 100+ more | LiteLLM | Provider-specific |
See the Providers and Credentials Guide for backend configuration details and examples.
CLEAR/
βββ README.md # This file
βββ src/clear_eval/
β βββ pipeline/ # LLM analysis pipeline
β βββ dashboard/ # LLM dashboard (Streamlit)
β βββ agentic/
β β βββ README.md # Agentic Workflows Guide
β β βββ pipeline/ # Agentic pipeline
β β βββ dashboard/
β β βββ README_DASHBOARD.md # Agentic Dashboard Guide
β β βββ ...
β βββ sample_data/ # Sample datasets
βββ docs/
β βββ ANALYSIS_README.md # LLM Analysis Guide
β βββ PROVIDERS.md # Providers and Credentials Guide
βββ examples/ # Configuration examples
βββ tests/ # Test suite
Apache 2.0 β see LICENSE for details.