Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Changelog

All notable changes to `alignrl` are documented in this file. The format is
based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this
project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.4.0] - 2026-04-13

### Added

- **Expanded public API.** `alignrl` now lazily exports reward helpers
(`math_verify_reward`, `format_reward`, `extract_answer`), evaluation
helpers (`compare_stages`, `parse_results`, `BENCHMARK_PRESETS`), and the
`Trainer` protocol. `dir(alignrl)` reflects every lazy export for better
IDE discoverability.
- **`BaseTrainConfig.to_yaml()`** — serialize a validated config back to YAML
for round-tripping from CLI overrides to committed config files. When a
path is given, parent directories are created and the file is written.
- **`alignrl version` subcommand** and a top-level `-V` / `--version` flag
on the CLI.
- **CLI `eval` flags**: `--num-fewshot` and `--batch-size` for configuring
few-shot prompting and lm-eval batch size from the command line.
- **CLI `serve` flags**: `--temperature` and `--max-tokens` are now piped
through to every `ModelServer` in the Gradio comparison demo.
- **Config validation.** `BaseTrainConfig` now uses Pydantic field
constraints (`gt=0`, `ge=0`, etc.) for numeric fields and `extra="forbid"`
so typos in YAML configs fail loudly at load time instead of silently
falling back to defaults.

### Changed

- **Reward normalization is more robust.** `_normalize_numeric` now handles
thousands separators (`1,234`), currency prefixes (`$42`, `\$42`),
trailing percent (`50%`), and strips trailing periods. `_answers_match`
performs case-insensitive comparison before numeric normalization.
- **`extract_answer` supports more formats.** The regex now matches
`final answer: X` and `answer X` variants, accepts commas inside numeric
answers, and unwraps `\text{...}` inside `\boxed{...}` groups.

### Fixed

- Trailing punctuation (`.`, `,`, `;`, `:`) is no longer carried into
extracted answers from `"the answer is …"` / `"= …"` patterns, which
previously caused spurious reward mismatches.

## [0.3.0] - 2026-03-25

### Added

- Public API lazy-imports surface (`alignrl.SFTConfig`, `alignrl.GRPORunner`, …).
- W&B integration: `detect_wandb`, `log_eval_to_wandb`, CLI `--wandb` flag.
- HuggingFace Hub helpers: `push_adapter`, `merge_and_push`.
- Benchmark presets for `EvalConfig` (`core`, `reasoning`, `leaderboard`).
- Docker support (Dockerfile, docker-compose) for GPU-ready workflows.

### Fixed

- Guard against empty `loss_history` in all trainers.
- Copy preset list to prevent aliasing mutation in `EvalConfig`.
- Cache lazy imports in module globals after first resolution.
- Pass LoRA adapter to vLLM via `LoRARequest` instead of silently dropping it.

[0.4.0]: https://github.com/sacredvoid/alignrl/releases/tag/v0.4.0
[0.3.0]: https://github.com/sacredvoid/alignrl/releases/tag/v0.3.0
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "alignrl"
version = "0.3.0"
version = "0.4.0"
description = "LLM post-training playbook: SFT, GRPO, DPO, eval, and inference"
readme = "README.md"
license = "MIT"
Expand Down
24 changes: 23 additions & 1 deletion src/alignrl/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,25 +4,42 @@

import importlib

__version__ = "0.3.0"
__version__ = "0.4.0"

_LAZY_IMPORTS: dict[str, str] = {
# Config
"BaseTrainConfig": "alignrl.config",
# SFT
"SFTConfig": "alignrl.sft",
"SFTRunner": "alignrl.sft",
# DPO
"DPOConfig": "alignrl.dpo",
"DPORunner": "alignrl.dpo",
# GRPO
"GRPOConfig": "alignrl.grpo",
"GRPORunner": "alignrl.grpo",
# Evaluation
"EvalConfig": "alignrl.eval",
"EvalRunner": "alignrl.eval",
"compare_stages": "alignrl.eval",
"parse_results": "alignrl.eval",
"BENCHMARK_PRESETS": "alignrl.eval",
# Inference
"InferenceConfig": "alignrl.inference",
"ModelServer": "alignrl.inference",
"build_prompt": "alignrl.inference",
# Shared types / protocols
"TrainResult": "alignrl.types",
"EvalResult": "alignrl.types",
"Trainer": "alignrl.types",
# Rewards
"math_verify_reward": "alignrl.rewards",
"format_reward": "alignrl.rewards",
"extract_answer": "alignrl.rewards",
# HF Hub helpers
"push_adapter": "alignrl.hub",
"merge_and_push": "alignrl.hub",
# W&B integration
"detect_wandb": "alignrl.callbacks",
"log_eval_to_wandb": "alignrl.callbacks",
}
Expand All @@ -37,4 +54,9 @@ def __getattr__(name: str):
raise AttributeError(f"module 'alignrl' has no attribute {name!r}")


def __dir__() -> list[str]:
"""Expose lazy exports in ``dir(alignrl)`` for discoverability."""
return sorted([*_LAZY_IMPORTS, "__version__"])


__all__ = [*_LAZY_IMPORTS, "__version__"]
57 changes: 56 additions & 1 deletion src/alignrl/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,13 @@
from pathlib import Path


def cmd_version(args: argparse.Namespace) -> None:
"""Print the installed alignrl version."""
from alignrl import __version__

print(f"alignrl {__version__}")


def cmd_train(args: argparse.Namespace) -> None:
"""Run a training pipeline."""
config_path = Path(args.config)
Expand Down Expand Up @@ -59,6 +66,10 @@ def cmd_eval(args: argparse.Namespace) -> None:
config_kwargs["tasks"] = args.tasks.split(",")
if args.preset:
config_kwargs["preset"] = args.preset
if getattr(args, "num_fewshot", None) is not None:
config_kwargs["num_fewshot"] = args.num_fewshot
if getattr(args, "batch_size", None) is not None:
config_kwargs["batch_size"] = args.batch_size

config = EvalConfig(**config_kwargs)
runner = EvalRunner(config)
Expand Down Expand Up @@ -89,14 +100,32 @@ def cmd_serve(args: argparse.Namespace) -> None:
name, _, path = spec.partition("=")
stages[name] = path if path else None

demo = create_demo(stages=stages, model_name=args.model)
demo_kwargs: dict = {"stages": stages, "model_name": args.model}
if getattr(args, "temperature", None) is not None:
demo_kwargs["temperature"] = args.temperature
if getattr(args, "max_tokens", None) is not None:
demo_kwargs["max_tokens"] = args.max_tokens

demo = create_demo(**demo_kwargs)
demo.launch(server_name="0.0.0.0", server_port=args.port, share=args.share)


def main() -> None:
from alignrl import __version__

parser = argparse.ArgumentParser(prog="alignrl", description="LLM Post-Training Playbook")
parser.add_argument(
"-V",
"--version",
action="version",
version=f"alignrl {__version__}",
)
sub = parser.add_subparsers(dest="command", required=True)

# Version (as a subcommand for scripting use)
version_p = sub.add_parser("version", help="Print the installed alignrl version")
version_p.set_defaults(func=cmd_version)

# Train
train_p = sub.add_parser("train", help="Run training pipeline")
train_p.add_argument("stage", choices=["sft", "grpo", "dpo"])
Expand All @@ -118,6 +147,19 @@ def main() -> None:
choices=["core", "reasoning", "leaderboard"],
help="Benchmark preset (default: core)",
)
eval_p.add_argument(
"--num-fewshot",
dest="num_fewshot",
type=int,
default=None,
help="Number of few-shot examples (default: 0)",
)
eval_p.add_argument(
"--batch-size",
dest="batch_size",
default=None,
help="Batch size for evaluation (e.g. 'auto', 8)",
)
eval_p.add_argument("--limit", type=int, default=None)
eval_p.add_argument("--output", default="./results")
eval_p.add_argument("--wandb", action="store_true", help="Log results to W&B")
Expand All @@ -134,6 +176,19 @@ def main() -> None:
)
serve_p.add_argument("--port", type=int, default=7860)
serve_p.add_argument("--share", action="store_true")
serve_p.add_argument(
"--temperature",
type=float,
default=None,
help="Sampling temperature for generation (default: 0.7)",
)
serve_p.add_argument(
"--max-tokens",
dest="max_tokens",
type=int,
default=None,
help="Maximum tokens to generate per response (default: 512)",
)
serve_p.set_defaults(func=cmd_serve)

args = parser.parse_args()
Expand Down
52 changes: 38 additions & 14 deletions src/alignrl/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,33 +6,43 @@
from typing import TYPE_CHECKING

import yaml
from pydantic import BaseModel, Field
from pydantic import BaseModel, ConfigDict, Field

if TYPE_CHECKING:
from typing_extensions import Self


class BaseTrainConfig(BaseModel):
"""Shared training configuration."""
"""Shared training configuration.

All training stage configs (SFT, GRPO, DPO) inherit from this. Fields
are validated at construction time via Pydantic, so malformed YAML
files fail fast rather than partway through a training run.
"""

# Use a custom config dict to forbid unknown keys. This catches typos
# in YAML files (e.g. ``learnign_rate: 2e-4``) before any training
# begins, which is much friendlier than a silent default.
model_config = ConfigDict(extra="forbid")

model_name: str = "Qwen/Qwen2.5-3B"
output_dir: Path = Path("./outputs")
max_seq_length: int = 2048
per_device_train_batch_size: int = 4
gradient_accumulation_steps: int = 4
learning_rate: float = 2e-4
num_train_epochs: int = 1
max_steps: int = -1
warmup_steps: int = 10
max_seq_length: int = Field(default=2048, gt=0)
per_device_train_batch_size: int = Field(default=4, gt=0)
gradient_accumulation_steps: int = Field(default=4, gt=0)
learning_rate: float = Field(default=2e-4, gt=0)
num_train_epochs: int = Field(default=1, ge=0)
max_steps: int = Field(default=-1, ge=-1)
warmup_steps: int = Field(default=10, ge=0)
optim: str = "adamw_8bit"
seed: int = 42
seed: int = Field(default=42, ge=0)
report_to: str = "none"
logging_steps: int = 10
logging_steps: int = Field(default=10, gt=0)

# LoRA
lora_r: int = 16
lora_alpha: int = 32
lora_dropout: float = 0.0
lora_r: int = Field(default=16, gt=0)
lora_alpha: int = Field(default=32, gt=0)
lora_dropout: float = Field(default=0.0, ge=0.0, le=1.0)
lora_target_modules: list[str] = Field(
default=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
)
Expand All @@ -42,10 +52,24 @@ class BaseTrainConfig(BaseModel):

@classmethod
def from_yaml(cls, path: Path) -> Self:
"""Load a config from a YAML file. Missing keys use defaults."""
with open(path) as f:
data = yaml.safe_load(f)
return cls(**(data or {}))

def to_yaml(self, path: Path | None = None) -> str:
"""Serialize the config to YAML.

Returns the YAML string. If ``path`` is provided, also writes the
YAML to disk (parent directories are created as needed).
"""
data = self.model_dump(mode="json")
text: str = yaml.safe_dump(data, sort_keys=False, default_flow_style=False)
if path is not None:
Path(path).parent.mkdir(parents=True, exist_ok=True)
Path(path).write_text(text)
return text


# ChatML template used as fallback when the tokenizer doesn't have one set.
# This is the standard format for Qwen, Yi, and many other models.
Expand Down
6 changes: 6 additions & 0 deletions src/alignrl/demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,16 @@
def create_demo(
stages: dict[str, str | None],
model_name: str = "Qwen/Qwen2.5-3B",
temperature: float = 0.7,
max_tokens: int = 512,
):
"""Create a Gradio demo comparing model outputs across training stages.

Args:
stages: {stage_name: adapter_path_or_None}
model_name: base model name
temperature: sampling temperature passed to each backend
max_tokens: maximum number of tokens to generate per response
"""
import gradio as gr

Expand All @@ -28,6 +32,8 @@ def create_demo(
model_name=model_name,
adapter_path=adapter_path,
backend="unsloth",
temperature=temperature,
max_tokens=max_tokens,
)
server = ModelServer(config)
server.load()
Expand Down
Loading
Loading