Skip to content

Add pipe execution mode for distributed HPC job orchestration#854

Open
alongd wants to merge 7 commits intomainfrom
pipe
Open

Add pipe execution mode for distributed HPC job orchestration#854
alongd wants to merge 7 commits intomainfrom
pipe

Conversation

@alongd
Copy link
Copy Markdown
Member

@alongd alongd commented Apr 1, 2026

Implements a new "pipe" execution mode that orchestrates hundreds of subjobs within a single SLURM/PBS/SGE/HTCondor array allocation using a
distributed, lease-based state machine backed by the filesystem.

Architecture:

  • pipe_state.py — Task/run state machines, data models (TaskSpec, TaskStateRecord), file-locked atomic I/O, claim tokens for ownership verification
  • pipe_run.py — PipeRun orchestrator: staging, submit-script generation, reconciliation with orphan detection, retry budgets, run.json persistence,
    from_dir() reconstruction
  • pipe_worker.py — Standalone worker script that loops claiming PENDING tasks, dispatches by task family, writes result.json, verifies ownership
    before terminal writes
  • scheduler.py — Pipe API on Scheduler: eligibility checks, pipe routing for conformer/TS/species-side/scan jobs, family-based ingestion dispatch,
    polling loop integration

Supported task families: conf_opt, conf_sp, ts_guess_batch_method, ts_opt, species_sp, species_freq, irc, rotor_scan_1d

Key design rules:

  • Pipe executes only ready leaf jobs — all QA, troubleshooting, and downstream branching stays in mother ARC
  • One family / one engine / one level per PipeRun (homogeneity enforced at staging)
  • Ingestion happens only after full PipeRun completion
  • Workers verify ownership via claimed_by + claim_token before writing terminal state

Legacy cleanup: Removed the old HDF5-based DataPoint/write_hdf5/determine_job_array_parameters infrastructure from JobAdapter. Updated pipe_submit
templates in settings/submit.py from the old pipe.py design to the new pipe_worker design.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new filesystem-backed “pipe” execution mode to orchestrate many homogeneous leaf tasks inside a single scheduler array allocation, with a dedicated PipeRun orchestrator + pipe_worker consumer, and removes the legacy HDF5/array infrastructure from JobAdapter.

Changes:

  • Introduces pipe task/run state machines and atomic, file-locked state updates (pipe_state.py) plus a run orchestrator with staging, submit-script generation, and reconciliation (pipe_run.py).
  • Adds a standalone worker loop that claims PENDING tasks, runs adapters in incore, persists per-attempt outputs/results, and uses claim tokens to protect terminal writes (pipe_worker.py).
  • Integrates pipe routing/polling/ingestion into Scheduler, and removes legacy job-array/HDF5 machinery from adapters/tests.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
arc/settings/submit.py Adds scheduler-type keyed templates for pipe_worker array submission.
arc/scripts/pipe_worker.py Implements the pipe worker claim/execute loop and result persistence.
arc/scripts/pipe_worker_test.py Unit tests for claiming, execution, dispatch routing, ownership, cleanup, and loop behavior.
arc/scripts/init.py Adjusts scripts package import to use absolute module path.
arc/scheduler.py Adds pipe routing helpers, active pipe polling loop integration, and ingestion hooks.
arc/scheduler_pipe_test.py Extensive tests for Scheduler pipe eligibility, submission, polling, routing, and ingestion behavior.
arc/job/pipe_state.py Defines pipe task/run state machines, models, and locked atomic state updates.
arc/job/pipe_state_test.py Tests for state transitions, spec validation, locking semantics, and persistence helpers.
arc/job/pipe_run.py Implements PipeRun staging, submit script generation, reconcile/orphan/retry logic, and from_dir restore.
arc/job/pipe_run_test.py Tests for PipeRun staging/restore, submit script content, reconcile behavior, and homogeneity rules.
arc/job/adapters/psi_4.py Removes legacy array/HDF5 initialization hook.
arc/job/adapters/common.py Removes legacy job-array parameter determination call during adapter init.
arc/job/adapter.py Removes legacy HDF5/array infrastructure and makes adapter-level pipe execution explicitly unsupported.
arc/job/adapter_test.py Removes DataPoint/HDF5-related tests and pandas dependency usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.61%. Comparing base (69df219) to head (e0631a7).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #854      +/-   ##
==========================================
+ Coverage   58.83%   59.61%   +0.77%     
==========================================
  Files          97      101       +4     
  Lines       29355    30075     +720     
  Branches     7791     7870      +79     
==========================================
+ Hits        17271    17929     +658     
- Misses       9877     9889      +12     
- Partials     2207     2257      +50     
Flag Coverage Δ
functionaltests 59.61% <ø> (+0.77%) ⬆️
unittests 59.61% <ø> (+0.77%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants