Skip to content

feat(block): Allow wait block to wait up to 30 days#4331

Open
TheodoreSpeaks wants to merge 4 commits intostagingfrom
feat/long-waits
Open

feat(block): Allow wait block to wait up to 30 days#4331
TheodoreSpeaks wants to merge 4 commits intostagingfrom
feat/long-waits

Conversation

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator

Summary

Reuse human in the loop logic to allow wait blocks to wait up to 30 days. Think this could be useful for things like email automation where you want to send followups after x days.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • Other: ___________

Testing

  • Tested locally with a wait of 6 minutes. Ran resume and validated that later blocks run.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

Screenshots/Videos

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 29, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Apr 30, 2026 10:36pm

Request Review

@TheodoreSpeaks
Copy link
Copy Markdown
Collaborator Author

@BugBot review

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 29, 2026

PR Summary

Medium Risk
Adds an automatic, cron-driven resume path for paused executions and a new DB column/index to schedule resumes; mistakes could cause stuck executions or unexpected resumes.

Overview
Enables the Wait block to pause workflows for up to 30 days by splitting waits into in-process sleeps (<=5 minutes) vs time-based suspension (>5 minutes) that resumes later.

Introduces pauseKind (human vs time) and optional resumeAt on pause metadata/points, persists the earliest due time as paused_executions.next_resume_at, and adds a new authenticated cron endpoint (/api/resume/poll) plus Helm CronJob config to automatically enqueue/start due resumes.

Tightens manual resume to human-only pauses (allowedPauseKinds: ['human']) and hides time-based pause points from the resume UI/listing APIs, with updated wait block config/options (adds hours/days) and expanded tests for suspension behavior.

Reviewed by Cursor Bugbot for commit 7678245. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread apps/sim/lib/core/config/feature-flags.ts Outdated
Comment thread apps/sim/app/api/resume/poll/route.ts
@TheodoreSpeaks TheodoreSpeaks marked this pull request as ready for review April 29, 2026 02:19
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

This PR extends the Wait block to support durations up to 30 days by reusing the human-in-the-loop pause/resume infrastructure. Waits ≤ 5 minutes continue to execute in-process; longer waits suspend the workflow by writing pauseKind: 'time' pause points to pausedExecutions, and a new per-minute cron endpoint (/api/resume/poll) resumes them when their resumeAt timestamp is reached.

  • P1 — permanently stranded executions on dispatch failure: In route.ts, when enqueueOrStartResume throws for a due pause point, the error is caught but nextRemaining (which controls the rescheduled nextResumeAt) only tracks future points. After the loop, nextResumeAt is set to NULL, so the cron query (isNotNull(nextResumeAt)) never selects the row again and the workflow hangs indefinitely with no retry or alerting.

Confidence Score: 3/5

Not safe to merge until the failed-dispatch silent-strand bug is fixed; any transient error during resume dispatch permanently locks a workflow.

A confirmed P1 defect in the new poll route means any transient DB or lock error during dispatch will permanently orphan a paused execution with no retry or observability. The rest of the implementation (schema migration, in-process vs. suspended branching, UI filtering, cron config) is well-structured and correct.

apps/sim/app/api/resume/poll/route.ts — the failed-dispatch no-retry bug and missing ORDER BY

Important Files Changed

Filename Overview
apps/sim/app/api/resume/poll/route.ts New cron-driven polling endpoint that resumes time-based paused executions; has a P1 bug where failed dispatches permanently strand executions, and lacks ORDER BY on the batch query
apps/sim/executor/handlers/wait/wait-handler.ts Refactored to support in-process (≤5 min) and suspended (>5 min, up to 30 days) waits; implementation is clean but executeWithNode signature is narrower than the BlockHandler interface
apps/sim/executor/types.ts Adds PauseKind union type and pauseKind/resumeAt fields to PauseMetadata and PausePoint; backward compat handled via ?? fallbacks in the manager
apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts Correctly propagates pauseKind/resumeAt through persistPauseResult and computes nextResumeAt from the earliest time-based pause point
packages/db/migrations/0201_brave_kylun.sql Adds next_resume_at column and partial index on paused_executions; migration looks correct
apps/sim/executor/handlers/wait/wait-handler.test.ts Replaces old max-enforcement tests with suspended-workflow tests for hours/days/minutes above threshold; good coverage of the new branching logic
apps/sim/executor/handlers/human-in-the-loop/human-in-the-loop-handler.ts Adds explicit pauseKind: 'human' to pause metadata; single-line change is correct and unambiguous
apps/sim/blocks/blocks/wait.ts Adds hours/days options and updates documentation to reflect the new 30-day ceiling and dual-mode execution
helm/sim/values.yaml Adds a 1-minute CronJob for the new poll endpoint with Forbid concurrency policy; configuration is consistent with other cron jobs

Sequence Diagram

sequenceDiagram
    participant W as WaitBlockHandler
    participant E as ExecutionEngine
    participant M as PauseResumeManager
    participant DB as pausedExecutions DB
    participant C as CronJob (/api/resume/poll)

    W->>W: execute(inputs)
    alt waitMs ≤ 5 min (in-process)
        W->>W: sleep(waitMs)
        W-->>E: {status: 'completed'}
    else waitMs > 5 min (suspended)
        W-->>E: {status: 'waiting', _pauseMetadata: {pauseKind: 'time', resumeAt}}
        E->>M: persistPauseResult(pausePoints)
        M->>M: compute nextResumeAt (earliest time pause point)
        M->>DB: INSERT/UPDATE pausedExecutions {nextResumeAt}
    end

    loop Every 1 minute
        C->>DB: SELECT WHERE status='paused' AND nextResumeAt <= now LIMIT 200
        DB-->>C: dueRows[]
        loop for each dueRow
            loop for each duePoint (pauseKind='time', resumeAt <= now)
                C->>M: enqueueOrStartResume({executionId, contextId})
                M-->>C: {status: 'starting', ...}
                C->>M: startResumeExecution() [fire and forget]
            end
            C->>DB: UPDATE SET nextResumeAt = nextRemaining (null if all done)
        end
    end
Loading

Reviews (1): Last reviewed commit: "restore ff" | Re-trigger Greptile

Comment on lines +92 to +141
for (const point of duePoints) {
const contextId = point.contextId
if (!contextId) continue
try {
const enqueueResult = await PauseResumeManager.enqueueOrStartResume({
executionId: row.executionId,
contextId,
resumeInput: {},
userId,
})

if (enqueueResult.status === 'starting') {
PauseResumeManager.startResumeExecution({
resumeEntryId: enqueueResult.resumeEntryId,
resumeExecutionId: enqueueResult.resumeExecutionId,
pausedExecution: enqueueResult.pausedExecution,
contextId: enqueueResult.contextId,
resumeInput: enqueueResult.resumeInput,
userId: enqueueResult.userId,
}).catch((error) => {
logger.error('Background time-pause resume failed', {
executionId: row.executionId,
contextId,
error: toError(error).message,
})
})
}
dispatched++
} catch (error) {
const message = toError(error).message
logger.warn('Failed to dispatch time-pause resume', {
executionId: row.executionId,
contextId,
error: message,
})
failures.push({ executionId: row.executionId, contextId, error: message })
}
}

await db
.update(pausedExecutions)
.set({ nextResumeAt: nextRemaining })
.where(eq(pausedExecutions.id, row.id))
}

logger.info('Time-pause resume poll completed', {
requestId,
claimedRows,
dispatched,
failureCount: failures.length,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Failed dispatches permanently strand executions

When enqueueOrStartResume throws for a due pause point, the error is caught and pushed to failures[], but nextRemaining is unaffected (it only tracks future points). The loop then runs UPDATE … SET next_resume_at = nextRemaining (effectively NULL when all points were due). After this update, the row no longer satisfies the cron query (isNotNull(nextResumeAt)), so it is silently abandoned and the workflow is permanently stuck in status = 'paused'.

Any transient failure — DB timeout, lock contention, network hiccup inside enqueueOrStartResume — turns into a permanent hang with no visible alert and no retry path.

A simple fix is to re-schedule failed points by putting their resumeAt back into nextRemaining:

for (const point of duePoints) {
  const contextId = point.contextId
  if (!contextId) continue
  try {
    // ... dispatch ...
    dispatched++
  } catch (error) {
    const message = toError(error).message
    logger.warn('Failed to dispatch time-pause resume', { ... })
    failures.push({ executionId: row.executionId, contextId, error: message })
    // Re-queue failed point
    if (point.resumeAt) {
      const retryAt = new Date(point.resumeAt)
      if (!Number.isNaN(retryAt.getTime())) {
        if (!nextRemaining || retryAt < nextRemaining) nextRemaining = retryAt
      }
    }
  }
}

Alternatively, schedule a short retry (e.g. new Date(Date.now() + 60_000)) to avoid hammering a bad point at full frequency.

Comment on lines +56 to +66
metadata: pausedExecutions.metadata,
})
.from(pausedExecutions)
.where(
and(
eq(pausedExecutions.status, 'paused'),
isNotNull(pausedExecutions.nextResumeAt),
lte(pausedExecutions.nextResumeAt, now)
)
)
.limit(POLL_BATCH_LIMIT)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No ORDER BY on batch query — high-volume queues risk row starvation

Without an explicit ORDER BY, PostgreSQL returns rows in an unspecified order. When the queue depth exceeds POLL_BATCH_LIMIT = 200, the same 200 rows may be returned on every invocation (e.g. lowest physical heap order), while later-inserted rows are perpetually skipped. Adding .orderBy(pausedExecutions.nextResumeAt) ensures the most-overdue entries are always processed first and that all rows are eventually drained.

.orderBy(pausedExecutions.nextResumeAt)
.limit(POLL_BATCH_LIMIT)

Comment on lines +106 to +117
async executeWithNode(
ctx: ExecutionContext,
block: SerializedBlock,
inputs: Record<string, any>,
nodeMetadata: {
nodeId: string
loopId?: string
parallelId?: string
branchIndex?: number
branchTotal?: number
}
): Promise<BlockOutput> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 executeWithNode signature is narrower than the BlockHandler interface

BlockHandler.executeWithNode in types.ts declares nodeMetadata with three additional optional fields (originalBlockId, isLoopNode, executionOrder). The WaitBlockHandler implementation omits all three, so the method technically does not satisfy the declared interface contract. While TypeScript currently allows this (the extra fields are optional and ignored at runtime), it means callers that pass full nodeMetadata objects will silently drop fields the handler might need in a future iteration. Widening the implementation's parameter type to match the interface definition prevents this drift.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0c32dd4. Configure here.

Comment thread apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant