Skip to content

Node ui state stuck in error despite ldk running fine #845

@ovitrif

Description

@ovitrif

Summary

A user on v2.1.0 (178) reported their Lightning node completely non-functional. The app shows "LDK Node error: The node is already running" but the node status shows "Could not initiate". Log analysis reveals the LDK node is actually running fine (connected to peers, syncing chain), but the app's NodeLifecycleState is stuck in ErrorStarting(AlreadyRunning) — an unrecoverable dead-end state. Force-closing and restarting the app resolves it (purely in-memory state issue, no data corruption).

Root Cause: Two Concurrent Start Paths

The node startup has two callers, both invoking lightningRepo.start():

  1. LightningNodeService.setupService() — foreground service, uses serviceScope with SupervisorJob(), independent of UI lifecycle
  2. WalletViewModel.startNode() — runs in viewModelScope, tied to ViewModel/Activity lifecycle

Both race for lifecycleMutex in LightningRepo.start(). If the ViewModel's call wins the mutex, the critical startup runs in a cancellable scope. If that scope is cancelled after node.start() succeeds but before nodeLifecycleState is updated to Running, the app gets stuck.

Note: Even if we consolidated to a single start path (e.g., foreground service only), that wouldn't be a complete fix — users may eventually be able to disable the foreground service to save battery. The real fix is making the start logic robust against AlreadyRunning regardless of which caller runs it.

Trigger Sequence (from user logs)

22:48:18.422  LightningService: "Starting node..."         (1st node.start() call)
22:48:19.372  LDK: "Startup complete."                      (node IS running)
22:48:19.379  LightningService.start returns                (954ms, success)
22:48:19.386  LightningRepo:373: JobCancellationException   (coroutine cancelled AFTER success)
22:48:19.390  2nd node.start() call                         (immediate retry)
22:48:19.399  LightningRepo:373: AlreadyRunning             (of course — node IS running)
22:48:21.406  3rd node.start() call                         (delayed retry, shouldRetry=false)
              → State permanently stuck as ErrorStarting(AlreadyRunning)

The user was stuck in this loop for 12+ hours (22:48 Mar 11 through 10:46 Mar 12).

Four Compounding Bugs

Bug 1 — runCatching catches CancellationException (LightningRepo.kt:299)

runCatching { ... } catches CancellationException, which is a known Kotlin coroutines anti-pattern. After lightningService.start() returns from suspension, Kotlin checks for cancellation before resuming — so line 329 (nodeLifecycleState = Running) never executes. The error handler at line 365 sees state is Starting, not Running, and falls into the retry path instead of the success path.

Bug 2 — Circular getStatus() guard (LightningRepo.kt:1211)

fun getStatus(): NodeStatus? =
    if (_lightningState.value.nodeLifecycleState.isRunning()) lightningService.status else null

The safety pre-check at line 317 (if (getStatus()?.isRunning == true)) returns null when state is Starting — so it can never detect that the LDK node is actually running during a retry. The check should use lightningService.status?.isRunning directly.

Bug 3 — ErrorStarting is a dead end (NodeLifecycleState.kt)

Once in ErrorStarting:

  • Not Running → all operations blocked ("Cannot execute X: node is ErrorStarting")
  • Not Stopped/Stoppingstop() guard at line 437 rejects it
  • No recovery path exists
  • User is permanently stuck until force-close

Bug 4 — No single source of truth for node lifecycle

NodeLifecycleState is a manual mirror of LDK's own internal state, maintained separately by the app. Both Android and iOS have their own implementations. This violates single-source-of-truth and doubles the bug surface. LDK-node itself knows whether it's running — the app shouldn't need to independently track this.

Suggested Fixes (priority order)

  1. Handle AlreadyRunning as success — In LightningRepo.start(), when node.start() throws AlreadyRunning, check lightningService.status?.isRunning directly (not via getStatus()) and transition to Running if the node is healthy
  2. Fix the circular getStatus() guard — Line 317's pre-check should use lightningService.status?.isRunning directly, bypassing the isRunning() state gate
  3. Don't catch CancellationException — Replace runCatching at line 299 with explicit try-catch that rethrows CancellationException (standard Kotlin coroutines practice)
  4. Make stop() accept ErrorStarting — Allow recovery from the dead-end state
  5. Long-term: Move lifecycle state to ldk-node — Eliminate the manual state mirror entirely (affects both iOS and Android)

Key Files

  • LightningRepo.kt:270-413start() with retry logic, circular getStatus() at line 1211
  • LightningRepo.kt:435-457stop() that rejects ErrorStarting
  • LightningNodeService.kt:68-82 — Foreground service start path
  • WalletViewModel.kt:293-317 — ViewModel start path
  • NodeLifecycleState.kt — State enum with no recovery from ErrorStarting
  • LightningService.kt:233+node.start() call

User Impact

  • Lightning node appears completely non-functional
  • Cannot send or receive payments
  • Cannot export logs (no lightning connection)
  • Persists across app restarts (only force-close resolves it)
  • Wallet data and channels are safe — purely in-memory state issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions