-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Summary
A user on v2.1.0 (178) reported their Lightning node completely non-functional. The app shows "LDK Node error: The node is already running" but the node status shows "Could not initiate". Log analysis reveals the LDK node is actually running fine (connected to peers, syncing chain), but the app's NodeLifecycleState is stuck in ErrorStarting(AlreadyRunning) — an unrecoverable dead-end state. Force-closing and restarting the app resolves it (purely in-memory state issue, no data corruption).
Root Cause: Two Concurrent Start Paths
The node startup has two callers, both invoking lightningRepo.start():
LightningNodeService.setupService()— foreground service, usesserviceScopewithSupervisorJob(), independent of UI lifecycleWalletViewModel.startNode()— runs inviewModelScope, tied to ViewModel/Activity lifecycle
Both race for lifecycleMutex in LightningRepo.start(). If the ViewModel's call wins the mutex, the critical startup runs in a cancellable scope. If that scope is cancelled after node.start() succeeds but before nodeLifecycleState is updated to Running, the app gets stuck.
Note: Even if we consolidated to a single start path (e.g., foreground service only), that wouldn't be a complete fix — users may eventually be able to disable the foreground service to save battery. The real fix is making the start logic robust against AlreadyRunning regardless of which caller runs it.
Trigger Sequence (from user logs)
22:48:18.422 LightningService: "Starting node..." (1st node.start() call)
22:48:19.372 LDK: "Startup complete." (node IS running)
22:48:19.379 LightningService.start returns (954ms, success)
22:48:19.386 LightningRepo:373: JobCancellationException (coroutine cancelled AFTER success)
22:48:19.390 2nd node.start() call (immediate retry)
22:48:19.399 LightningRepo:373: AlreadyRunning (of course — node IS running)
22:48:21.406 3rd node.start() call (delayed retry, shouldRetry=false)
→ State permanently stuck as ErrorStarting(AlreadyRunning)
The user was stuck in this loop for 12+ hours (22:48 Mar 11 through 10:46 Mar 12).
Four Compounding Bugs
Bug 1 — runCatching catches CancellationException (LightningRepo.kt:299)
runCatching { ... } catches CancellationException, which is a known Kotlin coroutines anti-pattern. After lightningService.start() returns from suspension, Kotlin checks for cancellation before resuming — so line 329 (nodeLifecycleState = Running) never executes. The error handler at line 365 sees state is Starting, not Running, and falls into the retry path instead of the success path.
Bug 2 — Circular getStatus() guard (LightningRepo.kt:1211)
fun getStatus(): NodeStatus? =
if (_lightningState.value.nodeLifecycleState.isRunning()) lightningService.status else nullThe safety pre-check at line 317 (if (getStatus()?.isRunning == true)) returns null when state is Starting — so it can never detect that the LDK node is actually running during a retry. The check should use lightningService.status?.isRunning directly.
Bug 3 — ErrorStarting is a dead end (NodeLifecycleState.kt)
Once in ErrorStarting:
- Not
Running→ all operations blocked ("Cannot execute X: node is ErrorStarting") - Not
Stopped/Stopping→stop()guard at line 437 rejects it - No recovery path exists
- User is permanently stuck until force-close
Bug 4 — No single source of truth for node lifecycle
NodeLifecycleState is a manual mirror of LDK's own internal state, maintained separately by the app. Both Android and iOS have their own implementations. This violates single-source-of-truth and doubles the bug surface. LDK-node itself knows whether it's running — the app shouldn't need to independently track this.
Suggested Fixes (priority order)
- Handle
AlreadyRunningas success — InLightningRepo.start(), whennode.start()throwsAlreadyRunning, checklightningService.status?.isRunningdirectly (not viagetStatus()) and transition toRunningif the node is healthy - Fix the circular
getStatus()guard — Line 317's pre-check should uselightningService.status?.isRunningdirectly, bypassing theisRunning()state gate - Don't catch
CancellationException— ReplacerunCatchingat line 299 with explicit try-catch that rethrowsCancellationException(standard Kotlin coroutines practice) - Make
stop()acceptErrorStarting— Allow recovery from the dead-end state - Long-term: Move lifecycle state to ldk-node — Eliminate the manual state mirror entirely (affects both iOS and Android)
Key Files
LightningRepo.kt:270-413—start()with retry logic, circulargetStatus()at line 1211LightningRepo.kt:435-457—stop()that rejectsErrorStartingLightningNodeService.kt:68-82— Foreground service start pathWalletViewModel.kt:293-317— ViewModel start pathNodeLifecycleState.kt— State enum with no recovery fromErrorStartingLightningService.kt:233+—node.start()call
User Impact
- Lightning node appears completely non-functional
- Cannot send or receive payments
- Cannot export logs (no lightning connection)
- Persists across app restarts (only force-close resolves it)
- Wallet data and channels are safe — purely in-memory state issue