"More than one version of commitment transaction" must be fixed in node software side, not VLS side

Issue: Channel State Splitbrain, And VLS `policy-commitment-retry-same`
=======================================================================

There is a race condition where the CLN node can ask VLS to sign a new
counterparty commitment (to which VLS then signs and responds), then
CLN crashes before it can write the new state to disk.

On restart, CLN will forget about the new state.
It will then `channel_reestablish` with its peer.

The issue is that the CLN node and the peer can now negotiate another
new state, with a set of HTLCs different from the new state from before
the crash.

However, VLS has already signed new state before the crash, and by its
policy `policy-commitment-retry-same`, it will reject the signing request
after restart, and there is no way to recover other than to close the
channel.
Fortunately, this state is not a funds-loss event; as far as VLS is
concerned, both the old state and new state are valid, and as far as
CLN (and presumably its counterparty) is concerned, there is only the old
state, and CLN is unable to sign the new state.

In VLS, `policy-commitment-retry-same` is the policy which requires
that every `hsmd_sign_remote_commitment_tx` is either the same as the
previous call, ***or*** is at the next higher state index.
In this issue, VLS thinks we are alreaady at the new state `N`, but CLN
and its peer believes the new state is `N - 1` and attempt to
re-negotiate a different state index `N`.

A similar issue exists with rust-lightning.

In rust-lightning, depending on what exactly got saved on disk (it
seems to be dependent on whether the persister is asynchronous or
not), rust-lightning may re-call at state index `N - 1`, or have a
similar problem with CLN where it re-negotiates a different state
index `N`.

Incorrect Solution
==================

CLN proposed a solution of "VLS should always just accept signing any
alternate new state `N` as long as state `N - 1` is not invalidated."

This violates the VLS policy `policy-commitment-retry-same`.

The issue here is that VLS has the goal of "the node can be compromised,
but this will not lead to a funds loss event".

The problem is that VLS signing ***any*** alternate new state `N` (the
solution proposed by CLN) leads to a ***funds-loss vulnerability***
that is exploitable if the counterparty has compromised the node.

The following user story shows the funds-loss vulnerability:

* Actors:
  - `R`, the counterparty.
    - secretly malicious.
  - `A`, the node (CLN or rust-lightning).
    - secretly compromised by `R`.
  - `V`, the Validating Lightning Signer.
* Start with `R` commitment at state index `N - 1`.
* `R` sends `update_add_htlc` with `H1` with the tiniest amount.
* `A` decides to sign the next `R` state `N`, and sends
  `hsmd_sign_remote_commitment_tx` to `V`.
* `V` signs the new state and sends back the signature in response.
  Let us call this state `<R.N> H1`.
* `A` receives the signature for `<R.N> H1` and gives it to `R`.
  - `R` is now in possession of `<R.N> H1`.
* `A` ***PRETENDS TO CRASH***.
* On "restart", `A` ***PRETENDS*** to perform `channel_reestablish`
  with `R` at state `N - 1`.
* `R` sends `update_add_htlc` with `H2` with the largest possible
  amount, to be forwarded elsewhere (to another node owned by `R`).
* `A` requests, to `V`, a signature for the next `R` state `N` with `H2`
  as the HTLC state.
* `V` signs the new state due to lack of `policy-commitment-retry-same`.
  Call this signature `<R.N> H2`.
* `R` sends the revocation for state `R.N - 1`.
* `V` now believes that `H2` is "irrevocably committed" and authorizes
  its forwarding onwards.
* `R` completes the signature for `<R.N> H1`.
  - This can confirm, as it is now signed by both `V` and `R`.
  - `V` cannot punish it it, because only state `R.N - 1` is revoked.
  - `R` loses only the tiny amount `H1` (and it really only gets locked
    until timeout) but gains the amount of `H2`.
  - `R` WINS.

Due to this proposed solution allowing funds loss (and the issue itself
***not*** leading to funds loss, other than the "cost of doing
business" for fees for closing and re-opening channels), VLS rejects
this solution as worse than the problem.

Correct Solution
================

The correct solution is to write an ***intent*** to sign to disk,
***before*** performing the signing.

This is valid as a solution: writing the ***intent*** to a log (called
the "write-ahead log" or "***intent*** log") is how all modern
databases implement ACID transactions; Sqlite3 is faster if you
enable WAL mode while still as safe, and CLN often uses Sqlite3
anyway.

Thus, writing the "intent to sign this new state" to the log ***is***
semantically equivalent to "I signed this new state".
Once it has been written and committed to the database, it is
equivalent to atomically having signed the new state.

The advantage is that this is effectively atomic from the point of
view of both VLS and CLN!
Thus, it makes atomic the two operations "make signature" and
"write new state to disk", even if the two operations are done on
different machines:

* Suppose CLN and VLS crash before it can write the intent-to-sign
  on disk.
  On restart, the view is:
  - CLN: `N - 1`
  - VLS: `N - 1`
  - Both are in synchrony.
* Suppose CLN writes intent-to-sign to disk, but crashes before it
  can make the request to VLS.
  On restart, the view is:
  - CLN: `N`
  - VLS: `N - 1`
  - On restart, CLN sees the intent-to-sign and calls VLS to make
    the request to sign `N`.
    This is accepted by VLS because it is a state advancement
    from `N - 1` to `N`.
* Suppose CLN writes intent-to-sign to disk and is able to make the
  request to VLS, but VLS crashes before it can sign and updates its
  own disk.
  On VLS restart, the view is:
  - CLN: `N`
  - VLS: `N - 1`
  - The CLN-VLS integration code will repeat the last request that
    has not been responded to on reconnection.
    This is the same request that signs `N`, which is accepted by
    VLS as a state advancement.
* Suppose VLS signs and writes to disk, but crashes before it can
  send the response to CLN.
  On VLS restart, the view is:
  - CLN: `N`
  - VLS: `N`
  - As noted, the CLN-VLS integration code will repeat the last
    request on reconnection of VLS.
    As this request is the same as the request VLS received before
    crashing, this passes `policy-commitment-retry-same`.
* Suppose VLS is able to respond to CLN, but CLN crashes before it
  can send the signature in a `commitment_signed`.
  On restart, the view is:
  - CLN: `N`
  - VLS: `N`
  - On restart, CLN will see the intent-to-sign on disk, and re-do
    the signing.
    This is the same request as what VLS received before, which
    passes `policy-commitment-retry-same`.
    Then CLN will use the signature for
    `connection_reestablish`.

In CLN, the primary change is to move the call to `calc_commitsigs`,
in `sign_commit_part` of `channeld/channeld.c`, to after the code
that notifies the master.
The only dependency on the output of `calc_commitsigs` appears to be a
`tal_count(htlc_sigs)` in the `status_debug` message, which can be
replaced with `tal_count(txs) - 1`; otherwise, the outputs of
`calc_commitsigs` seems to only be used to generate the actual
`commitment_signed` message.

`calc_commitsigs` accepts `txs` as non-`const`, but does not appear
to actually modify it.
The `commit_sig` argument is an output, and is the only other
non-`const` pointer amongst its arguments.

I am uncertain if `openingd/openingd.c` also needs modifications.

Expanding View
==============

There are four different things that a signer has to handle for the
channel state advancement:

* Our commitment:
  - Validate signature from counterparty for `N`.
  - Provide revocation to counterparty for `N - 1`.
* Counterparty commitment:
  - Provide signature to counterparty for `N`.
  - Validate revocaation from counterparty for `N - 1`.

Uniquely, this splitbrain phenomoenon only affects the "provide
signature to counterparty for `N`".

* Validate signature from counterparty for `N`.
  - If the counterparty makes signatures for different `N` states
    instead of just a single `N` state, that means the counterparty is
    opening itself to the vulnerability described here.
  - This means that this is not a funds loss for *us* at least.
  - VLS can safely replace its own state with the latest signature and
    latest state from the counterparty or node.
* Provide revocation to counterparty for `N - 1`.
  - The revocation key is independent of any HTLCs or other state.
  - The only validation VLS makes here is to ensure that "validate
    signature from counterparty" has been done.
* Validate revocation from counterparty for `N - 1`.
  - The revocation key is independent of any HTLCs or other state.
  - CLN can repeat this request to VLS any number of times and the
    result will be the same each time (i.e. fully idempotent).

Thus, changes are unnnecessary in CLN for the other parts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"More than one version of commitment transaction" must be fixed in node software side, not VLS side #9213

Issue: Channel State Splitbrain, And VLS `policy-commitment-retry-same`

Incorrect Solution

Correct Solution

Expanding View

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

"More than one version of commitment transaction" must be fixed in node software side, not VLS side #9213

Description

Issue: Channel State Splitbrain, And VLS policy-commitment-retry-same

Incorrect Solution

Correct Solution

Expanding View

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Issue: Channel State Splitbrain, And VLS `policy-commitment-retry-same`