Issue: Channel State Splitbrain, And VLS policy-commitment-retry-same
There is a race condition where the CLN node can ask VLS to sign a new
counterparty commitment (to which VLS then signs and responds), then
CLN crashes before it can write the new state to disk.
On restart, CLN will forget about the new state.
It will then channel_reestablish with its peer.
The issue is that the CLN node and the peer can now negotiate another
new state, with a set of HTLCs different from the new state from before
the crash.
However, VLS has already signed new state before the crash, and by its
policy policy-commitment-retry-same, it will reject the signing request
after restart, and there is no way to recover other than to close the
channel.
Fortunately, this state is not a funds-loss event; as far as VLS is
concerned, both the old state and new state are valid, and as far as
CLN (and presumably its counterparty) is concerned, there is only the old
state, and CLN is unable to sign the new state.
In VLS, policy-commitment-retry-same is the policy which requires
that every hsmd_sign_remote_commitment_tx is either the same as the
previous call, or is at the next higher state index.
In this issue, VLS thinks we are alreaady at the new state N, but CLN
and its peer believes the new state is N - 1 and attempt to
re-negotiate a different state index N.
A similar issue exists with rust-lightning.
In rust-lightning, depending on what exactly got saved on disk (it
seems to be dependent on whether the persister is asynchronous or
not), rust-lightning may re-call at state index N - 1, or have a
similar problem with CLN where it re-negotiates a different state
index N.
Incorrect Solution
CLN proposed a solution of "VLS should always just accept signing any
alternate new state N as long as state N - 1 is not invalidated."
This violates the VLS policy policy-commitment-retry-same.
The issue here is that VLS has the goal of "the node can be compromised,
but this will not lead to a funds loss event".
The problem is that VLS signing any alternate new state N (the
solution proposed by CLN) leads to a funds-loss vulnerability
that is exploitable if the counterparty has compromised the node.
The following user story shows the funds-loss vulnerability:
- Actors:
R, the counterparty.
A, the node (CLN or rust-lightning).
- secretly compromised by
R.
V, the Validating Lightning Signer.
- Start with
R commitment at state index N - 1.
R sends update_add_htlc with H1 with the tiniest amount.
A decides to sign the next R state N, and sends
hsmd_sign_remote_commitment_tx to V.
V signs the new state and sends back the signature in response.
Let us call this state <R.N> H1.
A receives the signature for <R.N> H1 and gives it to R.
R is now in possession of <R.N> H1.
A PRETENDS TO CRASH.
- On "restart",
A PRETENDS to perform channel_reestablish
with R at state N - 1.
R sends update_add_htlc with H2 with the largest possible
amount, to be forwarded elsewhere (to another node owned by R).
A requests, to V, a signature for the next R state N with H2
as the HTLC state.
V signs the new state due to lack of policy-commitment-retry-same.
Call this signature <R.N> H2.
R sends the revocation for state R.N - 1.
V now believes that H2 is "irrevocably committed" and authorizes
its forwarding onwards.
R completes the signature for <R.N> H1.
- This can confirm, as it is now signed by both
V and R.
V cannot punish it it, because only state R.N - 1 is revoked.
R loses only the tiny amount H1 (and it really only gets locked
until timeout) but gains the amount of H2.
R WINS.
Due to this proposed solution allowing funds loss (and the issue itself
not leading to funds loss, other than the "cost of doing
business" for fees for closing and re-opening channels), VLS rejects
this solution as worse than the problem.
Correct Solution
The correct solution is to write an intent to sign to disk,
before performing the signing.
This is valid as a solution: writing the intent to a log (called
the "write-ahead log" or "intent log") is how all modern
databases implement ACID transactions; Sqlite3 is faster if you
enable WAL mode while still as safe, and CLN often uses Sqlite3
anyway.
Thus, writing the "intent to sign this new state" to the log is
semantically equivalent to "I signed this new state".
Once it has been written and committed to the database, it is
equivalent to atomically having signed the new state.
The advantage is that this is effectively atomic from the point of
view of both VLS and CLN!
Thus, it makes atomic the two operations "make signature" and
"write new state to disk", even if the two operations are done on
different machines:
- Suppose CLN and VLS crash before it can write the intent-to-sign
on disk.
On restart, the view is:
- CLN:
N - 1
- VLS:
N - 1
- Both are in synchrony.
- Suppose CLN writes intent-to-sign to disk, but crashes before it
can make the request to VLS.
On restart, the view is:
- CLN:
N
- VLS:
N - 1
- On restart, CLN sees the intent-to-sign and calls VLS to make
the request to sign N.
This is accepted by VLS because it is a state advancement
from N - 1 to N.
- Suppose CLN writes intent-to-sign to disk and is able to make the
request to VLS, but VLS crashes before it can sign and updates its
own disk.
On VLS restart, the view is:
- CLN:
N
- VLS:
N - 1
- The CLN-VLS integration code will repeat the last request that
has not been responded to on reconnection.
This is the same request that signs N, which is accepted by
VLS as a state advancement.
- Suppose VLS signs and writes to disk, but crashes before it can
send the response to CLN.
On VLS restart, the view is:
- CLN:
N
- VLS:
N
- As noted, the CLN-VLS integration code will repeat the last
request on reconnection of VLS.
As this request is the same as the request VLS received before
crashing, this passes policy-commitment-retry-same.
- Suppose VLS is able to respond to CLN, but CLN crashes before it
can send the signature in a commitment_signed.
On restart, the view is:
- CLN:
N
- VLS:
N
- On restart, CLN will see the intent-to-sign on disk, and re-do
the signing.
This is the same request as what VLS received before, which
passes policy-commitment-retry-same.
Then CLN will use the signature for
connection_reestablish.
In CLN, the primary change is to move the call to calc_commitsigs,
in sign_commit_part of channeld/channeld.c, to after the code
that notifies the master.
The only dependency on the output of calc_commitsigs appears to be a
tal_count(htlc_sigs) in the status_debug message, which can be
replaced with tal_count(txs) - 1; otherwise, the outputs of
calc_commitsigs seems to only be used to generate the actual
commitment_signed message.
calc_commitsigs accepts txs as non-const, but does not appear
to actually modify it.
The commit_sig argument is an output, and is the only other
non-const pointer amongst its arguments.
I am uncertain if openingd/openingd.c also needs modifications.
Expanding View
There are four different things that a signer has to handle for the
channel state advancement:
- Our commitment:
- Validate signature from counterparty for
N.
- Provide revocation to counterparty for
N - 1.
- Counterparty commitment:
- Provide signature to counterparty for
N.
- Validate revocaation from counterparty for
N - 1.
Uniquely, this splitbrain phenomoenon only affects the "provide
signature to counterparty for N".
- Validate signature from counterparty for
N.
- If the counterparty makes signatures for different
N states
instead of just a single N state, that means the counterparty is
opening itself to the vulnerability described here.
- This means that this is not a funds loss for us at least.
- VLS can safely replace its own state with the latest signature and
latest state from the counterparty or node.
- Provide revocation to counterparty for
N - 1.
- The revocation key is independent of any HTLCs or other state.
- The only validation VLS makes here is to ensure that "validate
signature from counterparty" has been done.
- Validate revocation from counterparty for
N - 1.
- The revocation key is independent of any HTLCs or other state.
- CLN can repeat this request to VLS any number of times and the
result will be the same each time (i.e. fully idempotent).
Thus, changes are unnnecessary in CLN for the other parts.
Issue: Channel State Splitbrain, And VLS
policy-commitment-retry-sameThere is a race condition where the CLN node can ask VLS to sign a new
counterparty commitment (to which VLS then signs and responds), then
CLN crashes before it can write the new state to disk.
On restart, CLN will forget about the new state.
It will then
channel_reestablishwith its peer.The issue is that the CLN node and the peer can now negotiate another
new state, with a set of HTLCs different from the new state from before
the crash.
However, VLS has already signed new state before the crash, and by its
policy
policy-commitment-retry-same, it will reject the signing requestafter restart, and there is no way to recover other than to close the
channel.
Fortunately, this state is not a funds-loss event; as far as VLS is
concerned, both the old state and new state are valid, and as far as
CLN (and presumably its counterparty) is concerned, there is only the old
state, and CLN is unable to sign the new state.
In VLS,
policy-commitment-retry-sameis the policy which requiresthat every
hsmd_sign_remote_commitment_txis either the same as theprevious call, or is at the next higher state index.
In this issue, VLS thinks we are alreaady at the new state
N, but CLNand its peer believes the new state is
N - 1and attempt tore-negotiate a different state index
N.A similar issue exists with rust-lightning.
In rust-lightning, depending on what exactly got saved on disk (it
seems to be dependent on whether the persister is asynchronous or
not), rust-lightning may re-call at state index
N - 1, or have asimilar problem with CLN where it re-negotiates a different state
index
N.Incorrect Solution
CLN proposed a solution of "VLS should always just accept signing any
alternate new state
Nas long as stateN - 1is not invalidated."This violates the VLS policy
policy-commitment-retry-same.The issue here is that VLS has the goal of "the node can be compromised,
but this will not lead to a funds loss event".
The problem is that VLS signing any alternate new state
N(thesolution proposed by CLN) leads to a funds-loss vulnerability
that is exploitable if the counterparty has compromised the node.
The following user story shows the funds-loss vulnerability:
R, the counterparty.A, the node (CLN or rust-lightning).R.V, the Validating Lightning Signer.Rcommitment at state indexN - 1.Rsendsupdate_add_htlcwithH1with the tiniest amount.Adecides to sign the nextRstateN, and sendshsmd_sign_remote_commitment_txtoV.Vsigns the new state and sends back the signature in response.Let us call this state
<R.N> H1.Areceives the signature for<R.N> H1and gives it toR.Ris now in possession of<R.N> H1.APRETENDS TO CRASH.APRETENDS to performchannel_reestablishwith
Rat stateN - 1.Rsendsupdate_add_htlcwithH2with the largest possibleamount, to be forwarded elsewhere (to another node owned by
R).Arequests, toV, a signature for the nextRstateNwithH2as the HTLC state.
Vsigns the new state due to lack ofpolicy-commitment-retry-same.Call this signature
<R.N> H2.Rsends the revocation for stateR.N - 1.Vnow believes thatH2is "irrevocably committed" and authorizesits forwarding onwards.
Rcompletes the signature for<R.N> H1.VandR.Vcannot punish it it, because only stateR.N - 1is revoked.Rloses only the tiny amountH1(and it really only gets lockeduntil timeout) but gains the amount of
H2.RWINS.Due to this proposed solution allowing funds loss (and the issue itself
not leading to funds loss, other than the "cost of doing
business" for fees for closing and re-opening channels), VLS rejects
this solution as worse than the problem.
Correct Solution
The correct solution is to write an intent to sign to disk,
before performing the signing.
This is valid as a solution: writing the intent to a log (called
the "write-ahead log" or "intent log") is how all modern
databases implement ACID transactions; Sqlite3 is faster if you
enable WAL mode while still as safe, and CLN often uses Sqlite3
anyway.
Thus, writing the "intent to sign this new state" to the log is
semantically equivalent to "I signed this new state".
Once it has been written and committed to the database, it is
equivalent to atomically having signed the new state.
The advantage is that this is effectively atomic from the point of
view of both VLS and CLN!
Thus, it makes atomic the two operations "make signature" and
"write new state to disk", even if the two operations are done on
different machines:
on disk.
On restart, the view is:
N - 1N - 1can make the request to VLS.
On restart, the view is:
NN - 1the request to sign
N.This is accepted by VLS because it is a state advancement
from
N - 1toN.request to VLS, but VLS crashes before it can sign and updates its
own disk.
On VLS restart, the view is:
NN - 1has not been responded to on reconnection.
This is the same request that signs
N, which is accepted byVLS as a state advancement.
send the response to CLN.
On VLS restart, the view is:
NNrequest on reconnection of VLS.
As this request is the same as the request VLS received before
crashing, this passes
policy-commitment-retry-same.can send the signature in a
commitment_signed.On restart, the view is:
NNthe signing.
This is the same request as what VLS received before, which
passes
policy-commitment-retry-same.Then CLN will use the signature for
connection_reestablish.In CLN, the primary change is to move the call to
calc_commitsigs,in
sign_commit_partofchanneld/channeld.c, to after the codethat notifies the master.
The only dependency on the output of
calc_commitsigsappears to be atal_count(htlc_sigs)in thestatus_debugmessage, which can bereplaced with
tal_count(txs) - 1; otherwise, the outputs ofcalc_commitsigsseems to only be used to generate the actualcommitment_signedmessage.calc_commitsigsacceptstxsas non-const, but does not appearto actually modify it.
The
commit_sigargument is an output, and is the only othernon-
constpointer amongst its arguments.I am uncertain if
openingd/openingd.calso needs modifications.Expanding View
There are four different things that a signer has to handle for the
channel state advancement:
N.N - 1.N.N - 1.Uniquely, this splitbrain phenomoenon only affects the "provide
signature to counterparty for
N".N.Nstatesinstead of just a single
Nstate, that means the counterparty isopening itself to the vulnerability described here.
latest state from the counterparty or node.
N - 1.signature from counterparty" has been done.
N - 1.result will be the same each time (i.e. fully idempotent).
Thus, changes are unnnecessary in CLN for the other parts.