pyln-testing: replace ephemeral-port-reserve with a filesystem lock approach by daywalker90 · Pull Request #9211 · ElementsProject/lightning

daywalker90 · 2026-06-12T11:27:10Z

using pytest-xdist the ephemeral-port-reserve approach would only ensure no port conflicts per pytest worker. There could still be race conditions where we get "address already in use".

This new filesystem approach should work across pytest workers

Changelog-None

cdecker · 2026-06-12T14:29:55Z

I don't quite get why it is necessary to keep a list of reserved ports at all. The idea is to have the OS itself track the reservations, and have them time out after the usual 30s - 60s cleanup timeout, after which the TIME_WAIT state on the socket expires, and the port goes back into the pool of available ports (i.e., bind with that specific port no longer needs SO_REUSEADDR set, and the random port selection may return it).

So unless we have a test that somehow assumes we have control over a port, while not actively bound to it, for prolonged periods (30s - 60s+) we should never need to materialize the list of reservations, even worse, the list of reservations may think it has control over a port, when the OS has already given that port to someone else. So the materialized port list may actually be causing collisions that are harder to debug. To me, a test that runs for longer than 30s and still expects to have exclusive control over a port is the bigger problem. Due to the port returning to the free-pool, and the OS being allowed to reassign it, the list reservation approach will never work I think.

e7fe59b changed the way lightningd refuses rpc calls during shutdown and the now resulting ConnectionRefusedError was never caught so the test plugin would not exit and the teardown of those tests would hang for the full timeout Changelog-None

daywalker90 · 2026-06-16T07:35:19Z

I don't quite get why it is necessary to keep a list of reserved ports at all. The idea is to have the OS itself track the reservations, and have them time out after the usual 30s - 60s cleanup timeout, after which the TIME_WAIT state on the socket expires, and the port goes back into the pool of available ports (i.e., bind with that specific port no longer needs SO_REUSEADDR set, and the random port selection may return it).

So unless we have a test that somehow assumes we have control over a port, while not actively bound to it, for prolonged periods (30s - 60s+) we should never need to materialize the list of reservations, even worse, the list of reservations may think it has control over a port, when the OS has already given that port to someone else. So the materialized port list may actually be causing collisions that are harder to debug. To me, a test that runs for longer than 30s and still expects to have exclusive control over a port is the bigger problem. Due to the port returning to the free-pool, and the OS being allowed to reassign it, the list reservation approach will never work I think.

For some reason ports go back into the available pool sooner than what we would expect. I'm not sure why, but it clearly happens.
ephemeral_port_reserve puts them in TIME_WAIT, we can bind to them immediately because of SO_REUSEADDR set, but then, probably because of node restarts, during a test the port is somehow returned to a state where ephemeral_port_reserve will return it again immediately for another test.
So we need an additional mechanism for the ports to stay consistent during a test with node restarts. Our current mechanism was flawed in combination with pytest-xdist since it was tracking per worker, not per machine. I did five runs with this on my fork repo and one run here with no "address already in use" errors, where before it would happen in atleast 50% of runs.

…ady first `test_grpc_custommsg_notification` was often failing with a timeout here: ``` def test_grpc_custommsg_notification(node_factory): l1, l2 = node_factory.get_nodes(2) # Test the connect notification custommsg_stream = l1.grpc.SubscribeCustomMsg(clnpb.StreamCustomMsgRequest()) l2.connect(l1) # Send the custom-msg to node l1 # The payload doesn't matter. # It just needs to be valid hex which encodes to an odd BOLT-8 msg id l2.rpc.sendcustommsg(l1.info["id"], "3131313174657374") > for custommsg in custommsg_stream: ``` Changelog-None

… exit this would cause pytest teardowns to take the full timeout of killing a plugin that should exit themselves Changelog-None

…pproach For some reason ports go back into the available pool sooner than what we would expect. I'm not sure why, but it clearly happens. ephemeral_port_reserve puts them in TIME_WAIT, we can bind to them immediately because of SO_REUSEADDR set, but then, probably because of node restarts, during a test the port is somehow returned to a state where ephemeral_port_reserve will return it again immediately for another test. So we need an additional mechanism for the ports to stay consistent during a test with node restarts. Our current mechanism was flawed in combination with pytest-xdist since it was tracking per worker, not per machine. This new filesystem approach should work across pytest workers Changelog-None

daywalker90 · 2026-06-16T14:20:44Z

Did three more full runs and sadly got one with one case of "address already in use" (😭 ), still feels much more reliable though..

Also added two fixes for slow teardown because of improper exiting after shutdown notification and a fix for a grpc test flake, where there was a race between the grpc channel being ready and the custommsg already being sent.

daywalker90 requested a review from cdecker as a code owner June 12, 2026 11:27

daywalker90 marked this pull request as draft June 12, 2026 12:30

daywalker90 force-pushed the address-in-use-fix-67 branch from afaa3ca to bcedc58 Compare June 15, 2026 21:56

daywalker90 linked an issue Jun 16, 2026 that may be closed by this pull request

CI flake test_websocket #9220

Open

daywalker90 marked this pull request as ready for review June 16, 2026 07:35

daywalker90 added the Status::Ready for Review The work has been completed and is now awaiting evaluation or approval. label Jun 16, 2026

daywalker90 force-pushed the address-in-use-fix-67 branch from bcedc58 to 2492c7a Compare June 16, 2026 07:39

daywalker90 mentioned this pull request Jun 16, 2026

CI flake test_websocket #9220

Open

daywalker90 force-pushed the address-in-use-fix-67 branch from 2492c7a to 6858b33 Compare June 16, 2026 09:20

daywalker90 added 3 commits June 16, 2026 12:47

cln-subscribe-wildcard: properly handle shutdown notification with an…

b9be98e

… exit this would cause pytest teardowns to take the full timeout of killing a plugin that should exit themselves Changelog-None

daywalker90 force-pushed the address-in-use-fix-67 branch from 6858b33 to 09d9fb0 Compare June 16, 2026 10:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyln-testing: replace ephemeral-port-reserve with a filesystem lock approach#9211

pyln-testing: replace ephemeral-port-reserve with a filesystem lock approach#9211
daywalker90 wants to merge 4 commits into
ElementsProject:masterfrom
daywalker90:address-in-use-fix-67

daywalker90 commented Jun 12, 2026

Uh oh!

cdecker commented Jun 12, 2026

Uh oh!

daywalker90 commented Jun 16, 2026

Uh oh!

daywalker90 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

daywalker90 commented Jun 12, 2026

Uh oh!

cdecker commented Jun 12, 2026

Uh oh!

daywalker90 commented Jun 16, 2026

Uh oh!

daywalker90 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants