pyln-testing: replace ephemeral-port-reserve with a filesystem lock approach#9211
pyln-testing: replace ephemeral-port-reserve with a filesystem lock approach#9211daywalker90 wants to merge 4 commits into
Conversation
|
I don't quite get why it is necessary to keep a list of reserved ports at all. The idea is to have the OS itself track the reservations, and have them time out after the usual 30s - 60s cleanup timeout, after which the So unless we have a test that somehow assumes we have control over a port, while not actively bound to it, for prolonged periods (30s - 60s+) we should never need to materialize the list of reservations, even worse, the list of reservations may think it has control over a port, when the OS has already given that port to someone else. So the materialized port list may actually be causing collisions that are harder to debug. To me, a test that runs for longer than 30s and still expects to have exclusive control over a port is the bigger problem. Due to the port returning to the free-pool, and the OS being allowed to reassign it, the list reservation approach will never work I think. |
e7fe59b changed the way lightningd refuses rpc calls during shutdown and the now resulting ConnectionRefusedError was never caught so the test plugin would not exit and the teardown of those tests would hang for the full timeout Changelog-None
afaa3ca to
bcedc58
Compare
For some reason ports go back into the available pool sooner than what we would expect. I'm not sure why, but it clearly happens. |
bcedc58 to
2492c7a
Compare
2492c7a to
6858b33
Compare
…ady first
`test_grpc_custommsg_notification` was often failing with a timeout here:
```
def test_grpc_custommsg_notification(node_factory):
l1, l2 = node_factory.get_nodes(2)
# Test the connect notification
custommsg_stream = l1.grpc.SubscribeCustomMsg(clnpb.StreamCustomMsgRequest())
l2.connect(l1)
# Send the custom-msg to node l1
# The payload doesn't matter.
# It just needs to be valid hex which encodes to an odd BOLT-8 msg id
l2.rpc.sendcustommsg(l1.info["id"], "3131313174657374")
> for custommsg in custommsg_stream:
```
Changelog-None
… exit this would cause pytest teardowns to take the full timeout of killing a plugin that should exit themselves Changelog-None
…pproach For some reason ports go back into the available pool sooner than what we would expect. I'm not sure why, but it clearly happens. ephemeral_port_reserve puts them in TIME_WAIT, we can bind to them immediately because of SO_REUSEADDR set, but then, probably because of node restarts, during a test the port is somehow returned to a state where ephemeral_port_reserve will return it again immediately for another test. So we need an additional mechanism for the ports to stay consistent during a test with node restarts. Our current mechanism was flawed in combination with pytest-xdist since it was tracking per worker, not per machine. This new filesystem approach should work across pytest workers Changelog-None
6858b33 to
09d9fb0
Compare
|
Did three more full runs and sadly got one with one case of "address already in use" (😭 ), still feels much more reliable though.. Also added two fixes for slow teardown because of improper exiting after shutdown notification and a fix for a grpc test flake, where there was a race between the grpc channel being ready and the custommsg already being sent. |
using pytest-xdist the ephemeral-port-reserve approach would only ensure no port conflicts per pytest worker. There could still be race conditions where we get "address already in use".
This new filesystem approach should work across pytest workers
Changelog-None