Skip to content

Enabling Distributed Processing#61

Open
markmac99 wants to merge 237 commits into
wmpg:masterfrom
markmac99:distrib_processing
Open

Enabling Distributed Processing#61
markmac99 wants to merge 237 commits into
wmpg:masterfrom
markmac99:distrib_processing

Conversation

@markmac99

@markmac99 markmac99 commented Feb 14, 2026

Copy link
Copy Markdown
Contributor

An upgrade to the RMS solver, CorrelateRMS, to enable distributed processing across multiple servers. To assist with this the first step has been split into two, one to create candidates and one to perform initial simple solutions.

This PR also replaces the JSON database with SQLite which is necessary to enable distributed processing and also brings performance benefits. The PR adds two new commandline arguments, one to control how much data to retain in the databases, and one to post-fix the log name with the phase name eg correlate_rms_20260101_12345_cands.log. This is to ensure each phase's logfile is uniquely named and to make it easier to monitor and debug.

@dvida

dvida commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Agreed that issue 5 is a false alarm, I fixed this recently. Not sure why it picked it up. Let me know once things are ready to rock and roll.

@markmac99 markmac99 marked this pull request as draft June 3, 2026 14:39
@markmac99

Copy link
Copy Markdown
Contributor Author

Ha. During testing i have remembered why i was removing the remote copy even if the download failed....

When a child node downloads a file from the server, it then attempts to move it to the 'processed' folder on the server. This is to avoid reprocessing it again on the next pass. If the downloaded file can't be moved to the processed folder it is because a copy ALREADY exists in the "processed" folder, ie it was already processed. So, it should indeed be deleted from the server as otherwise the next pass will again download and process it, and so on ad infinitum.

On the other hand if i can't download the file in the first place. its virtually certain to be because of a network glitch. If after 10s of retries it still can't be downloaded, then we do probably want to leave it, though the rename / delete would also almost certainly fail.

I'll update the code to remove the file from the server if the move/rename operation fails.

@markmac99

Copy link
Copy Markdown
Contributor Author

okay, all done and tested. During testing i spotted and corrected a couple of other small issues, and improved logging so its possible to more easily track which trajectories got assigned to child nodes.

@markmac99 markmac99 marked this pull request as ready for review June 5, 2026 21:45
@markmac99

markmac99 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

I think this is all good to go, pace the changes in documentation you asked for however i found one interesting problem.

When a monte-carlo solution is run on a phase1 solution, sometimes the reference timestamp changes by a few milliseconds.

When run on a single machine, the software can detect this because each improved solution has the original trajectory path stored in the pre_mc_longname field and can therefore remove the old folder.

However, if phase2 solutions are being distributed to child nodes, each node has its own copy of the files. The software will correctly delete the "old" copy on the child node, but when the data are uploaded to the server, the server simply moves the uploaded files to the trajectories folder. It does not check if the solution is an improvement to an existing solution and so sometimes, we can end up with two solutions on disk.

Logic already exists to delete these when doing the MC phase on the master node, so i just need to extend this to run when merging in data from child nodes.

If we prefer to wait till its resolved i am fine with that. Should only take a few days.

Meanwhile the workaround is simple - don't distribute the MCPhase to child nodes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants