Skip to content

[9.0] feat(Resources): introduce fabric in SSHCE#7703

Open
aldbr wants to merge 1 commit intoDIRACGrid:integrationfrom
aldbr:v9.0_FEAT_use-fabric-in-SSHCE
Open

[9.0] feat(Resources): introduce fabric in SSHCE#7703
aldbr wants to merge 1 commit intoDIRACGrid:integrationfrom
aldbr:v9.0_FEAT_use-fabric-in-SSHCE

Conversation

@aldbr
Copy link
Contributor

@aldbr aldbr commented Jun 27, 2024

Replace the Dirac-specific SSH class by fabric.

BEGINRELEASENOTES
*Resources
CHANGE: Replace SSH by fabric in SSHComputingElement
ENDRELEASENOTES

@aldbr aldbr force-pushed the v9.0_FEAT_use-fabric-in-SSHCE branch 2 times, most recently from 206c55e to 6fbeaed Compare June 27, 2024 08:07
@fstagni
Copy link
Contributor

fstagni commented Jun 28, 2024

Just a note: we do not (yet) have a way to do proper integration test for the Computing Elements, but one may think about adding them to our integration tests setup. Something to think about it, it would be nice if it was in this PR.
It involves creating the "site", with the "CE" (this would be yet another container) and the SiteDirector could send pilots to it.

@aldbr aldbr linked an issue Jul 3, 2024 that may be closed by this pull request
@aldbr aldbr force-pushed the v9.0_FEAT_use-fabric-in-SSHCE branch from 6fbeaed to 1c60e47 Compare July 25, 2024 15:24
@aldbr
Copy link
Contributor Author

aldbr commented Jul 25, 2024

Just a note: we do not (yet) have a way to do proper integration test for the Computing Elements, but one may think about adding them to our integration tests setup. Something to think about it, it would be nice if it was in this PR. It involves creating the "site", with the "CE" (this would be yet another container) and the SiteDirector could send pilots to it.

I agree it would great to add integration tests for CEs, at least to test basic features. But it will likely become complex because:

  • if we want to test things properly, we need to set up a CE and a Batch System.
  • we will have to choose one configuration, but it might not reflect the configuration of the sites in production.

I will give it a try with the SSHCE, let's see.

@aldbr
Copy link
Contributor Author

aldbr commented Nov 29, 2024

I wonder if it really makes sense to add CEs (and Batch Systems) in the integration tests: while it would be great to have a "grid in a box" in a controller environment, it would be cumbersome to maintain on the long term and would not be representative of all the instances we can find out there (e.g. Arc v6, v6 with a hack, v7, transferring jobs to Slurm, HTCondor, SSH, SSH tunnel, HTCondor with local scheduler, with remote scheduler...).

It would probably make more sense to add some scripts to run during the hackathons. For each type of CE supported it would:

  • get all the instances related to the given type of CE and for each of them:
    • submit a "hello world" job
    • get the CE status
    • get the job status until it reaches a final state
    • get the job output and logging info (if available)

Basically, it would be very similar to (i) submitting pilots with the Site Director and (ii) checking their results manually. But it would be more focused on the CE interfaces and would be more automated (though a human would need to check whether errors come from the CE instance itself or the Dirac CE interface).

Any opinion @fstagni ?

@fstagni
Copy link
Contributor

fstagni commented Nov 29, 2024

I think the only one that would make sense to set up here is the SSHCE. The others, "proper Grid ones", can not be tested here.

@aldbr
Copy link
Contributor Author

aldbr commented Nov 29, 2024

I don't even know if testing SSHCE in an integration test makes sense. The only easy test we can set up would be SSHCE + Host, which is not representative of what we can have in production.

@fstagni
Copy link
Contributor

fstagni commented Nov 29, 2024

OK OK, give up on the idea...

@aldbr
Copy link
Contributor Author

aldbr commented Nov 29, 2024

I will add a certification test focused on the CE interfaces as I explained (+ a card in the kanban board to explain how to execute it). I will execute it in the lhcb environment to make sure the changes in this PR are correct.

And I can also try to add a container that would act as a "Site" and use SSH + Host so that we can at least test the Site Director "in a box". Would it be okay?

@fstagni
Copy link
Contributor

fstagni commented Nov 29, 2024

Sure, thanks

@aldbr
Copy link
Contributor Author

aldbr commented Jan 12, 2026

Tested with #8420 in LHCb production + now used by the LHCb Site Directors for a few hours without any issue so far.

This PR does not modify the interfaces so the transition is expected to be transparent.
The only risky point I identified is the SSHTunnel, which can take any command as value: I need to extract the hostname as well as the port from it.
My function is supposed to be flexible and support various cases, but I can't guarantee that this is going to work fine if the value is tricky.

@aldbr aldbr force-pushed the v9.0_FEAT_use-fabric-in-SSHCE branch from 6b383ed to 4a20778 Compare January 12, 2026 14:07
@aldbr
Copy link
Contributor Author

aldbr commented Jan 12, 2026

Another risky point is the SSHBatchCE because I don't know how to test that properly, we don't have any instance in LHCb as far as I know.

@fstagni
Copy link
Contributor

fstagni commented Jan 12, 2026

Does this need a new DIRACOS release?

@aldbr aldbr force-pushed the v9.0_FEAT_use-fabric-in-SSHCE branch from 4a20778 to 2bc6401 Compare January 13, 2026 09:47
@fstagni
Copy link
Contributor

fstagni commented Mar 4, 2026

A few comments after asking the relevant admins:

  • the SSHTunnel option is not used at the moment by any other user but LHCb. It is anyway felt that it can be useful (and looks like it is!)
  • SSHBatch CE is there for testing purposes, can be some times used for short-lived "CEs". There is a SSHBatch CE defined in the certification machine.

This PR should be tested in the certification setup.

@aldbr
Copy link
Contributor Author

aldbr commented Mar 10, 2026

Alright, then I will simplify the SSHTunnel option.
For the SSHBatch CE: I still fail to understand its goal. What would you test with that exactly? Is it really needed? I see the SSHBatch CE in certification, it looks like it is serving only 1 host and could be just an SSHCE. Am I wrong?

@fstagni
Copy link
Contributor

fstagni commented Mar 10, 2026

The one in certification is just an example, of course. More than a single IP can be added.

SSHBatch can be used for creating a grid in a box, and in this case it's just for local testing, and/or demonstration possibilities.
My impression is that its functionalities could be included in SSHCE (if BatchSystem == "Host" and an SSHHost list is defined).

@aldbr
Copy link
Contributor Author

aldbr commented Mar 10, 2026

Given that:

  • there is no real use case in production (IIUC)
  • the "grid in a box" use case is hypothetical at this point (again IIUC), and we could just use an SSHCE per host
    • btw, if we ever have a "grid in a box", then it would be likely better to use the SSHCE to test it (as we said in earlier comments)
  • other CEs work the same way: 1 CE = 1 endpoint

Shouldn't we just drop that feature?

@arrabito
Copy link
Contributor

For me dropping SSHBatch is fine.

Then about:

the SSHTunnel option is not used at the moment by any other user but LHCb. It is anyway felt that it can be useful (and looks like it is!)

As I said in my answer to the admin list, we do use the SSHTunnel option combined with SSHCE.
To be honest I've just configured my SSHCE with it, as I'm not sure about which are the alternatives.

To clarify the context of our usage, we have 2 use cases:

1/ in production for a site which does not have an HTCondor, neither a ARC CE (even if very soon the site will adopt HTCondor, so that we will not need to use SSHCE anymore)

2/ in our certification instance and also in our kind deployment used in the CI -> 'grid in a box'

To avoid any misunderstanding I share here our configurations.

In production the Site using the SSHCE is configured as follows:

Capture d’écran 2026-03-11 à 08 21 46

Is my understanding correct that in this case we are using the SSHTunnel to execute slurm submission commands?

While for the certfication instance, the Site is a Host and is configured as follows:

Capture d’écran 2026-03-11 à 08 24 04

Again, my understanding is that we are also using the SSHTunnel in this case. Is that correct?

@aldbr
Copy link
Contributor Author

aldbr commented Mar 11, 2026

Is my understanding correct that in this case we are using the SSHTunnel to execute slurm submission commands?

From what I can see, you are not using the SSHTunnel option here (I don't see it in the screenshot you provided).
SSHTunnel is used when you are passing through one (or multiple) jump node(s) to access the target node where the batch system commands are available.
Here from what I understand, you just SSH to a target node where slurm commands are available.
Am I understanding correctly?

@arrabito
Copy link
Contributor

From what I can see, you are not using the SSHTunnel option here (I don't see it in the screenshot you provided).
SSHTunnel is used when you are passing through one (or multiple) jump node(s) to access the target node where the batch system commands are available.
Here from what I understand, you just SSH to a target node where slurm commands are available.
Am I understanding correctly?

Yes that's correct. So in this case we don't use the SSHTunnel option. Thank you for the clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace the SSH class by a Python library?

4 participants