Auto-generate session names for serverless SSH connect#4701
Auto-generate session names for serverless SSH connect#4701
Conversation
|
Commit: 6575760
18 interesting tests: 9 SKIP, 7 KNOWN, 2 flaky
Top 22 slowest tests (at least 2 minutes):
|
d678d8d to
857404f
Compare
simonfaltum
left a comment
There was a problem hiding this comment.
[Agent Swarm Review] Verdict: Not ready yet
- 1 Critical
- 3 Major
- 1 Gap
- 2 Nit
- 2 Suggestion
The feature design (auto-generate session names, track sessions, offer reconnection) is good UX. However, there are several correctness issues that need addressing before this is safe to ship:
- Critical: Any probe failure triggers irreversible remote resource cleanup, even for transient network errors
- Major: No identity tracking means profile switches can cause cross-identity destructive cleanup
- Major: Expired sessions accumulate forever (remote resources leaked)
- Major: CLI version changes break reconnection within the 24h window
See inline comments for details.
| alive = append(alive, s) | ||
| } else { | ||
| cleanupStaleSession(ctx, client, s, version) | ||
| } |
There was a problem hiding this comment.
[Agent Swarm Review] [Critical]
Any probe error is treated as proof that the session is stale.
resolveServerlessSession() calls cleanupStaleSession() for every getServerMetadata() failure. That probe can fail for transient auth, network, workspace API, or version-mismatch reasons. In those cases the CLI will delete local SSH config, remove the session from state, and best-effort delete secret scopes and workspace content for a session that may still be alive.
Both reviewers flagged this. Isaac confirmed Critical in cross-review due to irreversible blast radius.
Suggestion: Only run destructive cleanup on definitive stale signals (e.g., 404/not-found). For transient errors, keep the session and surface a warning.
There was a problem hiding this comment.
Fixed. Now only errServerMetadata (definitive not-found) triggers cleanup. Transient errors (network, auth) are logged as warnings and the session is kept.
| Accelerator string `json:"accelerator"` | ||
| WorkspaceHost string `json:"workspace_host"` | ||
| CreatedAt time.Time `json:"created_at"` | ||
| ClusterID string `json:"cluster_id,omitempty"` |
There was a problem hiding this comment.
[Agent Swarm Review] [Major]
Session matching ignores Databricks identity. No username/profile is stored, so switching profiles on the same workspace causes cross-identity session mixup. Combined with the probe-failure cleanup issue, probing another identity's session fails and triggers cleanup of their remote resources.
Suggestion: Persist an identity key (e.g., workspace username) in the Session struct and include it in FindMatching.
There was a problem hiding this comment.
Added UserName to the Session struct. FindMatching now requires and filters by user name, preventing cross-identity session mixups.
| result = append(result, s) | ||
| } | ||
| } | ||
| return result, nil |
There was a problem hiding this comment.
[Agent Swarm Review] [Major]
Expired sessions accumulate in state file indefinitely. FindMatching filters out expired sessions from results but never removes them from the store on disk or triggers cleanup of their remote resources (secret scopes, workspace content). Over time the state file grows unboundedly and remote resources are leaked.
Suggestion: Prune expired sessions during Load or FindMatching, saving the cleaned store back.
There was a problem hiding this comment.
Fixed. FindMatching now prunes expired sessions from disk when it encounters them, so the state file stays bounded.
|
|
||
| date := time.Now().Format("20060102") | ||
| b := make([]byte, 3) | ||
| _, _ = rand.Read(b) |
There was a problem hiding this comment.
[Agent Swarm Review] [Nit]
_, _ = rand.Read(b) discards the error. If crypto/rand.Read fails, b is all zeros, producing predictable session names. Consider checking the error or adding a comment explaining why it's intentionally ignored.
There was a problem hiding this comment.
Now panics on rand.Read failure (which should never happen with crypto/rand on a healthy system) instead of silently producing predictable names.
| hostKeyChecking := "StrictHostKeyChecking=accept-new" | ||
| if opts.IsServerlessMode() { | ||
| hostKeyChecking = "StrictHostKeyChecking=no" | ||
| } | ||
|
|
||
| sshArgs := []string{ | ||
| "-l", userName, | ||
| "-i", privateKeyPath, | ||
| "-o", "IdentitiesOnly=yes", | ||
| "-o", "StrictHostKeyChecking=accept-new", | ||
| "-o", hostKeyChecking, | ||
| "-o", "ConnectTimeout=360", | ||
| "-o", "ProxyCommand=" + proxyCommand, | ||
| } | ||
| if opts.UserKnownHostsFile != "" { | ||
| if opts.IsServerlessMode() { | ||
| sshArgs = append(sshArgs, "-o", "UserKnownHostsFile=/dev/null") | ||
| } else if opts.UserKnownHostsFile != "" { |
There was a problem hiding this comment.
We've had such options before, but the security didn't like it.
With auto host name generation we should not have that many host conflicts, right?
Before you would get them if you re-used the same name to connect to a different workspace. Re-using the same name for the same workspace is fine, as our server will get the server ssh key from the secrets scope that's tied to the name (and with the same name the scope will be the same). But across different workspaces we will get a problem, since server keys will be different.
Can we also add workspace id (real one, or based on the host url) to the generated session name?
There was a problem hiding this comment.
Good call — removed StrictHostKeyChecking=no entirely. Instead, session names now include a workspace host hash (4 hex chars from MD5 of the host URL), so different workspaces get different session names and won't have known_hosts conflicts.
| // resolveServerlessSession handles auto-generation and reconnection for serverless sessions. | ||
| // It checks local state for existing sessions matching the workspace and accelerator, | ||
| // probes them to see if they're still alive, and prompts the user to reconnect or create new. | ||
| func resolveServerlessSession(ctx context.Context, client *databricks.WorkspaceClient, opts *ClientOptions) error { |
There was a problem hiding this comment.
Nit, but this can be a method on the ClientOptions struct, might be easier to understand that we are mutating the options here then
There was a problem hiding this comment.
Done — resolveServerlessSession is now a method on *ClientOptions.
| func resolveServerlessSession(ctx context.Context, client *databricks.WorkspaceClient, opts *ClientOptions) error { | ||
| version := build.GetInfo().Version | ||
|
|
||
| matching, err := sessions.FindMatching(ctx, client.Config.Host, opts.Accelerator) |
There was a problem hiding this comment.
I feel like majority of this logic can be moved to the sessions package (up until line 788). getServerMetadata can be passed as an argument. Then it will be easier to test.
Same for cleanupStaleSession. Or will there be circular dependencies it we do that? (since that function has a lot of them)
There was a problem hiding this comment.
Good idea — I'll consider moving the probe/prompt logic to the sessions package in a follow-up. For now the probing depends on getServerMetadata and cleanupStaleSession which have heavy workspace client dependencies, so extracting it cleanly would need passing those as function arguments. Keeping it here for now to avoid a larger refactor in this PR.
|
|
||
| // GenerateSessionName creates a human-readable session name from the accelerator type. | ||
| // Format: <prefix>-<random_hex>, e.g. "gpu-a10-f3a2b1c0". | ||
| func GenerateSessionName(accelerator string) string { |
There was a problem hiding this comment.
As mentioned above, will it help with known_hosts conflicts if we add a workspace id/host here?
There was a problem hiding this comment.
Yes! Added a 4-char hash of the workspace host to generated session names (format: databricks-gpu-a10-20260316-<wshash><random>). Different workspaces produce different names, avoiding known_hosts conflicts without needing StrictHostKeyChecking=no.
eb36b9e to
c65e058
Compare
Move detailed diagnostic messages (SSH key paths, secrets scope, remote user/port, job submission details, upload progress) from cmdio.LogString to log.Infof so they only appear with --log-level=info. Add spinners for long-running operations: binary upload, cluster state check, job startup wait, and metadata polling. Keep concise user-facing step messages (Connecting, Uploading, Starting, Connected) for progress visibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> # Conflicts: # experimental/ssh/internal/client/client.go
- Use single sp.Close() after UploadTunnelReleases instead of duplicated calls - Use defer sp.Close() in ensureSSHServerIsRunning for consistent cleanup - Keep job run ID in cmdio.LogString so it's visible without --log-level=info Co-authored-by: Isaac
Add TrackElapsedTime() method to the spinner that shows a running MM:SS stopwatch next to the spinner message. The time updates on every spinner tick (200ms). Enable it for all SSH connect spinners so users can see how long each step has been running. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Show elapsed time as prefix (e.g., "00:15 ⣾ Starting SSH server...") so it doesn't jump around as the message changes - Replace TrackElapsedTime() method with WithElapsedTime() construction option - Remove elapsedTimeMsg in favor of setting startTime directly in the model Co-authored-by: Isaac
…upport Remove the requirement for --name in serverless SSH connect. Sessions are now auto-generated with human-readable names (e.g. databricks-gpu-a10-20260310-a1b2c3), tracked in ~/.databricks/ssh-tunnel-sessions.json, and offered for reconnection on subsequent runs. Stale sessions are cleaned up automatically. Sessions expire after 24 hours. Also fixes known_hosts key mismatches for serverless by disabling strict host key checking (identity verified via Databricks auth). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…stKeyChecking=no - Add UserName to Session struct and include in FindMatching to prevent cross-identity session mixups when switching profiles - Only run cleanup on definitive errServerMetadata errors; log and skip on transient failures (network, auth) to avoid deleting live sessions - Add workspace host hash to generated session names to avoid SSH known_hosts conflicts across workspaces, removing the need for StrictHostKeyChecking=no and UserKnownHostsFile=/dev/null - Prune expired sessions from disk during FindMatching - Make resolveServerlessSession a method on ClientOptions - Handle rand.Read error explicitly Co-authored-by: Isaac
a177bea to
e8fa14a
Compare
## Summary
- Wrap the settings JSON preview in `{ }` with proper indentation so it
stands out visually
- Add blank lines around the settings block for padding
- Default the "Apply these settings?" prompt to yes (`[Y/n]`) — pressing
Enter accepts
- Shorten inline comments (`// Global setting` instead of `// Global
setting that affects all remote ssh connections`)
Stacked on #4701.
## Test plan
- [x] Existing vscode settings tests pass
- [ ] Manual test: verify the prompt renders with proper formatting and
padding
- [ ] Manual test: pressing Enter without typing accepts the settings
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
--namein serverlessssh connect;--acceleratoralone is now sufficientdatabricks-gpu-a10-20260310-f3a2b1c0) with workspace host hash to avoid SSH known_hosts conflicts across workspaces~/.databricks/ssh-tunnel-sessions.jsonwith user identity to prevent cross-profile session mixupsStacked on #4697.
Test plan
databricks ssh connect --accelerator GPU_1xA10creates new session🤖 Generated with Claude Code