Skip to content

Add databricks doctor command#4730

Open
simonfaltum wants to merge 17 commits intomainfrom
simonfaltum/doctor-command
Open

Add databricks doctor command#4730
simonfaltum wants to merge 17 commits intomainfrom
simonfaltum/doctor-command

Conversation

@simonfaltum
Copy link
Member

@simonfaltum simonfaltum commented Mar 12, 2026

Why

Users debugging CLI setup issues (auth failures, config problems, network issues) have no single command to diagnose their environment. They must manually run separate commands to check auth, config, and connectivity.

Changes

Before: Users had to manually run separate commands to check auth, config, and connectivity.
Now: A new databricks doctor command runs all diagnostic checks and reports results as a checklist:

  • CLI version (info)
  • Config file readability and profile count (pass/fail)
  • Active profile (info)
  • Authentication validity and auth type (pass/fail)
  • User identity via CurrentUser.Me (pass/fail)
  • Network connectivity to workspace host (pass/fail)

Text output uses colored status icons ([ok], [FAIL], etc.) to stdout. JSON output (--output json) returns a structured array. Auth failures are reported as check results, not command errors.

Post-review fixes

  • Fix unified-host account profile misclassification: only classify as account-level when AccountID is set AND WorkspaceID is empty
  • Fix empty HTTP method in auth check request: use http.MethodGet with proper URL
  • Delegate config file path resolution to the SDK instead of manual filepath.Join
  • Add "skip" to CheckResult status doc comment
  • Add acceptance test under acceptance/cmd/doctor/
  • Fix context-backed env resolution regression: use env.NewConfigLoader(ctx) to participate in full auth resolution

Open item

  • Top-level command deny list: Like the global flags, the doctor command name should be added to a deny list for new API names in the universe API linters, so future auto-generated API commands don't collide with it. Tracked separately.

Test plan

  • Unit tests for each check function
  • Unit tests for both text and JSON rendering
  • Tests for graceful error handling (auth failure, missing config)
  • Regression tests for unified-host account profiles
  • Acceptance test for top-level command
  • make lintfull passes
  • make checks passes

Adds a top-level `databricks doctor` command that validates CLI setup by
running sequential diagnostic checks: CLI version, config file readability,
active profile, authentication, user identity, and network connectivity.

Auth failures are reported as check results, not command errors. Supports
both text output (colored status icons) and JSON output (`--output json`).

Co-authored-by: Isaac
@eng-dev-ecosystem-bot
Copy link
Collaborator

eng-dev-ecosystem-bot commented Mar 12, 2026

Commit: 4f9375e

Run: 23182727529

Env 🟨​KNOWN 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🟨​ aws linux 7 1 9 268 793 8:42
🟨​ aws windows 7 1 9 270 791 8:57
🔄​ aws-ucws linux 2 7 9 364 708 7:03
🔄​ aws-ucws windows 2 7 9 366 706 6:21
💚​ azure linux 2 11 271 791 7:12
💚​ azure windows 2 11 273 789 5:24
🔄​ azure-ucws linux 2 1 11 369 704 9:00
🔄​ azure-ucws windows 2 1 11 371 702 6:33
💚​ gcp linux 2 11 267 794 10:08
💚​ gcp windows 2 11 269 792 8:27
18 interesting tests: 9 SKIP, 7 KNOWN, 2 flaky
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 🔄​f 🔄​f 💚​R 💚​R 💚​R 🔄​f 💚​R 💚​R
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestAccept/ssh/connect-serverless-gpu 🙈​s 🙈​s 🔄​f 🔄​f 🙈​s 🙈​s 🔄​f 🔄​f 🙈​s 🙈​s
🔄​ TestAccept/ssh/connection 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 🔄​f 💚​R 💚​R 💚​R
Top 27 slowest tests (at least 2 minutes):
duration env testname
6:59 gcp linux TestSecretsPutSecretStringValue
6:44 gcp linux TestAccept/ssh/connection
5:57 gcp windows TestSecretsPutSecretStringValue
5:28 gcp windows TestAccept/ssh/connection
4:52 aws linux TestSecretsPutSecretStringValue
4:03 aws windows TestSecretsPutSecretStringValue
3:13 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:13 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:13 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:11 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:08 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:08 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:06 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:46 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:43 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:43 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:37 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:30 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:23 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:15 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:13 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:10 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:10 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:10 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:09 azure linux TestSecretsPutSecretStringValue
2:08 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:06 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

The test used env.Set(ctx, ...) to set DATABRICKS_HOST and
DATABRICKS_TOKEN, but checkAuth creates a bare config.Config{}
that reads from real environment variables via os.Getenv, not
the context-based env layer. Use t.Setenv instead so the SDK
can see the values.

Co-authored-by: Isaac
@simonfaltum simonfaltum marked this pull request as ready for review March 13, 2026 11:34
Copy link
Contributor

@shreyas-goenka shreyas-goenka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This review was posted by Claude (AI assistant). Shreyas will do a separate, more thorough review pass.

Priority: HIGH — Config resolution diverges from real CLI auth path

MAJOR: resolveConfig diverges from real CLI auth

The resolveConfig function in databricks doctor constructs its own config resolution path instead of going through the standard SDK/CLI authentication flow. This means the doctor command could report "config is fine" while the real CLI fails (or vice versa). If the goal is to diagnose auth issues, it should use the same code path the CLI uses.

MEDIUM: Network check bypasses SDK HTTP client

The connectivity check uses http.DefaultClient directly instead of going through the SDK's configured HTTP client. In enterprise environments with proxies or custom TLS, this will give misleading results — the check might fail even though the SDK would succeed (or vice versa).

Other Observations

  • Good idea for a diagnostic command overall
  • The step-by-step output format is user-friendly
  • Missing test coverage for the core diagnostic logic

When the workspace client is unavailable but config is resolved,
the network check was falling back to http.DefaultClient. This
ignores proxy and custom TLS settings from the SDK config, giving
misleading results in enterprise environments. Use
configuredNetworkHTTPClient(cfg) instead, which respects
HTTPTransport and InsecureSkipVerify from the config.
…allback, skip status

- Detect account-level configs (AccountID + account host) and use
  NewAccountClient instead of always using NewWorkspaceClient
- Add 15s per-check deadline for auth and identity checks to prevent
  hangs on unresponsive IdP
- Network check now tries even when config resolution fails, as long
  as a host URL is available from partial config resolution
- Identity marked as 'skip' (not 'fail') when auth failed or when
  using account-level profile, avoiding double failures from one root cause
- Add skip status rendering in text output
- isAccountLevelConfig now handles both AccountHost and UnifiedHost
- checkIdentity uses isAccountLevelConfig instead of raw HostType check
- Use http.MethodGet and authCfg.Host instead of empty strings
- Remove hardcoded config file path (let SDK resolve it)
- Add "skip" to CheckResult status doc comment
- checkCurrentProfile shows resolved profile from config file

Co-authored-by: Isaac
resolveConfig now checks for [__settings__].default_profile when no
explicit profile is set via --profile flag or DATABRICKS_CONFIG_PROFILE.
This mirrors the resolution logic in cmd/root/auth.go so users who rely
on `databricks auth switch` see the correct profile in doctor output.

Co-authored-by: Isaac
Table-driven tests for isAccountLevelConfig covering classic account
host, unified host with/without account ID, workspace host, and missing
host. Also adds a test for checkIdentity skipping the /me endpoint for
unified-host account profiles.

Co-authored-by: Isaac
Remove the custom Loaders override in resolveConfig. The CLI's
env.NewConfigLoader panics on non-string config attributes (e.g.
DATABRICKS_RATE_LIMIT which is an int), and the SDK's default loaders
already handle env vars properly. This aligns with how cmd/root/auth.go
creates config objects.

Also add an acceptance test under acceptance/cmd/doctor/ that exercises
the full doctor flow with a mock HTTP server.

Co-authored-by: Isaac
…tion

Use env.NewConfigLoader(ctx) in the doctor command's config loader chain
so that all SDK config attributes (HOST, TOKEN, PROFILE, etc.) set via
env.Set(ctx, ...) participate in auth resolution, not just CONFIG_FILE.

Fix isAccountLevelConfig to treat unified-host profiles with both
account_id and workspace_id as workspace-level, matching the profiler
classification in libs/databrickscfg/profile/profiler.go.

Co-authored-by: Isaac
The GetConfiguredDefaultProfile function is from the auth switch PR
which has not been merged into this branch's base yet. Remove the
dependency to fix the build.

Co-authored-by: Isaac
Use SetS instead of Set to properly convert string env var values
to the target type (int, bool, etc.) before assignment.

Co-authored-by: Isaac
Use filepath.ToSlash so the config file path always uses forward
slashes, making acceptance test output stable across OSes.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants