Add MCPServerEntry validation controller and MCPGroup integration#4664
Add MCPServerEntry validation controller and MCPGroup integration#4664
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4664 +/- ##
==========================================
- Coverage 68.84% 68.71% -0.13%
==========================================
Files 509 510 +1
Lines 52668 53025 +357
==========================================
+ Hits 36259 36438 +179
- Misses 13606 13777 +171
- Partials 2803 2810 +7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| ObservedGeneration: entry.Generation, | ||
| }) | ||
| return false | ||
| } |
There was a problem hiding this comment.
This only checks that the MCPGroup exists, but not that it's in Ready phase. Both the MCPServer and MCPRemoteProxy controllers check group.Status.Phase != MCPGroupPhaseReady here and set a GroupRefNotReady reason — and the types file already defines ConditionReasonMCPServerEntryGroupRefNotReady for this, so it looks like the intent was there.
Should be a straightforward addition after the successful Get.
There was a problem hiding this comment.
Good catch, yeah the intent was definitely there... I just forgot to wire it up. Fixed now. The controller checks group.Status.Phase != MCPGroupPhaseReady after the successful Get and sets GroupRefNotReady with a message showing the current phase. Matches what MCPServer and MCPRemoteProxy do.
| handler.EnqueueRequestsFromMapFunc(r.findEntriesForAuthConfig), | ||
| ). | ||
| Complete(r) | ||
| } |
There was a problem hiding this comment.
This watches MCPExternalAuthConfig changes but not MCPGroup or ConfigMap (CA bundle). Since this is a validation-only controller, stale conditions are its main failure mode.
If an MCPGroup or CA bundle ConfigMap is created after the MCPServerEntry, the entry stays Pending forever — unlike MCPServer/MCPRemoteProxy which have secondary triggers from Deployment changes, there's nothing else to re-trigger reconciliation here.
Worth adding watches for both, or at minimum a periodic requeue as fallback.
There was a problem hiding this comment.
Yeah, you're right... stale conditions are basically the one thing this controller can get wrong since it does nothing else. Added watches for both MCPGroup and corev1.ConfigMap now. The ConfigMap watch uses a field index on spec.caBundleRef.configMapRef.name so it only re-queues entries that actually reference the changed ConfigMap (same pattern as the auth config watch, but with server-side filtering this time).
| client.Client | ||
| } | ||
|
|
||
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpserverentries,verbs=get;list;watch;create;update;patch;delete |
There was a problem hiding this comment.
The controller only does Get, List, and Status().Update() — it never creates, updates, patches, or deletes MCPServerEntry objects. The RBAC marker here grants full CRUD though. Compare with the MCPGroup controller on line 41 which correctly uses get;list;watch for the same resource.
Should be:
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpserverentries,verbs=get;list;watch
Then re-run task operator-manifests.
There was a problem hiding this comment.
Yep, fixed. Reduced to get;list;watch and re-ran task operator-manifests. The generated ClusterRole now groups mcpserverentries with the other read-only resources.
| groupKey := types.NamespacedName{Namespace: entry.Namespace, Name: entry.Spec.GroupRef} | ||
|
|
||
| if err := r.Get(ctx, groupKey, group); err != nil { | ||
| reason := mcpv1alpha1.ConditionReasonMCPServerEntryGroupRefNotFound |
There was a problem hiding this comment.
When r.Get fails with a transient error (timeout, RBAC issue), the reason is still set to GroupRefNotFound even though it's not actually a 404. The message differentiates but the machine-readable reason doesn't. Same thing in the other two validate methods.
Might be worth either adding a separate reason for transient failures, or returning the error to force a requeue instead of persisting a potentially misleading condition.
There was a problem hiding this comment.
Went with the "return the error" approach. All three validate methods now return (bool, error). If the Get fails with anything other than NotFound, we return the error directly so the controller-runtime requeue logic kicks in. No misleading condition gets persisted, and the detailed error is logged server-side only (no more raw err.Error() in status conditions either, which addresses the security concern about leaking internal details).
| @@ -0,0 +1,283 @@ | |||
| // SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc. | |||
There was a problem hiding this comment.
No unit tests for this controller. It's 284 lines with non-trivial validation logic (GroupRef, ExternalAuthConfigRef, CABundleRef, phase transitions, watch mapping). The analogous MCPRemoteProxy controller has 14 test functions.
At minimum I'd expect table-driven tests covering: happy path (all valid), each ref missing, optional refs being nil, not-found entry, and status update conflict.
There was a problem hiding this comment.
Fair point. Added TestMCPServerEntryReconciler_Reconcile with 8 table-driven subtests covering: happy path (all refs valid), optional refs nil, group not found, group not ready, auth config not found, CA bundle not found, entry not found (no-op), and CA bundle with nil configMapRef. All passing.
| } | ||
|
|
||
| // findReferencingMCPServerEntries finds all MCPServerEntries that reference the given MCPGroup | ||
| func (r *MCPGroupReconciler) findReferencingMCPServerEntries( |
There was a problem hiding this comment.
The four new MCPServerEntry methods (findReferencingMCPServerEntries, populateEntryStatus, findMCPGroupForMCPServerEntry, updateReferencingEntriesOnDeletion) don't have tests. The analogous MCPRemoteProxy methods all do — e.g., TestMCPGroupReconciler_findReferencingMCPRemoteProxies, TestMCPGroupReconciler_findMCPGroupForMCPRemoteProxy, etc.
Also none of the existing reconciliation tests create MCPServerEntry objects, so Status.EntryCount/Status.Entries are never asserted nonzero, and the deletion cleanup path for entries is exercised only with an empty list.
There was a problem hiding this comment.
Added two new test functions:
TestMCPGroupReconciler_MCPServerEntryIntegrationthat creates a group with 2 entries and assertsEntryCount=2andEntries=["entry1","entry2"]after reconciliation.TestMCPGroupReconciler_EntryDeletionHandlerthat callsupdateReferencingEntriesOnDeletionand verifies both entries getGroupRefValidated=False.
So the non-zero EntryCount path and the deletion cleanup path are both covered now.
94d87f1 to
a57d60d
Compare
ChrisJBurns
left a comment
There was a problem hiding this comment.
Agent-Assisted Code Review
This review was performed by three specialized agents: kubernetes-expert, code-reviewer, and security-advisor. Each finding is attributed to the agent(s) that identified it.
Summary
| Priority | Count | Description |
|---|---|---|
| Must fix | 3 | Integration test nesting bug, AllowInsecureAnnotation dead code, overly broad RBAC |
| Should fix/defer | 4 | Failed phase never used, unused constant, condition naming inconsistency, missing ConfigMap watch |
| Medium | 4 | SSRF protection, client-side filtering, all-or-nothing status clearing, error info leak |
| Minor | 1 | Test field indexer duplication |
Positive Aspects
- Clean validate-then-update-status pattern with no infrastructure side effects
- Good use of
RequeueAfter: 500msfor conflict handling (consistent with existing controllers) - Correct finalizer decision (none needed for validation-only controller)
setupGroupRefFieldIndexesextraction is a nice improvement tomain.go- MCPGroup integration faithfully follows established patterns for MCPServer and MCPRemoteProxy
- CRD YAML, Helm charts, RBAC, and API docs are consistent and complete
- Short name
mseis well-chosen
Generated with Claude Code
|
|
||
| // Set up field indexing for MCPServerEntry.Spec.GroupRef | ||
| err = k8sManager.GetFieldIndexer().IndexField( | ||
| context.Background(), |
There was a problem hiding this comment.
[CRITICAL] Found by: kubernetes-expert
Bug: MCPServerEntry field indexer is nested inside the MCPRemoteProxy error handler — it will never execute on the happy path.
The MCPServerEntry field indexer registration (lines 125-140) is placed inside the if err != nil block for the MCPRemoteProxy indexer (line 122). This means the MCPServerEntry indexer only runs when the MCPRemoteProxy indexer fails. In the normal (success) path, it is skipped entirely.
The same bug exists in all four integration test suites:
mcp-oidc-config/suite_test.go(here)mcp-remote-proxy/suite_test.gomcp-server/suite_test.govirtualmcp/suite_test.go
Note: mcp-group/suite_test.go has it correct — the indexer is a separate statement outside the if err block.
Fix: Close the if err block (line 123) before the MCPServerEntry indexer, so both run unconditionally:
}); err != nil {
Expect(err).ToNot(HaveOccurred())
}
// Set up field indexing for MCPServerEntry.Spec.GroupRef
err = k8sManager.GetFieldIndexer().IndexField(...)There was a problem hiding this comment.
Nice catch! Yeah that was a sneaky nesting bug. Fixed in all four suites. The MCPServerEntry field indexer now lives outside the if err block so it runs unconditionally, matching how mcp-group/suite_test.go already had it.
| }); err != nil { | ||
| Expect(err).ToNot(HaveOccurred()) | ||
|
|
||
| // Set up field indexing for MCPServerEntry.Spec.GroupRef |
There was a problem hiding this comment.
[CRITICAL] Found by: kubernetes-expert
Same nesting bug as mcp-oidc-config/suite_test.go — the MCPServerEntry field indexer is inside the if err != nil block for the MCPRemoteProxy indexer and will never run on the happy path. See the comment on that file for the fix.
There was a problem hiding this comment.
Fixed, same as above.
| }); err != nil { | ||
| Expect(err).ToNot(HaveOccurred()) | ||
|
|
||
| // Set up field indexing for MCPServerEntry.Spec.GroupRef |
There was a problem hiding this comment.
[CRITICAL] Found by: kubernetes-expert
Same nesting bug — MCPServerEntry field indexer unreachable on happy path. See mcp-oidc-config/suite_test.go comment for details.
| }); err != nil { | ||
| Expect(err).ToNot(HaveOccurred()) | ||
|
|
||
| // Set up field indexing for MCPServerEntry.Spec.GroupRef |
There was a problem hiding this comment.
[CRITICAL] Found by: kubernetes-expert
Same nesting bug — MCPServerEntry field indexer unreachable on happy path. See mcp-oidc-config/suite_test.go comment for details.
| ConditionReasonMCPServerEntryCABundleNotConfigured = "CABundleNotConfigured" | ||
| ) | ||
|
|
||
| // AllowInsecureAnnotation is the annotation key to allow HTTP URLs for development. |
There was a problem hiding this comment.
[CRITICAL] Found by: kubernetes-expert, code-reviewer, security-advisor
All three agents flagged this: AllowInsecureAnnotation is defined and documented in the RemoteURL field comment (lines 17-18: "HTTPS is enforced by default; set the annotation...to allow HTTP for development"), but:
- The kubebuilder pattern
^https?://(line 20) already allows both HTTP and HTTPS at admission time - The controller never reads or checks this annotation
This creates a false sense of security — users who read the docs will believe HTTPS is enforced when it is not. HTTP URLs are silently accepted regardless of the annotation.
Options:
- (a) Implement the check in the controller: reject HTTP URLs unless the annotation is
"true", and set a warning condition when the annotation is present - (b) Remove the annotation constant and update the RemoteURL doc comment to accurately reflect that URL scheme enforcement is not in scope for this PR
- (c) Document it as a known TODO for a follow-up PR, but update the doc comment to not mislead users in the interim
There was a problem hiding this comment.
Went with option (b). Removed the AllowInsecureAnnotation constant entirely and updated the RemoteURL doc comment to just say both HTTP and HTTPS are accepted at admission time. No point having a documented annotation that nothing checks... that's worse than having no annotation at all. If we want actual HTTPS enforcement later, we can add it properly in a follow-up with the controller actually checking.
| reason := mcpv1alpha1.ConditionReasonMCPServerEntryGroupRefNotFound | ||
| message := "Referenced MCPGroup not found" | ||
| if !errors.IsNotFound(err) { | ||
| message = "Failed to get referenced MCPGroup: " + err.Error() |
There was a problem hiding this comment.
[MEDIUM] Found by: security-advisor
For non-NotFound errors, the raw err.Error() is included in the condition message: "Failed to get referenced MCPGroup: " + err.Error(). API server errors could contain internal details (network topology, internal service names, etc.) that are then visible to anyone who can read the MCPServerEntry status.
The same pattern appears in validateExternalAuthConfigRef (line 151) and validateCABundleRef (line 200).
Consider using a generic message for non-NotFound errors and logging the detailed error server-side only:
if !errors.IsNotFound(err) {
ctxLogger.Error(err, "Failed to get referenced MCPGroup")
message = "Failed to get referenced MCPGroup"
}There was a problem hiding this comment.
Fixed by changing the approach entirely. Transient errors (non-NotFound) now return the error to the controller-runtime so it handles the requeue. No condition gets persisted at all for transient failures, which means no raw err.Error() ends up in status. The detailed error is only logged server-side via ctxLogger.Error().
|
|
||
| // List all MCPServerEntries in the same namespace | ||
| entryList := &mcpv1alpha1.MCPServerEntryList{} | ||
| if err := r.List(ctx, entryList, client.InNamespace(authConfig.Namespace)); err != nil { |
There was a problem hiding this comment.
[MEDIUM] Found by: kubernetes-expert
findEntriesForAuthConfig lists all MCPServerEntries in the namespace and filters client-side. By contrast, findReferencingMCPServerEntries in the MCPGroup controller uses client.MatchingFields with a field index for efficient server-side filtering.
For consistency and scalability, consider adding a field indexer on spec.externalAuthConfigRef.name and using client.MatchingFields here instead.
There was a problem hiding this comment.
Done. Added a field index on spec.externalAuthConfigRef.name in SetupWithManager and switched findEntriesForAuthConfig to use client.MatchingFields. Same pattern as the MCPGroup controller's findReferencingMCPServerEntries. Also added a similar field index for spec.caBundleRef.configMapRef.name for the new ConfigMap watch.
| // HTTPS is enforced by default; set the annotation | ||
| // toolhive.stacklok.dev/allow-insecure: "true" to allow HTTP for development. | ||
| // +kubebuilder:validation:Required | ||
| // +kubebuilder:validation:Pattern=`^https?://` |
There was a problem hiding this comment.
[MEDIUM] Found by: security-advisor
The URL pattern ^https?:// only validates the scheme prefix. It allows potentially dangerous URLs:
- Cloud metadata:
http://169.254.169.254/latest/meta-data/ - Localhost:
http://127.0.0.1:8080/ - Internal K8s services:
http://kubernetes.default.svc/
Even though MCPServerEntry creates no infrastructure, the URL is consumed by downstream vMCP components. If any component fetches the URL in-cluster, these become SSRF vectors.
Consider adding validation (either CEL or in the controller) to reject well-known internal/metadata IP ranges, or at minimum document that the URL must point to an externally accessible endpoint.
There was a problem hiding this comment.
Valid concern. I'd rather tackle URL validation properly in a follow-up though, since it touches how vMCP consumes these URLs downstream and I don't want to half-do it here. The MCPRemoteProxy CRD has the same ^https?:// pattern today too, so this isn't a regression specific to MCPServerEntry. Worth tracking as a separate issue for both resource types.
| }) | ||
|
|
||
| // Clear both resource types' status fields to avoid stale data when entering Failed state | ||
| // Clear all resource types' status fields to avoid stale data when entering Failed state |
There was a problem hiding this comment.
[MEDIUM] Found by: kubernetes-expert
If listing MCPServerEntries fails but listing MCPServers and MCPRemoteProxies succeeded, all three counts/lists are zeroed (lines 169-174). The already-retrieved server and proxy data is discarded.
This means a transient failure listing entries causes the MCPGroup to report zero servers too, which could confuse operators and monitoring. Consider only clearing the status for the resource type that failed, or document why the all-or-nothing approach was chosen.
There was a problem hiding this comment.
The all-or-nothing approach is intentional here. If any list operation fails, we don't know the actual membership state, so reporting partial data could be more misleading than reporting zero. Imagine the entry list fails but we report "2 servers, 0 entries" when there should be 5 entries... that partial view could cause downstream consumers to make bad routing decisions.
The Failed phase + the condition message saying which resource type failed to list should give operators enough signal to know what went wrong. But I'm open to changing this if you feel strongly about it.
| WithIndex(&mcpv1alpha1.MCPServerEntry{}, "spec.groupRef", func(obj client.Object) []string { | ||
| mcpServerEntry := obj.(*mcpv1alpha1.MCPServerEntry) | ||
| if mcpServerEntry.Spec.GroupRef == "" { | ||
| return nil |
There was a problem hiding this comment.
[MINOR] Found by: kubernetes-expert
The same 7-line MCPServerEntry field indexer block is copy-pasted into 11 test function builders in this file. Combined with the 5 integration test suites, that's 16 copies of the same code. A shared test helper like addGroupRefFieldIndexes(builder) would significantly reduce duplication and maintenance burden.
There was a problem hiding this comment.
Agreed, there's a lot of copy-paste here. I'll track this as a follow-up to extract a shared addGroupRefFieldIndexes(builder) helper. Didn't want to refactor the existing test infrastructure in this PR since it's already touching a bunch of files.
Introduce a validation-only controller for MCPServerEntry that checks referenced resources (MCPGroup, MCPExternalAuthConfig, CA bundle ConfigMap) exist and sets status conditions accordingly. The controller never creates infrastructure or probes remote URLs. Integrate MCPServerEntry into the MCPGroup controller so groups track entry membership in Entries/EntryCount status fields, handle entry deletion notifications, and watch for entry changes. Wire up field indexing for MCPServerEntry.Spec.GroupRef and register the controller in the operator's server controller setup. Extract field indexer setup into setupGroupRefFieldIndexes to keep setupServerControllers under the cyclomatic complexity limit. Refs: #4656 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uites The Chainsaw E2E tests assert on the exact ClusterRole contents, and the integration test suites need field indexers for all types that reference MCPGroup via spec.groupRef. Add mcpserverentries and mcpserverentries/status to the RBAC golden files for both multi-tenancy and single-tenancy Chainsaw test setups. Register MCPServerEntry field indexer in all five integration test suite BeforeSuite blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix RBAC: reduce mcpserverentries verbs to get;list;watch (validation-only) - Fix field indexer nesting bug in 4 integration test suites where MCPServerEntry indexer was unreachable on the happy path - Add MCPGroup Ready phase check in validateGroupRef, matching MCPServer/MCPRemoteProxy behavior - Return transient errors for requeue instead of persisting misleading NotFound conditions; sanitize error messages in status conditions - Add watches for MCPGroup and ConfigMap changes to re-validate entries - Use field indexes for auth config and CA bundle ConfigMap lookups - Use Failed phase when validations fail (Pending is initial-only) - Remove unused AllowInsecureAnnotation constant and misleading doc - Align condition type names with shared constants from mcpserver_types - Add unit tests for MCPServerEntry controller and MCPGroup entry methods - Regenerate RBAC manifests and CRD API docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6fcd0e8 to
3062bb6
Compare
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Summary
MCPServerEntry needs a controller to validate that referenced resources exist and to integrate with MCPGroup for discovery. This is the second PR in the RFC-55 implementation series, building on #4662 (CRD types).
setupGroupRefFieldIndexesto stay under cyclomatic complexity limitsRefs #4656
Type of change
Test plan
task test) — operator unit tests pass (including updated MCPGroup controller tests with MCPServerEntry field index)task lint-fix) — 0 issuestask operator-manifestsChanges
cmd/thv-operator/controllers/mcpserverentry_controller.gocmd/thv-operator/controllers/mcpgroup_controller.gocmd/thv-operator/controllers/mcpgroup_controller_test.gocmd/thv-operator/main.gosetupGroupRefFieldIndexes, add MCPServerEntry controller setupdeploy/charts/operator/templates/clusterrole/role.yamlSpecial notes for reviewers
Reconcile()duplicate log was removed —updateGroupMemberStatus()already logs on success.Generated with Claude Code