Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 12 additions & 41 deletions cmd/thv-operator/DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,40 +29,17 @@ When building operators, the decision of when to use a `podTemplateSpec` and whe

### Status Management Design

**Decision**: Use batched status updates via StatusCollector pattern instead of individual field updates.
**Decision**: Use standard Kubernetes workload status pattern matching MCPServer — flat `Phase` + `Ready` condition + `ReadyReplicas` + `URL`.

**Rationale**:
- Prevents race conditions between multiple status updates
- Reduces API server load with fewer update calls
- Ensures consistent status across reconciliation cycles
- Handles resource version conflicts gracefully
- Consistency with MCPServer and standard Kubernetes workload patterns
- Enables `kubectl wait --for=condition=Ready` and standard monitoring
- The operator only needs to track deployment readiness, not internal registry server state
- Tracking internal sync/API states would require the operator to call the registry server, which with auth enabled is not feasible

**Implementation**: StatusCollector interface collects all changes and applies them atomically.
**Implementation**: Controller sets `Phase`, `Message`, `URL`, `ReadyReplicas`, and a `Ready` condition directly based on the API deployment's readiness. The latest resource version is refetched before status updates to avoid conflicts.

### Sync Operation Design

**Decision**: Separate sync decision logic from sync execution with clear interfaces.

**Rationale**:
- Testability: Mock sync decisions independently from execution
- Flexibility: Different sync strategies without changing core logic
- Maintainability: Clear separation of concerns

**Key Patterns**:
- Idempotent operations for safe retry
- Manual vs automatic sync distinction
- Data preservation on failures

### Storage Architecture

**Decision**: Abstract storage via StorageManager interface with ConfigMap as default implementation.

**Rationale**:
- Future flexibility: Easy addition of new storage backends (OCI, databases)
- Testability: Mock storage for unit tests
- Consistency: Single interface for all storage operations

**Current Implementation**: ConfigMap-based with owner references for automatic cleanup.
**History**: The original design used a `StatusCollector` pattern (`mcpregistrystatus` package) that batched status changes from multiple independent sources — an `APIStatusCollector` for deployment state and originally a sync collector — then applied them atomically via a single `Status().Update()`. A `StatusDeriver` computed the overall phase from sub-phases (`SyncPhase` + `APIPhase` → `MCPRegistryPhase`). This was removed because with sync operations moved to the registry server itself, only one status source remained (deployment readiness), making the batching/derivation indirection unnecessary. The new approach produces the same number of API server calls with less abstraction.

### Registry API Service Pattern

Expand All @@ -78,27 +55,21 @@ When building operators, the decision of when to use a `podTemplateSpec` and whe

### Error Handling Strategy

**Decision**: Structured error types with progressive retry backoff.
**Decision**: Structured error types (`registryapi.Error`) with condition metadata.

**Rationale**:
- Different error types need different handling strategies
- Progressive backoff prevents thundering herd problems
- Structured errors enable better observability
- Structured errors carry `ConditionReason` for setting Kubernetes conditions with specific failure reasons (e.g., `ConfigMapFailed`, `DeploymentFailed`)
- Enables better observability via condition reasons

**Implementation**: 5m initial retry, exponential backoff with cap, manual sync bypass.
**Implementation**: `registryapi.Error` carries `ConditionReason` and `Message`. The controller uses `errors.As` to extract structured fields when available, falling back to generic `NotReady` reason for unstructured errors.

### Performance Design Decisions

#### Resource Optimization
- **Status Updates**: Batched to reduce API calls (implemented)
- **Source Fetching**: Planned caching to avoid repeated downloads
- **Status Updates**: Single refetch-then-update per reconciliation cycle
- **API Deployment**: Lazy creation only when needed (implemented)

#### Memory Management
- **Git Operations**: Shallow clones to minimize disk usage (implemented)
- **Large Registries**: Stream processing planned for future
- **Status Objects**: Efficient field-level updates (implemented)

### Security Architecture

#### Permission Model
Expand Down
223 changes: 22 additions & 201 deletions cmd/thv-operator/api/v1alpha1/mcpregistry_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -492,203 +492,66 @@ type MCPRegistryOAuthProviderConfig struct {

// MCPRegistryStatus defines the observed state of MCPRegistry
type MCPRegistryStatus struct {
// ObservedGeneration reflects the generation most recently observed by the controller
// +optional
ObservedGeneration int64 `json:"observedGeneration,omitempty"`

// Phase represents the current overall phase of the MCPRegistry
// Derived from sync and API status
// +optional
Phase MCPRegistryPhase `json:"phase,omitempty"`

// Message provides additional information about the current phase
// +optional
Message string `json:"message,omitempty"`

// SyncStatus provides detailed information about data synchronization
// +optional
SyncStatus *SyncStatus `json:"syncStatus,omitempty"`

// APIStatus provides detailed information about the API service
// +optional
APIStatus *APIStatus `json:"apiStatus,omitempty"`

// LastAppliedFilterHash is the hash of the last applied filter
// +optional
LastAppliedFilterHash string `json:"lastAppliedFilterHash,omitempty"`

// StorageRef is a reference to the internal storage location
// +optional
StorageRef *StorageReference `json:"storageRef,omitempty"`

// LastManualSyncTrigger tracks the last processed manual sync annotation value
// Used to detect new manual sync requests via toolhive.stacklok.dev/sync-trigger annotation
// +optional
LastManualSyncTrigger string `json:"lastManualSyncTrigger,omitempty"`

// Conditions represent the latest available observations of the MCPRegistry's state
// +optional
// +listType=map
// +listMapKey=type
Conditions []metav1.Condition `json:"conditions,omitempty"`
}

// SyncStatus provides detailed information about data synchronization
type SyncStatus struct {
// Phase represents the current synchronization phase
// +kubebuilder:validation:Enum=Syncing;Complete;Failed
Phase SyncPhase `json:"phase"`

// Message provides additional information about the sync status
// +optional
Message string `json:"message,omitempty"`

// LastAttempt is the timestamp of the last sync attempt
// +optional
LastAttempt *metav1.Time `json:"lastAttempt,omitempty"`

// AttemptCount is the number of sync attempts since last success
// +optional
// +kubebuilder:validation:Minimum=0
AttemptCount int32 `json:"attemptCount,omitempty"`

// LastSyncTime is the timestamp of the last successful sync
// +optional
LastSyncTime *metav1.Time `json:"lastSyncTime,omitempty"`
Conditions []metav1.Condition `json:"conditions,omitempty"`

// LastSyncHash is the hash of the last successfully synced data
// Used to detect changes in source data
// ObservedGeneration reflects the generation most recently observed by the controller
// +optional
LastSyncHash string `json:"lastSyncHash,omitempty"`
ObservedGeneration int64 `json:"observedGeneration,omitempty"`

// ServerCount is the total number of servers in the registry
// Phase represents the current overall phase of the MCPRegistry
// +optional
// +kubebuilder:validation:Minimum=0
ServerCount int32 `json:"serverCount,omitempty"`
}

// APIStatus provides detailed information about the API service
type APIStatus struct {
// Phase represents the current API service phase
// +kubebuilder:validation:Enum=NotStarted;Deploying;Ready;Unhealthy;Error
Phase APIPhase `json:"phase"`
Phase MCPRegistryPhase `json:"phase,omitempty"`

// Message provides additional information about the API status
// Message provides additional information about the current phase
// +optional
Message string `json:"message,omitempty"`

// Endpoint is the URL where the API is accessible
// URL is the URL where the registry API can be accessed
// +optional
Endpoint string `json:"endpoint,omitempty"`
URL string `json:"url,omitempty"`

// ReadySince is the timestamp when the API became ready
// ReadyReplicas is the number of ready registry API replicas
// +optional
ReadySince *metav1.Time `json:"readySince,omitempty"`
}

// SyncPhase represents the data synchronization state
// +kubebuilder:validation:Enum=Syncing;Complete;Failed
type SyncPhase string

const (
// SyncPhaseSyncing means sync is currently in progress
SyncPhaseSyncing SyncPhase = "Syncing"

// SyncPhaseComplete means sync completed successfully
SyncPhaseComplete SyncPhase = "Complete"

// SyncPhaseFailed means sync failed
SyncPhaseFailed SyncPhase = "Failed"
)

// APIPhase represents the API service state
// +kubebuilder:validation:Enum=NotStarted;Deploying;Ready;Unhealthy;Error
type APIPhase string

const (
// APIPhaseNotStarted means API deployment has not been created
APIPhaseNotStarted APIPhase = "NotStarted"

// APIPhaseDeploying means API is being deployed
APIPhaseDeploying APIPhase = "Deploying"

// APIPhaseReady means API is ready to serve requests
APIPhaseReady APIPhase = "Ready"

// APIPhaseUnhealthy means API is deployed but not healthy
APIPhaseUnhealthy APIPhase = "Unhealthy"

// APIPhaseError means API deployment failed
APIPhaseError APIPhase = "Error"
)

// StorageReference defines a reference to internal storage
type StorageReference struct {
// Type is the storage type (configmap)
// +kubebuilder:validation:Enum=configmap
Type string `json:"type"`

// ConfigMapRef is a reference to a ConfigMap storage
// Only used when Type is "configmap"
// +optional
ConfigMapRef *corev1.LocalObjectReference `json:"configMapRef,omitempty"`
ReadyReplicas int32 `json:"readyReplicas,omitempty"`
}

// MCPRegistryPhase represents the phase of the MCPRegistry
// +kubebuilder:validation:Enum=Pending;Ready;Failed;Syncing;Terminating
// +kubebuilder:validation:Enum=Pending;Running;Failed;Terminating
type MCPRegistryPhase string

const (
// MCPRegistryPhasePending means the MCPRegistry is being initialized
MCPRegistryPhasePending MCPRegistryPhase = "Pending"

// MCPRegistryPhaseReady means the MCPRegistry is ready and operational
MCPRegistryPhaseReady MCPRegistryPhase = "Ready"
// MCPRegistryPhaseRunning means the MCPRegistry is running and operational
MCPRegistryPhaseRunning MCPRegistryPhase = "Running"

// MCPRegistryPhaseFailed means the MCPRegistry has failed
MCPRegistryPhaseFailed MCPRegistryPhase = "Failed"

// MCPRegistryPhaseSyncing means the MCPRegistry is currently syncing data
MCPRegistryPhaseSyncing MCPRegistryPhase = "Syncing"

// MCPRegistryPhaseTerminating means the MCPRegistry is being deleted
MCPRegistryPhaseTerminating MCPRegistryPhase = "Terminating"
)

// Condition types for MCPRegistry
// Condition reasons for MCPRegistry
const (
// ConditionSourceAvailable indicates whether the source is available and accessible
ConditionSourceAvailable = "SourceAvailable"
// ConditionReasonRegistryReady indicates the MCPRegistry is ready
ConditionReasonRegistryReady = "Ready"

// ConditionDataValid indicates whether the registry data is valid
ConditionDataValid = "DataValid"

// ConditionSyncSuccessful indicates whether the last sync was successful
ConditionSyncSuccessful = "SyncSuccessful"

// ConditionAPIReady indicates whether the registry API is ready
ConditionAPIReady = "APIReady"

// ConditionRegistryPodTemplateValid indicates whether the PodTemplateSpec is valid
ConditionRegistryPodTemplateValid = "PodTemplateValid"
)

// Condition reasons for MCPRegistry PodTemplateSpec validation
const (
// ConditionReasonRegistryPodTemplateValid indicates PodTemplateSpec validation succeeded
ConditionReasonRegistryPodTemplateValid = "ValidPodTemplateSpec"

// ConditionReasonRegistryPodTemplateInvalid indicates PodTemplateSpec validation failed
ConditionReasonRegistryPodTemplateInvalid = "InvalidPodTemplateSpec"
// ConditionReasonRegistryNotReady indicates the MCPRegistry is not ready
ConditionReasonRegistryNotReady = "NotReady"
)

//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
//+kubebuilder:printcolumn:name="Phase",type="string",JSONPath=".status.phase"
//+kubebuilder:printcolumn:name="Sync",type="string",JSONPath=".status.syncStatus.phase"
//+kubebuilder:printcolumn:name="API",type="string",JSONPath=".status.apiStatus.phase"
//+kubebuilder:printcolumn:name="Servers",type="integer",JSONPath=".status.syncStatus.serverCount"
//+kubebuilder:printcolumn:name="Last Sync",type="date",JSONPath=".status.syncStatus.lastSyncTime"
//+kubebuilder:printcolumn:name="Status",type="string",JSONPath=".status.phase"
//+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.conditions[?(@.type=='Ready')].status"
//+kubebuilder:printcolumn:name="Replicas",type="integer",JSONPath=".status.readyReplicas"
//+kubebuilder:printcolumn:name="URL",type="string",JSONPath=".status.url"
//+kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
//+kubebuilder:resource:shortName=mcpreg;registry,scope=Namespaced,categories=toolhive
//nolint:lll
Expand Down Expand Up @@ -722,48 +585,6 @@ func (r *MCPRegistry) GetAPIResourceName() string {
return fmt.Sprintf("%s-api", r.Name)
}

// DeriveOverallPhase determines the overall MCPRegistry phase based on sync and API status
func (r *MCPRegistry) DeriveOverallPhase() MCPRegistryPhase {
syncStatus := r.Status.SyncStatus
apiStatus := r.Status.APIStatus

// Default phases if status not set
var syncPhase SyncPhase
if syncStatus != nil {
syncPhase = syncStatus.Phase
}

apiPhase := APIPhaseNotStarted
if apiStatus != nil {
apiPhase = apiStatus.Phase
}

// If sync failed, overall is Failed
if syncPhase == SyncPhaseFailed {
return MCPRegistryPhaseFailed
}

// If sync in progress, overall is Syncing
if syncPhase == SyncPhaseSyncing {
return MCPRegistryPhaseSyncing
}

// If sync is complete (no sync needed), check API status
if syncPhase == SyncPhaseComplete {
switch apiPhase {
case APIPhaseReady:
return MCPRegistryPhaseReady
case APIPhaseError:
return MCPRegistryPhaseFailed
case APIPhaseNotStarted, APIPhaseDeploying, APIPhaseUnhealthy:
return MCPRegistryPhasePending // API still starting/not healthy
}
}

// Default to pending for initial states
return MCPRegistryPhasePending
}

func init() {
SchemeBuilder.Register(&MCPRegistry{}, &MCPRegistryList{})
}
Expand Down
Loading
Loading