HYPERFLEET-896 - doc: Config Driven Generic Resource API Design#122
HYPERFLEET-896 - doc: Config Driven Generic Resource API Design#122rh-amarin wants to merge 1 commit intoopenshift-hyperfleet:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
WalkthroughA new design document, "Generic Resource Registry," was added. It specifies a descriptor-driven, contract-first API that consolidates many managed entities into a single generic Sequence Diagram(s)sequenceDiagram
participant Client
participant Router
participant ResourceHandler
participant ResourceService
participant ResourceDao
participant DB
Client->>Router: HTTP request (e.g., POST /resources?type=Cluster)
Router->>ResourceHandler: dispatch based on route/descriptor
ResourceHandler->>ResourceService: validate & apply descriptor rules
ResourceService->>ResourceDao: persist/query by resource type
ResourceDao->>DB: write/read resources + labels/conditions
DB-->>ResourceDao: result
ResourceDao-->>ResourceService: result
ResourceService-->>ResourceHandler: build response
ResourceHandler-->>Router: HTTP response
Router-->>Client: HTTP response
sequenceDiagram
participant Startup
participant DescriptorLoader
participant RouteRegistrar
participant Router
Startup->>DescriptorLoader: load EntityDescriptors
DescriptorLoader-->>RouteRegistrar: descriptors
RouteRegistrar->>Router: register top-level & nested route trees (CRUD, /statuses)
Router-->>Startup: routes registered
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
99bdd5a to
c4aa85d
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
hyperfleet/docs/generic-resource-registry-design.md (1)
703-706: Add error handling for JSON unmarshaling in PresentResource.The code silently ignores the error from
json.Unmarshal(r.Spec, &spec). While theSpecfield is stored as GORMdatatypes.JSONand should contain valid JSON, defensive error handling would improve robustness and make debugging easier if data corruption occurs.Consider logging the error or returning a fallback value:
var spec map[string]interface{} if err := json.Unmarshal(r.Spec, &spec); err != nil { log.Warnf("failed to unmarshal spec for resource %s: %v", r.ID, err) spec = make(map[string]interface{}) // or return error to caller }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hyperfleet/docs/generic-resource-registry-design.md` around lines 703 - 706, In PresentResource, handle the error returned by json.Unmarshal(r.Spec, &spec) instead of ignoring it: after calling json.Unmarshal on r.Spec, check the error and either log a warning/error that includes the resource identifier (e.g., r.ID) and the error details, then set spec = make(map[string]interface{}) as a safe fallback, or propagate/return the error to the caller; update the PresentResource function to perform this check and logging around json.Unmarshal to make failures observable and prevent nil/spec panics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hyperfleet/docs/generic-resource-registry-design.md`:
- Line 9: Change the misspelled word "altternatives" to "alternatives" in the
introduction of the Generic Resource Registry proposal (look for the line
containing "This document contains the proposal for the Generic Resource
Registry and altternatives" in the generic-resource-registry-design.md file and
correct the typo).
- Around line 1309-1396: The doc exposes a security risk: authorization is
opt-in via EntityDescriptor.Authz (EntityAuthzConfig) so new entities default to
no checks; update the design by (1) adding a note in §11 documenting
default-allow risk and audit burden, (2) specifying a registry validation step
that inspects registrations (registry.Register/EntityDescriptor) and emits a
warning or fails if Authz == nil for critical types, and (3) defining a
configuration toggle (e.g., EnforceAuthForAllEntities) that, when enabled, makes
the registry validation error instead of warn and requires OperationPermissions
be present (ResourceCheck optional).
- Around line 406-438: The Replace method in sqlResourceDao reads
existing.Generation then writes back without checking the row hasn't changed,
allowing lost increments under concurrent Replace calls; modify the Save step in
Replace to perform an optimistic-locking conditional update by adding a
Where("generation = ?", existing.Generation) before
Omit("Labels","Conditions").Save(resource) and detect a failed update (e.g.,
gorm.ErrRecordNotFound or RowsAffected == 0) to mark the transaction for
rollback and return a conflict error (e.g., errors.Conflict("resource was
modified concurrently")), keeping the existing db.MarkForRollback(ctx, err)
behavior for other errors.
- Around line 1240-1258: Add three new risk subsections after "R2 — Loss of
domain modeling flexibility": create "R3 — Schema evolution and backward
compatibility" documenting how to add/remove EntityDescriptor fields,
deprecation/versioning strategy, and schema migration guidance; "R4 — Security
considerations" listing checks for SpecSchemaName existence, label validation,
authorization gaps, and audit logging requirements; and "R5 — Debugging and
observability" covering entity-specific error messages, metrics/logging tagging,
and tracing. For each new risk include a short remediation paragraph (e.g.,
validation checks, descriptor versioning, RBAC/audit policy, and observability
tags) mirroring the style of the existing "R1" and "R2" sections so they
integrate consistently into the Risks section.
- Around line 572-574: registry.Validate() currently only checks ParentType
references but must also ensure descriptor.Plural values are unique to prevent
ambiguous routing in RegisterEntityRoutes (which uses descriptor.Plural as the
path segment). Update Validate() to iterate over All() descriptors, track seen
plurals in a map[string]string, and panic with a descriptive message including
both descriptor.Type values when a duplicate Plural is found (e.g., "duplicate
Plural %q: used by both %q and %q"). Ensure the new check runs alongside the
existing ParentType validation.
- Around line 918-940: The Delete method ignores errors from s.dao.CountByOwner
and s.dao.FindByTypeAndOwner which can hide DB failures and cause incorrect
delete behavior; update sqlResourceService.Delete to capture the returned error
from CountByOwner and if non-nil return it (wrap via handleDeleteError or
convert to *errors.ServiceError) instead of assuming count==0, and likewise
capture and return any error from FindByTypeAndOwner before iterating children
so cascade deletion only runs when the query succeeded; keep the existing call
to handleDeleteError(s.dao.Delete(...)) for the final delete.
- Around line 607-641: Update the design text to explicitly state
owner-reference immutability and how Replace behaves: specify that OwnerID,
OwnerType, and OwnerHref are immutable once created and that resources.href is
computed once from id, Plural and the resolved parent chain and will not change;
clarify that ResourcePatchRequest intentionally excludes owner fields and that
Replace cannot modify owner references (it updates other mutable resource fields
but must reject any incoming OwnerID/OwnerType/OwnerHref changes), and note that
the service will validate and return an error if an update attempts to change
owner fields to preserve href correctness and enforce the constraint described
in §10.6.
---
Nitpick comments:
In `@hyperfleet/docs/generic-resource-registry-design.md`:
- Around line 703-706: In PresentResource, handle the error returned by
json.Unmarshal(r.Spec, &spec) instead of ignoring it: after calling
json.Unmarshal on r.Spec, check the error and either log a warning/error that
includes the resource identifier (e.g., r.ID) and the error details, then set
spec = make(map[string]interface{}) as a safe fallback, or propagate/return the
error to the caller; update the PresentResource function to perform this check
and logging around json.Unmarshal to make failures observable and prevent
nil/spec panics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6e1cf04f-6d26-4d8a-8415-3a107600bc1c
📒 Files selected for processing (1)
hyperfleet/docs/generic-resource-registry-design.md
c4aa85d to
5e30455
Compare
| case registry.OnParentDeleteCascade: | ||
| children, _ := s.dao.FindByTypeAndOwner(ctx, child.Type, id) | ||
| for _, c := range children { | ||
| if err := s.Delete(ctx, c.Type, c.ID); err != nil { |
There was a problem hiding this comment.
This recursive s.Delete call runs DFS within a single transaction-per-request. For wide or deep hierarchies (e.g., a Cluster with hundreds of NodePools, each owning children), this can exceed the request timeout — the same problem acknowledged for "Alternative B — Cascade always" in §10.2.
The pull-secret-service DDR already uses background jobs and batched processing for heavy operations (reconciler jobs, batched re-encryption). Should large cascades follow that same pattern — async job with progress tracking instead of synchronous inline recursion?
There was a problem hiding this comment.
My take on this is:
- First we will implement the simplest solution
- Then we measure to find what numbers may break the feature
- If required, all owned objects could be updated in a single query, since they are on the same table, and it would be cheap
There was a problem hiding this comment.
No background jobs please. We should avoid these at all costs. Its a slippery slope to end up like CS with workers/controllers.
Also agree with Angel's comment, but just want to take a hard stance on background jobs. They should always be our last and only resort IMO
| spec: { type: object, additionalProperties: true } | ||
| labels: { type: object, additionalProperties: { type: string } } | ||
|
|
||
| ResourceList: |
There was a problem hiding this comment.
ResourceList uses offset-based pagination (page + size). The pull-secret-service DDR explicitly chose cursor-based pagination and documents why:
"Pagination Strategy: Cursor-based pagination (not offset-based). Why cursor-based: Prevents missing/duplicate results when new rotations are created during pagination"
Since all entity types now share a single resources table (higher write concurrency than per-type tables), the offset-based approach is more susceptible to the same skipped/duplicated results problem. Is the divergence from the cursor-based pattern intentional? If so, worth documenting the rationale. If not, consider aligning with the established pattern.
There was a problem hiding this comment.
IMO we can accept the risk of skipped/duplicated problem.
My rationale is that
HyperFleet API does not create/delete elements at high frequency scoped to a cluster or a search of related clusters
What operations do we expect to list resources?
- Listing clusters of a user
- Listing child objects of a user (e.g. nodepools)
- statuses
IMO, none of these operations will be impacted by the update frequency of the shared resources table. One customer looking at a paginated search of their cluster objects won't be affected by another customer creating/destroying other clusters
But, this topic now begs the question
- Should we allow unrestricted
GET /api/hyperfleet/v1/resourcesoperation in the API? - What can be the purpose of listing all entities at root level? mixing clusters/nodepools/others in the same API call?
- Should we add a restriction by
?kind=xxxalways? - Should we remove completely the endpoint?
We can go with it in a first version:
- It can be useful for administrative purposes and debugging
- It shouldn't be accessed by external customers
|
|
||
| `UpdateConditions` — called exclusively by the status aggregation path. Replaces all condition rows for the resource atomically. | ||
|
|
||
| `Replace` and `UpdateConditions` are on separate code paths deliberately: `Replace` is called by user-initiated spec/label updates; `UpdateConditions` is called only by the status aggregation pipeline. This separation keeps user-writable and system-computed state from interfering. |
There was a problem hiding this comment.
The separation of Replace and UpdateConditions into different code paths is a good design. However, the doc doesn't discuss what happens when they execute concurrently:
ReplacereadsGeneration=5, increments to6, writes backUpdateConditionsreadsGeneration=5(before the Replace commit), writesObservedGeneration=5- Now
ObservedGeneration(5) <Generation(6), even though the condition was computed after the spec change
Is this temporary inconsistency acceptable? If so, it's worth stating explicitly. If not, does UpdateConditions need to re-read Generation inside its own transaction?
There was a problem hiding this comment.
I've removed some code examples from the design, as they are more on the implementation phase than design.
But the value for the updated generation should stay consistent within the whole transaction
|
|
||
| // Authz defines per-operation authorization configuration. | ||
| // nil = no authorization checks (current default for all entities). | ||
| // Cannot be expressed in the config file — set in Go for entities requiring custom authz logic. |
There was a problem hiding this comment.
This comment says Authz "cannot be expressed in the config file". But if authorization becomes mandatory for all entity types (likely in a multi-tenant system), then every entity will need Go code — breaking the core promise of "no Go code required for standard entities".
Could OperationPermissions at least be expressible in config YAML? The permission strings are plain data, not logic:
entities:
- type: Cluster
authz:
operationPermissions:
GET: hyperfleet.clusters.view
POST: hyperfleet.clusters.createResourceCheck would remain Go-only since it's a function, but OperationPermissions is just a map[string]string — no reason it can't live in config.
There was a problem hiding this comment.
I added this option to the Appendix B, we will leave the implementation for a later stage though
|
|
||
| ### 5.4 Href generation | ||
|
|
||
| Each resource carries a relative `href` field that uniquely identifies it within the API. The href is computed once at creation time by the service layer using the entity's descriptor and its resolved parent chain and stored in the `resources.href` column. It is never recomputed after creation because the `id` (UUID) and the `Plural` path segment are both stable for the lifetime of the resource. |
There was a problem hiding this comment.
"It is never recomputed after creation" — this works as long as three invariants hold:
- Resources are never reparented (moved to a different owner)
- The
Pluralpath segment of an entity type never changes - The ownership hierarchy is stable
If any of these change in the future, all child hrefs become stale with no recalculation mechanism. Consider explicitly documenting these as system invariants so future design decisions don't accidentally violate them. Also worth noting: what happens if reparenting is ever needed? Is it a "delete + recreate" operation?
There was a problem hiding this comment.
I think these possibilities are very unlikely to happen, so IMO we should not complicate the design with "exotic" invariants
If we ever need to change resource, plural or ownership, the future task will have to deal with the migration
5e30455 to
4858a83
Compare
ciaranRoche
left a comment
There was a problem hiding this comment.
Overall I like the direction you took in this 😄 my main concern I called out inline, just to explicitly call out how we ensure openapi structs and config stay in sync. Prob worth a discussion with the group
| Type string `gorm:"column:type;size:100;not null"` | ||
| Kind string `gorm:"column:kind;size:100;not null"` |
There was a problem hiding this comment.
For my own clarity, what is the difference between Type and Kind?
| // nil = no authorization checks (current default for all entities). | ||
| // OperationPermissions is expressible in the config file (plain string map). | ||
| // ResourceCheck is Go-only (it is a function); entities requiring it must be registered via Register(). | ||
| Authz *EntityAuthzConfig |
There was a problem hiding this comment.
Prob best to remove this, lets remove inherited auth as part of this work.
I dont want to include this until it becomes a hard requirement
| - Con : Dynamic route registration is fundamentally incompatible with gorilla/mux, which builds its routing tree at startup; routes cannot be added at runtime without restarting the router | ||
| - Con : Adds a hard dependency on the Kubernetes API and `controller-runtime` or `client-go` — significant operational complexity | ||
| - Con : Entity type changes become a cluster operation rather than a code change; harder to test locally | ||
| - Con : Duplication of OpenAPI schemas. Providers already have their OpenAPI schema for their external API but now they have to break the types into CRDs and keep them aligned. |
There was a problem hiding this comment.
I dont think this con is correct.
I think we need to have a conversation here, I am not arguing for or against the current design, but I think there is a silent gap in it, that is how do we ensure OpenAPI spec and Config stay in sync?
We get this for 'free' with CRD's as its a single config, comes at the cost of our users having to translate to CRD.
Regardless, we should document it, to ensure we have test cases to cover there are no silent failures if the config matching fails
| There is no caller-supplied query parameter. The API surface is simply: | ||
|
|
||
| ``` | ||
| DELETE /clusters/{id} → 204 No Content (NodePools cascade) |
There was a problem hiding this comment.
204 No Content? What is this OCM 😆
I could be wrong but think its a 202 with the intent captured on the resource that it is marked for deletion
| case registry.OnParentDeleteCascade: | ||
| children, _ := s.dao.FindByTypeAndOwner(ctx, child.Type, id) | ||
| for _, c := range children { | ||
| if err := s.Delete(ctx, c.Type, c.ID); err != nil { |
There was a problem hiding this comment.
No background jobs please. We should avoid these at all costs. Its a slippery slope to end up like CS with workers/controllers.
Also agree with Angel's comment, but just want to take a hard stance on background jobs. They should always be our last and only resort IMO
Summary
hyperfleet/docs/generic-resource-registry-design.md)Resourcetype + declarativeEntityDescriptorregistryresourcestable), spec validation via JSON Schema, authorization config, delete semantics, and trade-offsContext
HYPERFLEET-896. Currently adding a new entity type (e.g. NodePool) requires duplicated handler, DAO, and route code plus a coordinated deploy. This design eliminates that by letting teams register new entities via a descriptor at startup — routes, validation, status aggregation, and delete behavior are derived automatically.
Summary by CodeRabbit