HYPERFLEET-896 - doc: Config Driven Generic Resource API Design by rh-amarin · Pull Request #122 · openshift-hyperfleet/architecture

rh-amarin · 2026-04-09T16:46:56Z

Summary

Adds design document for the CRD-Driven Generic Resource API (hyperfleet/docs/generic-resource-registry-design.md)
Proposes replacing per-entity handler code with a single Resource type + declarative EntityDescriptor registry
Covers OpenAPI contract, data layer (GORM + single resources table), spec validation via JSON Schema, authorization config, delete semantics, and trade-offs

Context

HYPERFLEET-896. Currently adding a new entity type (e.g. NodePool) requires duplicated handler, DAO, and route code plus a coordinated deploy. This design eliminates that by letting teams register new entities via a descriptor at startup — routes, validation, status aggregation, and delete behavior are derived automatically.

Summary by CodeRabbit

Documentation
- Added a draft design for a Generic Resource Registry: a contract-first API with a single generic Resource schema using a type discriminator; descriptor-driven generation of top-level and nested routes with CRUD and /statuses endpoints; consolidated data-layer approach and separate handling for labels and status conditions; descriptor-led parent-delete semantics (cascade vs restrict); outlines breaking-change scope, alternatives, tradeoffs, and risks.

openshift-ci · 2026-04-09T16:47:02Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2026-04-09T16:47:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign pnguyen44 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-09T16:47:05Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aadd3e38-fa42-47d6-a4b2-c314a105300b

📥 Commits

Reviewing files that changed from the base of the PR and between 5e30455 and 4858a83.

📒 Files selected for processing (1)

hyperfleet/docs/generic-resource-registry-design.md

✅ Files skipped from review due to trivial changes (1)

hyperfleet/docs/generic-resource-registry-design.md

Walkthrough

A new design document, "Generic Resource Registry," was added. It specifies a descriptor-driven, contract-first API that consolidates many managed entities into a single generic Resource model (discriminator type, untyped spec at API surface, entity-specific OpenAPI component schemas for validation). It describes autogenerated top-level and nested CRUD and /statuses routes from EntityDescriptors, a single resources table with separate label and condition tables, consolidated handlers/DAOs/services, and descriptor-driven cascade/restrict deletion semantics.

Sequence Diagram(s)

sequenceDiagram
participant Client
participant Router
participant ResourceHandler
participant ResourceService
participant ResourceDao
participant DB
Client->>Router: HTTP request (e.g., POST /resources?type=Cluster)
Router->>ResourceHandler: dispatch based on route/descriptor
ResourceHandler->>ResourceService: validate & apply descriptor rules
ResourceService->>ResourceDao: persist/query by resource type
ResourceDao->>DB: write/read resources + labels/conditions
DB-->>ResourceDao: result
ResourceDao-->>ResourceService: result
ResourceService-->>ResourceHandler: build response
ResourceHandler-->>Router: HTTP response
Router-->>Client: HTTP response

sequenceDiagram
participant Startup
participant DescriptorLoader
participant RouteRegistrar
participant Router
Startup->>DescriptorLoader: load EntityDescriptors
DescriptorLoader-->>RouteRegistrar: descriptors
RouteRegistrar->>Router: register top-level & nested route trees (CRUD, /statuses)
Router-->>Startup: routes registered

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding a design document for a config-driven generic resource API architecture, which is the core objective of this PR.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (1)

hyperfleet/docs/generic-resource-registry-design.md (1)
703-706: Add error handling for JSON unmarshaling in PresentResource.

The code silently ignores the error from json.Unmarshal(r.Spec, &spec). While the Spec field is stored as GORM datatypes.JSON and should contain valid JSON, defensive error handling would improve robustness and make debugging easier if data corruption occurs.

Consider logging the error or returning a fallback value:
var spec map[string]interface{}
if err := json.Unmarshal(r.Spec, &spec); err != nil {
    log.Warnf("failed to unmarshal spec for resource %s: %v", r.ID, err)
    spec = make(map[string]interface{}) // or return error to caller
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/docs/generic-resource-registry-design.md` around lines 703 - 706,
In PresentResource, handle the error returned by json.Unmarshal(r.Spec, &spec)
instead of ignoring it: after calling json.Unmarshal on r.Spec, check the error
and either log a warning/error that includes the resource identifier (e.g.,
r.ID) and the error details, then set spec = make(map[string]interface{}) as a
safe fallback, or propagate/return the error to the caller; update the
PresentResource function to perform this check and logging around json.Unmarshal
to make failures observable and prevent nil/spec panics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hyperfleet/docs/generic-resource-registry-design.md`:
- Line 9: Change the misspelled word "altternatives" to "alternatives" in the
introduction of the Generic Resource Registry proposal (look for the line
containing "This document contains the proposal for the Generic Resource
Registry and altternatives" in the generic-resource-registry-design.md file and
correct the typo).
- Around line 1309-1396: The doc exposes a security risk: authorization is
opt-in via EntityDescriptor.Authz (EntityAuthzConfig) so new entities default to
no checks; update the design by (1) adding a note in §11 documenting
default-allow risk and audit burden, (2) specifying a registry validation step
that inspects registrations (registry.Register/EntityDescriptor) and emits a
warning or fails if Authz == nil for critical types, and (3) defining a
configuration toggle (e.g., EnforceAuthForAllEntities) that, when enabled, makes
the registry validation error instead of warn and requires OperationPermissions
be present (ResourceCheck optional).
- Around line 406-438: The Replace method in sqlResourceDao reads
existing.Generation then writes back without checking the row hasn't changed,
allowing lost increments under concurrent Replace calls; modify the Save step in
Replace to perform an optimistic-locking conditional update by adding a
Where("generation = ?", existing.Generation) before
Omit("Labels","Conditions").Save(resource) and detect a failed update (e.g.,
gorm.ErrRecordNotFound or RowsAffected == 0) to mark the transaction for
rollback and return a conflict error (e.g., errors.Conflict("resource was
modified concurrently")), keeping the existing db.MarkForRollback(ctx, err)
behavior for other errors.
- Around line 1240-1258: Add three new risk subsections after "R2 — Loss of
domain modeling flexibility": create "R3 — Schema evolution and backward
compatibility" documenting how to add/remove EntityDescriptor fields,
deprecation/versioning strategy, and schema migration guidance; "R4 — Security
considerations" listing checks for SpecSchemaName existence, label validation,
authorization gaps, and audit logging requirements; and "R5 — Debugging and
observability" covering entity-specific error messages, metrics/logging tagging,
and tracing. For each new risk include a short remediation paragraph (e.g.,
validation checks, descriptor versioning, RBAC/audit policy, and observability
tags) mirroring the style of the existing "R1" and "R2" sections so they
integrate consistently into the Risks section.
- Around line 572-574: registry.Validate() currently only checks ParentType
references but must also ensure descriptor.Plural values are unique to prevent
ambiguous routing in RegisterEntityRoutes (which uses descriptor.Plural as the
path segment). Update Validate() to iterate over All() descriptors, track seen
plurals in a map[string]string, and panic with a descriptive message including
both descriptor.Type values when a duplicate Plural is found (e.g., "duplicate
Plural %q: used by both %q and %q"). Ensure the new check runs alongside the
existing ParentType validation.
- Around line 918-940: The Delete method ignores errors from s.dao.CountByOwner
and s.dao.FindByTypeAndOwner which can hide DB failures and cause incorrect
delete behavior; update sqlResourceService.Delete to capture the returned error
from CountByOwner and if non-nil return it (wrap via handleDeleteError or
convert to *errors.ServiceError) instead of assuming count==0, and likewise
capture and return any error from FindByTypeAndOwner before iterating children
so cascade deletion only runs when the query succeeded; keep the existing call
to handleDeleteError(s.dao.Delete(...)) for the final delete.
- Around line 607-641: Update the design text to explicitly state
owner-reference immutability and how Replace behaves: specify that OwnerID,
OwnerType, and OwnerHref are immutable once created and that resources.href is
computed once from id, Plural and the resolved parent chain and will not change;
clarify that ResourcePatchRequest intentionally excludes owner fields and that
Replace cannot modify owner references (it updates other mutable resource fields
but must reject any incoming OwnerID/OwnerType/OwnerHref changes), and note that
the service will validate and return an error if an update attempts to change
owner fields to preserve href correctness and enforce the constraint described
in §10.6.

---

Nitpick comments:
In `@hyperfleet/docs/generic-resource-registry-design.md`:
- Around line 703-706: In PresentResource, handle the error returned by
json.Unmarshal(r.Spec, &spec) instead of ignoring it: after calling
json.Unmarshal on r.Spec, check the error and either log a warning/error that
includes the resource identifier (e.g., r.ID) and the error details, then set
spec = make(map[string]interface{}) as a safe fallback, or propagate/return the
error to the caller; update the PresentResource function to perform this check
and logging around json.Unmarshal to make failures observable and prevent
nil/spec panics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6e1cf04f-6d26-4d8a-8415-3a107600bc1c

📥 Commits

Reviewing files that changed from the base of the PR and between 967e5eb and c4aa85d.

📒 Files selected for processing (1)

hyperfleet/docs/generic-resource-registry-design.md

rafabene · 2026-04-10T13:08:53Z

+        case registry.OnParentDeleteCascade:
+            children, _ := s.dao.FindByTypeAndOwner(ctx, child.Type, id)
+            for _, c := range children {
+                if err := s.Delete(ctx, c.Type, c.ID); err != nil {


This recursive s.Delete call runs DFS within a single transaction-per-request. For wide or deep hierarchies (e.g., a Cluster with hundreds of NodePools, each owning children), this can exceed the request timeout — the same problem acknowledged for "Alternative B — Cascade always" in §10.2.

The pull-secret-service DDR already uses background jobs and batched processing for heavy operations (reconciler jobs, batched re-encryption). Should large cascades follow that same pattern — async job with progress tracking instead of synchronous inline recursion?

My take on this is:

First we will implement the simplest solution

Then we measure to find what numbers may break the feature

If required, all owned objects could be updated in a single query, since they are on the same table, and it would be cheap

No background jobs please. We should avoid these at all costs. Its a slippery slope to end up like CS with workers/controllers.

Also agree with Angel's comment, but just want to take a hard stance on background jobs. They should always be our last and only resort IMO

rafabene · 2026-04-10T13:09:04Z

+    spec:   { type: object, additionalProperties: true }
+    labels: { type: object, additionalProperties: { type: string } }
+
+ResourceList:


ResourceList uses offset-based pagination (page + size). The pull-secret-service DDR explicitly chose cursor-based pagination and documents why:

"Pagination Strategy: Cursor-based pagination (not offset-based). Why cursor-based: Prevents missing/duplicate results when new rotations are created during pagination"

Since all entity types now share a single resources table (higher write concurrency than per-type tables), the offset-based approach is more susceptible to the same skipped/duplicated results problem. Is the divergence from the cursor-based pattern intentional? If so, worth documenting the rationale. If not, consider aligning with the established pattern.

IMO we can accept the risk of skipped/duplicated problem.

My rationale is that

HyperFleet API does not create/delete elements at high frequency scoped to a cluster or a search of related clusters

What operations do we expect to list resources?

Listing clusters of a user

Listing child objects of a user (e.g. nodepools)

statuses

IMO, none of these operations will be impacted by the update frequency of the shared resources table. One customer looking at a paginated search of their cluster objects won't be affected by another customer creating/destroying other clusters

But, this topic now begs the question

Should we allow unrestricted GET /api/hyperfleet/v1/resources operation in the API?

What can be the purpose of listing all entities at root level? mixing clusters/nodepools/others in the same API call?

Should we add a restriction by ?kind=xxx always?

Should we remove completely the endpoint?

We can go with it in a first version:

It can be useful for administrative purposes and debugging

It shouldn't be accessed by external customers

rafabene · 2026-04-10T13:09:17Z

+
+`UpdateConditions` — called exclusively by the status aggregation path. Replaces all condition rows for the resource atomically.
+
+`Replace` and `UpdateConditions` are on separate code paths deliberately: `Replace` is called by user-initiated spec/label updates; `UpdateConditions` is called only by the status aggregation pipeline. This separation keeps user-writable and system-computed state from interfering.


The separation of Replace and UpdateConditions into different code paths is a good design. However, the doc doesn't discuss what happens when they execute concurrently:

Replace reads Generation=5, increments to 6, writes back

UpdateConditions reads Generation=5 (before the Replace commit), writes ObservedGeneration=5

Now ObservedGeneration (5) < Generation (6), even though the condition was computed after the spec change

Is this temporary inconsistency acceptable? If so, it's worth stating explicitly. If not, does UpdateConditions need to re-read Generation inside its own transaction?

I've removed some code examples from the design, as they are more on the implementation phase than design.
But the value for the updated generation should stay consistent within the whole transaction

rafabene · 2026-04-10T13:09:27Z

+
+    // Authz defines per-operation authorization configuration.
+    // nil = no authorization checks (current default for all entities).
+    // Cannot be expressed in the config file — set in Go for entities requiring custom authz logic.


This comment says Authz "cannot be expressed in the config file". But if authorization becomes mandatory for all entity types (likely in a multi-tenant system), then every entity will need Go code — breaking the core promise of "no Go code required for standard entities".

Could OperationPermissions at least be expressible in config YAML? The permission strings are plain data, not logic:

entities: - type: Cluster authz: operationPermissions: GET: hyperfleet.clusters.view POST: hyperfleet.clusters.create

ResourceCheck would remain Go-only since it's a function, but OperationPermissions is just a map[string]string — no reason it can't live in config.

I added this option to the Appendix B, we will leave the implementation for a later stage though

rafabene · 2026-04-10T13:09:40Z

+
+### 5.4 Href generation
+
+Each resource carries a relative `href` field that uniquely identifies it within the API. The href is computed once at creation time by the service layer using the entity's descriptor and its resolved parent chain and stored in the `resources.href` column. It is never recomputed after creation because the `id` (UUID) and the `Plural` path segment are both stable for the lifetime of the resource.


"It is never recomputed after creation" — this works as long as three invariants hold:

Resources are never reparented (moved to a different owner)

The Plural path segment of an entity type never changes

The ownership hierarchy is stable

If any of these change in the future, all child hrefs become stale with no recalculation mechanism. Consider explicitly documenting these as system invariants so future design decisions don't accidentally violate them. Also worth noting: what happens if reparenting is ever needed? Is it a "delete + recreate" operation?

I think these possibilities are very unlikely to happen, so IMO we should not complicate the design with "exotic" invariants

If we ever need to change resource, plural or ownership, the future task will have to deal with the migration

ciaranRoche

Overall I like the direction you took in this 😄 my main concern I called out inline, just to explicitly call out how we ensure openapi structs and config stay in sync. Prob worth a discussion with the group

ciaranRoche · 2026-04-13T07:11:42Z

+    Type       string         `gorm:"column:type;size:100;not null"`
+    Kind       string         `gorm:"column:kind;size:100;not null"`


For my own clarity, what is the difference between Type and Kind?

ciaranRoche · 2026-04-13T07:35:03Z

+    // nil = no authorization checks (current default for all entities).
+    // OperationPermissions is expressible in the config file (plain string map).
+    // ResourceCheck is Go-only (it is a function); entities requiring it must be registered via Register().
+    Authz *EntityAuthzConfig


Prob best to remove this, lets remove inherited auth as part of this work.

I dont want to include this until it becomes a hard requirement

ciaranRoche · 2026-04-13T07:44:19Z

+- Con : Dynamic route registration is fundamentally incompatible with gorilla/mux, which builds its routing tree at startup; routes cannot be added at runtime without restarting the router
+- Con : Adds a hard dependency on the Kubernetes API and `controller-runtime` or `client-go` — significant operational complexity
+- Con : Entity type changes become a cluster operation rather than a code change; harder to test locally
+- Con : Duplication of OpenAPI schemas. Providers already have their OpenAPI schema for their external API but now they have to break the types into CRDs and keep them aligned.


I dont think this con is correct.

I think we need to have a conversation here, I am not arguing for or against the current design, but I think there is a silent gap in it, that is how do we ensure OpenAPI spec and Config stay in sync?

We get this for 'free' with CRD's as its a single config, comes at the cost of our users having to translate to CRD.

Regardless, we should document it, to ensure we have test cases to cover there are no silent failures if the config matching fails

ciaranRoche · 2026-04-13T07:49:00Z

+There is no caller-supplied query parameter. The API surface is simply:
+
+```
+DELETE /clusters/{id} → 204 No Content  (NodePools cascade)


204 No Content? What is this OCM 😆

I could be wrong but think its a 202 with the intent captured on the resource that it is marked for deletion

ciaranRoche · 2026-04-13T07:51:06Z

+        case registry.OnParentDeleteCascade:
+            children, _ := s.dao.FindByTypeAndOwner(ctx, child.Type, id)
+            for _, c := range children {
+                if err := s.Delete(ctx, c.Type, c.ID); err != nil {


No background jobs please. We should avoid these at all costs. Its a slippery slope to end up like CS with workers/controllers.

Also agree with Angel's comment, but just want to take a hard stance on background jobs. They should always be our last and only resort IMO

openshift-ci bot added the do-not-merge/work-in-progress label Apr 9, 2026

rh-amarin force-pushed the HYPERFLEET-896 branch 2 times, most recently from 99bdd5a to c4aa85d Compare April 10, 2026 06:22

rh-amarin marked this pull request as ready for review April 10, 2026 06:23

openshift-ci bot removed the do-not-merge/work-in-progress label Apr 10, 2026

openshift-ci bot requested review from aredenba-rh and pnguyen44 April 10, 2026 06:23

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

rh-amarin force-pushed the HYPERFLEET-896 branch from c4aa85d to 5e30455 Compare April 10, 2026 08:10

rh-amarin changed the title ~~HYPERFLEET-896 - doc: CRD-Driven Generic Resource API Design~~ HYPERFLEET-896 - doc:Config Driven Generic Resource API Design Apr 10, 2026

rh-amarin changed the title ~~HYPERFLEET-896 - doc:Config Driven Generic Resource API Design~~ HYPERFLEET-896 - doc: Config Driven Generic Resource API Design Apr 10, 2026

rafabene reviewed Apr 10, 2026

View reviewed changes

HYPERFLEET-896 - docs: CRD-Driven Generic Resource API Design

4858a83

rh-amarin force-pushed the HYPERFLEET-896 branch from 5e30455 to 4858a83 Compare April 13, 2026 06:30

ciaranRoche reviewed Apr 13, 2026

View reviewed changes


		`UpdateConditions` — called exclusively by the status aggregation path. Replaces all condition rows for the resource atomically.

		`Replace` and `UpdateConditions` are on separate code paths deliberately: `Replace` is called by user-initiated spec/label updates; `UpdateConditions` is called only by the status aggregation pipeline. This separation keeps user-writable and system-computed state from interfering.


		### 5.4 Href generation

		Each resource carries a relative `href` field that uniquely identifies it within the API. The href is computed once at creation time by the service layer using the entity's descriptor and its resolved parent chain and stored in the `resources.href` column. It is never recomputed after creation because the `id` (UUID) and the `Plural` path segment are both stable for the lifetime of the resource.

		Type string `gorm:"column:type;size:100;not null"`
		Kind string `gorm:"column:kind;size:100;not null"`

Conversation

rh-amarin commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Summary by CodeRabbit

Uh oh!

openshift-ci bot commented Apr 9, 2026

Uh oh!

openshift-ci bot commented Apr 9, 2026

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rh-amarin Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ciaranRoche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rh-amarin commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

rh-amarin Apr 13, 2026 •

edited

Loading