Skip to content

HYPERFLEET-896 - doc: Config Driven Generic Resource API Design#122

Open
rh-amarin wants to merge 1 commit intoopenshift-hyperfleet:mainfrom
rh-amarin:HYPERFLEET-896
Open

HYPERFLEET-896 - doc: Config Driven Generic Resource API Design#122
rh-amarin wants to merge 1 commit intoopenshift-hyperfleet:mainfrom
rh-amarin:HYPERFLEET-896

Conversation

@rh-amarin
Copy link
Copy Markdown
Collaborator

@rh-amarin rh-amarin commented Apr 9, 2026

Summary

  • Adds design document for the CRD-Driven Generic Resource API (hyperfleet/docs/generic-resource-registry-design.md)
  • Proposes replacing per-entity handler code with a single Resource type + declarative EntityDescriptor registry
  • Covers OpenAPI contract, data layer (GORM + single resources table), spec validation via JSON Schema, authorization config, delete semantics, and trade-offs

Context

HYPERFLEET-896. Currently adding a new entity type (e.g. NodePool) requires duplicated handler, DAO, and route code plus a coordinated deploy. This design eliminates that by letting teams register new entities via a descriptor at startup — routes, validation, status aggregation, and delete behavior are derived automatically.

Summary by CodeRabbit

  • Documentation
    • Added a draft design for a Generic Resource Registry: a contract-first API with a single generic Resource schema using a type discriminator; descriptor-driven generation of top-level and nested routes with CRUD and /statuses endpoints; consolidated data-layer approach and separate handling for labels and status conditions; descriptor-led parent-delete semantics (cascade vs restrict); outlines breaking-change scope, alternatives, tradeoffs, and risks.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 9, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 9, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign pnguyen44 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aadd3e38-fa42-47d6-a4b2-c314a105300b

📥 Commits

Reviewing files that changed from the base of the PR and between 5e30455 and 4858a83.

📒 Files selected for processing (1)
  • hyperfleet/docs/generic-resource-registry-design.md
✅ Files skipped from review due to trivial changes (1)
  • hyperfleet/docs/generic-resource-registry-design.md

Walkthrough

A new design document, "Generic Resource Registry," was added. It specifies a descriptor-driven, contract-first API that consolidates many managed entities into a single generic Resource model (discriminator type, untyped spec at API surface, entity-specific OpenAPI component schemas for validation). It describes autogenerated top-level and nested CRUD and /statuses routes from EntityDescriptors, a single resources table with separate label and condition tables, consolidated handlers/DAOs/services, and descriptor-driven cascade/restrict deletion semantics.

Sequence Diagram(s)

sequenceDiagram
participant Client
participant Router
participant ResourceHandler
participant ResourceService
participant ResourceDao
participant DB
Client->>Router: HTTP request (e.g., POST /resources?type=Cluster)
Router->>ResourceHandler: dispatch based on route/descriptor
ResourceHandler->>ResourceService: validate & apply descriptor rules
ResourceService->>ResourceDao: persist/query by resource type
ResourceDao->>DB: write/read resources + labels/conditions
DB-->>ResourceDao: result
ResourceDao-->>ResourceService: result
ResourceService-->>ResourceHandler: build response
ResourceHandler-->>Router: HTTP response
Router-->>Client: HTTP response
Loading
sequenceDiagram
participant Startup
participant DescriptorLoader
participant RouteRegistrar
participant Router
Startup->>DescriptorLoader: load EntityDescriptors
DescriptorLoader-->>RouteRegistrar: descriptors
RouteRegistrar->>Router: register top-level & nested route trees (CRUD, /statuses)
Router-->>Startup: routes registered
Loading

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding a design document for a config-driven generic resource API architecture, which is the core objective of this PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@rh-amarin rh-amarin force-pushed the HYPERFLEET-896 branch 2 times, most recently from 99bdd5a to c4aa85d Compare April 10, 2026 06:22
@rh-amarin rh-amarin marked this pull request as ready for review April 10, 2026 06:23
@openshift-ci openshift-ci bot requested review from aredenba-rh and pnguyen44 April 10, 2026 06:23
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
hyperfleet/docs/generic-resource-registry-design.md (1)

703-706: Add error handling for JSON unmarshaling in PresentResource.

The code silently ignores the error from json.Unmarshal(r.Spec, &spec). While the Spec field is stored as GORM datatypes.JSON and should contain valid JSON, defensive error handling would improve robustness and make debugging easier if data corruption occurs.

Consider logging the error or returning a fallback value:

var spec map[string]interface{}
if err := json.Unmarshal(r.Spec, &spec); err != nil {
    log.Warnf("failed to unmarshal spec for resource %s: %v", r.ID, err)
    spec = make(map[string]interface{}) // or return error to caller
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/docs/generic-resource-registry-design.md` around lines 703 - 706,
In PresentResource, handle the error returned by json.Unmarshal(r.Spec, &spec)
instead of ignoring it: after calling json.Unmarshal on r.Spec, check the error
and either log a warning/error that includes the resource identifier (e.g.,
r.ID) and the error details, then set spec = make(map[string]interface{}) as a
safe fallback, or propagate/return the error to the caller; update the
PresentResource function to perform this check and logging around json.Unmarshal
to make failures observable and prevent nil/spec panics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hyperfleet/docs/generic-resource-registry-design.md`:
- Line 9: Change the misspelled word "altternatives" to "alternatives" in the
introduction of the Generic Resource Registry proposal (look for the line
containing "This document contains the proposal for the Generic Resource
Registry and altternatives" in the generic-resource-registry-design.md file and
correct the typo).
- Around line 1309-1396: The doc exposes a security risk: authorization is
opt-in via EntityDescriptor.Authz (EntityAuthzConfig) so new entities default to
no checks; update the design by (1) adding a note in §11 documenting
default-allow risk and audit burden, (2) specifying a registry validation step
that inspects registrations (registry.Register/EntityDescriptor) and emits a
warning or fails if Authz == nil for critical types, and (3) defining a
configuration toggle (e.g., EnforceAuthForAllEntities) that, when enabled, makes
the registry validation error instead of warn and requires OperationPermissions
be present (ResourceCheck optional).
- Around line 406-438: The Replace method in sqlResourceDao reads
existing.Generation then writes back without checking the row hasn't changed,
allowing lost increments under concurrent Replace calls; modify the Save step in
Replace to perform an optimistic-locking conditional update by adding a
Where("generation = ?", existing.Generation) before
Omit("Labels","Conditions").Save(resource) and detect a failed update (e.g.,
gorm.ErrRecordNotFound or RowsAffected == 0) to mark the transaction for
rollback and return a conflict error (e.g., errors.Conflict("resource was
modified concurrently")), keeping the existing db.MarkForRollback(ctx, err)
behavior for other errors.
- Around line 1240-1258: Add three new risk subsections after "R2 — Loss of
domain modeling flexibility": create "R3 — Schema evolution and backward
compatibility" documenting how to add/remove EntityDescriptor fields,
deprecation/versioning strategy, and schema migration guidance; "R4 — Security
considerations" listing checks for SpecSchemaName existence, label validation,
authorization gaps, and audit logging requirements; and "R5 — Debugging and
observability" covering entity-specific error messages, metrics/logging tagging,
and tracing. For each new risk include a short remediation paragraph (e.g.,
validation checks, descriptor versioning, RBAC/audit policy, and observability
tags) mirroring the style of the existing "R1" and "R2" sections so they
integrate consistently into the Risks section.
- Around line 572-574: registry.Validate() currently only checks ParentType
references but must also ensure descriptor.Plural values are unique to prevent
ambiguous routing in RegisterEntityRoutes (which uses descriptor.Plural as the
path segment). Update Validate() to iterate over All() descriptors, track seen
plurals in a map[string]string, and panic with a descriptive message including
both descriptor.Type values when a duplicate Plural is found (e.g., "duplicate
Plural %q: used by both %q and %q"). Ensure the new check runs alongside the
existing ParentType validation.
- Around line 918-940: The Delete method ignores errors from s.dao.CountByOwner
and s.dao.FindByTypeAndOwner which can hide DB failures and cause incorrect
delete behavior; update sqlResourceService.Delete to capture the returned error
from CountByOwner and if non-nil return it (wrap via handleDeleteError or
convert to *errors.ServiceError) instead of assuming count==0, and likewise
capture and return any error from FindByTypeAndOwner before iterating children
so cascade deletion only runs when the query succeeded; keep the existing call
to handleDeleteError(s.dao.Delete(...)) for the final delete.
- Around line 607-641: Update the design text to explicitly state
owner-reference immutability and how Replace behaves: specify that OwnerID,
OwnerType, and OwnerHref are immutable once created and that resources.href is
computed once from id, Plural and the resolved parent chain and will not change;
clarify that ResourcePatchRequest intentionally excludes owner fields and that
Replace cannot modify owner references (it updates other mutable resource fields
but must reject any incoming OwnerID/OwnerType/OwnerHref changes), and note that
the service will validate and return an error if an update attempts to change
owner fields to preserve href correctness and enforce the constraint described
in §10.6.

---

Nitpick comments:
In `@hyperfleet/docs/generic-resource-registry-design.md`:
- Around line 703-706: In PresentResource, handle the error returned by
json.Unmarshal(r.Spec, &spec) instead of ignoring it: after calling
json.Unmarshal on r.Spec, check the error and either log a warning/error that
includes the resource identifier (e.g., r.ID) and the error details, then set
spec = make(map[string]interface{}) as a safe fallback, or propagate/return the
error to the caller; update the PresentResource function to perform this check
and logging around json.Unmarshal to make failures observable and prevent
nil/spec panics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6e1cf04f-6d26-4d8a-8415-3a107600bc1c

📥 Commits

Reviewing files that changed from the base of the PR and between 967e5eb and c4aa85d.

📒 Files selected for processing (1)
  • hyperfleet/docs/generic-resource-registry-design.md

Comment thread hyperfleet/docs/generic-resource-registry-design.md Outdated
Comment thread hyperfleet/docs/generic-resource-registry-design.md Outdated
Comment thread hyperfleet/docs/generic-resource-registry-design.md
Comment thread hyperfleet/docs/generic-resource-registry-design.md
Comment thread hyperfleet/docs/generic-resource-registry-design.md
Comment thread hyperfleet/docs/generic-resource-registry-design.md
Comment thread hyperfleet/docs/generic-resource-registry-design.md
@rh-amarin rh-amarin changed the title HYPERFLEET-896 - doc: CRD-Driven Generic Resource API Design HYPERFLEET-896 - doc:Config Driven Generic Resource API Design Apr 10, 2026
@rh-amarin rh-amarin changed the title HYPERFLEET-896 - doc:Config Driven Generic Resource API Design HYPERFLEET-896 - doc: Config Driven Generic Resource API Design Apr 10, 2026
case registry.OnParentDeleteCascade:
children, _ := s.dao.FindByTypeAndOwner(ctx, child.Type, id)
for _, c := range children {
if err := s.Delete(ctx, c.Type, c.ID); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recursive s.Delete call runs DFS within a single transaction-per-request. For wide or deep hierarchies (e.g., a Cluster with hundreds of NodePools, each owning children), this can exceed the request timeout — the same problem acknowledged for "Alternative B — Cascade always" in §10.2.

The pull-secret-service DDR already uses background jobs and batched processing for heavy operations (reconciler jobs, batched re-encryption). Should large cascades follow that same pattern — async job with progress tracking instead of synchronous inline recursion?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take on this is:

  • First we will implement the simplest solution
  • Then we measure to find what numbers may break the feature
  • If required, all owned objects could be updated in a single query, since they are on the same table, and it would be cheap

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No background jobs please. We should avoid these at all costs. Its a slippery slope to end up like CS with workers/controllers.

Also agree with Angel's comment, but just want to take a hard stance on background jobs. They should always be our last and only resort IMO

spec: { type: object, additionalProperties: true }
labels: { type: object, additionalProperties: { type: string } }

ResourceList:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResourceList uses offset-based pagination (page + size). The pull-secret-service DDR explicitly chose cursor-based pagination and documents why:

"Pagination Strategy: Cursor-based pagination (not offset-based). Why cursor-based: Prevents missing/duplicate results when new rotations are created during pagination"

Since all entity types now share a single resources table (higher write concurrency than per-type tables), the offset-based approach is more susceptible to the same skipped/duplicated results problem. Is the divergence from the cursor-based pattern intentional? If so, worth documenting the rationale. If not, consider aligning with the established pattern.

Copy link
Copy Markdown
Collaborator Author

@rh-amarin rh-amarin Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we can accept the risk of skipped/duplicated problem.

My rationale is that

HyperFleet API does not create/delete elements at high frequency scoped to a cluster or a search of related clusters

What operations do we expect to list resources?

  • Listing clusters of a user
  • Listing child objects of a user (e.g. nodepools)
  • statuses

IMO, none of these operations will be impacted by the update frequency of the shared resources table. One customer looking at a paginated search of their cluster objects won't be affected by another customer creating/destroying other clusters

But, this topic now begs the question

  • Should we allow unrestricted GET /api/hyperfleet/v1/resources operation in the API?
  • What can be the purpose of listing all entities at root level? mixing clusters/nodepools/others in the same API call?
  • Should we add a restriction by ?kind=xxx always?
  • Should we remove completely the endpoint?

We can go with it in a first version:

  • It can be useful for administrative purposes and debugging
  • It shouldn't be accessed by external customers


`UpdateConditions` — called exclusively by the status aggregation path. Replaces all condition rows for the resource atomically.

`Replace` and `UpdateConditions` are on separate code paths deliberately: `Replace` is called by user-initiated spec/label updates; `UpdateConditions` is called only by the status aggregation pipeline. This separation keeps user-writable and system-computed state from interfering.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The separation of Replace and UpdateConditions into different code paths is a good design. However, the doc doesn't discuss what happens when they execute concurrently:

  1. Replace reads Generation=5, increments to 6, writes back
  2. UpdateConditions reads Generation=5 (before the Replace commit), writes ObservedGeneration=5
  3. Now ObservedGeneration (5) < Generation (6), even though the condition was computed after the spec change

Is this temporary inconsistency acceptable? If so, it's worth stating explicitly. If not, does UpdateConditions need to re-read Generation inside its own transaction?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed some code examples from the design, as they are more on the implementation phase than design.
But the value for the updated generation should stay consistent within the whole transaction


// Authz defines per-operation authorization configuration.
// nil = no authorization checks (current default for all entities).
// Cannot be expressed in the config file — set in Go for entities requiring custom authz logic.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says Authz "cannot be expressed in the config file". But if authorization becomes mandatory for all entity types (likely in a multi-tenant system), then every entity will need Go code — breaking the core promise of "no Go code required for standard entities".

Could OperationPermissions at least be expressible in config YAML? The permission strings are plain data, not logic:

entities:
  - type: Cluster
    authz:
      operationPermissions:
        GET: hyperfleet.clusters.view
        POST: hyperfleet.clusters.create

ResourceCheck would remain Go-only since it's a function, but OperationPermissions is just a map[string]string — no reason it can't live in config.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this option to the Appendix B, we will leave the implementation for a later stage though


### 5.4 Href generation

Each resource carries a relative `href` field that uniquely identifies it within the API. The href is computed once at creation time by the service layer using the entity's descriptor and its resolved parent chain and stored in the `resources.href` column. It is never recomputed after creation because the `id` (UUID) and the `Plural` path segment are both stable for the lifetime of the resource.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"It is never recomputed after creation" — this works as long as three invariants hold:

  1. Resources are never reparented (moved to a different owner)
  2. The Plural path segment of an entity type never changes
  3. The ownership hierarchy is stable

If any of these change in the future, all child hrefs become stale with no recalculation mechanism. Consider explicitly documenting these as system invariants so future design decisions don't accidentally violate them. Also worth noting: what happens if reparenting is ever needed? Is it a "delete + recreate" operation?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these possibilities are very unlikely to happen, so IMO we should not complicate the design with "exotic" invariants

If we ever need to change resource, plural or ownership, the future task will have to deal with the migration

Copy link
Copy Markdown
Contributor

@ciaranRoche ciaranRoche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like the direction you took in this 😄 my main concern I called out inline, just to explicitly call out how we ensure openapi structs and config stay in sync. Prob worth a discussion with the group

Comment on lines +209 to +210
Type string `gorm:"column:type;size:100;not null"`
Kind string `gorm:"column:kind;size:100;not null"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own clarity, what is the difference between Type and Kind?

// nil = no authorization checks (current default for all entities).
// OperationPermissions is expressible in the config file (plain string map).
// ResourceCheck is Go-only (it is a function); entities requiring it must be registered via Register().
Authz *EntityAuthzConfig
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prob best to remove this, lets remove inherited auth as part of this work.

I dont want to include this until it becomes a hard requirement

- Con : Dynamic route registration is fundamentally incompatible with gorilla/mux, which builds its routing tree at startup; routes cannot be added at runtime without restarting the router
- Con : Adds a hard dependency on the Kubernetes API and `controller-runtime` or `client-go` — significant operational complexity
- Con : Entity type changes become a cluster operation rather than a code change; harder to test locally
- Con : Duplication of OpenAPI schemas. Providers already have their OpenAPI schema for their external API but now they have to break the types into CRDs and keep them aligned.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think this con is correct.

I think we need to have a conversation here, I am not arguing for or against the current design, but I think there is a silent gap in it, that is how do we ensure OpenAPI spec and Config stay in sync?

We get this for 'free' with CRD's as its a single config, comes at the cost of our users having to translate to CRD.

Regardless, we should document it, to ensure we have test cases to cover there are no silent failures if the config matching fails

There is no caller-supplied query parameter. The API surface is simply:

```
DELETE /clusters/{id} → 204 No Content (NodePools cascade)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

204 No Content? What is this OCM 😆

I could be wrong but think its a 202 with the intent captured on the resource that it is marked for deletion

case registry.OnParentDeleteCascade:
children, _ := s.dao.FindByTypeAndOwner(ctx, child.Type, id)
for _, c := range children {
if err := s.Delete(ctx, c.Type, c.ID); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No background jobs please. We should avoid these at all costs. Its a slippery slope to end up like CS with workers/controllers.

Also agree with Angel's comment, but just want to take a hard stance on background jobs. They should always be our last and only resort IMO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants