Skip to content

Comments

feat: optimize Dockerfile build speed with BuildKit caching#665

Closed
jeremyeder wants to merge 1 commit intoambient-code:mainfrom
jeremyeder:feature/dockerfile-alpine-optimization
Closed

feat: optimize Dockerfile build speed with BuildKit caching#665
jeremyeder wants to merge 1 commit intoambient-code:mainfrom
jeremyeder:feature/dockerfile-alpine-optimization

Conversation

@jeremyeder
Copy link
Collaborator

@jeremyeder jeremyeder commented Feb 21, 2026

Summary

Add BuildKit cache mounts and layer optimizations to speed up Docker builds across all components. Fixes Go version mismatch in public-api that broke builds.

  • Add BuildKit cache mounts for Go module and build cache (public-api, backend, operator)
  • Add npm cache mount for frontend npm ci
  • Fix public-api golang:1.23-alpinegolang:1.24-alpine (go.mod requires >= 1.24.0)
  • Bump public-api runtime alpine:3.19alpine:3.21
  • Upgrade runner from ubi9/python-311 to ubi9/python-312 with multi-stage pip install
  • Merge runner's two separate dnf install layers into single RUN (eliminates intermediate cache)
  • Move operator ARG declarations below cached layers (prevents cache busting on every build)
  • Remove unused procps from operator runtime
  • Add .dockerignore for Go components to reduce build context

Build Speed Improvements

Optimization Components Effect
BuildKit go mod cache public-api, backend, operator Skip module download on source-only changes
BuildKit go build cache public-api, backend, operator Reuse compiled object files across builds
npm cache mount frontend Faster npm ci when lockfile unchanged
.dockerignore files public-api, backend, operator Smaller build context, fewer cache invalidations
ARG reordering operator Build metadata changes no longer bust dependency cache
Merged dnf layers runner Single layer instead of two (cleaner, no intermediate cache)

Bug Fixes

  • public-api/Dockerfile: golang:1.23-alpinegolang:1.24-alpinego.mod requires go 1.24.0, builds were failing

Test plan

  • Build public-api — passes
  • Build backend — passes
  • Build operator — passes
  • Build frontend — passes (Turbopack/SWC compiles successfully)
  • Build runner — passes
  • Run make kind-up-local to validate full stack
  • Verify runner can execute Claude Code sessions
  • Build each image twice to confirm cache mounts speed up second build

🤖 Generated with Claude Code

@jeremyeder jeremyeder force-pushed the feature/dockerfile-alpine-optimization branch from c0afd64 to 225c47b Compare February 21, 2026 06:03
@jeremyeder jeremyeder changed the title feat: standardize all Dockerfiles on Alpine (~87% image size reduction) feat: optimize Dockerfiles with BuildKit caching and reduced image sizes (~80% reduction) Feb 21, 2026
@jeremyeder
Copy link
Collaborator Author

this patch brings 5g -> 1g. I looked at using all alpine family images but that only got things down to 645mb total for all container images.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 21, 2026

Claude Code Review

Summary

This PR optimizes Dockerfiles across all components using BuildKit cache mounts and multi-stage builds. The changes are primarily infrastructure-level (no application code affected). The runner Dockerfile sees the most substantial refactoring (single-stage → multi-stage, Python 3.11 → 3.12, ~73% claimed size reduction). Backend, operator, and public-api builder stages switch from ubi9/go-toolset to golang:1.24-alpine.

Overall direction is sound, but there are several correctness and consistency concerns that need attention before merge.


Issues by Severity

🚫 Blocker Issues

1. :latest tag in components/public-api/Dockerfile runtime stage

# ❌ Non-deterministic, will silently pick up breaking UBI updates
FROM registry.access.redhat.com/ubi9/ubi-minimal:latest

Using :latest for a runtime base image breaks reproducible builds. A future UBI update could silently change behavior or break the image. Pin to a specific version:

# ✅ Pin to a specific version
FROM registry.access.redhat.com/ubi9/ubi-minimal:9.4

All other runtime stages in this project use pinned versions — this should too.


🔴 Critical Issues

2. CGO compatibility risk: Alpine builder → UBI runtime (backend, operator, public-api)

The builder stages for backend, operator, and public-api now use golang:1.24-alpine (musl libc), while runtime stages remain UBI-based (glibc). If CGO is enabled during go build, the resulting binary will be linked against musl and will crash at runtime on glibc.

The build commands in the diff do not explicitly set CGO_ENABLED=0:

# ⚠️ No CGO_ENABLED=0 — depends on environment default
RUN --mount=type=cache,target=/go/pkg/mod \
    --mount=type=cache,target=/root/.cache/go-build \
    go build -o /app/backend ./...

Fix: Explicitly disable CGO when building on Alpine for a glibc runtime:

RUN --mount=type=cache,target=/go/pkg/mod \
    --mount=type=cache,target=/root/.cache/go-build \
    CGO_ENABLED=0 go build -o /app/backend ./...

This is critical — if any dependency uses CGO (even indirectly), the binary will silently build but fail to start. Verify with ldd /app/backend in the container.

3. Python 3.11 → 3.12 upgrade in runner with no dependency validation evidence

components/runners/claude-code-runner/Dockerfile upgrades from ubi9/python-311 to ubi9/python-312. Python 3.12 removed several deprecated APIs and changed distutils, imp, and other modules that some packages depend on. The platform's key dependencies (claude-code-sdk>=0.0.23, anthropic>=0.68.0) need verification against Python 3.12.

There is no mention of updated requirements.txt, lock file, or CI test results confirming the runner works on Python 3.12. This upgrade should be explicitly validated before merge.


🟡 Major Issues

4. Inconsistent base image strategy across components

The PR introduces a mixed image strategy:

Component Builder Runtime
backend golang:1.24-alpine (NEW) ubi9/ubi-minimal (unchanged)
operator golang:1.24-alpine (NEW) ubi9/ubi-minimal (unchanged)
public-api golang:1.24-alpine (NEW) ubi9/ubi-minimal:latest (NEW)
runner ubi9/python-312 ubi9/python-312

The public-api runtime was changed to UBI-minimal, but this version uses :latest (see Blocker #1). More importantly: if the motivation for switching Go builders to Alpine is build speed via layer caching, consider whether using a consistent golang:1.24-alpine or staying with ubi9/go-toolset uniformly is the right long-term call. Mixing Alpine builders with UBI runtimes adds musl/glibc friction (see Critical #2).

5. Implicit curl dependency in public-api health check

The public-api runtime stage installs only ca-certificates via microdnf, but the HEALTHCHECK uses curl:

RUN microdnf install -y ca-certificates && microdnf clean all

HEALTHCHECK --interval=30s --timeout=5s --start-period=5s --retries=3 \
    CMD curl -sf http://localhost:8080/health || exit 1

curl is available in ubi9/ubi-minimal by default, but relying on this implicit inclusion is fragile. Make the dependency explicit:

RUN microdnf install -y ca-certificates curl && microdnf clean all

🔵 Minor Issues

6. Duplicate .dockerignore content — not DRY

All four new .dockerignore files (backend, operator, public-api, runner) are byte-for-byte identical. This works, but means future updates require 4 edits. Consider documenting this intentional duplication or noting it in a comment, so future contributors don't accidentally diverge them.

7. BuildKit requirement not documented

The --mount=type=cache syntax requires BuildKit (DOCKER_BUILDKIT=1 or Docker >= 23 with BuildKit default). This is a build-time requirement. If CI or local dev environments have older Docker versions, the build will silently fall back to ignoring cache mounts or fail entirely. Verify the CI workflow (components-build-deploy.yml) and local dev documentation explicitly enable BuildKit.

8. adduser syntax change in public-api may not produce equivalent results

Old (Alpine): adduser -D -u 1001 appuser
New (UBI-minimal): adduser -u 1001 appuser

The -D flag in Alpine disables password. In UBI/shadow-utils, the equivalent behavior differs. Verify the resulting account has no password set and has the expected shell (/sbin/nologin or similar).


Positive Highlights

  • Smart ARG placement in operator: Moving ARG GIT_COMMIT, GIT_BRANCH, etc. to after COPY . . is an excellent cache optimization — these volatile build args no longer bust the go mod download cache layer. This is the right approach and should be applied to backend/public-api as well.

  • Runner multi-stage build: The refactor from single-stage to multi-stage for the runner is a meaningful architectural improvement. Separating pip dependency installation from the runtime image is the correct pattern and will reduce both image size and attack surface.

  • Consolidated dnf layers in runner: Merging multiple RUN dnf install calls into a single RUN ending with dnf clean all && rm -rf /var/cache/dnf is correct and reduces image size.

  • Removed commented-out Playwright/Chromium code: Cleaning ~40 lines of dead code from the runner Dockerfile is good hygiene and removes confusion about whether those dependencies are active.

  • Frontend cache mount: The --mount=type=cache,target=/root/.npm addition for npm ci is low-risk and provides meaningful CI speedups.

  • .dockerignore additions: Adding **/*_test.go, .env, .env.*, *.md exclusions is good practice — reduces build context size and prevents accidental secret exposure.


Recommendations

Priority order before merge:

  1. Fix :latest tag in public-api runtime — pin to ubi9/ubi-minimal:9.4 or equivalent (1-line fix)
  2. Add CGO_ENABLED=0 to all Go build commands in backend, operator, and public-api Dockerfiles (1-line fix each, verify with file or ldd on the binary)
  3. Verify Python 3.12 compatibility — run the runner's test suite (tests/test_privacy_masking.py at minimum) against the new image and confirm key dependencies work
  4. Make curl explicit in public-api microdnf install
  5. Confirm BuildKit is enabled in .github/workflows/components-build-deploy.yml and docs/developer/local-development/kind.md

🤖 Generated with Claude Code


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

Add BuildKit cache mounts and layer optimizations to speed up
Docker builds across all components. Fix Go version mismatch
in public-api that broke builds.

Key changes:
- Add BuildKit cache mounts for Go module/build cache (3 Go services)
- Add npm cache mount for frontend dependency installs
- Fix public-api Go version: 1.23 -> 1.24 (matches go.mod requirement)
- Bump public-api alpine: 3.19 -> 3.21
- Upgrade runner from python-311 to python-312
- Merge runner dnf install layers into single RUN
- Download jq binary directly (not available in ubi9/python-312 repos)
- Move operator ARG declarations below cached layers
- Remove unused procps from operator runtime
- Add .dockerignore for Go components (public-api, backend, operator)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jeremyeder jeremyeder force-pushed the feature/dockerfile-alpine-optimization branch from 225c47b to fd6907f Compare February 22, 2026 02:04
@jeremyeder jeremyeder changed the title feat: optimize Dockerfiles with BuildKit caching and reduced image sizes (~80% reduction) feat: optimize Dockerfile build speed with BuildKit caching Feb 22, 2026
@jeremyeder
Copy link
Collaborator Author

Closing in favor of a minimal one-line fix for the Go version mismatch. The broader Dockerfile optimizations (BuildKit caching, Alpine migration, etc.) didn't yield meaningful wins with UBI base images.

@jeremyeder jeremyeder closed this Feb 22, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 22, 2026

Claude Code Review

Summary

PR #665 optimizes Docker build speed by adding BuildKit cache mounts across all components, fixes a Go version mismatch that was breaking public-api builds, and upgrades the Claude runner from Python 3.11 to 3.12 with a multi-stage build. The optimization intent is sound, but several issues warrant attention before merging — one supply-chain security concern stands out as the most important.


Issues by Severity

🔴 Critical Issues

1. jq binary downloaded from GitHub with no integrity verification (runner Dockerfile:25)

curl -Lo /usr/local/bin/jq https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-$(uname -m | sed 's/aarch64/arm64/;s/x86_64/amd64/') && \
chmod +x /usr/local/bin/jq

This downloads an arbitrary binary from the internet at build time and makes it executable without verifying its SHA256 checksum. This is a classic supply-chain attack vector — a compromised CDN, DNS hijack, or MITM between the builder and GitHub could silently substitute a malicious binary into every runner image. The PR description acknowledges jq isn't available in UBI9 python-312 repos, but the solution introduces a more serious risk than the problem it solves.

Recommended fix — add integrity verification:

curl -Lo /usr/local/bin/jq https://github.com/jqlang/jq/releases/download/jq-1.7.1/jq-linux-$(uname -m | sed 's/aarch64/arm64/;s/x86_64/amd64/') && \
echo "<sha256_hash>  /usr/local/bin/jq" | sha256sum -c - && \
chmod +x /usr/local/bin/jq

Use the published SHA256 hashes from the jq 1.7.1 release page for both amd64 and arm64.

2. Backend and operator base image switched from ubi9/go-toolset to golang:alpine

# Before
FROM registry.access.redhat.com/ubi9/go-toolset:1.24 AS builder
# After
FROM golang:1.24-alpine AS builder

ubi9/go-toolset is the Red Hat-vetted, FIPS-compatible Go toolchain image intended for OpenShift workloads. Switching to golang:1.24-alpine (musl libc) for the build stage creates a mixed-provenance model: alpine-compiled binaries running on ubi-minimal runtime. While CGO is disabled so musl vs glibc won't cause direct linkage issues, this change:

  • Bypasses Red Hat's security scanning pipeline for the build environment
  • May conflict with OpenShift's image admission policies (some clusters restrict non-Red Hat base images even in build stages)
  • Removes FIPS compliance from the build environment, which may matter if the platform runs in regulated environments

If the motivation is that golang:1.24-alpine is lighter/faster, the BuildKit cache mounts deliver most of the speed benefit regardless of base image. Reverting the base image change while keeping the cache mounts would preserve both the optimization and the compliance posture.


🟡 Major Issues

3. Runner base image digest pin removed

# Before (pinned)
FROM registry.access.redhat.com/ubi9/python-311@sha256:d0b35f779ca0ae87deaf17cd1923461904f52d3ef249a53dbd487e02bdabdde6

# After (floating tag)
FROM registry.access.redhat.com/ubi9/python-312

Removing the SHA pin means builds are no longer reproducible — a routine registry update to python-312 could silently change the runner environment between builds. The upgrade to Python 3.12 is a good move, but consider pinning the new image: FROM registry.access.redhat.com/ubi9/python-312@sha256:<digest>.

4. ARG declarations in operator Dockerfile are unused in the build command

ARG GIT_COMMIT=unknown
ARG GIT_BRANCH=unknown
ARG GIT_REPO=unknown
ARG GIT_VERSION=unknown
ARG BUILD_DATE=unknown
ARG BUILD_USER=unknown

RUN ... CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o operator .

The ARGs are declared but never referenced in the build command (e.g., via -ldflags="-X main.version=$(GIT_VERSION)"). Moving them below the COPY is the right caching optimization, but they remain dead configuration. Either wire them into the build (inject version metadata via ldflags) or remove them entirely to avoid misleading future contributors.

5. BuildKit cache mount scope in multi-platform CI

The Makefile and GitHub Actions CI builds images for both linux/amd64 and linux/arm64. The cache mounts (/go/pkg/mod, /root/.cache/go-build) are shared by default across all builds on a given builder. In multi-platform docker buildx builds, the --mount=type=cache entries should include a scoped id (e.g., --mount=type=cache,id=amd64-gomod,target=/go/pkg/mod) to prevent architecture-specific compiled artifacts from contaminating cross-platform caches. Without this, intermittent build failures may occur in CI.


🔵 Minor Issues

6. Python packages copied to /usr/local in runner multi-stage build

RUN pip install --no-cache-dir --prefix=/install '/build/claude-runner[all]' uv
# ...
COPY --from=builder /install /usr/local

Copying the entire /install tree into /usr/local of the runtime image is non-standard and could conflict with existing UBI9 system files. A more conventional pattern is to copy to an isolated directory (e.g., /opt/venv) and set PYTHONPATH/PATH accordingly. Worth verifying this works correctly across arm64 builds.

7. .dockerignore excludes test files — document the tradeoff

**/*_test.go

This is correct for production builds. However, it means a future Dockerfile test stage (e.g., RUN go test ./...) would fail silently. A comment noting this intentional tradeoff would help future contributors.


Positive Highlights

  • Bug fix for public-api: Correcting golang:1.23-alpine → 1.24-alpine to match go.mod's go 1.24.0 requirement is an important correctness fix. Builds were silently broken.
  • ARG reordering in operator: Moving build metadata ARGs below the COPY . . layer is a correct and well-explained caching optimization.
  • Merged dnf RUN layers in runner: Combining the previously split dnf install commands into a single RUN reduces intermediate layers and eliminates an unnecessary cache layer.
  • Multi-stage build for runner Python deps: The builder/runtime split is a genuine improvement — pip install in the builder, copy artifacts to runtime, avoiding cache pollution in the final image.
  • npm cache mount for frontend: Clean, idiomatic BuildKit optimization.
  • Removed unused procps from operator runtime: Good housekeeping; smaller attack surface.
  • Added .dockerignore files: Reduces build context for all Go components, preventing git history from busting caches.
  • Comment cleanup in frontend Dockerfile: Removing stale "uncomment the following line" comments that described actions already taken.

Recommendations

Priority order for follow-up:

  1. [Must fix] Add SHA256 integrity verification for the jq binary download before this merges.
  2. [Should discuss] Evaluate whether the alpine base image change for backend/operator is acceptable given OpenShift/FIPS constraints. If compliance matters, revert to ubi9/go-toolset and keep only the cache mount additions.
  3. [Should fix] Pin the ubi9/python-312 image to a specific digest for reproducible runner builds.
  4. [Should fix] Scope the BuildKit cache mounts by architecture in CI multi-platform builds.
  5. [Low priority] Either wire the operator ARG values into the binary via ldflags or remove them.

Review performed by Claude Code against project standards in CLAUDE.md and .claude/context/security-standards.md.


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant