Skip to content

[OTEL] Add OpenTelemetry observability support#285

Open
royischoss wants to merge 33 commits intomlrun:developmentfrom
royischoss:ceml-641
Open

[OTEL] Add OpenTelemetry observability support#285
royischoss wants to merge 33 commits intomlrun:developmentfrom
royischoss:ceml-641

Conversation

@royischoss
Copy link
Copy Markdown
Contributor

@royischoss royischoss commented Apr 5, 2026

Adds OTel-based observability to MLRun CE with automatic Python instrumentation, deployment-mode metrics collection, and Prometheus integration.
https://iguazio.atlassian.net/browse/CEML-685

Changes

OTel operator sub-chart

  • Added opentelemetry-operator v0.78.1 as an optional dependency
  • crds.create: false — CRD rendering disabled on the sub-chart; the parent chart owns the CRDs via crds/ (see below)

CRD bootstrap via crds/ directory

  • Three minimal stub CRDs added to charts/mlrun-ce/crds/:
    • crd-opentelemetrycollector.yaml
    • crd-opentelemetryinstrumentation.yaml
    • crd-opampbridges.yaml
  • Helm applies crds/ before any templates or hooks, so the OTel CRD types are established before the crd-readiness-job hook runs — no CRD polling needed
  • Stubs use x-kubernetes-preserve-unknown-fields: true (minimal schema); the operator's admission webhook handles full CR validation once it's running
  • tests/package.sh replaces the large CRD files inside the opentelemetry-operator sub-chart tarball with 41-byte stubs, keeping the Helm release Secret well under the 3 MB Kubernetes API limit

New templates (templates/opentelemetry/)

  • Pre-install hook to label/annotate the namespace for OTel webhook injection and namespace-wide Python auto-instrumentation
  • collector.yaml and instrumentation.yaml — placeholder files; the actual CRs are applied by otel-cr-installer.yaml (post-install/post-upgrade hook) after the operator webhook is ready
  • RBAC for hook jobs

Metrics: push model (OTLP → Prometheus)

  • OTel collector exports metrics by pushing directly to Prometheus via the otlphttp/prometheus exporter at http://prometheus-operated.<namespace>.svc:9090/api/v1/otlp
  • Prometheus is configured with --enable-feature=otlp-write-receiver and --web.enable-otlp-receiver (both required in Prometheus v3)
  • Consistent with how the non-CE production system handles metrics collection

Instrumentation CR

  • Deployment-mode collector — single pod per namespace receiving OTLP from all instrumented workloads
  • Disabled aws_lambda OTel instrumentor to suppress irrelevant Lambda warnings
  • Removed duplicate OTEL_RESOURCE_ATTRIBUTES_* env vars (auto-injected by the operator)

MLRun API crash fix

  • Added mlrun.api.extraEnvKeyValue.PYTHONPATH — OTel operator injects PYTHONPATH=/otel-auto-instrumentation-python:$(PYTHONPATH) using K8s env var expansion, which can't see Docker image ENV vars. Without this explicit K8s env var, $(PYTHONPATH) resolves to empty, dropping the MLRun services package path and crashing the API

Admin / non-admin split

  • Admin: installs OTel operator with namespace-selector webhook; CRs disabled
  • User namespace: operator disabled; collector + instrumentation CRs enabled

🤖 Generated with Claude Code

Comment thread charts/mlrun-ce/values.yaml
@royischoss royischoss marked this pull request as ready for review April 9, 2026 07:50
@royischoss royischoss requested a review from davesh0812 April 12, 2026 10:16
Comment thread charts/mlrun-ce/templates/_helpers.tpl Outdated
Comment thread charts/mlrun-ce/README.md Outdated
…ion accordingly. add request and limit for crdReadinessJob and namespaceLabelJob
# Conflicts:
#	charts/mlrun-ce/Chart.yaml
#	charts/mlrun-ce/README.md
#	charts/mlrun-ce/requirements.lock
…, change naming for otel metrics using metadata.name fieldRef
…, change naming for otel metrics using metadata.name fieldRef
… empty templates, kubectl image

  - Move hardcoded OTel collector pipeline config into values.yaml under opentelemetry.collector.config — users can now override receivers, processors, exporters without forking the chart. Prometheus endpoint
  uses short DNS (prometheus-operated:9090) removing namespace interpolation from the helper.
  - Add opentelemetry.kubectlImage to values.yaml (defaults to bitnami/kubectl:latest) and reference it in both crd-readiness-job.yaml and namespace-label.yaml instead of hardcoded tag.
  - Fix namespace-label.yaml: replace indent with nindent for correct YAML formatting; change restartPolicy: Never to OnFailure so the job retries on transient failures.
  - Delete empty collector.yaml and instrumentation.yaml template files that generated no resources and were misleading. Move their documentation comment into crd-readiness-job.yaml where the actual CR
  creation happens.
  - Replace 50-line hardcoded collector manifest in _helpers.tpl with toYaml .Values.opentelemetry.collector.config | nindent 4.
Comment thread charts/mlrun-ce/templates/_helpers.tpl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants