[OTEL] Add OpenTelemetry observability support#285
Open
royischoss wants to merge 33 commits intomlrun:developmentfrom
Open
[OTEL] Add OpenTelemetry observability support#285royischoss wants to merge 33 commits intomlrun:developmentfrom
royischoss wants to merge 33 commits intomlrun:developmentfrom
Conversation
…R_SPACE and MLRUN_MODEL_ENDPOINT_MONITORING__STORE_PREFIXES__MONITORING_APPLICATION plus removes MLRUN_MODEL_ENDPOINT_MONITORING__ENDPOINT_STORE_CONNECTION
# Conflicts: # charts/mlrun-ce/Chart.yaml # charts/mlrun-ce/README.md # charts/mlrun-ce/requirements.lock # charts/mlrun-ce/values.yaml # tests/kind-test.sh
royischoss
commented
Apr 9, 2026
…ion accordingly. add request and limit for crdReadinessJob and namespaceLabelJob
# Conflicts: # charts/mlrun-ce/Chart.yaml # charts/mlrun-ce/README.md # charts/mlrun-ce/requirements.lock
…, change naming for otel metrics using metadata.name fieldRef
…, change naming for otel metrics using metadata.name fieldRef
… empty templates, kubectl image - Move hardcoded OTel collector pipeline config into values.yaml under opentelemetry.collector.config — users can now override receivers, processors, exporters without forking the chart. Prometheus endpoint uses short DNS (prometheus-operated:9090) removing namespace interpolation from the helper. - Add opentelemetry.kubectlImage to values.yaml (defaults to bitnami/kubectl:latest) and reference it in both crd-readiness-job.yaml and namespace-label.yaml instead of hardcoded tag. - Fix namespace-label.yaml: replace indent with nindent for correct YAML formatting; change restartPolicy: Never to OnFailure so the job retries on transient failures. - Delete empty collector.yaml and instrumentation.yaml template files that generated no resources and were misleading. Move their documentation comment into crd-readiness-job.yaml where the actual CR creation happens. - Replace 50-line hardcoded collector manifest in _helpers.tpl with toYaml .Values.opentelemetry.collector.config | nindent 4.
royischoss
commented
Apr 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds OTel-based observability to MLRun CE with automatic Python instrumentation, deployment-mode metrics collection, and Prometheus integration.
https://iguazio.atlassian.net/browse/CEML-685
Changes
OTel operator sub-chart
opentelemetry-operatorv0.78.1 as an optional dependencycrds.create: false— CRD rendering disabled on the sub-chart; the parent chart owns the CRDs viacrds/(see below)CRD bootstrap via
crds/directorycharts/mlrun-ce/crds/:crd-opentelemetrycollector.yamlcrd-opentelemetryinstrumentation.yamlcrd-opampbridges.yamlcrds/before any templates or hooks, so the OTel CRD types are established before thecrd-readiness-jobhook runs — no CRD polling neededx-kubernetes-preserve-unknown-fields: true(minimal schema); the operator's admission webhook handles full CR validation once it's runningtests/package.shreplaces the large CRD files inside theopentelemetry-operatorsub-chart tarball with 41-byte stubs, keeping the Helm release Secret well under the 3 MB Kubernetes API limitNew templates (
templates/opentelemetry/)collector.yamlandinstrumentation.yaml— placeholder files; the actual CRs are applied byotel-cr-installer.yaml(post-install/post-upgrade hook) after the operator webhook is readyMetrics: push model (OTLP → Prometheus)
otlphttp/prometheusexporter athttp://prometheus-operated.<namespace>.svc:9090/api/v1/otlp--enable-feature=otlp-write-receiverand--web.enable-otlp-receiver(both required in Prometheus v3)Instrumentation CR
aws_lambdaOTel instrumentor to suppress irrelevant Lambda warningsOTEL_RESOURCE_ATTRIBUTES_*env vars (auto-injected by the operator)MLRun API crash fix
mlrun.api.extraEnvKeyValue.PYTHONPATH— OTel operator injectsPYTHONPATH=/otel-auto-instrumentation-python:$(PYTHONPATH)using K8s env var expansion, which can't see Docker imageENVvars. Without this explicit K8s env var,$(PYTHONPATH)resolves to empty, dropping the MLRun services package path and crashing the APIAdmin / non-admin split
🤖 Generated with Claude Code