From 1377c9f6bd713989228cc8ce1edcf594fdfd46ae Mon Sep 17 00:00:00 2001 From: "ohotnikov.ivan" Date: Mon, 16 Feb 2026 10:41:29 +0300 Subject: [PATCH 1/4] docs(networking): add network architecture overview Add comprehensive documentation covering the Cozystack networking stack: MetalLB load balancing (L2 and BGP modes), Cilium eBPF as kube-proxy replacement, Kube-OVN centralized IPAM, and tenant isolation with identity-based eBPF policies. All diagrams use Mermaid. Signed-off-by: ohotnikov.ivan --- content/en/docs/v1/networking/architecture.md | 366 ++++++++++++++++++ 1 file changed, 366 insertions(+) create mode 100644 content/en/docs/v1/networking/architecture.md diff --git a/content/en/docs/v1/networking/architecture.md b/content/en/docs/v1/networking/architecture.md new file mode 100644 index 00000000..052c3898 --- /dev/null +++ b/content/en/docs/v1/networking/architecture.md @@ -0,0 +1,366 @@ +--- +title: "Network Architecture" +linkTitle: "Architecture" +description: "Overview of Cozystack cluster network architecture: MetalLB load balancing, Cilium eBPF networking, and tenant isolation with Kube-OVN." +weight: 5 +--- + +## Overview + +Cozystack uses a multi-layered networking stack designed for bare-metal Kubernetes clusters. The architecture combines several components, each responsible for a specific layer of the network: + +| Layer | Component | Purpose | +| --- | --- | --- | +| External load balancing | MetalLB | Publishing services to external networks | +| Service load balancing | Cilium eBPF | kube-proxy replacement, in-kernel DNAT | +| Network policies | Cilium eBPF | Tenant isolation and security enforcement | +| Pod networking (CNI) | Kube-OVN | Centralized IPAM, overlay networking | +| Observability | Hubble | Network traffic visibility | + +```mermaid +flowchart TD + EXT["External Clients"] + RTR["Upstream Router / Gateway"] + MLB["MetalLB
(L2 ARP / BGP)"] + CIL["Cilium eBPF
(Service Load Balancing + Network Policies)"] + OVN["Kube-OVN
(Pod Networking + IPAM)"] + PODS["Pods"] + + EXT --> RTR + RTR --> MLB + MLB --> CIL + CIL --> OVN + OVN --> PODS +``` + +## Cluster Network Configuration + +| Parameter | Default Value | +| --- | --- | +| Pod CIDR | 10.244.0.0/16 | +| Service CIDR | 10.96.0.0/16 | +| CNI | Kube-OVN + Cilium (kube-proxy replacement) | + +### Pod CIDR Allocation (Kube-OVN) + +Kube-OVN uses a **shared Pod CIDR** model: + +- All pods draw from a single shared IP pool (10.244.0.0/16) +- IP addresses are allocated centrally through Kube-OVN's IPAM +- There is no per-node CIDR splitting (unlike Calico or Flannel) +- This enables live migration of pods between nodes without changing their IP addresses + +## External Traffic Ingress with MetalLB + +MetalLB is a load balancer implementation for bare-metal Kubernetes clusters. It assigns external IP addresses to Services of type `LoadBalancer`, allowing external traffic to reach the cluster. + +```mermaid +flowchart TD + CLIENT["External Client"] + RTR["Upstream Router"] + + subgraph CLUSTER["Kubernetes Cluster"] + S1["Node 1
MetalLB Speaker"] + S2["Node 2
MetalLB Speaker"] + S3["Node 3
MetalLB Speaker"] + CIL["Cilium (eBPF)
Service Load Balancing
DNAT to Pod IP"] + POD["Target Pod
(Pod CIDR)"] + end + + CLIENT -->|"Traffic to external IP
(e.g. 10.x.x.20)"| RTR + RTR -->|"L2 (ARP) or BGP"| S1 + RTR -->|"L2 (ARP) or BGP"| S2 + RTR -->|"L2 (ARP) or BGP"| S3 + S1 --> CIL + S2 --> CIL + S3 --> CIL + CIL --> POD +``` + +### Layer 2 Mode (ARP) + +In L2 mode, MetalLB responds to ARP requests for the Service's external IP. A single node becomes the "leader" for that IP and receives all traffic. + +How it works: + +1. A MetalLB speaker on one node claims the external IP +2. The speaker responds to ARP requests: "IP X is at MAC aa:bb:cc:dd:ee:ff" +3. All traffic for that IP goes to the leader node +4. Cilium on the node performs DNAT to the actual pod + +```mermaid +sequenceDiagram + participant C as Client + participant L as Node (MetalLB Leader) + participant CIL as Cilium (eBPF) + participant P as Pod + + C->>L: ARP: Who has 10.x.x.20? + L-->>C: ARP Reply: 10.x.x.20 is at aa:bb:cc:dd:ee:ff + C->>L: Send traffic to 10.x.x.20 + L->>CIL: Packet enters kernel + CIL->>P: DNAT → Pod 10.244.x.x:8080 +``` + +{{% alert color="info" %}} +In L2 mode, only one node handles traffic for a given Service IP. Failover occurs if the leader node goes down, but there is no true load balancing across nodes for a single Service. +{{% /alert %}} + +### BGP Mode + +In BGP mode, MetalLB establishes BGP sessions with upstream routers and announces /32 routes for Service IPs. This enables true ECMP load balancing across nodes. + +How it works: + +1. MetalLB speakers establish BGP sessions with the upstream router(s) +2. Each speaker announces the Service IP as a /32 route +3. The router has multiple next-hops for the same prefix +4. ECMP distributes traffic across nodes +5. Cilium on the receiving node performs DNAT to the actual pod + +```mermaid +sequenceDiagram + participant S1 as Node 1 (Speaker) + participant S2 as Node 2 (Speaker) + participant S3 as Node 3 (Speaker) + participant R as Upstream Router + participant CIL as Cilium (eBPF) + participant P as Pod + + S1->>R: BGP UPDATE: 10.x.x.20/32 via Node 1 + S2->>R: BGP UPDATE: 10.x.x.20/32 via Node 2 + S3->>R: BGP UPDATE: 10.x.x.20/32 via Node 3 + Note over R: ECMP: 3 next-hops for 10.x.x.20/32 + R->>S1: Traffic (1/3) + R->>S2: Traffic (1/3) + R->>S3: Traffic (1/3) + S1->>CIL: Packet enters kernel + CIL->>P: DNAT → Pod +``` + +### VLAN Integration for External Traffic + +External traffic can be delivered to the cluster through additional VLANs (client VLANs, DMZ, public networks, etc.) which are then routed to services via MetalLB and Cilium. + +```mermaid +flowchart TD + EXT["External Traffic"] + + subgraph VLANs["Additional VLANs
(Client, DMZ, Public, etc.)"] + V1["VLAN A"] + V2["VLAN B"] + end + + subgraph LB["MetalLB"] + L2["L2 Mode → Service → Pod"] + BGP["BGP Mode → Service → Pod"] + end + + EXT --> VLANs + V1 --> L2 + V2 --> BGP +``` + +## Cilium as kube-proxy Replacement + +Cilium replaces kube-proxy by attaching eBPF programs directly in the Linux kernel. This provides more efficient packet processing and advanced capabilities. + +### Traditional kube-proxy (iptables) vs Cilium eBPF + +```mermaid +flowchart LR + subgraph IPTABLES["kube-proxy (iptables)"] + direction LR + P1["Packet"] --> IPT["iptables
PREROUTING"] + IPT --> NAT["NAT chains
O(n) rule traversal"] + NAT --> DNAT1["DNAT to Pod"] + DNAT1 --> POD1["Pod"] + end + + subgraph EBPF["Cilium (eBPF)"] + direction LR + P2["Packet"] --> BPF["eBPF program
(TC/XDP)"] + BPF --> MAP["eBPF map lookup
O(1) hash"] + MAP --> DNAT2["DNAT"] + DNAT2 --> POD2["Pod"] + end +``` + +Key differences: + +| Aspect | kube-proxy (iptables) | Cilium (eBPF) | +| --- | --- | --- | +| Lookup complexity | O(n) rule traversal | O(1) hash-based lookup | +| Execution context | Userspace overhead | Native in-kernel | +| Context switches | Required | None | +| Scalability | Degrades with service count | Constant performance | + +### eBPF Architecture + +```mermaid +flowchart TD + subgraph KERNEL["Kernel Space"] + subgraph BPF["eBPF Programs"] + TC["TC
(ingress/egress)"] + XDP["XDP
(fastest path)"] + SOCK["Socket-level
(connect, sendmsg)"] + end + + subgraph MAPS["eBPF Maps"] + SVC["Service Tables"] + EP["Endpoint Maps"] + POL["Policy Maps"] + end + + TC --> MAPS + XDP --> MAPS + SOCK --> MAPS + end +``` + +## Tenant Isolation with Kube-OVN and Cilium + +In a multi-tenant Cozystack cluster, all tenants share the same Pod CIDR. Kube-OVN manages a single shared IP pool (no per-node splitting), and Cilium enforces strong isolation using eBPF-based network policies. + +### CNI Architecture + +```mermaid +flowchart TD + subgraph KO["Kube-OVN"] + IPAM["Centralized IPAM — Shared pool 10.244.0.0/16"] + OVN["OVN/OVS Overlay Network"] + SUBNET["Subnet management per namespace/tenant"] + end + + subgraph CIL["Cilium"] + POLICY["eBPF Network Policies"] + SVCBAL["Service Load Balancing (kube-proxy replacement)"] + IDENT["Identity-based Security"] + HUB["Observability via Hubble"] + end + + KO --> CIL +``` + +### Tenant Isolation Model + +```mermaid +flowchart TD + TA["Tenant A — Namespace app-a
Pods: 10.244.0.10, 10.244.0.11"] + TB["Tenant B — Namespace app-b
Pods: 10.244.1.20, 10.244.1.21"] + TC["Tenant C — Namespace app-c
Pods: 10.244.2.30, 10.244.2.31"] + + ENGINE{"Cilium eBPF Policy Engine"} + + TA --> ENGINE + TB --> ENGINE + TC --> ENGINE + + ENGINE -->|"A ↔ A — ALLOWED"| ALLOW["Same-tenant traffic passes"] + ENGINE -->|"A ↔ B — DENIED"| DENY["Cross-tenant traffic dropped"] +``` + +### Identity-based Security + +Cilium assigns each endpoint (pod) a **security identity** based on its labels. Policies are enforced using these identities rather than IP addresses. + +```mermaid +flowchart LR + POD["Pod: frontend-abc123
Labels: app=frontend,
tenant=acme, env=prod"] + AGENT["Cilium Agent
Hash(labels) → Identity: 12345"] + BPFMAP["eBPF Map
10.244.0.10 → Identity 12345"] + + POD --> AGENT + AGENT --> BPFMAP +``` + +### Policy Enforcement in Kernel + +When a packet is sent between pods, Cilium enforces policies entirely within kernel space: + +```mermaid +flowchart TD + PKT["Packet: 10.244.0.10 → 10.244.1.20"] + STEP1["1. Lookup source identity:
10.244.0.10 → ID 12345 (tenant-a)"] + STEP2["2. Lookup destination identity:
10.244.1.20 → ID 67890 (tenant-b)"] + STEP3["3. Check policy map:
(12345, 67890, TCP, 80) → DENY"] + DROP["4. DROP packet"] + + PKT --> STEP1 --> STEP2 --> STEP3 --> DROP +``` + +All of this happens in kernel space in approximately 100 nanoseconds. + +### Why eBPF Enforcement is Secure + +| Property | Description | +| --- | --- | +| **Verifier** | eBPF programs are verified before loading — no crashes, no infinite loops | +| **Isolation** | Programs run in a restricted kernel context | +| **No userspace bypass** | All network traffic must pass through eBPF hooks | +| **Atomic updates** | Policy changes are atomic — no race conditions | +| **In-kernel** | No context switches needed, faster than userspace | + +### Kernel-level Enforcement + +```mermaid +flowchart TD + subgraph US["User Space"] + PODA["Pod A
(Tenant A)"] + PODB["Pod B
(Tenant B)"] + NOTE["Cannot bypass policy —
traffic MUST go through kernel"] + end + + subgraph KS["Kernel Space"] + EBPF["eBPF Programs
• Attached to network interfaces
• Run in privileged kernel context
• Verified by kernel
• Cannot be bypassed by userspace
• Atomic policy updates"] + end + + US -->|"all traffic"| KS +``` + +### Default Deny with Namespace Isolation + +Example `CiliumNetworkPolicy` for tenant isolation: + +```yaml +apiVersion: cilium.io/v2 +kind: CiliumNetworkPolicy +metadata: + name: isolate-tenant + namespace: tenant-a +spec: + endpointSelector: {} # Apply to all pods in namespace + ingress: + - fromEndpoints: + - matchLabels: + io.kubernetes.pod.namespace: tenant-a # Only from same namespace + egress: + - toEndpoints: + - matchLabels: + io.kubernetes.pod.namespace: tenant-a # Only to same namespace + - toEntities: + - kube-apiserver # Allow API server access + - cluster # Allow cluster DNS +``` + +## Traffic Flow Summary + +### External Access + +```mermaid +flowchart LR + C["Client"] --> R["Router"] + R --> M["MetalLB
(L2/BGP)"] + M --> N["Node"] + N --> E["Cilium eBPF"] + E --> P["Pod"] +``` + +### Tenant Isolation + +```mermaid +flowchart LR + A["Pod A"] --> CHECK{"eBPF
Policy Check"} + CHECK -->|"Cross-tenant"| DENY["DENY"] + CHECK -->|"Same tenant"| ALLOW["ALLOW → Pod A'"] +``` From 29973a50eb595a32c82b1e3ed489f41f8ae503a2 Mon Sep 17 00:00:00 2001 From: "ohotnikov.ivan" Date: Wed, 18 Feb 2026 10:18:05 +0300 Subject: [PATCH 2/4] docs(networking): address review feedback for network architecture Add missing technical details: networking stack variants, GENEVE overlay, Join CIDR, cluster domain, Hubble observability section. Improve tenant isolation description with Cozystack-specific labels, automatic policy application, and CNI chaining details. Signed-off-by: ohotnikov.ivan --- content/en/docs/v1/networking/architecture.md | 102 +++++++++++++++--- 1 file changed, 89 insertions(+), 13 deletions(-) diff --git a/content/en/docs/v1/networking/architecture.md b/content/en/docs/v1/networking/architecture.md index 052c3898..53cc9825 100644 --- a/content/en/docs/v1/networking/architecture.md +++ b/content/en/docs/v1/networking/architecture.md @@ -3,6 +3,8 @@ title: "Network Architecture" linkTitle: "Architecture" description: "Overview of Cozystack cluster network architecture: MetalLB load balancing, Cilium eBPF networking, and tenant isolation with Kube-OVN." weight: 5 +aliases: + - /docs/v1/reference/applications/architecture --- ## Overview @@ -15,7 +17,7 @@ Cozystack uses a multi-layered networking stack designed for bare-metal Kubernet | Service load balancing | Cilium eBPF | kube-proxy replacement, in-kernel DNAT | | Network policies | Cilium eBPF | Tenant isolation and security enforcement | | Pod networking (CNI) | Kube-OVN | Centralized IPAM, overlay networking | -| Observability | Hubble | Network traffic visibility | +| Observability | Hubble (optional) | Network traffic visibility (disabled by default) | ```mermaid flowchart TD @@ -39,7 +41,36 @@ flowchart TD | --- | --- | | Pod CIDR | 10.244.0.0/16 | | Service CIDR | 10.96.0.0/16 | -| CNI | Kube-OVN + Cilium (kube-proxy replacement) | +| Join CIDR | 100.64.0.0/16 | +| Cluster domain | cozy.local | +| Overlay type | GENEVE | +| CNI | Kube-OVN | +| kube-proxy replacement | Cilium eBPF | + +### Networking Stack Variants + +Cozystack supports several networking stack variants to accommodate different +cluster types. The variant is selected via `bundles.system.variant` in the +platform configuration. + +| Variant | Components | Target Platform | +| --- | --- | --- | +| `kubeovn-cilium` | Kube-OVN + Cilium (default) | Talos Linux | +| `kubeovn-cilium-generic` | Kube-OVN + Cilium | kubeadm, k3s, RKE2 | +| `cilium` | Cilium only | Talos Linux | +| `cilium-generic` | Cilium only | kubeadm, k3s, RKE2 | +| `noop` | None (bring your own CNI) | Any | + +In Kube-OVN variants, Cilium operates as a chained CNI (`generic-veth` mode): +Kube-OVN handles pod networking and IPAM, while Cilium provides service load +balancing, network policy enforcement, and optional observability via Hubble. + +In Cilium-only variants, Cilium serves as both the CNI and the service load +balancer. + +{{% alert color="info" %}} +The rest of this document describes the default `kubeovn-cilium` variant. +{{% /alert %}} ### Pod CIDR Allocation (Kube-OVN) @@ -48,7 +79,8 @@ Kube-OVN uses a **shared Pod CIDR** model: - All pods draw from a single shared IP pool (10.244.0.0/16) - IP addresses are allocated centrally through Kube-OVN's IPAM - There is no per-node CIDR splitting (unlike Calico or Flannel) -- This enables live migration of pods between nodes without changing their IP addresses +- Because IPs are not tied to node-specific CIDR blocks, pods can be rescheduled to different nodes while retaining their addresses +- Inter-node pod communication uses GENEVE tunnels (Join CIDR: 100.64.0.0/16) ## External Traffic Ingress with MetalLB @@ -220,7 +252,7 @@ flowchart TD ## Tenant Isolation with Kube-OVN and Cilium -In a multi-tenant Cozystack cluster, all tenants share the same Pod CIDR. Kube-OVN manages a single shared IP pool (no per-node splitting), and Cilium enforces strong isolation using eBPF-based network policies. +In a multi-tenant Cozystack cluster, all tenants share the same Pod CIDR. This is secure because isolation is enforced by Cilium eBPF policies at the kernel level, not by network segmentation. Tenants cannot communicate even though they share the same IP pool. Kube-OVN manages a single shared IP pool (no per-node splitting), and Cilium enforces strong isolation using eBPF-based network policies. ### CNI Architecture @@ -228,7 +260,7 @@ In a multi-tenant Cozystack cluster, all tenants share the same Pod CIDR. Kube-O flowchart TD subgraph KO["Kube-OVN"] IPAM["Centralized IPAM — Shared pool 10.244.0.0/16"] - OVN["OVN/OVS Overlay Network"] + OVN["OVN/OVS Overlay Network (GENEVE)"] SUBNET["Subnet management per namespace/tenant"] end @@ -242,6 +274,12 @@ flowchart TD KO --> CIL ``` +Kube-OVN provides the primary CNI plugin for pod networking and IPAM. Kube-OVN's +own network policy engine is disabled (`ENABLE_NP: false`), and all policy +enforcement is delegated to Cilium. Cilium operates as a chained CNI component +(`generic-veth` mode) that enforces network policies via eBPF and replaces +kube-proxy for service load balancing. + ### Tenant Isolation Model ```mermaid @@ -320,29 +358,67 @@ flowchart TD ### Default Deny with Namespace Isolation -Example `CiliumNetworkPolicy` for tenant isolation: + +{{% alert color="warning" %}} +By default, Kubernetes allows all pod-to-pod traffic. Cozystack applies +CiliumNetworkPolicy and CiliumClusterwideNetworkPolicy resources automatically +when a tenant is created. These policies enforce namespace-level isolation and +restrict access to system ports (etcd, kubelet, controllers). +{{% /alert %}} + +Cozystack uses hierarchical tenant labels for isolation. Policies match on +`tenant.cozystack.io/*` namespace labels, which allows parent tenants to +include sub-tenant namespaces. Example: ```yaml apiVersion: cilium.io/v2 kind: CiliumNetworkPolicy metadata: - name: isolate-tenant - namespace: tenant-a + name: allow-internal-communication + namespace: tenant-example spec: - endpointSelector: {} # Apply to all pods in namespace + endpointSelector: {} ingress: - fromEndpoints: - matchLabels: - io.kubernetes.pod.namespace: tenant-a # Only from same namespace + k8s:io.cilium.k8s.namespace.labels.tenant.cozystack.io/tenant-example: "" egress: - toEndpoints: - matchLabels: - io.kubernetes.pod.namespace: tenant-a # Only to same namespace + k8s:io.cilium.k8s.namespace.labels.tenant.cozystack.io/tenant-example: "" - toEntities: - - kube-apiserver # Allow API server access - - cluster # Allow cluster DNS + - kube-apiserver + - cluster ``` +## Observability with Hubble + +Hubble provides network traffic visibility for the Cilium data plane. It is +included in the Cozystack networking stack but **disabled by default** to +minimize resource usage. + +When enabled, Hubble provides: + +- Real-time flow logs for all pod-to-pod and external traffic +- DNS query visibility +- HTTP/gRPC request-level metrics +- Prometheus metrics integration +- Web UI for traffic visualization + +To enable Hubble, set the following in the Cilium configuration: + +```yaml +cilium: + hubble: + enabled: true + relay: + enabled: true + ui: + enabled: true +``` + +See Enabling Hubble for full configuration details. + ## Traffic Flow Summary ### External Access From 3d67bce31f01fb2a651c72bad729ff747aec820f Mon Sep 17 00:00:00 2001 From: Ivan Stukov Date: Wed, 18 Feb 2026 18:39:18 +0500 Subject: [PATCH 3/4] docs (networking): add to v0 Signed-off-by: Ivan Stukov --- content/en/docs/v0/networking/architecture.md | 442 ++++++++++++++++++ content/en/docs/v1/networking/architecture.md | 5 +- 2 files changed, 444 insertions(+), 3 deletions(-) create mode 100644 content/en/docs/v0/networking/architecture.md diff --git a/content/en/docs/v0/networking/architecture.md b/content/en/docs/v0/networking/architecture.md new file mode 100644 index 00000000..6568e9eb --- /dev/null +++ b/content/en/docs/v0/networking/architecture.md @@ -0,0 +1,442 @@ +--- +title: "Network Architecture" +linkTitle: "Architecture" +description: "Overview of Cozystack cluster network architecture: MetalLB load balancing, Cilium eBPF networking, and tenant isolation with Kube-OVN." +weight: 5 +aliases: + - /docs/v0/reference/applications/architecture + - /docs/reference/applications/architecture +--- + +## Overview + +Cozystack uses a multi-layered networking stack designed for bare-metal Kubernetes clusters. The architecture combines several components, each responsible for a specific layer of the network: + +| Layer | Component | Purpose | +| --- | --- | --- | +| External load balancing | MetalLB | Publishing services to external networks | +| Service load balancing | Cilium eBPF | kube-proxy replacement, in-kernel DNAT | +| Network policies | Cilium eBPF | Tenant isolation and security enforcement | +| Pod networking (CNI) | Kube-OVN | Centralized IPAM, overlay networking | +| Observability | Hubble (optional) | Network traffic visibility (disabled by default) | + +```mermaid +flowchart TD + EXT["External Clients"] + RTR["Upstream Router / Gateway"] + MLB["MetalLB
(L2 ARP / BGP)"] + CIL["Cilium eBPF
(Service Load Balancing + Network Policies)"] + OVN["Kube-OVN
(Pod Networking + IPAM)"] + PODS["Pods"] + + EXT --> RTR + RTR --> MLB + MLB --> CIL + CIL --> OVN + OVN --> PODS +``` + +## Cluster Network Configuration + +| Parameter | Default Value | +| --- | --- | +| Pod CIDR | 10.244.0.0/16 | +| Service CIDR | 10.96.0.0/16 | +| Join CIDR | 100.64.0.0/16 | +| Cluster domain | cozy.local | +| Overlay type | GENEVE | +| CNI | Kube-OVN | +| kube-proxy replacement | Cilium eBPF | + +### Networking Stack Variants + +Cozystack supports several networking stack variants to accommodate different +cluster types. The variant is selected via `bundles.system.variant` in the +platform configuration. + +| Variant | Components | Target Platform | +| --- | --- | --- | +| `kubeovn-cilium` | Kube-OVN + Cilium (default) | Talos Linux | +| `kubeovn-cilium-generic` | Kube-OVN + Cilium | kubeadm, k3s, RKE2 | +| `cilium` | Cilium only | Talos Linux | +| `cilium-generic` | Cilium only | kubeadm, k3s, RKE2 | +| `noop` | None (bring your own CNI) | Any | + +In Kube-OVN variants, Cilium operates as a chained CNI (`generic-veth` mode): +Kube-OVN handles pod networking and IPAM, while Cilium provides service load +balancing, network policy enforcement, and optional observability via Hubble. + +In Cilium-only variants, Cilium serves as both the CNI and the service load +balancer. + +{{% alert color="info" %}} +The rest of this document describes the default `kubeovn-cilium` variant. +{{% /alert %}} + +### Pod CIDR Allocation (Kube-OVN) + +Kube-OVN uses a **shared Pod CIDR** model: + +- All pods draw from a single shared IP pool (10.244.0.0/16) +- IP addresses are allocated centrally through Kube-OVN's IPAM +- There is no per-node CIDR splitting (unlike Calico or Flannel) +- Because IPs are not tied to node-specific CIDR blocks, pods can be rescheduled to different nodes while retaining their addresses +- Inter-node pod communication uses GENEVE tunnels (Join CIDR: 100.64.0.0/16) + +## External Traffic Ingress with MetalLB + +MetalLB is a load balancer implementation for bare-metal Kubernetes clusters. It assigns external IP addresses to Services of type `LoadBalancer`, allowing external traffic to reach the cluster. + +```mermaid +flowchart TD + CLIENT["External Client"] + RTR["Upstream Router"] + + subgraph CLUSTER["Kubernetes Cluster"] + S1["Node 1
MetalLB Speaker"] + S2["Node 2
MetalLB Speaker"] + S3["Node 3
MetalLB Speaker"] + CIL["Cilium (eBPF)
Service Load Balancing
DNAT to Pod IP"] + POD["Target Pod
(Pod CIDR)"] + end + + CLIENT -->|"Traffic to external IP
(e.g. 10.x.x.20)"| RTR + RTR -->|"L2 (ARP) or BGP"| S1 + RTR -->|"L2 (ARP) or BGP"| S2 + RTR -->|"L2 (ARP) or BGP"| S3 + S1 --> CIL + S2 --> CIL + S3 --> CIL + CIL --> POD +``` + +### Layer 2 Mode (ARP) + +In L2 mode, MetalLB responds to ARP requests for the Service's external IP. A single node becomes the "leader" for that IP and receives all traffic. + +How it works: + +1. A MetalLB speaker on one node claims the external IP +2. The speaker responds to ARP requests: "IP X is at MAC aa:bb:cc:dd:ee:ff" +3. All traffic for that IP goes to the leader node +4. Cilium on the node performs DNAT to the actual pod + +```mermaid +sequenceDiagram + participant C as Client + participant L as Node (MetalLB Leader) + participant CIL as Cilium (eBPF) + participant P as Pod + + C->>L: ARP: Who has 10.x.x.20? + L-->>C: ARP Reply: 10.x.x.20 is at aa:bb:cc:dd:ee:ff + C->>L: Send traffic to 10.x.x.20 + L->>CIL: Packet enters kernel + CIL->>P: DNAT → Pod 10.244.x.x:8080 +``` + +{{% alert color="info" %}} +In L2 mode, only one node handles traffic for a given Service IP. Failover occurs if the leader node goes down, but there is no true load balancing across nodes for a single Service. +{{% /alert %}} + +### BGP Mode + +In BGP mode, MetalLB establishes BGP sessions with upstream routers and announces /32 routes for Service IPs. This enables true ECMP load balancing across nodes. + +How it works: + +1. MetalLB speakers establish BGP sessions with the upstream router(s) +2. Each speaker announces the Service IP as a /32 route +3. The router has multiple next-hops for the same prefix +4. ECMP distributes traffic across nodes +5. Cilium on the receiving node performs DNAT to the actual pod + +```mermaid +sequenceDiagram + participant S1 as Node 1 (Speaker) + participant S2 as Node 2 (Speaker) + participant S3 as Node 3 (Speaker) + participant R as Upstream Router + participant CIL as Cilium (eBPF) + participant P as Pod + + S1->>R: BGP UPDATE: 10.x.x.20/32 via Node 1 + S2->>R: BGP UPDATE: 10.x.x.20/32 via Node 2 + S3->>R: BGP UPDATE: 10.x.x.20/32 via Node 3 + Note over R: ECMP: 3 next-hops for 10.x.x.20/32 + R->>S1: Traffic (1/3) + R->>S2: Traffic (1/3) + R->>S3: Traffic (1/3) + S1->>CIL: Packet enters kernel + CIL->>P: DNAT → Pod +``` + +### VLAN Integration for External Traffic + +External traffic can be delivered to the cluster through additional VLANs (client VLANs, DMZ, public networks, etc.) which are then routed to services via MetalLB and Cilium. + +```mermaid +flowchart TD + EXT["External Traffic"] + + subgraph VLANs["Additional VLANs
(Client, DMZ, Public, etc.)"] + V1["VLAN A"] + V2["VLAN B"] + end + + subgraph LB["MetalLB"] + L2["L2 Mode → Service → Pod"] + BGP["BGP Mode → Service → Pod"] + end + + EXT --> VLANs + V1 --> L2 + V2 --> BGP +``` + +## Cilium as kube-proxy Replacement + +Cilium replaces kube-proxy by attaching eBPF programs directly in the Linux kernel. This provides more efficient packet processing and advanced capabilities. + +### Traditional kube-proxy (iptables) vs Cilium eBPF + +```mermaid +flowchart LR + subgraph IPTABLES["kube-proxy (iptables)"] + direction LR + P1["Packet"] --> IPT["iptables
PREROUTING"] + IPT --> NAT["NAT chains
O(n) rule traversal"] + NAT --> DNAT1["DNAT to Pod"] + DNAT1 --> POD1["Pod"] + end + + subgraph EBPF["Cilium (eBPF)"] + direction LR + P2["Packet"] --> BPF["eBPF program
(TC/XDP)"] + BPF --> MAP["eBPF map lookup
O(1) hash"] + MAP --> DNAT2["DNAT"] + DNAT2 --> POD2["Pod"] + end +``` + +Key differences: + +| Aspect | kube-proxy (iptables) | Cilium (eBPF) | +| --- | --- | --- | +| Lookup complexity | O(n) rule traversal | O(1) hash-based lookup | +| Execution context | Userspace overhead | Native in-kernel | +| Context switches | Required | None | +| Scalability | Degrades with service count | Constant performance | + +### eBPF Architecture + +```mermaid +flowchart TD + subgraph KERNEL["Kernel Space"] + subgraph BPF["eBPF Programs"] + TC["TC
(ingress/egress)"] + XDP["XDP
(fastest path)"] + SOCK["Socket-level
(connect, sendmsg)"] + end + + subgraph MAPS["eBPF Maps"] + SVC["Service Tables"] + EP["Endpoint Maps"] + POL["Policy Maps"] + end + + TC --> MAPS + XDP --> MAPS + SOCK --> MAPS + end +``` + +## Tenant Isolation with Kube-OVN and Cilium + +In a multi-tenant Cozystack cluster, all tenants share the same Pod CIDR. This is secure because isolation is enforced by Cilium eBPF policies at the kernel level, not by network segmentation. Tenants cannot communicate even though they share the same IP pool. Kube-OVN allocates IPs from this shared pool centrally, without per-node CIDR splitting. + +### CNI Architecture + +```mermaid +flowchart TD + subgraph KO["Kube-OVN"] + IPAM["Centralized IPAM — Shared pool 10.244.0.0/16"] + OVN["OVN/OVS Overlay Network (GENEVE)"] + SUBNET["Subnet management per namespace/tenant"] + end + + subgraph CIL["Cilium"] + POLICY["eBPF Network Policies"] + SVCBAL["Service Load Balancing (kube-proxy replacement)"] + IDENT["Identity-based Security"] + HUB["Observability via Hubble"] + end + + KO --> CIL +``` + +Kube-OVN provides the primary CNI plugin for pod networking and IPAM. Kube-OVN's +own network policy engine is disabled (`ENABLE_NP: false`), and all policy +enforcement is delegated to Cilium. Cilium operates as a chained CNI component +(`generic-veth` mode) that enforces network policies via eBPF and replaces +kube-proxy for service load balancing. + +### Tenant Isolation Model + +```mermaid +flowchart TD + TA["Tenant A — Namespace app-a
Pods: 10.244.0.10, 10.244.0.11"] + TB["Tenant B — Namespace app-b
Pods: 10.244.1.20, 10.244.1.21"] + TC["Tenant C — Namespace app-c
Pods: 10.244.2.30, 10.244.2.31"] + + ENGINE{"Cilium eBPF Policy Engine"} + + TA --> ENGINE + TB --> ENGINE + TC --> ENGINE + + ENGINE -->|"A ↔ A — ALLOWED"| ALLOW["Same-tenant traffic passes"] + ENGINE -->|"A ↔ B — DENIED"| DENY["Cross-tenant traffic dropped"] +``` + +### Identity-based Security + +Cilium assigns each endpoint (pod) a **security identity** based on its labels. Policies are enforced using these identities rather than IP addresses. + +```mermaid +flowchart LR + POD["Pod: frontend-abc123
Labels: app=frontend,
tenant=acme, env=prod"] + AGENT["Cilium Agent
Hash(labels) → Identity: 12345"] + BPFMAP["eBPF Map
10.244.0.10 → Identity 12345"] + + POD --> AGENT + AGENT --> BPFMAP +``` + +### Policy Enforcement in Kernel + +When a packet is sent between pods, Cilium enforces policies entirely within kernel space: + +```mermaid +flowchart TD + PKT["Packet: 10.244.0.10 → 10.244.1.20"] + STEP1["1. Lookup source identity:
10.244.0.10 → ID 12345 (tenant-a)"] + STEP2["2. Lookup destination identity:
10.244.1.20 → ID 67890 (tenant-b)"] + STEP3["3. Check policy map:
(12345, 67890, TCP, 80) → DENY"] + DROP["4. DROP packet"] + + PKT --> STEP1 --> STEP2 --> STEP3 --> DROP +``` + +All of this happens in kernel space in approximately 100 nanoseconds. + +### Why eBPF Enforcement is Secure + +| Property | Description | +| --- | --- | +| **Verifier** | eBPF programs are verified before loading — no crashes, no infinite loops | +| **Isolation** | Programs run in a restricted kernel context | +| **No userspace bypass** | All network traffic must pass through eBPF hooks | +| **Atomic updates** | Policy changes are atomic — no race conditions | +| **In-kernel** | No context switches needed, faster than userspace | + +### Kernel-level Enforcement + +```mermaid +flowchart TD + subgraph US["User Space"] + PODA["Pod A
(Tenant A)"] + PODB["Pod B
(Tenant B)"] + NOTE["Cannot bypass policy —
traffic MUST go through kernel"] + end + + subgraph KS["Kernel Space"] + EBPF["eBPF Programs
• Attached to network interfaces
• Run in privileged kernel context
• Verified by kernel
• Cannot be bypassed by userspace
• Atomic policy updates"] + end + + US -->|"all traffic"| KS +``` + +### Default Deny with Namespace Isolation + +{{% alert color="warning" %}} +By default, Kubernetes allows all pod-to-pod traffic. Cozystack applies +CiliumNetworkPolicy and CiliumClusterwideNetworkPolicy resources automatically +when a tenant is created. These policies enforce namespace-level isolation and +restrict access to system ports (etcd, kubelet, controllers). +{{% /alert %}} + +Cozystack uses hierarchical tenant labels for isolation. Policies match on +`tenant.cozystack.io/*` namespace labels, which allows parent tenants to +include sub-tenant namespaces. Example: + +```yaml +apiVersion: cilium.io/v2 +kind: CiliumNetworkPolicy +metadata: + name: allow-internal-communication + namespace: tenant-example +spec: + endpointSelector: {} + ingress: + - fromEndpoints: + - matchLabels: + k8s:io.cilium.k8s.namespace.labels.tenant.cozystack.io/tenant-example: "" + egress: + - toEndpoints: + - matchLabels: + k8s:io.cilium.k8s.namespace.labels.tenant.cozystack.io/tenant-example: "" + - toEntities: + - kube-apiserver + - cluster +``` + +## Observability with Hubble + +Hubble provides network traffic visibility for the Cilium data plane. It is +included in the Cozystack networking stack but **disabled by default** to +minimize resource usage. + +When enabled, Hubble provides: + +- Real-time flow logs for all pod-to-pod and external traffic +- DNS query visibility +- HTTP/gRPC request-level metrics +- Prometheus metrics integration +- Web UI for traffic visualization + +To enable Hubble, set the following in the Cilium configuration: + +```yaml +cilium: + hubble: + enabled: true + relay: + enabled: true + ui: + enabled: true +``` + +See [Enabling Hubble](https://docs.cilium.io/en/stable/observability/hubble/) for full configuration details. + +## Traffic Flow Summary + +### External Access + +```mermaid +flowchart LR + C["Client"] --> R["Router"] + R --> M["MetalLB
(L2/BGP)"] + M --> N["Node"] + N --> E["Cilium eBPF"] + E --> P["Pod"] +``` + +### Tenant Isolation + +```mermaid +flowchart LR + A["Pod A"] --> CHECK{"eBPF
Policy Check"} + CHECK -->|"Cross-tenant"| DENY["DENY"] + CHECK -->|"Same tenant"| ALLOW["ALLOW → Pod A'"] +``` diff --git a/content/en/docs/v1/networking/architecture.md b/content/en/docs/v1/networking/architecture.md index 53cc9825..5d51ed30 100644 --- a/content/en/docs/v1/networking/architecture.md +++ b/content/en/docs/v1/networking/architecture.md @@ -252,7 +252,7 @@ flowchart TD ## Tenant Isolation with Kube-OVN and Cilium -In a multi-tenant Cozystack cluster, all tenants share the same Pod CIDR. This is secure because isolation is enforced by Cilium eBPF policies at the kernel level, not by network segmentation. Tenants cannot communicate even though they share the same IP pool. Kube-OVN manages a single shared IP pool (no per-node splitting), and Cilium enforces strong isolation using eBPF-based network policies. +In a multi-tenant Cozystack cluster, all tenants share the same Pod CIDR. This is secure because isolation is enforced by Cilium eBPF policies at the kernel level, not by network segmentation. Tenants cannot communicate even though they share the same IP pool. Kube-OVN allocates IPs from this shared pool centrally, without per-node CIDR splitting. ### CNI Architecture @@ -358,7 +358,6 @@ flowchart TD ### Default Deny with Namespace Isolation - {{% alert color="warning" %}} By default, Kubernetes allows all pod-to-pod traffic. Cozystack applies CiliumNetworkPolicy and CiliumClusterwideNetworkPolicy resources automatically @@ -417,7 +416,7 @@ cilium: enabled: true ``` -See Enabling Hubble for full configuration details. +See [Enabling Hubble](https://docs.cilium.io/en/stable/observability/hubble/) for full configuration details. ## Traffic Flow Summary From d1857488fadc4b04075533ef170aa86262e23b1c Mon Sep 17 00:00:00 2001 From: Ivan Stukov Date: Fri, 20 Feb 2026 19:10:41 +0500 Subject: [PATCH 4/4] docs(networking): add cozy-proxy and multus to architecture overview Co-Authored-By: Claude Signed-off-by: Ivan Stukov --- content/en/docs/v0/networking/architecture.md | 442 ------------------ content/en/docs/v1/networking/architecture.md | 4 + 2 files changed, 4 insertions(+), 442 deletions(-) delete mode 100644 content/en/docs/v0/networking/architecture.md diff --git a/content/en/docs/v0/networking/architecture.md b/content/en/docs/v0/networking/architecture.md deleted file mode 100644 index 6568e9eb..00000000 --- a/content/en/docs/v0/networking/architecture.md +++ /dev/null @@ -1,442 +0,0 @@ ---- -title: "Network Architecture" -linkTitle: "Architecture" -description: "Overview of Cozystack cluster network architecture: MetalLB load balancing, Cilium eBPF networking, and tenant isolation with Kube-OVN." -weight: 5 -aliases: - - /docs/v0/reference/applications/architecture - - /docs/reference/applications/architecture ---- - -## Overview - -Cozystack uses a multi-layered networking stack designed for bare-metal Kubernetes clusters. The architecture combines several components, each responsible for a specific layer of the network: - -| Layer | Component | Purpose | -| --- | --- | --- | -| External load balancing | MetalLB | Publishing services to external networks | -| Service load balancing | Cilium eBPF | kube-proxy replacement, in-kernel DNAT | -| Network policies | Cilium eBPF | Tenant isolation and security enforcement | -| Pod networking (CNI) | Kube-OVN | Centralized IPAM, overlay networking | -| Observability | Hubble (optional) | Network traffic visibility (disabled by default) | - -```mermaid -flowchart TD - EXT["External Clients"] - RTR["Upstream Router / Gateway"] - MLB["MetalLB
(L2 ARP / BGP)"] - CIL["Cilium eBPF
(Service Load Balancing + Network Policies)"] - OVN["Kube-OVN
(Pod Networking + IPAM)"] - PODS["Pods"] - - EXT --> RTR - RTR --> MLB - MLB --> CIL - CIL --> OVN - OVN --> PODS -``` - -## Cluster Network Configuration - -| Parameter | Default Value | -| --- | --- | -| Pod CIDR | 10.244.0.0/16 | -| Service CIDR | 10.96.0.0/16 | -| Join CIDR | 100.64.0.0/16 | -| Cluster domain | cozy.local | -| Overlay type | GENEVE | -| CNI | Kube-OVN | -| kube-proxy replacement | Cilium eBPF | - -### Networking Stack Variants - -Cozystack supports several networking stack variants to accommodate different -cluster types. The variant is selected via `bundles.system.variant` in the -platform configuration. - -| Variant | Components | Target Platform | -| --- | --- | --- | -| `kubeovn-cilium` | Kube-OVN + Cilium (default) | Talos Linux | -| `kubeovn-cilium-generic` | Kube-OVN + Cilium | kubeadm, k3s, RKE2 | -| `cilium` | Cilium only | Talos Linux | -| `cilium-generic` | Cilium only | kubeadm, k3s, RKE2 | -| `noop` | None (bring your own CNI) | Any | - -In Kube-OVN variants, Cilium operates as a chained CNI (`generic-veth` mode): -Kube-OVN handles pod networking and IPAM, while Cilium provides service load -balancing, network policy enforcement, and optional observability via Hubble. - -In Cilium-only variants, Cilium serves as both the CNI and the service load -balancer. - -{{% alert color="info" %}} -The rest of this document describes the default `kubeovn-cilium` variant. -{{% /alert %}} - -### Pod CIDR Allocation (Kube-OVN) - -Kube-OVN uses a **shared Pod CIDR** model: - -- All pods draw from a single shared IP pool (10.244.0.0/16) -- IP addresses are allocated centrally through Kube-OVN's IPAM -- There is no per-node CIDR splitting (unlike Calico or Flannel) -- Because IPs are not tied to node-specific CIDR blocks, pods can be rescheduled to different nodes while retaining their addresses -- Inter-node pod communication uses GENEVE tunnels (Join CIDR: 100.64.0.0/16) - -## External Traffic Ingress with MetalLB - -MetalLB is a load balancer implementation for bare-metal Kubernetes clusters. It assigns external IP addresses to Services of type `LoadBalancer`, allowing external traffic to reach the cluster. - -```mermaid -flowchart TD - CLIENT["External Client"] - RTR["Upstream Router"] - - subgraph CLUSTER["Kubernetes Cluster"] - S1["Node 1
MetalLB Speaker"] - S2["Node 2
MetalLB Speaker"] - S3["Node 3
MetalLB Speaker"] - CIL["Cilium (eBPF)
Service Load Balancing
DNAT to Pod IP"] - POD["Target Pod
(Pod CIDR)"] - end - - CLIENT -->|"Traffic to external IP
(e.g. 10.x.x.20)"| RTR - RTR -->|"L2 (ARP) or BGP"| S1 - RTR -->|"L2 (ARP) or BGP"| S2 - RTR -->|"L2 (ARP) or BGP"| S3 - S1 --> CIL - S2 --> CIL - S3 --> CIL - CIL --> POD -``` - -### Layer 2 Mode (ARP) - -In L2 mode, MetalLB responds to ARP requests for the Service's external IP. A single node becomes the "leader" for that IP and receives all traffic. - -How it works: - -1. A MetalLB speaker on one node claims the external IP -2. The speaker responds to ARP requests: "IP X is at MAC aa:bb:cc:dd:ee:ff" -3. All traffic for that IP goes to the leader node -4. Cilium on the node performs DNAT to the actual pod - -```mermaid -sequenceDiagram - participant C as Client - participant L as Node (MetalLB Leader) - participant CIL as Cilium (eBPF) - participant P as Pod - - C->>L: ARP: Who has 10.x.x.20? - L-->>C: ARP Reply: 10.x.x.20 is at aa:bb:cc:dd:ee:ff - C->>L: Send traffic to 10.x.x.20 - L->>CIL: Packet enters kernel - CIL->>P: DNAT → Pod 10.244.x.x:8080 -``` - -{{% alert color="info" %}} -In L2 mode, only one node handles traffic for a given Service IP. Failover occurs if the leader node goes down, but there is no true load balancing across nodes for a single Service. -{{% /alert %}} - -### BGP Mode - -In BGP mode, MetalLB establishes BGP sessions with upstream routers and announces /32 routes for Service IPs. This enables true ECMP load balancing across nodes. - -How it works: - -1. MetalLB speakers establish BGP sessions with the upstream router(s) -2. Each speaker announces the Service IP as a /32 route -3. The router has multiple next-hops for the same prefix -4. ECMP distributes traffic across nodes -5. Cilium on the receiving node performs DNAT to the actual pod - -```mermaid -sequenceDiagram - participant S1 as Node 1 (Speaker) - participant S2 as Node 2 (Speaker) - participant S3 as Node 3 (Speaker) - participant R as Upstream Router - participant CIL as Cilium (eBPF) - participant P as Pod - - S1->>R: BGP UPDATE: 10.x.x.20/32 via Node 1 - S2->>R: BGP UPDATE: 10.x.x.20/32 via Node 2 - S3->>R: BGP UPDATE: 10.x.x.20/32 via Node 3 - Note over R: ECMP: 3 next-hops for 10.x.x.20/32 - R->>S1: Traffic (1/3) - R->>S2: Traffic (1/3) - R->>S3: Traffic (1/3) - S1->>CIL: Packet enters kernel - CIL->>P: DNAT → Pod -``` - -### VLAN Integration for External Traffic - -External traffic can be delivered to the cluster through additional VLANs (client VLANs, DMZ, public networks, etc.) which are then routed to services via MetalLB and Cilium. - -```mermaid -flowchart TD - EXT["External Traffic"] - - subgraph VLANs["Additional VLANs
(Client, DMZ, Public, etc.)"] - V1["VLAN A"] - V2["VLAN B"] - end - - subgraph LB["MetalLB"] - L2["L2 Mode → Service → Pod"] - BGP["BGP Mode → Service → Pod"] - end - - EXT --> VLANs - V1 --> L2 - V2 --> BGP -``` - -## Cilium as kube-proxy Replacement - -Cilium replaces kube-proxy by attaching eBPF programs directly in the Linux kernel. This provides more efficient packet processing and advanced capabilities. - -### Traditional kube-proxy (iptables) vs Cilium eBPF - -```mermaid -flowchart LR - subgraph IPTABLES["kube-proxy (iptables)"] - direction LR - P1["Packet"] --> IPT["iptables
PREROUTING"] - IPT --> NAT["NAT chains
O(n) rule traversal"] - NAT --> DNAT1["DNAT to Pod"] - DNAT1 --> POD1["Pod"] - end - - subgraph EBPF["Cilium (eBPF)"] - direction LR - P2["Packet"] --> BPF["eBPF program
(TC/XDP)"] - BPF --> MAP["eBPF map lookup
O(1) hash"] - MAP --> DNAT2["DNAT"] - DNAT2 --> POD2["Pod"] - end -``` - -Key differences: - -| Aspect | kube-proxy (iptables) | Cilium (eBPF) | -| --- | --- | --- | -| Lookup complexity | O(n) rule traversal | O(1) hash-based lookup | -| Execution context | Userspace overhead | Native in-kernel | -| Context switches | Required | None | -| Scalability | Degrades with service count | Constant performance | - -### eBPF Architecture - -```mermaid -flowchart TD - subgraph KERNEL["Kernel Space"] - subgraph BPF["eBPF Programs"] - TC["TC
(ingress/egress)"] - XDP["XDP
(fastest path)"] - SOCK["Socket-level
(connect, sendmsg)"] - end - - subgraph MAPS["eBPF Maps"] - SVC["Service Tables"] - EP["Endpoint Maps"] - POL["Policy Maps"] - end - - TC --> MAPS - XDP --> MAPS - SOCK --> MAPS - end -``` - -## Tenant Isolation with Kube-OVN and Cilium - -In a multi-tenant Cozystack cluster, all tenants share the same Pod CIDR. This is secure because isolation is enforced by Cilium eBPF policies at the kernel level, not by network segmentation. Tenants cannot communicate even though they share the same IP pool. Kube-OVN allocates IPs from this shared pool centrally, without per-node CIDR splitting. - -### CNI Architecture - -```mermaid -flowchart TD - subgraph KO["Kube-OVN"] - IPAM["Centralized IPAM — Shared pool 10.244.0.0/16"] - OVN["OVN/OVS Overlay Network (GENEVE)"] - SUBNET["Subnet management per namespace/tenant"] - end - - subgraph CIL["Cilium"] - POLICY["eBPF Network Policies"] - SVCBAL["Service Load Balancing (kube-proxy replacement)"] - IDENT["Identity-based Security"] - HUB["Observability via Hubble"] - end - - KO --> CIL -``` - -Kube-OVN provides the primary CNI plugin for pod networking and IPAM. Kube-OVN's -own network policy engine is disabled (`ENABLE_NP: false`), and all policy -enforcement is delegated to Cilium. Cilium operates as a chained CNI component -(`generic-veth` mode) that enforces network policies via eBPF and replaces -kube-proxy for service load balancing. - -### Tenant Isolation Model - -```mermaid -flowchart TD - TA["Tenant A — Namespace app-a
Pods: 10.244.0.10, 10.244.0.11"] - TB["Tenant B — Namespace app-b
Pods: 10.244.1.20, 10.244.1.21"] - TC["Tenant C — Namespace app-c
Pods: 10.244.2.30, 10.244.2.31"] - - ENGINE{"Cilium eBPF Policy Engine"} - - TA --> ENGINE - TB --> ENGINE - TC --> ENGINE - - ENGINE -->|"A ↔ A — ALLOWED"| ALLOW["Same-tenant traffic passes"] - ENGINE -->|"A ↔ B — DENIED"| DENY["Cross-tenant traffic dropped"] -``` - -### Identity-based Security - -Cilium assigns each endpoint (pod) a **security identity** based on its labels. Policies are enforced using these identities rather than IP addresses. - -```mermaid -flowchart LR - POD["Pod: frontend-abc123
Labels: app=frontend,
tenant=acme, env=prod"] - AGENT["Cilium Agent
Hash(labels) → Identity: 12345"] - BPFMAP["eBPF Map
10.244.0.10 → Identity 12345"] - - POD --> AGENT - AGENT --> BPFMAP -``` - -### Policy Enforcement in Kernel - -When a packet is sent between pods, Cilium enforces policies entirely within kernel space: - -```mermaid -flowchart TD - PKT["Packet: 10.244.0.10 → 10.244.1.20"] - STEP1["1. Lookup source identity:
10.244.0.10 → ID 12345 (tenant-a)"] - STEP2["2. Lookup destination identity:
10.244.1.20 → ID 67890 (tenant-b)"] - STEP3["3. Check policy map:
(12345, 67890, TCP, 80) → DENY"] - DROP["4. DROP packet"] - - PKT --> STEP1 --> STEP2 --> STEP3 --> DROP -``` - -All of this happens in kernel space in approximately 100 nanoseconds. - -### Why eBPF Enforcement is Secure - -| Property | Description | -| --- | --- | -| **Verifier** | eBPF programs are verified before loading — no crashes, no infinite loops | -| **Isolation** | Programs run in a restricted kernel context | -| **No userspace bypass** | All network traffic must pass through eBPF hooks | -| **Atomic updates** | Policy changes are atomic — no race conditions | -| **In-kernel** | No context switches needed, faster than userspace | - -### Kernel-level Enforcement - -```mermaid -flowchart TD - subgraph US["User Space"] - PODA["Pod A
(Tenant A)"] - PODB["Pod B
(Tenant B)"] - NOTE["Cannot bypass policy —
traffic MUST go through kernel"] - end - - subgraph KS["Kernel Space"] - EBPF["eBPF Programs
• Attached to network interfaces
• Run in privileged kernel context
• Verified by kernel
• Cannot be bypassed by userspace
• Atomic policy updates"] - end - - US -->|"all traffic"| KS -``` - -### Default Deny with Namespace Isolation - -{{% alert color="warning" %}} -By default, Kubernetes allows all pod-to-pod traffic. Cozystack applies -CiliumNetworkPolicy and CiliumClusterwideNetworkPolicy resources automatically -when a tenant is created. These policies enforce namespace-level isolation and -restrict access to system ports (etcd, kubelet, controllers). -{{% /alert %}} - -Cozystack uses hierarchical tenant labels for isolation. Policies match on -`tenant.cozystack.io/*` namespace labels, which allows parent tenants to -include sub-tenant namespaces. Example: - -```yaml -apiVersion: cilium.io/v2 -kind: CiliumNetworkPolicy -metadata: - name: allow-internal-communication - namespace: tenant-example -spec: - endpointSelector: {} - ingress: - - fromEndpoints: - - matchLabels: - k8s:io.cilium.k8s.namespace.labels.tenant.cozystack.io/tenant-example: "" - egress: - - toEndpoints: - - matchLabels: - k8s:io.cilium.k8s.namespace.labels.tenant.cozystack.io/tenant-example: "" - - toEntities: - - kube-apiserver - - cluster -``` - -## Observability with Hubble - -Hubble provides network traffic visibility for the Cilium data plane. It is -included in the Cozystack networking stack but **disabled by default** to -minimize resource usage. - -When enabled, Hubble provides: - -- Real-time flow logs for all pod-to-pod and external traffic -- DNS query visibility -- HTTP/gRPC request-level metrics -- Prometheus metrics integration -- Web UI for traffic visualization - -To enable Hubble, set the following in the Cilium configuration: - -```yaml -cilium: - hubble: - enabled: true - relay: - enabled: true - ui: - enabled: true -``` - -See [Enabling Hubble](https://docs.cilium.io/en/stable/observability/hubble/) for full configuration details. - -## Traffic Flow Summary - -### External Access - -```mermaid -flowchart LR - C["Client"] --> R["Router"] - R --> M["MetalLB
(L2/BGP)"] - M --> N["Node"] - N --> E["Cilium eBPF"] - E --> P["Pod"] -``` - -### Tenant Isolation - -```mermaid -flowchart LR - A["Pod A"] --> CHECK{"eBPF
Policy Check"} - CHECK -->|"Cross-tenant"| DENY["DENY"] - CHECK -->|"Same tenant"| ALLOW["ALLOW → Pod A'"] -``` diff --git a/content/en/docs/v1/networking/architecture.md b/content/en/docs/v1/networking/architecture.md index 5d51ed30..ed291737 100644 --- a/content/en/docs/v1/networking/architecture.md +++ b/content/en/docs/v1/networking/architecture.md @@ -5,6 +5,7 @@ description: "Overview of Cozystack cluster network architecture: MetalLB load b weight: 5 aliases: - /docs/v1/reference/applications/architecture + - /docs/reference/applications/architecture --- ## Overview @@ -17,6 +18,8 @@ Cozystack uses a multi-layered networking stack designed for bare-metal Kubernet | Service load balancing | Cilium eBPF | kube-proxy replacement, in-kernel DNAT | | Network policies | Cilium eBPF | Tenant isolation and security enforcement | | Pod networking (CNI) | Kube-OVN | Centralized IPAM, overlay networking | +| VM IP passthrough | [cozy-proxy](https://github.com/cozystack/cozy-proxy/) | Passing through external IPs into virtual machines | +| VM secondary interfaces | [Multus CNI](https://github.com/k8snetworkplumbingwg/multus-cni) | Attaching secondary L2 interfaces to virtual machines | | Observability | Hubble (optional) | Network traffic visibility (disabled by default) | ```mermaid @@ -59,6 +62,7 @@ platform configuration. | `kubeovn-cilium-generic` | Kube-OVN + Cilium | kubeadm, k3s, RKE2 | | `cilium` | Cilium only | Talos Linux | | `cilium-generic` | Cilium only | kubeadm, k3s, RKE2 | +| `cilium-kilo` | Cilium + Kilo | Talos Linux | | `noop` | None (bring your own CNI) | Any | In Kube-OVN variants, Cilium operates as a chained CNI (`generic-veth` mode):