Technical Deep Dive
This page unpacks the technical specifications of the On-Premise Kubernetes Platform. If you’re looking for the design rationale and tradeoffs, start there.
Production context: This cluster runs a multi-tenant SIEM platform that accepts logs from over 1,500 remote agents across WAN boundaries, sustaining 6,000–10,000 logs per second around the clock. The cluster has maintained 100% uptime over the past 90 days (as of this writing).
Physical Infrastructure: Proxmox Hyperconverged Cluster
The Kubernetes cluster runs on virtual machines provisioned on a 5-node Proxmox hyperconverged cluster. Understanding the physical layer matters because it sets the constraints on what the Kubernetes cluster can do—and explains why I distribute virtual nodes the way I do.
The hyperconverged design means adding a Proxmox node automatically expands both compute capacity and Ceph storage. This is intentional—I wanted infrastructure that scales linearly without separate storage procurement.
Physical Servers
High-Performance Tier (2 servers):
| Model | CPU | RAM | Role |
|---|---|---|---|
| Dell PowerEdge R7525 | 2x AMD EPYC 7763 64-Core (256 threads) | 1,007 GB | GPU workloads, Pool C workers |
| Supermicro Super Server | 2x AMD EPYC 7742 64-Core (256 threads) | 1,007 GB | Pool B workers |
General-Purpose Tier (3 servers):
| Model | CPU | RAM | Role |
|---|---|---|---|
| Supermicro AS-1024US-TRT | 2x AMD EPYC 7532 32-Core (128 threads) | 503 GB | Pool A workers, Control plane, Ceph MON/MGR |
Aggregate Physical Resources
| Resource | Total |
|---|---|
| CPU Threads | 896 |
| Total RAM | 3,523 GB (~3.4 TB) |
| GPU | 1x NVIDIA A100 40GB |
Ceph Distributed Storage
| Component | Details |
|---|---|
| Monitors | 4 (distributed across hosts) |
| Managers | 5 (one active, rest standby) |
| MDS | 2 active, 2 standby |
| OSDs | 20 (all up and in) |
| Total Capacity | 62 TiB |
| Objects | 3.82M |
Virtualization Strategy: Blast Radius Containment
With 5 physical servers and 21 Kubernetes nodes, I had choices about how to distribute virtual machines. The distribution is designed around blast radius containment.
Kubernetes provides resilience regardless of whether a virtual node or a physical host goes down—pods get rescheduled to surviving nodes either way. But splitting each physical server into multiple virtual nodes reduces the blast radius when a single Kubernetes node fails. If I ran one massive VM per physical host, losing that VM (kernel panic, misconfiguration, failed upgrade) would take a large chunk of cluster capacity offline. By running multiple smaller VMs per host, a single VM failure only loses a fraction of that host’s resources.
The tradeoff is overhead: more VMs means more operating system instances, more memory reserved for each VM’s kernel, and more coordination. I sized the VMs to balance blast radius against that overhead—large enough to run meaningful workloads efficiently, small enough that losing one doesn’t cascade.
- Control plane nodes are distributed one-per-host across the three general-purpose servers. A single host failure loses one of three control plane nodes, leaving quorum intact.
- Pool A workers are spread 3-per-host across the same three servers. A host failure loses 3 of 9 workers; a single VM failure loses 1 of 9.
- Pool B and Pool C run on dedicated high-performance hosts. These pools run workloads with application-level redundancy (database replicas, distributed caches) that can tolerate node-level failures.
Kubernetes Cluster Specifications
| Property | Value |
|---|---|
| Kubernetes Version | v1.32.0 |
| Talos Version | v1.11.6 |
| Total Nodes | 21 |
| Control Plane Nodes | 3 |
| Worker Nodes | 18 |
Aggregate Resources
| Resource | Total |
|---|---|
| vCPUs | 456 |
| RAM | 2.67 TB |
| GPU | 1x NVIDIA A100 40GB |
Control Plane Design
I keep the control plane minimal and dedicated—these nodes run etcd and the Kubernetes API server, not workloads. The sizing is intentionally modest because the control plane isn’t where compute-intensive work happens.
| Property | Value |
|---|---|
| Nodes | 3 |
| Resources per node | 8 vCPUs, 32 GB RAM |
| Total | 24 vCPUs, 96 GB RAM |
The three-node control plane provides quorum for etcd—two nodes can fail before the cluster loses the ability to make scheduling decisions. I distribute them across different Proxmox hosts to ensure a single host failure doesn’t take out the control plane.
Worker Pool Design
I run three worker pools with distinct resource profiles and scheduling characteristics. The reasoning: not all workloads have the same shape, and trying to run everything on a homogeneous pool means either over-provisioning everywhere or starving some workloads.
Pool A – General Purpose (9 nodes)
| Property | Value |
|---|---|
| Distribution | 3 workers per host, spread across 3 hosts |
| Resources per node | 16 vCPUs, 96 GB RAM |
| Labels | node.kubernetes.io/pool=pool-a, node.kubernetes.io/workload=general-purpose |
| Total | 144 vCPUs, 864 GB RAM |
This pool handles the bulk of workloads—web services, background jobs, platform services. The nodes are sized to run multiple medium-sized pods without contention, and spreading across three hosts provides resilience against single-host failures.
Pool B – High Performance (4 nodes)
| Property | Value |
|---|---|
| Hardware | AMD EPYC 7742 hosts |
| Resources per node | 32 vCPUs, 192 GB RAM |
| Labels | node.kubernetes.io/pool=pool-b, node.kubernetes.io/workload=high-performance |
| Features | NUMA enabled |
| Total | 128 vCPUs, 768 GB RAM |
This pool runs latency-sensitive and memory-intensive workloads—databases, caches, search indices. NUMA awareness helps here because these workloads benefit from memory locality. The larger per-node sizing means fewer pods per node, which reduces noisy-neighbor effects.
Pool C – High Performance + GPU (5 nodes)
| Property | Value |
|---|---|
| Hardware | AMD EPYC 7763 hosts |
| Resources per node | 32 vCPUs, 192 GB RAM |
| GPU node | 1x NVIDIA A100 40GB (PCIe passthrough) |
| Labels | node.kubernetes.io/pool=pool-c, node.kubernetes.io/workload=high-performance |
| GPU Labels | nvidia.com/gpu.present=true |
| Taints (GPU) | nvidia.com/gpu=present:NoSchedule |
| Total | 160 vCPUs, 960 GB RAM, 1x A100 GPU |
The GPU taint ensures only workloads that explicitly request GPU resources get scheduled to the GPU node. Without this, the scheduler might place general workloads there and starve GPU workloads of the node’s CPU and memory.
Storage Classes
I use multiple storage backends because different workloads have different storage requirements. A database wants block storage with strong consistency; a shared config directory wants a filesystem that multiple pods can mount simultaneously.
| Storage Class | Type | Default | Use Case |
|---|---|---|---|
ceph-rbd | Block (RBD) | Yes | General workloads requiring persistent block storage |
cephfs | Filesystem | No | Shared storage (RWX) for distributed workloads |
truenas-iscsi | iSCSI | No | TrueNAS-backed storage for specific performance profiles |
Ceph Integration (Rook)
Ceph provides the primary storage tier, integrated via Rook’s CSI driver. The underlying Ceph cluster runs on the same Proxmox hosts, which means storage performance scales with compute—adding a node improves both.
- RBD (Block): ReadWriteOnce volumes for databases, stateful workloads
- CephFS (File): ReadWriteMany volumes for shared data across pods
TrueNAS Integration (Democratic CSI)
TrueNAS provides an alternative storage tier via the Democratic CSI driver. I use this for workloads that benefit from ZFS features (snapshots, clones) or need a different performance profile than Ceph provides.
Related
- On-Premise Kubernetes Platform — Design rationale and tradeoffs
- Observability Platform — How I monitor this cluster