Technical Deep Dive

This page unpacks the technical specifications of the On-Premise Kubernetes Platform. If you’re looking for the design rationale and tradeoffs, start there.

Production context: This cluster runs a multi-tenant SIEM platform that accepts logs from over 1,500 remote agents across WAN boundaries, sustaining 6,000–10,000 logs per second around the clock. The cluster has maintained 100% uptime over the past 90 days (as of this writing).

Physical Infrastructure: Proxmox Hyperconverged Cluster

The Kubernetes cluster runs on virtual machines provisioned on a 5-node Proxmox hyperconverged cluster. Understanding the physical layer matters because it sets the constraints on what the Kubernetes cluster can do—and explains why I distribute virtual nodes the way I do.

The hyperconverged design means adding a Proxmox node automatically expands both compute capacity and Ceph storage. This is intentional—I wanted infrastructure that scales linearly without separate storage procurement.

Physical Servers

High-Performance Tier (2 servers):

Model	CPU	RAM	Role
Dell PowerEdge R7525	2x AMD EPYC 7763 64-Core (256 threads)	1,007 GB	GPU workloads, Pool C workers
Supermicro Super Server	2x AMD EPYC 7742 64-Core (256 threads)	1,007 GB	Pool B workers

General-Purpose Tier (3 servers):

Model	CPU	RAM	Role
Supermicro AS-1024US-TRT	2x AMD EPYC 7532 32-Core (128 threads)	503 GB	Pool A workers, Control plane, Ceph MON/MGR

Aggregate Physical Resources

Resource	Total
CPU Threads	896
Total RAM	3,523 GB (~3.4 TB)
GPU	1x NVIDIA A100 40GB

Ceph Distributed Storage

Component	Details
Monitors	4 (distributed across hosts)
Managers	5 (one active, rest standby)
MDS	2 active, 2 standby
OSDs	20 (all up and in)
Total Capacity	62 TiB
Objects	3.82M

Virtualization Strategy: Blast Radius Containment

With 5 physical servers and 21 Kubernetes nodes, I had choices about how to distribute virtual machines. The distribution is designed around blast radius containment.

Kubernetes provides resilience regardless of whether a virtual node or a physical host goes down—pods get rescheduled to surviving nodes either way. But splitting each physical server into multiple virtual nodes reduces the blast radius when a single Kubernetes node fails. If I ran one massive VM per physical host, losing that VM (kernel panic, misconfiguration, failed upgrade) would take a large chunk of cluster capacity offline. By running multiple smaller VMs per host, a single VM failure only loses a fraction of that host’s resources.

The tradeoff is overhead: more VMs means more operating system instances, more memory reserved for each VM’s kernel, and more coordination. I sized the VMs to balance blast radius against that overhead—large enough to run meaningful workloads efficiently, small enough that losing one doesn’t cascade.

Control plane nodes are distributed one-per-host across the three general-purpose servers. A single host failure loses one of three control plane nodes, leaving quorum intact.
Pool A workers are spread 3-per-host across the same three servers. A host failure loses 3 of 9 workers; a single VM failure loses 1 of 9.
Pool B and Pool C run on dedicated high-performance hosts. These pools run workloads with application-level redundancy (database replicas, distributed caches) that can tolerate node-level failures.

Kubernetes Cluster Specifications

Property	Value
Kubernetes Version	v1.32.0
Talos Version	v1.11.6
Total Nodes	21
Control Plane Nodes	3
Worker Nodes	18

Aggregate Resources

Resource	Total
vCPUs	456
RAM	2.67 TB
GPU	1x NVIDIA A100 40GB

Control Plane Design

I keep the control plane minimal and dedicated—these nodes run etcd and the Kubernetes API server, not workloads. The sizing is intentionally modest because the control plane isn’t where compute-intensive work happens.

Property	Value
Nodes	3
Resources per node	8 vCPUs, 32 GB RAM
Total	24 vCPUs, 96 GB RAM

The three-node control plane provides quorum for etcd—two nodes can fail before the cluster loses the ability to make scheduling decisions. I distribute them across different Proxmox hosts to ensure a single host failure doesn’t take out the control plane.

Worker Pool Design

I run three worker pools with distinct resource profiles and scheduling characteristics. The reasoning: not all workloads have the same shape, and trying to run everything on a homogeneous pool means either over-provisioning everywhere or starving some workloads.

Pool A – General Purpose (9 nodes)

Property	Value
Distribution	3 workers per host, spread across 3 hosts
Resources per node	16 vCPUs, 96 GB RAM
Labels	`node.kubernetes.io/pool=pool-a`, `node.kubernetes.io/workload=general-purpose`
Total	144 vCPUs, 864 GB RAM

This pool handles the bulk of workloads—web services, background jobs, platform services. The nodes are sized to run multiple medium-sized pods without contention, and spreading across three hosts provides resilience against single-host failures.

Pool B – High Performance (4 nodes)

Property	Value
Hardware	AMD EPYC 7742 hosts
Resources per node	32 vCPUs, 192 GB RAM
Labels	`node.kubernetes.io/pool=pool-b`, `node.kubernetes.io/workload=high-performance`
Features	NUMA enabled
Total	128 vCPUs, 768 GB RAM

This pool runs latency-sensitive and memory-intensive workloads—databases, caches, search indices. NUMA awareness helps here because these workloads benefit from memory locality. The larger per-node sizing means fewer pods per node, which reduces noisy-neighbor effects.

Pool C – High Performance + GPU (5 nodes)

Property	Value
Hardware	AMD EPYC 7763 hosts
Resources per node	32 vCPUs, 192 GB RAM
GPU node	1x NVIDIA A100 40GB (PCIe passthrough)
Labels	`node.kubernetes.io/pool=pool-c`, `node.kubernetes.io/workload=high-performance`
GPU Labels	`nvidia.com/gpu.present=true`
Taints (GPU)	`nvidia.com/gpu=present:NoSchedule`
Total	160 vCPUs, 960 GB RAM, 1x A100 GPU

The GPU taint ensures only workloads that explicitly request GPU resources get scheduled to the GPU node. Without this, the scheduler might place general workloads there and starve GPU workloads of the node’s CPU and memory.

Storage Classes

I use multiple storage backends because different workloads have different storage requirements. A database wants block storage with strong consistency; a shared config directory wants a filesystem that multiple pods can mount simultaneously.

Storage Class	Type	Default	Use Case
`ceph-rbd`	Block (RBD)	Yes	General workloads requiring persistent block storage
`cephfs`	Filesystem	No	Shared storage (RWX) for distributed workloads
`truenas-iscsi`	iSCSI	No	TrueNAS-backed storage for specific performance profiles

Ceph Integration (Rook)

Ceph provides the primary storage tier, integrated via Rook’s CSI driver. The underlying Ceph cluster runs on the same Proxmox hosts, which means storage performance scales with compute—adding a node improves both.

RBD (Block): ReadWriteOnce volumes for databases, stateful workloads
CephFS (File): ReadWriteMany volumes for shared data across pods

TrueNAS Integration (Democratic CSI)

TrueNAS provides an alternative storage tier via the Democratic CSI driver. I use this for workloads that benefit from ZFS features (snapshots, clones) or need a different performance profile than Ceph provides.

On-Premise Kubernetes Platform — Design rationale and tradeoffs
Observability Platform — How I monitor this cluster

Scott Tompkins

Explorer

Technical Deep Dive

Technical Deep Dive

Physical Infrastructure: Proxmox Hyperconverged Cluster

Physical Servers

Aggregate Physical Resources

Ceph Distributed Storage

Virtualization Strategy: Blast Radius Containment

Kubernetes Cluster Specifications

Aggregate Resources

Control Plane Design

Worker Pool Design

Pool A – General Purpose (9 nodes)

Pool B – High Performance (4 nodes)

Pool C – High Performance + GPU (5 nodes)

Storage Classes

Ceph Integration (Rook)

TrueNAS Integration (Democratic CSI)

Table of Contents

Scott Tompkins

Explorer

Technical Deep Dive

Technical Deep Dive

Physical Infrastructure: Proxmox Hyperconverged Cluster

Physical Servers

Aggregate Physical Resources

Ceph Distributed Storage

Virtualization Strategy: Blast Radius Containment

Kubernetes Cluster Specifications

Aggregate Resources

Control Plane Design

Worker Pool Design

Pool A – General Purpose (9 nodes)

Pool B – High Performance (4 nodes)

Pool C – High Performance + GPU (5 nodes)

Storage Classes

Ceph Integration (Rook)

TrueNAS Integration (Democratic CSI)

Related

Table of Contents