Sunday, October 5, 2025

CPU affinity and Kubernetes

These days I'm working on a fun project: migrate some trading components from other orchestrators to EKS. Pretty soon we hit a problem - network packet loss.

Applications that are part of a trading engine can be very picky with the underlying infrastructure, specially single threaded low latency components. They like cpu affinity, a fancy network space with expensive network cards whith kernel bypass, and they don't like interruptions.

The initial approach was checking network configuration, drivers, solarflare firmware upgrade, onload version upgrade, sockperf tests. All looking decent. CPU affinity configured as usual - isolcpu, nohz, pstates disabled, frequency scaling off etc.

Then started to look into what was the process doing. Some of the findings were:

  • /proc/interrupts was showing interruptions from on the isolated cores from the network cards, which was weirf
  • perf commands were showing cpu interruptions on the isolated cores
  • Solarflare support confirmed the interruptions in the onload stacks being the probable cause of network loss
The first issue was easily solved, configuring irqbalance to ban the isolared cores from the list with IRQBALANCE_BANNED_CPUS.

The core interruption took a bit longer to figure out - it was all on the resource definition in Kubernetes and the cgroups. With other orchestrators, cpu isolation is easier to handle. But in Kubernetes we have to consider the QoS classes:
  • Burstable class - when there are different values for CPU requests and limits
  • Guaranteed class - when there are the same values for CPU / memory requests and limits
  • Best effort class - when we omit the CPU resource definition for both requests and limits
Depending on the QoS class we land, we will or will not get shared cpu time on the cgroups.

Kubelet will orchestrate the cgroup slices dynamically based on our deployment settings. From what I observed with my tests, the first two classes will cause cpu interruptions. The last one, will be just fine.

Looking into the cgroups for the first two options:

Burstable manifest definition:
resources:
  limits:
    cpu: "2"
  requests:
    cpu: "1"

# cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podXXXXX.slice/cri-containerd-YYYYY.scope/cpu.max

200000 100000



Guaranteed menifest definition:

resources:

  limits:

    cpu: "1000m"

    memory: "1Gi"

  requests:

    cpu: "1000m"

    memory: "1Gi"

cat /sys/fs/cgroup/kubepods.slice/kubepods-podYYYYY.slice/cri-containerd-XXXXXscope/cpu.max

100000 100000 

On both definitions there is a specific CPU runtime allocated per a interval. Looking at the onload stack we can see some interrupts:

onload_stackdump lots | grep interrupts

interrupts: 148378


When switching to best effort, we see some changes:
limits: (empty below)

# cat /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-AAAAA.slice/cri-containerd-BBBBB.scope/cpu.max

max 100000

We can see a difference where the max possible runtime is assigned. With this configuration, no interruptions were observed.

Looking again at the onload stack stats, we can see can confirm we don't have any more interrupts:

onload_stackdump lots | grep interrupts

interrupts: 0


This is only part of the story, because the cgroups have all the cores added to the cpuset (not using the cpu management policy in Kubernetes yet). But due to our cpu isolation policies only processes with cpu affinity settings will go to the designated cores, else this would need addressing.

After removing all the cpu resource definitions and applying the new irqbalance configuration, the network looks happy and healthy.

Other approaches to this could have been testing kubelet cgroup configurations to define a custom slice, and / or disable the cpu and cpuset controllers from the slices - altho not ideal as slices will change over deployments. A manual temporary approach could be disabling the cpuset and cpu controllers from the slices:

echo '-cpuset' > /sys/fs/cgroup/kubepods.slice/kubepods-<CLASS>.slice/kubepods-XXXXX.slice/cgroup.subtree_control


echo '-cpu' > /sys/fs/cgroup/kubepods.slice/kubepods-<CLASS>.slice/kubepods-XXXXX.slice/cgroup.subtree_control


The migration journey to EKS has just begun, and I expect to find more problems to solve. From what I heard, some exchanges migrated to EKS then migrated back due to performance complications. This is going to be fun.