๐Ÿš€ Linux Kernel Performance Tuning

Modern high-throughput systems donโ€™t fail because of lack of hardwareโ€”they fail because of inefficient defaults. The Linux kernel is powerful, but its out-of-the-box configuration is designed for general-purpose workloads, not latency-sensitive or throughput-intensive production systems.

This guide walks through practical kernel tuning techniques used in real-world systems, along with a data-driven benchmarking methodology so you can validate improvements instead of relying on assumptions.


๐Ÿง  Tuning Philosophy

Before touching any kernel parameter:

  • Measure first, then tune
  • Change one variable at a time
  • Validate using repeatable benchmarks
  • Always have a rollback strategy

Kernel tuning without measurement is just guesswork.


โš™๏ธ CPU & Scheduler Optimization

๐Ÿงฉ CPU Governor

Set CPU to performance mode to avoid frequency scaling latency:

echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Or persist via:

sudo apt install cpufrequtils
echo 'GOVERNOR="performance"' | sudo tee /etc/default/cpufrequtils

๐Ÿ” Scheduler Selection

Check current scheduler:

cat /sys/block/sda/queue/scheduler

Common options:

  • mq-deadline โ†’ balanced, good default for SSD
  • none โ†’ best for NVMe devices (no scheduling overhead)
  • bfq โ†’ desktop workloads

Set scheduler:

echo none | sudo tee /sys/block/nvme0n1/queue/scheduler

๐Ÿ“Œ CPU Pinning (Isolation)

Pin workloads to specific cores to reduce context switching:

taskset -c 2,3 ./your_app

Or isolate CPUs at boot:

GRUB_CMDLINE_LINUX="isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3"

Then:

sudo update-grub

๐Ÿงฎ Memory Management Tuning

๐Ÿ”„ vm.swappiness

Controls how aggressively Linux swaps memory.

sysctl vm.swappiness=10

Persist:

echo "vm.swappiness=10" | sudo tee -a /etc/sysctl.conf
  • 0โ€“10 โ†’ latency-sensitive workloads
  • 60 โ†’ default (too high for most production systems)

๐Ÿงฑ Transparent Huge Pages (THP)

THP can introduce latency spikes in some workloads (databases especially).

Disable:

echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

Persist via systemd:

sudo nano /etc/systemd/system/disable-thp.service
[Unit]
Description=Disable Transparent Huge Pages

[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo never > /sys/kernel/mm/transparent_hugepage/enabled"

[Install]
WantedBy=multi-user.target
sudo systemctl enable disable-thp

๐Ÿ“Š vm.dirty_ratio & vm.dirty_background_ratio

Control when the kernel flushes dirty pages to disk.

sysctl vm.dirty_background_ratio=5
sysctl vm.dirty_ratio=10
  • Lower values โ†’ more consistent latency
  • Higher values โ†’ better throughput, worse spikes

๐Ÿ’พ Disk & I/O Optimization

โšก Read-Ahead Buffer

Check:

blockdev --getra /dev/sda

Set (example: 4096 KB):

blockdev --setra 4096 /dev/sda
  • Larger values โ†’ good for sequential workloads
  • Smaller values โ†’ better for random I/O

๐Ÿงต I/O Queue Depth

For NVMe:

cat /sys/block/nvme0n1/queue/nr_requests

Tune:

echo 1024 | sudo tee /sys/block/nvme0n1/queue/nr_requests

๐Ÿ”€ Multi-Queue (blk-mq)

Modern kernels use multi-queue by default, but verify:

cat /sys/block/sda/queue/nr_hw_queues

Higher values โ†’ better parallelism


๐ŸŒ Network Stack Optimization

๐Ÿ“ฆ Increase Socket Buffers

sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728

๐Ÿš€ TCP Tuning

sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq

Enable BBR:

sysctl net.ipv4.tcp_available_congestion_control

๐Ÿ”Œ File Descriptors

ulimit -n 1048576

Persist:

echo "* soft nofile 1048576" | sudo tee -a /etc/security/limits.conf
echo "* hard nofile 1048576" | sudo tee -a /etc/security/limits.conf

๐Ÿ“ Benchmarking Methodology

๐Ÿงช Tools

  • sysbench โ†’ CPU, memory, I/O
  • fio โ†’ disk benchmarking
  • iperf3 โ†’ network throughput
  • perf โ†’ kernel-level profiling

๐Ÿ“Š Example: Disk Benchmark

fio --name=randread \
    --ioengine=libaio \
    --rw=randread \
    --bs=4k \
    --numjobs=4 \
    --size=1G \
    --runtime=60 \
    --group_reporting

๐Ÿง  Key Metrics

Track:

  • Throughput (MB/s)
  • Latency (avg, p95, p99)
  • CPU utilization
  • Context switches
  • I/O wait

๐Ÿ” Testing Strategy

  1. Capture baseline
  2. Apply one tuning change
  3. Re-run benchmark
  4. Compare results
  5. Keep or rollback

โš ๏ธ Common Pitfalls

  • Tuning everything at once โ†’ impossible to isolate impact
  • Ignoring workload type โ†’ wrong optimizations
  • Blindly copying configs โ†’ dangerous in production
  • Not testing under realistic load

๐Ÿ Putting It All Together

A typical production-tuned system might include:

  • CPU governor โ†’ performance
  • Scheduler โ†’ none (NVMe)
  • vm.swappiness=10
  • THP disabled
  • Tuned dirty ratios
  • Increased I/O queue depth
  • BBR congestion control
  • High file descriptor limits

But the exact combination depends on your workload.


๐Ÿ“Œ Final Thoughts

Linux kernel tuning is not about memorizing sysctl valuesโ€”itโ€™s about:

  • Understanding system behavior
  • Identifying bottlenecks
  • Applying targeted optimizations
  • Validating with data

The best engineers donโ€™t just tune systemsโ€”they prove improvements with evidence.


๐Ÿ”— Next Steps

If youโ€™re building high-performance platforms:

  • Automate tuning via Ansible or Terraform
  • Integrate benchmarks into CI pipelines
  • Combine kernel tuning with observability (eBPF, perf, tracing)

Measure. Tune. Validate. Repeat.