Skip to main content

scx_layered

This is a single user-defined scheduler used within sched_ext, which is a Linux kernel feature which enables implementing kernel thread schedulers in BPF and dynamically loading them. Read more about sched_ext.

Overview

A highly configurable multi-layer BPF / user space hybrid scheduler.

scx_layered allows the user to classify tasks into multiple layers, and apply different scheduling policies to those layers. For example, a layer could be created of all tasks that are part of the user.slice cgroup slice, and a policy could be specified that ensures that the layer is given at least 80% CPU utilization for some subset of CPUs on the system.

How To Install

Available as a Rust crate: cargo add scx_layered

Typical Use Case

scx_layered is designed to be highly customizable, and can be targeted for specific applications. For example, if you had a high-priority service that required priority access to all but 1 physical core to ensure acceptable p99 latencies, you could specify that the service would get priority access to all but 1 core on the system. If that service ends up not utilizing all of those cores, they could be used by other layers until they're needed.

Production Ready?

Yes. If tuned correctly, scx_layered should be performant across various CPU architectures and workloads.

That said, you may run into an issue with infeasible weights, where a task with a very high weight may cause the scheduler to incorrectly leave cores idle because it thinks they're necessary to accommodate the compute for a single task. This can also happen in CFS, and should soon be addressed for scx_layered.

Tuning scx_layered

scx_layered is designed with specific use cases in mind and may not perform as well as a general purpose scheduler for all workloads. It does have topology awareness, which can be disabled with the -t flag. This may impact performance on NUMA machines, as layers will be able to span NUMA nodes by default. For configuring scx_layered to span across multiple NUMA nodes simply setting all nodes in the nodes field of the config.

For controlling the performance level of different levels (i.e. CPU frequency) the perf field can be set. This must be used in combination with the schedutil frequency governor. The value should be from 0-1024 with 1024 being maximum performance. Depending on the system hardware it will translate to frequency, which can also trigger turbo boosting if the value is high enough and turbo is enabled.

Layer affinities can be defined using the nodes or llcs layer configs. This allows for restricting a layer to a NUMA node or LLC. Layers will by default attempt to grow within the same NUMA node, however this may change to suppport different layer growth strategies in the future. When tuning the util_range for a layer there should be some consideration for how the layer should grow. For example, if the util_range lower bound is too high, it may lead to the layer shrinking excessively. This could be ideal for core compaction strategies for a layer, but may poorly utilize hardware, especially in low system utilization. The upper bound of the util_range controls how the layer grows, if set too aggressively the layer could grow fast and prevent other layers from utilizing CPUs. Lastly, the slice_us can be used to tune the timeslice per layer. This is useful if a layer has more latency sensitive tasks, where timeslices should be shorter. Conversely if a layer is largely CPU bound with less concerns of latency it may be useful to increase the slice_us parameter.

scx_layered can provide performance wins, for certain workloads when sufficient tuning on the layer config.