scx_layered
This is a single user-defined scheduler used within sched_ext, which is a Linux kernel feature which enables implementing kernel thread schedulers in BPF and dynamically loading them. Read more about sched_ext.
Overview
A highly configurable multi-layer BPF / user space hybrid scheduler.
scx_layered allows the user to classify tasks into multiple layers, and apply
different scheduling policies to those layers. For example, a layer could be
created of all tasks that are part of the user.slice
cgroup slice, and a
policy could be specified that ensures that the layer is given at least 80% CPU
utilization for some subset of CPUs on the system.
How To Install
Available as a Rust crate: cargo add scx_layered
Typical Use Case
scx_layered is designed to be highly customizable, and can be targeted for specific applications. For example, if you had a high-priority service that required priority access to all but 1 physical core to ensure acceptable p99 latencies, you could specify that the service would get priority access to all but 1 core on the system. If that service ends up not utilizing all of those cores, they could be used by other layers until they're needed.
Production Ready?
Yes. If tuned correctly, scx_layered should be performant across various CPU architectures and workloads.
That said, you may run into an issue with infeasible weights, where a task with a very high weight may cause the scheduler to incorrectly leave cores idle because it thinks they're necessary to accommodate the compute for a single task. This can also happen in CFS, and should soon be addressed for scx_layered.
Tuning scx_layered
scx_layered
is designed with specific use cases in mind and may not perform
as well as a general purpose scheduler for all workloads. It does have topology
awareness, which can be disabled with the -t
flag. This may impact
performance on NUMA machines, as layers will be able to span NUMA nodes by
default. For configuring scx_layered
to span across multiple NUMA nodes simply
setting all nodes in the nodes
field of the config.
For controlling the performance level of different levels (i.e. CPU frequency)
the perf
field can be set. This must be used in combination with the
schedutil
frequency governor. The value should be from 0-1024 with 1024 being
maximum performance. Depending on the system hardware it will translate to
frequency, which can also trigger turbo boosting if the value is high enough
and turbo is enabled.
Layer affinities can be defined using the nodes
or llcs
layer configs. This
allows for restricting a layer to a NUMA node or LLC. Layers will by default
attempt to grow within the same NUMA node, however this may change to suppport
different layer growth strategies in the future. When tuning the util_range
for a layer there should be some consideration for how the layer should grow.
For example, if the util_range
lower bound is too high, it may lead to the
layer shrinking excessively. This could be ideal for core compaction strategies
for a layer, but may poorly utilize hardware, especially in low system
utilization. The upper bound of the util_range
controls how the layer grows,
if set too aggressively the layer could grow fast and prevent other layers from
utilizing CPUs. Lastly, the slice_us
can be used to tune the timeslice
per layer. This is useful if a layer has more latency sensitive tasks, where
timeslices should be shorter. Conversely if a layer is largely CPU bound with
less concerns of latency it may be useful to increase the slice_us
parameter.
scx_layered
can provide performance wins, for certain workloads when
sufficient tuning on the layer config.