183 lines
6.3 KiB
ReStructuredText
183 lines
6.3 KiB
ReStructuredText
================================
|
|
PSI - Pressure Stall Information
|
|
================================
|
|
|
|
:Date: April, 2018
|
|
:Author: Johannes Weiner <hannes@cmpxchg.org>
|
|
|
|
When CPU, memory or IO devices are contended, workloads experience
|
|
latency spikes, throughput losses, and run the risk of OOM kills.
|
|
|
|
Without an accurate measure of such contention, users are forced to
|
|
either play it safe and under-utilize their hardware resources, or
|
|
roll the dice and frequently suffer the disruptions resulting from
|
|
excessive overcommit.
|
|
|
|
The psi feature identifies and quantifies the disruptions caused by
|
|
such resource crunches and the time impact it has on complex workloads
|
|
or even entire systems.
|
|
|
|
Having an accurate measure of productivity losses caused by resource
|
|
scarcity aids users in sizing workloads to hardware--or provisioning
|
|
hardware according to workload demand.
|
|
|
|
As psi aggregates this information in realtime, systems can be managed
|
|
dynamically using techniques such as load shedding, migrating jobs to
|
|
other systems or data centers, or strategically pausing or killing low
|
|
priority or restartable batch jobs.
|
|
|
|
This allows maximizing hardware utilization without sacrificing
|
|
workload health or risking major disruptions such as OOM kills.
|
|
|
|
Pressure interface
|
|
==================
|
|
|
|
Pressure information for each resource is exported through the
|
|
respective file in /proc/pressure/ -- cpu, memory, and io.
|
|
|
|
The format for CPU is as such::
|
|
|
|
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
|
|
|
|
and for memory and IO::
|
|
|
|
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
|
|
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
|
|
|
|
The "some" line indicates the share of time in which at least some
|
|
tasks are stalled on a given resource.
|
|
|
|
The "full" line indicates the share of time in which all non-idle
|
|
tasks are stalled on a given resource simultaneously. In this state
|
|
actual CPU cycles are going to waste, and a workload that spends
|
|
extended time in this state is considered to be thrashing. This has
|
|
severe impact on performance, and it's useful to distinguish this
|
|
situation from a state where some tasks are stalled but the CPU is
|
|
still doing productive work. As such, time spent in this subset of the
|
|
stall state is tracked separately and exported in the "full" averages.
|
|
|
|
The ratios (in %) are tracked as recent trends over ten, sixty, and
|
|
three hundred second windows, which gives insight into short term events
|
|
as well as medium and long term trends. The total absolute stall time
|
|
(in us) is tracked and exported as well, to allow detection of latency
|
|
spikes which wouldn't necessarily make a dent in the time averages,
|
|
or to average trends over custom time frames.
|
|
|
|
Monitoring for pressure thresholds
|
|
==================================
|
|
|
|
Users can register triggers and use poll() to be woken up when resource
|
|
pressure exceeds certain thresholds.
|
|
|
|
A trigger describes the maximum cumulative stall time over a specific
|
|
time window, e.g. 100ms of total stall time within any 500ms window to
|
|
generate a wakeup event.
|
|
|
|
To register a trigger user has to open psi interface file under
|
|
/proc/pressure/ representing the resource to be monitored and write the
|
|
desired threshold and time window. The open file descriptor should be
|
|
used to wait for trigger events using select(), poll() or epoll().
|
|
The following format is used::
|
|
|
|
<some|full> <stall amount in us> <time window in us>
|
|
|
|
For example writing "some 150000 1000000" into /proc/pressure/memory
|
|
would add 150ms threshold for partial memory stall measured within
|
|
1sec time window. Writing "full 50000 1000000" into /proc/pressure/io
|
|
would add 50ms threshold for full io stall measured within 1sec time window.
|
|
|
|
Triggers can be set on more than one psi metric and more than one trigger
|
|
for the same psi metric can be specified. However for each trigger a separate
|
|
file descriptor is required to be able to poll it separately from others,
|
|
therefore for each trigger a separate open() syscall should be made even
|
|
when opening the same psi interface file.
|
|
|
|
Monitors activate only when system enters stall state for the monitored
|
|
psi metric and deactivates upon exit from the stall state. While system is
|
|
in the stall state psi signal growth is monitored at a rate of 10 times per
|
|
tracking window.
|
|
|
|
The kernel accepts window sizes ranging from 500ms to 10s, therefore min
|
|
monitoring update interval is 50ms and max is 1s. Min limit is set to
|
|
prevent overly frequent polling. Max limit is chosen as a high enough number
|
|
after which monitors are most likely not needed and psi averages can be used
|
|
instead.
|
|
|
|
When activated, psi monitor stays active for at least the duration of one
|
|
tracking window to avoid repeated activations/deactivations when system is
|
|
bouncing in and out of the stall state.
|
|
|
|
Notifications to the userspace are rate-limited to one per tracking window.
|
|
|
|
The trigger will de-register when the file descriptor used to define the
|
|
trigger is closed.
|
|
|
|
Userspace monitor usage example
|
|
===============================
|
|
|
|
::
|
|
|
|
#include <errno.h>
|
|
#include <fcntl.h>
|
|
#include <stdio.h>
|
|
#include <poll.h>
|
|
#include <string.h>
|
|
#include <unistd.h>
|
|
|
|
/*
|
|
* Monitor memory partial stall with 1s tracking window size
|
|
* and 150ms threshold.
|
|
*/
|
|
int main() {
|
|
const char trig[] = "some 150000 1000000";
|
|
struct pollfd fds;
|
|
int n;
|
|
|
|
fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
|
|
if (fds.fd < 0) {
|
|
printf("/proc/pressure/memory open error: %s\n",
|
|
strerror(errno));
|
|
return 1;
|
|
}
|
|
fds.events = POLLPRI;
|
|
|
|
if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
|
|
printf("/proc/pressure/memory write error: %s\n",
|
|
strerror(errno));
|
|
return 1;
|
|
}
|
|
|
|
printf("waiting for events...\n");
|
|
while (1) {
|
|
n = poll(&fds, 1, -1);
|
|
if (n < 0) {
|
|
printf("poll error: %s\n", strerror(errno));
|
|
return 1;
|
|
}
|
|
if (fds.revents & POLLERR) {
|
|
printf("got POLLERR, event source is gone\n");
|
|
return 0;
|
|
}
|
|
if (fds.revents & POLLPRI) {
|
|
printf("event triggered!\n");
|
|
} else {
|
|
printf("unknown event received: 0x%x\n", fds.revents);
|
|
return 1;
|
|
}
|
|
}
|
|
|
|
return 0;
|
|
}
|
|
|
|
Cgroup2 interface
|
|
=================
|
|
|
|
In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
|
|
mounted, pressure stall information is also tracked for tasks grouped
|
|
into cgroups. Each subdirectory in the cgroupfs mountpoint contains
|
|
cpu.pressure, memory.pressure, and io.pressure files; the format is
|
|
the same as the /proc/pressure/ files.
|
|
|
|
Per-cgroup psi monitors can be specified and used the same way as
|
|
system-wide ones.
|