366 lines
18 KiB
ReStructuredText
366 lines
18 KiB
ReStructuredText
.. _smp_arch:
|
|
|
|
Symmetric Multiprocessing
|
|
#########################
|
|
|
|
On multiprocessor architectures, Zephyr supports the use of multiple
|
|
physical CPUs running Zephyr application code. This support is
|
|
"symmetric" in the sense that no specific CPU is treated specially by
|
|
default. Any processor is capable of running any Zephyr thread, with
|
|
access to all standard Zephyr APIs supported.
|
|
|
|
No special application code needs to be written to take advantage of
|
|
this feature. If there are two Zephyr application threads runnable on
|
|
a supported dual processor device, they will both run simultaneously.
|
|
|
|
SMP configuration is controlled under the :kconfig:option:`CONFIG_SMP` kconfig
|
|
variable. This must be set to "y" to enable SMP features, otherwise
|
|
a uniprocessor kernel will be built. In general the platform default
|
|
will have enabled this anywhere it's supported. When enabled, the
|
|
number of physical CPUs available is visible at build time as
|
|
:kconfig:option:`CONFIG_MP_MAX_NUM_CPUS`. Likewise, the default for this will be the
|
|
number of available CPUs on the platform and it is not expected that
|
|
typical apps will change it. But it is legal and supported to set
|
|
this to a smaller (but obviously not larger) number for special
|
|
purposes (e.g. for testing, or to reserve a physical CPU for running
|
|
non-Zephyr code).
|
|
|
|
Synchronization
|
|
***************
|
|
|
|
At the application level, core Zephyr IPC and synchronization
|
|
primitives all behave identically under an SMP kernel. For example
|
|
semaphores used to implement blocking mutual exclusion continue to be
|
|
a proper application choice.
|
|
|
|
At the lowest level, however, Zephyr code has often used the
|
|
:c:func:`irq_lock`/:c:func:`irq_unlock` primitives to implement fine grained
|
|
critical sections using interrupt masking. These APIs continue to
|
|
work via an emulation layer (see below), but the masking technique
|
|
does not: the fact that your CPU will not be interrupted while you are
|
|
in your critical section says nothing about whether a different CPU
|
|
will be running simultaneously and be inspecting or modifying the same
|
|
data!
|
|
|
|
Spinlocks
|
|
=========
|
|
|
|
SMP systems provide a more constrained :c:func:`k_spin_lock` primitive
|
|
that not only masks interrupts locally, as done by :c:func:`irq_lock`, but
|
|
also atomically validates that a shared lock variable has been
|
|
modified before returning to the caller, "spinning" on the check if
|
|
needed to wait for the other CPU to exit the lock. The default Zephyr
|
|
implementation of :c:func:`k_spin_lock` and :c:func:`k_spin_unlock` is built
|
|
on top of the pre-existing :c:struct:`atomic_` layer (itself usually
|
|
implemented using compiler intrinsics), though facilities exist for
|
|
architectures to define their own for performance reasons.
|
|
|
|
One important difference between IRQ locks and spinlocks is that the
|
|
earlier API was naturally recursive: the lock was global, so it was
|
|
legal to acquire a nested lock inside of a critical section.
|
|
Spinlocks are separable: you can have many locks for separate
|
|
subsystems or data structures, preventing CPUs from contending on a
|
|
single global resource. But that means that spinlocks must not be
|
|
used recursively. Code that holds a specific lock must not try to
|
|
re-acquire it or it will deadlock (it is perfectly legal to nest
|
|
**distinct** spinlocks, however). A validation layer is available to
|
|
detect and report bugs like this.
|
|
|
|
When used on a uniprocessor system, the data component of the spinlock
|
|
(the atomic lock variable) is unnecessary and elided. Except for the
|
|
recursive semantics above, spinlocks in single-CPU contexts produce
|
|
identical code to legacy IRQ locks. In fact the entirety of the
|
|
Zephyr core kernel has now been ported to use spinlocks exclusively.
|
|
|
|
Legacy irq_lock() emulation
|
|
===========================
|
|
|
|
For the benefit of applications written to the uniprocessor locking
|
|
API, :c:func:`irq_lock` and :c:func:`irq_unlock` continue to work compatibly on
|
|
SMP systems with identical semantics to their legacy versions. They
|
|
are implemented as a single global spinlock, with a nesting count and
|
|
the ability to be atomically reacquired on context switch into locked
|
|
threads. The kernel will ensure that only one thread across all CPUs
|
|
can hold the lock at any time, that it is released on context switch,
|
|
and that it is re-acquired when necessary to restore the lock state
|
|
when a thread is switched in. Other CPUs will spin waiting for the
|
|
release to happen.
|
|
|
|
The overhead involved in this process has measurable performance
|
|
impact, however. Unlike uniprocessor apps, SMP apps using
|
|
:c:func:`irq_lock` are not simply invoking a very short (often ~1
|
|
instruction) interrupt masking operation. That, and the fact that the
|
|
IRQ lock is global, means that code expecting to be run in an SMP
|
|
context should be using the spinlock API wherever possible.
|
|
|
|
CPU Mask
|
|
********
|
|
|
|
It is often desirable for real time applications to deliberately
|
|
partition work across physical CPUs instead of relying solely on the
|
|
kernel scheduler to decide on which threads to execute. Zephyr
|
|
provides an API, controlled by the :kconfig:option:`CONFIG_SCHED_CPU_MASK`
|
|
kconfig variable, which can associate a specific set of CPUs with each
|
|
thread, indicating on which CPUs it can run.
|
|
|
|
By default, new threads can run on any CPU. Calling
|
|
:c:func:`k_thread_cpu_mask_disable` with a particular CPU ID will prevent
|
|
that thread from running on that CPU in the future. Likewise
|
|
:c:func:`k_thread_cpu_mask_enable` will re-enable execution. There are also
|
|
:c:func:`k_thread_cpu_mask_clear` and :c:func:`k_thread_cpu_mask_enable_all` APIs
|
|
available for convenience. For obvious reasons, these APIs are
|
|
illegal if called on a runnable thread. The thread must be blocked or
|
|
suspended, otherwise an ``-EINVAL`` will be returned.
|
|
|
|
Note that when this feature is enabled, the scheduler algorithm
|
|
involved in doing the per-CPU mask test requires that the list be
|
|
traversed in full. The kernel does not keep a per-CPU run queue.
|
|
That means that the performance benefits from the
|
|
:kconfig:option:`CONFIG_SCHED_SCALABLE` and :kconfig:option:`CONFIG_SCHED_MULTIQ`
|
|
scheduler backends cannot be realized. CPU mask processing is
|
|
available only when :kconfig:option:`CONFIG_SCHED_DUMB` is the selected
|
|
backend. This requirement is enforced in the configuration layer.
|
|
|
|
SMP Boot Process
|
|
****************
|
|
|
|
A Zephyr SMP kernel begins boot identically to a uniprocessor kernel.
|
|
Auxiliary CPUs begin in a disabled state in the architecture layer.
|
|
All standard kernel initialization, including device initialization,
|
|
happens on a single CPU before other CPUs are brought online.
|
|
|
|
Just before entering the application :c:func:`main` function, the kernel
|
|
calls :c:func:`z_smp_init` to launch the SMP initialization process. This
|
|
enumerates over the configured CPUs, calling into the architecture
|
|
layer using :c:func:`arch_cpu_start` for each one. This function is
|
|
passed a memory region to use as a stack on the foreign CPU (in
|
|
practice it uses the area that will become that CPU's interrupt
|
|
stack), the address of a local :c:func:`smp_init_top` callback function to
|
|
run on that CPU, and a pointer to a "start flag" address which will be
|
|
used as an atomic signal.
|
|
|
|
The local SMP initialization (:c:func:`smp_init_top`) on each CPU is then
|
|
invoked by the architecture layer. Note that interrupts are still
|
|
masked at this point. This routine is responsible for calling
|
|
:c:func:`smp_timer_init` to set up any needed stat in the timer driver. On
|
|
many architectures the timer is a per-CPU device and needs to be
|
|
configured specially on auxiliary CPUs. Then it waits (spinning) for
|
|
the atomic "start flag" to be released in the main thread, to
|
|
guarantee that all SMP initialization is complete before any Zephyr
|
|
application code runs, and finally calls :c:func:`z_swap` to transfer
|
|
control to the appropriate runnable thread via the standard scheduler
|
|
API.
|
|
|
|
.. figure:: smpinit.svg
|
|
:align: center
|
|
:alt: SMP Initialization
|
|
:figclass: align-center
|
|
|
|
Example SMP initialization process, showing a configuration with
|
|
two CPUs and two app threads which begin operating simultaneously.
|
|
|
|
Interprocessor Interrupts
|
|
*************************
|
|
|
|
When running in multiprocessor environments, it is occasionally the
|
|
case that state modified on the local CPU needs to be synchronously
|
|
handled on a different processor.
|
|
|
|
One example is the Zephyr :c:func:`k_thread_abort` API, which cannot return
|
|
until the thread that had been aborted is no longer runnable. If it
|
|
is currently running on another CPU, that becomes difficult to
|
|
implement.
|
|
|
|
Another is low power idle. It is a firm requirement on many devices
|
|
that system idle be implemented using a low-power mode with as many
|
|
interrupts (including periodic timer interrupts) disabled or deferred
|
|
as is possible. If a CPU is in such a state, and on another CPU a
|
|
thread becomes runnable, the idle CPU has no way to "wake up" to
|
|
handle the newly-runnable load.
|
|
|
|
So where possible, Zephyr SMP architectures should implement an
|
|
interprocessor interrupt. The current framework is very simple: the
|
|
architecture provides at least a :c:func:`arch_sched_broadcast_ipi` call,
|
|
which when invoked will flag an interrupt on all CPUs (except the current one,
|
|
though that is allowed behavior). If the architecture supports directed IPIs
|
|
(see :kconfig:option:`CONFIG_ARCH_HAS_DIRECTED_IPIS`), then the
|
|
architecture also provides a :c:func:`arch_sched_directed_ipi` call, which
|
|
when invoked will flag an interrupt on the specified CPUs. When an interrupt is
|
|
flagged on the CPUs, the :c:func:`z_sched_ipi` function implemented in the
|
|
scheduler will get invoked on those CPUs. The expectation is that these
|
|
APIs will evolve over time to encompass more functionality (e.g. cross-CPU
|
|
calls), and that the scheduler-specific calls here will be implemented in
|
|
terms of a more general framework.
|
|
|
|
Note that not all SMP architectures will have a usable IPI mechanism
|
|
(either missing, or just undocumented/unimplemented). In those cases
|
|
Zephyr provides fallback behavior that is correct, but perhaps
|
|
suboptimal.
|
|
|
|
Using this, :c:func:`k_thread_abort` becomes only slightly more
|
|
complicated in SMP: for the case where a thread is actually running on
|
|
another CPU (we can detect this atomically inside the scheduler), we
|
|
broadcast an IPI and spin, waiting for the thread to either become
|
|
"DEAD" or for it to re-enter the queue (in which case we terminate it
|
|
the same way we would have in uniprocessor mode). Note that the
|
|
"aborted" check happens on any interrupt exit, so there is no special
|
|
handling needed in the IPI per se. This allows us to implement a
|
|
reasonable fallback when IPI is not available: we can simply spin,
|
|
waiting until the foreign CPU receives any interrupt, though this may
|
|
be a much longer time!
|
|
|
|
Likewise idle wakeups are trivially implementable with an empty IPI
|
|
handler. If a thread is added to an empty run queue (i.e. there may
|
|
have been idle CPUs), we broadcast an IPI. A foreign CPU will then be
|
|
able to see the new thread when exiting from the interrupt and will
|
|
switch to it if available.
|
|
|
|
Without an IPI, however, a low power idle that requires an interrupt
|
|
will not work to synchronously run new threads. The workaround in
|
|
that case is more invasive: Zephyr will **not** enter the system idle
|
|
handler and will instead spin in its idle loop, testing the scheduler
|
|
state at high frequency (not spinning on it though, as that would
|
|
involve severe lock contention) for new threads. The expectation is
|
|
that power constrained SMP applications are always going to provide an
|
|
IPI, and this code will only be used for testing purposes or on
|
|
systems without power consumption requirements.
|
|
|
|
IPI Cascades
|
|
============
|
|
|
|
The kernel can not control the order in which IPIs are processed by the CPUs
|
|
in the system. In general, this is not an issue and a single set of IPIs is
|
|
sufficient to trigger a reschedule on the N CPUs that results with them
|
|
scheduling the highest N priority ready threads to execute. When CPU masking
|
|
is used, there may be more than one valid set of threads (not to be confused
|
|
with an optimal set of threads) that can be scheduled on the N CPUs and a
|
|
single set of IPIs may be insufficient to result in any of these valid sets.
|
|
|
|
.. note::
|
|
When CPU masking is not in play, the optimal set of threads is the same
|
|
as the valid set of threads. However when CPU masking is in play, there
|
|
may be more than one valid set--one of which may be optimal.
|
|
|
|
To better illustrate the distinction, consider a 2-CPU system with ready
|
|
threads T1 and T2 at priorities 1 and 2 respectively. Let T2 be pinned to
|
|
CPU0 and T1 not be pinned. If CPU0 is executing T2 and CPU1 executing T1,
|
|
then this set is is both valid and optimal. However, if CPU0 is executing
|
|
T1 and CPU1 is idling, then this too would be valid though not optimal.
|
|
|
|
In those cases where a single set of IPIs is not sufficient to generate a valid
|
|
set, the resulting set of executing threads are expected to be close to a valid
|
|
set, and subsequent IPIs can generally be expected to correct the situation
|
|
soon. However, for cases where neither the approximation nor the delay are
|
|
acceptable, enabling :kconfig:option:`CONFIG_SCHED_IPI_CASCADE` will allow the
|
|
kernel to generate cascading IPIs until the kernel has selected a valid set of
|
|
ready threads for the CPUs.
|
|
|
|
There are three types of costs/penalties associated with the IPI cascades--and
|
|
for these reasons they are disabled by default. The first is a cost incurred
|
|
by the CPU producing the IPI when a new thread preempts the old thread as checks
|
|
must be done to compare the old thread against the threads executing on the
|
|
other CPUs. The second is a cost incurred by the CPUs receiving the IPIs as
|
|
they must be processed. The third is the apparent sputtering of a thread as it
|
|
"winks in" and then "winks out" due to cascades stemming from the
|
|
aforementioned first cost.
|
|
|
|
SMP Kernel Internals
|
|
********************
|
|
|
|
In general, Zephyr kernel code is SMP-agnostic and, like application
|
|
code, will work correctly regardless of the number of CPUs available.
|
|
But in a few areas there are notable changes in structure or behavior.
|
|
|
|
|
|
Per-CPU data
|
|
============
|
|
|
|
Many elements of the core kernel data need to be implemented for each
|
|
CPU in SMP mode. For example, the ``_current`` thread pointer obviously
|
|
needs to reflect what is running locally, there are many threads
|
|
running concurrently. Likewise a kernel-provided interrupt stack
|
|
needs to be created and assigned for each physical CPU, as does the
|
|
interrupt nesting count used to detect ISR state.
|
|
|
|
These fields are now moved into a separate struct :c:struct:`_cpu` instance
|
|
within the :c:struct:`_kernel` struct, which has a ``cpus[]`` array indexed by ID.
|
|
Compatibility fields are provided for legacy uniprocessor code trying
|
|
to access the fields of ``cpus[0]`` using the older syntax and assembly
|
|
offsets.
|
|
|
|
Note that an important requirement on the architecture layer is that
|
|
the pointer to this CPU struct be available rapidly when in kernel
|
|
context. The expectation is that :c:func:`arch_curr_cpu` will be
|
|
implemented using a CPU-provided register or addressing mode that can
|
|
store this value across arbitrary context switches or interrupts and
|
|
make it available to any kernel-mode code.
|
|
|
|
Similarly, where on a uniprocessor system Zephyr could simply create a
|
|
global "idle thread" at the lowest priority, in SMP we may need one
|
|
for each CPU. This makes the internal predicate test for "_is_idle()"
|
|
in the scheduler, which is a hot path performance environment, more
|
|
complicated than simply testing the thread pointer for equality with a
|
|
known static variable. In SMP mode, idle threads are distinguished by
|
|
a separate field in the thread struct.
|
|
|
|
Switch-based context switching
|
|
==============================
|
|
|
|
The traditional Zephyr context switch primitive has been :c:func:`z_swap`.
|
|
Unfortunately, this function takes no argument specifying a thread to
|
|
switch to. The expectation has always been that the scheduler has
|
|
already made its preemption decision when its state was last modified
|
|
and cached the resulting "next thread" pointer in a location where
|
|
architecture context switch primitives can find it via a simple struct
|
|
offset. That technique will not work in SMP, because the other CPU
|
|
may have modified scheduler state since the current CPU last exited
|
|
the scheduler (for example: it might already be running that cached
|
|
thread!).
|
|
|
|
Instead, the SMP "switch to" decision needs to be made synchronously
|
|
with the swap call, and as we don't want per-architecture assembly
|
|
code to be handling scheduler internal state, Zephyr requires a
|
|
somewhat lower-level context switch primitives for SMP systems:
|
|
:c:func:`arch_switch` is always called with interrupts masked, and takes
|
|
exactly two arguments. The first is an opaque (architecture defined)
|
|
handle to the context to which it should switch, and the second is a
|
|
pointer to such a handle into which it should store the handle
|
|
resulting from the thread that is being switched out.
|
|
The kernel then implements a portable :c:func:`z_swap` implementation on top
|
|
of this primitive which includes the relevant scheduler logic in a
|
|
location where the architecture doesn't need to understand it.
|
|
|
|
Similarly, on interrupt exit, switch-based architectures are expected
|
|
to call :c:func:`z_get_next_switch_handle` to retrieve the next thread to
|
|
run from the scheduler. The argument to :c:func:`z_get_next_switch_handle`
|
|
is either the interrupted thread's "handle" reflecting the same opaque type
|
|
used by :c:func:`arch_switch`, or NULL if that thread cannot be released
|
|
to the scheduler just yet. The choice between a handle value or NULL
|
|
depends on the way CPU interrupt mode is implemented.
|
|
|
|
Architectures with a large CPU register file would typically preserve only
|
|
the caller-saved registers on the current thread's stack when interrupted
|
|
in order to minimize interrupt latency, and preserve the callee-saved
|
|
registers only when :c:func:`arch_switch` is called to minimize context
|
|
switching latency. Such architectures must use NULL as the argument to
|
|
:c:func:`z_get_next_switch_handle` to determine if there is a new thread
|
|
to schedule, and follow through with their own :c:func:`arch_switch` or
|
|
derivative if so, or directly leave interrupt mode otherwise.
|
|
In the former case it is up to that switch code to store the handle
|
|
resulting from the thread that is being switched out in that thread's
|
|
"switch_handle" field after its context has fully been saved.
|
|
|
|
Architectures whose entry in interrupt mode already preserves the entire
|
|
thread state may pass that thread's handle directly to
|
|
:c:func:`z_get_next_switch_handle` and be done in one step.
|
|
|
|
Note that while SMP requires :kconfig:option:`CONFIG_USE_SWITCH`, the reverse is not
|
|
true. A uniprocessor architecture built with :kconfig:option:`CONFIG_SMP` set to No might
|
|
still decide to implement its context switching using
|
|
:c:func:`arch_switch`.
|
|
|
|
API Reference
|
|
**************
|
|
|
|
.. doxygengroup:: spinlock_apis
|