doc: add developer primer
Developer Primer and images, and a tweak to figure formatting also renamed from Hypervisor Primer to just Developer Primer since the doc talks about Device Model too. Signed-off-by: David B. Kinder <david.b.kinder@intel.com>
|
@ -1,4 +1,4 @@
|
|||
doxygen
|
||||
_build
|
||||
devicemodel
|
||||
hypervisor
|
||||
*.bak
|
||||
*.sav
|
||||
|
|
|
@ -1,23 +0,0 @@
|
|||
.. _hypervisor_primer:
|
||||
|
||||
Hypervisor Developer Primer
|
||||
###########################
|
||||
|
||||
This Developer Primer introduces the fundamental components and
|
||||
virtualization technology used by this open source reference hypervisor
|
||||
stack. Code level documentation and additional details can be found by
|
||||
consulting the :ref:`hypercall_apis` documentation and the source code
|
||||
in GitHub.
|
||||
|
||||
The Hypervisor acts as a host with full control of the processor(s) and
|
||||
the hardware (physical memory, interrupt management and I/O). It
|
||||
provides the Guest OS with an abstraction of a virtual processor,
|
||||
allowing the guest to think it is executing directly on a logical
|
||||
processor.
|
||||
|
||||
.. _source tree structure:
|
||||
|
||||
Source Tree Structure
|
||||
*********************
|
||||
|
||||
blah blah
|
|
@ -27,7 +27,7 @@ Sections
|
|||
introduction/index.rst
|
||||
hardware.rst
|
||||
getting_started/index.rst
|
||||
hypervisor_primer/index.rst
|
||||
primer/index.rst
|
||||
release_notes.rst
|
||||
contribute.rst
|
||||
api/index.rst
|
||||
|
|
After Width: | Height: | Size: 14 KiB |
After Width: | Height: | Size: 66 KiB |
After Width: | Height: | Size: 85 KiB |
After Width: | Height: | Size: 21 KiB |
After Width: | Height: | Size: 38 KiB |
After Width: | Height: | Size: 15 KiB |
After Width: | Height: | Size: 16 KiB |
After Width: | Height: | Size: 89 KiB |
After Width: | Height: | Size: 18 KiB |
After Width: | Height: | Size: 73 KiB |
|
@ -0,0 +1,903 @@
|
|||
.. _primer:
|
||||
|
||||
Developer Primer
|
||||
################
|
||||
|
||||
This Developer Primer introduces the fundamental components of ACRN and
|
||||
the virtualization technology used by this open source reference stack.
|
||||
Code level documentation and additional details can be found by
|
||||
consulting the :ref:`acrn_apis` documentation and the `source code in
|
||||
GitHub`_.
|
||||
|
||||
.. _source code in GitHub: https://github.com/projectacrn
|
||||
|
||||
The ACRN Hypervisor acts as a host with full control of the processor(s)
|
||||
and the hardware (physical memory, interrupt management and I/O). It
|
||||
provides the User OS with an abstraction of a virtual platform, allowing
|
||||
the guest to behave as if were executing directly on a logical
|
||||
processor.
|
||||
|
||||
.. _source tree structure:
|
||||
|
||||
Source Tree Structure
|
||||
*********************
|
||||
|
||||
Understanding the ACRN hypervisor and the ACRN device model source tree
|
||||
structure is helpful for locating the code associated with a particular
|
||||
hypervisor and device emulation feature. The ACRN hypervisor and the
|
||||
ACRN device model source tree provides the following top-level
|
||||
directories:
|
||||
|
||||
ACRN hypervisor source tree
|
||||
===========================
|
||||
|
||||
**arch/x86/**
|
||||
hypervisor architecture, which includes arch x86 related source files
|
||||
to run the hypervisor, such as CPU, memory, interrupt, and vmx.
|
||||
|
||||
**boot/**
|
||||
boot stuff mainly including ACPI related
|
||||
|
||||
**bsp/**
|
||||
board support package, used to support NUC with UEFI
|
||||
|
||||
**common/**
|
||||
common source files for hypervisor, which including VM hypercall
|
||||
definition, VM main loop, and VM software loader
|
||||
|
||||
**debug/**
|
||||
all debug related source files, which will not be compiled for
|
||||
release version, mainly including console, uart, logmsg and shell
|
||||
|
||||
**include/**
|
||||
include files for all public APIs (doxygen comments in these source
|
||||
files are used to generate the :ref:`acrn_apis` documentation)
|
||||
|
||||
**lib/doc/**
|
||||
runtime service libraries
|
||||
|
||||
ACRN Device Model source tree
|
||||
=============================
|
||||
|
||||
**core/**
|
||||
ACRN Device model core logic (main loop, SOS interface, etc.)
|
||||
|
||||
**hw/**
|
||||
Hardware emulation code, with the following subdirectories:
|
||||
|
||||
**acpi/**
|
||||
ACPI table generator.
|
||||
|
||||
**pci/**
|
||||
PCI devices, including VBS-Us (virtio backend drivers in user-space).
|
||||
|
||||
**platform/**
|
||||
platform devices such as uart, and keyboard.
|
||||
|
||||
**include/**
|
||||
include files for all public APIs (doxygen comments in these source
|
||||
files are used to generate the :ref:`acrn_apis` documentation)
|
||||
|
||||
**samples/**
|
||||
include files for all public APIs (doxygen comments in these source
|
||||
|
||||
ACRN documentation source tree
|
||||
==============================
|
||||
|
||||
Project ACRN documentation is written using the reStructuredText markup
|
||||
language (.rst file extension) with Sphinx extensions, and processed
|
||||
using Sphinx to create a formatted stand-alone website, (the one you're
|
||||
reading now.) Developers can view this content either in its raw form as
|
||||
.rst markup files in the acrn-documentation repo, or you can generate
|
||||
the HTML content and view it with a web browser directly on your
|
||||
workstation, useful if you're contributing documentation to the project.
|
||||
|
||||
**api/**
|
||||
ReST files for API document generation
|
||||
|
||||
**custom-doxygen/**
|
||||
Customization files for doxygen-generated html output (while
|
||||
generated, we currently don't include the doxygen html output but do use
|
||||
the XML output to feed into the Sphinx-generation process)
|
||||
|
||||
**getting_started/**
|
||||
ReST files and images for the Getting Started Guide
|
||||
|
||||
**primer/**
|
||||
ReST files and images for the Developer Primer
|
||||
|
||||
**images/**
|
||||
Image files not specific to a document (logos, and such)
|
||||
|
||||
**introduction/**
|
||||
ReST files and images for the Introduction to Project ACRN
|
||||
|
||||
**scripts/**
|
||||
Files used to assist building the documentation set
|
||||
|
||||
**static/**
|
||||
Sphinx folder for extras added to the generated output (such as custom
|
||||
CSS additions)
|
||||
|
||||
CPU virtualization
|
||||
******************
|
||||
|
||||
The ACRN hypervisor uses static partitioning of the physical CPU cores,
|
||||
providing each User OS a virtualized environment containing at least one
|
||||
statically assigned physical CPU core. The CPUID features for a
|
||||
partitioned physical core is the same as the native CPU features. CPU
|
||||
power management (Cx/Px) is managed by the User OS.
|
||||
|
||||
The supported Intel |reg| NUC platform (see :ref:`hardware`) has a CPU
|
||||
with four cores. The Service OS is assigned one core and the other three
|
||||
cores are assigned to the User OS. ``XSAVE`` and ``XRSTOR`` instructions
|
||||
(used to perform a full save/restore of the extended state in the
|
||||
processor to/from memory) are currently not supported in the User OS.
|
||||
(The kernel boot parameters must specify ``noxsave``). Processor core
|
||||
sharing among User OSes is planned for a future release.
|
||||
|
||||
The following sections introduce CPU virtualization related
|
||||
concepts and technologies.
|
||||
|
||||
Host GDT
|
||||
========
|
||||
|
||||
The ACRN hypervisor initializes the host Global Descriptor Table (GDT),
|
||||
used to define the characteristics of the various memory areas during
|
||||
program execution. Code Segment ``CS:0x8`` and Data Segment ``DS:0x10``
|
||||
are configured as Hypervisor selectors, with their settings in host the
|
||||
GDT as shown in :numref:`host-gdt`:
|
||||
|
||||
.. figure:: images/primer-host-gdt.png
|
||||
:align: center
|
||||
:name: host-gdt
|
||||
|
||||
Host GDT
|
||||
|
||||
Host IDT
|
||||
========
|
||||
|
||||
The ACRN hypervisor installs interrupt gates for both Exceptions and
|
||||
Vectors. That means exceptions and interrupts will automatically disable
|
||||
interrupts. The ``HOST_GDT_RING0_CODE_SEL`` is used in the Host IDT
|
||||
table.
|
||||
|
||||
Guest SMP Booting
|
||||
=================
|
||||
|
||||
The Bootstrap Processor (BSP) vCPU for the User OS boots into x64 long
|
||||
mode directly, while the Application Processors (AP) vCPU boots into
|
||||
real mode. The virtualized Local Advanced Programmable Interrupt
|
||||
Controller (vLAPIC) for the User OS in the hypervisor emulates the
|
||||
INIT/STARTUP signals.
|
||||
|
||||
The AP vCPU belonging to the User OS begins in an infinite loop, waiting
|
||||
for an INIT signal. Once the User OS issues a Startup IPI (SIPI) signal
|
||||
to another vCPU, the vLAPIC traps the request, resets the target vCPU,
|
||||
and then enters the ``INIT->STARTUP#1->STARTUP#2`` cycle to boot the
|
||||
vCPUs for the User OS.
|
||||
|
||||
VMX configuration
|
||||
=================
|
||||
|
||||
ACRN hypervisor has the Virtual Machine configuration (VMX) shown in
|
||||
:numref:`VMX_MSR` below. (These configuration settings may change in the future, according to
|
||||
virtualization policies.)
|
||||
|
||||
.. table:: VMX Configuration
|
||||
:align: center
|
||||
:widths: auto
|
||||
:name: VMX_MSR
|
||||
|
||||
+----------------------------------------+----------------+---------------------------------------+
|
||||
| **VMX MSR** | **Bits** | **Description** |
|
||||
+========================================+================+=======================================+
|
||||
| **MSR\_IA32\_VMX\_PINBASED\_CTLS** | Bit0 set | Enable External IRQ VM Exit |
|
||||
+ +----------------+---------------------------------------+
|
||||
| | Bit6 set | Enable HV pre-40ms Preemption timer |
|
||||
+ +----------------+---------------------------------------+
|
||||
| | Bit7 clr | Post interrupt did not support |
|
||||
+----------------------------------------+----------------+---------------------------------------+
|
||||
| **MSR\_IA32\_VMX\_PROCBASED\_CTLS** | Bit25 set | Enable I/O bitmap |
|
||||
+ +----------------+---------------------------------------+
|
||||
| | Bit28 set | Enable MSR bitmap |
|
||||
+ +----------------+---------------------------------------+
|
||||
| | Bit19,20 set | Enable CR8 store/load |
|
||||
+----------------------------------------+----------------+---------------------------------------+
|
||||
| **MSR\_IA32\_VMX\_PROCBASED\_CTLS2** | Bit1 set | Enable EPT |
|
||||
+ +----------------+---------------------------------------+
|
||||
| | Bit7 set | Allow guest real mode |
|
||||
+----------------------------------------+----------------+---------------------------------------+
|
||||
| **MSR\_IA32\_VMX\_EXIT\_CTLS** | Bit15 | VMX Exit auto ack vector |
|
||||
+ +----------------+---------------------------------------+
|
||||
| | Bit18,19 | MSR IA32\_PAT save/load |
|
||||
+ +----------------+---------------------------------------+
|
||||
| | Bit20,21 | MSR IA32\_EFER save/load |
|
||||
+ +----------------+---------------------------------------+
|
||||
| | Bit9 | 64-bit mode after VM Exit |
|
||||
+----------------------------------------+----------------+---------------------------------------+
|
||||
|
||||
|
||||
CPUID and Guest TSC calibration
|
||||
===============================
|
||||
|
||||
User OS access to CPUID will be trapped by ACRN hypervisor, however
|
||||
the ACRN hypervisor will pass through most of the native CPUID
|
||||
information to the guest, except the virtualized CPUID 0x1 (to
|
||||
provide fake x86_model).
|
||||
|
||||
The Time Stamp Counter (TSC) is a 64-bit register present on all x86
|
||||
processors that counts the number of cycles since reset. ACRN hypervisor
|
||||
also virtualizes ``MSR_PLATFORM_INFO`` and ``MSR_ATOM_FSB_FREQ``.
|
||||
|
||||
RDTSC/RDTSCP
|
||||
============
|
||||
|
||||
User OS vCPU reads of ``RDTSC``, ``RDTSCP``, or ``MSR_IA32_TSC_AUX``
|
||||
will not make the VM Exit to the hypervisor. Thus the vCPUID provided by
|
||||
``MSR_IA32_TSC_AUX`` can be changed via the User OS.
|
||||
|
||||
The ``RDTSCP`` instruction is widely used by the ACRN hypervisor to
|
||||
identify the current CPU (and read the current value of the processor's
|
||||
time-stamp counter). Because there is no VM Exit for
|
||||
``MSR_IA32_TSC_AUX`` msr register, the hypervisor will save and restore
|
||||
the ``MSR_IA32_TSC_AUX`` value on every VM Exit and Enter. Before the
|
||||
hypervisor restores the host CPU ID, we must not use a ``RDTSCP``
|
||||
instruction because it would return the vCPU ID instead of host CPU ID.
|
||||
|
||||
CR Register virtualization
|
||||
==========================
|
||||
|
||||
Guest CR8 access will make the VM Exit, and is emulated in the
|
||||
hypervisor for vLAPIC to update its PPR register. Guest access to CR3
|
||||
will not make the VM Exit.
|
||||
|
||||
MSR BITMAP
|
||||
==========
|
||||
|
||||
In the ACRN hypervisor, only these module-specific registers (MSR) are
|
||||
supported:
|
||||
|
||||
**MSR_IA32_TSC_DEADLINE**
|
||||
emulates Guest TSC timer program
|
||||
|
||||
**MSR_PLATFORM_INFO**
|
||||
emulates a fake X86 module
|
||||
|
||||
**MSR_ATOM_FSB_FREQ**
|
||||
provides the CPU frequency directly via this MSR to avoid TSC calibration
|
||||
|
||||
I/O BITMAP
|
||||
==========
|
||||
|
||||
All User OS I/O port accesses are trapped into the ACRN hypervisor by
|
||||
default. Most of the Service OS I/O port accesses are not trapped into
|
||||
the ACRN hypervisor, allowing the Service OS direct access to the
|
||||
hardware port.
|
||||
|
||||
The Service OS I/O trap policy is:
|
||||
|
||||
**0x3F8/0x3FC**
|
||||
for emulated vUART inside hypervisor for SOS only, will be trapped
|
||||
|
||||
**0x20/0xA0/0x460**
|
||||
for vPIC emulation in hypervisor, will be trapped
|
||||
|
||||
**0xCF8/0xCFC**
|
||||
for hypervisor PCI device interception, will be trapped
|
||||
|
||||
Exceptions
|
||||
==========
|
||||
|
||||
The User OS handles its exceptions inside the VM, including page fault,
|
||||
GP, etc. A #MC and #DB exception causes a VM Exit to the ACRN hypervisor
|
||||
console.
|
||||
|
||||
Memory virtualization
|
||||
*********************
|
||||
|
||||
ACRN hypervisor provides memory virtualization by using a static
|
||||
partition of system memory. Each virtual machine owns its own contiguous
|
||||
partition of memory, with the Service OS staying in lower memory and the
|
||||
User OS instances in high memory. (High memory is memory which is not
|
||||
permanently mapped in the kernel address space, while Low Memory is
|
||||
always mapped, so you can access it in the kernel simply by
|
||||
dereferencing a pointer.) In future implementations, this will evolve to
|
||||
utilize EPT/VT-d.
|
||||
|
||||
ACRN hypervisor memory is not visible to any User OS. In the ACRN
|
||||
hypervisor, there are a few memory accesses that need to work
|
||||
efficiently:
|
||||
|
||||
- ACRN hypervisor to access host memory
|
||||
- vCPU per VM to access guest memory
|
||||
- vCPU per VM to access host memory
|
||||
- vCPU per VM to access MMIO memory
|
||||
|
||||
The rest of this section introduces how these kinds of memory accesses
|
||||
are managed. It gives an overview of physical memory layout,
|
||||
Paravirtualization (MMU) memory mapping in the hypervisor and VMs, and
|
||||
Host-Guest Extended Page Table (EPT) memory mapping for each VM.
|
||||
|
||||
Physical Memory Layout
|
||||
======================
|
||||
|
||||
The Physical Memory Layout Example for Service OS & User OS is shown in
|
||||
:numref:`primer-mem-layout` below:
|
||||
|
||||
.. figure:: images/primer-mem-layout.png
|
||||
:align: center
|
||||
:name: primer-mem-layout
|
||||
|
||||
Memory Layout
|
||||
|
||||
:numref:`primer-mem-layout` shows an example of physical memory layout
|
||||
of the Service and User OS. The Service OS accepts the whole e820 table
|
||||
(all usable memory address ranges not reserved for use by the BIOS)
|
||||
after filtering out the Hypervisor memory too. From the SOS's point of
|
||||
view, it takes control of all available physical memory, including User
|
||||
OS memory, not used by the hypervisor (or BIOS). Each User OSes memory
|
||||
is allocated from (High) SOS memory and the User OS only owns this
|
||||
section of memory control.
|
||||
|
||||
Some of the physical memory of a 32-bit machine, needs to be sacrificed
|
||||
by making it hidden so memory-mapped I/O (MMIO) devices have room to
|
||||
communicate. This creates an MMIO hole for VMs to access some range of
|
||||
MMIO addresses directly for communicating to devices; or they may need
|
||||
the hypervisor to trap some range of MMIO to do device emulation. This
|
||||
access control is done through EPT mapping.
|
||||
|
||||
PV (MMU) Memory Mapping in the Hypervisor
|
||||
=========================================
|
||||
|
||||
.. figure:: images/primer-pv-mapping.png
|
||||
:align: center
|
||||
:name: primer-pv-mapping
|
||||
|
||||
ACRN Hypervisor PV Mapping Example
|
||||
|
||||
The ACRN hypervisor is trusted and can access and control all system
|
||||
memory, as shown in :numref:`primer-pv-mapping`. Because the hypervisor
|
||||
is running in protected mode, an MMU page table must be prepared for its
|
||||
PV translation. To simplify things, the PV translation page table is set
|
||||
as a 1:1 mapping. Some MMIO range mappings could be removed if they are
|
||||
not needed. This PV page table is created when the hypervisor memory is
|
||||
first initialized.
|
||||
|
||||
PV (MMU) Memory Mapping in VMs
|
||||
==============================
|
||||
|
||||
As mentioned earlier, the Primary vCPU starts to run in protected mode
|
||||
when its VM is started. But before it begins, a temporary PV (MMU) page
|
||||
table must be prepared..
|
||||
|
||||
This page table is a 1:1 mapping for 4 Gb, and only lives for a short
|
||||
time when the vCPU first runs. After the vCPU starts to run its kernel
|
||||
image (for example Linux\*), the kernel will create its own PV page
|
||||
tables, after which, the temporary page table will be obsoleted.
|
||||
|
||||
Host-Guest (EPT) Memory Mapping
|
||||
===============================
|
||||
|
||||
The VMs (both SOS and UOS) need to create an Extended Page Table (EPT) to
|
||||
access the host physical memory based on its guest physical memory. The
|
||||
guest VMs also need to set an MMIO trap to trigger EPT violations for
|
||||
device emulation (such as IOAPIC, and LAPIC). This memory layout is
|
||||
shown in :numref:`primer-sos-ept-mapping`:
|
||||
|
||||
.. figure:: images/primer-sos-ept-mapping.png
|
||||
:align: center
|
||||
:name: primer-sos-ept-mapping
|
||||
|
||||
SOS EPT Mapping Example
|
||||
|
||||
The SOS takes control of all the host physical memory space: its EPT
|
||||
mapping covers almost all of the host memory except that reserved for
|
||||
the hypervisor (HV) and a few MMIO trap ranges for IOAPIC & LAPIC
|
||||
emulation. The guest to host mapping for SOS is 1:1.
|
||||
|
||||
.. figure:: images/primer-uos-ept-mapping.png
|
||||
:align: center
|
||||
:name: primer-uos-ept-mapping
|
||||
|
||||
UOS EPT Mapping Example
|
||||
|
||||
However, for the UOS, its memory EPT mapping is linear but with an
|
||||
offset (as shown in :numref:`primer-uos-ept-mapping`). The MMIO hole is
|
||||
not mapped to trap all MMIO accesses from the UOS (and do emulating in
|
||||
the device model). To support pass through devices in the future, some
|
||||
MMIO range mapping may be added.
|
||||
|
||||
Graphic mediation
|
||||
*****************
|
||||
|
||||
Intel |reg| Graphics Virtualization Technology –g (Intel |reg| GVT-g)
|
||||
provides GPU sharing capability to multiple VMs by using a mediated
|
||||
pass-through techniquer. This allows a VM to access performance critical
|
||||
I/O resources (usually partitioned) directly, without intervention from
|
||||
the hypervisor in most cases.
|
||||
|
||||
Privileged operations from this VM are trap-and-emulated to provide
|
||||
secure isolation among VMs. The Hypervisor must ensure that no
|
||||
vulnerability is exposed when assigning performance-critical resource to
|
||||
each VM. When a performance-critical resource cannot be partitioned, a
|
||||
scheduler must be implemented (either in software or hardware) to allow
|
||||
time-based sharing among multiple VMs. In this case, the device must
|
||||
allow the hypervisor to save and restore the hardware state associated
|
||||
with the shared resource, either through direct I/O register read/write
|
||||
(when there is no software invisible state) or through a device-specific
|
||||
context save/restore mechanism (where there is a software invisible
|
||||
state).
|
||||
|
||||
In the initial release of Project ACRN, graphic mediation is not
|
||||
enabled, and is planned for a future release.
|
||||
|
||||
I/O emulation
|
||||
*************
|
||||
|
||||
The I/O path is explained in the :ref:`ACRN-io-mediator` section of the
|
||||
:ref:`introduction`. The following sections, provide additional device
|
||||
assignment management and PIO/MMIO trap flow introduction.
|
||||
|
||||
Device Assignment Management
|
||||
============================
|
||||
|
||||
ACRN hypervisor provides major device assignment management. Since the
|
||||
hypervisor owns all native vectors and IRQs, there must be a mapping
|
||||
table to handle the Guest IRQ/Vector to Host IRQ/Vector. Currently we
|
||||
assign all devices to VM0 except the UART.
|
||||
|
||||
If a PCI device (with MSI/MSI-x) is assigned to Guest, the User OS will
|
||||
program the PCI config space and set the guest vector to this device. A
|
||||
Hypercall ``CWP_VM_PCI_MSIX_FIXUP`` is provided. Once the guest programs
|
||||
the guest vector, the User OS may call this hypercall to notify the ACRN
|
||||
hypervisor. The hypervisor allocates a host vector, creates a guest-host
|
||||
mapping relation, and replaces the guest vector with a real native
|
||||
vector for the device:
|
||||
|
||||
**PCI MSI/MSI-X**
|
||||
PCI Message Signalled Interrupts (MSI/MSX-x) from
|
||||
devices can be triggered from a hypercall when a guest program
|
||||
vectors. All PCI devices are programed with real vectors
|
||||
allocated by the Hypervisor.
|
||||
|
||||
**PCI/INTx**
|
||||
Device assignment is triggered when the guest programs
|
||||
the virtual Advanced I/O Programmable Interrupt Controller
|
||||
(vIOAPC) Redirection Table Entries (RTE).
|
||||
|
||||
**Legacy**
|
||||
Legacy devices are assigned to VM0.
|
||||
|
||||
User OS device assignment is similar to the above, except the User OS
|
||||
doesn't call hypercall. Instead, the Guest program PCI configuration
|
||||
space will be trapped into the Device Module, and Device Module may
|
||||
issue hypercall to notify hypervisor the guest vector is changing.
|
||||
|
||||
Currently, there are two types of I/O Emulation supported: MMIO and
|
||||
PORTIO trap handling. MMIO emulation is triggered by an EPT violation
|
||||
VMExit only. If there is an EPT misconfiguration and VMExit occurs, the
|
||||
hypervisor will halt the system. (Because the hypervisor set up all EPT
|
||||
page table mapping at the beginning of the Guest boot, there should not
|
||||
be an EPT misconfiguration.)
|
||||
|
||||
There are multiple places where I/O emulation can happen - in ACRN
|
||||
hypervisor, Service OS Kernel VHM module, or in the Service OS Userland
|
||||
ACRN Device Module.
|
||||
|
||||
PIO/MMIO trap Flow
|
||||
==================
|
||||
|
||||
Here is a description of the PIO/MMIO trap flow:
|
||||
|
||||
1. Instruction decoder: get the Guest Physical Address (GPA) from VM
|
||||
Exit, go through gla2gpa() page walker if necessary.
|
||||
|
||||
2. Emulate the instruction. Here the hypervisor will have an address
|
||||
range check to see if the hypervisor is interested in this IO
|
||||
port or MMIO GPA access.
|
||||
|
||||
3. Hypervisor emulates vLAPIC, vIOAPIC, vPIC, and vUART only (for
|
||||
Service OS only). Any other emulation request are forwarded to
|
||||
the SOS for handling. The vCPU raising the I/O request will
|
||||
halt until this I/O request is processed successfully. An IPI will
|
||||
send to vCPU0 of SOS to notify there is an I/O request waiting for
|
||||
service.
|
||||
|
||||
4. Service OS VHM module takes the I/O request and dispatches the request
|
||||
to multiple clients. These clients could be SOS kernel space
|
||||
VBS-K, MPT, or User-land Device model. VHM I/O request server
|
||||
selects a default fallback client responsible to handle any I/O
|
||||
request not handled by other clients. (The Device Manager is the
|
||||
default fallback client.) Each client needs to register its I/O
|
||||
range or specific PCI bus/device/function (BDF) numbers. If an I/O
|
||||
request falls into the client range, the I/O request server will
|
||||
send the request to that client.
|
||||
|
||||
5. Multiple clients - fallback client (Device Model in user-land),
|
||||
VBS-K client, MPT client.
|
||||
Once the I/O request emulation completes, the client updates the
|
||||
request status and notifies the hypervisor by a hypercall.
|
||||
Hypervisor picks up that request, do any necessary cleanup,
|
||||
and resume the Guest vCPU.
|
||||
|
||||
Most I/O emulation tasks are done by the SOS CPU, and requests come from
|
||||
UOS vCPUs.
|
||||
|
||||
Virtual interrupt
|
||||
*****************
|
||||
|
||||
All interrupts received by the User OS comes from a virtual interrupt
|
||||
injected by a virtual vLAPIC, vIOAPIC, or vPIC. All device emulation is
|
||||
done inside the SOS Userspace device model. However for performance
|
||||
consideration, vLAPIC, vIOAPIC, and vPIC devices are emulated inside the
|
||||
ACRN hypervisor directly. From the guest point of view, vPIC uses
|
||||
Virtual Wire Mode via vIOAPIC.
|
||||
|
||||
The symmetric I/O Mode is shown in :numref:`primer-symmetric-io`:
|
||||
|
||||
.. figure:: images/primer-symmetric-io.png
|
||||
:align: center
|
||||
:name: primer-symmetric-io
|
||||
|
||||
Symmetric I/O Mode
|
||||
|
||||
|
||||
**Kernel boot param with vPIC**
|
||||
add "maxcpu=0" to User OS to use PIC
|
||||
|
||||
**Kernel boot param with vIOAPIC**
|
||||
add "maxcpu=1" (as long as not "0") User OS will use IOAPIC. Keep
|
||||
IOAPIC pin2 as source of PIC.
|
||||
|
||||
Virtual LAPIC
|
||||
=============
|
||||
|
||||
The LAPIC (Local Advanced Programmable interrupt Controller) is
|
||||
virtualized for SOS or UOS. The vLAPIC is currently emulated by a Guest
|
||||
MMIO trap to GPA address range: 0xFEE00000 - 0xFEE100000 (1MB). ACRN
|
||||
hypervisor will support APIC-v and Post interrupts in a future release.
|
||||
|
||||
vLAPIC provides the same feature as a native LAPIC:
|
||||
|
||||
- Mask/Unmask vectors
|
||||
- Inject virtual vectors (Level or Edge trigger mode) to vCPU
|
||||
- Notify vIOAPIC of EOI processing
|
||||
- Provide TSC Timer service
|
||||
- vLAPIC support CR8 to update TPR
|
||||
- INIT/STARTUP handling
|
||||
|
||||
Virtual IOAPIC
|
||||
==============
|
||||
|
||||
A vIOAPIC is emulated by the hypervisor when the Guest accesses MMIO GPA
|
||||
Range: 0xFEC00000 - 0xFEC01000. The vIOAPIC for the SOS will match the
|
||||
same pin numbers as the native HW IOAPIC. The vIOAPIC for UOS only
|
||||
provides 24 Pins. When a vIOAPIC PIN is asserted, the vIOAPIC calls
|
||||
vLAPIC APIs to inject the vector to the Guest.
|
||||
|
||||
Virtual PIC
|
||||
===========
|
||||
|
||||
A vPIC is required for TSC calculation. Normally the UOS boots with a
|
||||
vIOAPIC. A vPIC is a source of external interrupts to the Guest. On
|
||||
every VMExit, the hypervisor checks if there are pending external PIC
|
||||
interrupts.
|
||||
|
||||
Virtual Interrupt Injection
|
||||
===========================
|
||||
|
||||
The source of virtual interrupts comes from either the Device Module or
|
||||
from assigned devices:
|
||||
|
||||
**SOS assigned devices**
|
||||
As we assigned all devices to SOS directly whenever a devices'
|
||||
physical interrupts come, we inject the corresponding virtual interrupts
|
||||
to SOS via the vLAPIC/vIOAPIC. In this case, the SOS doesn't use the
|
||||
vPIC and does not have emulated devices.
|
||||
|
||||
**UOS assigned devices**
|
||||
Only PCI devices are assigned to UOS, and virtual interrupt injection
|
||||
follows the same way as the SOS. A virtual interrupt injection operation
|
||||
is triggered when a device's physical interrupt is triggered.
|
||||
|
||||
**UOS emulated devices**
|
||||
Device Module (user-land Device Model) is responsible for UOS emulated
|
||||
devices' interrupt lifecycle management. The Device Model knows when an
|
||||
emulated device needs to assert a virtual IOPAIC/PIC Pin or needs to
|
||||
send a virtual MSI vector to the Guest. This logic is entirely handled
|
||||
by the Device Model.
|
||||
|
||||
:numref:`primer-hypervisor-interrupt` shows how the hypervisor handles
|
||||
interrupt processing and pending interrupts (acrn_do_intr_process):
|
||||
|
||||
.. figure:: images/primer-hypervisor-interrupt.png
|
||||
:align: center
|
||||
:name: primer-hypervisor-interrupt
|
||||
|
||||
Hypervisor Interrupt handler
|
||||
|
||||
There are many cases where the Guest RFLAG.IF is cleared and interrupts
|
||||
are disabled. The hypervisor will check if the Guest IRQ window is
|
||||
available before injection. NMI is unmasked interrupt injection
|
||||
regardless of existing guest IRQ window status. If the current IRQ
|
||||
windows is not available, hypervisor enables
|
||||
``MSR_IA32_VMX_PROCBASED_CTLS_IRQ_WIN`` (PROCBASED_CTRL.bit[2]) and
|
||||
VMEnter directly. The injection will be done on next VMExit once the
|
||||
Guest issues STI (GuestRFLAG.IF=1).
|
||||
|
||||
VT-x and VT-d
|
||||
*************
|
||||
|
||||
Since 2006, Intel CPUs have supported hardware assist - VT-x
|
||||
instructions, where the CPU itself traps specific guest instructions and
|
||||
register accesses directly into the VMM without need for binary
|
||||
translation (and modification) of the guest operating system. Guest
|
||||
operating systems can be run natively without modification, although it
|
||||
is common to still install virtualization-aware para-virtualized drivers
|
||||
into the guests to improve functionality. One common example is access
|
||||
to storage via emulated SCSI devices.
|
||||
|
||||
Intel CPUs and chipsets support various Virtualization Technology (VT)
|
||||
features - such as VT-x and VT-d. Physical events on the platform
|
||||
trigger CPU **VM Exits** (a trap into the VMM) to handle physical
|
||||
events such as physical device interrupts,
|
||||
|
||||
In the ACRN hypervisor design, VT-d can be used to do DMA Remapping,
|
||||
such as Address translation and Isolation.
|
||||
:numref:`primer-dma-address-mapping` is an example of address
|
||||
translation:
|
||||
|
||||
.. figure:: images/primer-dma-address-mapping.png
|
||||
:align: center
|
||||
:name: primer-dma-address-mapping
|
||||
|
||||
DMA address mapping
|
||||
|
||||
Hypercall
|
||||
*********
|
||||
|
||||
ACRN hypervisor currently supports less than a dozen
|
||||
:ref:`hypercall_apis` and VHM upcall APIs to support the necessary VM
|
||||
management, IO request distribution and guest memory mappings. The
|
||||
hypervisor and Service OS (SOS) reserve vector 0xF4 for hypervisor
|
||||
notification to the SOS. This upcall is necessary whenever device
|
||||
emulation is required by the SOS. The upcall vector 0xF4 is injected to
|
||||
SOS vCPU0.
|
||||
|
||||
Refer to the :ref:`acrn_apis` documentation for details.
|
||||
|
||||
Device emulation
|
||||
****************
|
||||
|
||||
The ACRN Device Model emulates different kinds of platform devices, such as
|
||||
RTC, LPC, UART, PCI device, and Virtio block device. The most important
|
||||
thing about device emulation is to handle the I/O request from different
|
||||
devices. The I/O request could be PIO, MMIO, or PCI CFG SPACE access. For
|
||||
example:
|
||||
|
||||
- a CMOS RTC device may access 0x70/0x71 PIO to get the CMOS time,
|
||||
- a GPU PCI device may access its MMIO or PIO BAR space to complete
|
||||
its framebuffer rendering, or
|
||||
- the bootloader may access PCI devices' CFG
|
||||
SPACE for BAR reprogramming.
|
||||
|
||||
ACRN Device Model injects interrupts/MSIs to its frontend devices when
|
||||
necessary as well, for example, a RTC device needs to get its ALARM
|
||||
interrupt or a PCI device with MSI capability needs to get its MSI. The
|
||||
Data Model also provides a PIRQ routing mechanism for platform devices.
|
||||
|
||||
Virtio Devices
|
||||
**************
|
||||
|
||||
This section introduces the Virtio devices supported by ACRN. Currently
|
||||
all the Back-end virtio drivers are implemented using the virtio APIs
|
||||
and the FE drivers are re-using Linux standard Front-end virtio drivers.
|
||||
|
||||
Virtio-rnd
|
||||
=================
|
||||
|
||||
The virtio-rnd entropy device supplies high-quality randomness for guest
|
||||
use. The virtio device ID of the virtio-rnd device is 4, and supports
|
||||
one virtqueue of 64 entries (configurable in the source code). No
|
||||
feature bits are defined.
|
||||
|
||||
When the FE driver requires random bytes, the BE device places bytes of
|
||||
random data onto the virtqueue.
|
||||
|
||||
To launch the virtio-rnd device, you can use the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
./acrn-dm -A -m 1168M \
|
||||
-s 0:0,hostbridge \
|
||||
-s 1,virtio-blk,./uos.img \
|
||||
-s 2,virtio-rnd \
|
||||
-k bzImage \
|
||||
-B "root=/dev/vda rw rootwait noxsave maxcpus=0 nohpet \
|
||||
console=hvc0 no_timer_check ignore_loglevel \
|
||||
log_buf_len=16M consoleblank=0 tsc=reliable" vm1
|
||||
|
||||
To verify the result in user OS side, you can use the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
od /dev/random
|
||||
|
||||
Virtio-blk
|
||||
==========
|
||||
|
||||
The virtio-blk device is a simple virtual block device. The FE driver
|
||||
will place read, write, and other requests onto the virtqueue, so that
|
||||
the BE driver can process them accordingly.
|
||||
|
||||
The virtio device ID of the virtio-blk is 2, and it supports one
|
||||
virtqueue with 64 entries, configurable in the source code. The feature
|
||||
bits supported by the BE device are as follows:
|
||||
|
||||
**VTBLK\_F\_SEG\_MAX(bit 2)**
|
||||
Maximum number of segments in a request is in seg_max.
|
||||
|
||||
**VTBLK\_F\_BLK\_SIZE(bit 6)**
|
||||
block size of disk is in blk\_size.
|
||||
|
||||
**VTBLK\_F\_FLUSH(bit 9)**
|
||||
cache flush command support.
|
||||
|
||||
**VTBLK\_F\_TOPOLOGY(bit 10)**
|
||||
device exports information on optimal I/O alignment.
|
||||
|
||||
To use the virtio-blk device, use the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
./acrn-dm -A -m 1168M \
|
||||
-s 0:0,hostbridge \
|
||||
-s 1,virtio-blk,./uos.img** \
|
||||
-k bzImage -B "root=/dev/vda rw rootwait noxsave maxcpus=0 \
|
||||
nohpet console=hvc0 no_timer_check ignore_loglevel \
|
||||
log_buf_len=16M consoleblank=0 tsc=reliable" vm1
|
||||
|
||||
To verify the result, you should expect the user OS to boot
|
||||
successfully.
|
||||
|
||||
Virtio-net
|
||||
==========
|
||||
|
||||
The virtio-net device is a virtual Ethernet device. The virtio device ID
|
||||
of the virtio-net is 1. The virtio-net device supports two virtqueues,
|
||||
one for transmitting packets and the other for receiving packets. The
|
||||
FE driver will place empty buffers onto one virtqueue for receiving
|
||||
packets, and enqueue outgoing packets onto the other virtqueue for
|
||||
transmission. Currently the size of each virtqueue is 1000, configurable
|
||||
in the source code.
|
||||
|
||||
To access the external network from user OS, a L2 virtual switch should
|
||||
be created in the service OS, and the BE driver is bonded to a tap/tun
|
||||
device linking under the L2 virtual switch. See
|
||||
:numref:`primer-virtio-net`:
|
||||
|
||||
.. figure:: images/primer-virtio-net.png
|
||||
:align: center
|
||||
:name: primer-virtio-net
|
||||
|
||||
Accessing external network from User OS
|
||||
|
||||
Currently the feature bits supported by the BE device are:
|
||||
|
||||
**VIRTIO\_NET\_F\_MAC(bit 5)**
|
||||
device has given MAC adderss.
|
||||
|
||||
**VIRTIO\_NET\_F\_MRG\_RXBUF(bit 15)**
|
||||
BE driver can merge receive buffers.
|
||||
|
||||
**VIRTIO\_NET\_F\_STATUS(bit 16)**
|
||||
configuration status field is available.
|
||||
|
||||
**VIRTIO\_F\_NOTIFY\_ON\_EMPTY(bit 24)**
|
||||
device will issue an interrupt if it runs out of available
|
||||
descriptors on a virtqueue.
|
||||
|
||||
To enable the virtio-net device, use the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
./acrn-dm -A -m 1168M \
|
||||
-s 0:0,hostbridge \
|
||||
-s 1,virtio-blk,./uos.img \
|
||||
-s 2,virtio-net,tap0 \
|
||||
-k bzImage -B "root=/dev/vda rw rootwait noxsave maxcpus=0 \
|
||||
nohpet console=hvc0 no_timer_check ignore_loglevel \
|
||||
log_buf_len=16M consoleblank=0 tsc=reliable" vm1
|
||||
|
||||
To verify the correctness of the device, the external
|
||||
network should be accessible from the user OS.
|
||||
|
||||
Virtio-console
|
||||
==============
|
||||
|
||||
The virtio-console device is a simple device for data input and output.
|
||||
The virtio device ID of the virtio-console device is 3. A device could
|
||||
have from one to 16 ports. Each port has a pair of input and output
|
||||
virtqueues used to communicate information between the FE and BE
|
||||
drivers. Currently the size of each virtqueue is 64, configurable in the
|
||||
source code.
|
||||
|
||||
Similar to virtio-net device, the two virtqueues specific to a port are
|
||||
for transmitting virtqueue and receiving virtqueue. The FE driver will
|
||||
place empty buffers onto the receiving virtqueue for incoming data, and
|
||||
enqueue outgoing characters onto transmitting virtqueue.
|
||||
|
||||
Currently the feature bits supported by the BE device are:
|
||||
|
||||
**VTCON\_F\_SIZE(bit 0)**
|
||||
configuration columns and rows are valid.
|
||||
|
||||
**VTCON\_F\_MULTIPORT(bit 1)**
|
||||
device supports multiple ports, and control virtqueues will be used.
|
||||
|
||||
**VTCON\_F\_EMERG\_WRITE(bit 2)**
|
||||
device supports emergency write.
|
||||
|
||||
Virtio-console supports redirecting guest output to various backend
|
||||
devices, including stdio/pty/tty. Users could follow the syntax below to
|
||||
specify which backend to use:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
virtio-console,[@]stdio\|tty\|pty:portname[=portpath][,[@]stdio\|tty\|pty:portname[=portpath]]
|
||||
|
||||
For example, to use stdio as a virtio-console backend, use the following
|
||||
command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
./acrn-dm -A -m 1168M \
|
||||
-s 0:0,hostbridge \
|
||||
-s 1,virtio-blk,./uos.img \
|
||||
-s 3,virtio-console,@stdio:stdio\_port \
|
||||
-k bzImage -B "root=/dev/vda rw rootwait noxsave maxcpus=0 \
|
||||
nohpet console=hvc0 no_timer_check ignore_loglevel \
|
||||
log_buf_len=16M consoleblank=0 tsc=reliable" vm1
|
||||
|
||||
Then user could login into user OS:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
Ubuntu 17.04 xubuntu hvc0
|
||||
xubuntu login: root
|
||||
Password:
|
||||
|
||||
To use pty as a virtio-console backend, use the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
./acrn-dm -A -m 1168M \
|
||||
-s 0:0,hostbridge \
|
||||
-s 1,virtio-blk,./uos.img \
|
||||
-s 2,virtio-net,tap0 \
|
||||
-s 3,virtio-console,@pty:pty\_port \
|
||||
-k ./bzImage -B "root=/dev/vda rw rootwait noxsave maxcpus=0 \
|
||||
nohpet console=hvc0 no_timer_check ignore_loglevel \
|
||||
log_buf_len=16M consoleblank=0 tsc=reliable" vm1 &
|
||||
|
||||
When ACRN-DM boots User OS successfully, a similar log will be shown
|
||||
as below:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
**************************************************************
|
||||
virt-console backend redirected to /dev/pts/0
|
||||
**************************************************************
|
||||
|
||||
You can then use the following command to login the User OS:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
minicom -D /dev/pts/0
|
||||
|
||||
or
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
screen /dev/pts/0
|
|
@ -18,11 +18,17 @@
|
|||
color: rgba(255,255,255,1);
|
||||
}
|
||||
|
||||
/* add some space before the figure caption */
|
||||
p.caption {
|
||||
# border-top: 1px solid;
|
||||
margin-top: 1em;
|
||||
}
|
||||
|
||||
/* add a colon after the figure/table number (before the caption) */
|
||||
span.caption-number::after {
|
||||
content: ": ";
|
||||
}
|
||||
|
||||
/* make .. hlist:: tables fill the page */
|
||||
table.hlist {
|
||||
width: 95% !important;
|
||||
|
|