In local APIC passthrough case, when devices triggered a INTx interrupt, this
interrupt would be delivered to vCPU directly. For this case, need to set the
virtual vector in
the 'Interrupt Vector' field of physical IOxAPIC I/O REDIRECTION TABLE REGISTER
(bits 7:0) and 'Vector' field of vt-d Interrupt Remapping Table Entry (IRTE)
for Remapped Interrupts.
Assumption:
(a) IOAPIC pins won't be shared between LAPIC PT guest and other guests;
(b) The guest would not trigger this IRQ before it switched to x2 APIC mode.
Tracked-On: #5923
Signed-off-by: Zide Chen <zide.chen@intel.com>
check some essential vmx capablility,
will panic if processor doesn't support it.
Tracked-On: #6584
Signed-off-by: Mingqiang Chi <mingqiang.chi@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
In current RDT code, if CDP is configured, L2/L3 resources' num_closids calculation
is wrong:
res_cap_info[res].num_closids = (uint16_t)((edx & 0xffffU) >> 1U) + 1U;
Should be:
res_cap_info[res].num_closids = (uint16_t)((edx & 0xffffU) >> 1U + 1) >> 1U;
Aslo, in order to enable CDP system-wide, need to enable the CDP bit (bit 0) on all pcpus,
not just on pcpu 0.
Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
Rename the clos_max field in struct rdt_info to num_closids
Rename variable valid_clos_num to common_num_closids and make it static
Tracked-On: #5917
Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>
These dirty flags are supposed to be per VMCS12, so move them from the
per vCPU acrn_nested struct to the newly added acrn_vvmcs struct.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
This variable represents the L1 GPA of the current VMCS12. But it's
no longer needed in the multiple active VMCS12 case, which uses the
following variables for this purpose.
- nested->current_vvmcs refers to the vvmcs[] entry which contains the
cached current VMCS12, its associated VMCS02, and other context info.
- nested->current_vvmcs->vmcs12_gpa refers to the L1 GPA of this
current VMCS12.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Add an array of struct acrn_vvmcs to struct acrn_nested, so it is
possible to cache multiple active VMCS12s.
This patch declares the size of this array to 1, meaning that there is
only one active VMCS12. This is to minimize the logical code changes.
Add pointer current_vvmcs to struct acrn_nested, which refers to the
current vvmcs[] entry. In this patch, if any VMCS12 is active, it
always points to vvmcs[0].
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
No any logical changes, this patch is preparing for multiple active
VMCS12 support.
- currently it's easy to get the vmcs12 pointer from the vcpu pointer.
In multiple active vmcs12 case, we need to explicitly add "struct
acrn_vmcs12 *vmcs12" to certain APIs' input argument list, in order to
get the desired vmcs12 pointer.
- merge flush_current_vmcs12() into clear_vmcs02() for multiple reasons:
a) it's called only once; b) we don't wrap the opposite operation
(loading vmcs12) in an API; c) this API has simple and clear logic.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
By changing the way to assign L1 VPID from bottom-up to top-down,
the possibilities for VPID conflicts between L1 and L2 guests are
small.
Then we can flush VPID just in case of conflicting.
Tracked-On: #6289
Signed-off-by: Anthony Xu <anthony.xu@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
In run time, it's rare for L1 to write to the intercepted non host-state
VMCS fields, and using multiple dirty flags is not necessary.
This patch uses one single dirty flag to manage all non host-state VMCS
fields. This helps to simplify current code and in the future we may
not need to declare new dirty flags when we intercept more VMCS fields.
Tracked-On: #5923
Signed-off-by: Zide Chen <zide.chen@intel.com>
Currently vmptrld_vmexit_handler() doesn't sync VMX_EPT_POINTER_FULL
from vmcs12 to vmcs02, instead it sets gpa_field_dirty and relies on
nested_vmentry() to sync EPTP in next nested VMentry.
This creates readability issue since all other intercepted VMCS fields
are synced in sync_vmcs12_to_vmcs02(). Another issue is that other
VMCS fields managed by gpa_field_dirty are repeatedly synced in both
vmptrld and nested vmentry handler.
This patch moves get_nept_desc() ahead of sync_vmcs12_to_vmcs02(), such
that shadow_eptp is allocated before sync_vmcs12_to_vmcs02() which
can sync EPTP properly.
BTW, in nested_vmexit_handler(), don't need to read from VMCS to get
the exit reason, since vcpu->arch.exit_reason has it already.
Tracked-On: #5923
Signed-off-by: Zide Chen <zide.chen@intel.com>
The locked btr instruction is expensive. This patch changes the
logic to ensure that the bitmap is non-zero before executing
bitmap_test_and_clear_lock().
The VMX transition time gets significant improvement. SOS running
on TGL, the CPUID roundtrip reduces from ~2400 cycles to ~2000 cycles.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Currently ACRN assumes firmware setup IA32_PAT correctly. This patch
explicitly initializes host IA32_PAT MSR according to ISDM Table 11-12.
Memory Type Setting of PAT Entries Following a Power-up or Reset.
ACRN creates host page tables based on PAT0 (WB) and PAT3 (UC).
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
is_lapic_pt_enabled() is called at least twice in one loop of the vCPU
thread, and it's called in vmexit_handler() frequently if LAPIC is not
pass-through. Thus the efficiency of this function has direct
impact to the system performance.
Since the LAPIC mode is not changed in run time, we don't have to
calculate it on the fly in is_lapic_pt_enabled().
BTW, removed the unused lapic_mask from struct acrn_vcpu_arch.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
For platforms that do not support XSAVES/XRSTORS instructions, like QEMU,
executing these instructions causes #UD.
This patch adds the check before the execution of XSAVES/XRSTORS instructions.
It also refines the logic inside rstore_xsave_area for the following reason:
If XSAVES/XRSTORS instructions are supported, restore XSAVE area if any of the
following conditions is met:
1. "vcpu->launched" is false (state initialization for guest)
2. "vcpu->arch.xsave_enabled" is true (state restoring for guest)
* Before vCPU is launched, condition 1 is satisfied.
* After vCPU is launched, condition 2 is satisfied because
is_valid_xsave_combination() guarantees that "vcpu->arch.xsave_enabled"
is consistent with pcpu_has_cap(X86_FEATURE_XSAVES).
Therefore, the check against "vcpu->launched" and "vcpu->arch.xsave_enabled"
can be eliminated here.
Tracked-On: #6481
Signed-off-by: Shiqing Gao <shiqing.gao@intel.com>
Acked-by: Eddie Dong <eddie.dong@Intel.com>
Use an unused MSR on host to save ACRN pcpu ID and avoid saving and
restoring TSC AUX MSR on VMX transitions.
Tracked-On: #6289
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Reviewed-by: Eddie Dong <eddie.dong@intel.com>
This feature is guarded under config CONFIG_SECURITY_VM_FIXUP, which
by default should be disabled.
This patch passthrough native SMBIOS information to prelaunched VM.
SMBIOS table contains a small entry point structure and a table, of which
the entry point structure will be put in 0xf0000-0xfffff region in guest
address space, and the table will be put in the ACPI_NVS region in guest
address space.
v2 -> v3:
uuid_is_equal moved to util.h as inline API
result -> pVendortable, in function efi_search_guid
recalc_checksum -> generate_checksum
efi_search_smbios -> efi_search_smbios_eps
scan_smbios_eps -> mem_search_smbios_eps
EFI GUID definition kept
Tracked-On: #6320
Signed-off-by: Yifan Liu <yifan1.liu@intel.com>
This helps to improve performance:
- Don't need to execute VMREAD in vcpu_get_efer(), which is frequently
called.
- VMX_EXIT_CTLS_SAVE_EFER can be removed from VM-Exit Controls.
- If the value of IA32_EFER MSR is identical between the host and guest
(highly likely), adjust the VMX controls not to load IA32_EFER on
VMExit and VMEntry.
It's convenient to continue use the exiting vcpu_s/get_efer() APIs,
other than the common vcpu_s/get_guest_msr().
Tracked-On: #6289
Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com>
Signed-off-by: Zide Chen <zide.chen@intel.com>
Mask off support of 57-bit linear addresses and five-level paging.
ICX-D has LA57 but ACRN doesn't support 5-level paging yet.
Tracked-On: #6357
Signed-off-by: Liang Yi <yi.liang@intel.com>
Signed-off-by: Li, Fei1 <fei1.li@intel.com>
If SOS is using kernel 5.4, hypervisor got panic with #GP.
Here is an example on KBL showing how the panic occurs when kernel 5.4 is used:
Notes:
* Physical MSR_IA32_XSS[bit 8] is 1 when physical CPU boots up.
* vcpu_get_guest_msr(vcpu, MSR_IA32_XSS)[bit 8] is initialized to 0.
Following thread switches would happen at run time:
1. idle thread -> vcpu thread
context_switch_in happens and rstore_xsave_area is called.
At this moment, vcpu->arch.xsave_enabled is false as vcpu is not launched yet
and init_vmcs is not called yet (where xsave_enabled is set to true).
Thus, physical MSR_IA32_XSS is not updated with the value of guest MSR_IA32_XSS.
States at this point:
* Physical MSR_IA32_XSS[bit 8] is 1.
* vcpu_get_guest_msr(vcpu, MSR_IA32_XSS)[bit 8] is 0.
2. vcpu thread -> idle thread
context_switch_out happens and save_xsave_area is called.
At this moment, vcpu->arch.xsave_enabled is true. Processor state is saved
to memory with XSAVES instruction. As physical MSR_IA32_XSS[bit 8] is 1,
ectx->xs_area.xsave_hdr.hdr.xcomp_bv[bit 8] is set to 1 after the execution
of XSAVES instruction.
States at this point:
* Physical MSR_IA32_XSS[bit 8] is 1.
* vcpu_get_guest_msr(vcpu, MSR_IA32_XSS)[bit 8] is 0.
* ectx->xs_area.xsave_hdr.hdr.xcomp_bv[bit 8] is 1.
3. idle thread -> vcpu thread
context_switch_in happens and rstore_xsave_area is called.
At this moment, vcpu->arch.xsave_enabled is true. Physical MSR_IA32_XSS is
updated with the value of guest MSR_IA32_XSS, which is 0.
States at this point:
* Physical MSR_IA32_XSS[bit 8] is 0.
* vcpu_get_guest_msr(vcpu, MSR_IA32_XSS)[bit 8] is 0.
* ectx->xs_area.xsave_hdr.hdr.xcomp_bv[bit 8] is 1.
Processor state is restored from memory with XRSTORS instruction afterwards.
According to SDM Vol1 13.12 OPERATION OF XRSTORS, a #GP occurs if XCOMP_BV
sets a bit in the range 62:0 that is not set in XCR0 | IA32_XSS.
So, #GP occurs once XRSTORS instruction is executed.
Such issue does not happen with kernel 5.10. Because kernel 5.10 writes to
MSR_IA32_XSS during initialization, while kernel 5.4 does not do such write.
Once guest writes to MSR_IA32_XSS, it would be trapped to hypervisor, then,
physical MSR_IA32_XSS and the value of MSR_IA32_XSS in vcpu->arch.guest_msrs
are updated with the value specified by guest. So, in the point 2 above,
correct processor state is saved. And #GP would not happen in the point 3.
This patch initializes the XSAVE related processor state for guest.
If vcpu is not launched yet, the processor state is initialized according to
the initial value of vcpu_get_guest_msr(vcpu, MSR_IA32_XSS), ectx->xcr0,
and ectx->xs_area. With this approach, the physical processor state is
consistent with the one presented to guest.
Tracked-On: #6434
Signed-off-by: Shiqing Gao <shiqing.gao@intel.com>
Reviewed-by: Li Fei1 <fei1.li@intel.com>
Currently init_vmx_msrs() emulates same value for the IA32_VMX_xxx_CTLS
and IA32_VMX_TRUE_xxx_CTLS MSRs.
But the value of physical MSRs could be different between the pair,
and we need to adjust the emulated value accordingly.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
check_vmx_permission() is called in vmresume_vmexit_handler() and
vmlaunch_vmexit_handler() already.
Tracked-On: #6289
Signed-off-by: Zide Chen <zide.chen@intel.com>
Remove the acpi loading function from elf_loader, rawimage_loaer and
bzimage_loader, and call it together in vm_sw_loader.
Now the vm_sw_loader's job is not just loading sw, so we rename it to
prepare_os_image.
Tracked-On: #6323
Signed-off-by: Zhou, Wu <wu.zhou@intel.com>
Reviewed-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
vboot_info.h declares vm loader function also, so rename the file name to
vboot.h;
Tracked-On: #6323
Signed-off-by: Victor Sun <victor.sun@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
When pass-thru GPU to pre-launched Linux guest,
need to pass GPU OpRegion to the guest.
Here's the detailed steps:
1. reserve a memory region in ve820 table for GPU OpRegion
2. build EPT mapping for GPU OpRegion to pass-thru OpRegion to guest
3. emulate the pci config register for OpRegion
For the third step, here's detailed description:
The address of OpRegion locates on PCI config space offset 0xFC,
Normal Linux guest won't write this register,
so we can regard this register as read-only.
When guest reads this register, return the emulated value.
When guest writes this register, ignore the operation.
Tracked-On: #6387
Signed-off-by: Liu,Junming <junming.liu@intel.com>
Relocate ACPI address to 0x7fe00000 and ACPI NVS to 0x7ff00000 correspondingly.
In this case, we could include TPM event log region [0x7ffb0000, 0x80000000)
into ACPI NVS.
Tracked-On: #6320
Signed-off-by: Fei Li <fei1.li@intel.com>
ACRN used to prepare the vTPM2 ACPI Table for pre-launched VM at the build stage
using config tools. This is OK if the TPM2 ACPI Table never changes. However,
TPM2 ACPI Table may be changed in some conditions: change BIOS configuration or
update BIOS.
This patch do TPM2 fixup to update the vTPM2 ACPI Table and TPM2 MMIO resource
configuration according to the physical TPM2 ACPI Table.
Tracked-On: #6366
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Signed-off-by: Fei Li <fei1.li@intel.com>
1. add a name field to indicate what the MMIO Device is.
2. add two more MMIO resource to the acrn_mmiodev data structure.
Tracked-On: #6366
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Signed-off-by: Fei Li <fei1.li@intel.com>
ACRN could run without XSAVE Capability. So remove XSAVE dependence to support
more (hardware or virtual) platforms.
Tracked-On: #6287
Signed-off-by: Fei Li <fei1.li@intel.com>
Check whether condition is met before check whether time is out after iommu_read32.
This is because iommu_read32 would cause time out on some virtual platform in
spite of the current DMAR status meets the pre_condition.
Tracked-On: #6371
Signed-off-by: Fei Li <fei1.li@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
If HV enable trigger #GP for uc-lock, and is about to emulate guest uc-lock
instructions, should trap guest #GP. Guest uc-lock instrucction trigger #GP,
cause vmexit for #GP, HV handle this vmexit and emulate uc-lock
instruction.
Tracked-On: #6299
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
There're some virtual platform which doesn't meet this constraint. So remove
this constraint.
Tracked-On: #6329
Signed-off-by: Fei Li <fei1.li@intel.com>
for core partition VM (like RTVM), PMC is always used for performance
profiling / tuning, so expose PMC capability and pass-through its MSRs
to the VM.
Tracked-On: #6307
Signed-off-by: Minggui Cao <minggui.cao@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
if one array just used in local only, and its size not used extern,
use ARRAY_SIZE macro to calculate its size.
Tracked-On: #6307
Signed-off-by: Minggui Cao <minggui.cao@intel.com>
Reviewed-by: Junjie Mao <junjie.mao@intel.com>
In some scenarios (e.g., nested) where lapic-pt is enabled for a vcpu
running on a pcpu hosting console timer, the hv console will be
inaccessible.
This patch adds the console callback to every VM-exit event so that the
console can still be somewhat functional under such circumstance.
Since this is VM-exit driven, the VM-exit/second can be low in certain
cases (e.g., idle or running stress workload). In extreme cases where
the guest panics/hangs, there will be no VM-exits at all.
In most cases, the shell is laggy but functional (probably enough for
debugging purpose).
Tracked-On: #6312
Signed-off-by: Yifan Liu <yifan1.liu@intel.com>
For an atomic operation using bus locking, it would generate LOCK# bus
signal, if it has Non-WB memory operand. This is an UC lock. It will
ruin the RT behavior of the system.
If MSR_IA32_CORE_CAPABILITIES[bit4] is 1, then CPU can trigger #GP
for instructions which cause UC lock. This feature is controlled by
MSR_TEST_CTL[bit28].
This patch enables #GP for guest UC lock.
Tracked-On: #6299
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Because the emulation code is for both split-lock and uc-lock,
rename splitlock.c/splitlock.h to lock_instr_emul.c/lock_instr_emul.h
Tracked-On: #6299
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Because the emulation code is for both split-lock and uc-lock, Changed
these API names:
vcpu_kick_splitlock_emulation() -> vcpu_kick_lock_instr_emulation()
vcpu_complete_splitlock_emulation() -> vcpu_complete_lock_instr_emulation()
emulate_splitlock() -> emulate_lock_instr()
Tracked-On: #6299
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
Split-lock emulation can be re-used for uc-lock. In emulate_splitlock(),
it only work if this vmexit is for #AC trap and guest do not handle
split-lock and HV enable #AC for splitlock.
Add another condition to let emulate_splitlock() also work for #GP trap
and guest do not handle uc-lock and HV enable #GP for uc-lock.
Tracked-On: #6299
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
When ACRN uses decode_instruction to emulate split-lock/uc-lock
instruction, It is actually a try-decode to see if it is XCHG.
If the instruction is XCHG instruction, ACRN must emulate it
(inject #PF if it is triggered) with peer VCPUs paused, and advance
the guest IP. If the instruction is a LOCK prefixed instruction
with accessing the UC memory, ACRN Halted the peer VCPUs, and
advance the IP to skip the LOCK prefix, and then let the VCPU
Executes one instruction by enabling IRQ Windows vm-exit. For
other cases, ACRN injects the exception back to VCPU without
emulating it.
So change the API to decode_instruction(vcpu, bool full_decode),
when full_decode is true, the API does same thing as before. When
full_decode is false, the different is if decode_instruction() meet unknown
instruction, will keep return = -1 and do not inject #UD. We can use
this to distinguish that an #UD has been skipped, and need inject #AC/#GP back.
Tracked-On: #6299
Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>
Guest may not use INVEPT instruction after enabling any of bits 2:0 from
0 to 1 of a present EPT entry, then the shadow EPT entry has no chance
to sync guest EPT entry. According to the SDM,
"""
Software may use the INVEPT instruction after modifying a present EPT
paging-structure entry (see Section 28.2.2) to change any of the
privilege bits 2:0 from 0 to 1.1 Failure to do so may cause an EPT
violation that would not otherwise occur. Because an EPT violation
invalidates any mappings that would be used by the access that caused
the EPT violation (see Section 28.3.3.1), an EPT violation will not
recur if the original access is performed again, even if the INVEPT
instruction is not executed.
"""
Sync the afterthought of privilege bits from guest EPT entry to shadow
EPT entry to cover above case.
Tracked-On: #5923
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
MSR_IA32_VMX_EPT_VPID_CAP is 64 bits. Using 32 bits MACROs with it may
cause the bit expression wrong.
Unify the MSR_IA32_VMX_EPT_VPID_CAP operation with 64 bits definition.
Tracked-On: #5923
Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>
Acked-by: Eddie Dong <eddie.dong@intel.com>
When hypervisor boots, the multiboot modules have been loaded to host space
by bootloader already. The space range of pre-launched VM modules is also
exposed to SOS VM, so SOS VM kernel might pick this range to extract kernel
when KASLR enabled. This would corrupt pre-launched VM modules and result in
pre-launched VM boot fail.
This patch will try to fix this issue. The SOS VM will not be loaded to guest
space until all pre-launched VMs are loaded successfully.
Tracked-On: #5879
Signed-off-by: Victor Sun <victor.sun@intel.com>
Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>