acrn-hypervisor

Commit Graph

Author	SHA1	Message	Date
Tao Yuhong	7926504011	HV: rename split-lock emulation APIs Because the emulation code is for both split-lock and uc-lock, Changed these API names: vcpu_kick_splitlock_emulation() -> vcpu_kick_lock_instr_emulation() vcpu_complete_splitlock_emulation() -> vcpu_complete_lock_instr_emulation() emulate_splitlock() -> emulate_lock_instr() Tracked-On: #6299 Signed-off-by: Tao Yuhong <yuhong.tao@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-07-21 11:25:47 +08:00
Tao Yuhong	bbd7b7091b	HV: re-use split-lock emulation code for uc-lock Split-lock emulation can be re-used for uc-lock. In emulate_splitlock(), it only work if this vmexit is for #AC trap and guest do not handle split-lock and HV enable #AC for splitlock. Add another condition to let emulate_splitlock() also work for #GP trap and guest do not handle uc-lock and HV enable #GP for uc-lock. Tracked-On: #6299 Signed-off-by: Tao Yuhong <yuhong.tao@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-07-21 11:25:47 +08:00
Tao Yuhong	553d59644b	HV: Fix decode_instruction() trigger #UD for emulating UC-lock When ACRN uses decode_instruction to emulate split-lock/uc-lock instruction, It is actually a try-decode to see if it is XCHG. If the instruction is XCHG instruction, ACRN must emulate it (inject #PF if it is triggered) with peer VCPUs paused, and advance the guest IP. If the instruction is a LOCK prefixed instruction with accessing the UC memory, ACRN Halted the peer VCPUs, and advance the IP to skip the LOCK prefix, and then let the VCPU Executes one instruction by enabling IRQ Windows vm-exit. For other cases, ACRN injects the exception back to VCPU without emulating it. So change the API to decode_instruction(vcpu, bool full_decode), when full_decode is true, the API does same thing as before. When full_decode is false, the different is if decode_instruction() meet unknown instruction, will keep return = -1 and do not inject #UD. We can use this to distinguish that an #UD has been skipped, and need inject #AC/#GP back. Tracked-On: #6299 Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>	2021-07-21 11:25:47 +08:00
Shuo A Liu	6e0b12180c	hv: dm: Use new power management data structures struct cpu_px_data -> struct acrn_pstate_data struct cpu_cx_data -> struct acrn_cstate_data enum pm_cmd_type -> enum acrn_pm_cmd_type struct acpi_generic_address -> struct acrn_acpi_generic_address cpu_cx_data -> acrn_cstate_data cpu_px_data -> acrn_pstate_data IC_PM_GET_CPU_STATE -> ACRN_IOCTL_PM_GET_CPU_STATE PMCMD_GET_PX_CNT -> ACRN_PMCMD_GET_PX_CNT PMCMD_GET_CX_CNT -> ACRN_PMCMD_GET_CX_CNT PMCMD_GET_PX_DATA -> ACRN_PMCMD_GET_PX_DATA PMCMD_GET_CX_DATA -> ACRN_PMCMD_GET_CX_DATA Tracked-On: #6282 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>	2021-07-15 11:53:54 +08:00
Shuo A Liu	9c910bae44	hv: dm: Use new I/O request data structures struct vhm_request -> struct acrn_io_request union vhm_request_buffer -> struct acrn_io_request_buffer struct pio_request -> struct acrn_pio_request struct mmio_request -> struct acrn_mmio_request struct ioreq_notify -> struct acrn_ioreq_notify VHM_REQ_PIO_INVAL -> IOREQ_PIO_INVAL VHM_REQ_MMIO_INVAL -> IOREQ_MMIO_INVAL REQ_PORTIO -> ACRN_IOREQ_TYPE_PORTIO REQ_MMIO -> ACRN_IOREQ_TYPE_MMIO REQ_PCICFG -> ACRN_IOREQ_TYPE_PCICFG REQ_WP -> ACRN_IOREQ_TYPE_WP REQUEST_READ -> ACRN_IOREQ_DIR_READ REQUEST_WRITE -> ACRN_IOREQ_DIR_WRITE REQ_STATE_PROCESSING -> ACRN_IOREQ_STATE_PROCESSING REQ_STATE_PENDING -> ACRN_IOREQ_STATE_PENDING REQ_STATE_COMPLETE -> ACRN_IOREQ_STATE_COMPLETE REQ_STATE_FREE -> ACRN_IOREQ_STATE_FREE IC_CREATE_IOREQ_CLIENT -> ACRN_IOCTL_CREATE_IOREQ_CLIENT IC_DESTROY_IOREQ_CLIENT -> ACRN_IOCTL_DESTROY_IOREQ_CLIENT IC_ATTACH_IOREQ_CLIENT -> ACRN_IOCTL_ATTACH_IOREQ_CLIENT IC_NOTIFY_REQUEST_FINISH -> ACRN_IOCTL_NOTIFY_REQUEST_FINISH IC_CLEAR_VM_IOREQ -> ACRN_IOCTL_CLEAR_VM_IOREQ HYPERVISOR_CALLBACK_VHM_VECTOR -> HYPERVISOR_CALLBACK_HSM_VECTOR arch_fire_vhm_interrupt() -> arch_fire_hsm_interrupt() get_vhm_notification_vector() -> get_hsm_notification_vector() set_vhm_notification_vector() -> set_hsm_notification_vector() acrn_vhm_notification_vector -> acrn_hsm_notification_vector get_vhm_req_state() -> get_io_req_state() set_vhm_req_state() -> set_io_req_state() Below structures have slight difference with former ones. struct acrn_ioreq_notify strcut acrn_io_request Tracked-On: #6282 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>	2021-07-15 11:53:54 +08:00
Shuo A Liu	107cae316a	hv: dm: Use new ioctl ACRN_IOCTL_SET_VCPU_REGS struct acrn_set_vcpu_regs -> struct acrn_vcpu_regs struct acrn_vcpu_regs -> struct acrn_regs IC_SET_VCPU_REGS -> ACRN_IOCTL_SET_VCPU_REGS Tracked-On: #6282 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>	2021-07-15 11:53:54 +08:00
Shuo A Liu	f476ca55ab	hv: dm: Use new VM management ioctls IC_CREATE_VM -> ACRN_IOCTL_CREATE_VM IC_DESTROY_VM -> ACRN_IOCTL_DESTROY_VM IC_START_VM -> ACRN_IOCTL_START_VM IC_PAUSE_VM -> ACRN_IOCTL_PAUSE_VM IC_RESET_VM -> ACRN_IOCTL_RESET_VM struct acrn_create_vm -> struct acrn_vm_creation Tracked-On: #6282 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>	2021-07-15 11:53:54 +08:00
Shuo A Liu	9c1caad25a	hv: nested: Keep privilege bits sync in shadow EPT entry Guest may not use INVEPT instruction after enabling any of bits 2:0 from 0 to 1 of a present EPT entry, then the shadow EPT entry has no chance to sync guest EPT entry. According to the SDM, """ Software may use the INVEPT instruction after modifying a present EPT paging-structure entry (see Section 28.2.2) to change any of the privilege bits 2:0 from 0 to 1.1 Failure to do so may cause an EPT violation that would not otherwise occur. Because an EPT violation invalidates any mappings that would be used by the access that caused the EPT violation (see Section 28.3.3.1), an EPT violation will not recur if the original access is performed again, even if the INVEPT instruction is not executed. """ Sync the afterthought of privilege bits from guest EPT entry to shadow EPT entry to cover above case. Tracked-On: #5923 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-07-02 09:24:12 +08:00
Shuo A Liu	a431cff94e	hv: Use 64 bits definition for 64 bits MSR_IA32_VMX_EPT_VPID_CAP operation MSR_IA32_VMX_EPT_VPID_CAP is 64 bits. Using 32 bits MACROs with it may cause the bit expression wrong. Unify the MSR_IA32_VMX_EPT_VPID_CAP operation with 64 bits definition. Tracked-On: #5923 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-07-02 09:24:12 +08:00
Victor Sun	e371432695	HV: avoid pre-launched VM modules being corrupted by SOS kernel load When hypervisor boots, the multiboot modules have been loaded to host space by bootloader already. The space range of pre-launched VM modules is also exposed to SOS VM, so SOS VM kernel might pick this range to extract kernel when KASLR enabled. This would corrupt pre-launched VM modules and result in pre-launched VM boot fail. This patch will try to fix this issue. The SOS VM will not be loaded to guest space until all pre-launched VMs are loaded successfully. Tracked-On: #5879 Signed-off-by: Victor Sun <victor.sun@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-06-11 10:06:02 +08:00
Victor Sun	268d4c3f3c	HV: boot guest with boot params Previously the load GPA of LaaG boot params like zeropage/cmdline and initgdt are all hard-coded, this would bring potential LaaG boot issues. The patch will try to fix this issue by finding a 32KB load_params memory block for LaaG to store these guest boot params. For other guest with raw image, in general only vgdt need to be cared of so the load_params will be put at 0x800 since it is a common place that most guests won't touch for entering protected mode. Tracked-On: #5626 Signed-off-by: Victor Sun <victor.sun@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-06-11 10:06:02 +08:00
Victor Sun	ed97022646	HV: add find_space_from_ve820() api The API would search ve820 table and return a valid GPA when the requested size of memory is available in the specified memory range, or return INVALID_GPA if the requested memory slot is not available; Tracked-On: #5626 Signed-off-by: Victor Sun <victor.sun@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-06-11 10:06:02 +08:00
Victor Sun	6127c0c5d2	HV: modify low 1MB area for pre-launched VM e820 The memory range of [0xA0000, 0xFFFFF] is a known reserved area for BIOS, actually Linux kernel would enforce this area to be reserved during its boot stage. Set this area to usable would cause potential compatibility issues. The patch set the range to reserved type to make it consistent with the real world. BTW, There should be a EBDA(Entended BIOS DATA Area) with reserved type exist right before 0xA0000 in real world for non-EFI boot. But given ACRN has no legacy BIOS emulation, we simply skipped the EBDA in vE820. Tracked-On: #5626 Signed-off-by: Victor Sun <victor.sun@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-06-11 10:06:02 +08:00
Victor Sun	28b7cee412	HV: modularization: rename multiboot.h to boot.h Given the structure in multiboot.h could be used for any boot protocol, use a more generic name "boot.h" instead; Tracked-On: #5661 Signed-off-by: Victor Sun <victor.sun@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-06-11 10:06:02 +08:00
Shuo A Liu	9ae32f96af	hv: Wrap same code as a static function vmptrld_vmexit_handler() has a same code snippet with vmclear_vmexit_handler(). Wrap the same code snippet as a static function clear_vmcs02(). There is only a small logic change that add nested->current_vmcs12_ptr = INVALID_GPA in vmptrld_vmexit_handler() for the old VMCS. That's reasonable. Tracked-On: #5923 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-06-09 10:07:05 +08:00
Shuo A Liu	387ea23961	hv: Rename get_ept_entry() to get_eptp() get_ept_entry() actually returns the EPTP of a VM. So rename it to get_eptp() for readability. Tracked-On: #5923 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-06-09 10:07:05 +08:00
Zide Chen	b6b5373818	hv: deny access to HV owned legacy PIO UART from SOS We need to deny accesses from SOS to the HV owned UART device, otherwise SOS could have direct access to this physical device and mess up the HV console. If ACRN debug UART is configured as PIO based, For example, CONFIG_SERIAL_PIO_BASE is generated from acrn-config tool, or the UART config is overwritten by hypervisor parameter "uart=port@<port address>", it could run into problem if ACRN doesn't emulate this UART PIO port to SOS. For example: - none of the ACRN emulated vUART devices has same PIO port with the port of the debug UART device. - ACRN emulates PCI vUART for SOS (configure "console_vuart" with PCI_VUART in the scenario configuration) This patch fixes the above issue by masking PIO accesses from SOS. deny_hv_owned_devices() is moved after setup_io_bitmap() where vm->arch_vm.io_bitmap is initialized. Commit `50d852561` ("HV: deny HV owned PCI bar access from SOS") handles the case that ACRN debug UART is configured as a PCI device. e.g., hypervisor parameter "uart=bdf@<BDF value>" is appended. If the hypervisor debug UART is MMIO based, need to configured it as a PCI type device, so that it can be hidden from SOS. Tracked-On: #5923 Signed-off-by: Zide Chen <zide.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-06-08 16:16:14 +08:00
Yonghua Huang	25c0e3817e	hv: validate input for dmar_free_irte function Malicious input 'index' may trigger buffer overflow on array 'irte_alloc_bitmap[]'. This patch validate that 'index' shall be less than 'CONFIG_MAX_IR_ENTRIES' and also remove unnecessary check on 'index' in 'ptirq_free_irte()' function with this fix. Tracked-On: #6132 Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>	2021-06-08 09:03:10 +08:00
Yonghua Huang	4acaeb91bd	hv: remove unnecessary ASSERT in vlapic_write vlapic_write handle 'offset' that is valid and ignore all other invalid 'offset'. so ASSERT on this 'offset' input is unnecessary. This patch removes above ASSERT to avoid potential hypervisor crash by guest malicious input when debug build is used. Tracked-On: #6131 Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>	2021-06-08 09:03:10 +08:00
Shuo A Liu	15e6c5b9cf	hv: nested: audit guest EPT mapping during shadow EPT entries setup generate_shadow_ept_entry() didn't verify the correctness of the requested guest EPT mapping. That might leak host memory access to L2 VM. To simplify the implementation of the guest EPT audit, hide capabilities 'map 2-Mbyte page' and 'map 1-Gbyte page' from L1 VM. In addition, minimize the attribute bits of EPT entry when create a shadow EPT entry. Also, for invalid requested mapping address, reflect the EPT_VIOLATION to L1 VM. Here, we have some TODOs: 1) Enable large page support in generate_shadow_ept_entry() 2) Evaluate if need to emulate the invalid GPA access of L2 in HV directly. 3) Minimize EPT entry attributes. Tracked-On: #5923 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-06-04 13:53:47 +08:00
Shuo A Liu	3110e70d0a	hv: nested: INVEPT emulation supports shadow EPT L1 VM changes the guest EPT and do INVEPT to invalidate the previous TLB cache of EPT entries. The shadow EPT replies on INVEPT instruction to do the update. The target shadow EPTs can be found according to the 'type' of INVEPT. Here are two types and their target shadow EPT, 1) Single-context invalidation Get the EPTP from the INVEPT descriptor. Then find the target shadow EPT. 2) Global invalidation All shadow EPTs of the L1 VM. The INVEPT emulation handler invalidate all the EPT entries of the target shadow EPTs. Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-06-04 13:53:47 +08:00
Shuo A Liu	1dc7b7f798	hv: nested: Introduce shadow EPT release function When a shadow EPT is not used anymore, its resources need to be released. free_sept_table() is introduced to walk the whole shadow EPT table and free the pagetable pages. Please note, the PML4E page of shadow EPT is not freed by free_sept_table() as it still be used to present a shadow EPT pointer. Tracked-On: #5923 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-06-04 13:53:47 +08:00
Shuo A Liu	b10b5658bd	hv: nested: Introduce L2 VM EPT VIOLATION handler With shadow EPT, the hypervisor walks through guest EPT table: * If the entry is not present in guest EPT, ACRN injects EPT_VIOLATION to L1 VM and resumes to L1 VM. * If the entry is present in guest EPT, do the EPT_MISCONFIG check. Inject EPT_MISCONFIG to L1 VM if the check failed. * If the entry is present in guest EPT, do permission check. Reflect EPT_VIOLATION to L1 VM if the check failed. * If the entry is present in guest EPT but shadow EPT entry is not present, create the shadow entry and resumes to L2 VM. * If the entry is present in guest EPT but the GPA in the entry is invalid, injects EPT_VIOLATION to L1 VM and resumes L1 VM. Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-06-04 13:53:47 +08:00
Shuo A Liu	8565750bbe	hv: nested: Hide some capability bits from L1 guest * Hide 5 level EPT capability, let L1 guest stick to 4 level EPT. * Access/Dirty bits are not support currently, hide corresponding EPT capability bits. * "Mode-based execute control for EPT" is also not support well currently, hide its capability bit from MSR_IA32_VMX_PROCBASED_CTLS2. Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-06-04 13:53:47 +08:00
Shuo A Liu	540a484147	hv: nested: Manage shadow EPTP according to guest VMCS change 'struct nept_desc' is used to associate guest EPTP with a shadow EPTP. It's created in the first reference and be freed while no reference. The life cycle seems like, While guest VMCS VMX_EPT_POINTER_FULL is changed, the 'struct nept_desc' of the new guest EPTP is referenced; the 'struct nept_desc' of the old guest EPTP is dereferenced. While guest VMCS be cleared(by VMCLEAR in L1 VM), the 'struct nept_desc' of the old guest EPTP is dereferenced. While a new guest VMCS be loaded(by VMPTRLD in L1 VM), the 'struct nept_desc' of the new guest EPTP is referenced. The 'struct nept_desc' of the old guest EPTP is dereferenced. Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-06-04 13:53:47 +08:00
Shuo A Liu	10ec896f99	hv: nested: Introduce shadow EPT infrastructure To shadow guest EPT, the hypervisor needs construct a shadow EPT for each guest EPT. The key to associate a shadow EPT and a guest EPT is the EPTP (EPT pointer). This patch provides following structure to do the association. struct nept_desc { /* * A shadow EPTP. * The format is same with 'EPT pointer' in VMCS. * Its PML4 address field is a HVA of the hypervisor. / uint64_t shadow_eptp; / * An guest EPTP configured by L1 VM. * The format is same with 'EPT pointer' in VMCS. * Its PML4 address field is a GPA of the L1 VM. */ uint64_t guest_eptp; uint32_t ref_count; }; Due to lack of dynamic memory allocation of the hypervisor, a array nept_bucket of type 'struct nept_desc' is introduced to store those association information. A guest EPT might be shared between different L2 vCPUs, so this patch provides several functions to handle the reference of the structure. Interface get_shadow_eptp() also is introduced. To find the shadow EPTP of a specified guest EPTP. Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-06-04 13:53:47 +08:00
Shuo A Liu	17bc7f08c9	hv: nested: Create a page pool for shadow EPT construction Shadow EPT uses lots of pages to construct the shadow page table. To utilize the memory more efficient, a page poll sept_page_pool is introduced. For simplicity, total platform RAM size is considered to calculate the memory needed for shadow page tables. This is not an accurate upper bound. This can satisfy typical use-cases where there is not a lot of overcommitment and sharing of memory between L2 VMs. Memory of the pool is marked as reserved from E820 table in early stage. Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-06-04 13:53:47 +08:00
Zide Chen	811e367ad9	hv: nested: implement nested VM exit handler Nested VM exits happen when vCPU is in guest mode (VMCS02 is current). Initially we reflect all nested VM exits to L1 hypervisor. To prepare the environment to run L1 guest: - restore some VMCS fields to the value as what L1 hypervisor programmed. - VMCLEAR VMCS02, VMPTRLD VMCS01 and enable VMCS shadowing. - load the non-shadowing host states from VMCS12 to VMCS01 guest states. - VMRESUME to L1 guest with this modified VMCS01. Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Alexander Merritt <alex.merritt@intel.com>	2021-06-03 15:23:25 +08:00
Zide Chen	22d225663f	hv: nested: update run_vcpu() function for nested case Since L2 guest vCPU mode and VPID are managed by L1 hypervisor, so we can skip these handling in run_vcpu(). And be careful that we can't cache L2 registers in struct acrn_vcpu. Tracked-On: #5923 Signed-off-by: Zide Chen <zide.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-06-03 15:23:25 +08:00
Zide Chen	4acc65eacc	hv: nested: support for INVEPT and INVVPID emulation invvpid and invept instructions cause VM exits unconditionally. For initial support, we pass all the instruction operands as is to the pCPU. Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-06-03 15:23:25 +08:00
Zide Chen	4c29a0bb29	hv: nested: support for VMLAUNCH and VMRESUME emulation Implement the VMLAUNCH and VMRESUME instructions, allowing a L1 hypervisor to run nested guests. - merge VMCS control fields and VMCS guest fields to VMCS02 - clear shadow VMCS indicator on VMCS02 and load VMCS02 as current - set VMCS12 launch state to "launched" in VMLAUNCH handler Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Alex Merritt <alex.merritt@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-06-03 15:23:25 +08:00
Li Fei1	0d5f12e281	hv: vlapic: a minor refine about vlapic_x2apic_pt_icr_access In physical destination mode, the destination processor is specified by its local APIC ID. When a CPU switch xAPIC Mode to x2APIC Mode or vice versa, the local APIC ID is not changed. So a vcpu in x2APIC Mode could use physical Destination Mode to send an IPI to another vcpu in xAPIC Mode by writing ICR. This patch adds support for a vCPU A could write ICR to send IPI to another vCPU B which is in different APIC mode. Tracked-On: #5923 Signed-off-by: Li Fei1 <fei1.li@intel.com>	2021-05-27 09:00:08 +08:00
dongshen	5e3c6ae941	hv: vcpuid: passthrough host CPUID leaf.0BH to guest VMs Using physical APIC IDs as vLAPIC IDs for pre-Launched and post-launched VMs is not sufficient to replicate the host CPU and cache topologies in guest VMs, we also need to passthrough host CPUID leaf.0BH to guest VMs, otherwise, guest VMs may see weird CPU topology. Note that in current code, ACRN has already passthroughed host cache CPUID leaf 04H to guest VMs Tracked-On: #6020 Reviewed-by: Wang, Yu1 <yu1.wang@intel.com> Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>	2021-05-26 11:23:06 +08:00
dongshen	f332ef15b2	hv: vlapic: use physical APIC IDs as vLAPIC IDs for pre-launched and post-launched VMs In current code, ACRN uses physical APIC IDs as vLAPIC IDs for SOS, and vCPU ids (contiguous) as vLAPIC IDs for pre-Launched and post-Launched VMs. Using vCPU ids as vLAPIC IDs for pre-Launched and post-Launched VMs would result in wrong CPU and cache topologies showing in the guest VMs, and could adversely affect performance if the guest VM chooses to detect CPU and cache topologies and optimize its behavior accordingly. Uses physical APIC IDs as vLAPIC IDs (and related CPU/cache topology enumeration CPUIDs passthrough) will replicate the host CPU and cache topologies in pre-Launched and post-Launched VMs. Tracked-On: #6020 Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com> Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>	2021-05-26 11:23:06 +08:00
Zide Chen	6428ca8f5b	hv: VMPTRLD and VMCLEAR VMCS with the common APIs Remove the direct calls to exec_vmptrld() or exec_vmclear(), and replace with the wrapper APIs load_va_vmcs() and clear_va_vmcs(). Tracked-On: #5923 Signed-off-by: Zide Chen <zide.chen@intel.com>	2021-05-26 11:22:26 +08:00
Zide Chen	6d69058a9d	hv: nested: support for VMREAD and VMWRITE emulation This patch implements the VMREAD and VMWRITE instructions. When L1 guest is running with an active VMCS12, the “VMCS shadowing” VM-execution control is always set to 1 in VMCS01. Thus the possible behavior of VMREAD or VMWRITE from L1 could be: - It causes a VM exit to L0 if the bit corresponds to the target VMCS field in the VMREAD bitmap or VMWRITE bitmap is set to 1. - It accesses the VMCS referenced by VMCS01 link pointer (VMCS02 in our case) if the above mentioned bit is set to 0. This patch handles the VMREAD and VMWRITE VM exits in this way: - on VMWRITE, it writes the desired VMCS value to the respective field in the cached VMCS12. For VMCS fields that need to be synced to VMCS02, sets the corresponding dirty flag. - on VMREAD, it reads the desired VMCS value from the cached VMCS12. Tracked-On: #5923 Signed-off-by: Alex Merritt <alex.merritt@intel.com> Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@Intel.com>	2021-05-24 10:34:01 +08:00
Zide Chen	2bd269c11c	hv: nested: support for VMCLEAR emulation This patch is to emulate VMCLEAR instruction. L1 hypervisor issues VMCLEAR on a VMCS12 whose state could be any of these: active and current, active but not current, not yet VMPTRLDed. To emulate the VMCLEAR instruction, ACRN sets the VMCS12 launch state to "clear", and if L0 already cached this VMCS12, need to sync it back to guest memory: - sync shadow fields from shadow VMCS VMCS to cache VMCS12 - copy cache VMCS12 to L1 guest memory Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com>	2021-05-24 10:34:01 +08:00
Zide Chen	5379b14108	hv: nested: define VMCS shadow fields Enable VMCS shadowing for most of the VMCS fields, so that execution of the VMREAD or VMWRITE on these shadow VMCS fields from L1 hypervisor won't cause VM exits, but read from or write to the shadow VMCS. Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Alexander Merritt <alex.merritt@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com>	2021-05-24 10:34:01 +08:00
Zide Chen	863e58e539	hv: nested: define software layout for VMCS12 and helper functions Software layout of VMCS12 data is a contract between L1 guest and L0 hypervisor to run a L2 guest. ACRN hypervisor caches the VMCS12 which is passed down from L1 hypervisor by the VMPTRLD instructin. At the time of VMCLEAR, ACRN syncs the cached VMCS12 back to L1 guest memory. Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@Intel.com>	2021-05-24 10:34:01 +08:00
Zide Chen	f5744174b5	hv: nested: support for VMPTRLD emulation This patch emulates the VMPTRLD instruction. L0 hypervisor (ACRN) caches the VMCS12 that is passed down from the VMPTRLD instruction, and merges it with VMCS01 to create VMCS02 to run the nested VM. - Currently ACRN can't cache multiple VMCS12 on one vCPU, so it needs to flushes active but not current VMCS12s to L1 guest. - ACRN creates VMCS02 to run nested VM based on VMCS12: 1) copy VMCS12 from guest memory to the per vCPU cache VMCS12 2) initialize VMCS02 revision ID and host-state area 3) load shadow fields from cache VMCS12 to VMCS02 4) enable VMCS shadowing before L1 Vm entry Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com>	2021-05-24 10:34:01 +08:00
Zide Chen	0a1ac2f4a0	hv: nested: support for VMXOFF emulation This patch implements the VMXOFF instruction. By issuing VMXOFF, L1 guest Leaves VMX Operation. - cleanup VCPU nested virtualization context states in VMXOFF handler. - implement check_vmx_permission() to check permission for VMX operation for VMXOFF and other VMX instructions. Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@Intel.com>	2021-05-24 10:34:01 +08:00
Zide Chen	3fdad3c6d1	hv: nested: check prerequisites to enter VMX operation According to VMXON Instruction Reference, do the following checks in the virtual hardware environment: vCPU CPL, guest CR0, CR4, revision ID in VMXON region, etc. Currently ACRN doesn't support 32-bit L1 hypervisor, and injects an #UD exception if L1 hypervisor is not running in 64-bit mode. Tracked-On: #5923 Signed-off-by: Zide Chen <zide.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@Intel.com>	2021-05-24 10:34:01 +08:00
Zide Chen	fc8f07e740	hv: nested: support for VMXON emulation This patch emulates VMXON instruction. Basically checks some prerequisites to enable VMX operation on L1 guest (next patch), and prepares some virtual hardware environment in L0. Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@Intel.com>	2021-05-24 10:34:01 +08:00
Li Fei1	6077152c4b	hv: vlapic: extend vlapic_x2apic_pt_icr_access to support more destination mode Now guest would use `Destination Shorthand` to broadcast IPIs if there're more than one destination. However, it is not supported when the guest is in LAPIC passthru situation, and all active VCPUs are working in X2APIC mode. As a result, the guest would not work properly since this kind broadcast IPIs was ignored by ACRN. What's worse, ACRN Hypervisor would inject GP to the guest in this case. This patch extend vlapic_x2apic_pt_icr_access to support more destination modes (both `Physical` and `Logical`) and destination shorthand (`No Shorthand`, `Self`, `All Including Self` and `All Excluding Self`). Tracked-On: #5923 Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Li Fei1 <fei1.li@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-05-24 10:27:32 +08:00
Li Fei1	a69e67b58b	hv: vlapic: wrap a function to calculate destination vcpu mask by shorthand 1. Rename vlapic_calc_dest to vlapic_calc_dest_noshort 2. Remove vlapic_calc_dest_lapic_pt, use vlapic_calc_dest_noshort instead 3. Wrap vlapic_calc_dest to calculate destination vcpu mask according shorthand Tracked-On: #5923 Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Li Fei1 <fei1.li@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-05-24 10:27:32 +08:00
Liang Yi	6805510d77	hv/mod_timer: refine timer interface 1. do not allow external modules to touch internal field of a timer. 2. make timer mode internal, period_in_ticks will decide the mode. API wise: 1. the "mode" parameter was taken out of initialize_timer(). 2. a new function update_timer() was added to update the timeout and period fields. 3. the timer_expired() function was extended with an output parameter to return the remaining cycles before expiration. Also, the "fire_tsc" field name of hv_timer was renamed to "timeout". With the new API, however, this change should not concern user code. Tracked-On: #5920 Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-05-18 16:43:28 +08:00
Liang Yi	5a2b89b0a4	hv/mod_timer: split tsc handling code from timer. Generalize and split basic cpu cycle/tick routines from x86/timer: - Instead of rdstc(), use cpu_ticks() in generic code. - Instead of get_tsc_khz(), use cpu_tickrate() in generic code. - Include "common/ticks.h" instead of "x86/timer.h" in generic code. - CYCLES_PER_MS is renamed to TICKS_PER_MS. The x86 specific API rdstc() and get_tsc_khz(), as well as TSC_PER_MS are still available in arch/x86/tsc.h but only for x86 specific usage. Tracked-On: #5920 Signed-off-by: Rong Liu <rong2.liu@intel.com> Signed-off-by: Yi Liang <yi.liang@intel.com>	2021-05-18 16:43:28 +08:00
Zide Chen	c9982e8c7e	hv: nested: setup emulated VMX MSRs We emulated these MSRs: - MSR_IA32_VMX_PINBASED_CTLS - MSR_IA32_VMX_PROCBASED_CTLS - MSR_IA32_VMX_PROCBASED_CTLS2 - MSR_IA32_VMX_EXIT_CTLS - MSR_IA32_VMX_ENTRY_CTLS - MSR_IA32_VMX_BASIC: emulate VMCS revision ID, etc. - MSR_IA32_VMX_MISC For the following MSRs, we pass through the physical value to L1 guests: - MSR_IA32_VMX_EPT_VPID_CAP - MSR_IA32_VMX_VMCS_ENUM - MSR_IA32_VMX_CR0_FIXED0 - MSR_IA32_VMX_CR0_FIXED1 - MSR_IA32_VMX_CR4_FIXED0 - MSR_IA32_VMX_CR4_FIXED1 Tracked-On: #5923 Signed-off-by: Zide Chen <zide.chen@intel.com> Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-05-16 19:05:21 +08:00
Zide Chen	4930992118	hv: nested: implement the framework for VMX MSR emulation Define LIST_OF_VMX_MSRS which includes a list of MSRs that are visible to L1 guests if nested virtualization is enabled. - If CONFIG_NVMX_ENABLED is set, these MSRs are included in emulated_guest_msrs[]. - otherwise, they are included in unsupported_msrs[]. In this way we can take advantage of the existing infrastructure to emulate these MSRs. Tracked-On: #5923 Spick igned-off-by: Zide Chen <zide.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-05-16 19:05:21 +08:00
Zide Chen	97df220f49	hv: vmsr: emulate IA32_FEATURE_CONTORL MSR for nested virtualization In order to support nested virtualization, need to expose the "Enable VMX outside SMX operation" bit to L1 hypervisor. Tracked-On: #5923 Signed-off-by: Zide Chen <zide.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-05-16 19:05:21 +08:00
Li Fei1	30febed0e1	hv: cache: wrap common APIs Wrap three common Cache APIs: - flush_invalidate_all_cache - flush_cacheline - flush_cache_range Tracked-On: #5830 Signed-off-by: Li Fei1 <fei1.li@intel.com>	2021-05-14 09:18:00 +08:00
Li Fei1	d6362b6e0a	hv: paging: rename ppt_set/clear_ATTR to set_paging_ATTR Rename ppt_set/clear_(attribute) to set_paging_(attribute) Tracked-On: #5830 Signed-off-by: Li Fei1 <fei1.li@intel.com>	2021-05-14 09:18:00 +08:00
Zide Chen	ccfdf9cdd7	hv: nested: enable nested virtualization Allow guest set CR4_VMXE if CONFIG_NVMX_ENABLED is set: - move CR4_VMXE from CR4_EMULATED_RESERVE_BITS to CR4_TRAP_AND_EMULATE_BITS so that CR4_VMXE is removed from cr4_reserved_bits_mask. - force CR4_VMXE to be removed from cr4_rsv_bits_guest_value so that CR4_VMXE is able to be set. Expose VMX feature (CPUID01.01H:ECX[5]) to L1 guests whose GUEST_FLAG_NVMX_ENABLED is set. Assuming guest hypervisor (L1) is KVM, and KVM uses EPT for L2 guests. Constraints on ACRN VM. - LAPIC passthrough should be enabled. - use SCHED_NOOP scheduler. Tracked-On: #5923 Signed-off-by: Sainath Grandhi <sainath.grandhi@intel.com> Signed-off-by: Zide Chen <zide.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-05-13 16:16:30 +08:00
Shuo A Liu	3fffa68665	hv: Support WAITPKG instructions in guest VM TPAUSE, UMONITOR or UMWAIT instructions execution in guest VM cause a #UD if "enable user wait and pause" (bit 26) of VMX_PROCBASED_CTLS2 is not set. To fix this issue, set the bit 26 of VMX_PROCBASED_CTLS2. Besides, these WAITPKG instructions uses MSR_IA32_UMWAIT_CONTROL. So load corresponding vMSR value during context switch in of a vCPU. Please note, the TPAUSE or UMWAIT instruction causes a VM exit if the "RDTSC exiting" and "enable user wait and pause" are both 1. In ACRN hypervisor, "RDTSC exiting" is always 0. So TPAUSE or UMWAIT doesn't cause a VM exit. Performance impact: MSR_IA32_UMWAIT_CONTROL read costs ~19 cycles; MSR_IA32_UMWAIT_CONTROL write costs ~63 cycles. Tracked-On: #6006 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com>	2021-05-13 14:19:50 +08:00
dongshen	ebadf00de8	hv: some coding style fixes Fix issues reported by checkpatch.pl Tracked-On: #5917 Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>	2021-05-12 16:50:34 +08:00
Junjie Mao	ea4eadf0a5	hv: hypercalls: refactor permission-checking and dispatching logic The current permission-checking and dispatching mechanism of hypercalls is not unified because: 1. Some hypercalls require the exact vCPU initiating the call, while the others only need to know the VM. 2. Different hypercalls have different permission requirements: the trusty-related ones are enabled by a guest flag, while the others require the initiating VM to be the Service OS. Without a unified logic it could be hard to scale when more kinds of hypercalls are added later. The objectives of this patch are as follows. 1. All hypercalls have the same prototype and are dispatched by a unified logic. 2. Permissions are checked by a unified logic without consulting the hypercall ID. To achieve the first objective, this patch modifies the type of the first parameter of hcall_* functions (which are the callbacks implementing the hypercalls) from `struct acrn_vm ` to `struct acrn_vcpu `. The doxygen-style documentations are updated accordingly. To achieve the second objective, this patch adds to `struct hc_dispatch` a `permission_flags` field which specifies the guest flags that must ALL be set for a VM to be able to invoke the hypercall. The default value (which is 0UL) indicates that this hypercall is for SOS only. Currently only the `permission_flag` of trusty-related hypercalls have the non-zero value GUEST_FLAG_SECURE_WORLD_ENABLED. With `permission_flag`, the permission checking logic of hypercalls is unified as follows. 1. General checks i. If the VM is neither SOS nor having any guest flag that allows certain hypercalls, it gets #UD upon executing the `vmcall` instruction. ii. If the VM is allowed to execute the `vmcall` instruction, but attempts to execute it in ring 1, 2 or 3, the VM gets #GP(0). 2. Hypercall-specific checks i. If the hypercall is for SOS (i.e. `permission_flag` is 0), the initiating VM must be SOS and the specified target VM cannot be a pre-launched VM. Otherwise the hypercall returns -EINVAL without further actions. ii. If the hypercall requires certain guest flags, the initiating VM must have all the required flags. Otherwise the hypercall returns -EINVAL without further actions. iii. A hypercall with an unknown hypercall ID makes the hypercall returns -EINVAL without further actions. The logic above is different from the current implementation in the following aspects. 1. A pre-launched VM now gets #UD (rather than #GP(0)) when it attempts to execute `vmcall` in ring 1, 2 or 3. 2. A pre-launched VM now gets #UD (rather than the return value -EPERM) when it attempts to execute a trusty hypercall in ring 0. 3. The SOS now gets the return value -EINVAL (rather than -EPERM) when it attempts to invoke a trusty hypercall. 4. A post-launched VM with trusty support now gets the return value -EINVAL (rather than #UD) when it attempts to invoke a non-trusty hypercall or an invalid hypercall. v1 -> v2: - Update documentation that describe hypercall behavior. - Fix Doxygen warnings Tracked-On: #5924 Signed-off-by: Junjie Mao <junjie.mao@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-05-12 13:43:41 +08:00
Liang Yi	688a41c290	hv: mod: do not use explicit arch name when including headers Instead of "#include <x86/foo.h>", use "#include <asm/foo.h>". In other words, we are adopting the same practice in Linux kernel. Tracked-On: #5920 Signed-off-by: Liang Yi <yi.liang@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-05-08 11:15:46 +08:00
Shuo A Liu	dc88c2e397	hv: Save/restore MSR_IA32_CSTAR during context switch Both Windows guest and Linux guest use the MSR MSR_IA32_CSTAR, while Linux uses it rarely. Now vcpu context switch doesn't save/restore it. Windows detects the change of the MSR and rises a exception. Do the save/resotre MSR_IA32_CSTAR during context switch. Tracked-On: #5899 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-04-23 11:21:52 +08:00
Jian Jun Chen	31b8b698ce	hv: TLFS: Add tsc_offset support for reference time TLFS spec defines that when a VM is created, the value of HV_X64_MSR_TIME_REF_COUNT is set to zero. Now tsc_offset is not supported properly, so guest get a drifted reference time. This patch implements tsc_offset. tsc_scale and tsc_offset are calculated when a VM is launched and are saved in struct acrn_hyperv of struct acrn_vm. Tracked-On: #5956 Signed-off-by: Jian Jun Chen <jian.jun.chen@intel.com> Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-04-23 10:48:07 +08:00
Jian Jun Chen	b4312efbd7	hv: TLFS: inject #GP to guest VM for writing of read-only MSRs TLFS spec defines that HV_X64_MSR_VP_INDEX and HV_X64_MSR_TIME_REF_COUNT are read-only MSRs. Any attempt to write to them results in a #GP fault. Fix the issue by returning error in handler hyperv_wrmsr() of MSRs HV_X64_MSR_VP_INDEX/HV_X64_MSR_TIME_REF_COUNT emulation. Tracked-On: #5956 Signed-off-by: Jian Jun Chen <jian.jun.chen@intel.com> Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-04-23 10:48:07 +08:00
Jian Jun Chen	dd524d076d	hv: TLFS: Setup hypercall page according to the vcpu mode TLFS spec defines different hypercall ABIs for X86 and x64. Currently x64 hypercall interface is not supported well. Setup the hypercall interface page according to the vcpu mode. Tracked-On: #5956 Signed-off-by: Jian Jun Chen <jian.jun.chen@intel.com> Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-04-23 10:48:07 +08:00
Li Fei1	628bca5cad	hv: pgtable: use new algo to calculate PPT/EPT_PD_PAGE_NUM In order to support platform (such as Ander Lake) which physical address width bits is 46, the current code need to reserve 2^16 PD page ((2^46) / (2^30)). This is a complete waste of memory. This patch would reserve PD page by three parts: 1. DRAM - may take PD_PAGE_NUM(CONFIG_PLATFORM_RAM_SIZE) PD pages at most; 2. low MMIO - may take PD_PAGE_NUM(MEM_1G << 2U) PD pages at most; 3. high MMIO - may takes (CONFIG_MAX_PCI_DEV_NUM * 6U) PD pages (may plus PDPT entries if its size is larger than 1GB ) at most for: (a) MMIO BAR size must be a power of 2 from 16 bytes; (b) MMIO BAR base address must be power of two in size and are aligned with its size. Tracked-On: #5929 Signed-off-by: Li Fei1 <fei1.li@intel.com>	2021-04-22 14:35:57 +08:00
Li Fei1	3a465388d4	hv: guest: remove get_mem_range_info in prepare_sos_vm_memmap We used get_mem_range_info to get the top memory address and then use this address as the high 64 bits max memory address of SOS. This assumes the platform must have high memory space. This patch removes the assumption. It will set high 64 bits max memory address of SOS to 4G by default (Which means there's no 64 bits high memory), then update the high 64 bits max memory address if the SOS really has high memory space. Tracked-On: #5830 Signed-off-by: Li Fei1 <fei1.li@intel.com> Acked-by: eddie Dong <eddie.dong@intel.com>	2021-04-21 14:00:44 +08:00
Li Fei1	901e8c869e	hv: vE820: calculate SOS memory size by vE820 tables SOS's memory size could be calculated by its vE820 Tables easily. Tracked-On: #5830 Signed-off-by: Li Fei1 <fei1.li@intel.com> Acked-by: eddie Dong <eddie.dong@intel.com>	2021-04-21 14:00:44 +08:00
Li Fei1	e049abb542	hv: vcpuid: hide new cpuid 0x1b/0x1f Hide CPUID 0x1b (PCONFIG) and 0x1f (Extended Topology Enumeration Leaf) Tracked-On: #5929 Signed-off-by: Li Fei1 <fei1.li@intel.com>	2021-04-20 13:28:44 +08:00
Yifan Liu	b80c388b52	hv: Hide HLAT to guest For platform with HLAT (Hypervisor-managed Linear Address Translation) capability, the hypervisor shall hide this feature to its guest. This patch adds MSR_IA32_VMX_PROCBASED_CTLS3 MSR to unsupported MSR list. The presence of this MSR is determined by 1-setting of bit 49 of MSR MSR_IA32_VMX_PROCBASED_CTLS. which is already in unsupported MSR list. [2] Related documentations: [1] Intel Architecture Instruction Set Extensions, version Feb 16, 2021, Ch 6.12 [2] Intel KeyLocker Specification, Sept 2020, Ch 7.2 Tracked-On: #5895 Signed-off-by: Yifan Liu <yifan1.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-04-07 13:47:47 +08:00
Li Fei1	d1ae797742	hv: pgtable: move sanitize_pte into pagetable.c sanitize_pte is used to set page table entry to map to an sanitized page to mitigate l1tf. It should belongs to pgtable module. So move it to pagetable.c Tracked-On: #5830 Signed-off-by: Li Fei1 <fei1.li@intel.com>	2021-03-29 13:28:55 +08:00
Li Fei1	ef90bb6db3	hv:pgtable: rename lookup_address to pgtable_lookup_entry lookup_address is used to lookup a pagetable entry by an address. So rename it to pgtable_lookup_entry to indicate this clearly. Tracked-On: #5830 Signed-off-by: Li Fei1 <fei1.li@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-03-29 13:28:55 +08:00
Li Fei1	36ddd87a09	hv: pgtable: remove alloc_ept_page alloc_page/free_page should been called in pagetable module. In order to do this, we add pgtable_create_root and pgtable_create_trusty_root to create PML4 page table page for normal world and secure world. After this done, no one uses alloc_ept_page. So remove it. Tracked-On: #5830 Signed-off-by: Li Fei1 <fei1.li@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-03-29 13:28:55 +08:00
Li Fei1	ea701c63c7	hv: pgtable: add pgtable_create_trusty_root Add pgtable_create_trusty_root to allocate a page for trusty PML4 page table page. This function also copy PDPT entries from Normal world to Secure world. Tracked-On: #5830 Signed-off-by: Li Fei1 <fei1.li@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-03-29 13:28:55 +08:00
Li Fei1	596c349600	hv: pgtable: add pgtable_create_root Add pgtable_create_root to allocate a page for PMl4 page table page. Tracked-On: #5830 Signed-off-by: Li Fei1 <fei1.li@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-03-29 13:28:55 +08:00
Li Fei1	eb52e2193a	hv: pgtable: refine name for pgtable add/modify/del Rename mmu_add to pgtable_add_map; Rename mmu_modify_or_del to pgtable_modify_or_del_map. And move these functions declaration into pgtable.h Tracked-On: #5830 Signed-off-by: Li Fei1 <fei1.li@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-03-29 13:28:55 +08:00
Liang Yi	33ef656462	hv/mod-irq: use arch specific header files Requires explicit arch path name in the include directive. The config scripts was also updated to reflect this change. Tracked-On: #5825 Signed-off-by: Peter Fang <peter.fang@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-03-24 11:38:14 +08:00
Liang Yi	df36da1b80	hv/mod_irq: do not include x86/irq.h in common/irq.h Each .c file includes the arch specific irq header file (with full path) by itself if required. Tracked-On: #5825 Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-03-24 11:38:14 +08:00
Liang Yi	ff732cfb2a	hv/mod_irq: move guest interrupt API out of x86/irq.h A new x86/guest/virq.h head file now contains all guest related interrupt handling API. Tracked-On: #5825 Signed-off-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-03-24 11:38:14 +08:00
Li Fei1	0278a3f46e	hv: pgatble: move the EPT page table related APIs to ept.c Move the EPT page table related APIs to ept.c. page module only provides APIs to allocate/free page for page table page. pagetabl module only provides APIs to add/modify/delete/lookup page table entry. The page pool and the page table related APIs for EPT should defined in EPT module. Tracked-On: #5830 Signed-off-by: Li Fei1 <fei1.li@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2021-03-11 13:48:52 +08:00
Li Fei1	80bd3ac02a	hv: trusty: move post_uos_sworld_memory into vm.c post_uos_sworld_memory are used for post-launched VM which support trusty. It's more VM related. So move it definition into vm.c Tracked-On: #5830 Signed-off-by: Li Fei1 <fei1.li@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-03-11 13:48:52 +08:00
Yonghua Huang	1a011bd91b	hv: disable guest MONITOR-WAIT support when SW SRAM is configured Per-core software SRAM L2 cache may be flushed by 'mwait' extension instruction, which guest VM may execute to enter core deep sleep. Such kind of flushing is not expected when software SRAM is enabled for RTVM. Hypervisor disables MONITOR-WAIT support on both hypervisor and VMs sides to protect above software SRAM from being flushed. This patch disable ACRN guest MONITOR-WAIT support if software SRAM is configured. Tracked-On: #5649 Signed-off-by: Yonghua Huang <yonghua.huang@intel.com> Reviewed-by: Fei Li <fei1.li@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-03-11 09:42:44 +08:00
Yonghua Huang	ea44bb6c4d	hv: wrap function to check software SRAM support Below boolean function are defined in this patch: - is_software_sram_enabled() to check if SW SRAM feature is enabled or not. - set global variable 'is_sw_sram_initialized' to file static. Tracked-On: #5649 Signed-off-by: Yonghua Huang <yonghua.huang@intel.com> Reviewed-by: Fei Li <fei1.li@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-03-11 09:42:44 +08:00
Li Fei1	768e483cd2	hv: pgtable: rename 'struct memory_ops' to 'struct pgtable' The fields and APIs in old 'struct memory_ops' are used to add/modify/delete page table (page or entry). So rename 'struct memory_ops' to 'struct pgtable'. Tracked-On: #5830 Signed-off-by: Li Fei1 <fei1.li@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-03-10 11:42:13 +08:00
Li Fei1	7c6a52037a	refine ept_flush_leaf_page Refine the logic how to skip the pSRAM region when flushing cache. Tracked-On: #5330 Signed-off-by: Li Fei1 <fei1.li@intel.com>	2021-03-03 14:44:25 +08:00
Li Fei1	1db32f4d03	hv: ept: build 4KB page mapping in EPT for code pages of rtvm RTVM is enforced to use 4KB pages to mitigate CVE-2018-12207 and performance jitter, which may be introduced by splitting large page into 4KB pages on demand. It works fine in previous hardware platform where the size of address space for the RTVM is relatively small. However, this is a problem when the platforms support 64 bits high MMIO space, which could be super large and therefore consumes large # of EPT page table pages. This patch optimize it by using large page for purely data pages, such as MMIO spaces, even for the RTVM. Signed-off-by: Li Fei1 <fei1.li@intel.com> Tracked-On: #5788	2021-03-03 13:46:49 +08:00
Li Fei1	8d9f12f3b7	hv: page: use dynamic page allocation for pagetable mapping For FuSa's case, we remove all dynamic memory allocation use in ACRN HV. Instead, we use static memory allocation or embedded data structure. For pagetable page, we prefer to use an index (hva for MMU, gpa for EPT) to get a page from a special page pool. The special page pool should be big enougn for each possible index. This is not a big problem when we don't support 64 bits MMIO. Without 64 bits MMIO support, we could use the index to search addrss not larger than DRAM_SIZE + 4G. However, if ACRN plan to support 64 bits MMIO in SOS, we could not use the static memory alocation any more. This is because there's a very huge hole between the top DRAM address and the bottom 64 bits MMIO address. We could not reserve such many pages for pagetable mapping as the CPU physical address bits may very large. This patch will use dynamic page allocation for pagetable mapping. We also need reserve a big enough page pool at first. For HV MMU, we don't use 4K granularity page table mapping, we need reserve PML4, PDPT and PD pages according the maximum physical address space (PPT va and pa are identical mapping); For each VM EPT, we reserve PML4, PDPT and PD pages according to the maximum physical address space too, (the EPT address sapce can't beyond the physical address space), and we reserve PT pages by real use cases of DRAM, low MMIO and high MMIO. Signed-off-by: Li Fei1 <fei1.li@intel.com> Tracked-On: #5788	2021-03-01 13:10:04 +08:00
Li Fei1	5621fabbcb	hv: memory: remove get_sworld_memory_base API memory_ops structure will be changed to store page table related fields. However, secure world memory base address is not one of them, it's VM related. So save sworld_memory_base_hva in vm_arch structure directly. Signed-off-by: Li Fei1 <fei1.li@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com> Tracked-On: #5788	2021-03-01 13:10:04 +08:00
Yonghua Huang	fdfd28b140	hv: unmap software region of pre-RTVM from Service VM EPT Accessing to software SRAM region is not allowed when software SRAM is pass-thru to prelaunch RTVM. This patch removes software SRAM region from service VM EPT if it is enabled for prelaunch RTVM. Tracked-On: #5649 Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>	2021-02-25 09:35:31 +08:00
Tao Yuhong	50d8525618	HV: deny HV owned PCI bar access from SOS This patch denies Service VM the access permission to device resources owned by hypervisor. HV may own these devices: (1) debug uart pci device for debug version (2) type 1 pci device if have pre-launched VMs. Current implementation exposes the mmio/pio resource of HV owned devices to SOS, should remove them from SOS. Tracked-On: #5615 Signed-off-by: Tao Yuhong <yuhong.tao@intel.com>	2021-02-03 14:01:23 +08:00
Tao Yuhong	6e7ce4a73f	HV: deny pre-launched VM ptdev bar access from SOS This patch denies Service VM the access permission to device resources owned by pre-launched VMs. Rationale: * Pre-launched VMs in ACRN are independent of service VM, and should be immune to attacks from service VM. However, current implementation exposes the bar resource of passthru devices to service VM for some reason. This makes it possible for service VM to crash or attack pre-launched VMs. * It is same for hypervisor owned devices. NOTE: * The MMIO spaces pre-allocated to VFs are still presented to Service VM. The SR-IOV capable devices assigned to pre-launched VMs doesn't have the SR-IOV capability. So the MMIO address spaces pre-allocated by BIOS for VFs are not decoded by hardware and couldn't be enabled by guest. SOS may live with seeing the address space or not. We will revisit later. Tracked-On: #5615 Signed-off-by: Tao Yuhong <yuhong.tao@intel.com> Reviewed-by: Fei Li <fei1.li@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-02-03 14:01:23 +08:00
Shuo A Liu	d4aaf99d86	hv: keylocker: Support keylocker backup MSRs for Guest VM The logical processor scoped IWKey can be copied to or from a platform-scope storage copy called IWKeyBackup. Copying IWKey to IWKeyBackup is called ‘backing up IWKey’ and copying from IWKeyBackup to IWKey is called ‘restoring IWKey’. IWKeyBackup and the path between it and IWKey are protected against software and simple hardware attacks. This means that IWKeyBackup can be used to distribute an IWKey within the logical processors in a platform in a protected manner. Linux keylocker implementation uses this feature, so they are introduced by this patch. Tracked-On: #5695 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-02-03 13:54:45 +08:00
Shuo A Liu	38cd5b481d	hv: keylocker: host keylocker iwkey context switch Different vCPU may have different IWKeys. Hypervisor need do the iwkey context switch. This patch introduce a load_iwkey() function to do that. Switches the host iwkey when the switch_in vCPU satisfies: 1) keylocker feature enabled 2) Different from the current loaded one. Two opportunities to do the load_iwkey(): 1) Guest enables CR4.KL bit. 2) vCPU thread context switch. load_iwkey() costs ~600 cycles when do the load IWKey action. Tracked-On: #5695 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-02-03 13:54:45 +08:00
Shuo A Liu	c11c07e0fe	hv: keylocker: Support Key Locker feature for guest VM KeyLocker is a new security feature available in new Intel CPUs that protects data-encryption keys for the Advanced Encryption Standard (AES) algorithm. These keys are more valuable than what they guard. If stolen once, the key can be repeatedly used even on another system and even after vulnerability closed. It also introduces a CPU-internal wrapping key (IWKey), which is a key- encryption key to wrap AES keys into handles. While the IWKey is inaccessible to software, randomizing the value during the boot-time helps its value unpredictable. Keylocker usage: - New “ENCODEKEY” instructions take original key input and returns HANDLE crypted by an internal wrap key (IWKey, init by “LOADIWKEY” instruction) - Software can then delete the original key from memory - Early in boot/software, less likely to have vulnerability that allows stealing original key - Later encrypt/decrypt can use the HANDLE through new AES KeyLocker instructions - Note: * Software can use original key without knowing it (use HANDLE) * HANDLE cannot be used on other systems or after warm/cold reset * IWKey cannot be read from CPU after it's loaded (this is the nature of this feature) and only 1 copy of IWKey inside CPU. The virtualization implementation of Key Locker on ACRN is: - Each vCPU has a 'struct iwkey' to store its IWKey in struct acrn_vcpu_arch. - At initilization, every vCPU is created with a random IWKey. - Hypervisor traps the execution of LOADIWKEY (by 'LOADIWKEY exiting' VM-exectuion control) of vCPU to capture and save the IWKey if guest set a new IWKey. Don't support randomization (emulate CPUID to disable) of the LOADIWKEY as hypervisor cannot capture and save the random IWKey. From keylocker spec: "Note that a VMM may wish to enumerate no support for HW random IWKeys to the guest (i.e. enumerate CPUID.19H:ECX[1] as 0) as such IWKeys cannot be easily context switched. A guest ENCODEKEY will return the type of IWKey used (IWKey.KeySource) and thus will notice if a VMM virtualized a HW random IWKey with a SW specified IWKey." - In context_switch_in() of each vCPU, hypervisor loads that vCPU's IWKey into pCPU by LOADIWKEY instruction. - There is an assumption that ACRN hypervisor will never use the KeyLocker feature itself. This patch implements the vCPU's IWKey management and the next patch implements host context save/restore IWKey logic. Tracked-On: #5695 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-02-03 13:54:45 +08:00
Shuo A Liu	4483e93bd1	hv: keylocker: Enable the tertiary VM-execution controls In order for a VMM to capture the IWKey values of guests, processors that support Key Locker also support a new "LOADIWKEY exiting" VM-execution control in bit 0 of the tertiary processor-based VM-execution controls. This patch enables the tertiary VM-execution controls. Tracked-On: #5695 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-02-03 13:54:45 +08:00
Shuo A Liu	e9247dbca0	hv: keylocker: Simulate CPUID of keylocker caps for guest VM KeyLocker is a new security feature available in new Intel CPUs that protects data-encryption keys for the Advanced Encryption Standard (AES) algorithm. This patch emulates Keylocker CPUID leaf 19H to support Keylocker feature for guest VM. To make the hypervisor being able to manage the IWKey correctly, this patch doesn't expose hardware random IWKey capability (CPUID.0x19.ECX[1]) to guest VM. Tracked-On: #5695 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@Intel.com>	2021-02-03 13:54:45 +08:00
Shuo A Liu	15c967ad34	hv: keylocker: Add CR4 bit CR4_KL as CR4_TRAP_AND_PASSTHRU_BITS Bit19 (CR4_KL) of CR4 is CPU KeyLocker feature enable bit. Hypervisor traps the bit's writing to track the keylocker feature on/off of guest. While the bit is set by guest, - set cr4_kl_enabled to indicate the vcpu's keylocker feature enabled status - load vcpu's IWKey in host (will add in later patch) While the bit is clear by guest, - clear cr4_kl_enabled This patch trap and passthru the CR4_KL bit to guest for operation. Tracked-On: #5695 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-02-03 13:54:45 +08:00
Li Fei1	94a980c923	hv: hypercall: prevent sos can touch hv/pre-launched VM resource Current implementation, SOS may allocate the memory region belonging to hypervisor/pre-launched VM to a post-launched VM. Because it only verifies the start address rather than the entire memory region. This patch verifies the validity of the entire memory region before allocating to a post-launched VM so that the specified memory can only be allocated to a post-launched VM if the entire memory region is mapped in SOS’s EPT. Tracked-On: #5555 Signed-off-by: Li Fei1 <fei1.li@intel.com> Reviewed-by: Yonghua Huang <yonghua.huang@intel.com>	2021-02-02 16:55:40 +08:00
Yonghua Huang	8bec63a6ea	hv: remove the hardcoding of Software SRAM GPA base Currently, we hardcode the GPA base of Software SRAM to an address that is derived from TGL platform, as this GPA is identical with HPA for Pre-launch VM, This hardcoded address may not work on other platforms if the HPA bases of Software SRAM are different. Now, Offline tool configures above GPA based on the detection of Software SRAM on specific platform. This patch removes the hardcoding GPA of Software SRAM, and also renames MACRO 'SOFTWARE_SRAM_BASE_GPA' to 'PRE_RTVM_SW_SRAM_BASE_GPA' to avoid confusing, as it is for Prelaunch VM only. Tracked-On: #5649 Signed-off-by: Yonghua Huang <yonghua.huang@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-01-30 13:41:02 +08:00
Yonghua Huang	a6e666dbe7	hv: remove hardcoding of SW SRAM HPA base Physical address to SW SRAM region maybe different on different platforms, this hardcoded address may result in address mismatch for SW SRAM operations. This patch removes above hardcoded address and uses the physical address parsed from native RTCT. Tracked-On: #5649 Signed-off-by: Yonghua Huang <yonghua.huang@intel.com> Reviewed-by: Fei Li <fei1.li@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-01-28 11:29:25 +08:00
Yonghua Huang	a6420e8cfa	hv: cleanup legacy terminologies in RTCM module This patch updates below terminologies according to the latest TCC Spec: PTCT -> RTCT PTCM -> RTCM pSRAM -> Software SRAM Tracked-On: #5649 Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>	2021-01-28 11:29:25 +08:00
Yonghua Huang	806f479108	hv: rename RTCM source files 'ptcm' and 'ptct' are legacy name according to the latest TCC spec, hence rename below files to avoid confusing: ptcm.c -> rtcm.c ptcm.h -> rtcm.h ptct.h -> rtct.h Tracked-On: #5649 Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>	2021-01-28 11:29:25 +08:00
Jie Deng	5c5d272358	hv: remove bitmap_clear_lock of split-lock after completing emulation When "signal_event" is called, "wait_event" will actually not block. So it is ok to remove this line. Tracked-On: #5605 Signed-off-by: Jie Deng <jie.deng@intel.com>	2021-01-13 15:32:27 +08:00
Yin Fengwei	ef411d4ac3	hv: ptirq: Shouldn't change sid if intx irq mapping was added Now, we use hash table to maintain intx irq mapping by using the key generated from sid. So once the entry is added,we can not update source ide any more. Otherwise, we can't locate the entry with the key generated from new source ide. For source id change, remove_remapping/add_remapping is used instead of update source id directly if entry was added already. Tracked-On: #5640 Signed-off-by: Yin Fengwei <fengwei.yin@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2021-01-12 15:23:44 +08:00
Jie Deng	8aebf5526f	hv: move split-lock logic into dedicated file This patch move the split-lock logic into dedicated file to reduce LOC. This may make the logic more clear. Tracked-On: #5605 Signed-off-by: Jie Deng <jie.deng@intel.com>	2021-01-08 17:37:20 +08:00
Jie Deng	27d5711b62	hv: add a cache register for VMX_PROC_VM_EXEC_CONTROLS This patch adds a cache register for VMX_PROC_VM_EXEC_CONTROLS to avoid the frequent VMCS access. Tracked-On: #5605 Signed-off-by: Jie Deng <jie.deng@intel.com>	2021-01-08 17:37:20 +08:00
Jie Deng	f291997811	hv: split-lock: using MTF instead of TF(#DB) The TF is visible to guest which may be modified by the guest, so it is not a safe method to emulate the split-lock. While MTF is specifically designed for single-stepping in x86/Intel hardware virtualization VT-x technology which is invisible to the guest. Use MTF to single step the VCPU during the emulation of split lock. Tracked-On: #5605 Signed-off-by: Jie Deng <jie.deng@intel.com>	2021-01-08 17:37:20 +08:00
Jie Deng	6852438e3a	hv: Support concurrent split-lock emulation on SMP. For a SMP guest, split-lock check may happen on multiple vCPUs simultaneously. In this case, one vCPU at most can be allowed running in the split-lock emulation window. And if the vCPU is doing the emulation, it should never be blocked in the hypervisor, it should go back to the guest to execute the lock instruction immediately and trap back to the hypervisor with #DB to complete the split-lock emulation. Tracked-On: #5605 Signed-off-by: Jie Deng <jie.deng@intel.com>	2021-01-08 17:37:20 +08:00
Li Fei1	0b18389d95	hv: vcpuid: expose mce feature to guest Windows64 seems only support processor which has MCE (Machine Check Error) feature. Tracked-On: #5638 Signed-off-by: Li Fei1 <fei1.li@intel.com>	2021-01-08 17:22:34 +08:00
Jie Deng	b14c32a110	hv: Retain RIP only for fault exception. We have trapped the #DB for split-lock emulation. Only fault exception need RIP being retained. Tracked-On: #5605 Signed-off-by: Jie Deng <jie.deng@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-12-31 11:12:33 +08:00
Jie Deng	977e862192	hv: Add split-lock emulation for xchg xchg may also cause the #AC for split-lock check. This patch adds this emulation. 1. Kick other vcpus of the guest to stop execution if the guest has more than one vcpu. 2. Emulate the xchg instruction. 3. Notify other vcpus (if any) to restart execution. Tracked-On: #5605 Signed-off-by: Jie Deng <jie.deng@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-12-31 11:12:33 +08:00
Jie Deng	47e193a7bb	hv: Add split-lock emulation for LOCK prefix instruction This patch adds the split-lock emulation. If a #AC is caused by instruction with LOCK prefix then emulate it, otherwise, inject it back as it used to be. 1. Kick other vcpus of the guest to stop execution and set the TF flag to have #DB if the guest has more than one vcpu. 2. Skip over the LOCK prefix and resume the current vcpu back to guest for execution. 3. Notify other vcpus to restart exception at the end of handling the #DB since we have completed the LOCK prefix instruction emulation. Tracked-On: #5605 Signed-off-by: Jie Deng <jie.deng@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-12-31 11:12:33 +08:00
Yonghua Huang	643bbcfe34	hv: check the availability of guest CR4 features Check hardware support for all features in CR4, and hide bits from guest by vcpuid if they're not supported for guests OS. Tracked-On: #5586 Signed-off-by: Yonghua Huang <yonghua.huang@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-12-18 11:21:22 +08:00
Yonghua Huang	442fc30117	hv: refine virtualization flow for cr0 and cr4 - The current code to virtualize CR0/CR4 is not well designed, and hard to read. This patch reshuffle the logic to make it clear and classify those bits into PASSTHRU, TRAP_AND_PASSTHRU, TRAP_AND_EMULATE & reserved bits. Tracked-On: #5586 Signed-off-by: Eddie Dong <eddie.dong@intel.com> Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>	2020-12-18 11:21:22 +08:00
Yonghua Huang	08c42f91c9	hv: rename hypercall for hv-emulated device management Coding style cleanup, use add/remove instead of create/destroy. Tracked-On: #5586 Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>	2020-12-07 16:25:17 +08:00
Shiqing Gao	6f10bd00bf	hv: coding style clean-up related to Boolean While following two styles are both correct, the 2nd one is simpler. bool is_level_triggered; 1. if (is_level_triggered == true) {...} 2. if (is_level_triggered) {...} This patch cleans up the style in hypervisor. Tracked-On: #861 Signed-off-by: Shiqing Gao <shiqing.gao@intel.com>	2020-11-28 14:51:32 +08:00
Junming Liu	1cd932e568	hv: refine code style refine code style Tracked-On: #4020 Signed-off-by: Junming Liu <junming.liu@intel.com>	2020-11-26 12:56:28 +08:00
Junming Liu	56eb859ea4	hv: vmexit: refine xsetbv_vmexit_handler API From SDM Vol.2C - XSETBV instruction description, If CR4.OSXSAVE[bit 18] = 0, execute "XSETBV" instruction will generate #UD exception. From SDM Vol.3C 25.1.1,#UD exception has priority over VM exits, So if vCPU execute "XSETBV" instruction when CR4.OSXSAVE[bit 18] = 0, VM exits won't happen. While hv inject #GP if vCPU execute "XSETBV" instruction when CR4.OSXSAVE[bit 18] = 0. It's a wrong behavior, this patch will fix the bug. Tracked-On: #4020 Signed-off-by: Junming Liu <junming.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-11-26 12:56:28 +08:00
Peter Fang	68dc8d9f8f	hv: pm: avoid duplicate shutdowns on RTVM It is possible for more than one vCPUs to trigger shutdown on an RTVM. We need to avoid entering VM_READY_TO_POWEROFF state again after the RTVM has been paused or shut down. Also, make sure an RTVM enters VM_READY_TO_POWEROFF state before it can be paused. v1 -> v2: - rename to poweroff_if_rt_vm for better clarity Tracked-On: #5411 Signed-off-by: Peter Fang <peter.fang@intel.com>	2020-11-11 14:05:39 +08:00
dongshen	ca5683f78d	hv: add support for shutdown for pre-launched VMs Currently, ACRN only support shutdown when triple fault happens, because ACRN doesn't present/emulate a virtual HW, i.e. port IO, to support shutdown. This patch emulate a virtual shutdown component, and the vACPI method for guest OS to use. Pre-launched VM uses ACPI reduced HW mode, intercept the virtual sleep control/status registers for pre-launched VMs shutdown Tracked-On: #5411 Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>	2020-11-04 10:33:31 +08:00
dongshen	8f79ceefbd	hv: fix out-of-date comments related to pre-launched VMs rebooting Like post-launched VMs, for pre-launched VMs, the ACPI reset register is also fixed at 0xcf9 and the reset value is 0xE, so pre-launched VMs now also use ACPI reset register for rebooting. Tracked-On: #5411 Signed-off-by: dongshen <dongsheng.x.zhang@intel.com>	2020-11-04 10:33:31 +08:00
Peter Fang	70b1218952	hv: pm: support shutting down multiple VMs when pCPUs are shared More than one VM may request shutdown on the same pCPU before shutdown_vm_from_idle() is called in the idle thread when pCPUs are shared among VMs. Use a per-pCPU bitmap to store all the VMIDs requesting shutdown. v1 -> v2: - use vm_lock to avoid a race on shutdown Tracked-On: #5411 Signed-off-by: Peter Fang <peter.fang@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-11-04 10:33:31 +08:00
Li Fei1	c6f9404f55	hv: psram: add kconfig to enable psram Add two Kconfig pSRAM config: one for whether to enable the pSRAM on the platfrom or not; another for if the pSRAM is enabled on the platform whether to enable the pSRAM in the pre-launched RTVM. If we enable the pSRAM on the platform, we should remove the pSRAM EPT mapping from the SOS to prevent it could flush the pSRAM cache. Tracked-On: #5330 Signed-off-by: Qian Wang <qian1.wang@intel.com>	2020-11-02 15:56:30 +08:00
Qian Wang	99ee76781f	hv: pSRAM: add pSRAM support for pre-launched RTVM 1.Modified the virtual e820 table for pre-launched VM. We added a segment for pSRAM, and thus lowmem RAM is split into two parts. Logics are added to deal with the split. 2.Added EPT mapping of pSRAM segment for pre-launched RTVM if it uses pSRAM. Tracked-On: #5330 Signed-off-by: Qian Wang <qian1.wang@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-11-02 15:56:30 +08:00
Qian Wang	a557105e71	hv: ept: set EPT cache attribute to WB for pSRAM pSRAM memory should be cachable. However, it's not a RAM or a normal MMIO, so we can't use the an exist API to do the EPT mapping and set the EPT cache attribute to WB for it. Now we assume that SOS must assign the PSRAM area as a whole and as a separate memory region whose base address is PSRAM_BASE_HPA. If the hpa of the EPT mapping region is equal to PSRAM_BASE_HPA, we think this EPT mapping is for pSRAM, we change the EPT mapping cache attribute to WB. And fix a minor bug when SOS trap out to emulate wbinvd when pSRAM is enabled. Tracked-On: #5330 Signed-off-by: Qian Wang <qian1.wang@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-11-02 15:56:30 +08:00
Qian Wang	ca2aee225c	hv: skip pSRAM for guest WBINVD emulation Use ept_flush_leaf_page to emulate guest WBINVD when PTCM is enabled and skip the pSRAM in ept_flush_leaf_page. TODO: do we need to emulate WBINVD in HV side. Tracked-On: #5330 Signed-off-by: Qian Wang <qian1.wang@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-11-02 10:29:43 +08:00
Qian Wang	77269c15c5	hv: vcr: remove wbinvd for CR0.CD emulation According 11.5.1 Cache Control Registers and Bits, Intel SDM Vol 3, change CR0.CD will not flush cache to insure memory coherency. So it's not needed to call wbinvd to flush cache in ACRN Hypervisor. That's what the guest should do. Tracked-On: #5330 Signed-off-by: Qian Wang <qian1.wang@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-11-02 10:29:43 +08:00
Tao Yuhong	4120bd391a	HV: decouple legacy vuart interface from acrn_vuart layer support pci-vuart type, and refine: 1.Rename init_vuart() to init_legacy_vuarts(), only init PIO type. 2.Rename deinit_vuart() to deinit_legacy_vuarts(), only deinit PIO type. 3.Move io handler code out of setup_vuart(), into init_legacy_vuarts() 4.add init_pci_vuart(), deinit_pci_vuart, for one pci vuart vdev. and some change from requirement: 1.Increase MAX_VUART_NUM_PER_VM to 8. Tracked-On: #5394 Signed-off-by: Tao Yuhong <yuhong.tao@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com> Reviewed-by: Wang, Yu1 <yu1.wang@intel.com>	2020-10-30 20:41:34 +08:00
Yonghua Huang	3ea1ae1e11	hv: refine msi interrupt injection functions 1. refine the prototype of 'inject_msi_lapic_pt()' 2. rename below function: - rename 'vlapic_intr_msi()' to 'vlapic_inject_msi()' - rename 'inject_msi_lapic_pt()' to 'inject_msi_for_lapic_pt()' - rename 'inject_msi_lapic_virt()' to 'inject_msi_for_non_lapic_pt()' Tracked-On: #5407 Signed-off-by: Yonghua Huang <yonghua.huang@intel.com> Reviewed-by: Li Fei <fei1.li@intel.com> Reviewed-by: Wang, Yu1 <yu1.wang@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-10-26 08:44:13 +08:00
Yonghua Huang	012927d0bd	hv: move function 'inject_msi_lapic_pt()' to vlapic.c This function can be used by other modules instead of hypercall handling only, hence move it to vlapic.c Tracked-On: #5407 Signed-off-by: Yonghua Huang <yonghua.huang@intel.com> Reviewed-by: Li, Fei <fei1.li@intel.com> Reviewed-by: Wang, Yu1 <yu1.wang@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-10-26 08:44:13 +08:00
Zide Chen	bebffb29fc	hv: remove de-privilege boot mode support and remove vboot wrappers Now ACRN supports direct boot mode, which could be SBL/ABL, or GRUB boot. Thus the vboot wrapper layer can be removed and the direct boot functions don't need to be wrapped in direct_boot.c: - remove call to init_vboot(), and call e820_alloc_memory() directly at the time when the trampoline buffer is actually needed. - Similarly, call CPU_IRQ_ENABLE() instead of the wrapper init_vboot_irq(). - remove get_ap_trampoline_buf(), since the existing function get_trampoline_start16_paddr() returns the exact same value. - merge init_general_vm_boot_info() into init_vm_boot_info(). - remove vm_sw_loader pointer, and call direct_boot_sw_loader() directly. - move get_rsdp_ptr() from vboot_wrapper.c to multiboot.c, and remove the wrapper over two boot modes. Tracked-On: #5197 Signed-off-by: Zide Chen <zide.chen@intel.com>	2020-10-21 15:09:26 +08:00
Victor Sun	c63899fc81	HV: correct hpa calculation for pre-launched VM The commit of `da81a0041d` "HV: add e820 ACPI entry for pre-launched VM" introduced a issue that the base_hpa and remaining_hpa_size are also calculated on the entry of 32bit PCI hole which from 0x80000000 to 0xffffffff, which is incorrect; Tracked-On: #5266 Signed-off-by: Victor Sun <victor.sun@intel.com>	2020-09-15 09:45:10 +08:00
Li Fei1	a2fd8c5a9d	pci: mcfg: limit device bus numbers which could access by ECAM Per PCI Firmware Specification Revision 3.0, 4.1.2. MCFG Table Description: Memory Mapped Enhanced Configuration Space Base Address Allocation Structure assign the Start Bus Number and the End Bus Number which could decoded by the Host Bridge. We should not access the PCI device which bus number outside of the range of [Start Bus Number, End Bus Number). For ACRN, we should: 1. Don't detect PCI device which bus number outside the range of [Start Bus Number, End Bus Number) of MCFG ACPI Table. 2. Only trap the ECAM MMIO size: [MMCFG_BASE_ADDRESS, MMCFG_BASE_ADDRESS + (End Bus Number - Start Bus Number + 1) * 0x100000) for SOS. Tracked-On: #5233 Signed-off-by: Li Fei1 <fei1.li@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-09-09 09:31:56 +08:00
Victor Sun	34547e1e19	HV: add acpi module support for pre-launched VM Previously we use a pre-defined structure as vACPI table for pre-launched VM, the structure is initialized by HV code. Now change the method to use a pre-loaded multiboot module instead. The module file will be generated by acrn-config tool and loaded to GPA 0x7ff00000, a hardcoded RSDP table at GPA 0x000f2400 will point to the XSDT table which at GPA 0x7ff00080; Tracked-On: #5266 Signed-off-by: Victor Sun <victor.sun@intel.com> Signed-off-by: Shuang Zheng <shuang.zheng@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-09-08 19:52:25 +08:00
Victor Sun	da81a0041d	HV: add e820 ACPI entry for pre-launched VM Previously the ACPI table was stored in F segment which might not be big enough for a customized ACPI table, hence reserve 1MB space in pre-launched VM e820 table to store the ACPI related data: 0x7ff00000 ~ 0x7ffeffff : ACPI Reclaim memory 0x7fff0000 ~ 0x7fffffff : ACPI NVS memory Tracked-On: #5266 Signed-off-by: Victor Sun <victor.sun@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-09-08 19:52:25 +08:00
Nishioka, Toshiki	77fb21e98c	hv: add vgpio device model support When HV pass through the P2SB MMIO device to pre-launched VM, vgpio device model traps MMIO access to the GPIO registers within P2SB so that it can expose virtual IOAPIC pins to the VM in accordance with the programmed mappings between gsi and vgsi. Tracked-On: #5246 Signed-off-by: Toshiki Nishioka <toshiki.nishioka@intel.com> Reviewed-by: Junjie Mao <junjie.mao@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-09-07 14:52:02 +08:00
Nishioka, Toshiki	ba99984f69	hv: add INTx mapping for pre-launched VMs Add the capability of forwarding specified physical IOAPIC interrupt lines to pre-launched VMs as virtual IOAPIC interrupts. This is for the sake of the certain MMIO pass-thru devices on EHL CRB which can support only INTx interrupts. Tracked-On: #5245 Signed-off-by: Toshiki Nishioka <toshiki.nishioka@intel.com> Reviewed-by: Junjie Mao <junjie.mao@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-09-07 14:52:02 +08:00
Yonghua Huang	c03623f3fb	hv[v2]: Remove deprecated term in vPIC submodule This patch cleanup below deprecated terms: 'master' -> 'primary' 'slave' -> 'secondary' v2 update: Refine comments. Tracked-On: #5249 Signed-off-by: Yonghua Huang <yonghua.huang@intel.com>	2020-09-01 09:30:08 +08:00
Yuan Liu	8a34cf03ca	hv: add new hypercalls to create and destroy an emulated device in hypervisor Add HC_CREATE_VDEV and HC_DESTROY_VDEV two hypercalls that are used to create and destroy an emulated device(PCI device or legacy device) in hypervisor v3: 1) change HC_CREATE_DEVICE and HC_DESTROY_DEVICE to HC_CREATE_VDEV and HC_DESTROY_VDEV 2) refine code style v4: 1) remove unnecessary parameter 2) add VM state check for HC_CREATE_VDEV and HC_DESTROY hypercalls Tracked-On: #4853 Reviewed-by: Wang, Yu1 <yu1.wang@intel.com> Signed-off-by: Yuan Liu <yuan1.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-08-28 16:53:12 +08:00
Mingqiang Chi	53b11d1048	refine hypercall -- use an array to fast locate the hypercall handler to replace switch case. -- uniform hypercall handler as below: int32_t (*handler)(sos_vm, target_vm, param1, param2) Tracked-On: #4958 Signed-off-by: Mingqiang Chi <mingqiang.chi@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com> Reviewed-by: Eddie Dong <eddie.dong@intel.com>	2020-08-26 14:55:24 +08:00
Junming Liu	3631a85c3c	hv:cpu-caps:refine is_apl_platform func and clean up duplicated code Fix the bug for "is_apl_platform" func. "monitor_cap_buggy" is identical to "is_apl_platform", so remove it. On apl platform: 1) ACRN doesn't use monitor/mwait instructions 2) ACRN disable GPU IOMMU Tracked-On:#3675 Signed-off-by: Junming Liu <junming.liu@intel.com>	2020-08-14 10:08:50 +08:00
Mingqiang Chi	a67a85c70d	hv:refine vm & vcpu lock -- move vm_state_lock to other place in vm structure to avoid the memory waste because of the page-aligned. -- remove the memset from create_vm -- explicitly set max_emul_mmio_regions and vcpuid_entry_nr to 0 inside create_vm to avoid use without initialization. -- rename max_emul_mmio_regions to nr_emul_mmio_regions v1->v2: add deinit_emul_io in shutdown_vm Tracked-On: #4958 Signed-off-by: Mingqiang Chi <mingqiang.chi@intel.com> Reviewed-by: Grandhi, Sainath <sainath.grandhi@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-08-05 13:39:28 +08:00
Victor Sun	b9ad04d24d	HV: add cpu affinity info for SOS VM Previously the CPU affinity of SOS VM is initialized at runtime during sanitize_vm_config() stage, follow the policy that all physical CPUs except ocuppied by Pre-launched VMs are all belong to SOS_VM. Now change the process that SOS CPU affinity should be initialized at build time and has the assumption that its validity is guarenteed before runtime. Tracked-On: #5077 Signed-off-by: Victor Sun <victor.sun@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-08-04 09:05:29 +08:00
Shuo A Liu	112f02851c	hv: Disable XSAVE-managed CET state of guest VM To hide CET feature from guest VM completely, the MSR IA32_MSR_XSS also need to be intercepted because it comprises CET_U and CET_S feature bits of xsave/xstors operations. Mask these two bits in IA32_MSR_XSS writing. With IA32_MSR_XSS interception, member 'xss' of 'struct ext_context' can be removed because it is duplicated with the MSR store array 'vcpu->arch.guest_msrs[]'. Tracked-On: #5074 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2020-07-23 20:15:57 +08:00
Shuo A Liu	ac598b0856	hv: Hide CET feature from guest VM Return-oriented programming (ROP), and similarly CALL/JMP-oriented programming (COP/JOP), have been the prevalent attack methodologies for stealth exploit writers targeting vulnerabilities in programs. CET (Control-flow Enforcement Technology) provides the following capabilities to defend against ROP/COP/JOP style control-flow subversion attacks: * Shadow stack: Return address protection to defend against ROP. * Indirect branch tracking: Free branch protection to defend against COP/JOP The full support of CET for Linux kernel has not been merged yet. As the first stage, hide CET from guest VM. Tracked-On: #5074 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com>	2020-07-23 20:15:57 +08:00
Li Fei1	5e605e0daf	hv: vmcall: check vm id in dispatch_sos_hypercall Check whether vm_id is valid in dispatch_sos_hypercall Tracked-On: #4550 Signed-off-by: Li Fei1 <fei1.li@intel.com>	2020-07-23 20:13:20 +08:00
Li Fei1	acc69007e2	hv: mmio_dev: add mmio device pass through support Add mmio device pass through support for pre-launched VM. When we pass through a MMIO device to pre-launched VM, we would remove its resource from the SOS. Now these resources only include the MMIO regions. Tracked-On: #5053 Acked-by: Eddie Dong <eddie.dong@intel.com> Signed-off-by: Li Fei1 <fei1.li@intel.com>	2020-07-23 20:13:20 +08:00
Li Fei1	baf77a79ad	hv: mmio_dev: add hypercall to support mmio device pass through Add two hypercalls to support MMIO device pass through for post-launched VM. And when we support MMIO pass through for pre-launched VM, we could re-use the code in mmio_dev.c Tracked-On: #5053 Signed-off-by: Li Fei1 <fei1.li@intel.com>	2020-07-23 20:13:20 +08:00
Conghui Chen	821c65b40c	hv: fix possible SSE region mismatch issue During context switch in hypervisor, xsave/xrstore are used to save/resotre the XSAVE area according to the XCR0 and XSS. The legacy region in XSAVE area include FPU and SSE, we should make sure the legacy region be saved during contex switch. FPU in XCR0 is always enabled according to SDM. For SSE, we enable it in XCR0 during context switch. Tracked-On: #5062 Signed-off-by: Conghui Chen <conghui.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-07-22 14:19:21 +08:00
Conghui Chen	53d4a7169b	hv: remove kick_thread from scheduler module kick_thread function is only used by kick_vcpu to kick vcpu out of non-root mode, the implementation in it is sending IPI to target CPU if target obj is running and target PCPU is not current one; while for runnable obj, it will just make reschedule request. So the kick_thread is not actually belong to scheduler module, we can drop it and just do the cpu notification in kick_vcpu. Tracked-On: #5057 Signed-off-by: Conghui Chen <conghui.chen@intel.com> Reviewed-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-07-22 13:38:41 +08:00
Conghui Chen	b6422f8985	hv: remove 'running' from vcpu structure vcpu->running is duplicated with THREAD_STS_RUNNING status of thread object. Introduce an API sleep_thread_sync(), which can utilize the inner status of thread object, to do the sync sleep for zombie_vcpu(). Tracked-On: #5057 Signed-off-by: Conghui Chen <conghui.chen@intel.com> Reviewed-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-07-22 13:38:41 +08:00
Mingqiang Chi	aa89eb3541	hv:add per-vm lock for vm & vcpu state change -- replace global hypercall lock with per-vm lock -- add spinlock protection for vm & vcpu state change v1-->v2: change get_vm_lock/put_vm_lock parameter from vm_id to vm move lock obtain before vm state check move all lock from vmcall.c to hypercall.c Tracked-On: #4958 Signed-off-by: Mingqiang Chi <mingqiang.chi@intel.com Reviewed-by: Jason Chen CJ <jason.cj.chen@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-07-20 11:22:17 +08:00
Li Fei1	80c7da8f1c	hv: vioapic: expose ioapic to guest unconditionally Some OSes assume the platform must have the IOAPIC. For example: Linux Kernel allocates IRQ force from GSI (0 if there's no PIC and IOAPIC) on x86. And it thinks IRQ 0 is an architecture special IRQ, not for device driver. As a result, the device driver may goes wrong if the allocated IRQ is 0 for RTVM. This patch expose vIOAPIC to RTVM with LAPIC passthru even though the RTVM can't use IOAPIC, it servers as a place holder to fullfil the guest assumption. After vIOAPIC has exposed to guest unconditionally, the 'ready' field could be removed since we do vIOAPIC initialization for each guest. Tracked-On: #4691 Signed-off-by: Li Fei1 <fei1.li@intel.com> Acked-by: Eddie Dong <eddie.dong@intel.com>	2020-07-10 19:33:46 +08:00
Shuo A Liu	0397cb7174	hv: Fix the interrupts lost issue with PI support Currently, not all platforms support posted interrupt processing of both VT-x and VT-d. On EHL, VT-d doesn't support posted interrupt processing. So in such scenario, is_pi_capable() in vcpu_handle_pi_notification() will bypass the PIR pending bits check which might cause a self-NV-IPI lost. With commit "bf1ff8c98 (hv: Offload syncing PIR to vIRR to processor hardware)", the syncing PIR to vIRR is postponed and it is handled by a self-NV-IPI in the following VMEnter. The process looks like, a) vcpu A accepts a virtual interrupt -> 1) ACRN_REQUEST_EVENT is set 2) corresponding bit in PIR is set 3) Posted Interrupt ON bit is set b) vcpu A does virtual interrupt injection on resume path due to the pending ACRN_REQUEST_EVENT -> 1) hypervisor disables host interrupt 2) ACRN_REQUEST_EVENT is cleared 3) a self-NV-IPI is sent via ICR of LAPIC. 4) IRR bit of the self-NV-IPI is set c) (VM-ENTRY) vcpu A returns into non-root mode 1) host interrupt enable(by HW) 2) posted interrupt processing clears the ON bit, sync PIR to vIRR 3) deliver the virtual interrupt if guest rflags.IF=1 d) (VM-EXIT) vcpu A traps due to a instruction execution (e.g. HLT) 1) host interrupt disable(by HW) 2) hypervisor enable host interrupt Above illustrates a normal process of the virtual interrupt injection with cpu PI support. However, a failing case is observed. The failing case is that the self-NV-IPI from b-3 is not accepted by the core until a timing between d-1 and d-2. b-4 happening between d-1 and d-2 is observed by debug trace. So the self-NV-IPI will be handled in root-mode which cannot do the syncing PIR to vIRR processing. Due to the bug described in the first paragraph, vcpu_handle_pi_notification() cannot succeed the virtual interrupt injection request. This patch fix it by removing the wrong check in vcpu_handle_pi_notification() because vcpu_handle_pi_notification() only happens on platform with cpu PI support. Here are some cost data for sending IPI via LAPIC ICR regsiter. Normally, the cycles between ICR write and IRR got set is 140~260, which is not accurate due to the MSR read overhead. And from b-3 to c is about 560 cycles. So b-4 happens during this period. But in bad case, b-4 doesn't happen even c is triggered. The worse case i captured is that ICR write and IRR got set costs more than 1900 cycles. Now, the best GUESS of the huge cost of IPI via ICR is the ACPI bus arbitration(refer to SDM 10.6.3, 10.7 and Figure 10-17). Tracked-On: #4937 Signed-off-by: Shuo A Liu <shuo.a.liu@intel.com> Acked-by： Eddie Dong <eddie.dong@intel.com>	2020-06-28 10:33:22 +08:00

1 2 3 4 5 ...

1268 Commits