200 lines
8.7 KiB
ReStructuredText
200 lines
8.7 KiB
ReStructuredText
.. _rt_performance_tuning:
|
|
|
|
Real-Time (RT) Performance Analysis on ACRN
|
|
###########################################
|
|
|
|
The document describes the methods to collect trace/data for ACRN Real-Time VM (RTVM)
|
|
real-time performance analysis. Two parts are included:
|
|
|
|
- Method to trace ``vmexit`` occurences for analysis.
|
|
- Method to collect Performance Monitoring Counters information for tuning based on Performance Monitoring Unit, or PMU.
|
|
|
|
``vmexit`` analysis for ACRN RT performance
|
|
*******************************************
|
|
|
|
``vmexit`` are triggered in response to certain instructions and events and are
|
|
a key source of performance degradation in virtual machines. During the runtime
|
|
of a hard RTVM of ACRN, the following impacts real-time deterministic latency:
|
|
|
|
- CPUID
|
|
- TSC_Adjust read/write
|
|
- TSC write
|
|
- APICID/LDR read
|
|
- ICR write
|
|
|
|
Generally, we don't want to see any ``vmexit`` occur during the critical section of the RT task.
|
|
|
|
The methodology of ``vmexit`` analysis is very simple. First, we clearly
|
|
identify the **critical section** of the RT task. The critical section is
|
|
the duration of time where we do not want to see any ``vmexit`` occur.
|
|
Different RT tasks use different critical sections. This document uses
|
|
the cyclictest benchmark as an example of how to do ``vmexit`` analysis.
|
|
|
|
The critical sections
|
|
=====================
|
|
|
|
Here is example pseudocode of a cyclictest implementation.
|
|
|
|
.. code-block:: none
|
|
|
|
while (!shutdown) {
|
|
…
|
|
clock_nanosleep(&next)
|
|
clock_gettime(&now)
|
|
latency = calcdiff(now, next)
|
|
…
|
|
next += interval
|
|
}
|
|
|
|
Time point ``now`` is the actual point at which the cyclictest app is woken up
|
|
and scheduled. Time point ``next`` is the expected point at which we want
|
|
the cyclictest to be awakened and scheduled. Here we can get the latency by
|
|
``now - next``. We don't want to see any ``vmexit`` in between ``next`` and ``now``.
|
|
So, we define the start point of the critical section as ``next`` and the end
|
|
point as ``now``.
|
|
|
|
Log and trace data collection
|
|
=============================
|
|
|
|
#. Add timestamps (in TSC) at ``next`` and ``now``.
|
|
#. Capture the log with the above timestamps in the RTVM.
|
|
#. Capture the ``acrntrace`` log in the Service VM at the same time.
|
|
|
|
Offline analysis
|
|
================
|
|
|
|
#. Convert the raw trace data to human readable format.
|
|
#. Merge the logs in the RTVM and the ACRN hypervisor trace based on timestamps (in TSC).
|
|
#. Check to see if any ``vmexit`` occured within the critical sections. The pattern is as follows:
|
|
|
|
.. figure:: images/vm_exits_log.png
|
|
:align: center
|
|
:name: vm_exits_log
|
|
|
|
Collecting Performance Monitoring Counters data
|
|
***********************************************
|
|
|
|
Enable Performance Monitoring Unit (PMU) support in VM
|
|
======================================================
|
|
|
|
By default, the ACRN hypervisor doesn't expose the PMU-related CPUID and
|
|
MSRs to the guest VM. In order to use Performance Monitoring Counters (PMCs)
|
|
in the guest VM, modify the ACRN hypervisor code in order to expose the
|
|
capability to the RTVM.
|
|
|
|
Note that Precise Event Based Sampling (PEBS) is not yet enabled in the VM.
|
|
|
|
#. Expose the CPUID leaf 0xA as below:
|
|
|
|
.. code-block:: none
|
|
|
|
--- a/hypervisor/arch/x86/guest/vcpuid.c
|
|
+++ b/hypervisor/arch/x86/guest/vcpuid.c
|
|
@@ -345,7 +345,7 @@ int32_t set_vcpuid_entries(struct acrn_vm *vm)
|
|
break;
|
|
/* These features are disabled */
|
|
/* PMU is not supported */
|
|
- case 0x0aU:
|
|
+ //case 0x0aU:
|
|
/* Intel RDT */
|
|
case 0x0fU:
|
|
case 0x10U:
|
|
|
|
#. Expose the PMU-related MSRs to the VM as below:
|
|
|
|
.. code-block:: none
|
|
|
|
--- a/hypervisor/arch/x86/guest/vmsr.c
|
|
+++ b/hypervisor/arch/x86/guest/vmsr.c
|
|
@@ -337,6 +337,41 @@ void init_msr_emulation(struct acrn_vcpu *vcpu)
|
|
/* don't need to intercept rdmsr for these MSRs */
|
|
enable_msr_interception(msr_bitmap, MSR_IA32_TIME_STAMP_COUNTER, INTERCEPT_WRITE);
|
|
|
|
+
|
|
+ /* Passthru PMU related MSRs to guest */
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_FIXED_CTR_CTL, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PERF_GLOBAL_CTRL, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PERF_GLOBAL_STATUS, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PERF_GLOBAL_OVF_CTRL, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PERF_GLOBAL_STATUS_SET, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PERF_GLOBAL_INUSE, INTERCEPT_DISABLE);
|
|
+
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_FIXED_CTR0, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_FIXED_CTR1, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_FIXED_CTR2, INTERCEPT_DISABLE);
|
|
+
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PMC0, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PMC1, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PMC2, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PMC3, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PMC4, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PMC5, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PMC6, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PMC7, INTERCEPT_DISABLE);
|
|
+
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_A_PMC0, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_A_PMC1, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_A_PMC2, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_A_PMC3, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_A_PMC4, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_A_PMC5, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_A_PMC6, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_A_PMC7, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PERFEVTSEL0, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PERFEVTSEL1, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PERFEVTSEL2, INTERCEPT_DISABLE);
|
|
+ enable_msr_interception(msr_bitmap, MSR_IA32_PERFEVTSEL3, INTERCEPT_DISABLE);
|
|
+
|
|
/* Setup MSR bitmap - Intel SDM Vol3 24.6.9 */
|
|
value64 = hva2hpa(vcpu->arch.msr_bitmap);
|
|
exec_vmwrite64(VMX_MSR_BITMAP_FULL, value64);
|
|
|
|
Perf/PMU tools in performance analysis
|
|
======================================
|
|
|
|
After exposing PMU-related CPUID/MSRs to the VM, performance analysis tools
|
|
such as **perf** and **pmu** can be used inside the VM to locate
|
|
the bottleneck of the application.
|
|
|
|
**Perf** is a profiler tool for Linux 2.6+ based systems that abstracts away
|
|
CPU hardware differences in Linux performance measurements and presents a
|
|
simple command line interface. Perf is based on the ``perf_events`` interface
|
|
exported by recent versions of the Linux kernel.
|
|
|
|
**PMU** tools is a collection of tools for profile collection and performance analysis on Intel CPUs on top of Linux Perf. Refer to the following links for perf usage:
|
|
|
|
- https://perf.wiki.kernel.org/index.php/Main_Page
|
|
- https://perf.wiki.kernel.org/index.php/Tutorial
|
|
|
|
Refer to https://github.com/andikleen/pmu-tools for pmu usage.
|
|
|
|
Top-down Micro-Architecture Analysis Method (TMAM)
|
|
==================================================
|
|
|
|
The Top-down Micro-Architecture Analysis Method (TMAM), based on Top-Down
|
|
Characterization methodology, aims to provide an insight into whether you
|
|
have made wise choices with your algorithms and data structures. See the
|
|
Intel |reg| 64 and IA-32 `Architectures Optimization Reference Manual <http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf>`_,
|
|
Appendix B.1 for more details on TMAM. Refer to this `technical paper <https://fd.io/wp-content/uploads/sites/34/2018/01/performance_analysis_sw_data_planes_dec21_2017.pdf>`_
|
|
which adopts TMAM for systematic performance benchmarking and analysis
|
|
of compute-native Network Function data planes that are executed on
|
|
Commercial-Off-The-Shelf (COTS) servers using available open-source
|
|
measurement tools.
|
|
|
|
Example: Using Perf to analyze TMAM level 1 on CPU core 1
|
|
|
|
.. code-block:: console
|
|
|
|
perf stat --topdown -C 1 taskset -c 1 dd if=/dev/zero of=/dev/null count=10
|
|
10+0 records in
|
|
10+0 records out
|
|
5120 bytes (5.1 kB, 5.0 KiB) copied, 0.00336348 s, 1.5 MB/s
|
|
|
|
Performance counter stats for 'CPU(s) 1':
|
|
|
|
retiring bad speculation frontend bound backend bound
|
|
S0-C1 1 10.6% 1.5% 3.9% 84.0%
|
|
|
|
0.006737123 seconds time elapsed
|
|
|