printf 'break qemu_add_opts +run +' > data/vm.gdb +./run --debug-vm-file data/vm.gdb+
From bae980a2e62de8af8d3f7ad490825636c57af839 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Ciro=20Santilli=20=E5=85=AD=E5=9B=9B=E4=BA=8B=E4=BB=B6=20?=
=?UTF-8?q?=E6=B3=95=E8=BD=AE=E5=8A=9F?= Or if things get really involved and you want a debug script: Our default emulator builds are optimized with Other related fs.py options are: The message also shows on User mode simulation deadlocks, for example in userland/posix/pthread_deadlock.c: userland/cpp/atomic/aarch64_add.cpp: non synchronized aarch64 inline assembly userland/cpp/atomic/aarch64_ldaxr_stlxr.cpp: see: ARM LDXR and STXR instructions userland/cpp/atomic/aarch64_ldadd.cpp: synchronized aarch64 inline assembly with the ARM Large System Extensions (LSE) LDADD instruction TODO let’s understand that fully one day. emulator topics: Illustrates how to add extra non-code data files to an NPM package, and then use those files at runtime: Illustrates how to add extra non-code data files to an NPM package, and then use those files at runtime. userland/arch/aarch64/inline_asm/inc_32.c: how to use 32-bit This syscall is rarely used on its own, and there isn’t even a glibc wrapper for it: you almost always just want to use the pthreads or C++ multithreading wrappers which use it for you to implement higher level constructs like mutexes. Futexes are bit complicated, because in order to achieve their efficiency, basically nothing is guaranteed: the wait might not wait, and the wakes might not wake. So you are just basically forced to use atomic operations on the futex memory address in order to be sure of anything. Futexes are bit complicated, because in order to achieve their efficiency, basically nothing is guaranteed: the wait might not wait, and the wakes might not wake. So you are just basically forced to use atomic operations on the futex memory address in order to be sure of anything (we encourage you to try without :-)). Minimal examples: lkmc/futex.h: our futex wrapper userland/linux/futex.c: minimal example, the main thread: userland/linux/futex.c: minimal example. It: spawns a child first spawns a child the child waits on a futex then sleeps for 1 second and wakes up the futex if anyone is sleeping on it the main thread sleeps for one second the main thread wakes up the child the child returns the child sleeps on the futex if it reaches that futex before the end of the parent’s sleep (likely). If it did reach that So what you see is:
-
-
@@ -25595,7 +25645,10 @@ zmmintrin.h AVX512
-
@@ -1746,8 +1749,9 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
printf 'break qemu_add_opts
+run
+' > data/vm.gdb
+./run --debug-vm-file data/vm.gdb
+gcc -O2 -g. To use -O0 instead, build and run with:
+
+--abs-max-tick: set the maximum guest simulation time. The same scale as the ExecAll trace is used. E.g., for the above example with 3 instructions, the same trace would be achieved with a value of 3000.21.2.1.2. C++ parallel algorithms
+21.2.1.2. C++ std::memory_order
+
+21.2.1.3. C++ parallel algorithms
@@ -23340,11 +23377,21 @@ There are no non-locking atomic types or atomic primitives in POSIX: x86 thread synchronization primitives
21.6.2.1.1. NPM data-files
w registers in aarch64. We have to add w to the % as in %w[io] instead of %[io]
FUTEX_WAIT there, it gets awoken by the parent.main start
child start
[wait 1s]
-main after sleep
-child end
+parent after sleep
+child after parent sleep
Parent section: Userland multithreading.
+ARMv8 architecture reference manual db A1.7.3 "The ARMv8.1 architecture extension"
-Parent section: ARM baremetal multicore.
+Parent section: atomic.cpp
ARMv8 architecture reference manual db "ARMv8.1-LSE, ARMv8.1 Large System Extensions"
+ +LDXR and STXR vs LDAXR and STLXR: https://stackoverflow.com/questions/21535058/arm64-ldxr-stxr-vs-ldaxr-stlxr TODO understand better and example.
+LDXR and STXR for a so-called "Load-link/store-conditional" (LLSC) pattern: https://en.wikipedia.org/wiki/Load-link/store-conditional which appears in many RISC ISAs.
+This pattern makes it such that basically:
+LDXR marks an address for exclusive access by the current CPU
+STXR:
+marks the address as not being exclusive to other CPUs that may have done LDXR before
+loads fine if the address is still marked as exclusive, and stores 0 on a third register for success
+fails to load if the address is not, and stores 1 on the third register for failure
+In case of failure, we just have to loop back to just before the LDXR and try again.
+This is therefore basically a spinlock and should only be used to cover very short critical sections such as atomic increments.
+C++ std::atomic uses this for increments before v8.1 ARM Large System Extensions (LSE): https://stackoverflow.com/questions/56810/how-do-i-start-threads-in-plain-c/52453291#52453291
Set of atomic and synchronization primitives added in ARMv8.1 architecture extension.
+Documented at ARMv8 architecture reference manual db "ARMv8.1-LSE, ARMv8.1 Large System Extensions"
ARMv8 architecture reference manual db A1.7.3 "The ARMv8.1 architecture extension"
+Good getting started tutorials:
The official manuals were stored in http://infocenter.arm.com but as of 2017 they started to slowly move to https://developer.arm.com.
Bibliography: https://www.quora.com/Where-can-I-find-the-official-documentation-of-ARM-instruction-set-architectures-ISAs
ARM also releases documentation specific to each given processor.
This adds extra details to the more portable ARMv8 architecture reference manual ISA documentation.
SEV is not the only thing that can wake up a WFE, it is only an explicit software way to do it. Notably, global monitor operations on memory accesses of regions marked by LDAXR and STLXR instructions can also wake up a WFE sleeping core. This is done to allow spinlocks opens to automatically wake up WFE sleeping cores at free time without the need for a explicit SEV.
WFE and SEV are usable from userland, and are part of an efficient spinlock implementation, which maybe is not something that userland should ever tho and just stick to mutexes?
-There is a control bit SCTLR_EL1.nTWE that determines if WFE is trapped or not, i.e.: is that bit is set, then it is trapped and EL0 execution raises an exception in EL1. Linux v5.2.1 does not seem to trap however, tested with --trace ExecAll and the dump_regs kernel module in a full system simulation. But then, how does the kernel prevent CPUs from going to sleep randomly and instead reschedules other tasks? Does the kernel check if CPUs are in WFE when it wakes up on the timer, and only then reschedules? This would allow for userland to implement fast spinlocks if the spinlock returns faster than the timer. The kernel seems to setup NTWE at:
include/asm/sysreg.h
-#define SCTLR_EL1_SET (SCTLR_ELx_M | SCTLR_ELx_C | SCTLR_ELx_SA |\ - ... - SCTLR_EL1_NTWE | SCTLR_ELx_IESB | SCTLR_EL1_SPAN |\-
and:
-mm/proc.S
-/* - * Prepare SCTLR - */ - mov_q x0, SCTLR_EL1_SET-
Quotes for the above ARMv8 architecture reference manual db G1.18.1 "Wait For Event and Send Event":
For how userland spinlocks and mutexes are implemented see Userland mutex implementation.
WFE and SEV are usable from userland, and are part of an efficient spinlock implementation (which userland should arguably stay away from and rather use the futex system call which allow for non busy sleep instead), which maybe is not something that userland should ever tho and just stick to mutexes?
+There is a control bit SCTLR_EL1.nTWE that determines if WFE is trapped or not, i.e.: is that bit is set, then it is trapped and EL0 execution raises an exception in EL1.
Linux v5.2.1 does not set SCTLR_EL1.nTWE however, tested with gem5 tracing with --trace ExecAll,Failts and the dump_regs kernel module in a full system simulation.
The kernel seems to setup nTWE at:
+include/asm/sysreg.h
+#define SCTLR_EL1_SET (SCTLR_ELx_M | SCTLR_ELx_C | SCTLR_ELx_SA |\ + ... + SCTLR_EL1_NTWE | SCTLR_ELx_IESB | SCTLR_EL1_SPAN |\+
and:
+mm/proc.S
+/* + * Prepare SCTLR + */ + mov_q x0, SCTLR_EL1_SET+
To reduce the number of instructions from our trace, first we boot, and then we restore a checkpoint after boot with gem5 checkpoint restore and run a different script with a restore command that runs userland/arch/aarch64/freestanding/linux/wfe_wfe.S:
+./run --arch aarch64 --emulator gem5 --gem5-worktree master --gem5-restore 1 --gem5-readfile 'arch/aarch64/freestanding/linux/wfe_wfe.out' --trace ExecAll,Faults,FmtFlag,Thread+
On the traces, we search for wfe, and there are just two hits, so they must be our instructions!
The traces then look like this at LKMC 777b7cbbd1d553baf2be9bc2075102be740054dd:
+112285501668497000: Thread: system.cpu: suspend contextId 0 +112285501668497000: ExecEnable: system.cpu: A0 T0 : 0x400078 : wfe : IntAlu : D=0x0000000000000000 flags=(IsSerializeAfter|IsNonSpeculative|IsQuiesce|IsUnverifiable) +112285501668497501: Thread: system.cpu: activate contextId 0 +112285501668498000: Thread: system.cpu: suspend contextId 0 +112285501668498000: ExecEnable: system.cpu: A0 T0 : 0x40007c : wfe : IntAlu : D=0x0000000000000000 flags=(IsSerializeAfter|IsNonSpeculative|IsQuiesce|IsUnverifiable) +112285501909320284: Thread: system.cpu: activate contextId 0 +112285501909320500: Faults: IRQ: Invoking Fault (AArch64 target EL):IRQ cpsr:0x4003c5 PC:0x400080 elr:0x400080 newVec: 0xffffff8010082480 +112285501909320500: ExecEnable: system.cpu: A0 T0 : @vectors+1152 : nop : IntAlu : flags=(IsNop) +112285501909321000: ExecEnable: system.cpu: A0 T0 : @vectors+1156 : nop : IntAlu : flags=(IsNop) + +[more exception handler, no ERET here] + +112285501923080500: ExecEnable: system.cpu: A0 T0 : @finish_ret_to_user+188 : ldr x30, [sp, #240] : MemRead : D=0x0000000000000000 A=0xffffff8010cb3fb0 flags=(IsInteger|IsMemRef|IsLoad) +112285501923081000: ExecEnable: system.cpu: A0 T0 : @finish_ret_to_user+192 : add sp, sp, #320 : IntAlu : D=0xffffff8010cb4000 flags=(IsInteger) +112285501923081500: ExecEnable: system.cpu: A0 T0 : 0xffffff8010084144 : eret : IntAlu : D=0x0000000000000001 flags=(IsControl|IsSerializeAfter|IsNonSpeculative|IsSquashAfter) +112285501923082000: ExecEnable: system.cpu: A0 T0 : 0x400080 : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) +112285501923082500: ExecEnable: system.cpu: A0 T0 : 0x400084 : movz x8, #93, #0 : IntAlu : D=0x000000000000005d flags=(IsInteger) +112285501923083000: ExecEnable: system.cpu: A0 T0 : 0x400088 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)+
so we conclude that:
+the second WFE made the CPU stop running instructions at time 112285501668498000 and PC 0x40007c
+the next thing that happened a long time later (112285501909320500, while a following instruction would happen at 112285501668498000 + 1000) was an interrupt, presumably the ARM timer
+after a few interrupt handler instructions, the first ERET instruction exits the handler and comes back directly to the instruction after the WFE at PC 0x400080 == 0x40007c + 4
+Therefore, a WFE in userland is treated much like a busy loop by the Linux kernel: the kernel does not seem to try and explicitly make up room for other processes as would happen on a futex.
+gem5 390a74f59934b85d91489f8a563450d8321b602d does not sleep on the first WFE on either syscall emulation or full system, because the code does:
First, also consider the userland bibliography: Section 24.8, “ARM assembly bibliography”.
+First, also consider the userland bibliography: Section 24.9, “ARM assembly bibliography”.
The most useful ARM baremetal example sets we’ve seen so far are: