printf 'break qemu_add_opts +run +' > data/vm.gdb +./run --debug-vm-file data/vm.gdb+
diff --git a/index.html b/index.html index e8b5240..50ca0cf 100644 --- a/index.html +++ b/index.html @@ -1281,7 +1281,8 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
Or if things get really involved and you want a debug script:
+printf 'break qemu_add_opts +run +' > data/vm.gdb +./run --debug-vm-file data/vm.gdb+
Our default emulator builds are optimized with gcc -O2 -g. To use -O0 instead, build and run with:
Other related fs.py options are:
+--abs-max-tick: set the maximum guest simulation time. The same scale as the ExecAll trace is used. E.g., for the above example with 3 instructions, the same trace would be achieved with a value of 3000.
The message also shows on User mode simulation deadlocks, for example in userland/posix/pthread_deadlock.c:
userland/cpp/atomic/aarch64_add.cpp: non synchronized aarch64 inline assembly
userland/cpp/atomic/aarch64_ldaxr_stlxr.cpp: see: ARM LDXR and STXR instructions
+userland/cpp/atomic/aarch64_ldadd.cpp: synchronized aarch64 inline assembly with the ARM Large System Extensions (LSE) LDADD instruction
TODO let’s understand that fully one day.
+emulator topics:
Illustrates how to add extra non-code data files to an NPM package, and then use those files at runtime:
+Illustrates how to add extra non-code data files to an NPM package, and then use those files at runtime.
userland/arch/aarch64/inline_asm/inc_32.c: how to use 32-bit w registers in aarch64. We have to add w to the % as in %w[io] instead of %[io]
This syscall is rarely used on its own, and there isn’t even a glibc wrapper for it: you almost always just want to use the pthreads or C++ multithreading wrappers which use it for you to implement higher level constructs like mutexes.
Futexes are bit complicated, because in order to achieve their efficiency, basically nothing is guaranteed: the wait might not wait, and the wakes might not wake. So you are just basically forced to use atomic operations on the futex memory address in order to be sure of anything.
+Futexes are bit complicated, because in order to achieve their efficiency, basically nothing is guaranteed: the wait might not wait, and the wakes might not wake.
+So you are just basically forced to use atomic operations on the futex memory address in order to be sure of anything (we encourage you to try without :-)).
Minimal examples:
@@ -25606,23 +25659,17 @@ zmmintrin.h AVX512lkmc/futex.h: our futex wrapper
userland/linux/futex.c: minimal example, the main thread:
+userland/linux/futex.c: minimal example. It:
spawns a child
+first spawns a child
the child waits on a futex
+then sleeps for 1 second and wakes up the futex if anyone is sleeping on it
the main thread sleeps for one second
-the main thread wakes up the child
-the child returns
+the child sleeps on the futex if it reaches that futex before the end of the parent’s sleep (likely). If it did reach that FUTEX_WAIT there, it gets awoken by the parent.
So what you see is:
main start child start [wait 1s] -main after sleep -child end+parent after sleep +child after parent sleep
Parent section: Userland multithreading.
+ARMv8 architecture reference manual db A1.7.3 "The ARMv8.1 architecture extension"
-Parent section: ARM baremetal multicore.
+Parent section: atomic.cpp
ARMv8 architecture reference manual db "ARMv8.1-LSE, ARMv8.1 Large System Extensions"
+ +LDXR and STXR vs LDAXR and STLXR: https://stackoverflow.com/questions/21535058/arm64-ldxr-stxr-vs-ldaxr-stlxr TODO understand better and example.
+LDXR and STXR for a so-called "Load-link/store-conditional" (LLSC) pattern: https://en.wikipedia.org/wiki/Load-link/store-conditional which appears in many RISC ISAs.
+This pattern makes it such that basically:
+LDXR marks an address for exclusive access by the current CPU
+STXR:
+marks the address as not being exclusive to other CPUs that may have done LDXR before
+loads fine if the address is still marked as exclusive, and stores 0 on a third register for success
+fails to load if the address is not, and stores 1 on the third register for failure
+In case of failure, we just have to loop back to just before the LDXR and try again.
+This is therefore basically a spinlock and should only be used to cover very short critical sections such as atomic increments.
+C++ std::atomic uses this for increments before v8.1 ARM Large System Extensions (LSE): https://stackoverflow.com/questions/56810/how-do-i-start-threads-in-plain-c/52453291#52453291
Set of atomic and synchronization primitives added in ARMv8.1 architecture extension.
+Documented at ARMv8 architecture reference manual db "ARMv8.1-LSE, ARMv8.1 Large System Extensions"
ARMv8 architecture reference manual db A1.7.3 "The ARMv8.1 architecture extension"
+Good getting started tutorials:
The official manuals were stored in http://infocenter.arm.com but as of 2017 they started to slowly move to https://developer.arm.com.
Bibliography: https://www.quora.com/Where-can-I-find-the-official-documentation-of-ARM-instruction-set-architectures-ISAs
ARM also releases documentation specific to each given processor.
This adds extra details to the more portable ARMv8 architecture reference manual ISA documentation.
SEV is not the only thing that can wake up a WFE, it is only an explicit software way to do it. Notably, global monitor operations on memory accesses of regions marked by LDAXR and STLXR instructions can also wake up a WFE sleeping core. This is done to allow spinlocks opens to automatically wake up WFE sleeping cores at free time without the need for a explicit SEV.
WFE and SEV are usable from userland, and are part of an efficient spinlock implementation, which maybe is not something that userland should ever tho and just stick to mutexes?
-There is a control bit SCTLR_EL1.nTWE that determines if WFE is trapped or not, i.e.: is that bit is set, then it is trapped and EL0 execution raises an exception in EL1. Linux v5.2.1 does not seem to trap however, tested with --trace ExecAll and the dump_regs kernel module in a full system simulation. But then, how does the kernel prevent CPUs from going to sleep randomly and instead reschedules other tasks? Does the kernel check if CPUs are in WFE when it wakes up on the timer, and only then reschedules? This would allow for userland to implement fast spinlocks if the spinlock returns faster than the timer. The kernel seems to setup NTWE at:
include/asm/sysreg.h
-#define SCTLR_EL1_SET (SCTLR_ELx_M | SCTLR_ELx_C | SCTLR_ELx_SA |\ - ... - SCTLR_EL1_NTWE | SCTLR_ELx_IESB | SCTLR_EL1_SPAN |\-
and:
-mm/proc.S
-/* - * Prepare SCTLR - */ - mov_q x0, SCTLR_EL1_SET-
Quotes for the above ARMv8 architecture reference manual db G1.18.1 "Wait For Event and Send Event":
For how userland spinlocks and mutexes are implemented see Userland mutex implementation.
WFE and SEV are usable from userland, and are part of an efficient spinlock implementation (which userland should arguably stay away from and rather use the futex system call which allow for non busy sleep instead), which maybe is not something that userland should ever tho and just stick to mutexes?
+There is a control bit SCTLR_EL1.nTWE that determines if WFE is trapped or not, i.e.: is that bit is set, then it is trapped and EL0 execution raises an exception in EL1.
Linux v5.2.1 does not set SCTLR_EL1.nTWE however, tested with gem5 tracing with --trace ExecAll,Failts and the dump_regs kernel module in a full system simulation.
The kernel seems to setup nTWE at:
+include/asm/sysreg.h
+#define SCTLR_EL1_SET (SCTLR_ELx_M | SCTLR_ELx_C | SCTLR_ELx_SA |\ + ... + SCTLR_EL1_NTWE | SCTLR_ELx_IESB | SCTLR_EL1_SPAN |\+
and:
+mm/proc.S
+/* + * Prepare SCTLR + */ + mov_q x0, SCTLR_EL1_SET+
To reduce the number of instructions from our trace, first we boot, and then we restore a checkpoint after boot with gem5 checkpoint restore and run a different script with a restore command that runs userland/arch/aarch64/freestanding/linux/wfe_wfe.S:
+./run --arch aarch64 --emulator gem5 --gem5-worktree master --gem5-restore 1 --gem5-readfile 'arch/aarch64/freestanding/linux/wfe_wfe.out' --trace ExecAll,Faults,FmtFlag,Thread+
On the traces, we search for wfe, and there are just two hits, so they must be our instructions!
The traces then look like this at LKMC 777b7cbbd1d553baf2be9bc2075102be740054dd:
+112285501668497000: Thread: system.cpu: suspend contextId 0 +112285501668497000: ExecEnable: system.cpu: A0 T0 : 0x400078 : wfe : IntAlu : D=0x0000000000000000 flags=(IsSerializeAfter|IsNonSpeculative|IsQuiesce|IsUnverifiable) +112285501668497501: Thread: system.cpu: activate contextId 0 +112285501668498000: Thread: system.cpu: suspend contextId 0 +112285501668498000: ExecEnable: system.cpu: A0 T0 : 0x40007c : wfe : IntAlu : D=0x0000000000000000 flags=(IsSerializeAfter|IsNonSpeculative|IsQuiesce|IsUnverifiable) +112285501909320284: Thread: system.cpu: activate contextId 0 +112285501909320500: Faults: IRQ: Invoking Fault (AArch64 target EL):IRQ cpsr:0x4003c5 PC:0x400080 elr:0x400080 newVec: 0xffffff8010082480 +112285501909320500: ExecEnable: system.cpu: A0 T0 : @vectors+1152 : nop : IntAlu : flags=(IsNop) +112285501909321000: ExecEnable: system.cpu: A0 T0 : @vectors+1156 : nop : IntAlu : flags=(IsNop) + +[more exception handler, no ERET here] + +112285501923080500: ExecEnable: system.cpu: A0 T0 : @finish_ret_to_user+188 : ldr x30, [sp, #240] : MemRead : D=0x0000000000000000 A=0xffffff8010cb3fb0 flags=(IsInteger|IsMemRef|IsLoad) +112285501923081000: ExecEnable: system.cpu: A0 T0 : @finish_ret_to_user+192 : add sp, sp, #320 : IntAlu : D=0xffffff8010cb4000 flags=(IsInteger) +112285501923081500: ExecEnable: system.cpu: A0 T0 : 0xffffff8010084144 : eret : IntAlu : D=0x0000000000000001 flags=(IsControl|IsSerializeAfter|IsNonSpeculative|IsSquashAfter) +112285501923082000: ExecEnable: system.cpu: A0 T0 : 0x400080 : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) +112285501923082500: ExecEnable: system.cpu: A0 T0 : 0x400084 : movz x8, #93, #0 : IntAlu : D=0x000000000000005d flags=(IsInteger) +112285501923083000: ExecEnable: system.cpu: A0 T0 : 0x400088 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)+
so we conclude that:
+the second WFE made the CPU stop running instructions at time 112285501668498000 and PC 0x40007c
+the next thing that happened a long time later (112285501909320500, while a following instruction would happen at 112285501668498000 + 1000) was an interrupt, presumably the ARM timer
+after a few interrupt handler instructions, the first ERET instruction exits the handler and comes back directly to the instruction after the WFE at PC 0x400080 == 0x40007c + 4
+Therefore, a WFE in userland is treated much like a busy loop by the Linux kernel: the kernel does not seem to try and explicitly make up room for other processes as would happen on a futex.
+gem5 390a74f59934b85d91489f8a563450d8321b602d does not sleep on the first WFE on either syscall emulation or full system, because the code does:
First, also consider the userland bibliography: Section 24.8, “ARM assembly bibliography”.
+First, also consider the userland bibliography: Section 24.9, “ARM assembly bibliography”.
The most useful ARM baremetal example sets we’ve seen so far are: