info registers all+
diff --git a/index.html b/index.html index a180df3..b332b94 100644 --- a/index.html +++ b/index.html @@ -1006,6 +1006,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
For a more minimal baremetal multicore setup, see: Section 27.8.3, “ARM multicore”.
+For a more minimal baremetal multicore setup, see: Section 27.8.3, “ARM baremetal multicore”.
We can set and get which cores the Linux kernel allows a program to run on with sched_getaffinity and sched_setaffinity:
The following kernel modules and Baremetal executables dump and disassemble various registers which cannot be observed from userland (usually "system registers", "control registers"):
+Some of those programs are using:
+Alternatively, you can also get their value from inside GDB step debug with:
+info registers all+
or the short version:
+i r a+
or to get just specific registers, e.g. just ARMv8’s SCTLR:
+i r SCTLR+
but it is sometimes just more convenient to run an executable to get the registers at the point of interest.
+See also:
+25007500: time count in some unit. Note how the microops execute at further timestamps.
system.cpu: distinguishes between CPUs when there are more than one. For example, running Section 27.8.3, “ARM multicore” with two cores produces system.cpu0 and system.cpu1
system.cpu: distinguishes between CPUs when there are more than one. For example, running Section 27.8.3, “ARM baremetal multicore” with two cores produces system.cpu0 and system.cpu1
T0: thread number. TODO: hyperthread? How to play with it?
config.ini has --param 'system.multi_thread = True' --param 'system.cpu[0].numThreads = 2', but in ARM multicore the first one alone does not produce T1, and with the second one simulation blows up with:
config.ini has --param 'system.multi_thread = True' --param 'system.cpu[0].numThreads = 2', but in ARM baremetal multicore the first one alone does not produce T1, and with the second one simulation blows up with:
Those instructions are exposed through the m5 in tree executable.
To make things simpler to understand, you can play around with our own minimized educational m5 subset userland/c/m5ops.c.
To make things simpler to understand, you can play around with our own minimized educational m5 subset:
The instructions used by ./c/m5ops.out are present in lkmc/m5ops.h in a very simple to understand and reuse inline assembly form.
ARM multicore with a single CPU stays stopped at an WFE sleep instruction
+ARM baremetal multicore with a single CPU stays stopped at an WFE sleep instruction
this sample bug on se.py multithreading: https://github.com/cirosantilli/linux-kernel-module-cheat/issues/81
@@ -23321,7 +23421,15 @@ There are no non-locking atomic types or atomic primitives in POSIX:python3 lkmc/python/hello.py+
python3 python/hello.py+
or:
+./python/hello.py
LKMC 50ac89b779363774325c81157ec8b9a6bdb50a2f gem5 390a74f59934b85d91489f8a563450d8321b602da arch64:
+LKMC 50ac89b779363774325c81157ec8b9a6bdb50a2f gem5 390a74f59934b85d91489f8a563450d8321b602da:
+./run \ + --emulator gem5 \ + --userland "$(buildroot_target_dir)/usr/bin/python3" \ + --userland-args rootfs_overlay/lkmc/python/hello.py \ +;+
fails with:
+fatal: Syscall 318 out of range+
which corresponds to the glorious inotify_rm_watch syscall: https://github.com/torvalds/linux/blob/v5.4/arch/arm/tools/syscall.tbl#L335
and aarch64:
./run \ --arch aarch64 \ --emulator gem5 \ - --userland "$(./getvar \ - --arch aarch64 buildroot_target_dir)/usr/bin/python3" \ + --userland "$(./getvar --arch aarch64 buildroot_target_dir)/usr/bin/python3" \ --userland-args rootfs_overlay/lkmc/python/hello.py \ ;
To debug these instructoins, you can see the register values in GDB with:
+To debug these instructions, you can see the register values in GDB with:
I’m not sure how much stdlib functionality is supposed to work without the pre-main stuff, but I guess we’ll just have to find out!
Was going to ask the following markdown question, but I noticed half way that:
+without -static, I see a bunch of dynamic loader instructions, so not much is gained
with -static, the program segfaults, including on the host with stack:
#0 0x0000000000429625 in _IO_cleanup () +#1 0x0000000000400c72 in __run_exit_handlers () +#2 0x0000000000400caa in exit () +#3 0x0000000000400a01 in _start () at exit.S:4+
so I didn’t really have a good question.
+The Markdown question that was almost asked:
+When working in emulators, I often want to keep my workloads as small as possible to more easily study instruction traces and reproduce bugs. + +One of the ways I often want to do that, especially when doing [user mode simulations](https://wiki.debian.org/QemuUserEmulation), is by not running [the code that normally runs before main](https://stackoverflow.com/questions/53570678/what-happens-before-main-in-c) so that I can start directly in the instructions of interest that I control myself, which can be achieved with the `gcc -nostartfiles` option and by starting the program directly at `_start`. + +Here is a tiny example that calls just `exit` from the C standard library: + +main.S + +``` +.global _start +_start: + mov $0, %rdi + call exit + +``` + +Compile and run with: + +``` +gcc -ggdb3 -nostartfiles -static -o exit.out exit.S +qemu-x86_64 -d in_asm exit.out +``` + +However, for programming convenience, and to potentially keep my examples more OS portable, I would like to avoid making raw system calls, which would of course work, by using C standard library functions instead. + +But I'm afraid that some of those C standard library functions will fail in subtle ways because I have skipped required initialization steps that would normally happen before `main`. + +Is it any easy to determine which functions I can use or not, in case there are any that I can't use?+
This is how threads either:
+request the kernel to sleep until they are woken up by other threads
+request the kernel to wake up other threads that are waiting on a given futex
+This syscall is rarely used on its own, and there isn’t even a glibc wrapper for it: you almost always just want to use the pthreads or C++ multithreading wrappers which use it for you to implement higher level constructs like mutexes.
+Futexes are bit complicated, because in order to achieve their efficiency, basically nothing is guaranteed: the wait might not wait, and the wakes might not wake. So you are just basically forced to use atomic operations on the futex memory address in order to be sure of anything.
+Minimal examples:
+lkmc/futex.h: our futex wrapper
+userland/linux/futex.c: minimal example, the main thread:
+spawns a child
+the child waits on a futex
+the main thread sleeps for one second
+the main thread wakes up the child
+the child returns
+So what you see is:
+main start +child start +[wait 1s] +main after sleep +child end+
The best article to understand spinlocks is: https://eli.thegreenplace.net/2018/basics-of-futexes/
+The example in man futex is also a must.
Parent section: ARM multicore.
+Parent section: ARM baremetal multicore.
ARMv8 architecture reference manual db "ARMv8.1-LSE, ARMv8.1 Large System Extensions"
@@ -30381,7 +30648,7 @@ IN: mainExamples:
The WFE and SEV instructions are just hints: a compliant implementation can treat them as NOPs.
Concrete examples of the instruction can be seen at:
+userland/arch/aarch64/freestanding/linux/wfe_wfe.S: run WFE twice, because gem5 390a74f59934b85d91489f8a563450d8321b602d does not sleep on the first, see also: gem5 ARM WFE
+baremetal/arch/aarch64/no_bootloader/wfe_loop.S, see: gem5 simulate() limit reached
+userland/arch/aarch64/inline_asm/wfe_sev.cpp: one Linux thread runs WFE and the other runs SEV to wake it up
+ARM baremetal multicore shows baremetal examples where WFE sleeps and another thread wakes it up:
+ +However, likely no implementation likely does (TODO confirm), since:
SEV is not the only thing that can wake up a WFE, it is only an explicit software way to do it. Notably, global monitor operations on memory accesses of regions marked by LDAXR and STLXR instructions can also wake up a WFE sleeping core. This is done to allow spinlocks opens to automatically wake up WFE sleeping cores at free time without the need for a explicit SEV.
WFE and SEV are usable from userland, and are part of a efficient spinlock implementation.
+WFE and SEV are usable from userland, and are part of an efficient spinlock implementation, which maybe is not something that userland should ever tho and just stick to mutexes?
+There is a control bit SCTLR_EL1.nTWE that determines if WFE is trapped or not, i.e.: is that bit is set, then it is trapped and EL0 execution raises an exception in EL1. Linux v5.2.1 does not seem to trap however, tested with --trace ExecAll and the dump_regs kernel module in a full system simulation. But then, how does the kernel prevent CPUs from going to sleep randomly and instead reschedules other tasks? Does the kernel check if CPUs are in WFE when it wakes up on the timer, and only then reschedules? This would allow for userland to implement fast spinlocks if the spinlock returns faster than the timer. The kernel seems to setup NTWE at:
include/asm/sysreg.h
+#define SCTLR_EL1_SET (SCTLR_ELx_M | SCTLR_ELx_C | SCTLR_ELx_SA |\ + ... + SCTLR_EL1_NTWE | SCTLR_ELx_IESB | SCTLR_EL1_SPAN |\+
and:
+mm/proc.S
+/* + * Prepare SCTLR + */ + mov_q x0, SCTLR_EL1_SET+
Quotes for the above ARMv8 architecture reference manual db G1.18.1 "Wait For Event and Send Event":
@@ -30575,7 +30907,55 @@ IN: mainThe best article to understand spinlocks is: https://eli.thegreenplace.net/2018/basics-of-futexes/
+For how userland spinlocks and mutexes are implemented see Userland mutex implementation.
+gem5 390a74f59934b85d91489f8a563450d8321b602d does not sleep on the first WFE on either syscall emulation or full system, because the code does:
+Fault WfeInst::execute(
+ ExecContext *xc, Trace::InstRecord *traceData) const
+{
+[...]
+ if (SevMailbox == 1) {
+ SevMailbox = 0;
+ PseudoInst::quiesceSkip(tc);
+ } else if (tc->getCpuPtr()->getInterruptController(
+ tc->threadId())->checkInterrupts(tc)) {
+ PseudoInst::quiesceSkip(tc);
+ } else {
+ fault = trapWFx(tc, cpsr, scr, true);
+ if (fault == NoFault) {
+ PseudoInst::quiesce(tc);
+ } else {
+ PseudoInst::quiesceSkip(tc);
+ }
+ }
+where "quiesce" means "sleep" for laymen like Ciro, and quiesceSkip means don’t sleep.
SevMailbox is read from MISCREG_SEV_MAILBOX which is initialized to 1 at:
ISA::clear()
+{
+[...]
+ miscRegs[MISCREG_SEV_MAILBOX] = 1;
+a18f28e263c91362519ef550150b5c9d75fa3679 + 1
userland/gcc/busy_loop.c -O0
qemu --arch aarch64
10^10
68
a18f28e263c91362519ef550150b5c9d75fa3679 + 1
userland/gcc/busy_loop.c -O0
gem5 --arch aarch64
10^7
100
a18f28e263c91362519ef550150b5c9d75fa3679 + 1
userland/gcc/busy_loop.c -O0
gem5 --arch aarch64 -- --cpu-type MinorCPU --caches
10^6
31
a18f28e263c91362519ef550150b5c9d75fa3679 + 1
userland/gcc/busy_loop.c -O0
gem5 --arch aarch64 -- --cpu-type DerivO3CPU --caches
10^6
52
a18f28e263c91362519ef550150b5c9d75fa3679 + 1
userland/gcc/busy_loop.c -O0
gem5 --arch aarch64 --gem5-build-id MOESI_CMP_directory -- --cpu-type DerivO3CPU --caches --ruby
1 * 1000000 = 10^6
63
605448f07e6380634b1aa7e9732d111759f69fd
Dhrystone -O3
Dhrystone -O3
gem5 --arch aarch64
4 * 10^5
68
5d233f2664a78789f9907d27e2a40e86cefad595
STREAM benchmark -O3
STREAM benchmark -O3
gem5 --arch aarch64 --userland-args 300000 2
3 * 10^5 * 2
64
9.9674773 * 10^7
1.6
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611
gem5 --arch aarch64 --userland-args e
1
2
1.26479 * 10^5
0.05
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611
gem5 --arch aarch64 --userland-args e
1
2
2.385012 * 10^6
1