./run \ --userland "$(./getvar buildroot_target_dir)/bin/echo" \ - --userland-args='asdf' \ + --cli-args='asdf' \ ;
diff --git a/index.html b/index.html index 70fe219..777125c 100644 --- a/index.html +++ b/index.html @@ -1808,68 +1808,70 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
TODO: the carriage returns are a bit different than in QEMU, see: Section 27.4, “gem5 baremetal carriage return”.
+TODO: the carriage returns are a bit different than in QEMU, see: Section 27.6, “gem5 baremetal carriage return”.
Note that ./build-baremetal requires the --emulator gem5 option, and generates separate executable images for both, as can be seen from:
For a more minimal baremetal multicore setup, see: Section 27.8.3, “ARM baremetal multicore”.
+For a more minimal baremetal multicore setup, see: Section 27.10.3, “ARM baremetal multicore”.
We can set and get which cores the Linux kernel allows a program to run on with sched_getaffinity and sched_setaffinity:
./build user-mode-qemu ./run \ --userland userland/c/command_line_arguments.c \ - --userland-args='asdf "qw er"' \ + --cli-args='asdf "qw er"' \ ;
./run \ --userland "$(./getvar buildroot_target_dir)/bin/echo" \ - --userland-args='asdf' \ + --cli-args='asdf' \ ;
./run \ --arch aarch64 \ --userland "$(./getvar --arch aarch64 buildroot_target_dir)/bin/sh" \ - --userland-args='-c "uname -a && pwd"' \ + --cli-args='-c "uname -a && pwd"' \ ;@@ -7727,7 +7729,7 @@ qemu: uncaught target signal 6 (Aborted) - core dumped --arch aarch64 \ --static \ --userland userland/c/command_line_arguments.c \ - --userland-args 'asdf "qw er"' \ + --cli-args 'asdf "qw er"' \ ; @@ -7931,7 +7933,7 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex --arch aarch64 \ --emulator gem5 \ --userland userland/c/command_line_arguments.c \ - --userland-args 'asdf "qw er"' \ + --cli-args 'asdf "qw er"' \ ; @@ -7948,7 +7950,7 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex --emulator gem5 \ --gdb-wait \ --userland userland/c/command_line_arguments.c \ - --userland-args 'asdf "qw er"' \ + --cli-args 'asdf "qw er"' \ ; ./run-gdb \ --arch aarch64 \ @@ -8053,7 +8055,7 @@ hello
./run --userland userland/posix/count_to.c --userland-args 3+
./run --userland userland/posix/count_to.c --cli-args 3
25007500: time count in some unit. Note how the microops execute at further timestamps.
system.cpu: distinguishes between CPUs when there are more than one. For example, running Section 27.8.3, “ARM baremetal multicore” with two cores produces system.cpu0 and system.cpu1
system.cpu: distinguishes between CPUs when there are more than one. For example, running Section 27.10.3, “ARM baremetal multicore” with two cores produces system.cpu0 and system.cpu1
T0: thread number. TODO: hyperthread? How to play with it?
runs are deterministic by default, unlike QEMU which has a special QEMU record and replay mode, that requires first playing the content once and then replaying
gem5 ARM at least appears to implement more low level CPU functionality than QEMU, e.g. QEMU only added EL2 in 2018: https://stackoverflow.com/questions/42824706/qemu-system-aarch64-entering-el1-when-emulating-a53-power-up See also: Section 27.8.1, “ARM exception levels”
+gem5 ARM at least appears to implement more low level CPU functionality than QEMU, e.g. QEMU only added EL2 in 2018: https://stackoverflow.com/questions/42824706/qemu-system-aarch64-entering-el1-when-emulating-a53-power-up See also: Section 27.10.1, “ARM exception levels”
gem5 offers more advanced logging, even for non micro architectural things which QEMU models in some way, e.g. QEMU trace memory accesses, because QEMU’s binary translation optimizations reduce visibility
@@ -18309,7 +18311,7 @@ getconf _NPROCESSORS_CONF./run --userland userland/posix/pthread_count.c --userland-args 4 +./run --userland userland/posix/pthread_count.c --cli-args 4 ps Haux | grep qemu | wc
./run --cpus 1 --emulator gem5 --userland userland/posix/pthread_self.c --userland-args 1+
./run --cpus 1 --emulator gem5 --userland userland/posix/pthread_self.c --cli-args 1
./run --cpus 2 --emulator gem5 --userland userland/posix/pthread_self.c --userland-args 1+
./run --cpus 2 --emulator gem5 --userland userland/posix/pthread_self.c --cli-args 1
./run --cpus 2 --emulator gem5 --userland userland/posix/pthread_self.c --userland-args '1 2'+
./run --cpus 2 --emulator gem5 --userland userland/posix/pthread_self.c --cli-args '1 2'
In order to debug checkpoint restore bugs, this minimal setup using userland/freestanding/gem5_checkpoint_restore.S can be handy:
+In order to debug checkpoint restore bugs, this minimal setup using userland/freestanding/gem5_checkpoint.S can be handy:
./build-userland --arch aarch64 --static -./run --arch aarch64 --emulator gem5 --static --userland userland/freestanding/gem5_checkpoint_restore.S --trace-insts-stdout -./run --arch aarch64 --emulator gem5 --static --userland userland/freestanding/gem5_checkpoint_restore.S --trace-insts-stdout --gem5-restore 1 -./run --arch aarch64 --emulator gem5 --static --userland userland/freestanding/gem5_checkpoint_restore.S --trace-insts-stdout --gem5-restore 1 -- --cpu-type=DerivO3CPU --restore-with-cpu=DerivO3CPU --caches+./run --arch aarch64 --emulator gem5 --static --userland userland/freestanding/gem5_checkpoint.S --trace-insts-stdout +./run --arch aarch64 --emulator gem5 --static --userland userland/freestanding/gem5_checkpoint.S --trace-insts-stdout --gem5-restore 1 +./run --arch aarch64 --emulator gem5 --static --userland userland/freestanding/gem5_checkpoint.S --trace-insts-stdout --gem5-restore 1 -- --cpu-type=DerivO3CPU --restore-with-cpu=DerivO3CPU --caches
And now you will notice that everything happens much slower in the guest terminal!
One even more direct and minimal way to observe this is with userland/freestanding/gem5_checkpoint_restore.S which was mentioned at gem5 checkpoint userland minimal example plus some logging:
+One even more direct and minimal way to observe this is with userland/freestanding/gem5_checkpoint.S which was mentioned at gem5 checkpoint userland minimal example plus some logging:
But let’s give it a try anyways with userland/freestanding/gem5_checkpoint_restore.S which was mentioned at gem5 checkpoint userland minimal example
+But let’s give it a try anyways with userland/freestanding/gem5_checkpoint.S which was mentioned at gem5 checkpoint userland minimal example
./run --arch aarch64 --emulator gem5 --userland "$(./getvar --arch aarch64 out_rootfs_overlay_bin_dir)/m5" --userland-args dumpstats+
./run --arch aarch64 --emulator gem5 --userland "$(./getvar --arch aarch64 out_rootfs_overlay_bin_dir)/m5" --cli-args dumpstats
./run --arch aarch64 --emulator gem5 --userland userland/c/m5ops.c --userland-args 'd 1000'+
./run --arch aarch64 --emulator gem5 --userland userland/c/m5ops.c --cli-args 'd 1000'
./run \ --emulator gem5 \ --userland userland/posix/pthread_deadlock.c \ - --userland-args 1 \ + --cli-args 1 \ ;
./run --userland "$(./getvar buildroot_target_dir)/usr/bin/python3" --userland-args rootfs_overlay/lkmc/python/hello.py+
./run --userland "$(./getvar buildroot_target_dir)/usr/bin/python3" --cli-args rootfs_overlay/lkmc/python/hello.py
./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone" --userland-args 100000000+
./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone" --cli-args 100000000
Build for Baremetal execution and run it in baremetal QEMU. TODO: fix the build, just need to factor out all run arguments from build-baremetal into common.py and it should just work, no missing syscalls.
+Build Dhrystone for Baremetal and run it in on QEMU:
# Build our Newlib stubs. ./build-baremetal --arch aarch64 ./build-dhrystone --arch aarch64 --mode baremetal -./run --arch aarch64 --baremetal "$(./getvar baremetal_build_dir)/submodules/dhrystone/dhrystone"+./run --arch aarch64 --baremetal "$(./getvar --arch aarch64 baremetal_build_dir)/submodules/dhrystone/dhrystone" --cli-args 10000 +
or with gem5:
+# Build our Newlib stubs. +./build-baremetal --arch aarch64 +./build-dhrystone --arch aarch64 --emulator gem5 --mode baremetal +./run --arch aarch64 --baremetal "$(./getvar --arch aarch64 --emulator gem5 baremetal_build_dir)/submodules/dhrystone/dhrystone" --cli-args 10000 --emulator gem5
./run --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe" --userland-args '100 2'+
./run --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe" --cli-args '100 2'
./build-stream --optimization-level 3 -./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe" --userland-args '1000 2'+./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe" --cli-args '1000 2'
userland/freestanding/: freestanding programs that work on any ISA
+the stack pointer
TODO: we don’t do this currently but maybe we should setup BSS
@@ -32342,12 +32358,88 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.Semihosting is a publicly documented interface specified by ARM Holdings that allows us to do some magic operations very useful in development.
+For things to work in baremetal, we often have to layout memory in specific ways.
Semihosting is implemented both on some real devices and on simulators such as QEMU and gem5 semihosting.
+Notably, since we start with paging disabled, there are more constraints on where memory can or cannot go.
+Especially for C programs, this memory layout is specified by a "linker script", which is present at: baremetal/link.ld
+Note how our linker script also exposes some symbols to C:
+lkmc_heap_low = .; +lkmc_heap_top = .;+
Those for example are required to implement malloc in Newlib. We can play with those variables more explicitly with baremetal/linker_variables.c:
./run --arch aarch64 --baremetal baremetal/linker_variables.c+
QEMU and gem5 currently supports baremetal CLI arguments!
+You can see them in action e.g. with:
+./run --arch aarch64 --baremetal userland/c/command_line_arguments.c --cli-args 'aa bb cc' +./run --arch aarch64 --userland userland/c/command_line_arguments.c --cli-args 'aa bb cc'+
both of which output the exact same thing:
+aa +bb +cc+
This is implemented by parsing the command line arguments and placing them into memory where the code will find them.
+This works by:
+fixing the argc and argv addresses in memory in the Baremetal linker script
the Baremetal bootloaders pass those addresses correctly to the call of main
our Python scripts write the desired binary memory values to a file
+QEMU loads those files into memory with -device loader: https://github.com/qemu/qemu/blob/60905286cb5150de854e08279bca7dfc4b549e91/docs/generic-loader.txt
It is worth noting that e.g. ARM has a Semihosting mechanism for loading CLI arguments through SYS_GET_CMDLINE, but our mechanism works in principle for any ISA.
Semihosting is a publicly documented interface specified by ARM Holdings that allows us to do some magic operations very useful in development, such as writting to the terminal or reading and writing host files.
It is documented at: https://developer.arm.com/docs/100863/latest/introduction
@@ -32463,7 +32555,7 @@ svc 0x00123456For gem5, you need:
TODO: our example is printing newlines without automatic carriage return \r as in:
For arm, some baremetal examples compile fine with:
Didn’t get it working, traking at: https://github.com/cirosantilli/linux-kernel-module-cheat/issues/119
It is incredible, but GDB also has a CPU simulator inside of it as documented at: https://sourceware.org/gdb/onlinedocs/gdb/Target-Commands.html
Since I had this compiled, I also decided to try it out on userland.
In this section we will focus on learning ARM architecture concepts that can only learnt on baremetal setups.
Userland information can be found at: https://github.com/cirosantilli/arm-assembly-cheat
ARM exception levels are analogous to x86 rings.
According to ARMv7 architecture reference manual, access to that register is controlled by other registers NSACR.{CP11, CP10} and HCPTR so those must be turned off, but I’m lazy to investigate now, even just trying to dump those registers in userland/arch/arm/dump_regs.c also leads to exceptions…
TODO. Create a minimal runnable example of going into EL0 and jumping to EL1.
See ARMv8 architecture reference manual db D1.6.2 "The stack pointer registers".
This is the most basic example of exception handling we have.
The vector table format is described on ARMv8 architecture reference manual Table D1-7 "Vector offsets from vector table base address".
Exception Syndrome Register.
See example at: Section 27.8.2, “ARM SVC instruction”
+See example at: Section 27.10.2, “ARM SVC instruction”
Documentation: ARMv8 architecture reference manual db D12.2.36 "ESR_EL1, Exception Syndrome Register (EL1)".
Exception Link Register.
See the example at: Section 27.8.2, “ARM SVC instruction”
+See the example at: Section 27.10.2, “ARM SVC instruction”
Examples:
Bibliography: https://stackoverflow.com/questions/980999/what-does-multicore-assembly-language-look-like/33651438#33651438
The WFE and SEV instructions are just hints: a compliant implementation can treat them as NOPs.
For how userland spinlocks and mutexes are implemented see Userland mutex implementation.
Examples:
WFE and SEV are usable from userland, and are part of an efficient spinlock implementation (which userland should arguably stay away from and rather use the futex system call which allow for non busy sleep instead), which maybe is not something that userland should ever tho and just stick to mutexes?
gem5 390a74f59934b85d91489f8a563450d8321b602d does not sleep on the first WFE on either syscall emulation or full system, because the code does:
Can be used to implement atomic variables, see also:
In QEMU, CPU 1 starts in a halted state. This can be observed from GDB, where:
TODO: create and study a minimal examples in gem5 where the DMB instruction leads to less cycles: https://stackoverflow.com/questions/15491751/real-life-use-cases-of-barriers-dsb-dmb-isb-in-arm
The ARM timer is the simplest way to generate hardware interrupts periodically, and therefore serves as the simples example of ARM GIC usage.
Generic Interrupt Controller.
TODO create a minimal working aarch64 example analogous to the x86 one at: https://github.com/cirosantilli/x86-bare-metal-examples/blob/6dc9a73830fc05358d8d66128f740ef9906f7677/paging.S
First, also consider the userland bibliography: Section 24.9, “ARM assembly bibliography”.
It is nice when thing just work.
But you can also learn a thing or two from how I actually made them work in the first place.
Enter the QEMU console:
Inside baremetal/lib/aarch64.S there is a chunk of code that enables floating point operations:
Baremetal tests work exactly like User mode tests, except that you have to add the --mode baremetal option, for example:
gem5 busy loop
a18f28e263c91362519ef550150b5c9d75fa3679 + 1
./run --arch aarch64 --emulator gem5 --static --userland userland/gcc/busy_loop.c --userland-args 1000000
./run --arch aarch64 --emulator gem5 --static --userland userland/gcc/busy_loop.c --cli-args 1000000
10^6
18
2.4005699 * 10^7
gem5 busy loop for a debug build
a18f28e263c91362519ef550150b5c9d75fa3679 + 1
./run --arch aarch64 --emulator gem5 --gem5-build-type debug --static --userland userland/gcc/busy_loop.c --userland-args 100000
./run --arch aarch64 --emulator gem5 --gem5-build-type debug --static --userland userland/gcc/busy_loop.c --cli-args 100000
10^5
33
2.405682 * 10^6
gem5 busy loop for a fast build
0d5a41a3f88fcd7ed40fc19474fe5aed0463663f + 1
userland/gcc/busy_loop.c -O0 -static
./run --arch aarch64 --emulator gem5 --gem5-build-type fast --static --userland userland/gcc/busy_loop.c --userland-args 1000000
./run --arch aarch64 --emulator gem5 --gem5-build-type fast --static --userland userland/gcc/busy_loop.c --cli-args 1000000
10^6
15
2.4005699 * 10^7
gem5 busy loop for a TimingSimpleCPU
a18f28e263c91362519ef550150b5c9d75fa3679 + 1
./run --arch aarch64 --emulator gem5 --arch aarch64 --static --userland userland/gcc/busy_loop.c --userland-args 1000000 -- --cpu-type TimingSimpleCPU --caches
./run --arch aarch64 --emulator gem5 --arch aarch64 --static --userland userland/gcc/busy_loop.c --cli-args 1000000 -- --cpu-type TimingSimpleCPU --caches
10^6
26
2.4005699 * 10^7
gem5 busy loop for a MinorCPU
a18f28e263c91362519ef550150b5c9d75fa3679 + 1
./run --arch aarch64 --emulator gem5 --arch aarch64 --userland userland/gcc/busy_loop.c --userland-args 1000000 -- --cpu-type MinorCPU --caches
./run --arch aarch64 --emulator gem5 --arch aarch64 --userland userland/gcc/busy_loop.c --cli-args 1000000 -- --cpu-type MinorCPU --caches
10^6
31
1.1018152 * 10^7
5d233f2664a78789f9907d27e2a40e86cefad595
STREAM benchmark -O3
./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland-args 1000000 --trace ExecAll
./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --cli-args 1000000 --trace ExecAll
3 * 10^5 * 2
64
9.9674773 * 10^7
glibc C pre-main effects
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611
gem5 --arch aarch64 --userland-args e
gem5 --arch aarch64 --cli-args e
1
2
1.26479 * 10^5
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611
glibc C pre-main userland/c/m5ops.c -O0
gem5 --arch aarch64 --userland-args e --gem5-build-type debug
gem5 --arch aarch64 --cli-args e --gem5-build-type debug
1
2
1.26479 * 10^5
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611
glibc C++ pre-main userland/cpp/m5ops.cpp -O0
gem5 --arch aarch64 --userland-args e
gem5 --arch aarch64 --cli-args e
1
2
2.385012 * 10^6
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611
glibc C++ pre-main userland/cpp/m5ops.cpp -O0
gem5 --arch aarch64 --userland-args e --gem5-build-type debug
gem5 --arch aarch64 --cli-args e --gem5-build-type debug
1
25
2.385012 * 10^6
Check the effect of an ExecAll log (log every instruction) on execution time, compare to analogous run without it. trace.txt size: 3.5GB. 5x slowdown observed with output to a hard disk.
d29a07ddad499f273cc90dd66e40f8474b5dfc40
./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland-args 1000000 --gem5-worktree master --trace ExecAll
./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --cli-args 1000000 --gem5-worktree master --trace ExecAll
10^6
2.4106774 * 10^7
136
Same as above but with run command manually hacked to output to a ramfs. Slightly faster, but the bulk was still just in log format operations!
d29a07ddad499f273cc90dd66e40f8474b5dfc40
./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland-args 1000000 --gem5-worktree master --trace ExecAll
./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --cli-args 1000000 --gem5-worktree master --trace ExecAll
10^6
2.4106774 * 10^7
107
./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland-args '1 10000000' +./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --cli-args '1 10000000' ./gem5-stat --arch aarch64 sim_insts