printf a > f.tmp -./run --emulator gem5 --userland userland/c/getchar.c --static --stdin-file f.tmp+./run --emulator gem5 --userland userland/c/getchar.c --stdin-file f.tmp
diff --git a/index.html b/index.html index 44bfc51..88d37e2 100644 --- a/index.html +++ b/index.html @@ -7520,10 +7520,10 @@ qemu: uncaught target signal 6 (Aborted) - core dumped
gem5 user mode currently only supports static executables as mentioned at: Section 10.7, “gem5 syscall emulation mode”
+QEMU x86_64 guest on x86_64 host was failing with stack smashing detected, but we found a workaround
QEMU x86_64 guest on x86_64 host was failing with stack smashing detected, but we found a workaround
+gem5 user only supported static executables in the past, as mentioned at: Section 10.7, “gem5 syscall emulation mode”
printf a > f.tmp -./run --emulator gem5 --userland userland/c/getchar.c --static --stdin-file f.tmp+./run --emulator gem5 --userland userland/c/getchar.c --stdin-file f.tmp
There are much more unimplemented syscalls in gem5 than in QEMU. Many of those are trivial to implement however.
As of 185c2730cc78d5adda683d76c0e3b35e7cb534f0, dynamically linked executables only work on x86, and they can only use the host libraries, which is ugly:
-If you try dynamically linked executables on ARM, they fail with:
-fatal: Unable to open dynamic executable's interpreter.-
Support for dynamic linking was added in November 2019: https://stackoverflow.com/questions/50542222/how-to-run-a-dynamically-linked-executable-syscall-emulation-mode-se-py-in-gem5/50696098#50696098
So let’s just play with some static ones:
./build-userland \ - --arch aarch64 \ - --static \ -; +./build-userland --arch aarch64 ./run \ --arch aarch64 \ --emulator gem5 \ @@ -7745,14 +7724,12 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex --arch aarch64 \ --emulator gem5 \ --gdb-wait \ - --static \ --userland userland/c/command_line_arguments.c \ --userland-args 'asdf "qw er"' \ ; ./run-gdb \ --arch aarch64 \ --emulator gem5 \ - --static \ --userland userland/c/command_line_arguments.c \ main \ ;@@ -7765,7 +7742,7 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex
./run --dry-run --emulator gem5 --static --userland userland/c/false.c+
./run --dry-run --emulator gem5 --userland userland/c/false.c
./run \ --emulator gem5 \ - --static userland/arch/x86_64/freestanding/linux/hello.S \ - --userland \ + --userland userland/arch/x86_64/freestanding/linux/hello.S \ --trace-stdout \ --trace ExecAll,SyscallBase,SyscallVerbose \ ;@@ -17961,7 +17937,7 @@ ps Haux | grep qemu | wc
./run --cpus 1 --emulator gem5 --static --userland userland/posix/pthread_self.c --userland-args 1+
./run --cpus 1 --emulator gem5 --userland userland/posix/pthread_self.c --userland-args 1
./run --cpus 2 --emulator gem5 --static --userland userland/posix/pthread_self.c --userland-args 1+
./run --cpus 2 --emulator gem5 --userland userland/posix/pthread_self.c --userland-args 1
./run --cpus 1 --userland userland/cpp/thread_hardware_concurrency.cpp --emulator gem5 --static -./run --cpus 2 --userland userland/cpp/thread_hardware_concurrency.cpp --emulator gem5 --static+
./run --cpus 1 --userland userland/cpp/thread_hardware_concurrency.cpp --emulator gem5 +./run --cpus 2 --userland userland/cpp/thread_hardware_concurrency.cpp --emulator gem5
./run --arch aarch64 --cpus 3 --emulator gem5 --static --userland userland/posix/pthread_self.c --userland-args 2+
./run --arch aarch64 --cpus 3 --emulator gem5 --userland userland/posix/pthread_self.c --userland-args 2
./run \ --emulator gem5 \ - --static \ --userland \userland/arch/x86_64/freestanding/linux/hello.S \ --trace-insts-stdout \ -- \ @@ -19763,7 +19737,6 @@ Exiting @ tick 3000 because all threads reached the max instruction count
./run \ --emulator gem5 \ - --static \ --userland \userland/arch/x86_64/freestanding/linux/hello.S \ --trace-insts-stdout \ -- \ @@ -19778,7 +19751,6 @@ Exiting @ tick 3000 because all threads reached the max instruction count
./run \ --emulator gem5 \ - --static \ --userland userland/posix/pthread_deadlock.c \ --userland-args 1 \ ;@@ -23502,13 +23474,11 @@ xdg-open bst_vs_heap_vs_hashmap.tmp.png --arch x86_64 \ --ccflags='-DLKMC_M5OPS_ENABLE=1' \ --force-rebuild userland/cpp/bst_vs_heap_vs_hashmap.cpp \ - --static \ --optimization-level 3 \ ; ./run \ --arch x86_64 \ --emulator gem5 \ - --static \ --userland userland/cpp/bst_vs_heap_vs_hashmap.cpp \ --userland-args='100000 1 0' \ -- \ @@ -23674,6 +23644,25 @@ cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans,3,3,2 ,1, A,3, B,
Created in the 80’s, it is not a representative measure of performance in modern computers anymore. It has mostly been replaced by SPEC, which is… closed source! Unbelievable.
Dhrystone is very simple:
+there is one loop in the dhry_1.c main function that gets executed N times
that loop calls 9 short functions called Proc_0 to Proc_9, most of which are defined in dhry_1.c, and a few others in dhry_2.c
The benchmark is single-threaded.
+After a quick look at it, Dhrystone in -O3 is is very likely completely CPU bound, as there are no loops over variable sized arrays, except for some dummy ones that only run once. It just does a bunch of operations on local and global C variables, which are very likely to be inlined and treated fully in registers until the final write back, or to fit entirely in cache. TODO confirm with some kind of measurement. The benchmark also makes no syscalls except for measuring time and reporting results.
Buildroot has a dhrystone package, but because it is so interesting to us, we decided to also build it ourselves, which allows things like static and baremetal compilation more easily.
./build-dhrystone --optimization-level 3 --static -./run --emulator gem5 --userland "$(./getvar --static userland_build_dir)/submodules/dhrystone/dhrystone"+
./build-dhrystone --optimization-level 3 +./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
./build-stream --optimization-level 3 --static -./run --emulator gem5 --userland "$(./getvar --static userland_build_dir)/submodules/stream-benchmark/stream_c.exe" --userland-args '1000 2'+
./build-stream --optimization-level 3 +./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe" --userland-args '1000 2'
./build-userland --static userland/arch/x86_64/inline_asm/rdtsc.S +./build-userland userland/arch/x86_64/inline_asm/rdtsc.S ./run --eval './arch/x86_64/rdtsc.out;m5 exit;' --emulator gem5 ./gem5-stat
./run --arch aarch64 --userland userland/arch/aarch64/sve_addvl.S --static --emulator gem5 -- --param 'system.cpu[:].isa[:].sve_vl_se = 1' -./run --arch aarch64 --userland userland/arch/aarch64/sve_addvl.S --static --emulator gem5 -- --param 'system.cpu[:].isa[:].sve_vl_se = 2' -./run --arch aarch64 --userland userland/arch/aarch64/sve_addvl.S --static --emulator gem5 -- --param 'system.cpu[:].isa[:].sve_vl_se = 4'+
./run --arch aarch64 --userland userland/arch/aarch64/sve_addvl.S --emulator gem5 -- --param 'system.cpu[:].isa[:].sve_vl_se = 1' +./run --arch aarch64 --userland userland/arch/aarch64/sve_addvl.S --emulator gem5 -- --param 'system.cpu[:].isa[:].sve_vl_se = 2' +./run --arch aarch64 --userland userland/arch/aarch64/sve_addvl.S --emulator gem5 -- --param 'system.cpu[:].isa[:].sve_vl_se = 4'
For now we can just run on gem5 to estimate the instruction count per input size and extrapolate?
For example, the simplest scalable CPU content would be a busy loop: userland/gcc/busy_loop.c, so let’s focus on that for now.
+For example, the simplest scalable CPU content would be a busy loop: userland/gcc/busy_loop.c, so let’s start by analyzing that one.
Summary of manually collected results on P51 at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: Table 7, “Busy loop MIPS for different simulator setups”. As expected, the less native / more detailed / more complex simulations are slower!
@@ -31515,15 +31504,19 @@ instructions 124346081| Simulator | +LKMC | +Benchmark | +Emulator | Loops | Time (s) | Instruction count | @@ -31532,6 +31525,8 @@ instructions 124346081
|---|---|---|---|---|---|---|
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
+userland/gcc/busy_loop.c -O0 |
|
10^10 |
68 |
@@ -31539,6 +31534,8 @@ instructions 124346081
2000 |
|
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
+userland/gcc/busy_loop.c -O0 |
|
10^7 |
100 |
@@ -31546,6 +31543,8 @@ instructions 124346081
1 |
|
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
+userland/gcc/busy_loop.c -O0 |
|
10^6 |
31 |
@@ -31553,6 +31552,8 @@ instructions 124346081
0.4 |
|
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
+userland/gcc/busy_loop.c -O0 |
|
10^6 |
52 |
@@ -31560,23 +31561,43 @@ instructions 124346081
0.2 |
|
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
+userland/gcc/busy_loop.c -O0 |
|
1 * 1000000 = 10^6 |
63 |
1.1005150 * 10^7 |
0.2 |
605448f07e6380634b1aa7e9732d111759f69fd |
+Dhrystone -O3 |
+
|
+4 * 10^5 |
+68 |
+9.2034139 * 10^7 |
+1.6 |
+
5d233f2664a78789f9907d27e2a40e86cefad595 |
+STREAM benchmark -O3 |
+
|
+3 * 10^5 * 2 |
+64 |
+9.9674773 * 10^7 |
+1.6 |
+
The first step is to determine a number of loops that will run long enough to have meaningful results, but not too long that we will get bored.
+The first step is to determine a number of loops that will run long enough to have meaningful results, but not too long that we will get bored, so about 1 minute.
On our P51 machine, we found 10^7 (10 million == 1000 times 10000) loops to be a good number for a gem5 atomic simulation:
./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland-args '1 10000000' --static +./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland-args '1 10000000' ./gem5-stat --arch aarch64 sim_insts
Lists coming from parent directories are extended instead of overwritten by children, this is especially useful for C compiler flags.
To quickly determine which properties a path has, you can use getprops, e.g.:
+./getprops userland/c/hello.c+
which outputs values such as:
+allowed_archs=None +allowed_emulators=None +arm_aarch32=False +arm_sve=False +baremetal=True+