diff --git a/index.html b/index.html index 44bfc51..88d37e2 100644 --- a/index.html +++ b/index.html @@ -7520,10 +7520,10 @@ qemu: uncaught target signal 6 (Aborted) - core dumped
@@ -7673,7 +7673,7 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex
printf a > f.tmp
-./run --emulator gem5 --userland userland/c/getchar.c --static --stdin-file f.tmp
+./run --emulator gem5 --userland userland/c/getchar.c --stdin-file f.tmp
@@ -7696,35 +7696,14 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex

There are much more unimplemented syscalls in gem5 than in QEMU. Many of those are trivial to implement however.

-

As of 185c2730cc78d5adda683d76c0e3b35e7cb534f0, dynamically linked executables only work on x86, and they can only use the host libraries, which is ugly:

-
-
- -
-
-

If you try dynamically linked executables on ARM, they fail with:

-
-
-
-
fatal: Unable to open dynamic executable's interpreter.
-
+

Support for dynamic linking was added in November 2019: https://stackoverflow.com/questions/50542222/how-to-run-a-dynamically-linked-executable-syscall-emulation-mode-se-py-in-gem5/50696098#50696098

So let’s just play with some static ones:

-
./build-userland \
-  --arch aarch64 \
-  --static \
-;
+
./build-userland --arch aarch64
 ./run \
   --arch aarch64 \
   --emulator gem5 \
@@ -7745,14 +7724,12 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex
   --arch aarch64 \
   --emulator gem5 \
   --gdb-wait \
-  --static \
   --userland userland/c/command_line_arguments.c \
   --userland-args 'asdf "qw er"' \
 ;
 ./run-gdb \
   --arch aarch64 \
   --emulator gem5 \
-  --static \
   --userland userland/c/command_line_arguments.c \
   main \
 ;
@@ -7765,7 +7742,7 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex
-
./run --dry-run --emulator gem5 --static --userland userland/c/false.c
+
./run --dry-run --emulator gem5 --userland userland/c/false.c
@@ -7806,8 +7783,7 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex
./run \
   --emulator gem5 \
-  --static userland/arch/x86_64/freestanding/linux/hello.S \
-  --userland \
+  --userland userland/arch/x86_64/freestanding/linux/hello.S \
   --trace-stdout \
   --trace ExecAll,SyscallBase,SyscallVerbose \
 ;
@@ -17961,7 +17937,7 @@ ps Haux | grep qemu | wc
-
./run --cpus 1 --emulator gem5 --static --userland userland/posix/pthread_self.c --userland-args 1
+
./run --cpus 1 --emulator gem5 --userland userland/posix/pthread_self.c --userland-args 1
@@ -17977,7 +17953,7 @@ ps Haux | grep qemu | wc
-
./run --cpus 2 --emulator gem5 --static --userland userland/posix/pthread_self.c --userland-args 1
+
./run --cpus 2 --emulator gem5 --userland userland/posix/pthread_self.c --userland-args 1
@@ -17988,8 +17964,8 @@ ps Haux | grep qemu | wc
-
./run --cpus 1 --userland userland/cpp/thread_hardware_concurrency.cpp --emulator gem5 --static
-./run --cpus 2 --userland userland/cpp/thread_hardware_concurrency.cpp --emulator gem5 --static
+
./run --cpus 1 --userland userland/cpp/thread_hardware_concurrency.cpp --emulator gem5
+./run --cpus 2 --userland userland/cpp/thread_hardware_concurrency.cpp --emulator gem5
@@ -18000,7 +17976,7 @@ ps Haux | grep qemu | wc
-
./run --arch aarch64 --cpus 3 --emulator gem5 --static --userland userland/posix/pthread_self.c --userland-args 2
+
./run --arch aarch64 --cpus 3 --emulator gem5 --userland userland/posix/pthread_self.c --userland-args 2
@@ -19081,7 +19057,6 @@ m5 execfile --arch aarch64 \ --ccflags='-DLKMC_M5OPS_ENABLE=1' \ --force-rebuild \ - --static \ userland/c/m5ops.c \ ; ./build-buildroot --arch aarch64 @@ -19732,7 +19707,6 @@ system.cpu[0].max_insts_any_thread
./run \
   --emulator gem5 \
-  --static \
   --userland \userland/arch/x86_64/freestanding/linux/hello.S \
   --trace-insts-stdout \
   -- \
@@ -19763,7 +19737,6 @@ Exiting @ tick 3000 because all threads reached the max instruction count
./run \
   --emulator gem5 \
-  --static \
   --userland \userland/arch/x86_64/freestanding/linux/hello.S \
   --trace-insts-stdout \
   -- \
@@ -19778,7 +19751,6 @@ Exiting @ tick 3000 because all threads reached the max instruction count
./run \
   --emulator gem5 \
-  --static \
   --userland userland/posix/pthread_deadlock.c \
   --userland-args 1 \
 ;
@@ -23502,13 +23474,11 @@ xdg-open bst_vs_heap_vs_hashmap.tmp.png --arch x86_64 \ --ccflags='-DLKMC_M5OPS_ENABLE=1' \ --force-rebuild userland/cpp/bst_vs_heap_vs_hashmap.cpp \ - --static \ --optimization-level 3 \ ; ./run \ --arch x86_64 \ --emulator gem5 \ - --static \ --userland userland/cpp/bst_vs_heap_vs_hashmap.cpp \ --userland-args='100000 1 0' \ -- \ @@ -23674,6 +23644,25 @@ cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans,3,3,2 ,1, A,3, B,

Created in the 80’s, it is not a representative measure of performance in modern computers anymore. It has mostly been replaced by SPEC, which is…​ closed source! Unbelievable.

+

Dhrystone is very simple:

+
+
+
    +
  • +

    there is one loop in the dhry_1.c main function that gets executed N times

    +
  • +
  • +

    that loop calls 9 short functions called Proc_0 to Proc_9, most of which are defined in dhry_1.c, and a few others in dhry_2.c

    +
  • +
+
+
+

The benchmark is single-threaded.

+
+
+

After a quick look at it, Dhrystone in -O3 is is very likely completely CPU bound, as there are no loops over variable sized arrays, except for some dummy ones that only run once. It just does a bunch of operations on local and global C variables, which are very likely to be inlined and treated fully in registers until the final write back, or to fit entirely in cache. TODO confirm with some kind of measurement. The benchmark also makes no syscalls except for measuring time and reporting results.

+
+

Buildroot has a dhrystone package, but because it is so interesting to us, we decided to also build it ourselves, which allows things like static and baremetal compilation more easily.

@@ -23702,8 +23691,8 @@ cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans,3,3,2 ,1, A,3, B,
-
./build-dhrystone --optimization-level 3 --static
-./run --emulator gem5 --userland "$(./getvar --static userland_build_dir)/submodules/dhrystone/dhrystone"
+
./build-dhrystone --optimization-level 3
+./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
@@ -23812,8 +23801,8 @@ times[3 * ntimes + k] = mysecond() - times[3 * ntimes + k];
-
./build-stream --optimization-level 3 --static
-./run --emulator gem5 --userland "$(./getvar --static userland_build_dir)/submodules/stream-benchmark/stream_c.exe" --userland-args '1000 2'
+
./build-stream --optimization-level 3
+./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe" --userland-args '1000 2'
@@ -26800,7 +26789,7 @@ pop %rbp
-
./build-userland --static userland/arch/x86_64/inline_asm/rdtsc.S
+
./build-userland userland/arch/x86_64/inline_asm/rdtsc.S
 ./run --eval './arch/x86_64/rdtsc.out;m5 exit;' --emulator gem5
 ./gem5-stat
@@ -28717,9 +28706,9 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.

-
./run --arch aarch64 --userland userland/arch/aarch64/sve_addvl.S --static --emulator gem5 -- --param 'system.cpu[:].isa[:].sve_vl_se = 1'
-./run --arch aarch64 --userland userland/arch/aarch64/sve_addvl.S --static --emulator gem5 -- --param 'system.cpu[:].isa[:].sve_vl_se = 2'
-./run --arch aarch64 --userland userland/arch/aarch64/sve_addvl.S --static --emulator gem5 -- --param 'system.cpu[:].isa[:].sve_vl_se = 4'
+
./run --arch aarch64 --userland userland/arch/aarch64/sve_addvl.S --emulator gem5 -- --param 'system.cpu[:].isa[:].sve_vl_se = 1'
+./run --arch aarch64 --userland userland/arch/aarch64/sve_addvl.S --emulator gem5 -- --param 'system.cpu[:].isa[:].sve_vl_se = 2'
+./run --arch aarch64 --userland userland/arch/aarch64/sve_addvl.S --emulator gem5 -- --param 'system.cpu[:].isa[:].sve_vl_se = 4'
@@ -31507,7 +31496,7 @@ instructions 124346081

For now we can just run on gem5 to estimate the instruction count per input size and extrapolate?

-

For example, the simplest scalable CPU content would be a busy loop: userland/gcc/busy_loop.c, so let’s focus on that for now.

+

For example, the simplest scalable CPU content would be a busy loop: userland/gcc/busy_loop.c, so let’s start by analyzing that one.

Summary of manually collected results on P51 at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: Table 7, “Busy loop MIPS for different simulator setups”. As expected, the less native / more detailed / more complex simulations are slower!

@@ -31515,15 +31504,19 @@ instructions 124346081 -----+++++++ - + + + @@ -31532,6 +31525,8 @@ instructions 124346081 + + @@ -31539,6 +31534,8 @@ instructions 124346081 + + @@ -31546,6 +31543,8 @@ instructions 124346081 + + @@ -31553,6 +31552,8 @@ instructions 124346081 + + @@ -31560,23 +31561,43 @@ instructions 124346081 + + + + + + + + + + + + + + + + + + + +
Table 7. Busy loop MIPS for different simulator setups
SimulatorLKMCBenchmarkEmulator Loops Time (s) Instruction count

a18f28e263c91362519ef550150b5c9d75fa3679 + 1

userland/gcc/busy_loop.c -O0

qemu --arch aarch64

10^10

68

2000

a18f28e263c91362519ef550150b5c9d75fa3679 + 1

userland/gcc/busy_loop.c -O0

gem5 --arch aarch64

10^7

100

1

a18f28e263c91362519ef550150b5c9d75fa3679 + 1

userland/gcc/busy_loop.c -O0

gem5 --arch aarch64 -- --cpu-type MinorCPU --caches

10^6

31

0.4

a18f28e263c91362519ef550150b5c9d75fa3679 + 1

userland/gcc/busy_loop.c -O0

gem5 --arch aarch64 -- --cpu-type DerivO3CPU --caches

10^6

52

0.2

a18f28e263c91362519ef550150b5c9d75fa3679 + 1

userland/gcc/busy_loop.c -O0

gem5 --arch aarch64 --gem5-build-id MOESI_CMP_directory -- --cpu-type DerivO3CPU --caches --ruby

1 * 1000000 = 10^6

63

1.1005150 * 10^7

0.2

605448f07e6380634b1aa7e9732d111759f69fd

Dhrystone -O3

gem5 --arch aarch64

4 * 10^5

68

9.2034139 * 10^7

1.6

5d233f2664a78789f9907d27e2a40e86cefad595

STREAM benchmark -O3

gem5 --arch aarch64 --userland-args 300000 2

3 * 10^5 * 2

64

9.9674773 * 10^7

1.6

-

The first step is to determine a number of loops that will run long enough to have meaningful results, but not too long that we will get bored.

+

The first step is to determine a number of loops that will run long enough to have meaningful results, but not too long that we will get bored, so about 1 minute.

On our P51 machine, we found 10^7 (10 million == 1000 times 10000) loops to be a good number for a gem5 atomic simulation:

-
./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland-args '1 10000000' --static
+
./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland-args '1 10000000'
 ./gem5-stat --arch aarch64 sim_insts
@@ -34132,6 +34153,26 @@ git submodule update

Lists coming from parent directories are extended instead of overwritten by children, this is especially useful for C compiler flags.

+
+

To quickly determine which properties a path has, you can use getprops, e.g.:

+
+
+
+
./getprops userland/c/hello.c
+
+
+
+

which outputs values such as:

+
+
+
+
allowed_archs=None
+allowed_emulators=None
+arm_aarch32=False
+arm_sve=False
+baremetal=True
+
+