Build improvements

2026-01-23 02:05:57 +01:00 · 2020-10-29 00:00:02 +00:00
parent a5dcad93b4
commit f6be232ea8
8 changed files with 2 additions and 730 deletions
--- a/.gitmodules
+++ b/.gitmodules
@@ -11,9 +11,6 @@
 [submodule "submodules/crosstool-ng"]
 	path = submodules/crosstool-ng
 	url = https://github.com/cirosantilli/crosstool-ng
-[submodule "submodules/dhrystone"]
-	path = submodules/dhrystone
-	url = https://github.com/cirosantilli/dhrystone
 [submodule "submodules/gcc"]
 	path = submodules/gcc
 	url = https://github.com/cirosantilli/gcc
@@ -36,12 +33,6 @@
 [submodule "submodules/linux"]
 	path = submodules/linux
 	url = https://github.com/cirosantilli/linux
-[submodule "submodules/lmbench"]
-	path = submodules/lmbench
-	url = https://github.com/cirosantilli/lmbench
-[submodule "submodules/omp-validation"]
-	path = submodules/omp-validation
-	url = https://github.com/cirosantilli/omp-validation
 [submodule "submodules/parsec-benchmark"]
 	path = submodules/parsec-benchmark
 	url = https://github.com/cirosantilli/parsec-benchmark
@@ -51,9 +42,6 @@
 [submodule "submodules/xen"]
 	path = submodules/xen
 	url = https://github.com/cirosantilli/xen
-[submodule "submodules/stream-benchmark"]
-	path = submodules/stream-benchmark
-	url = https://github.com/cirosantilli/stream-benchmark
 [submodule "submodules/freebsd"]
 	path = submodules/freebsd
 	url = https://github.com/cirosantilli/freebsd
--- a/README.adoc
+++ b/README.adoc
@@ -11337,7 +11337,7 @@ ____

 OK, this is why we used gem5 in the first place, performance measurements!

-Let's see how many cycles <<dhrystone>>, which Buildroot provides, takes for a few different input parameters.
+Let's see how many cycles dhrystone, which Buildroot provides, takes for a few different input parameters.

 We will do that for various input parameters on full system by taking a checkpoint after the boot finishes a fast atomic CPU boot, and then we will restore in a more detailed mode and run the benchmark:

@@ -20138,84 +20138,6 @@ The implementation lives under `libgomp` in the GCC tree, and is documented at:

 `strace` shows that OpenMP makes `clone()` syscalls in Linux. TODO: does it actually call `pthread_` functions, or does it make syscalls directly? Or in other words, can it work on <<freestanding-programs>>? A quick grep shows many references to pthreads.

-====== OpenMP validation
-
-https://github.com/uhhpctools/omp-validation
-
-Host build on Ubuntu 20.04:
-
-....
-git submodule update --init submodules/omp-validation
-cd submodules/omp-validation
-PERL5LIB="${PERL5LIB}:." make -j `nproc` ctest
-....
-
-This both builds and runs, took about 5 minutes on <<p51>>, but had build failues for some reason:
-
-....
-Summary:
-S Number of tested Open MP constructs: 62
-S Number of used tests:                123
-S Number of failed tests:              4
-S Number of successful tests:          119
-S + from this were verified:           115
-
-Normal tests:
-N Number of failed tests:              2
-N + from this fail compilation:        0
-N + from this timed out                0
-N Number of successful tests:          60
-N + from this were verified:           58
-
-Orphaned tests:
-O Number of failed tests:              2
-O + from this fail compilation:        0
-O + from this timed out                0
-O Number of successful tests:          59
-O + from this were verified:           57
-....
-
-The tests and run results placed under `bin/c/`, e.g.:
-
-....
-test_omp_threadprivate
-test_omp_threadprivate.c
-test_omp_threadprivate.log
-test_omp_threadprivate.out
-test_omp_threadprivate_compile.log
-....
-
-C files are also present as some kind of code generaion is used.
-
-Build only and run one of them manually:
-
-....
-make -j`nproc` omp_my_sleep omp_testsuite
-PERL5LIB="${PERL5LIB}:." ./runtest.pl --lang=c --norun testlist-c.txt
-./bin/c/test_omp_barrier
-....
-
-The `bin/c` directory is hardcoded in the executable, so to run it you must ensure that it exists relative to CWD, e.g.:
-
-....
-cd bin/c
-mkdir -p bin/c
-./test_omp_barrier
-....
-
-Manually cross compile all tests and optionally add some extra options, e.g. `-static` to <<gem5-dynamic-linked-executables-in-syscall-emulation,more conveniently run in gem5>>:
-
-....
-PERL5LIB="${PERL5LIB}:." ./runtest.pl --lang=c --makeopts 'CC=aarch64-linux-gnu-gcc CFLAGS_EXTRA=-static' --norun testlist-c.txt
-./../../run --arch aarch64 --emulator gem5 --userland submodules/omp-validation/bin/c/test_omp_parallel_reduction --cpus 8 --memory 8G
-....
-
-Build a single test:
-
-....
-make bin/c/test_omp_sections_reduction
-....
-
 [[cpp]]
 === C++

@@ -21391,176 +21313,6 @@ Don't forget to explicitly rebuild PARSEC with:
 You may also want to test if your patches are still functionally correct inside of QEMU first, which is a faster emulator.
 * sell your soul, and compile natively inside the guest. We won't do this, not only because it is evil, but also because Buildroot explicitly does not support it: https://buildroot.org/downloads/manual/manual.html#faq-no-compiler-on-target ARM employees have been known to do this: https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/qemu-patch.diff

-===== Coremark
-
-https://en.wikipedia.org/wiki/Coremark
-
-Part of https://en.wikipedia.org/wiki/EEMBC[EEMBC].
-
-They have two versions:
-
-* 2009: https://github.com/eembc/coremark
-* 2015: https://github.com/eembc/coremark-pro
-+
-Describes very clearly on the README what tests it does. Most of them are understandable high level operations.
-+
-In particular, it contains "a greatly improved version of the https://en.wikipedia.org/wiki/Livermore_loops[Livermore loops]"
-
-Both have a custom license, so yeah, no patience to read this stuff.
-
-Coremark-pro build and run on Ubuntu 20.04:
-
-....
-git submodule update --init submodules coremark-pro
-cd submodules/coremark-pro
-make TARGET=linux64 build
-make TARGET=linux64 XCMD='-c4' certify-all
-....
-
-This uses `4` contexts. TODO what are contexts? Is the same as threads? You likely want to use `-c$(nproc)` in practice instead?
-
-Finishes in a few seconds, <<p51>> results:
-
-....
-Workload Name                                     (iter/s)   (iter/s)    Scaling
----------------------------------------------- ---------- ---------- ----------
-cjpeg-rose7-preset                                  526.32     178.57       2.95
-core                                                  7.39       2.16       3.42
-linear_alg-mid-100x100-sp                           684.93     238.10       2.88
-loops-all-mid-10k-sp                                 27.65       7.80       3.54
-nnet_test                                            32.79      10.57       3.10
-parser-125k                                          71.43      25.00       2.86
-radix2-big-64k                                     2320.19     623.44       3.72
-sha-test                                            555.56     227.27       2.44
-zip-test                                            363.64     166.67       2.18
-
-MARK RESULTS TABLE
-
-Mark Name                                        MultiCore SingleCore    Scaling
----------------------------------------------- ---------- ---------- ----------
-CoreMark-PRO                                      18743.79    6306.76       2.97
-....
-
-More sample results: <<p51-coremark-pro>>.
-
-And scaling appears to be the ration between multicore (4 due to `-c4` and single core performance), each benchmark gets run twice with multicore and single core.
-
-The tester script also outputs test commands, some of which are:
-
-....
-builds/linux64/gcc64/bin/zip-test.exe -c1 -w1 -c4 -v1
-builds/linux64/gcc64/bin/zip-test.exe -c1 -w1 -c4 -v0
-builds/linux64/gcc64/bin/zip-test.exe -c4 -v1
-builds/linux64/gcc64/bin/zip-test.exe -c4 -v0
-....
-
-`-v1` appears to be a fast verification run, and both `-c1` vs `-c4` get run because for the single vs multicore preformance.
-
-Sample `-c4 -v0` output:
-
-....
-  Info: Starting Run...
-- Workload:zip-test=946108807
-- zip-test:time(ns)=11
-- zip-test:contexts=4
-- zip-test:iterations=4
-- zip-test:time(secs)=   0.011
-- zip-test:secs/workload= 0.00275
-- zip-test:workloads/sec= 363.636
-- Done:zip-test=946108807
-....
-
-and so we see the `zip-test:workloads/sec= 363.636` output is the key value, which is close to that of the `zip-test 363.64` in the earlier full summarized result.
-
-Cross compile statically for aarch64. From LKMC toplevel:
-
-....
-make \
-  -C submodules/coremark-pro \
-  LINKER_FLAGS='-static' \
-  LINKER_LAST='-lm -lpthread -lrt' \
-  TARGET=gcc-cross-linux \
-  TOOLCHAIN=gcc-cross-linux \
-  TOOLS="$(./getvar --arch aarch64 buildroot_host_usr_dir)" \
-  TPREF="$(./getvar --arch aarch64 buildroot_toolchain_prefix)-" \
-  build \
-;
-....
-
-Run a single executable on QEMU:
-
-....
-./run --arch aarch64 --userland submodules/coremark-pro/builds/gcc-cross-linux/bin/zip-test.exe --cli-args='-c4 -v0'
-....
-
-Finishes in about 1 second, and gives `zip-test:workloads/sec= 74.0741` so we see that it ran about 5x slower than the native host.
-
-Run a single executable on gem5 in a verification run:
-
-....
-./run \
-  --arch aarch64 \
-  --cli-args='-c1 -v1' \
-  --emulator gem5 \
-  --userland submodules/coremark-pro/builds/gcc-cross-linux/bin/zip-test.exe \
-;
-....
-
-TODO: hangs for at least 15 minutes, there must be something wrong. Stuck on an evolving strlen loop:
-
-....
-7837834500: system.cpu: A0 T0 : @__strlen_generic+112    : ldp
-7837834500: system.cpu: A0 T0 : @__strlen_generic+112. 0 :   addxi_uop   ureg0, x1, #16 : IntAlu :  D=0x0000003ffff07170  flags=(IsInteger|IsMicroop|IsDelayedCommit|IsFirstMicroop)
-7837835000: system.cpu: A0 T0 : @__strlen_generic+112. 1 :   ldp_uop   x2, x3, [ureg0] : MemRead :  D=0x20703c0a3e702f3c A=0x3ffff07170  flags=(IsInteger|IsMemRef|IsLoad|IsMicroop|IsLastMicroop)
-7837835500: system.cpu: A0 T0 : @__strlen_generic+116    :   sub   x4, x2, x8         : IntAlu :  D=0x3d607360632e3b34  flags=(IsInteger)
-7837836000: system.cpu: A0 T0 : @__strlen_generic+120    :   sub   x6, x3, x8         : IntAlu :  D=0x1f6f3b093d6f2e3b  flags=(IsInteger)
-7837836500: system.cpu: A0 T0 : @__strlen_generic+124    :   orr   x5, x4, x6         : IntAlu :  D=0x3f6f7b697f6f3f3f  flags=(IsInteger)
-7837837000: system.cpu: A0 T0 : @__strlen_generic+128    :   ands   x5, x8, LSL #7    : IntAlu :  D=0x0000000000000000  flags=(IsInteger)
-7837837500: system.cpu: A0 T0 : @__strlen_generic+132    :   b.eq   <__strlen_generic+88> : IntAlu :   flags=(IsControl|IsDirectControl|IsCondControl)
-7837838000: system.cpu: A0 T0 : @__strlen_generic+88    : ldp
-7837838000: system.cpu: A0 T0 : @__strlen_generic+88. 0 :   addxi_uop   ureg0, x1, #32 : IntAlu :  D=0x0000003ffff07180  flags=(IsInteger|IsMicroop|IsDelayedCommit|IsFirstMicroop)
-7837838500: system.cpu: A0 T0 : @__strlen_generic+88. 1 :   ldp_uop   x2, x3, [ureg0] : MemRead :  D=0x6565686b636f4c27 A=0x3ffff07180  flags=(IsInteger|IsMemRef|IsLoad|IsMicroop|IsDelayedCommit)
-7837839000: system.cpu: A0 T0 : @__strlen_generic+88. 2 :   addxi_uop   x1, ureg0, #0 : IntAlu :  D=0x0000003ffff07180  flags=(IsInteger|IsMicroop|IsLastMicroop)
-7837839500: system.cpu: A0 T0 : @__strlen_generic+92    :   sub   x4, x2, x8         : IntAlu :  D=0x3c786d606f6c6e62  flags=(IsInteger)
-7837840000: system.cpu: A0 T0 : @__strlen_generic+96    :   sub   x6, x3, x8         : IntAlu :  D=0x6464676a626e4b26  flags=(IsInteger)
-7837840500: system.cpu: A0 T0 : @__strlen_generic+100    :   orr   x5, x4, x6         : IntAlu :  D=0x7c7c6f6a6f6e6f66  flags=(IsInteger)
-7837841000: system.cpu: A0 T0 : @__strlen_generic+104    :   ands   x5, x8, LSL #7    : IntAlu :  D=0x0000000000000000  flags=(IsInteger)
-....
-
-Instructions before `__strlen_generic` starts:
-
-....
-7831019000: system.cpu: A0 T0 : @define_params_zip+664    :   add   x1, sp, #168       : IntAlu :  D=0x0000007ffffef988  flags=(IsInteger)
-7831019500: system.cpu: A0 T0 : @define_params_zip+668    :   orr   x0, xzr, x24       : IntAlu :  D=0x0000003ffff00010  flags=(IsInteger)
-7831020000: system.cpu: A0 T0 : @define_params_zip+672    :   bl   <th_strcat>         : IntAlu :  D=0x000000000040a4c4  flags=(IsInteger|IsControl|IsDirectControl|IsUncondControl|IsCall)
-7831020500: system.cpu: A0 T0 : @th_strcat    :   b   <strcat>             : IntAlu :   flags=(IsControl|IsDirectControl|IsUncondControl)
-7831021000: system.cpu: A0 T0 : @strcat    : stp
-7831021000: system.cpu: A0 T0 : @strcat. 0 :   addxi_uop   ureg0, sp, #-48 : IntAlu :  D=0x0000007ffffef8b0  flags=(IsInteger|IsMicroop|IsDelayedCommit|IsFirstMicroop)
-7831021500: system.cpu: A0 T0 : @strcat. 1 :   strxi_uop   x29, [ureg0] : MemWrite :  D=0x0000007ffffef8e0 A=0x7ffffef8b0  flags=(IsInteger|IsMemRef|IsStore|IsMicroop|IsDelayedCommit)
-7831022000: system.cpu: A0 T0 : @strcat. 2 :   strxi_uop   x30, [ureg0, #8] : MemWrite :  D=0x000000000040a4c4 A=0x7ffffef8b8  flags=(IsInteger|IsMemRef|IsStore|IsMicroop|IsDelayedCommit)
-7831022500: system.cpu: A0 T0 : @strcat. 3 :   addxi_uop   sp, ureg0, #0 : IntAlu :  D=0x0000007ffffef8b0  flags=(IsInteger|IsMicroop|IsLastMicroop)
-7831023000: system.cpu: A0 T0 : @strcat+4    :   add   x29, sp, #0        : IntAlu :  D=0x0000007ffffef8b0  flags=(IsInteger)
-7831023500: system.cpu: A0 T0 : @strcat+8    :   str   x19, [sp, #16]     : MemWrite :  D=0x00000000004d6560 A=0x7ffffef8c0  flags=(IsInteger|IsMemRef|IsStore)
-7831024000: system.cpu: A0 T0 : @strcat+12    :   orr   x19, xzr, x0       : IntAlu :  D=0x0000003ffff00010  flags=(IsInteger)
-7831024500: system.cpu: A0 T0 : @strcat+16    :   str   x1, [sp, #40]      : MemWrite :  D=0x0000007ffffef988 A=0x7ffffef8d8  flags=(IsInteger|IsMemRef|IsStore)
-7831025000: system.cpu: A0 T0 : @strcat+20    :   bl   <_init+120>         : IntAlu :  D=0x00000000004464c8  flags=(IsInteger|IsControl|IsDirectControl|IsUncondControl|IsCall)
-7831025500: system.cpu: A0 T0 : @_init+120    :   adrp   x16, #835584      : IntAlu :  D=0x00000000004cc000  flags=(IsInteger)
-7831026000: system.cpu: A0 T0 : @_init+124    :   ldr   x17, [x16, #48]    : MemRead :  D=0x0000000000449680 A=0x4cc030  flags=(IsInteger|IsMemRef|IsLoad)
-7831026500: system.cpu: A0 T0 : @_init+128    :   add   x16, x16, #48      : IntAlu :  D=0x00000000004cc030  flags=(IsInteger)
-7831027000: system.cpu: A0 T0 : @_init+132    :   br   x17                 : IntAlu :   flags=(IsInteger|IsControl|IsIndirectControl|IsUncondControl)
-....
-
-Their build/run system is nice, it even user mode simulators out-of-the-box! TODO give it a shot. See :
-
-....
-RUN =
-RUN_FLAGS =
-....
-
-under `util/make/linux64.mak`.
-
-Tested on a7ae8e6a8e29ef46d79eb9178d8599d1faeea0e5 + 1.
-
 ==== Microbenchmarks

 It eventually has to come to that, hasn't it?
@@ -21581,297 +21333,6 @@ Bibliography:

 * https://stackoverflow.com/questions/2842695/what-is-microbenchmarking

-===== Dhrystone
-
-https://en.wikipedia.org/wiki/Dhrystone
-
-Created in the 80's, it is not a representative measure of performance in modern computers anymore. It has mostly been replaced by https://en.wikipedia.org/wiki/SPECint[SPEC], which is... closed source! Unbelievable.
-
-Dhrystone is very simple:
-
-* there is one loop in the `dhry_1.c` main function that gets executed N times
-* that loop calls 9 short functions called `Proc_0` to `Proc_9`, most of which are defined in `dhry_1.c`, and a few others in `dhry_2.c`
-
-The benchmark is single-threaded.
-
-After a quick look at it, Dhrystone in `-O3` is is very likely completely CPU bound, as there are no loops over variable sized arrays, except for some dummy ones that only run once. It just does a bunch of operations on local and global C variables, which are very likely to be inlined and treated fully in registers until the final write back, or to fit entirely in cache. TODO confirm with some kind of measurement. The benchmark also makes no syscalls except for measuring time and reporting results.
-
-<<buildroot>> has a `dhrystone` package, but because it is so interesting to us, we decided to also build it ourselves, which allows things like static and baremetal compilation more easily.
-
-Build and run on QEMU <<user-mode-simulation>>:
-
-....
-git submodule update --init submodules/dhrystone
-./build-dhrystone --optimization-level 3
-./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
-....
-
-TODO automate run more nicely to dispense `getvar`.
-
-Increase the number of loops to try and reach more meaningful results:
-
-....
-./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone" --cli-args 100000000
-....
-
-Build and run on gem5 user mode:
-
-....
-./build-dhrystone --optimization-level 3
-./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
-....
-
-Run natively on the host:
-
-....
-./build-dhrystone --host
-"$(./getvar --host userland_build_dir)/submodules/dhrystone/dhrystone" 1000000000
-....
-
-Sample output for <<p51>> Ubuntu 20.04:
-
-....
-Microseconds for one run through Dhrystone:    0.1
-Dhrystones per Second:                      16152479.0
-....
-
-Build Dhrystone for <<baremetal>> and run it in on QEMU:
-
-....
-# Build our Newlib stubs.
-./build-baremetal --arch aarch64
-./build-dhrystone --arch aarch64 --mode baremetal
-./run --arch aarch64 --baremetal "$(./getvar --arch aarch64 baremetal_build_dir)/submodules/dhrystone/dhrystone" --cli-args 10000
-....
-
-or with gem5:
-
-....
-# Build our Newlib stubs.
-./build-baremetal --arch aarch64
-./build-dhrystone --arch aarch64 --emulator gem5 --mode baremetal
-./run --arch aarch64 --baremetal "$(./getvar --arch aarch64 --emulator gem5 baremetal_build_dir)/submodules/dhrystone/dhrystone" --cli-args 10000 --emulator gem5
-....
-
-If you really want the Buildroot package for some reason, build it with:
-
-....
-./build-buildroot --config 'BR2_PACKAGE_DHRYSTONE=y'
-....
-
-and run inside the guest from `PATH` with:
-
-....
-dhrystone
-....
-
-===== LMbench
-
-http://www.bitmover.com/lmbench/
-
-Canonical source at https://sourceforge.net/projects/lmbench/ but Intel has a fork at: https://github.com/intel/lmbench which has more recent build updates, so I think that's the one I'd put my money on as of 2020.
-
-Feels old, guessing not representative anymore like <<dhrystone>>. But hey, history!
-
-Ubuntu 20.04 AMD64 native build and run:
-
-....
-git submodule update --init submodules/lmbench
-cd submodules/lmbench
-cd src
-make results
-....
-
-TODO it hangs for a long time at:
-
-....
-Hang on, we are calculating your cache line size.
-....
-
-Bug report: https://github.com/intel/lmbench/issues/15
-
-the If I kill it, configuration process continues:
-
-....
-Killed
-OK, it looks like your cache line is  bytes.
-....
-
-and continues with a few more interactive questions until finally:
-
-....
-Confguration done, thanks.
-....
-
-where it again hangs for at least 2 hours, so I lost patience and killed it.
-
-TODO: how to do a non-interactive config? After the above procedure, `bin/x86_64-linux-gnu/CONFIG.ciro-p51` contains:
-
-....
-DISKS=""
-DISK_DESC=""
-OUTPUT=/dev/null
-ENOUGH=50000
-FASTMEM="NO"
-FILE=/var/tmp/XXX
-FSDIR=/var/tmp
-INFO=INFO.ciro-p51
-LINE_SIZE=
-LOOP_O=0.00000000
-MAIL=no
-TOTAL_MEM=31903
-MB=22332
-MHZ="-1 System too busy"
-MOTHERBOARD=""
-NETWORKS=""
-OS="x86_64-linux-gnu"
-PROCESSORS="8"
-REMOTE=""
-SLOWFS="NO"
-SYNC_MAX="1"
-LMBENCH_SCHED="DEFAULT"
-TIMING_O=0
-RSH=rsh
-RCP=rcp
-VERSION=lmbench-3alpha4
-BENCHMARK_HARDWARE=YES
-BENCHMARK_OS=YES
-BENCHMARK_SYSCALL=
-BENCHMARK_SELECT=
-BENCHMARK_PROC=
-BENCHMARK_CTX=
-BENCHMARK_PAGEFAULT=
-BENCHMARK_FILE=
-BENCHMARK_MMAP=
-BENCHMARK_PIPE=
-BENCHMARK_UNIX=
-BENCHMARK_UDP=
-BENCHMARK_TCP=
-BENCHMARK_CONNECT=
-BENCHMARK_RPC=
-BENCHMARK_HTTP=
-BENCHMARK_BCOPY=
-BENCHMARK_MEM=
-BENCHMARK_OPS=
-....
-
-Native build only without running tests:
-
-....
-cd src
-make
-....
-
-Interestingly, one of the creators of LMbench, Larry Mcvoy (https://www.linkedin.com/in/larrymcvoy/[], https://en.wikipedia.org/wiki/Larry_McVoy[]), is also a co-founder of https://en.wikipedia.org/wiki/BitKeeper[BitKeeper]. Their SMC must be blazingly fast!!! Also his LinkedIn says Intel uses it. But they will forever be remembered as "the closed source Git precursor that died N years ago", RIP.
-
-===== STREAM benchmark
-
-http://www.cs.virginia.edu/stream/ref.html
-
-Very simple memory width benchmark with one C and one Fortran version, originally published in 1991, and the latest version at the time of writing is from 2013.
-
-Its operation is very simple: fork one thread for each CPU in the system (using OpenMP) and do the following four array operations (4 separate loops of individual operations):
-
-....
-/* Copy. */
-times[0 * ntimes + k] = mysecond();
-#pragma omp parallel for
-for (j=0; j<stream_array_size; j++)
-    c[j] = a[j];
-times[0 * ntimes + k] = mysecond() - times[0 * ntimes + k];
-
-/* Scale. */
-times[1 * ntimes + k] = mysecond();
-#pragma omp parallel for
-for (j=0; j<stream_array_size; j++)
-    b[j] = scalar*c[j];
-times[1 * ntimes + k] = mysecond() - times[1 * ntimes + k];
-
-/* Add. */
-times[2 * ntimes + k] = mysecond();
-#pragma omp parallel for
-for (j=0; j<stream_array_size; j++)
-    c[j] = a[j]+b[j];
-times[2 * ntimes + k] = mysecond() - times[2 * ntimes + k];
-
-/* Triad. */
-times[3 * ntimes + k] = mysecond();
-#pragma omp parallel for
-for (j=0; j<stream_array_size; j++)
-    a[j] = b[j]+scalar*c[j];
-times[3 * ntimes + k] = mysecond() - times[3 * ntimes + k];
-}
-....
-
-See also: https://stackoverflow.com/questions/56086993/what-does-stream-memory-bandwidth-benchmark-really-measure
-
-Ubuntu 20.04 native build and run:
-
-....
-git submodule update --init submodules/stream-benchmark
-cd submodules/stream-benchmark
-make
-./stream_c.exe
-....
-
-Sample output:
-
-....
-------------------------------------------------------------
-STREAM version $Revision: 5.10 $
-------------------------------------------------------------
-This system uses 8 bytes per array element.
-------------------------------------------------------------
-Array size = 10000000 (elements), Offset = 0 (elements)
-Memory per array = 76.3 MiB (= 0.1 GiB).
-Total memory required = 228.9 MiB (= 0.2 GiB).
-Each kernel will be executed 10 times.
- The *best* time for each kernel (excluding the first iteration)
- will be used to compute the reported bandwidth.
-------------------------------------------------------------
-Number of Threads requested = 8
-Number of Threads counted = 8
-------------------------------------------------------------
-Your clock granularity/precision appears to be 1 microseconds.
-Each test below will take on the order of 7027 microseconds.
-   (= 7027 clock ticks)
-Increase the size of the arrays if this shows that
-you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
-WARNING -- The above is only a rough guideline.
-For best results, please be sure you know the
-precision of your system timer.
-------------------------------------------------------------
-Function    Best Rate MB/s  Avg time     Min time     Max time
-Copy:           20123.2     0.008055     0.007951     0.008267
-Scale:          20130.4     0.008032     0.007948     0.008177
-Add:            22528.8     0.010728     0.010653     0.010867
-Triad:          22448.4     0.010826     0.010691     0.011352
-------------------------------------------------------------
-Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
-....
-
-The LKMC usage of STREAM is analogous to that of <<dhrystone>>. Build and run on QEMU <<user-mode-simulation>>:
-
-....
-./build-stream --optimization-level 3
-./run --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe"
-....
-
-Decrease the benchmark size and the retry count to finish simulation faster, but possibly have a less representative result:
-
-....
-./run --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe" --cli-args '100 2'
-....
-
-Build and run on gem5 user mode:
-
-....
-./build-stream --optimization-level 3
-./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe" --cli-args '1000 2'
-....
-
 [[userland-libs-directory]]
 === userland/libs directory

@@ -26836,28 +26297,6 @@ Summary of manually collected results on <<p51>> at LKMC a18f28e263c91362519ef55
 |
 |

-|
-|605448f07e6380634b1aa7e9732d111759f69fd
-|<<dhrystone>> `-O3`
-|`gem5 --arch aarch64`
-|4 * 10^5
-|68
-|9.2034139 * 10^7
-|1.6
-|
-|
-
-|
-|5d233f2664a78789f9907d27e2a40e86cefad595
-|<<stream-benchmark>> `-O3`
-|`./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --cli-args 1000000 --trace ExecAll`
-|3 * 10^5 * 2
-|64
-|9.9674773 * 10^7
-|1.6
-|
-|
-
 |glibc C pre-main effects
 |ab6f7331406b22f8ab6e2df5f8b8e464fb35b611
 |link:userland/c/m5ops.c[] `-O0`
@@ -26973,9 +26412,7 @@ We can then repeat the experiment for other gem5 CPUs to see how they compare.

 Let's see if user mode runs considerably faster than full system or not, ignoring the kernel boot.

-First we build <<dhrystone>> manually statically since dynamic linking is broken in gem5 as explained at: xref:gem5-syscall-emulation-mode[xrefstyle=full].
-
-TODO: move this section to our new custom dhrystone setup: xref:dhrystone[xrefstyle=full].
+First we build dhrystonee manually statically since dynamic linking is broken in gem5 as explained at: xref:gem5-syscall-emulation-mode[xrefstyle=full].

 gem5 user mode:

@@ -27231,35 +26668,6 @@ Reddit threads:

 ===== P51 benchmarks

-<<dhrystone>> on Ubuntu 20.04 results at <<dhrystone>>.
-
-<<stream-benchmark>> on Ubuntu 20.04 results at <<stream-benchmark>>.
-
-====== P51 CoreMark-Pro
-
-<<coremark,CoreMark-Pro>> d5b4f2ba7ba31e37a5aa93423831e7d5eb933868 on Ubuntu 20.04 with `XCMD="-c$(nproc)"`:
-
-....
-                                                 MultiCore SingleCore
-Workload Name                                     (iter/s)   (iter/s)    Scaling
----------------------------------------------- ---------- ---------- ----------
-cjpeg-rose7-preset                                  769.23     175.44       4.38
-core                                                  7.98       2.11       3.78
-linear_alg-mid-100x100-sp                           892.86     233.64       3.82
-loops-all-mid-10k-sp                                 35.84       7.58       4.73
-nnet_test                                            35.09      10.05       3.49
-parser-125k                                         125.00      20.41       6.12
-radix2-big-64k                                     3278.69     630.91       5.20
-sha-test                                            625.00     227.27       2.75
-zip-test                                            615.38     166.67       3.69
-
-MARK RESULTS TABLE
-
-Mark Name                                        MultiCore SingleCore    Scaling
----------------------------------------------- ---------- ---------- ----------
-CoreMark-PRO                                      25016.00    6079.70       4.11
-....
-
 ===== P51 maintenance history

 Bought: 2017 for approximately 2400 pounds.
--- a/53
+++ b/53
@@ -1,53 +0,0 @@
-#!/usr/bin/env python3
-
-import os
-import shutil
-
-import common
-import shlex
-from shell_helpers import LF
-
-class Main(common.BuildCliFunction):
-    def __init__(self):
-        super().__init__(
-            description='''\
-https://cirosantilli.com/linux-kernel-module-cheat#dhrystone
-'''
-        )
-        self._add_argument('--ccflags')
-        self._add_argument('--force-rebuild')
-        self._add_argument('--optimization-level')
-
-    def setup(self, env):
-        self.root_relpath = os.path.join('submodules', 'dhrystone')
-
-    def build(self):
-        build_dir = self.get_build_dir()
-        extra_flags = []
-        if self.env['force_rebuild']:
-            extra_flags.extend(['-B', LF])
-        ret = self.sh.run_cmd(
-            [
-                'make', LF,
-                '-j', str(self.env['nproc']), LF,
-                '-C', os.path.join(self.env['submodules_dir'], 'dhrystone'), LF,
-                'CC={}'.format(self.env['gcc_path']), LF,
-                'CFLAGS_EXTRA={}'.format(self.sh.cmd_to_string(self.env['ccflags'], force_oneline=True)), LF,
-                'LDFLAGS_EXTRA={}'.format(self.sh.cmd_to_string(self.env['ldflags'], force_oneline=True)), LF,
-                'EXTRA_OBJS={}'.format(' '.join(self.env['extra_objs'])), LF,
-                'OUT_DIR={}'.format(build_dir), LF,
-            ]
-            + extra_flags
-        )
-        if ret == 0 and self.env['copy_overlay']:
-            self.sh.copy_file_if_update(
-                os.path.join(build_dir, 'dhrystone'),
-                os.path.join(self.env['out_rootfs_overlay_lkmc_dir'], self.root_relpath, 'dhrystone'),
-            )
-        return ret
-
-    def get_build_dir(self):
-        return os.path.join(self.env['build_dir'], self.root_relpath)
-
-if __name__ == '__main__':
-    Main().cli()
--- a/67
+++ b/67
@@ -1,67 +0,0 @@
-#!/usr/bin/env python3
-
-'''
-TODO factor this out with the ./build-buildroot setup which is more advanced.
-'''
-
-import os
-import shutil
-
-import common
-import shlex
-from shell_helpers import LF
-
-class Main(common.BuildCliFunction):
-    def __init__(self):
-        super().__init__(
-            description='''\
-https://cirosantilli.com/linux-kernel-module-cheat#stream-benchmark
-'''
-        )
-        self._add_argument('--ccflags')
-        self._add_argument('--force-rebuild')
-        self._add_argument('--optimization-level')
-
-    def setup(self, env):
-        self.root_relpath = os.path.join('submodules', 'stream-benchmark')
-
-    def build(self):
-        build_dir = self.get_build_dir()
-        cflags = ['-O{}'.format(self.env['optimization_level'])]
-        extra_flags = []
-        if self.env['static']:
-            cflags.extend(['-static'])
-        if self.env['force_rebuild']:
-            extra_flags.extend(['-B', LF])
-        if self.env['mode'] == 'baremetal':
-            extra_objs = [
-                self.env['baremetal_syscalls_obj'],
-                self.env['baremetal_syscalls_asm_obj']
-            ]
-        else:
-            extra_objs = []
-        ret = self.sh.run_cmd(
-            [
-                'make', LF,
-                '-j', str(self.env['nproc']), LF,
-                '-C', os.path.join(self.env['submodules_dir'], 'stream-benchmark'), LF,
-                'CC={}'.format(self.env['gcc_path']), LF,
-                'CFLAGS_EXTRA={}'.format(' '.join(cflags)), LF,
-                'EXTRA_OBJS={}'.format(' '.join(extra_objs)), LF,
-                'FC={}'.format(self.env['gfortran_path']), LF,
-                'OUT_DIR={}'.format(build_dir), LF,
-            ]
-            + extra_flags
-        )
-        if ret == 0 and self.env['copy_overlay']:
-            self.sh.copy_file_if_update(
-                os.path.join(build_dir, 'stream_c.exe'),
-                os.path.join(self.env['out_rootfs_overlay_lkmc_dir'], self.root_relpath, 'stream-benchmark'),
-            )
-        return ret
-
-    def get_build_dir(self):
-        return os.path.join(self.env['build_dir'], self.root_relpath)
-
-if __name__ == '__main__':
-    Main().cli()
--- a/submodules/dhrystone
+++ b/submodules/dhrystone
--- a/submodules/lmbench
+++ b/submodules/lmbench
--- a/submodules/omp-validation
+++ b/submodules/omp-validation
--- a/submodules/stream-benchmark
+++ b/submodules/stream-benchmark