mirror of
https://github.com/cirosantilli/linux-kernel-module-cheat.git
synced 2026-01-23 02:05:57 +01:00
Build improvements
This commit is contained in:
12
.gitmodules
vendored
12
.gitmodules
vendored
@@ -11,9 +11,6 @@
|
||||
[submodule "submodules/crosstool-ng"]
|
||||
path = submodules/crosstool-ng
|
||||
url = https://github.com/cirosantilli/crosstool-ng
|
||||
[submodule "submodules/dhrystone"]
|
||||
path = submodules/dhrystone
|
||||
url = https://github.com/cirosantilli/dhrystone
|
||||
[submodule "submodules/gcc"]
|
||||
path = submodules/gcc
|
||||
url = https://github.com/cirosantilli/gcc
|
||||
@@ -36,12 +33,6 @@
|
||||
[submodule "submodules/linux"]
|
||||
path = submodules/linux
|
||||
url = https://github.com/cirosantilli/linux
|
||||
[submodule "submodules/lmbench"]
|
||||
path = submodules/lmbench
|
||||
url = https://github.com/cirosantilli/lmbench
|
||||
[submodule "submodules/omp-validation"]
|
||||
path = submodules/omp-validation
|
||||
url = https://github.com/cirosantilli/omp-validation
|
||||
[submodule "submodules/parsec-benchmark"]
|
||||
path = submodules/parsec-benchmark
|
||||
url = https://github.com/cirosantilli/parsec-benchmark
|
||||
@@ -51,9 +42,6 @@
|
||||
[submodule "submodules/xen"]
|
||||
path = submodules/xen
|
||||
url = https://github.com/cirosantilli/xen
|
||||
[submodule "submodules/stream-benchmark"]
|
||||
path = submodules/stream-benchmark
|
||||
url = https://github.com/cirosantilli/stream-benchmark
|
||||
[submodule "submodules/freebsd"]
|
||||
path = submodules/freebsd
|
||||
url = https://github.com/cirosantilli/freebsd
|
||||
|
||||
596
README.adoc
596
README.adoc
@@ -11337,7 +11337,7 @@ ____
|
||||
|
||||
OK, this is why we used gem5 in the first place, performance measurements!
|
||||
|
||||
Let's see how many cycles <<dhrystone>>, which Buildroot provides, takes for a few different input parameters.
|
||||
Let's see how many cycles dhrystone, which Buildroot provides, takes for a few different input parameters.
|
||||
|
||||
We will do that for various input parameters on full system by taking a checkpoint after the boot finishes a fast atomic CPU boot, and then we will restore in a more detailed mode and run the benchmark:
|
||||
|
||||
@@ -20138,84 +20138,6 @@ The implementation lives under `libgomp` in the GCC tree, and is documented at:
|
||||
|
||||
`strace` shows that OpenMP makes `clone()` syscalls in Linux. TODO: does it actually call `pthread_` functions, or does it make syscalls directly? Or in other words, can it work on <<freestanding-programs>>? A quick grep shows many references to pthreads.
|
||||
|
||||
====== OpenMP validation
|
||||
|
||||
https://github.com/uhhpctools/omp-validation
|
||||
|
||||
Host build on Ubuntu 20.04:
|
||||
|
||||
....
|
||||
git submodule update --init submodules/omp-validation
|
||||
cd submodules/omp-validation
|
||||
PERL5LIB="${PERL5LIB}:." make -j `nproc` ctest
|
||||
....
|
||||
|
||||
This both builds and runs, took about 5 minutes on <<p51>>, but had build failues for some reason:
|
||||
|
||||
....
|
||||
Summary:
|
||||
S Number of tested Open MP constructs: 62
|
||||
S Number of used tests: 123
|
||||
S Number of failed tests: 4
|
||||
S Number of successful tests: 119
|
||||
S + from this were verified: 115
|
||||
|
||||
Normal tests:
|
||||
N Number of failed tests: 2
|
||||
N + from this fail compilation: 0
|
||||
N + from this timed out 0
|
||||
N Number of successful tests: 60
|
||||
N + from this were verified: 58
|
||||
|
||||
Orphaned tests:
|
||||
O Number of failed tests: 2
|
||||
O + from this fail compilation: 0
|
||||
O + from this timed out 0
|
||||
O Number of successful tests: 59
|
||||
O + from this were verified: 57
|
||||
....
|
||||
|
||||
The tests and run results placed under `bin/c/`, e.g.:
|
||||
|
||||
....
|
||||
test_omp_threadprivate
|
||||
test_omp_threadprivate.c
|
||||
test_omp_threadprivate.log
|
||||
test_omp_threadprivate.out
|
||||
test_omp_threadprivate_compile.log
|
||||
....
|
||||
|
||||
C files are also present as some kind of code generaion is used.
|
||||
|
||||
Build only and run one of them manually:
|
||||
|
||||
....
|
||||
make -j`nproc` omp_my_sleep omp_testsuite
|
||||
PERL5LIB="${PERL5LIB}:." ./runtest.pl --lang=c --norun testlist-c.txt
|
||||
./bin/c/test_omp_barrier
|
||||
....
|
||||
|
||||
The `bin/c` directory is hardcoded in the executable, so to run it you must ensure that it exists relative to CWD, e.g.:
|
||||
|
||||
....
|
||||
cd bin/c
|
||||
mkdir -p bin/c
|
||||
./test_omp_barrier
|
||||
....
|
||||
|
||||
Manually cross compile all tests and optionally add some extra options, e.g. `-static` to <<gem5-dynamic-linked-executables-in-syscall-emulation,more conveniently run in gem5>>:
|
||||
|
||||
....
|
||||
PERL5LIB="${PERL5LIB}:." ./runtest.pl --lang=c --makeopts 'CC=aarch64-linux-gnu-gcc CFLAGS_EXTRA=-static' --norun testlist-c.txt
|
||||
./../../run --arch aarch64 --emulator gem5 --userland submodules/omp-validation/bin/c/test_omp_parallel_reduction --cpus 8 --memory 8G
|
||||
....
|
||||
|
||||
Build a single test:
|
||||
|
||||
....
|
||||
make bin/c/test_omp_sections_reduction
|
||||
....
|
||||
|
||||
[[cpp]]
|
||||
=== C++
|
||||
|
||||
@@ -21391,176 +21313,6 @@ Don't forget to explicitly rebuild PARSEC with:
|
||||
You may also want to test if your patches are still functionally correct inside of QEMU first, which is a faster emulator.
|
||||
* sell your soul, and compile natively inside the guest. We won't do this, not only because it is evil, but also because Buildroot explicitly does not support it: https://buildroot.org/downloads/manual/manual.html#faq-no-compiler-on-target ARM employees have been known to do this: https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/qemu-patch.diff
|
||||
|
||||
===== Coremark
|
||||
|
||||
https://en.wikipedia.org/wiki/Coremark
|
||||
|
||||
Part of https://en.wikipedia.org/wiki/EEMBC[EEMBC].
|
||||
|
||||
They have two versions:
|
||||
|
||||
* 2009: https://github.com/eembc/coremark
|
||||
* 2015: https://github.com/eembc/coremark-pro
|
||||
+
|
||||
Describes very clearly on the README what tests it does. Most of them are understandable high level operations.
|
||||
+
|
||||
In particular, it contains "a greatly improved version of the https://en.wikipedia.org/wiki/Livermore_loops[Livermore loops]"
|
||||
|
||||
Both have a custom license, so yeah, no patience to read this stuff.
|
||||
|
||||
Coremark-pro build and run on Ubuntu 20.04:
|
||||
|
||||
....
|
||||
git submodule update --init submodules coremark-pro
|
||||
cd submodules/coremark-pro
|
||||
make TARGET=linux64 build
|
||||
make TARGET=linux64 XCMD='-c4' certify-all
|
||||
....
|
||||
|
||||
This uses `4` contexts. TODO what are contexts? Is the same as threads? You likely want to use `-c$(nproc)` in practice instead?
|
||||
|
||||
Finishes in a few seconds, <<p51>> results:
|
||||
|
||||
....
|
||||
Workload Name (iter/s) (iter/s) Scaling
|
||||
----------------------------------------------- ---------- ---------- ----------
|
||||
cjpeg-rose7-preset 526.32 178.57 2.95
|
||||
core 7.39 2.16 3.42
|
||||
linear_alg-mid-100x100-sp 684.93 238.10 2.88
|
||||
loops-all-mid-10k-sp 27.65 7.80 3.54
|
||||
nnet_test 32.79 10.57 3.10
|
||||
parser-125k 71.43 25.00 2.86
|
||||
radix2-big-64k 2320.19 623.44 3.72
|
||||
sha-test 555.56 227.27 2.44
|
||||
zip-test 363.64 166.67 2.18
|
||||
|
||||
MARK RESULTS TABLE
|
||||
|
||||
Mark Name MultiCore SingleCore Scaling
|
||||
----------------------------------------------- ---------- ---------- ----------
|
||||
CoreMark-PRO 18743.79 6306.76 2.97
|
||||
....
|
||||
|
||||
More sample results: <<p51-coremark-pro>>.
|
||||
|
||||
And scaling appears to be the ration between multicore (4 due to `-c4` and single core performance), each benchmark gets run twice with multicore and single core.
|
||||
|
||||
The tester script also outputs test commands, some of which are:
|
||||
|
||||
....
|
||||
builds/linux64/gcc64/bin/zip-test.exe -c1 -w1 -c4 -v1
|
||||
builds/linux64/gcc64/bin/zip-test.exe -c1 -w1 -c4 -v0
|
||||
builds/linux64/gcc64/bin/zip-test.exe -c4 -v1
|
||||
builds/linux64/gcc64/bin/zip-test.exe -c4 -v0
|
||||
....
|
||||
|
||||
`-v1` appears to be a fast verification run, and both `-c1` vs `-c4` get run because for the single vs multicore preformance.
|
||||
|
||||
Sample `-c4 -v0` output:
|
||||
|
||||
....
|
||||
- Info: Starting Run...
|
||||
-- Workload:zip-test=946108807
|
||||
-- zip-test:time(ns)=11
|
||||
-- zip-test:contexts=4
|
||||
-- zip-test:iterations=4
|
||||
-- zip-test:time(secs)= 0.011
|
||||
-- zip-test:secs/workload= 0.00275
|
||||
-- zip-test:workloads/sec= 363.636
|
||||
-- Done:zip-test=946108807
|
||||
....
|
||||
|
||||
and so we see the `zip-test:workloads/sec= 363.636` output is the key value, which is close to that of the `zip-test 363.64` in the earlier full summarized result.
|
||||
|
||||
Cross compile statically for aarch64. From LKMC toplevel:
|
||||
|
||||
....
|
||||
make \
|
||||
-C submodules/coremark-pro \
|
||||
LINKER_FLAGS='-static' \
|
||||
LINKER_LAST='-lm -lpthread -lrt' \
|
||||
TARGET=gcc-cross-linux \
|
||||
TOOLCHAIN=gcc-cross-linux \
|
||||
TOOLS="$(./getvar --arch aarch64 buildroot_host_usr_dir)" \
|
||||
TPREF="$(./getvar --arch aarch64 buildroot_toolchain_prefix)-" \
|
||||
build \
|
||||
;
|
||||
....
|
||||
|
||||
Run a single executable on QEMU:
|
||||
|
||||
....
|
||||
./run --arch aarch64 --userland submodules/coremark-pro/builds/gcc-cross-linux/bin/zip-test.exe --cli-args='-c4 -v0'
|
||||
....
|
||||
|
||||
Finishes in about 1 second, and gives `zip-test:workloads/sec= 74.0741` so we see that it ran about 5x slower than the native host.
|
||||
|
||||
Run a single executable on gem5 in a verification run:
|
||||
|
||||
....
|
||||
./run \
|
||||
--arch aarch64 \
|
||||
--cli-args='-c1 -v1' \
|
||||
--emulator gem5 \
|
||||
--userland submodules/coremark-pro/builds/gcc-cross-linux/bin/zip-test.exe \
|
||||
;
|
||||
....
|
||||
|
||||
TODO: hangs for at least 15 minutes, there must be something wrong. Stuck on an evolving strlen loop:
|
||||
|
||||
....
|
||||
7837834500: system.cpu: A0 T0 : @__strlen_generic+112 : ldp
|
||||
7837834500: system.cpu: A0 T0 : @__strlen_generic+112. 0 : addxi_uop ureg0, x1, #16 : IntAlu : D=0x0000003ffff07170 flags=(IsInteger|IsMicroop|IsDelayedCommit|IsFirstMicroop)
|
||||
7837835000: system.cpu: A0 T0 : @__strlen_generic+112. 1 : ldp_uop x2, x3, [ureg0] : MemRead : D=0x20703c0a3e702f3c A=0x3ffff07170 flags=(IsInteger|IsMemRef|IsLoad|IsMicroop|IsLastMicroop)
|
||||
7837835500: system.cpu: A0 T0 : @__strlen_generic+116 : sub x4, x2, x8 : IntAlu : D=0x3d607360632e3b34 flags=(IsInteger)
|
||||
7837836000: system.cpu: A0 T0 : @__strlen_generic+120 : sub x6, x3, x8 : IntAlu : D=0x1f6f3b093d6f2e3b flags=(IsInteger)
|
||||
7837836500: system.cpu: A0 T0 : @__strlen_generic+124 : orr x5, x4, x6 : IntAlu : D=0x3f6f7b697f6f3f3f flags=(IsInteger)
|
||||
7837837000: system.cpu: A0 T0 : @__strlen_generic+128 : ands x5, x8, LSL #7 : IntAlu : D=0x0000000000000000 flags=(IsInteger)
|
||||
7837837500: system.cpu: A0 T0 : @__strlen_generic+132 : b.eq <__strlen_generic+88> : IntAlu : flags=(IsControl|IsDirectControl|IsCondControl)
|
||||
7837838000: system.cpu: A0 T0 : @__strlen_generic+88 : ldp
|
||||
7837838000: system.cpu: A0 T0 : @__strlen_generic+88. 0 : addxi_uop ureg0, x1, #32 : IntAlu : D=0x0000003ffff07180 flags=(IsInteger|IsMicroop|IsDelayedCommit|IsFirstMicroop)
|
||||
7837838500: system.cpu: A0 T0 : @__strlen_generic+88. 1 : ldp_uop x2, x3, [ureg0] : MemRead : D=0x6565686b636f4c27 A=0x3ffff07180 flags=(IsInteger|IsMemRef|IsLoad|IsMicroop|IsDelayedCommit)
|
||||
7837839000: system.cpu: A0 T0 : @__strlen_generic+88. 2 : addxi_uop x1, ureg0, #0 : IntAlu : D=0x0000003ffff07180 flags=(IsInteger|IsMicroop|IsLastMicroop)
|
||||
7837839500: system.cpu: A0 T0 : @__strlen_generic+92 : sub x4, x2, x8 : IntAlu : D=0x3c786d606f6c6e62 flags=(IsInteger)
|
||||
7837840000: system.cpu: A0 T0 : @__strlen_generic+96 : sub x6, x3, x8 : IntAlu : D=0x6464676a626e4b26 flags=(IsInteger)
|
||||
7837840500: system.cpu: A0 T0 : @__strlen_generic+100 : orr x5, x4, x6 : IntAlu : D=0x7c7c6f6a6f6e6f66 flags=(IsInteger)
|
||||
7837841000: system.cpu: A0 T0 : @__strlen_generic+104 : ands x5, x8, LSL #7 : IntAlu : D=0x0000000000000000 flags=(IsInteger)
|
||||
....
|
||||
|
||||
Instructions before `__strlen_generic` starts:
|
||||
|
||||
....
|
||||
7831019000: system.cpu: A0 T0 : @define_params_zip+664 : add x1, sp, #168 : IntAlu : D=0x0000007ffffef988 flags=(IsInteger)
|
||||
7831019500: system.cpu: A0 T0 : @define_params_zip+668 : orr x0, xzr, x24 : IntAlu : D=0x0000003ffff00010 flags=(IsInteger)
|
||||
7831020000: system.cpu: A0 T0 : @define_params_zip+672 : bl <th_strcat> : IntAlu : D=0x000000000040a4c4 flags=(IsInteger|IsControl|IsDirectControl|IsUncondControl|IsCall)
|
||||
7831020500: system.cpu: A0 T0 : @th_strcat : b <strcat> : IntAlu : flags=(IsControl|IsDirectControl|IsUncondControl)
|
||||
7831021000: system.cpu: A0 T0 : @strcat : stp
|
||||
7831021000: system.cpu: A0 T0 : @strcat. 0 : addxi_uop ureg0, sp, #-48 : IntAlu : D=0x0000007ffffef8b0 flags=(IsInteger|IsMicroop|IsDelayedCommit|IsFirstMicroop)
|
||||
7831021500: system.cpu: A0 T0 : @strcat. 1 : strxi_uop x29, [ureg0] : MemWrite : D=0x0000007ffffef8e0 A=0x7ffffef8b0 flags=(IsInteger|IsMemRef|IsStore|IsMicroop|IsDelayedCommit)
|
||||
7831022000: system.cpu: A0 T0 : @strcat. 2 : strxi_uop x30, [ureg0, #8] : MemWrite : D=0x000000000040a4c4 A=0x7ffffef8b8 flags=(IsInteger|IsMemRef|IsStore|IsMicroop|IsDelayedCommit)
|
||||
7831022500: system.cpu: A0 T0 : @strcat. 3 : addxi_uop sp, ureg0, #0 : IntAlu : D=0x0000007ffffef8b0 flags=(IsInteger|IsMicroop|IsLastMicroop)
|
||||
7831023000: system.cpu: A0 T0 : @strcat+4 : add x29, sp, #0 : IntAlu : D=0x0000007ffffef8b0 flags=(IsInteger)
|
||||
7831023500: system.cpu: A0 T0 : @strcat+8 : str x19, [sp, #16] : MemWrite : D=0x00000000004d6560 A=0x7ffffef8c0 flags=(IsInteger|IsMemRef|IsStore)
|
||||
7831024000: system.cpu: A0 T0 : @strcat+12 : orr x19, xzr, x0 : IntAlu : D=0x0000003ffff00010 flags=(IsInteger)
|
||||
7831024500: system.cpu: A0 T0 : @strcat+16 : str x1, [sp, #40] : MemWrite : D=0x0000007ffffef988 A=0x7ffffef8d8 flags=(IsInteger|IsMemRef|IsStore)
|
||||
7831025000: system.cpu: A0 T0 : @strcat+20 : bl <_init+120> : IntAlu : D=0x00000000004464c8 flags=(IsInteger|IsControl|IsDirectControl|IsUncondControl|IsCall)
|
||||
7831025500: system.cpu: A0 T0 : @_init+120 : adrp x16, #835584 : IntAlu : D=0x00000000004cc000 flags=(IsInteger)
|
||||
7831026000: system.cpu: A0 T0 : @_init+124 : ldr x17, [x16, #48] : MemRead : D=0x0000000000449680 A=0x4cc030 flags=(IsInteger|IsMemRef|IsLoad)
|
||||
7831026500: system.cpu: A0 T0 : @_init+128 : add x16, x16, #48 : IntAlu : D=0x00000000004cc030 flags=(IsInteger)
|
||||
7831027000: system.cpu: A0 T0 : @_init+132 : br x17 : IntAlu : flags=(IsInteger|IsControl|IsIndirectControl|IsUncondControl)
|
||||
....
|
||||
|
||||
Their build/run system is nice, it even user mode simulators out-of-the-box! TODO give it a shot. See :
|
||||
|
||||
....
|
||||
RUN =
|
||||
RUN_FLAGS =
|
||||
....
|
||||
|
||||
under `util/make/linux64.mak`.
|
||||
|
||||
Tested on a7ae8e6a8e29ef46d79eb9178d8599d1faeea0e5 + 1.
|
||||
|
||||
==== Microbenchmarks
|
||||
|
||||
It eventually has to come to that, hasn't it?
|
||||
@@ -21581,297 +21333,6 @@ Bibliography:
|
||||
|
||||
* https://stackoverflow.com/questions/2842695/what-is-microbenchmarking
|
||||
|
||||
===== Dhrystone
|
||||
|
||||
https://en.wikipedia.org/wiki/Dhrystone
|
||||
|
||||
Created in the 80's, it is not a representative measure of performance in modern computers anymore. It has mostly been replaced by https://en.wikipedia.org/wiki/SPECint[SPEC], which is... closed source! Unbelievable.
|
||||
|
||||
Dhrystone is very simple:
|
||||
|
||||
* there is one loop in the `dhry_1.c` main function that gets executed N times
|
||||
* that loop calls 9 short functions called `Proc_0` to `Proc_9`, most of which are defined in `dhry_1.c`, and a few others in `dhry_2.c`
|
||||
|
||||
The benchmark is single-threaded.
|
||||
|
||||
After a quick look at it, Dhrystone in `-O3` is is very likely completely CPU bound, as there are no loops over variable sized arrays, except for some dummy ones that only run once. It just does a bunch of operations on local and global C variables, which are very likely to be inlined and treated fully in registers until the final write back, or to fit entirely in cache. TODO confirm with some kind of measurement. The benchmark also makes no syscalls except for measuring time and reporting results.
|
||||
|
||||
<<buildroot>> has a `dhrystone` package, but because it is so interesting to us, we decided to also build it ourselves, which allows things like static and baremetal compilation more easily.
|
||||
|
||||
Build and run on QEMU <<user-mode-simulation>>:
|
||||
|
||||
....
|
||||
git submodule update --init submodules/dhrystone
|
||||
./build-dhrystone --optimization-level 3
|
||||
./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
|
||||
....
|
||||
|
||||
TODO automate run more nicely to dispense `getvar`.
|
||||
|
||||
Increase the number of loops to try and reach more meaningful results:
|
||||
|
||||
....
|
||||
./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone" --cli-args 100000000
|
||||
....
|
||||
|
||||
Build and run on gem5 user mode:
|
||||
|
||||
....
|
||||
./build-dhrystone --optimization-level 3
|
||||
./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
|
||||
....
|
||||
|
||||
Run natively on the host:
|
||||
|
||||
....
|
||||
./build-dhrystone --host
|
||||
"$(./getvar --host userland_build_dir)/submodules/dhrystone/dhrystone" 1000000000
|
||||
....
|
||||
|
||||
Sample output for <<p51>> Ubuntu 20.04:
|
||||
|
||||
....
|
||||
Microseconds for one run through Dhrystone: 0.1
|
||||
Dhrystones per Second: 16152479.0
|
||||
....
|
||||
|
||||
Build Dhrystone for <<baremetal>> and run it in on QEMU:
|
||||
|
||||
....
|
||||
# Build our Newlib stubs.
|
||||
./build-baremetal --arch aarch64
|
||||
./build-dhrystone --arch aarch64 --mode baremetal
|
||||
./run --arch aarch64 --baremetal "$(./getvar --arch aarch64 baremetal_build_dir)/submodules/dhrystone/dhrystone" --cli-args 10000
|
||||
....
|
||||
|
||||
or with gem5:
|
||||
|
||||
....
|
||||
# Build our Newlib stubs.
|
||||
./build-baremetal --arch aarch64
|
||||
./build-dhrystone --arch aarch64 --emulator gem5 --mode baremetal
|
||||
./run --arch aarch64 --baremetal "$(./getvar --arch aarch64 --emulator gem5 baremetal_build_dir)/submodules/dhrystone/dhrystone" --cli-args 10000 --emulator gem5
|
||||
....
|
||||
|
||||
If you really want the Buildroot package for some reason, build it with:
|
||||
|
||||
....
|
||||
./build-buildroot --config 'BR2_PACKAGE_DHRYSTONE=y'
|
||||
....
|
||||
|
||||
and run inside the guest from `PATH` with:
|
||||
|
||||
....
|
||||
dhrystone
|
||||
....
|
||||
|
||||
===== LMbench
|
||||
|
||||
http://www.bitmover.com/lmbench/
|
||||
|
||||
Canonical source at https://sourceforge.net/projects/lmbench/ but Intel has a fork at: https://github.com/intel/lmbench which has more recent build updates, so I think that's the one I'd put my money on as of 2020.
|
||||
|
||||
Feels old, guessing not representative anymore like <<dhrystone>>. But hey, history!
|
||||
|
||||
Ubuntu 20.04 AMD64 native build and run:
|
||||
|
||||
....
|
||||
git submodule update --init submodules/lmbench
|
||||
cd submodules/lmbench
|
||||
cd src
|
||||
make results
|
||||
....
|
||||
|
||||
TODO it hangs for a long time at:
|
||||
|
||||
....
|
||||
Hang on, we are calculating your cache line size.
|
||||
....
|
||||
|
||||
Bug report: https://github.com/intel/lmbench/issues/15
|
||||
|
||||
the If I kill it, configuration process continues:
|
||||
|
||||
....
|
||||
Killed
|
||||
OK, it looks like your cache line is bytes.
|
||||
....
|
||||
|
||||
and continues with a few more interactive questions until finally:
|
||||
|
||||
....
|
||||
Confguration done, thanks.
|
||||
....
|
||||
|
||||
where it again hangs for at least 2 hours, so I lost patience and killed it.
|
||||
|
||||
TODO: how to do a non-interactive config? After the above procedure, `bin/x86_64-linux-gnu/CONFIG.ciro-p51` contains:
|
||||
|
||||
....
|
||||
DISKS=""
|
||||
DISK_DESC=""
|
||||
OUTPUT=/dev/null
|
||||
ENOUGH=50000
|
||||
FASTMEM="NO"
|
||||
FILE=/var/tmp/XXX
|
||||
FSDIR=/var/tmp
|
||||
INFO=INFO.ciro-p51
|
||||
LINE_SIZE=
|
||||
LOOP_O=0.00000000
|
||||
MAIL=no
|
||||
TOTAL_MEM=31903
|
||||
MB=22332
|
||||
MHZ="-1 System too busy"
|
||||
MOTHERBOARD=""
|
||||
NETWORKS=""
|
||||
OS="x86_64-linux-gnu"
|
||||
PROCESSORS="8"
|
||||
REMOTE=""
|
||||
SLOWFS="NO"
|
||||
SYNC_MAX="1"
|
||||
LMBENCH_SCHED="DEFAULT"
|
||||
TIMING_O=0
|
||||
RSH=rsh
|
||||
RCP=rcp
|
||||
VERSION=lmbench-3alpha4
|
||||
BENCHMARK_HARDWARE=YES
|
||||
BENCHMARK_OS=YES
|
||||
BENCHMARK_SYSCALL=
|
||||
BENCHMARK_SELECT=
|
||||
BENCHMARK_PROC=
|
||||
BENCHMARK_CTX=
|
||||
BENCHMARK_PAGEFAULT=
|
||||
BENCHMARK_FILE=
|
||||
BENCHMARK_MMAP=
|
||||
BENCHMARK_PIPE=
|
||||
BENCHMARK_UNIX=
|
||||
BENCHMARK_UDP=
|
||||
BENCHMARK_TCP=
|
||||
BENCHMARK_CONNECT=
|
||||
BENCHMARK_RPC=
|
||||
BENCHMARK_HTTP=
|
||||
BENCHMARK_BCOPY=
|
||||
BENCHMARK_MEM=
|
||||
BENCHMARK_OPS=
|
||||
....
|
||||
|
||||
Native build only without running tests:
|
||||
|
||||
....
|
||||
cd src
|
||||
make
|
||||
....
|
||||
|
||||
Interestingly, one of the creators of LMbench, Larry Mcvoy (https://www.linkedin.com/in/larrymcvoy/[], https://en.wikipedia.org/wiki/Larry_McVoy[]), is also a co-founder of https://en.wikipedia.org/wiki/BitKeeper[BitKeeper]. Their SMC must be blazingly fast!!! Also his LinkedIn says Intel uses it. But they will forever be remembered as "the closed source Git precursor that died N years ago", RIP.
|
||||
|
||||
===== STREAM benchmark
|
||||
|
||||
http://www.cs.virginia.edu/stream/ref.html
|
||||
|
||||
Very simple memory width benchmark with one C and one Fortran version, originally published in 1991, and the latest version at the time of writing is from 2013.
|
||||
|
||||
Its operation is very simple: fork one thread for each CPU in the system (using OpenMP) and do the following four array operations (4 separate loops of individual operations):
|
||||
|
||||
....
|
||||
/* Copy. */
|
||||
times[0 * ntimes + k] = mysecond();
|
||||
#pragma omp parallel for
|
||||
for (j=0; j<stream_array_size; j++)
|
||||
c[j] = a[j];
|
||||
times[0 * ntimes + k] = mysecond() - times[0 * ntimes + k];
|
||||
|
||||
/* Scale. */
|
||||
times[1 * ntimes + k] = mysecond();
|
||||
#pragma omp parallel for
|
||||
for (j=0; j<stream_array_size; j++)
|
||||
b[j] = scalar*c[j];
|
||||
times[1 * ntimes + k] = mysecond() - times[1 * ntimes + k];
|
||||
|
||||
/* Add. */
|
||||
times[2 * ntimes + k] = mysecond();
|
||||
#pragma omp parallel for
|
||||
for (j=0; j<stream_array_size; j++)
|
||||
c[j] = a[j]+b[j];
|
||||
times[2 * ntimes + k] = mysecond() - times[2 * ntimes + k];
|
||||
|
||||
/* Triad. */
|
||||
times[3 * ntimes + k] = mysecond();
|
||||
#pragma omp parallel for
|
||||
for (j=0; j<stream_array_size; j++)
|
||||
a[j] = b[j]+scalar*c[j];
|
||||
times[3 * ntimes + k] = mysecond() - times[3 * ntimes + k];
|
||||
}
|
||||
....
|
||||
|
||||
See also: https://stackoverflow.com/questions/56086993/what-does-stream-memory-bandwidth-benchmark-really-measure
|
||||
|
||||
Ubuntu 20.04 native build and run:
|
||||
|
||||
....
|
||||
git submodule update --init submodules/stream-benchmark
|
||||
cd submodules/stream-benchmark
|
||||
make
|
||||
./stream_c.exe
|
||||
....
|
||||
|
||||
Sample output:
|
||||
|
||||
....
|
||||
-------------------------------------------------------------
|
||||
STREAM version $Revision: 5.10 $
|
||||
-------------------------------------------------------------
|
||||
This system uses 8 bytes per array element.
|
||||
-------------------------------------------------------------
|
||||
Array size = 10000000 (elements), Offset = 0 (elements)
|
||||
Memory per array = 76.3 MiB (= 0.1 GiB).
|
||||
Total memory required = 228.9 MiB (= 0.2 GiB).
|
||||
Each kernel will be executed 10 times.
|
||||
The *best* time for each kernel (excluding the first iteration)
|
||||
will be used to compute the reported bandwidth.
|
||||
-------------------------------------------------------------
|
||||
Number of Threads requested = 8
|
||||
Number of Threads counted = 8
|
||||
-------------------------------------------------------------
|
||||
Your clock granularity/precision appears to be 1 microseconds.
|
||||
Each test below will take on the order of 7027 microseconds.
|
||||
(= 7027 clock ticks)
|
||||
Increase the size of the arrays if this shows that
|
||||
you are not getting at least 20 clock ticks per test.
|
||||
-------------------------------------------------------------
|
||||
WARNING -- The above is only a rough guideline.
|
||||
For best results, please be sure you know the
|
||||
precision of your system timer.
|
||||
-------------------------------------------------------------
|
||||
Function Best Rate MB/s Avg time Min time Max time
|
||||
Copy: 20123.2 0.008055 0.007951 0.008267
|
||||
Scale: 20130.4 0.008032 0.007948 0.008177
|
||||
Add: 22528.8 0.010728 0.010653 0.010867
|
||||
Triad: 22448.4 0.010826 0.010691 0.011352
|
||||
-------------------------------------------------------------
|
||||
Solution Validates: avg error less than 1.000000e-13 on all three arrays
|
||||
-------------------------------------------------------------
|
||||
....
|
||||
|
||||
The LKMC usage of STREAM is analogous to that of <<dhrystone>>. Build and run on QEMU <<user-mode-simulation>>:
|
||||
|
||||
....
|
||||
./build-stream --optimization-level 3
|
||||
./run --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe"
|
||||
....
|
||||
|
||||
Decrease the benchmark size and the retry count to finish simulation faster, but possibly have a less representative result:
|
||||
|
||||
....
|
||||
./run --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe" --cli-args '100 2'
|
||||
....
|
||||
|
||||
Build and run on gem5 user mode:
|
||||
|
||||
....
|
||||
./build-stream --optimization-level 3
|
||||
./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe" --cli-args '1000 2'
|
||||
....
|
||||
|
||||
[[userland-libs-directory]]
|
||||
=== userland/libs directory
|
||||
|
||||
@@ -26836,28 +26297,6 @@ Summary of manually collected results on <<p51>> at LKMC a18f28e263c91362519ef55
|
||||
|
|
||||
|
|
||||
|
||||
|
|
||||
|605448f07e6380634b1aa7e9732d111759f69fd
|
||||
|<<dhrystone>> `-O3`
|
||||
|`gem5 --arch aarch64`
|
||||
|4 * 10^5
|
||||
|68
|
||||
|9.2034139 * 10^7
|
||||
|1.6
|
||||
|
|
||||
|
|
||||
|
||||
|
|
||||
|5d233f2664a78789f9907d27e2a40e86cefad595
|
||||
|<<stream-benchmark>> `-O3`
|
||||
|`./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --cli-args 1000000 --trace ExecAll`
|
||||
|3 * 10^5 * 2
|
||||
|64
|
||||
|9.9674773 * 10^7
|
||||
|1.6
|
||||
|
|
||||
|
|
||||
|
||||
|glibc C pre-main effects
|
||||
|ab6f7331406b22f8ab6e2df5f8b8e464fb35b611
|
||||
|link:userland/c/m5ops.c[] `-O0`
|
||||
@@ -26973,9 +26412,7 @@ We can then repeat the experiment for other gem5 CPUs to see how they compare.
|
||||
|
||||
Let's see if user mode runs considerably faster than full system or not, ignoring the kernel boot.
|
||||
|
||||
First we build <<dhrystone>> manually statically since dynamic linking is broken in gem5 as explained at: xref:gem5-syscall-emulation-mode[xrefstyle=full].
|
||||
|
||||
TODO: move this section to our new custom dhrystone setup: xref:dhrystone[xrefstyle=full].
|
||||
First we build dhrystonee manually statically since dynamic linking is broken in gem5 as explained at: xref:gem5-syscall-emulation-mode[xrefstyle=full].
|
||||
|
||||
gem5 user mode:
|
||||
|
||||
@@ -27231,35 +26668,6 @@ Reddit threads:
|
||||
|
||||
===== P51 benchmarks
|
||||
|
||||
<<dhrystone>> on Ubuntu 20.04 results at <<dhrystone>>.
|
||||
|
||||
<<stream-benchmark>> on Ubuntu 20.04 results at <<stream-benchmark>>.
|
||||
|
||||
====== P51 CoreMark-Pro
|
||||
|
||||
<<coremark,CoreMark-Pro>> d5b4f2ba7ba31e37a5aa93423831e7d5eb933868 on Ubuntu 20.04 with `XCMD="-c$(nproc)"`:
|
||||
|
||||
....
|
||||
MultiCore SingleCore
|
||||
Workload Name (iter/s) (iter/s) Scaling
|
||||
----------------------------------------------- ---------- ---------- ----------
|
||||
cjpeg-rose7-preset 769.23 175.44 4.38
|
||||
core 7.98 2.11 3.78
|
||||
linear_alg-mid-100x100-sp 892.86 233.64 3.82
|
||||
loops-all-mid-10k-sp 35.84 7.58 4.73
|
||||
nnet_test 35.09 10.05 3.49
|
||||
parser-125k 125.00 20.41 6.12
|
||||
radix2-big-64k 3278.69 630.91 5.20
|
||||
sha-test 625.00 227.27 2.75
|
||||
zip-test 615.38 166.67 3.69
|
||||
|
||||
MARK RESULTS TABLE
|
||||
|
||||
Mark Name MultiCore SingleCore Scaling
|
||||
----------------------------------------------- ---------- ---------- ----------
|
||||
CoreMark-PRO 25016.00 6079.70 4.11
|
||||
....
|
||||
|
||||
===== P51 maintenance history
|
||||
|
||||
Bought: 2017 for approximately 2400 pounds.
|
||||
|
||||
@@ -1,53 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
import os
|
||||
import shutil
|
||||
|
||||
import common
|
||||
import shlex
|
||||
from shell_helpers import LF
|
||||
|
||||
class Main(common.BuildCliFunction):
|
||||
def __init__(self):
|
||||
super().__init__(
|
||||
description='''\
|
||||
https://cirosantilli.com/linux-kernel-module-cheat#dhrystone
|
||||
'''
|
||||
)
|
||||
self._add_argument('--ccflags')
|
||||
self._add_argument('--force-rebuild')
|
||||
self._add_argument('--optimization-level')
|
||||
|
||||
def setup(self, env):
|
||||
self.root_relpath = os.path.join('submodules', 'dhrystone')
|
||||
|
||||
def build(self):
|
||||
build_dir = self.get_build_dir()
|
||||
extra_flags = []
|
||||
if self.env['force_rebuild']:
|
||||
extra_flags.extend(['-B', LF])
|
||||
ret = self.sh.run_cmd(
|
||||
[
|
||||
'make', LF,
|
||||
'-j', str(self.env['nproc']), LF,
|
||||
'-C', os.path.join(self.env['submodules_dir'], 'dhrystone'), LF,
|
||||
'CC={}'.format(self.env['gcc_path']), LF,
|
||||
'CFLAGS_EXTRA={}'.format(self.sh.cmd_to_string(self.env['ccflags'], force_oneline=True)), LF,
|
||||
'LDFLAGS_EXTRA={}'.format(self.sh.cmd_to_string(self.env['ldflags'], force_oneline=True)), LF,
|
||||
'EXTRA_OBJS={}'.format(' '.join(self.env['extra_objs'])), LF,
|
||||
'OUT_DIR={}'.format(build_dir), LF,
|
||||
]
|
||||
+ extra_flags
|
||||
)
|
||||
if ret == 0 and self.env['copy_overlay']:
|
||||
self.sh.copy_file_if_update(
|
||||
os.path.join(build_dir, 'dhrystone'),
|
||||
os.path.join(self.env['out_rootfs_overlay_lkmc_dir'], self.root_relpath, 'dhrystone'),
|
||||
)
|
||||
return ret
|
||||
|
||||
def get_build_dir(self):
|
||||
return os.path.join(self.env['build_dir'], self.root_relpath)
|
||||
|
||||
if __name__ == '__main__':
|
||||
Main().cli()
|
||||
67
build-stream
67
build-stream
@@ -1,67 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
'''
|
||||
TODO factor this out with the ./build-buildroot setup which is more advanced.
|
||||
'''
|
||||
|
||||
import os
|
||||
import shutil
|
||||
|
||||
import common
|
||||
import shlex
|
||||
from shell_helpers import LF
|
||||
|
||||
class Main(common.BuildCliFunction):
|
||||
def __init__(self):
|
||||
super().__init__(
|
||||
description='''\
|
||||
https://cirosantilli.com/linux-kernel-module-cheat#stream-benchmark
|
||||
'''
|
||||
)
|
||||
self._add_argument('--ccflags')
|
||||
self._add_argument('--force-rebuild')
|
||||
self._add_argument('--optimization-level')
|
||||
|
||||
def setup(self, env):
|
||||
self.root_relpath = os.path.join('submodules', 'stream-benchmark')
|
||||
|
||||
def build(self):
|
||||
build_dir = self.get_build_dir()
|
||||
cflags = ['-O{}'.format(self.env['optimization_level'])]
|
||||
extra_flags = []
|
||||
if self.env['static']:
|
||||
cflags.extend(['-static'])
|
||||
if self.env['force_rebuild']:
|
||||
extra_flags.extend(['-B', LF])
|
||||
if self.env['mode'] == 'baremetal':
|
||||
extra_objs = [
|
||||
self.env['baremetal_syscalls_obj'],
|
||||
self.env['baremetal_syscalls_asm_obj']
|
||||
]
|
||||
else:
|
||||
extra_objs = []
|
||||
ret = self.sh.run_cmd(
|
||||
[
|
||||
'make', LF,
|
||||
'-j', str(self.env['nproc']), LF,
|
||||
'-C', os.path.join(self.env['submodules_dir'], 'stream-benchmark'), LF,
|
||||
'CC={}'.format(self.env['gcc_path']), LF,
|
||||
'CFLAGS_EXTRA={}'.format(' '.join(cflags)), LF,
|
||||
'EXTRA_OBJS={}'.format(' '.join(extra_objs)), LF,
|
||||
'FC={}'.format(self.env['gfortran_path']), LF,
|
||||
'OUT_DIR={}'.format(build_dir), LF,
|
||||
]
|
||||
+ extra_flags
|
||||
)
|
||||
if ret == 0 and self.env['copy_overlay']:
|
||||
self.sh.copy_file_if_update(
|
||||
os.path.join(build_dir, 'stream_c.exe'),
|
||||
os.path.join(self.env['out_rootfs_overlay_lkmc_dir'], self.root_relpath, 'stream-benchmark'),
|
||||
)
|
||||
return ret
|
||||
|
||||
def get_build_dir(self):
|
||||
return os.path.join(self.env['build_dir'], self.root_relpath)
|
||||
|
||||
if __name__ == '__main__':
|
||||
Main().cli()
|
||||
Submodule submodules/dhrystone deleted from 35f961d841
Submodule submodules/lmbench deleted from 5ddee65534
Submodule submodules/omp-validation deleted from 926f8e1619
Submodule submodules/stream-benchmark deleted from eaa0a90ded
Reference in New Issue
Block a user