vim submodules/qemu/target/i386/cpu.c-
diff --git a/index.html b/index.html index 64d5a65..8e158c1 100644 --- a/index.html +++ b/index.html @@ -477,10 +477,9 @@ pre{ white-space:pre }
Not satisfied with mere software? OK then, let’s hack up the QEMU x86 CPU identification:
-vim submodules/qemu/target/i386/cpu.c-
and modify:
-.model_id = "QEMU Virtual CPU version " QEMU_HW_VERSION,-
to contain:
-.model_id = "QEMU Virtual CPU version HACKED " QEMU_HW_VERSION,-
then as usual rebuild and re-run:
-./build-qemu -./run --eval-after 'grep "model name" /proc/cpuinfo'-
and once again, there is your message: QEMU communicated it to the Linux kernel, which printed it out.
-You have now gone from newb to hardware hacker in a mere 15 minutes, your rate of progress is truly astounding!!!
-Seriously though, if you want to be a real hardware hacker, it just can’t be done with open source tools as of 2018. The root obstacle is that:
-Silicon fabs don’t publish reveal their design rules
-which implies that there are no decent standard cell libraries. See also: https://www.quora.com/Are-there-good-open-source-standard-cell-libraries-to-learn-IC-synthesis-with-EDA-tools/answer/Ciro-Santilli
-which implies that people can’t develop open source EDA tools
-which implies that you can’t get decent power, performance and area estimates
-The only thing you can do with open source is purely functional designs with Verilator, but you will never know if it can be actually produced and how efficient it can be.
-If you really want to develop semiconductors, your only choice is to join an university or a semiconductor company that has the EDA licenses.
-While hacking QEMU, you will likely want to GDB step its source. That is trivial since QEMU is just another userland program like any other, but our setup has a shortcut to make it even more convenient, see: Section 22.8, “Debug the emulator”.
-We use glibc as our default libc now, and it is tracked as an unmodified submodule at submodules/glibc, at the exact same version that Buildroot has it, which can be found at: package/glibc/glibc.mk. Buildroot 2018.05 applies no patches.
Have you ever felt that a single inc instruction was not enough? Really? Me too!
OK, now time to hack GCC.
This also allows you to modify QEMU if you’re into that sort of thing.
-To build the kernel modules as in Your first kernel module hack do:
The device tree is a Linux kernel defined data structure that serves to inform the kernel how the hardware is setup.
platform_device contains a minimal runnable example of device tree manipulation.
-Device trees serve to reduce the need for hardware vendors to patch the kernel: they just provide a device tree file instead, which is much simpler.
Running dynamically linked executables in QEMU requires pointing it to the root filesystem with the -L option so that it can find the dynamic linker and shared libraries.
Running dynamically linked executables in QEMU requires pointing it to the root filesystem with the -L option so that it can find the dynamic linker and shared libraries, see also:
We pass -L by default, so everything just works.
Support for dynamic linking was added in November 2019: https://stackoverflow.com/questions/50542222/how-to-run-a-dynamically-linked-executable-syscall-emulation-mode-se-py-in-gem5/50696098#50696098
+Support for dynamic linking was added in November 2019:
+Note that as shown at Section 34.2.2, “Benchmark emulators on userland executables”, the dynamic version runs 200x more instructions, which might have an impact on smaller simulations in detailed CPUs.
@@ -12945,7 +12885,7 @@ CONFIG_VIRTIO_PCI=yand furthermore interrupt 1 and 12 happen immediately TODO why, were they somehow pending?
So so see something interesting, you need to monitor an interrupt that is more rare than the keyboard, e.g. platform_device.
-added in our fork of QEMU:
-Only tested in x86.
PCI driver for our minimal pci_min.c QEMU fork device:
./run -- -device lkmc_pci_min-
then:
-insmod pci_min.ko-
Sources:
-Kernel module: kernel_modules/pci_min.c.
-QEMU device: https://github.com/cirosantilli/qemu/blob/lkmc/hw/misc/lkmc_pci_min.c
-Outcome:
-<4>[ 10.608241] pci_min: loading out-of-tree module taints kernel. -<6>[ 10.609935] probe -<6>[ 10.651881] dev->irq = 11 -lkmc_pci_min mmio_write addr = 0 val = 12345678 size = 4 -<6>[ 10.668515] irq_handler irq = 11 dev = 251 -lkmc_pci_min mmio_write addr = 4 val = 0 size = 4-
What happened:
-right at probe time, we write to a register
-our hardware model is coded such that it generates an interrupt when written to
-the Linux kernel interrupt handler write to another register, which tells the hardware to stop sending interrupts
-Kernel messages and printks from inside QEMU are shown all together, to see that more clearly, run in QEMU graphic mode instead.
-We don’t enable the device by default because it does not work for vanilla QEMU, which we often want to test with this repository.
-Probe already does a MMIO write, which generates an IRQ and tests everything.
-Small upstream educational PCI device:
https://stackoverflow.com/questions/62831327/add-memory-device-to-qemu
https://stackoverflow.com/questions/64539528/qemu-pci-dma-read-and-pci-dma-write-does-not-work
+In this section we will try to interact with PCI devices directly from userland without kernel modules.
which we identify as being edu and pci_min respectively by the magic numbers: 1234:11e?
which we identify as being QEMU edu PCI device by the magic number: 1234:11e8.
Alternatively, we can also do use the QEMU monitor:
@@ -17605,17 +17462,7 @@ lkmc_pci_min mmio_write addr = 4 val = 0 size = 4dev: lkmc_pci_min, id "" - addr = 07.0 - romfile = "" - rombar = 1 (0x1) - multifunction = false - command_serr_enable = true - x-pcie-lnksta-dllla = true - x-pcie-extcap-init = true - class Class 00ff, addr 00:07.0, pci id 1234:11e9 (sub 1af4:1100) - bar 0: mem at 0xfeb54000 [0xfeb54007] - dev: edu, id "" +dev: edu, id "" addr = 06.0 romfile = "" rombar = 1 (0x1) @@ -17652,7 +17499,7 @@ lkmc_pci_min mmio_write addr = 4 val = 0 size = 4setpci -s 0000:00:06.0 BASE_ADDRESS_0 -setpci -d 1234:11e9 BASE_ADDRESS_0+setpci -d 1234:11e8 BASE_ADDRESS_0@@ -17680,16 +17527,14 @@ setpci -d 1234:11e9 BASE_ADDRESS_0
which writes to the first register of our pci_min device.
+which writes to the first register of the edu device.
The device then fires an interrupt at irq 11, which is unhandled, which leads the kernel to say you are a bad boy:
+The device then fires an interrupt at irq 11, which is unhandled, which leads the kernel to say you are a bad person:
lkmc_pci_min mmio_write addr = 0 val = 12345678 size = 4 -<5>[ 1064.042435] random: crng init done -<3>[ 1065.567742] irq 11: nobody cared (try booting with the "irqpoll" option)+
<3>[ 1065.567742] irq 11: nobody cared (try booting with the "irqpoll" option)
Our kernel module handles the interrupt, but does not acknowledge it like our proper pci_min kernel module, and so it keeps firing, which leads to infinitely many messages being printed:
+Our kernel module handles the interrupt, but does not acknowledge it like our proper edu kernel module, and so it keeps firing, which leads to infinitely many messages being printed:
There are two versions of setpci and lspci:
The PCI standard is non-free, obviously like everything in low level: https://pcisig.com/specifications but Google gives several illegal PDF hits :-)
lspci -k shows something like:
Minimal platform device example coded into the -M versatilepb SoC of our QEMU fork.
Using this device now requires checking out to the branch:
-git checkout platform-device -git submodule sync-
before building, it does not work on master.
-Rationale: we found out that the kernels that build for qemu -M versatilepb don’t work on gem5 because versatilepb is an old pre-v7 platform, and gem5 requires armv7. So we migrated over to -M virt to have a single kernel for both gem5 and QEMU, and broke this since the single kernel was more important. TODO port to -M virt.
The module itself can be found at: https://github.com/cirosantilli/linux-kernel-module-cheat/blob/platform-device/kernel_modules/platform_device.c
-Uses:
-hw/misc/lkmc_platform_device.c minimal device added in our QEMU fork to -M versatilepb
the device tree entry we added to our Linux kernel fork: https://github.com/cirosantilli/linux/blob/361bb623671a52a36a077a6dd45843389a687a33/arch/arm/boot/dts/versatile-pb.dts#L42
-Expected outcome after insmod:
-QEMU reports MMIO with printfs
-IRQs are generated and handled by this module, which logs to dmesg
-Without insmoding this module, try writing to the register with /dev/mem:
-devmem 0x101e9000 w 0x12345678-
We can also observe the interrupt with dummy-irq:
-modprobe dummy-irq irq=34 -insmod platform_device.ko-
The IRQ number 34 was found by on the dmesg after:
insmod platform_device.ko-
TODO get some working!
OK, this is why we used gem5 in the first place, performance measurements!
Let’s see how many cycles Dhrystone, which Buildroot provides, takes for a few different input parameters.
+Let’s see how many cycles dhrystone, which Buildroot provides, takes for a few different input parameters.
We will do that for various input parameters on full system by taking a checkpoint after the boot finishes a fast atomic CPU boot, and then we will restore in a more detailed mode and run the benchmark:
@@ -21309,10 +21076,33 @@ m5 execfilegem5 allocates some magic instructions on unused instruction encodings for convenient guest instrumentation.
+There are few different possible instructions that can be used to implement identical m5ops:
+magic instructions reserved in the encoding space
+magic addresses: m5ops magic addresses
+unused Semihosting addresses space on ARM platforms
+Those instructions are exposed through the gem5 m5 executable in tree executable.
+All of those those methods are exposed through the gem5 m5 executable in-tree executable. You can select which method to use when calling the executable, e.g.:
+m5 exit +# Same as the above. +m5 --inst exit +# The address is mandatory if not configured at build time. +m5 --addr 0x10010000 exit +m5 --semi exit+
To make things simpler to understand, you can play around with our own minimized educational m5 subset:
These are magic addresses that when accessed lead to an m5op.
+The base address is given by system.m5ops_base, and then each m5op happens at a different address offset form that base.
If system.m5ops_base is 0, then the memory m5ops are disabled.
Note that the address is physical, and therefore when running in full system on top of the Linux kernel, you must first map a virtual to physical address with /dev/mem as mentioned at: Userland physical address experiments.
One advantage of this method is that it can work with gem5 KVM, whereas the magic instructions don’t, since the host cannot handle them and it is hard to hook into that.
+A Baremetal example of that can be found at: baremetal/arch/aarch64/no_bootloader/m5_exit_addr.S.
+As of gem5 0d5a80cb469f515b95e03f23ddaf70c9fd2ecbf2, fs.py --baremetal disables the memory m5ops however for some reason, therefore you should run that program as:
./run --arch aarch64 --baremetal baremetal/arch/aarch64/no_bootloader/m5_exit_addr.S --emulator gem5 --trace-insts-stdout -- --param 'system.m5ops_base=0x10010000'+
TODO failing with:
+info: Entering event queue @ 0. Starting simulation... +fatal: Unable to find destination for [0x10012100:0x10012108] on system.iobus+
Let’s study how the gem5 m5 executable uses them:
magic instructions, which don’t exist in the corresponding arch
magic memory addresses on a given page
+magic memory addresses on a given page: m5ops magic addresses
TODO: what is the advantage of magic memory addresses? Because you have to do more setup work by telling the kernel never to touch the magic page. For the magic instructions, the only thing that could go wrong is if you run some crazy kind of fuzzing workload that generates random instructions.
-Then, in aarch64 magic instructions for example, the lines:
include/gem5/asm/generic/m5ops.h also describes some annotation instructions.
and configure it into Eclipse as usual.
One downside of this setup is that if you want to nuke your build directory to get a clean build, then the Eclipse configuration files present in it might get deleted. Maybe it is possible to store configuration files outside of the directory, but we are now mitigating that by making a backup copy of those configuration files before removing the directory, and restoring it when you do ./build-gem --clean.
Unlike in SimpleThread however, O3ThreadContext does not contain the register data itself, e.g. O3ThreadContext::readIntRegFlat instead forwards to cpu:
git submodule update --init submodules/gensim-simulator +git submodule update --init submodules/gensim sudo apt install libantlr3c-dev -cd submodule/gensim-simulator +cd submodule/gensim make
cd /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/models/armv8 && \ - /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/build/dist/bin/gensim \ - -a /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/models/armv8/aarch64.ac \ +cd /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/models/armv8 && \ + /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/build/dist/bin/gensim \ + -a /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/models/armv8/aarch64.ac \ -s module,arch,decode,disasm,ee_interp,ee_blockjit,jumpinfo,function,makefile \ - -o decode.GenerateDotGraph=1,makefile.libtrace_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/support/libtrace/inc,makefile.archsim_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/archsim/inc,makefile.llvm_path=,makefile.Optimise=2,makefile.Debug=1 \ - -t /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/build/models/armv8/output-aarch64/+ -o decode.GenerateDotGraph=1,makefile.libtrace_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/support/libtrace/inc,makefile.archsim_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/archsim/inc,makefile.llvm_path=,makefile.Optimise=2,makefile.Debug=1 \ + -t /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/build/models/armv8/output-aarch64/
strace shows that OpenMP makes clone() syscalls in Linux. TODO: does it actually call pthread_ functions, or does it make syscalls directly? Or in other words, can it work on Freestanding programs? A quick grep shows many references to pthreads.
Host build on Ubuntu 20.04:
-git submodule update --init submodules/omp-validation
-cd submodules/omp-validation
-PERL5LIB="${PERL5LIB}:." make -j `nproc` ctest
-This both builds and runs, took about 5 minutes on 2017 Lenovo ThinkPad P51, but had build failues for some reason:
-Summary: -S Number of tested Open MP constructs: 62 -S Number of used tests: 123 -S Number of failed tests: 4 -S Number of successful tests: 119 -S + from this were verified: 115 - -Normal tests: -N Number of failed tests: 2 -N + from this fail compilation: 0 -N + from this timed out 0 -N Number of successful tests: 60 -N + from this were verified: 58 - -Orphaned tests: -O Number of failed tests: 2 -O + from this fail compilation: 0 -O + from this timed out 0 -O Number of successful tests: 59 -O + from this were verified: 57-
The tests and run results placed under bin/c/, e.g.:
test_omp_threadprivate -test_omp_threadprivate.c -test_omp_threadprivate.log -test_omp_threadprivate.out -test_omp_threadprivate_compile.log-
C files are also present as some kind of code generaion is used.
-Build only and run one of them manually:
-make -j`nproc` omp_my_sleep omp_testsuite
-PERL5LIB="${PERL5LIB}:." ./runtest.pl --lang=c --norun testlist-c.txt
-./bin/c/test_omp_barrier
-The bin/c directory is hardcoded in the executable, so to run it you must ensure that it exists relative to CWD, e.g.:
cd bin/c -mkdir -p bin/c -./test_omp_barrier-
Manually cross compile all tests and optionally add some extra options, e.g. -static to more conveniently run in gem5:
PERL5LIB="${PERL5LIB}:." ./runtest.pl --lang=c --makeopts 'CC=aarch64-linux-gnu-gcc CFLAGS_EXTRA=-static' --norun testlist-c.txt
-./../../run --arch aarch64 --emulator gem5 --userland submodules/omp-validation/bin/c/test_omp_parallel_reduction --cpus 8 --memory 8G
-Build a single test:
-make bin/c/test_omp_sections_reduction-
templates
-userland/cpp/template.cpp: basic example
-userland/cpp/template_class_with_static_member.cpp: https://stackoverflow.com/questions/3229883/static-member-initialization-in-a-class-template
-userland/cpp/if_constexpr.cpp: C++17 if constexpr: https://stackoverflow.com/questions/12160765/if-else-at-compile-time-in-c/54647315#54647315
iostream
Not possible to do the typecheck automatically without explicitly giving type constraints: https://stackoverflow.com/questions/53441832/sfinae-automatically-check-that-function-body-compiles-without-explicit-constrai
+decltypeC++11 keyword.
+Replaces decltype with type of an expression at compile time.
+More powerful than auto as you can use it in more places.
rootfs_overlay/lkmc/python/unittest_find/ contains examples to test how tests are found by unittest within directories. Related questions:
rootfs_overlay/lkmc/python/relative_import/ contains examples to test how how to do relative imports in Python.
+This subject is impossible to understand.
+Related questions:
+https://stackoverflow.com/questions/16981921/relative-imports-in-python-3
+https://stackoverflow.com/questions/14132789/relative-imports-for-the-billionth-time
+https://stackoverflow.com/questions/21490860/relative-imports-with-unittest-in-python
+https://stackoverflow.com/questions/714063/importing-modules-from-parent-folder
+Buildroot has a Python package that can be added to the guest image:
At LKMC 50ac89b779363774325c81157ec8b9a6bdb50a2f gem5 390a74f59934b85d91489f8a563450d8321b602da:
Here we will add some better examples and explanations for: https://docs.python.org/3/extending/embedding.html#very-high-level-embedding
util.inspect.custom and toString override experiment: https://stackoverflow.com/questions/24902061/is-there-an-repr-equivalent-for-javascript/26698403#26698403
-Output:
-util.inspect
-my type is MyClassUtilInspectCustom and a is 1 and b is 2
-console.log
-my type is MyClassUtilInspectCustom and a is 1 and b is 2
-toString
-[object Object]
-
-util.inspect
-MyClassToString { a: 1, b: 2 }
-console.log
-MyClassToString { a: 1, b: 2 }
-toString
-my type is MyClassToString and a is 1 and b is 2
-rootfs_overlay/lkmc/nodejs/object_to_json.js: toJSON examples
Part of EEMBC.
-They have two versions:
-2015: https://github.com/eembc/coremark-pro
-Describes very clearly on the README what tests it does. Most of them are understandable high level operations.
-In particular, it contains "a greatly improved version of the Livermore loops"
-Both have a custom license, so yeah, no patience to read this stuff.
-Coremark-pro build and run on Ubuntu 20.04:
-git submodule update --init submodules coremark-pro -cd submodules/coremark-pro -make TARGET=linux64 build -make TARGET=linux64 XCMD='-c4' certify-all-
This uses 4 contexts. TODO what are contexts? Is the same as threads? You likely want to use -c$(nproc) in practice instead?
Finishes in a few seconds, 2017 Lenovo ThinkPad P51 results:
-Workload Name (iter/s) (iter/s) Scaling ------------------------------------------------ ---------- ---------- ---------- -cjpeg-rose7-preset 526.32 178.57 2.95 -core 7.39 2.16 3.42 -linear_alg-mid-100x100-sp 684.93 238.10 2.88 -loops-all-mid-10k-sp 27.65 7.80 3.54 -nnet_test 32.79 10.57 3.10 -parser-125k 71.43 25.00 2.86 -radix2-big-64k 2320.19 623.44 3.72 -sha-test 555.56 227.27 2.44 -zip-test 363.64 166.67 2.18 - -MARK RESULTS TABLE - -Mark Name MultiCore SingleCore Scaling ------------------------------------------------ ---------- ---------- ---------- -CoreMark-PRO 18743.79 6306.76 2.97-
More sample results: P51 CoreMark-Pro.
-And scaling appears to be the ration between multicore (4 due to -c4 and single core performance), each benchmark gets run twice with multicore and single core.
The tester script also outputs test commands, some of which are:
-builds/linux64/gcc64/bin/zip-test.exe -c1 -w1 -c4 -v1 -builds/linux64/gcc64/bin/zip-test.exe -c1 -w1 -c4 -v0 -builds/linux64/gcc64/bin/zip-test.exe -c4 -v1 -builds/linux64/gcc64/bin/zip-test.exe -c4 -v0-
-v1 appears to be a fast verification run, and both -c1 vs -c4 get run because for the single vs multicore preformance.
Sample -c4 -v0 output:
- Info: Starting Run... --- Workload:zip-test=946108807 --- zip-test:time(ns)=11 --- zip-test:contexts=4 --- zip-test:iterations=4 --- zip-test:time(secs)= 0.011 --- zip-test:secs/workload= 0.00275 --- zip-test:workloads/sec= 363.636 --- Done:zip-test=946108807-
and so we see the zip-test:workloads/sec= 363.636 output is the key value, which is close to that of the zip-test 363.64 in the earlier full summarized result.
Cross compile statically for aarch64. From LKMC toplevel:
-make \ - -C submodules/coremark-pro \ - LINKER_FLAGS='-static' \ - LINKER_LAST='-lm -lpthread -lrt' \ - TARGET=gcc-cross-linux \ - TOOLCHAIN=gcc-cross-linux \ - TOOLS="$(./getvar --arch aarch64 buildroot_host_usr_dir)" \ - TPREF="$(./getvar --arch aarch64 buildroot_toolchain_prefix)-" \ - build \ -;-
Run a single executable on QEMU:
-./run --arch aarch64 --userland submodules/coremark-pro/builds/gcc-cross-linux/bin/zip-test.exe --cli-args='-c4 -v0'-
Finishes in about 1 second, and gives zip-test:workloads/sec= 74.0741 so we see that it ran about 5x slower than the native host.
Run a single executable on gem5 in a verification run:
-./run \ - --arch aarch64 \ - --cli-args='-c1 -v1' \ - --emulator gem5 \ - --userland submodules/coremark-pro/builds/gcc-cross-linux/bin/zip-test.exe \ -;-
TODO: hangs for at least 15 minutes, there must be something wrong. Stuck on an evolving strlen loop:
-7837834500: system.cpu: A0 T0 : @__strlen_generic+112 : ldp -7837834500: system.cpu: A0 T0 : @__strlen_generic+112. 0 : addxi_uop ureg0, x1, #16 : IntAlu : D=0x0000003ffff07170 flags=(IsInteger|IsMicroop|IsDelayedCommit|IsFirstMicroop) -7837835000: system.cpu: A0 T0 : @__strlen_generic+112. 1 : ldp_uop x2, x3, [ureg0] : MemRead : D=0x20703c0a3e702f3c A=0x3ffff07170 flags=(IsInteger|IsMemRef|IsLoad|IsMicroop|IsLastMicroop) -7837835500: system.cpu: A0 T0 : @__strlen_generic+116 : sub x4, x2, x8 : IntAlu : D=0x3d607360632e3b34 flags=(IsInteger) -7837836000: system.cpu: A0 T0 : @__strlen_generic+120 : sub x6, x3, x8 : IntAlu : D=0x1f6f3b093d6f2e3b flags=(IsInteger) -7837836500: system.cpu: A0 T0 : @__strlen_generic+124 : orr x5, x4, x6 : IntAlu : D=0x3f6f7b697f6f3f3f flags=(IsInteger) -7837837000: system.cpu: A0 T0 : @__strlen_generic+128 : ands x5, x8, LSL #7 : IntAlu : D=0x0000000000000000 flags=(IsInteger) -7837837500: system.cpu: A0 T0 : @__strlen_generic+132 : b.eq <__strlen_generic+88> : IntAlu : flags=(IsControl|IsDirectControl|IsCondControl) -7837838000: system.cpu: A0 T0 : @__strlen_generic+88 : ldp -7837838000: system.cpu: A0 T0 : @__strlen_generic+88. 0 : addxi_uop ureg0, x1, #32 : IntAlu : D=0x0000003ffff07180 flags=(IsInteger|IsMicroop|IsDelayedCommit|IsFirstMicroop) -7837838500: system.cpu: A0 T0 : @__strlen_generic+88. 1 : ldp_uop x2, x3, [ureg0] : MemRead : D=0x6565686b636f4c27 A=0x3ffff07180 flags=(IsInteger|IsMemRef|IsLoad|IsMicroop|IsDelayedCommit) -7837839000: system.cpu: A0 T0 : @__strlen_generic+88. 2 : addxi_uop x1, ureg0, #0 : IntAlu : D=0x0000003ffff07180 flags=(IsInteger|IsMicroop|IsLastMicroop) -7837839500: system.cpu: A0 T0 : @__strlen_generic+92 : sub x4, x2, x8 : IntAlu : D=0x3c786d606f6c6e62 flags=(IsInteger) -7837840000: system.cpu: A0 T0 : @__strlen_generic+96 : sub x6, x3, x8 : IntAlu : D=0x6464676a626e4b26 flags=(IsInteger) -7837840500: system.cpu: A0 T0 : @__strlen_generic+100 : orr x5, x4, x6 : IntAlu : D=0x7c7c6f6a6f6e6f66 flags=(IsInteger) -7837841000: system.cpu: A0 T0 : @__strlen_generic+104 : ands x5, x8, LSL #7 : IntAlu : D=0x0000000000000000 flags=(IsInteger)-
Instructions before __strlen_generic starts:
7831019000: system.cpu: A0 T0 : @define_params_zip+664 : add x1, sp, #168 : IntAlu : D=0x0000007ffffef988 flags=(IsInteger) -7831019500: system.cpu: A0 T0 : @define_params_zip+668 : orr x0, xzr, x24 : IntAlu : D=0x0000003ffff00010 flags=(IsInteger) -7831020000: system.cpu: A0 T0 : @define_params_zip+672 : bl <th_strcat> : IntAlu : D=0x000000000040a4c4 flags=(IsInteger|IsControl|IsDirectControl|IsUncondControl|IsCall) -7831020500: system.cpu: A0 T0 : @th_strcat : b <strcat> : IntAlu : flags=(IsControl|IsDirectControl|IsUncondControl) -7831021000: system.cpu: A0 T0 : @strcat : stp -7831021000: system.cpu: A0 T0 : @strcat. 0 : addxi_uop ureg0, sp, #-48 : IntAlu : D=0x0000007ffffef8b0 flags=(IsInteger|IsMicroop|IsDelayedCommit|IsFirstMicroop) -7831021500: system.cpu: A0 T0 : @strcat. 1 : strxi_uop x29, [ureg0] : MemWrite : D=0x0000007ffffef8e0 A=0x7ffffef8b0 flags=(IsInteger|IsMemRef|IsStore|IsMicroop|IsDelayedCommit) -7831022000: system.cpu: A0 T0 : @strcat. 2 : strxi_uop x30, [ureg0, #8] : MemWrite : D=0x000000000040a4c4 A=0x7ffffef8b8 flags=(IsInteger|IsMemRef|IsStore|IsMicroop|IsDelayedCommit) -7831022500: system.cpu: A0 T0 : @strcat. 3 : addxi_uop sp, ureg0, #0 : IntAlu : D=0x0000007ffffef8b0 flags=(IsInteger|IsMicroop|IsLastMicroop) -7831023000: system.cpu: A0 T0 : @strcat+4 : add x29, sp, #0 : IntAlu : D=0x0000007ffffef8b0 flags=(IsInteger) -7831023500: system.cpu: A0 T0 : @strcat+8 : str x19, [sp, #16] : MemWrite : D=0x00000000004d6560 A=0x7ffffef8c0 flags=(IsInteger|IsMemRef|IsStore) -7831024000: system.cpu: A0 T0 : @strcat+12 : orr x19, xzr, x0 : IntAlu : D=0x0000003ffff00010 flags=(IsInteger) -7831024500: system.cpu: A0 T0 : @strcat+16 : str x1, [sp, #40] : MemWrite : D=0x0000007ffffef988 A=0x7ffffef8d8 flags=(IsInteger|IsMemRef|IsStore) -7831025000: system.cpu: A0 T0 : @strcat+20 : bl <_init+120> : IntAlu : D=0x00000000004464c8 flags=(IsInteger|IsControl|IsDirectControl|IsUncondControl|IsCall) -7831025500: system.cpu: A0 T0 : @_init+120 : adrp x16, #835584 : IntAlu : D=0x00000000004cc000 flags=(IsInteger) -7831026000: system.cpu: A0 T0 : @_init+124 : ldr x17, [x16, #48] : MemRead : D=0x0000000000449680 A=0x4cc030 flags=(IsInteger|IsMemRef|IsLoad) -7831026500: system.cpu: A0 T0 : @_init+128 : add x16, x16, #48 : IntAlu : D=0x00000000004cc030 flags=(IsInteger) -7831027000: system.cpu: A0 T0 : @_init+132 : br x17 : IntAlu : flags=(IsInteger|IsControl|IsIndirectControl|IsUncondControl)-
Their build/run system is nice, it even user mode simulators out-of-the-box! TODO give it a shot. See :
-RUN = -RUN_FLAGS =-
under util/make/linux64.mak.
Tested on a7ae8e6a8e29ef46d79eb9178d8599d1faeea0e5 + 1.
-Created in the 80’s, it is not a representative measure of performance in modern computers anymore. It has mostly been replaced by SPEC, which is… closed source! Unbelievable.
-Dhrystone is very simple:
-there is one loop in the dhry_1.c main function that gets executed N times
that loop calls 9 short functions called Proc_0 to Proc_9, most of which are defined in dhry_1.c, and a few others in dhry_2.c
The benchmark is single-threaded.
-After a quick look at it, Dhrystone in -O3 is is very likely completely CPU bound, as there are no loops over variable sized arrays, except for some dummy ones that only run once. It just does a bunch of operations on local and global C variables, which are very likely to be inlined and treated fully in registers until the final write back, or to fit entirely in cache. TODO confirm with some kind of measurement. The benchmark also makes no syscalls except for measuring time and reporting results.
Buildroot has a dhrystone package, but because it is so interesting to us, we decided to also build it ourselves, which allows things like static and baremetal compilation more easily.
Build and run on QEMU User mode simulation:
-git submodule update --init submodules/dhrystone -./build-dhrystone --optimization-level 3 -./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"-
TODO automate run more nicely to dispense getvar.
Increase the number of loops to try and reach more meaningful results:
-./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone" --cli-args 100000000-
Build and run on gem5 user mode:
-./build-dhrystone --optimization-level 3 -./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"-
Run natively on the host:
-./build-dhrystone --host -"$(./getvar --host userland_build_dir)/submodules/dhrystone/dhrystone" 1000000000-
Sample output for 2017 Lenovo ThinkPad P51 Ubuntu 20.04:
-Microseconds for one run through Dhrystone: 0.1 -Dhrystones per Second: 16152479.0-
Build Dhrystone for Baremetal and run it in on QEMU:
-# Build our Newlib stubs. -./build-baremetal --arch aarch64 -./build-dhrystone --arch aarch64 --mode baremetal -./run --arch aarch64 --baremetal "$(./getvar --arch aarch64 baremetal_build_dir)/submodules/dhrystone/dhrystone" --cli-args 10000-
or with gem5:
-# Build our Newlib stubs. -./build-baremetal --arch aarch64 -./build-dhrystone --arch aarch64 --emulator gem5 --mode baremetal -./run --arch aarch64 --baremetal "$(./getvar --arch aarch64 --emulator gem5 baremetal_build_dir)/submodules/dhrystone/dhrystone" --cli-args 10000 --emulator gem5-
If you really want the Buildroot package for some reason, build it with:
-./build-buildroot --config 'BR2_PACKAGE_DHRYSTONE=y'-
and run inside the guest from PATH with:
dhrystone-
Canonical source at https://sourceforge.net/projects/lmbench/ but Intel has a fork at: https://github.com/intel/lmbench which has more recent build updates, so I think that’s the one I’d put my money on as of 2020.
-Feels old, guessing not representative anymore like Dhrystone. But hey, history!
-Ubuntu 20.04 AMD64 native build and run:
-git submodule update --init submodules/lmbench -cd submodules/lmbench -cd src -make results-
TODO it hangs for a long time at:
-Hang on, we are calculating your cache line size.-
Bug report: https://github.com/intel/lmbench/issues/15
-the If I kill it, configuration process continues:
-Killed -OK, it looks like your cache line is bytes.-
and continues with a few more interactive questions until finally:
-Confguration done, thanks.-
where it again hangs for at least 2 hours, so I lost patience and killed it.
-TODO: how to do a non-interactive config? After the above procedure, bin/x86_64-linux-gnu/CONFIG.ciro-p51 contains:
DISKS="" -DISK_DESC="" -OUTPUT=/dev/null -ENOUGH=50000 -FASTMEM="NO" -FILE=/var/tmp/XXX -FSDIR=/var/tmp -INFO=INFO.ciro-p51 -LINE_SIZE= -LOOP_O=0.00000000 -MAIL=no -TOTAL_MEM=31903 -MB=22332 -MHZ="-1 System too busy" -MOTHERBOARD="" -NETWORKS="" -OS="x86_64-linux-gnu" -PROCESSORS="8" -REMOTE="" -SLOWFS="NO" -SYNC_MAX="1" -LMBENCH_SCHED="DEFAULT" -TIMING_O=0 -RSH=rsh -RCP=rcp -VERSION=lmbench-3alpha4 -BENCHMARK_HARDWARE=YES -BENCHMARK_OS=YES -BENCHMARK_SYSCALL= -BENCHMARK_SELECT= -BENCHMARK_PROC= -BENCHMARK_CTX= -BENCHMARK_PAGEFAULT= -BENCHMARK_FILE= -BENCHMARK_MMAP= -BENCHMARK_PIPE= -BENCHMARK_UNIX= -BENCHMARK_UDP= -BENCHMARK_TCP= -BENCHMARK_CONNECT= -BENCHMARK_RPC= -BENCHMARK_HTTP= -BENCHMARK_BCOPY= -BENCHMARK_MEM= -BENCHMARK_OPS=-
Native build only without running tests:
-cd src -make-
Interestingly, one of the creators of LMbench, Larry Mcvoy (https://www.linkedin.com/in/larrymcvoy/, https://en.wikipedia.org/wiki/Larry_McVoy), is also a co-founder of BitKeeper. Their SMC must be blazingly fast!!! Also his LinkedIn says Intel uses it. But they will forever be remembered as "the closed source Git precursor that died N years ago", RIP.
-Very simple memory width benchmark with one C and one Fortran version, originally published in 1991, and the latest version at the time of writing is from 2013.
-Its operation is very simple: fork one thread for each CPU in the system (using OpenMP) and do the following four array operations (4 separate loops of individual operations):
-/* Copy. */ -times[0 * ntimes + k] = mysecond(); -#pragma omp parallel for -for (j=0; j<stream_array_size; j++) - c[j] = a[j]; -times[0 * ntimes + k] = mysecond() - times[0 * ntimes + k]; - -/* Scale. */ -times[1 * ntimes + k] = mysecond(); -#pragma omp parallel for -for (j=0; j<stream_array_size; j++) - b[j] = scalar*c[j]; -times[1 * ntimes + k] = mysecond() - times[1 * ntimes + k]; - -/* Add. */ -times[2 * ntimes + k] = mysecond(); -#pragma omp parallel for -for (j=0; j<stream_array_size; j++) - c[j] = a[j]+b[j]; -times[2 * ntimes + k] = mysecond() - times[2 * ntimes + k]; - -/* Triad. */ -times[3 * ntimes + k] = mysecond(); -#pragma omp parallel for -for (j=0; j<stream_array_size; j++) - a[j] = b[j]+scalar*c[j]; -times[3 * ntimes + k] = mysecond() - times[3 * ntimes + k]; -}-
Ubuntu 20.04 native build and run:
-git submodule update --init submodules/stream-benchmark -cd submodules/stream-benchmark -make -./stream_c.exe-
Sample output:
-------------------------------------------------------------- -STREAM version $Revision: 5.10 $ -------------------------------------------------------------- -This system uses 8 bytes per array element. -------------------------------------------------------------- -Array size = 10000000 (elements), Offset = 0 (elements) -Memory per array = 76.3 MiB (= 0.1 GiB). -Total memory required = 228.9 MiB (= 0.2 GiB). -Each kernel will be executed 10 times. - The *best* time for each kernel (excluding the first iteration) - will be used to compute the reported bandwidth. -------------------------------------------------------------- -Number of Threads requested = 8 -Number of Threads counted = 8 -------------------------------------------------------------- -Your clock granularity/precision appears to be 1 microseconds. -Each test below will take on the order of 7027 microseconds. - (= 7027 clock ticks) -Increase the size of the arrays if this shows that -you are not getting at least 20 clock ticks per test. -------------------------------------------------------------- -WARNING -- The above is only a rough guideline. -For best results, please be sure you know the -precision of your system timer. -------------------------------------------------------------- -Function Best Rate MB/s Avg time Min time Max time -Copy: 20123.2 0.008055 0.007951 0.008267 -Scale: 20130.4 0.008032 0.007948 0.008177 -Add: 22528.8 0.010728 0.010653 0.010867 -Triad: 22448.4 0.010826 0.010691 0.011352 -------------------------------------------------------------- -Solution Validates: avg error less than 1.000000e-13 on all three arrays ---------------------------------------------------------------
The LKMC usage of STREAM is analogous to that of Dhrystone. Build and run on QEMU User mode simulation:
-./build-stream --optimization-level 3 -./run --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe"-
Decrease the benchmark size and the retry count to finish simulation faster, but possibly have a less representative result:
-./run --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe" --cli-args '100 2'-
Build and run on gem5 user mode:
-./build-stream --optimization-level 3 -./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe" --cli-args '1000 2'-
This is analogous to step debugging baremetal examples.
Related:
+https://stackoverflow.com/questions/4783404/is-main-really-start-of-a-c-program/64116561#64116561 "Is main() really start of a C++ program?"
+https://electronics.stackexchange.com/questions/258896/what-happens-before-main/404298#404298
+https://electronics.stackexchange.com/questions/55767/who-receives-the-value-returned-by-main, more microcontroller focused, should entitled "how to quit a program in microcontroller"
+https://stackoverflow.com/questions/53570678/what-happens-before-main-in-c "What happens before main in C++?"
+https://www.quora.com/What-is-happening-before-the-main-function-is-called-in-C++-programming
+https://stackoverflow.com/questions/2053029/how-exactly-does-attribute-constructor-work
+605448f07e6380634b1aa7e9732d111759f69fd
Dhrystone -O3
gem5 --arch aarch64
4 * 10^5
68
9.2034139 * 10^7
1.6
5d233f2664a78789f9907d27e2a40e86cefad595
STREAM benchmark -O3
./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --cli-args 1000000 --trace ExecAll
3 * 10^5 * 2
64
9.9674773 * 10^7
1.6
glibc C pre-main effects
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611
Let’s see if user mode runs considerably faster than full system or not, ignoring the kernel boot.
First we build Dhrystone manually statically since dynamic linking is broken in gem5 as explained at: Section 10.7, “gem5 syscall emulation mode”.
-TODO: move this section to our new custom dhrystone setup: Section 26.9.2.1, “Dhrystone”.
+First we build dhrystonee manually statically since dynamic linking is broken in gem5 as explained at: Section 10.7, “gem5 syscall emulation mode”.
gem5 user mode:
@@ -43148,7 +42372,7 @@ tail -n+1 ../linux-kernel-module-cheat-regression/*/gem5-bench-build-*.txtand then copy the link command to a separate Bash file. Then you can time and modify it easily.
Some approximate reference values on 2017 Lenovo ThinkPad P51:
+Some approximate reference values on 2017 Lenovo ThinkPad P51 LKMC d4b3e064adeeace3c3e7d106801f95c14637c12f + 1 (doing multiple runs to warm up disk caches):
Tested at: d4b3e064adeeace3c3e7d106801f95c14637c12f + 1.
+On LKMC 220c3a434499e4713664d4a47c246cb81ee0a06a gem5 63e96992568d8a8a0dccac477b8b7f1370ac7e98 (Sep 2020):
+opt
default link: 18.32user 3.99system 0:22.33elapsed 99%CPU (0avgtext+0avgdata 4622908maxresident)k
LDFLAGS_EXTRA=-fuse-ld=lld (after a build with default linker): 6.74user 1.81system 0:03.85elapsed 222%CPU (0avgtext+0avgdata 7025292maxresident)k
LDFLAGS_EXTRA=-fuse-ld=gold: 7.70user 1.36system 0:09.44elapsed 95%CPU (0avgtext+0avgdata 5959152maxresident)k
LDFLAGS_EXTRA=-fuse-ld=gold -Wl,--threads -Wl,--thread-count=8: 9.66user 1.86system 0:04.62elapsed 249%CPU (0avgtext+0avgdata 5989916maxresident)k
Arghhh, it does not use multile threads by default… https://stackoverflow.com/questions/5142753/can-gcc-use-multiple-cores-when-linking/42302047#42302047
+STREAM benchmark on Ubuntu 20.04 results at STREAM benchmark.
-CoreMark-Pro d5b4f2ba7ba31e37a5aa93423831e7d5eb933868 on Ubuntu 20.04 with XCMD="-c$(nproc)":
MultiCore SingleCore -Workload Name (iter/s) (iter/s) Scaling ------------------------------------------------ ---------- ---------- ---------- -cjpeg-rose7-preset 769.23 175.44 4.38 -core 7.98 2.11 3.78 -linear_alg-mid-100x100-sp 892.86 233.64 3.82 -loops-all-mid-10k-sp 35.84 7.58 4.73 -nnet_test 35.09 10.05 3.49 -parser-125k 125.00 20.41 6.12 -radix2-big-64k 3278.69 630.91 5.20 -sha-test 625.00 227.27 2.75 -zip-test 615.38 166.67 3.69 -MARK RESULTS TABLE - -Mark Name MultiCore SingleCore Scaling ------------------------------------------------ ---------- ---------- ---------- -CoreMark-PRO 25016.00 6079.70 4.11-