diff --git a/index.html b/index.html index ad16af9..d8388ed 100644 --- a/index.html +++ b/index.html @@ -567,7 +567,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 2.9. GDB step debug multicore userland
  • 2.10. Linux kernel GDB scripts
  • 2.11. Debug the GDB remote protocol @@ -1538,29 +1542,33 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 21.8. Benchmarks
  • -
  • 21.9. Micro benchmarks
  • -
  • 21.10. userland/libs directory +
  • 21.9. userland/libs directory
  • -
  • 21.11. Userland content filename conventions
  • -
  • 21.12. Userland content bibliography
  • +
  • 21.10. Userland content filename conventions
  • +
  • 21.11. Userland content bibliography
  • 22. Userland assembly @@ -5826,6 +5834,202 @@ pwd
  • +
    +
    2.10.1.1. CONFIG_PID_IN_CONTEXTIDR
    +
    +

    https://stackoverflow.com/questions/54133479/accessing-logical-software-thread-id-in-gem5 on ARM the kernel can store an indication of PID in the CONTEXTIDR_EL1 register, making that much easier to observe from simulators.

    +
    +
    +

    In particular, gem5 prints that number out by default on ExecAll messages!

    +
    +
    +

    Let’s test it out with Linux kernel build variants + gem5 checkpoint restore and run a different script:

    +
    +
    +
    +
    ./build-linux --arch aarch64 --linux-build-id CONFIG_PID_IN_CONTEXTIDR --config 'CONFIG_PID_IN_CONTEXTIDR=y'
    +# Checkpoint run.
    +./run --arch aarch64 --emulator gem5 --linux-build-id CONFIG_PID_IN_CONTEXTIDR --eval './gem5.sh'
    +# Trace run.
    +./run \
    +  --arch aarch64 \
    +  --emulator gem5 \
    +  --gem5-readfile 'posix/getpid.out; posix/getpid.out' \
    +  --gem5-restore 1 \
    +  --linux-build-id CONFIG_PID_IN_CONTEXTIDR \
    +  --trace FmtFlag,ExecAll,-ExecSymbol \
    +;
    +
    +
    +
    +

    The terminal runs both programs which output their PID to stdout:

    +
    +
    +
    +
    pid=44
    +pid=45
    +
    +
    +
    +

    By quickly inspecting the trace.txt file, we immediately notice that the system.cpu: A<n> part of the logs, which used to always be system.cpu: A0, now has a few different values! Nice!

    +
    +
    +

    We can briefly summarize those values by removing repetitions:

    +
    +
    +
    +
    cut -d' ' -f4 "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)" | uniq -c
    +
    +
    +
    +

    gives:

    +
    +
    +
    +
      97227 A39
    + 147476 A38
    + 222052 A40
    +      1 terminal
    +1117724 A40
    +  27529 A31
    +  43868 A40
    +  27487 A31
    + 138349 A40
    +  13781 A38
    + 231246 A40
    +  25536 A38
    +  28337 A40
    + 214799 A38
    + 963561 A41
    +  92603 A38
    +  27511 A31
    + 224384 A38
    + 564949 A42
    + 182360 A38
    + 729009 A43
    +   8398 A23
    +  20200 A10
    + 636848 A43
    + 187995 A44
    +  27529 A31
    +  70071 A44
    +  16981 A0
    + 623806 A44
    +  16981 A0
    + 139319 A44
    +  24487 A0
    + 174986 A44
    +  25420 A0
    +  89611 A44
    +  16981 A0
    + 183184 A44
    +  24728 A0
    +  89608 A44
    +  17226 A0
    + 899075 A44
    +  24974 A0
    + 250608 A44
    + 137700 A43
    +1497997 A45
    + 227485 A43
    + 138147 A38
    + 482646 A46
    +
    +
    +
    +

    I’m not smart enough to be able to deduce all of those IDs, but we can at least see that:

    +
    +
    + +
    +
    +

    Or if you want to have some real fun, try: baremetal/arch/aarch64/contextidr_el1.c:

    +
    +
    +
    +
    ./run --arch aarch64 --emulator gem5 --baremetal baremetal/arch/aarch64/contextidr_el1.c --trace-insts-stdout
    +
    +
    +
    +

    in which we directly set the register ourselves! Output excerpt:

    +
    +
    +
    +
      31500: system.cpu: A0 T0 : @main+12    :   ldr   x0, [sp, #12]      : MemRead :  D=0x0000000000000001 A=0x82fffffc  flags=(IsInteger|IsMemRef|IsLoad)
    +  32000: system.cpu: A1 T0 : @main+16    :   msr   contextidr_el1, x0 : IntAlu :  D=0x0000000000000001  flags=(IsInteger|IsSerializeAfter|IsNonSpeculative)
    +  32500: system.cpu: A1 T0 : @main+20    :   ldr   x0, [sp, #12]      : MemRead :  D=0x0000000000000001 A=0x82fffffc  flags=(IsInteger|IsMemRef|IsLoad)
    +  33000: system.cpu: A1 T0 : @main+24    :   add   w0, w0, #1         : IntAlu :  D=0x0000000000000002  flags=(IsInteger)
    +  33500: system.cpu: A1 T0 : @main+28    :   str   x0, [sp, #12]      : MemWrite :  D=0x0000000000000002 A=0x82fffffc  flags=(IsInteger|IsMemRef|IsStore)
    +  34000: system.cpu: A1 T0 : @main+32    :   ldr   x0, [sp, #12]      : MemRead :  D=0x0000000000000002 A=0x82fffffc  flags=(IsInteger|IsMemRef|IsLoad)
    +  34500: system.cpu: A1 T0 : @main+36    :   subs   w0, #9            : IntAlu :  D=0x0000000000000000  flags=(IsInteger)
    +  35000: system.cpu: A1 T0 : @main+40    :   b.le   <main+12>         : IntAlu :   flags=(IsControl|IsDirectControl|IsCondControl)
    +  35500: system.cpu: A1 T0 : @main+12    :   ldr   x0, [sp, #12]      : MemRead :  D=0x0000000000000002 A=0x82fffffc  flags=(IsInteger|IsMemRef|IsLoad)
    +  36000: system.cpu: A2 T0 : @main+16    :   msr   contextidr_el1, x0 : IntAlu :  D=0x0000000000000002  flags=(IsInteger|IsSerializeAfter|IsNonSpeculative)
    +  36500: system.cpu: A2 T0 : @main+20    :   ldr   x0, [sp, #12]      : MemRead :  D=0x0000000000000002 A=0x82fffffc  flags=(IsInteger|IsMemRef|IsLoad)
    +  37000: system.cpu: A2 T0 : @main+24    :   add   w0, w0, #1         : IntAlu :  D=0x0000000000000003  flags=(IsInteger)
    +  37500: system.cpu: A2 T0 : @main+28    :   str   x0, [sp, #12]      : MemWrite :  D=0x0000000000000003 A=0x82fffffc  flags=(IsInteger|IsMemRef|IsStore)
    +  38000: system.cpu: A2 T0 : @main+32    :   ldr   x0, [sp, #12]      : MemRead :  D=0x0000000000000003 A=0x82fffffc  flags=(IsInteger|IsMemRef|IsLoad)
    +  38500: system.cpu: A2 T0 : @main+36    :   subs   w0, #9            : IntAlu :  D=0x0000000000000000  flags=(IsInteger)
    +  39000: system.cpu: A2 T0 : @main+40    :   b.le   <main+12>         : IntAlu :   flags=(IsControl|IsDirectControl|IsCondControl)
    +  39500: system.cpu: A2 T0 : @main+12    :   ldr   x0, [sp, #12]      : MemRead :  D=0x0000000000000003 A=0x82fffffc  flags=(IsInteger|IsMemRef|IsLoad)
    +  40000: system.cpu: A3 T0 : @main+16    :   msr   contextidr_el1, x0 : IntAlu :  D=0x0000000000000003  flags=(IsInteger|IsSerializeAfter|IsNonSpeculative)
    +
    +
    +
    +

    ARMv8 architecture reference manual db D13.2.27 "CONTEXTIDR_EL1, Context ID Register (EL1)" documents CONTEXTIDR_EL1 as:

    +
    +
    +
    +
    +

    Identifies the current Process Identifier.

    +
    +
    +

    The value of the whole of this register is called the Context ID and is used by:

    +
    +
    +
      +
    • +

      The debug logic, for Linked and Unlinked Context ID matching.

      +
    • +
    • +

      The trace logic, to identify the current process.

      +
    • +
    +
    +
    +

    The significance of this register is for debug and trace use only.

    +
    +
    +
    +
    +

    Tested on 145769fc387dc5ee63ec82e55e6b131d9c968538 + 1.

    +
    +
    @@ -14108,7 +14312,17 @@ pid 63
    -

    Source: userland/linux/pagemap_dump.c

    +

    Source:

    +
    +
    +

    Adapted from: https://github.com/dwks/pagemap/blob/8a25747bc79d6080c8b94eac80807a4dceeda57a/pagemap2.c

    @@ -20981,6 +21195,17 @@ system.cpu.dtb.inst_hits

    We also note however that the stat dump made the such a simulation that just loops and dumps considerably slower, from 3s to 15s on P51. Fascinating, we are definitely not disk bound there.

    +
    +

    We enable HDF5 on the build by default with USE_HDF5=1. To disable it, you can add USE_HDF5=0 to the build as in:

    +
    +
    +
    +
    ./build-gem5 -- USE_HDF5=0
    +
    +
    +
    +

    Library support is automatically detected, and only built if you have it installed. But there have been some compilation bugs with HDF5, which is why you might want to turn it off sometimes, e.g.: https://gem5.atlassian.net/browse/GEM5-365

    +
    19.9.3.2. gem5 only dump selected stats
    @@ -28252,7 +28477,8 @@ build/ARM/config/the_isa.hh
    -
    sudo apt install libantlr3c-dev
    +
    git submodule update --init submodules/gensim-simulator
    +sudo apt install libantlr3c-dev
     cd submodule/gensim
     make
    @@ -28685,7 +28911,7 @@ make menuconfig

    Also mentioned at: https://stackoverflow.com/questions/47320800/how-to-clean-only-target-in-buildroot

    -

    See this for a sample manual workaround: Section 21.8.5.4, “PARSEC uninstall”.

    +

    See this for a sample manual workaround: Section 21.8.1.4, “PARSEC uninstall”.

    @@ -31425,23 +31651,503 @@ cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans,3,3,2 ,1, A,3, B,
    -

    21.8.1. Boost

    +

    21.8.1. PARSEC benchmark

    -

    https://en.wikipedia.org/wiki/Boost_(C%2B%2B_libraries)

    +

    We have ported parts of the PARSEC benchmark for cross compilation at: https://github.com/cirosantilli/parsec-benchmark See the documentation on that repo to find out which benchmarks have been ported. Some of the benchmarks were are segfaulting, they are documented in that repo.

    -

    userland/libs/boost

    +

    There are two ways to run PARSEC with this repo:

    +
    +
    +
    21.8.1.1. PARSEC benchmark without parsecmgmt
    +
    +
    +
    ./build --arch arm --download-dependencies gem5-buildroot parsec-benchmark
    +./build-buildroot --arch arm --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y'
    +./run --arch arm --emulator gem5
    +
    +
    +
    +

    Once inside the guest, launch one of the test input sized benchmarks manually as in:

    +
    +
    +
    +
    cd /parsec/ext/splash2x/apps/fmm/run
    +../inst/arm-linux.gcc/bin/fmm 1 < input_1
    +
    +
    +
    +

    To find run out how to run many of the benchmarks, have a look at the test.sh script of the parse-benchmark repo.

    +
    +
    +

    From the guest, you can also run it as:

    +
    +
    +
    +
    cd /parsec
    +./test.sh
    +
    +
    +
    +

    but this might be a bit time consuming in gem5.

    +
    +
    +
    +
    21.8.1.2. PARSEC change the input size
    +
    +

    Running a benchmark of a size different than test, e.g. simsmall, requires a rebuild with:

    +
    +
    +
    +
    ./build-buildroot \
    +  --arch arm \
    +  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
    +  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_INPUT_SIZE="simsmall"' \
    +  -- parsec_benchmark-reconfigure \
    +;
    +
    +
    +
    +

    Large input may also require tweaking:

    +
    +
    + +
    +
    +

    test.sh only contains the run commands for the test size, and cannot be used for simsmall.

    +
    +
    +

    The easiest thing to do, is to scroll up on the host shell after the build, and look for a line of type:

    +
    +
    +
    +
    Running /root/linux-kernel-module-cheat/out/aarch64/buildroot/build/parsec-benchmark-custom/ext/splash2x/apps/ocean_ncp/inst/aarch64-linux.gcc/bin/ocean_ncp -n2050 -p1 -e1e-07 -r20000 -t28800
    +
    +
    +
    +

    and then tweak the command found in test.sh accordingly.

    +
    +
    +

    Yes, we do run the benchmarks on host just to unpack / generate inputs. They are expected fail to run since they were build for the guest instead of host, including for x86_64 guest which has a different interpreter than the host’s (see file myexecutable).

    +
    +
    +

    The rebuild is required because we unpack input files on the host.

    +
    +
    +

    Separating input sizes also allows to create smaller images when only running the smaller benchmarks.

    +
    +
    +

    This limitation exists because parsecmgmt generates the input files just before running via the Bash scripts, but we can’t run parsecmgmt on gem5 as it is too slow!

    +
    +
    +

    One option would be to do that inside the guest with QEMU.

    +
    +
    +

    Also, we can’t generate all input sizes at once, because many of them have the same name and would overwrite one another…​

    +
    +
    +

    PARSEC simply wasn’t designed with non native machines in mind…​

    +
    +
    +
    +
    21.8.1.3. PARSEC benchmark with parsecmgmt
    +
    +

    Most users won’t want to use this method because:

    +
    +
    +
      +
    • +

      running the parsecmgmt Bash scripts takes forever before it ever starts running the actual benchmarks on gem5

      +
      +

      Running on QEMU is feasible, but not the main use case, since QEMU cannot be used for performance measurements

      +
      +
    • +
    • +

      it requires putting the full .tar inputs on the guest, which makes the image twice as large (1x for the .tar, 1x for the unpacked input files)

      +
    • +
    +
    +
    +

    It would be awesome if it were possible to use this method, since this is what Parsec supports officially, and so:

    +
    +
    +
      +
    • +

      you don’t have to dig into what raw command to run

      +
    • +
    • +

      there is an easy way to run all the benchmarks in one go to test them out

      +
    • +
    • +

      you can just run any of the benchmarks that you want

      +
    • +
    +
    +
    +

    but it simply is not feasible in gem5 because it takes too long.

    +
    +
    +

    If you still want to run this, try it out with:

    +
    +
    +
    +
    ./build-buildroot \
    +  --arch aarch64 \
    +  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
    +  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_PARSECMGMT=y' \
    +  --config 'BR2_TARGET_ROOTFS_EXT2_SIZE="3G"' \
    +  -- parsec_benchmark-reconfigure \
    +;
    +
    +
    +
    +

    And then you can run it just as you would on the host:

    +
    +
    +
    +
    cd /parsec/
    +bash
    +. env.sh
    +parsecmgmt -a run -p splash2x.fmm -i test
    +
    +
    +
    +
    +
    21.8.1.4. PARSEC uninstall
    +
    +

    If you want to remove PARSEC later, Buildroot doesn’t provide an automated package removal mechanism as mentioned at: Section 20.6, “Remove Buildroot packages”, but the following procedure should be satisfactory:

    +
    +
    +
    +
    rm -rf \
    +  "$(./getvar buildroot_download_dir)"/parsec-* \
    +  "$(./getvar buildroot_build_dir)"/build/parsec-* \
    +  "$(./getvar buildroot_build_dir)"/build/packages-file-list.txt \
    +  "$(./getvar buildroot_build_dir)"/images/rootfs.* \
    +  "$(./getvar buildroot_build_dir)"/target/parsec-* \
    +;
    +./build-buildroot --arch arm
    +
    +
    +
    +
    +
    21.8.1.5. PARSEC benchmark hacking
    +
    +

    If you end up going inside submodules/parsec-benchmark to hack up the benchmark (you will!), these tips will be helpful.

    +
    +
    +

    Buildroot was not designed to deal with large images, and currently cross rebuilds are a bit slow, due to some image generation and validation steps.

    +
    +
    +

    A few workarounds are:

    +
    +
    +
      +
    • +

      develop in host first as much as you can. Our PARSEC fork supports it.

      +
      +

      If you do this, don’t forget to do a:

      +
      +
      +
      +
      cd "$(./getvar parsec_source_dir)"
      +git clean -xdf .
      +
      +
      +
      +

      before going for the cross compile build.

      +
      +
    • +
    • +

      patch Buildroot to work well, and keep cross compiling all the way. This should be totally viable, and we should do it.

      +
      +

      Don’t forget to explicitly rebuild PARSEC with:

      +
      +
      +
      +
      ./build-buildroot \
      +  --arch arm \
      +  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
      +  -- parsec_benchmark-reconfigure \
      +;
      +
      +
      +
      +

      You may also want to test if your patches are still functionally correct inside of QEMU first, which is a faster emulator.

      +
      +
    • +
    • +

      sell your soul, and compile natively inside the guest. We won’t do this, not only because it is evil, but also because Buildroot explicitly does not support it: https://buildroot.org/downloads/manual/manual.html#faq-no-compiler-on-target ARM employees have been known to do this: https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/qemu-patch.diff

    +
    +
    21.8.1.6. Coremark
    + +
    +

    Part of EEMBC.

    +
    +
    +

    They have two versions:

    +
    +
    + +
    +
    +

    Both have a custom license, so yeah, no patience to read this stuff.

    +
    +
    +

    Coremark-pro build and run on Ubuntu 20.04:

    +
    +
    +
    +
    git submodule update --init submodules coremark-pro
    +cd submodules/coremark-pro
    +make TARGET=linux64 build
    +make TARGET=linux64 XCMD='-c4' certify-all
    +
    +
    +
    +

    This uses 4 contexts. TODO what are contexts? Is the same as threads?

    +
    +
    +

    Finishes in a few seconds, P51 results:

    +
    +
    +
    +
    Workload Name                                     (iter/s)   (iter/s)    Scaling
    +----------------------------------------------- ---------- ---------- ----------
    +cjpeg-rose7-preset                                  526.32     178.57       2.95
    +core                                                  7.39       2.16       3.42
    +linear_alg-mid-100x100-sp                           684.93     238.10       2.88
    +loops-all-mid-10k-sp                                 27.65       7.80       3.54
    +nnet_test                                            32.79      10.57       3.10
    +parser-125k                                          71.43      25.00       2.86
    +radix2-big-64k                                     2320.19     623.44       3.72
    +sha-test                                            555.56     227.27       2.44
    +zip-test                                            363.64     166.67       2.18
    +
    +MARK RESULTS TABLE
    +
    +Mark Name                                        MultiCore SingleCore    Scaling
    +----------------------------------------------- ---------- ---------- ----------
    +CoreMark-PRO                                      18743.79    6306.76       2.97
    +
    +
    +
    +

    And scaling appears to be the ration between multicore (4 due to -c4 and single core performance), each benchmark gets run twice with multicore and single core.

    +
    +
    +

    The tester script also outputs test commands, some of which are:

    +
    +
    +
    +
    builds/linux64/gcc64/bin/zip-test.exe -c1 -w1 -c4 -v1
    +builds/linux64/gcc64/bin/zip-test.exe -c1 -w1 -c4 -v0
    +builds/linux64/gcc64/bin/zip-test.exe -c4 -v1
    +builds/linux64/gcc64/bin/zip-test.exe -c4 -v0
    +
    +
    +
    +

    -v1 appears to be a fast verification run, and both -c1 vs -c4 get run because for the single vs multicore preformance.

    +
    +
    +

    Sample -c4 -v0 output:

    +
    +
    +
    +
    -  Info: Starting Run...
    +-- Workload:zip-test=946108807
    +-- zip-test:time(ns)=11
    +-- zip-test:contexts=4
    +-- zip-test:iterations=4
    +-- zip-test:time(secs)=   0.011
    +-- zip-test:secs/workload= 0.00275
    +-- zip-test:workloads/sec= 363.636
    +-- Done:zip-test=946108807
    +
    +
    +
    +

    and so we see the zip-test:workloads/sec= 363.636 output is the key value, which is close to that of the zip-test 363.64 in the earlier full summarized result.

    +
    +
    +

    Cross compile statically for aarch64. From LKMC toplevel:

    +
    +
    +
    +
    make \
    +  -C submodules/coremark-pro \
    +  LINKER_FLAGS='-static' \
    +  LINKER_LAST='-lm -lpthread -lrt' \
    +  TARGET=gcc-cross-linux \
    +  TOOLCHAIN=gcc-cross-linux \
    +  TOOLS="$(./getvar --arch aarch64 buildroot_host_usr_dir)" \
    +  TPREF="$(./getvar --arch aarch64 buildroot_toolchain_prefix)-" \
    +  build \
    +;
    +
    +
    +
    +

    Run a single executable on QEMU:

    +
    +
    +
    +
    ./run --arch aarch64 --userland submodules/coremark-pro/builds/gcc-cross-linux/bin/zip-test.exe --cli-args='-c4 -v0'
    +
    +
    +
    +

    Finishes in about 1 second, and gives zip-test:workloads/sec= 74.0741 so we see that it ran about 5x slower than the native host.

    +
    +
    +

    Run a single executable on gem5 in a verification run:

    +
    +
    +
    +
    ./run \
    +  --arch aarch64 \
    +  --cli-args='-c1 -v1' \
    +  --emulator gem5 \
    +  --userland submodules/coremark-pro/builds/gcc-cross-linux/bin/zip-test.exe \
    +;
    +
    +
    +
    +

    TODO: hangs for at least 15 minutes, there must be something wrong. Stuck on an evolving strlen loop:

    +
    +
    +
    +
    7837834500: system.cpu: A0 T0 : @__strlen_generic+112    : ldp
    +7837834500: system.cpu: A0 T0 : @__strlen_generic+112. 0 :   addxi_uop   ureg0, x1, #16 : IntAlu :  D=0x0000003ffff07170  flags=(IsInteger|IsMicroop|IsDelayedCommit|IsFirstMicroop)
    +7837835000: system.cpu: A0 T0 : @__strlen_generic+112. 1 :   ldp_uop   x2, x3, [ureg0] : MemRead :  D=0x20703c0a3e702f3c A=0x3ffff07170  flags=(IsInteger|IsMemRef|IsLoad|IsMicroop|IsLastMicroop)
    +7837835500: system.cpu: A0 T0 : @__strlen_generic+116    :   sub   x4, x2, x8         : IntAlu :  D=0x3d607360632e3b34  flags=(IsInteger)
    +7837836000: system.cpu: A0 T0 : @__strlen_generic+120    :   sub   x6, x3, x8         : IntAlu :  D=0x1f6f3b093d6f2e3b  flags=(IsInteger)
    +7837836500: system.cpu: A0 T0 : @__strlen_generic+124    :   orr   x5, x4, x6         : IntAlu :  D=0x3f6f7b697f6f3f3f  flags=(IsInteger)
    +7837837000: system.cpu: A0 T0 : @__strlen_generic+128    :   ands   x5, x8, LSL #7    : IntAlu :  D=0x0000000000000000  flags=(IsInteger)
    +7837837500: system.cpu: A0 T0 : @__strlen_generic+132    :   b.eq   <__strlen_generic+88> : IntAlu :   flags=(IsControl|IsDirectControl|IsCondControl)
    +7837838000: system.cpu: A0 T0 : @__strlen_generic+88    : ldp
    +7837838000: system.cpu: A0 T0 : @__strlen_generic+88. 0 :   addxi_uop   ureg0, x1, #32 : IntAlu :  D=0x0000003ffff07180  flags=(IsInteger|IsMicroop|IsDelayedCommit|IsFirstMicroop)
    +7837838500: system.cpu: A0 T0 : @__strlen_generic+88. 1 :   ldp_uop   x2, x3, [ureg0] : MemRead :  D=0x6565686b636f4c27 A=0x3ffff07180  flags=(IsInteger|IsMemRef|IsLoad|IsMicroop|IsDelayedCommit)
    +7837839000: system.cpu: A0 T0 : @__strlen_generic+88. 2 :   addxi_uop   x1, ureg0, #0 : IntAlu :  D=0x0000003ffff07180  flags=(IsInteger|IsMicroop|IsLastMicroop)
    +7837839500: system.cpu: A0 T0 : @__strlen_generic+92    :   sub   x4, x2, x8         : IntAlu :  D=0x3c786d606f6c6e62  flags=(IsInteger)
    +7837840000: system.cpu: A0 T0 : @__strlen_generic+96    :   sub   x6, x3, x8         : IntAlu :  D=0x6464676a626e4b26  flags=(IsInteger)
    +7837840500: system.cpu: A0 T0 : @__strlen_generic+100    :   orr   x5, x4, x6         : IntAlu :  D=0x7c7c6f6a6f6e6f66  flags=(IsInteger)
    +7837841000: system.cpu: A0 T0 : @__strlen_generic+104    :   ands   x5, x8, LSL #7    : IntAlu :  D=0x0000000000000000  flags=(IsInteger)
    +
    +
    +
    +

    Instructions before __strlen_generic starts:

    +
    +
    +
    +
    7831019000: system.cpu: A0 T0 : @define_params_zip+664    :   add   x1, sp, #168       : IntAlu :  D=0x0000007ffffef988  flags=(IsInteger)
    +7831019500: system.cpu: A0 T0 : @define_params_zip+668    :   orr   x0, xzr, x24       : IntAlu :  D=0x0000003ffff00010  flags=(IsInteger)
    +7831020000: system.cpu: A0 T0 : @define_params_zip+672    :   bl   <th_strcat>         : IntAlu :  D=0x000000000040a4c4  flags=(IsInteger|IsControl|IsDirectControl|IsUncondControl|IsCall)
    +7831020500: system.cpu: A0 T0 : @th_strcat    :   b   <strcat>             : IntAlu :   flags=(IsControl|IsDirectControl|IsUncondControl)
    +7831021000: system.cpu: A0 T0 : @strcat    : stp
    +7831021000: system.cpu: A0 T0 : @strcat. 0 :   addxi_uop   ureg0, sp, #-48 : IntAlu :  D=0x0000007ffffef8b0  flags=(IsInteger|IsMicroop|IsDelayedCommit|IsFirstMicroop)
    +7831021500: system.cpu: A0 T0 : @strcat. 1 :   strxi_uop   x29, [ureg0] : MemWrite :  D=0x0000007ffffef8e0 A=0x7ffffef8b0  flags=(IsInteger|IsMemRef|IsStore|IsMicroop|IsDelayedCommit)
    +7831022000: system.cpu: A0 T0 : @strcat. 2 :   strxi_uop   x30, [ureg0, #8] : MemWrite :  D=0x000000000040a4c4 A=0x7ffffef8b8  flags=(IsInteger|IsMemRef|IsStore|IsMicroop|IsDelayedCommit)
    +7831022500: system.cpu: A0 T0 : @strcat. 3 :   addxi_uop   sp, ureg0, #0 : IntAlu :  D=0x0000007ffffef8b0  flags=(IsInteger|IsMicroop|IsLastMicroop)
    +7831023000: system.cpu: A0 T0 : @strcat+4    :   add   x29, sp, #0        : IntAlu :  D=0x0000007ffffef8b0  flags=(IsInteger)
    +7831023500: system.cpu: A0 T0 : @strcat+8    :   str   x19, [sp, #16]     : MemWrite :  D=0x00000000004d6560 A=0x7ffffef8c0  flags=(IsInteger|IsMemRef|IsStore)
    +7831024000: system.cpu: A0 T0 : @strcat+12    :   orr   x19, xzr, x0       : IntAlu :  D=0x0000003ffff00010  flags=(IsInteger)
    +7831024500: system.cpu: A0 T0 : @strcat+16    :   str   x1, [sp, #40]      : MemWrite :  D=0x0000007ffffef988 A=0x7ffffef8d8  flags=(IsInteger|IsMemRef|IsStore)
    +7831025000: system.cpu: A0 T0 : @strcat+20    :   bl   <_init+120>         : IntAlu :  D=0x00000000004464c8  flags=(IsInteger|IsControl|IsDirectControl|IsUncondControl|IsCall)
    +7831025500: system.cpu: A0 T0 : @_init+120    :   adrp   x16, #835584      : IntAlu :  D=0x00000000004cc000  flags=(IsInteger)
    +7831026000: system.cpu: A0 T0 : @_init+124    :   ldr   x17, [x16, #48]    : MemRead :  D=0x0000000000449680 A=0x4cc030  flags=(IsInteger|IsMemRef|IsLoad)
    +7831026500: system.cpu: A0 T0 : @_init+128    :   add   x16, x16, #48      : IntAlu :  D=0x00000000004cc030  flags=(IsInteger)
    +7831027000: system.cpu: A0 T0 : @_init+132    :   br   x17                 : IntAlu :   flags=(IsInteger|IsControl|IsIndirectControl|IsUncondControl)
    +
    +
    +
    +

    Their build/run system is nice, it even user mode simulators out-of-the-box! TODO give it a shot. See :

    +
    +
    +
    +
    RUN =
    +RUN_FLAGS =
    +
    +
    +
    +

    under util/make/linux64.mak.

    +
    +
    +

    Tested on a7ae8e6a8e29ef46d79eb9178d8599d1faeea0e5 + 1.

    +
    +
    +
    -

    21.8.2. Dhrystone

    +

    21.8.2. Microbenchmarks

    +
    +

    It eventually has to come to that, hasn’t it?

    +
    +
    + +
    +
    +

    Of course, there is a continuum between what is a "microbenchmark" and a "macrobechmark".

    +
    +
    +

    One would hope that every microbenchmark exercises a concentrated subset of part of an important macro benchmark, otherwise what’s the point, right?

    +
    +
    +

    Also for parametrized "macro benchmark", you can always in theory reduce the problem size to be so small that it might be more appropriate to call it a micro benchmark.

    +
    +
    +

    So our working definition will be more of the type: "does it solve an understandable useful high level problem from start to end?".

    +
    +
    +

    If the answer is yes, then we call it a macro benchmark, otherwise micro.

    +
    +
    +

    Bibliography:

    +
    +
    + +
    +
    +
    21.8.2.1. Dhrystone
    @@ -31506,7 +32212,16 @@ cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans,3,3,2 ,1, A,3, B,
    ./build-dhrystone --host
    -"$(./getvar --host userland_build_dir)/submodules/dhrystone/dhrystone"
    +"$(./getvar --host userland_build_dir)/submodules/dhrystone/dhrystone" 1000000000 +
    +
    +
    +

    Output for P51 Ubuntu 20.04:

    +
    +
    +
    +
    Microseconds for one run through Dhrystone:    0.1
    +Dhrystones per Second:                      16152479.0
    @@ -31548,8 +32263,8 @@ cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans,3,3,2 ,1, A,3, B,
    -
    -

    21.8.3. LMbench

    +
    +
    21.8.2.2. LMbench
    @@ -31666,8 +32381,8 @@ make

    Interestingly, one of the creators of LMbench, Larry Mcvoy (https://www.linkedin.com/in/larrymcvoy/, https://en.wikipedia.org/wiki/Larry_McVoy), is also a co-founder of BitKeeper. Their SMC must be blazingly fast!!! Also his LinkedIn says Intel uses it. But they will forever be remembered as "the closed source Git precursor that died N years ago", RIP.

    -
    -

    21.8.4. STREAM benchmark

    +
    +
    21.8.2.3. STREAM benchmark
    @@ -31740,272 +32455,10 @@ times[3 * ntimes + k] = mysecond() - times[3 * ntimes + k];
    -
    -

    21.8.5. PARSEC benchmark

    -
    -

    We have ported parts of the PARSEC benchmark for cross compilation at: https://github.com/cirosantilli/parsec-benchmark See the documentation on that repo to find out which benchmarks have been ported. Some of the benchmarks were are segfaulting, they are documented in that repo.

    -
    -
    -

    There are two ways to run PARSEC with this repo:

    -
    -
    - -
    -
    -
    21.8.5.1. PARSEC benchmark without parsecmgmt
    -
    -
    -
    ./build --arch arm --download-dependencies gem5-buildroot parsec-benchmark
    -./build-buildroot --arch arm --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y'
    -./run --arch arm --emulator gem5
    -
    -
    -
    -

    Once inside the guest, launch one of the test input sized benchmarks manually as in:

    -
    -
    -
    -
    cd /parsec/ext/splash2x/apps/fmm/run
    -../inst/arm-linux.gcc/bin/fmm 1 < input_1
    -
    -
    -
    -

    To find run out how to run many of the benchmarks, have a look at the test.sh script of the parse-benchmark repo.

    -
    -
    -

    From the guest, you can also run it as:

    -
    -
    -
    -
    cd /parsec
    -./test.sh
    -
    -
    -
    -

    but this might be a bit time consuming in gem5.

    -
    -
    -
    -
    21.8.5.2. PARSEC change the input size
    -
    -

    Running a benchmark of a size different than test, e.g. simsmall, requires a rebuild with:

    -
    -
    -
    -
    ./build-buildroot \
    -  --arch arm \
    -  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
    -  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_INPUT_SIZE="simsmall"' \
    -  -- parsec_benchmark-reconfigure \
    -;
    -
    -
    -
    -

    Large input may also require tweaking:

    -
    -
    - -
    -
    -

    test.sh only contains the run commands for the test size, and cannot be used for simsmall.

    -
    -
    -

    The easiest thing to do, is to scroll up on the host shell after the build, and look for a line of type:

    -
    -
    -
    -
    Running /root/linux-kernel-module-cheat/out/aarch64/buildroot/build/parsec-benchmark-custom/ext/splash2x/apps/ocean_ncp/inst/aarch64-linux.gcc/bin/ocean_ncp -n2050 -p1 -e1e-07 -r20000 -t28800
    -
    -
    -
    -

    and then tweak the command found in test.sh accordingly.

    -
    -
    -

    Yes, we do run the benchmarks on host just to unpack / generate inputs. They are expected fail to run since they were build for the guest instead of host, including for x86_64 guest which has a different interpreter than the host’s (see file myexecutable).

    -
    -
    -

    The rebuild is required because we unpack input files on the host.

    -
    -
    -

    Separating input sizes also allows to create smaller images when only running the smaller benchmarks.

    -
    -
    -

    This limitation exists because parsecmgmt generates the input files just before running via the Bash scripts, but we can’t run parsecmgmt on gem5 as it is too slow!

    -
    -
    -

    One option would be to do that inside the guest with QEMU.

    -
    -
    -

    Also, we can’t generate all input sizes at once, because many of them have the same name and would overwrite one another…​

    -
    -
    -

    PARSEC simply wasn’t designed with non native machines in mind…​

    -
    -
    -
    -
    21.8.5.3. PARSEC benchmark with parsecmgmt
    -
    -

    Most users won’t want to use this method because:

    -
    -
    -
      -
    • -

      running the parsecmgmt Bash scripts takes forever before it ever starts running the actual benchmarks on gem5

      -
      -

      Running on QEMU is feasible, but not the main use case, since QEMU cannot be used for performance measurements

      -
      -
    • -
    • -

      it requires putting the full .tar inputs on the guest, which makes the image twice as large (1x for the .tar, 1x for the unpacked input files)

      -
    • -
    -
    -
    -

    It would be awesome if it were possible to use this method, since this is what Parsec supports officially, and so:

    -
    -
    -
      -
    • -

      you don’t have to dig into what raw command to run

      -
    • -
    • -

      there is an easy way to run all the benchmarks in one go to test them out

      -
    • -
    • -

      you can just run any of the benchmarks that you want

      -
    • -
    -
    -
    -

    but it simply is not feasible in gem5 because it takes too long.

    -
    -
    -

    If you still want to run this, try it out with:

    -
    -
    -
    -
    ./build-buildroot \
    -  --arch aarch64 \
    -  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
    -  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_PARSECMGMT=y' \
    -  --config 'BR2_TARGET_ROOTFS_EXT2_SIZE="3G"' \
    -  -- parsec_benchmark-reconfigure \
    -;
    -
    -
    -
    -

    And then you can run it just as you would on the host:

    -
    -
    -
    -
    cd /parsec/
    -bash
    -. env.sh
    -parsecmgmt -a run -p splash2x.fmm -i test
    -
    -
    -
    -
    -
    21.8.5.4. PARSEC uninstall
    -
    -

    If you want to remove PARSEC later, Buildroot doesn’t provide an automated package removal mechanism as mentioned at: Section 20.6, “Remove Buildroot packages”, but the following procedure should be satisfactory:

    -
    -
    -
    -
    rm -rf \
    -  "$(./getvar buildroot_download_dir)"/parsec-* \
    -  "$(./getvar buildroot_build_dir)"/build/parsec-* \
    -  "$(./getvar buildroot_build_dir)"/build/packages-file-list.txt \
    -  "$(./getvar buildroot_build_dir)"/images/rootfs.* \
    -  "$(./getvar buildroot_build_dir)"/target/parsec-* \
    -;
    -./build-buildroot --arch arm
    -
    -
    -
    -
    -
    21.8.5.5. PARSEC benchmark hacking
    -
    -

    If you end up going inside submodules/parsec-benchmark to hack up the benchmark (you will!), these tips will be helpful.

    -
    -
    -

    Buildroot was not designed to deal with large images, and currently cross rebuilds are a bit slow, due to some image generation and validation steps.

    -
    -
    -

    A few workarounds are:

    -
    -
    -
      -
    • -

      develop in host first as much as you can. Our PARSEC fork supports it.

      -
      -

      If you do this, don’t forget to do a:

      -
      -
      -
      -
      cd "$(./getvar parsec_source_dir)"
      -git clean -xdf .
      -
      -
      -
      -

      before going for the cross compile build.

      -
      -
    • -
    • -

      patch Buildroot to work well, and keep cross compiling all the way. This should be totally viable, and we should do it.

      -
      -

      Don’t forget to explicitly rebuild PARSEC with:

      -
      -
      -
      -
      ./build-buildroot \
      -  --arch arm \
      -  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
      -  -- parsec_benchmark-reconfigure \
      -;
      -
      -
      -
      -

      You may also want to test if your patches are still functionally correct inside of QEMU first, which is a faster emulator.

      -
      -
    • -
    • -

      sell your soul, and compile natively inside the guest. We won’t do this, not only because it is evil, but also because Buildroot explicitly does not support it: https://buildroot.org/downloads/manual/manual.html#faq-no-compiler-on-target ARM employees have been known to do this: https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/qemu-patch.diff

      -
    • -
    -
    -
    -

    21.9. Micro benchmarks

    -
    -

    It eventually has to come to that, hasn’t it?

    -
    -
    - -
    -
    -
    -

    21.10. userland/libs directory

    +

    21.9. userland/libs directory

    Tests under userland/libs require certain optional libraries to be installed on the target, and are not built or tested by default, you must enable them with either:

    @@ -32019,7 +32472,23 @@ git clean -xdf .

    See for example BLAS.

    -

    21.10.1. HDF5

    +

    21.9.1. Boost

    +
    +

    https://en.wikipedia.org/wiki/Boost_(C%2B%2B_libraries)

    +
    +
    +

    userland/libs/boost

    +
    +
    + +
    +
    +
    +

    21.9.2. HDF5

    https://en.wikipedia.org/wiki/Hierarchical_Data_Format

    @@ -32042,7 +32511,7 @@ git clean -xdf .
    -

    21.11. Userland content filename conventions

    +

    21.10. Userland content filename conventions

    The following basenames should always refer to programs that do the same thing, but in different languages:

    @@ -32071,7 +32540,7 @@ git clean -xdf .
    -

    21.12. Userland content bibliography

    +

    21.11. Userland content bibliography

    -

    Summary of manually collected results on P51 at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: Table 7, “Busy loop MIPS for different simulator setups”. As expected, the less native / more detailed / more complex simulations are slower!

    +

    Summary of manually collected results on P51 at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: Table 7, “Busy loop MIPS for different simulator setups”. As expected, the less native/more detailed/more complex simulations are slower!

    @@ -40377,16 +40846,28 @@ instructions 124346081 - - + + - + + + + + + + + + + + + + - + @@ -40658,7 +41139,7 @@ instructions 124346081

    First we build Dhrystone manually statically since dynamic linking is broken in gem5 as explained at: Section 10.7, “gem5 syscall emulation mode”.

    -

    TODO: move this section to our new custom dhrystone setup: Section 21.8.2, “Dhrystone”.

    +

    TODO: move this section to our new custom dhrystone setup: Section 21.8.2.1, “Dhrystone”.

    gem5 user mode:

    Table 7. Busy loop MIPS for different simulator setups
    Time (s) Instruction count Approximate MIPSgem5 versionHostHardware versionHost OS

    QEMU busy loop

    Native busy loop

    a7ae8e6a8e29ef46d79eb9178d8599d1faeea0e5 + 1

    userland/gcc/busy_loop.c -O0

    ./run --emulator native --userland userland/gcc/busy_loop.c --cli-args 10000000000

    10^10

    27

    P51

    Ubuntu 20.04

    QEMU aarch64 busy loop

    a18f28e263c91362519ef550150b5c9d75fa3679 + 1

    userland/gcc/busy_loop.c -O0

    `./run --arch aarch64 --userland userland/gcc/busy_loop.c `

    ./run --arch aarch64 --userland userland/gcc/busy_loop.c --cli-args 10000000000

    10^10

    68

    1.1 * 10^11 (approx)