diff --git a/index.html b/index.html index 5f97bd2..11e1664 100644 --- a/index.html +++ b/index.html @@ -686,18 +686,18 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b -
  • 10.6. gem5 syscall emulation mode +
  • 10.6. syscall emulation mode program stdin
  • +
  • 10.7. gem5 syscall emulation mode
  • -
  • 10.7. QEMU user mode quirks +
  • 10.8. QEMU user mode quirks @@ -1104,23 +1104,6 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.2.2.6. gem5 clock frequency
  • -
  • 19.2.3. Interesting benchmarks - -
  • 19.3. gem5 kernel command line parameters
  • @@ -1303,7 +1286,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 21.3.2.1. Fork bomb
  • -
  • 21.3.3. pthreads
  • +
  • 21.3.3. pthreads + +
  • 21.3.4. sysconf
  • 21.3.5. mmap
  • -
  • 21.6. Interpreted languages
  • -
  • 21.7. Node.js
  • -
  • 21.8. Userland content bibliography
  • +
  • 21.6. Interpreted languages + +
  • +
  • 21.7. Algorithms + +
  • +
  • 21.8. Benchmarks + +
  • +
  • 21.9. Userland content bibliography
  • 22. Userland assembly @@ -3334,7 +3345,7 @@ dmesg

    POSIX

  • -

    https://github.com/cirosantilli/algorithm-cheat TODO will be good to move here for performance analysis with gem5

    +

    Algorithms

  • @@ -7324,7 +7335,7 @@ qw er

    Tests under userland/libs/ depend on certain libraries being available on the target, e.g. BLAS for userland/libs/openblas. They are not run by default, but can be enabled with --package and --package-all.

    -

    The gem5 tests require building statically with build id static, see also: Section 10.6, “gem5 syscall emulation mode”. TODO automate this better.

    +

    The gem5 tests require building statically with build id static, see also: Section 10.7, “gem5 syscall emulation mode”. TODO automate this better.

    See: Section 33.13, “Test this repo” for more useful testing tips.

    @@ -7506,7 +7517,7 @@ qemu: uncaught target signal 6 (Aborted) - core dumped
    -

    10.6. gem5 syscall emulation mode

    +

    10.6. syscall emulation mode program stdin

    +
    +

    The following work on both QEMU and gem5 as of LKMC 99d6bc6bc19d4c7f62b172643be95d9c43c26145 + 1. Interactive input:

    +
    +
    +
    +
    ./run --userland userland/c/getchar.c
    +
    +
    +
    +

    Source: userland/c/getchar.c

    +
    +
    +

    A line of type should show:

    +
    +
    +
    +
    enter a character:
    +
    +
    +
    +

    and after pressing say a and Enter, we get:

    +
    +
    +
    +
    you entered: a
    +
    +
    +
    +

    Note however that due to QEMU user mode does not show stdout immediately we don’t really see the initial enter a character line.

    +
    +
    +

    Non-interactive input from a file by forwarding emulators stdin implicitly through our Python scripts:

    +
    +
    +
    +
    printf a > f.tmp
    +./run --userland userland/c/getchar.c < f.tmp
    +
    +
    +
    +

    Input from a file by explicitly requesting our scripts to use it via the Python API:

    +
    +
    +
    +
    printf a > f.tmp
    +./run --emulator gem5 --userland userland/c/getchar.c --static --stdin-file f.tmp
    +
    +
    +
    +

    This is especially useful when running tests that require stdin input.

    +
    +
    +
    +

    10.7. gem5 syscall emulation mode

    Less robust than QEMU’s, but still usable:

    @@ -7691,7 +7756,7 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex
    -

    10.6.1. gem5 syscall emulation exit status

    +

    10.7.1. gem5 syscall emulation exit status

    As of gem5 7fa4c946386e7207ad5859e8ade0bbfc14000d91, the crappy se.py script does not forward the exit status of syscall emulation mode, you can test it with:

    @@ -7730,33 +7795,7 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex
    -

    10.6.2. gem5 syscall emulation mode program stdin

    -
    -

    gem5 shows its own stdout to terminal, and does not allow you to type stdin to programs.

    -
    -
    -

    Instead, you must pass stdin non-interactively with the through a file with the --se.py --input option, e.g.:

    -
    -
    -
    -
    printf a > f
    -./run --emulator gem5 --userland userland/c/getchar.c --static -- --input f
    -
    -
    -
    -

    leads to gem5 output:

    -
    -
    -
    -
    enter a character: you entered: a
    -
    -
    -
    -

    Source: userland/c/getchar.c

    -
    -
    -
    -

    10.6.3. gem5 syscall emulation mode syscall tracing

    +

    10.7.2. gem5 syscall emulation mode syscall tracing

    Since gem5 has to implement syscalls itself in syscall emulation mode, it can of course clearly see which syscalls are being made, and we can log them for debug purposes with gem5 tracing, e.g.:

    @@ -7804,9 +7843,9 @@ hello
    -

    10.7. QEMU user mode quirks

    +

    10.8. QEMU user mode quirks

    -

    10.7.1. QEMU user mode does not show stdout immediately

    +

    10.8.1. QEMU user mode does not show stdout immediately

    At 8d8307ac0710164701f6e14c99a69ee172ccbb70 + 1, I noticed that if you run userland/posix/count.c:

    @@ -7834,7 +7873,7 @@ hello

    TODO: investigate further and then possibly post on QEMU mailing list.

    -
    10.7.1.1. QEMU user mode does not show errors
    +
    10.8.1.1. QEMU user mode does not show errors

    Similarly to QEMU user mode does not show stdout immediately, QEMU error messages do not show at all through pipes.

    @@ -17789,6 +17828,19 @@ cat out/gem5-bench-dhrystone.txt
    +

    Interesting algorithms and benchmarks for this game are being collected at:

    +
    +
    + +
    +

    To find out why your program is slow, a good first step is to have a look at the gem5 m5out/stats.txt file.

    @@ -18329,577 +18381,6 @@ m5 dumpstats
    -
    -

    19.2.3. Interesting benchmarks

    -
    -

    Buildroot built-in libraries, mostly under Libraries > Other:

    -
    -
    - -
    -
    -

    Open source but not in Buildroot:

    -
    -
    - -
    -
    -

    There are not yet enabled, but it should be easy to so, see: Section 20.5, “Add new Buildroot packages”

    -
    -
    -
    19.2.3.1. Dhrystone
    - -
    -

    Created in the 80’s, it is not a representative measure of performance in modern computers anymore. It has mostly been replaced by SPEC, which is…​ closed source! Unbelievable.

    -
    -
    -

    Buildroot has a dhrystone package, but because it is so interesting to us, we decided to also build it ourselves, which allows things like static and baremetal compilation more easily.

    -
    -
    -

    Build and run on QEMU User mode simulation:

    -
    -
    -
    -
    git submodule update --init submodules/dhrystone
    -./build-dhrystone --mode userland
    -./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
    -
    -
    -
    -

    Build and run on gem5 user mode:

    -
    -
    -
    -
    ./build-dhrystone --mode userland --static --force-rebuild
    -./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
    -
    -
    -
    -

    TODO automate run more nicely.

    -
    -
    -

    Build for Baremetal execution and run it in baremetal QEMU:

    -
    -
    -
    -
    # Build our Newlib stubs.
    -./build-baremetal --arch aarch64
    -./build-dhrystone --arch aarch64 --mode baremetal
    -./run --arch aarch64 --baremetal "$(./getvar baremetal_build_dir)/submodules/dhrystone/dhrystone"
    -
    -
    -
    -

    TODO: fix the build, just need to factor out all run arguments from build-baremetal into common.py and it should just work, no missing syscalls.

    -
    -
    -

    If you really want the Buildroot package for some reason, build it with:

    -
    -
    -
    -
    ./build-buildroot --config 'BR2_PACKAGE_DHRYSTONE=y'
    -
    -
    -
    -

    and run inside the guest from PATH with:

    -
    -
    -
    -
    dhrystone
    -
    -
    -
    -
    -
    19.2.3.2. BST vs heap vs hashmap
    -
    -

    TODO: move benchmark graph from userland/cpp/bst_vs_heap_vs_hashmap.cpp to userland/algorithm/set.

    -
    -
    -

    The following benchmark setup works both:

    -
    -
    -
      -
    • -

      on host through timers + granule

      -
    • -
    • -

      gem5 with dumpstats, which can get more precise results with granule == 1

      -
    • -
    -
    -
    -

    It has been used to answer:

    -
    - -
    -

    To benchmark on the host, we do:

    -
    -
    -
    -
    ./build-userland-in-tree \
    -  --force-rebuild \
    -  --optimization-level 3 \
    -  ./userland/cpp/bst_vs_heap_vs_hashmap.cpp \
    -;
    -./userland/cpp/bst_vs_heap_vs_hashmap.out 10000000 10000 0 | tee bst_vs_heap_vs_hashmap.dat
    -gnuplot \
    -  -e 'input_noext="bst_vs_heap_vs_hashmap"' \
    -  -e 'heap_zoom_max=50' \
    -  -e 'hashmap_zoom_max=400' \
    -  ./bst-vs-heap-vs-hashmap.gnuplot \
    -;
    -xdg-open bst_vs_heap_vs_hashmap.tmp.png
    -
    -
    -
    -

    The parameters heap_zoom_max and hashmap_zoom_max are chosen manually interactively to best showcase the regions of interest in those plots.

    -
    -
    -

    To benchmark on gem5, we first build the benchmark with m5ops instructions enabled, and then we run it and extract the stats:

    -
    -
    -
    -
    ./build-userland \
    -  --arch x86_64 \
    -  --ccflags='-DLKMC_M5OPS_ENABLE=1' \
    -  --force-rebuild userland/cpp/bst_vs_heap_vs_hashmap.cpp \
    -  --static \
    -  --optimization-level 3 \
    -;
    -./run \
    -  --arch x86_64 \
    -  --emulator gem5 \
    -  --static \
    -  --userland userland/cpp/bst_vs_heap_vs_hashmap.cpp \
    -  --userland-args='100000 1 0' \
    -  -- \
    -  --cpu-type=DerivO3CPU \
    -  --caches \
    -  --l2cache \
    -  --l1d_size=32kB \
    -  --l1i_size=32kB \
    -  --l2_size=256kB \
    -  --l3_size=20MB \
    -;
    -./bst-vs-heap-vs-hashmap-gem5-stats --arch x86_64 | tee bst_vs_heap_vs_hashmap_gem5.dat
    -gnuplot \
    -  -e 'input_noext="bst_vs_heap_vs_hashmap_gem5"' \
    -  -e 'heap_zoom_max=500' \
    -  -e 'hashmap_zoom_max=400' \
    -  ./bst-vs-heap-vs-hashmap.gnuplot \
    -;
    -xdg-open bst_vs_heap_vs_hashmap_gem5.tmp.png
    -
    -
    -
    -

    TODO: the gem5 simulation blows up on a tcmalloc allocation somewhere near 25k elements as of 3fdd83c2c58327d9714fa2347c724b78d7c05e2b + 1, likely linked to the extreme inefficiency of the stats collection?

    -
    -
    -

    The cache sizes were chosen to match the host P51 to improve the comparison. Ideally we should also use the same standard library.

    -
    -
    -

    Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: Section 19.9.2.1, “gem5 only dump selected stats”

    -
    -
    -

    Sources:

    -
    - -
    -
    -
    19.2.3.3. BLAS
    -
    -

    Buildroot supports it, which makes everything just trivial:

    -
    -
    -
    -
    ./build-buildroot --config 'BR2_PACKAGE_OPENBLAS=y'
    -./build-userland --package openblas -- userland/libs/openblas/hello.c
    -./run --eval-after './libs/openblas/hello.out; echo $?'
    -
    -
    -
    -

    Outcome: the test passes:

    -
    -
    -
    -
    0
    -
    -
    - -
    -

    The test performs a general matrix multiplication:

    -
    -
    -
    -
        |  1.0 -3.0 |   |  1.0  2.0  1.0 |       |  0.5  0.5  0.5 |   |  11.0 - 9.0  5.0 |
    -1 * |  2.0  4.0 | * | -3.0  4.0 -1.0 | + 2 * |  0.5  0.5  0.5 | = | - 9.0  21.0 -1.0 |
    -    |  1.0 -1.0 |                            |  0.5  0.5  0.5 |   |   5.0 - 1.0  3.0 |
    -
    -
    -
    -

    This can be deduced from the Fortran interfaces at

    -
    -
    -
    -
    less "$(./getvar buildroot_build_build_dir)"/openblas-*/reference/dgemmf.f
    -
    -
    -
    -

    which we can map to our call as:

    -
    -
    -
    -
    C := alpha*op( A )*op( B ) + beta*C,
    -SUBROUTINE DGEMMF(               TRANA,        TRANB,     M,N,K,  ALPHA,A,LDA,B,LDB,BETA,C,LDC)
    -cblas_dgemm(      CblasColMajor, CblasNoTrans, CblasTrans,3,3,2  ,1,    A,3,  B,3,  2   ,C,3  );
    -
    -
    -
    -
    -
    19.2.3.4. Eigen
    -
    -

    Header only linear algebra library with a mainline Buildroot package:

    -
    -
    -
    -
    ./build-buildroot --config 'BR2_PACKAGE_EIGEN=y'
    -./build-userland --package eigen -- userland/libs/eigen/hello.cpp
    -
    -
    -
    -

    Just create an array and print it:

    -
    -
    -
    -
    ./run --eval-after './libs/eigen/hello.out'
    -
    -
    -
    -

    Output:

    -
    -
    -
    -
      3  -1
    -2.5 1.5
    -
    -
    - -
    -

    This example just creates a matrix and prints it out.

    -
    - -
    -
    -
    19.2.3.5. PARSEC benchmark
    -
    -

    We have ported parts of the PARSEC benchmark for cross compilation at: https://github.com/cirosantilli/parsec-benchmark See the documentation on that repo to find out which benchmarks have been ported. Some of the benchmarks were are segfaulting, they are documented in that repo.

    -
    -
    -

    There are two ways to run PARSEC with this repo:

    -
    -
    - -
    -
    -
    19.2.3.5.1. PARSEC benchmark without parsecmgmt
    -
    -
    -
    ./build --arch arm --download-dependencies gem5-buildroot parsec-benchmark
    -./build-buildroot --arch arm --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y'
    -./run --arch arm --emulator gem5
    -
    -
    -
    -

    Once inside the guest, launch one of the test input sized benchmarks manually as in:

    -
    -
    -
    -
    cd /parsec/ext/splash2x/apps/fmm/run
    -../inst/arm-linux.gcc/bin/fmm 1 < input_1
    -
    -
    -
    -

    To find run out how to run many of the benchmarks, have a look at the test.sh script of the parse-benchmark repo.

    -
    -
    -

    From the guest, you can also run it as:

    -
    -
    -
    -
    cd /parsec
    -./test.sh
    -
    -
    -
    -

    but this might be a bit time consuming in gem5.

    -
    -
    -
    -
    19.2.3.5.2. PARSEC change the input size
    -
    -

    Running a benchmark of a size different than test, e.g. simsmall, requires a rebuild with:

    -
    -
    -
    -
    ./build-buildroot \
    -  --arch arm \
    -  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
    -  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_INPUT_SIZE="simsmall"' \
    -  -- parsec_benchmark-reconfigure \
    -;
    -
    -
    -
    -

    Large input may also require tweaking:

    -
    -
    - -
    -
    -

    test.sh only contains the run commands for the test size, and cannot be used for simsmall.

    -
    -
    -

    The easiest thing to do, is to scroll up on the host shell after the build, and look for a line of type:

    -
    -
    -
    -
    Running /root/linux-kernel-module-cheat/out/aarch64/buildroot/build/parsec-benchmark-custom/ext/splash2x/apps/ocean_ncp/inst/aarch64-linux.gcc/bin/ocean_ncp -n2050 -p1 -e1e-07 -r20000 -t28800
    -
    -
    -
    -

    and then tweak the command found in test.sh accordingly.

    -
    -
    -

    Yes, we do run the benchmarks on host just to unpack / generate inputs. They are expected fail to run since they were build for the guest instead of host, including for x86_64 guest which has a different interpreter than the host’s (see file myexecutable).

    -
    -
    -

    The rebuild is required because we unpack input files on the host.

    -
    -
    -

    Separating input sizes also allows to create smaller images when only running the smaller benchmarks.

    -
    -
    -

    This limitation exists because parsecmgmt generates the input files just before running via the Bash scripts, but we can’t run parsecmgmt on gem5 as it is too slow!

    -
    -
    -

    One option would be to do that inside the guest with QEMU.

    -
    -
    -

    Also, we can’t generate all input sizes at once, because many of them have the same name and would overwrite one another…​

    -
    -
    -

    PARSEC simply wasn’t designed with non native machines in mind…​

    -
    -
    -
    -
    19.2.3.5.3. PARSEC benchmark with parsecmgmt
    -
    -

    Most users won’t want to use this method because:

    -
    -
    -
      -
    • -

      running the parsecmgmt Bash scripts takes forever before it ever starts running the actual benchmarks on gem5

      -
      -

      Running on QEMU is feasible, but not the main use case, since QEMU cannot be used for performance measurements

      -
      -
    • -
    • -

      it requires putting the full .tar inputs on the guest, which makes the image twice as large (1x for the .tar, 1x for the unpacked input files)

      -
    • -
    -
    -
    -

    It would be awesome if it were possible to use this method, since this is what Parsec supports officially, and so:

    -
    -
    -
      -
    • -

      you don’t have to dig into what raw command to run

      -
    • -
    • -

      there is an easy way to run all the benchmarks in one go to test them out

      -
    • -
    • -

      you can just run any of the benchmarks that you want

      -
    • -
    -
    -
    -

    but it simply is not feasible in gem5 because it takes too long.

    -
    -
    -

    If you still want to run this, try it out with:

    -
    -
    -
    -
    ./build-buildroot \
    -  --arch aarch64 \
    -  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
    -  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_PARSECMGMT=y' \
    -  --config 'BR2_TARGET_ROOTFS_EXT2_SIZE="3G"' \
    -  -- parsec_benchmark-reconfigure \
    -;
    -
    -
    -
    -

    And then you can run it just as you would on the host:

    -
    -
    -
    -
    cd /parsec/
    -bash
    -. env.sh
    -parsecmgmt -a run -p splash2x.fmm -i test
    -
    -
    -
    -
    -
    19.2.3.5.4. PARSEC uninstall
    -
    -

    If you want to remove PARSEC later, Buildroot doesn’t provide an automated package removal mechanism as mentioned at: Section 20.6, “Remove Buildroot packages”, but the following procedure should be satisfactory:

    -
    -
    -
    -
    rm -rf \
    -  "$(./getvar buildroot_download_dir)"/parsec-* \
    -  "$(./getvar buildroot_build_dir)"/build/parsec-* \
    -  "$(./getvar buildroot_build_dir)"/build/packages-file-list.txt \
    -  "$(./getvar buildroot_build_dir)"/images/rootfs.* \
    -  "$(./getvar buildroot_build_dir)"/target/parsec-* \
    -;
    -./build-buildroot --arch arm
    -
    -
    -
    -
    -
    19.2.3.5.5. PARSEC benchmark hacking
    -
    -

    If you end up going inside submodules/parsec-benchmark to hack up the benchmark (you will!), these tips will be helpful.

    -
    -
    -

    Buildroot was not designed to deal with large images, and currently cross rebuilds are a bit slow, due to some image generation and validation steps.

    -
    -
    -

    A few workarounds are:

    -
    -
    -
      -
    • -

      develop in host first as much as you can. Our PARSEC fork supports it.

      -
      -

      If you do this, don’t forget to do a:

      -
      -
      -
      -
      cd "$(./getvar parsec_source_dir)"
      -git clean -xdf .
      -
      -
      -
      -

      before going for the cross compile build.

      -
      -
    • -
    • -

      patch Buildroot to work well, and keep cross compiling all the way. This should be totally viable, and we should do it.

      -
      -

      Don’t forget to explicitly rebuild PARSEC with:

      -
      -
      -
      -
      ./build-buildroot \
      -  --arch arm \
      -  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
      -  -- parsec_benchmark-reconfigure \
      -;
      -
      -
      -
      -

      You may also want to test if your patches are still functionally correct inside of QEMU first, which is a faster emulator.

      -
      -
    • -
    • -

      sell your soul, and compile natively inside the guest. We won’t do this, not only because it is evil, but also because Buildroot explicitly does not support it: https://buildroot.org/downloads/manual/manual.html#faq-no-compiler-on-target ARM employees have been known to do this: https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/qemu-patch.diff

      -
    • -
    -
    -
    -
    -

    19.3. gem5 kernel command line parameters

    @@ -21800,7 +21281,7 @@ make menuconfig

    Also mentioned at: https://stackoverflow.com/questions/47320800/how-to-clean-only-target-in-buildroot

    -

    See this for a sample manual workaround: Section 19.2.3.5.4, “PARSEC uninstall”.

    +

    See this for a sample manual workaround: Section 21.8.2.4, “PARSEC uninstall”.

    @@ -22500,7 +21981,7 @@ printf '%x\n' 4198400

    which is documented in man proc.

    -

    The default value is 0, which I can’t find a precise documentation for. 2 is precisly documented but I’m lazy to do all calculations. So let’s just verify 0 vs 1 by trying to mmap 1GiB of memory:

    +

    The default value is 0, which I can’t find a precise documentation for. 2 is precisely documented but I’m lazy to do all calculations. So let’s just verify 0 vs 1 by trying to mmap 1GiB of memory:

    @@ -22889,6 +22370,19 @@ time ./mutex.out 4 100000000
    +

    Related examples:

    +
    +
    + +
    +

    Bibliography:

    @@ -23028,6 +22522,42 @@ fork() return = 13039
    +
    +
    21.3.3.1. pthread_mutex
    +
    +

    userland/posix/pthread_count.c exemplifies the functions:

    +
    +
    + +
    +
    +

    That example that the same interface as: atomic.cpp. +There are no non-locking atomic types or atomic primitives in POSIX: http://stackoverflow.com/questions/1130018/unix-portable-atomic-operations

    +
    +
    +

    pthread_mutex_lock and pthread_mutex_unlock and many other pthread functions already enforce cross thread memory synchronization:

    +
    +
    + +
    +

    21.3.4. sysconf

    @@ -23222,16 +22752,8 @@ fork() return = 13039

    Maybe some day someone will use this setup to study the performance of interpreters:

    -
    - -
    -
    -
    -

    21.7. Node.js

    +
    +

    21.6.1. Node.js

    Parent section: Interpreted languages.

    @@ -23292,8 +22814,710 @@ fork() return = 13039
    +
    -

    21.8. Userland content bibliography

    +

    21.7. Algorithms

    +
    +

    userland/algorithm

    +
    +
    +

    This is still work in progress and needs better automation, but is already a good sketch. Key missing features:

    +
    +
    + +
    +
    +

    The idea was originally started at: https://github.com/cirosantilli/algorithm-cheat

    +
    +
    +

    The key idea is that input / output pairs are present in human readable files generated either:

    +
    +
    + +
    +
    +

    Test programs then:

    +
    +
    + +
    +
    +

    so that we can compare the output to the expected one.

    +
    +
    +

    This way, tests can be reused across several implementations in different languages, emulating the many multi-language programming competition websites out there.

    +
    +
    +

    For example, for a native run we can can run a set / sorting test:

    +
    +
    +
    +
    cd userland/algorithm/set
    +./build
    +
    +# Run with a small hand written test.
    +./std_set.out < test_data/8.i > tmp.raw
    +
    +# Extract the output from the sorted stdout, which also
    +# contained some timing information.
    +./parse_output output < tmp.raw > tmp.o
    +
    +# Compare the output to the Expected one.
    +cmp tmp.o test_data/8.e
    +
    +# Same but now with a large randomly generated input.
    +./generate_io
    +./std_set.out < tmp.i | ./parse_output output > tmp.o
    +cmp tmp.o tmp.e
    +
    +
    +
    +

    It is also possible to the algorithm tests normally from emulators in User mode simulation by setting stdin as explained at syscall emulation mode program stdin, e.g.:

    +
    +
    +
    +
    ./run --arch aarch64 -u userland/algorithm/set/std_set.cpp --stdin-file userland/algorithm/set/test_data/8.i
    +
    +
    +
    +

    Sources:

    +
    +
    + +
    +
    +

    userland/algorithm/set/parse_output is needed because timing instrumentation measurements must be embedded in the program itself to allow:

    +
    +
    + +
    +
    +

    The following are also interesting Buildroot libraries that we could benchmark:

    +
    +
    + +
    +
    +

    These are good targets for performance analysis with gem5, and there is some overlap between this section and Benchmarks.

    +
    +
    +

    21.7.1. BST vs heap vs hashmap

    +
    +

    TODO: move benchmark graph from userland/cpp/bst_vs_heap_vs_hashmap.cpp to userland/algorithm/set.

    +
    +
    +

    The following benchmark setup works both:

    +
    +
    +
      +
    • +

      on host through timers + granule

      +
    • +
    • +

      gem5 with dumpstats, which can get more precise results with granule == 1

      +
    • +
    +
    +
    +

    It has been used to answer:

    +
    + +
    +

    To benchmark on the host, we do:

    +
    +
    +
    +
    ./build-userland-in-tree \
    +  --force-rebuild \
    +  --optimization-level 3 \
    +  ./userland/cpp/bst_vs_heap_vs_hashmap.cpp \
    +;
    +./userland/cpp/bst_vs_heap_vs_hashmap.out 10000000 10000 0 | tee bst_vs_heap_vs_hashmap.dat
    +gnuplot \
    +  -e 'input_noext="bst_vs_heap_vs_hashmap"' \
    +  -e 'heap_zoom_max=50' \
    +  -e 'hashmap_zoom_max=400' \
    +  ./bst-vs-heap-vs-hashmap.gnuplot \
    +;
    +xdg-open bst_vs_heap_vs_hashmap.tmp.png
    +
    +
    +
    +

    The parameters heap_zoom_max and hashmap_zoom_max are chosen manually interactively to best showcase the regions of interest in those plots.

    +
    +
    +

    To benchmark on gem5, we first build the benchmark with m5ops instructions enabled, and then we run it and extract the stats:

    +
    +
    +
    +
    ./build-userland \
    +  --arch x86_64 \
    +  --ccflags='-DLKMC_M5OPS_ENABLE=1' \
    +  --force-rebuild userland/cpp/bst_vs_heap_vs_hashmap.cpp \
    +  --static \
    +  --optimization-level 3 \
    +;
    +./run \
    +  --arch x86_64 \
    +  --emulator gem5 \
    +  --static \
    +  --userland userland/cpp/bst_vs_heap_vs_hashmap.cpp \
    +  --userland-args='100000 1 0' \
    +  -- \
    +  --cpu-type=DerivO3CPU \
    +  --caches \
    +  --l2cache \
    +  --l1d_size=32kB \
    +  --l1i_size=32kB \
    +  --l2_size=256kB \
    +  --l3_size=20MB \
    +;
    +./bst-vs-heap-vs-hashmap-gem5-stats --arch x86_64 | tee bst_vs_heap_vs_hashmap_gem5.dat
    +gnuplot \
    +  -e 'input_noext="bst_vs_heap_vs_hashmap_gem5"' \
    +  -e 'heap_zoom_max=500' \
    +  -e 'hashmap_zoom_max=400' \
    +  ./bst-vs-heap-vs-hashmap.gnuplot \
    +;
    +xdg-open bst_vs_heap_vs_hashmap_gem5.tmp.png
    +
    +
    +
    +

    TODO: the gem5 simulation blows up on a tcmalloc allocation somewhere near 25k elements as of 3fdd83c2c58327d9714fa2347c724b78d7c05e2b + 1, likely linked to the extreme inefficiency of the stats collection?

    +
    +
    +

    The cache sizes were chosen to match the host P51 to improve the comparison. Ideally we should also use the same standard library.

    +
    +
    +

    Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: Section 19.9.2.1, “gem5 only dump selected stats”

    +
    +
    +

    Sources:

    +
    + +
    +
    +

    21.7.2. BLAS

    +
    +

    Buildroot supports it, which makes everything just trivial:

    +
    +
    +
    +
    ./build-buildroot --config 'BR2_PACKAGE_OPENBLAS=y'
    +./build-userland --package openblas -- userland/libs/openblas/hello.c
    +./run --eval-after './libs/openblas/hello.out; echo $?'
    +
    +
    +
    +

    Outcome: the test passes:

    +
    +
    +
    +
    0
    +
    +
    + +
    +

    The test performs a general matrix multiplication:

    +
    +
    +
    +
        |  1.0 -3.0 |   |  1.0  2.0  1.0 |       |  0.5  0.5  0.5 |   |  11.0 - 9.0  5.0 |
    +1 * |  2.0  4.0 | * | -3.0  4.0 -1.0 | + 2 * |  0.5  0.5  0.5 | = | - 9.0  21.0 -1.0 |
    +    |  1.0 -1.0 |                            |  0.5  0.5  0.5 |   |   5.0 - 1.0  3.0 |
    +
    +
    +
    +

    This can be deduced from the Fortran interfaces at

    +
    +
    +
    +
    less "$(./getvar buildroot_build_build_dir)"/openblas-*/reference/dgemmf.f
    +
    +
    +
    +

    which we can map to our call as:

    +
    +
    +
    +
    C := alpha*op( A )*op( B ) + beta*C,
    +SUBROUTINE DGEMMF(               TRANA,        TRANB,     M,N,K,  ALPHA,A,LDA,B,LDB,BETA,C,LDC)
    +cblas_dgemm(      CblasColMajor, CblasNoTrans, CblasTrans,3,3,2  ,1,    A,3,  B,3,  2   ,C,3  );
    +
    +
    +
    +
    +

    21.7.3. Eigen

    +
    +

    Header only linear algebra library with a mainline Buildroot package:

    +
    +
    +
    +
    ./build-buildroot --config 'BR2_PACKAGE_EIGEN=y'
    +./build-userland --package eigen -- userland/libs/eigen/hello.cpp
    +
    +
    +
    +

    Just create an array and print it:

    +
    +
    +
    +
    ./run --eval-after './libs/eigen/hello.out'
    +
    +
    +
    +

    Output:

    +
    +
    +
    +
      3  -1
    +2.5 1.5
    +
    +
    + +
    +

    This example just creates a matrix and prints it out.

    +
    + +
    +
    +
    +

    21.8. Benchmarks

    +
    +

    These are good targets for performance analysis with gem5.

    +
    +
    +

    TODO also consider the following:

    +
    +
    + +
    +
    +

    21.8.1. Dhrystone

    + +
    +

    Created in the 80’s, it is not a representative measure of performance in modern computers anymore. It has mostly been replaced by SPEC, which is…​ closed source! Unbelievable.

    +
    +
    +

    Buildroot has a dhrystone package, but because it is so interesting to us, we decided to also build it ourselves, which allows things like static and baremetal compilation more easily.

    +
    +
    +

    Build and run on QEMU User mode simulation:

    +
    +
    +
    +
    git submodule update --init submodules/dhrystone
    +./build-dhrystone --mode userland
    +./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
    +
    +
    +
    +

    Build and run on gem5 user mode:

    +
    +
    +
    +
    ./build-dhrystone --mode userland --static --force-rebuild
    +./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
    +
    +
    +
    +

    TODO automate run more nicely.

    +
    +
    +

    Build for Baremetal execution and run it in baremetal QEMU:

    +
    +
    +
    +
    # Build our Newlib stubs.
    +./build-baremetal --arch aarch64
    +./build-dhrystone --arch aarch64 --mode baremetal
    +./run --arch aarch64 --baremetal "$(./getvar baremetal_build_dir)/submodules/dhrystone/dhrystone"
    +
    +
    +
    +

    TODO: fix the build, just need to factor out all run arguments from build-baremetal into common.py and it should just work, no missing syscalls.

    +
    +
    +

    If you really want the Buildroot package for some reason, build it with:

    +
    +
    +
    +
    ./build-buildroot --config 'BR2_PACKAGE_DHRYSTONE=y'
    +
    +
    +
    +

    and run inside the guest from PATH with:

    +
    +
    +
    +
    dhrystone
    +
    +
    +
    +
    +

    21.8.2. PARSEC benchmark

    +
    +

    We have ported parts of the PARSEC benchmark for cross compilation at: https://github.com/cirosantilli/parsec-benchmark See the documentation on that repo to find out which benchmarks have been ported. Some of the benchmarks were are segfaulting, they are documented in that repo.

    +
    +
    +

    There are two ways to run PARSEC with this repo:

    +
    +
    + +
    +
    +
    21.8.2.1. PARSEC benchmark without parsecmgmt
    +
    +
    +
    ./build --arch arm --download-dependencies gem5-buildroot parsec-benchmark
    +./build-buildroot --arch arm --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y'
    +./run --arch arm --emulator gem5
    +
    +
    +
    +

    Once inside the guest, launch one of the test input sized benchmarks manually as in:

    +
    +
    +
    +
    cd /parsec/ext/splash2x/apps/fmm/run
    +../inst/arm-linux.gcc/bin/fmm 1 < input_1
    +
    +
    +
    +

    To find run out how to run many of the benchmarks, have a look at the test.sh script of the parse-benchmark repo.

    +
    +
    +

    From the guest, you can also run it as:

    +
    +
    +
    +
    cd /parsec
    +./test.sh
    +
    +
    +
    +

    but this might be a bit time consuming in gem5.

    +
    +
    +
    +
    21.8.2.2. PARSEC change the input size
    +
    +

    Running a benchmark of a size different than test, e.g. simsmall, requires a rebuild with:

    +
    +
    +
    +
    ./build-buildroot \
    +  --arch arm \
    +  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
    +  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_INPUT_SIZE="simsmall"' \
    +  -- parsec_benchmark-reconfigure \
    +;
    +
    +
    +
    +

    Large input may also require tweaking:

    +
    +
    + +
    +
    +

    test.sh only contains the run commands for the test size, and cannot be used for simsmall.

    +
    +
    +

    The easiest thing to do, is to scroll up on the host shell after the build, and look for a line of type:

    +
    +
    +
    +
    Running /root/linux-kernel-module-cheat/out/aarch64/buildroot/build/parsec-benchmark-custom/ext/splash2x/apps/ocean_ncp/inst/aarch64-linux.gcc/bin/ocean_ncp -n2050 -p1 -e1e-07 -r20000 -t28800
    +
    +
    +
    +

    and then tweak the command found in test.sh accordingly.

    +
    +
    +

    Yes, we do run the benchmarks on host just to unpack / generate inputs. They are expected fail to run since they were build for the guest instead of host, including for x86_64 guest which has a different interpreter than the host’s (see file myexecutable).

    +
    +
    +

    The rebuild is required because we unpack input files on the host.

    +
    +
    +

    Separating input sizes also allows to create smaller images when only running the smaller benchmarks.

    +
    +
    +

    This limitation exists because parsecmgmt generates the input files just before running via the Bash scripts, but we can’t run parsecmgmt on gem5 as it is too slow!

    +
    +
    +

    One option would be to do that inside the guest with QEMU.

    +
    +
    +

    Also, we can’t generate all input sizes at once, because many of them have the same name and would overwrite one another…​

    +
    +
    +

    PARSEC simply wasn’t designed with non native machines in mind…​

    +
    +
    +
    +
    21.8.2.3. PARSEC benchmark with parsecmgmt
    +
    +

    Most users won’t want to use this method because:

    +
    +
    +
      +
    • +

      running the parsecmgmt Bash scripts takes forever before it ever starts running the actual benchmarks on gem5

      +
      +

      Running on QEMU is feasible, but not the main use case, since QEMU cannot be used for performance measurements

      +
      +
    • +
    • +

      it requires putting the full .tar inputs on the guest, which makes the image twice as large (1x for the .tar, 1x for the unpacked input files)

      +
    • +
    +
    +
    +

    It would be awesome if it were possible to use this method, since this is what Parsec supports officially, and so:

    +
    +
    +
      +
    • +

      you don’t have to dig into what raw command to run

      +
    • +
    • +

      there is an easy way to run all the benchmarks in one go to test them out

      +
    • +
    • +

      you can just run any of the benchmarks that you want

      +
    • +
    +
    +
    +

    but it simply is not feasible in gem5 because it takes too long.

    +
    +
    +

    If you still want to run this, try it out with:

    +
    +
    +
    +
    ./build-buildroot \
    +  --arch aarch64 \
    +  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
    +  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_PARSECMGMT=y' \
    +  --config 'BR2_TARGET_ROOTFS_EXT2_SIZE="3G"' \
    +  -- parsec_benchmark-reconfigure \
    +;
    +
    +
    +
    +

    And then you can run it just as you would on the host:

    +
    +
    +
    +
    cd /parsec/
    +bash
    +. env.sh
    +parsecmgmt -a run -p splash2x.fmm -i test
    +
    +
    +
    +
    +
    21.8.2.4. PARSEC uninstall
    +
    +

    If you want to remove PARSEC later, Buildroot doesn’t provide an automated package removal mechanism as mentioned at: Section 20.6, “Remove Buildroot packages”, but the following procedure should be satisfactory:

    +
    +
    +
    +
    rm -rf \
    +  "$(./getvar buildroot_download_dir)"/parsec-* \
    +  "$(./getvar buildroot_build_dir)"/build/parsec-* \
    +  "$(./getvar buildroot_build_dir)"/build/packages-file-list.txt \
    +  "$(./getvar buildroot_build_dir)"/images/rootfs.* \
    +  "$(./getvar buildroot_build_dir)"/target/parsec-* \
    +;
    +./build-buildroot --arch arm
    +
    +
    +
    +
    +
    21.8.2.5. PARSEC benchmark hacking
    +
    +

    If you end up going inside submodules/parsec-benchmark to hack up the benchmark (you will!), these tips will be helpful.

    +
    +
    +

    Buildroot was not designed to deal with large images, and currently cross rebuilds are a bit slow, due to some image generation and validation steps.

    +
    +
    +

    A few workarounds are:

    +
    +
    +
      +
    • +

      develop in host first as much as you can. Our PARSEC fork supports it.

      +
      +

      If you do this, don’t forget to do a:

      +
      +
      +
      +
      cd "$(./getvar parsec_source_dir)"
      +git clean -xdf .
      +
      +
      +
      +

      before going for the cross compile build.

      +
      +
    • +
    • +

      patch Buildroot to work well, and keep cross compiling all the way. This should be totally viable, and we should do it.

      +
      +

      Don’t forget to explicitly rebuild PARSEC with:

      +
      +
      +
      +
      ./build-buildroot \
      +  --arch arm \
      +  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
      +  -- parsec_benchmark-reconfigure \
      +;
      +
      +
      +
      +

      You may also want to test if your patches are still functionally correct inside of QEMU first, which is a faster emulator.

      +
      +
    • +
    • +

      sell your soul, and compile natively inside the guest. We won’t do this, not only because it is evil, but also because Buildroot explicitly does not support it: https://buildroot.org/downloads/manual/manual.html#faq-no-compiler-on-target ARM employees have been known to do this: https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/qemu-patch.diff

      +
    • +
    +
    +
    +
    +
    +
    +

    21.9. Userland content bibliography

    -

    Summary of manually collected results on P51 at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: Table 7, “Busy loop DMIPS for different simulator setups”. As expected, the less native / more detailed / more complex simulations are slower!

    +

    Summary of manually collected results on P51 at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: Table 7, “Busy loop MIPS for different simulator setups”. As expected, the less native / more detailed / more complex simulations are slower!

    - +@@ -30783,18 +31007,25 @@ instructions 124346081 + + + + + + +
    Table 7. Busy loop DMIPS for different simulator setupsTable 7. Busy loop MIPS for different simulator setups

    1.1018128 * 10^7

    0.2

    gem5 --arch aarch64 --gem5-build-id MOESI_CMP_directory -- --cpu-type DerivO3CPU --caches --ruby

    1 * 1000000 = 10^6

    63

    1.1005150 * 10^7

    0.2

    The first step is to determine a number of loops that will run long enough to have meaningful results, but not too long that we will get bored.

    -

    On our P51 machine, we found 10^7 (10 million == 1000 times 10000) loops to be a good number:

    +

    On our P51 machine, we found 10^7 (10 million == 1000 times 10000) loops to be a good number for a gem5 atomic simulation:

    -
    ./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland-args '1000 10000' --static
    -./get-stat sim_insts
    +
    ./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland-args '1 10000000' --static
    +./gem5-stat --arch aarch64 sim_insts
    @@ -30898,10 +31129,10 @@ instructions 124346081

    Let’s see if user mode runs considerably faster than full system or not, ignoring the kernel boot.

    -

    First we build Dhrystone manually statically since dynamic linking is broken in gem5 as explained at: Section 10.6, “gem5 syscall emulation mode”.

    +

    First we build Dhrystone manually statically since dynamic linking is broken in gem5 as explained at: Section 10.7, “gem5 syscall emulation mode”.

    -

    TODO: move this section to our new custom dhrystone setup: Section 19.2.3.1, “Dhrystone”.

    +

    TODO: move this section to our new custom dhrystone setup: Section 21.8.1, “Dhrystone”.

    gem5 user mode:

    @@ -33096,7 +33327,7 @@ echo $?

    gem5: m5 fail works on all archs

  • -

    user mode: QEMU forwards exit status, for gem5 we do some log parsing as described at: Section 10.6.1, “gem5 syscall emulation exit status”

    +

    user mode: QEMU forwards exit status, for gem5 we do some log parsing as described at: Section 10.7.1, “gem5 syscall emulation exit status”