start moving algorithm in

2026-01-27 12:04:27 +01:00 · 2019-11-18 00:00:00 +00:00
parent 99d6bc6bc1
commit 0898fad4fc
13 changed files with 623 additions and 416 deletions
--- a/README.adoc
+++ b/README.adoc
@@ -1007,7 +1007,7 @@ Notable userland content included / moving into this repository includes:
 * <<c>>
 * <<cpp>>
 * <<posix>>
-* https://github.com/cirosantilli/algorithm-cheat TODO will be good to move here for performance analysis <<gem5-run-benchmark,with gem5>>
+* <<algorithms>>
 ==== Userland setup getting started
@@ -10645,6 +10645,11 @@ Now you can play a fun little game with your friends:
 * make a program that solves the computation problem, and outputs output to stdout
 * write the code that runs the correct computation in the smallest number of cycles possible
 Interesting algorithms and benchmarks for this game are being collected at:
 * <<algorithms>>
 * <<benchmarks>>
 To find out why your program is slow, a good first step is to have a look at the <<gem5-m5out-stats-txt-file>>.
 ==== Skip extra benchmark instructions
@@ -11028,386 +11033,6 @@ TODO: why doesn't this exist:
 ls /sys/devices/system/cpu/cpu0/cpufreq
 ....
 ==== Interesting benchmarks
 Buildroot built-in libraries, mostly under Libraries > Other:
 * Armadillo `C++`: linear algebra
 * fftw: Fourier transform
 * Flann
 * GSL: various
 * liblinear
 * libspacialindex
 * libtommath
 * qhull
 Open source but not in Buildroot:
 * https://github.com/kozyraki/stamp transactional memory benchmarks
 There are not yet enabled, but it should be easy to so, see: xref:add-new-buildroot-packages[xrefstyle=full]
 ===== Dhrystone
 https://en.wikipedia.org/wiki/Dhrystone
 Created in the 80's, it is not a representative measure of performance in modern computers anymore. It has mostly been replaced by https://en.wikipedia.org/wiki/SPECint[SPEC], which is... closed source! Unbelievable.
 <<buildroot>> has a `dhrystone` package, but because it is so interesting to us, we decided to also build it ourselves, which allows things like static and baremetal compilation more easily.
 Build and run on QEMU <<user-mode-simulation>>:
 ....
 git submodule update --init submodules/dhrystone
 ./build-dhrystone --mode userland
 ./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
 ....
 Build and run on gem5 user mode:
 ....
 ./build-dhrystone --mode userland --static --force-rebuild
 ./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
 ....
 TODO automate run more nicely.
 Build for <<baremetal>> execution and run it in baremetal QEMU:
 ....
 # Build our Newlib stubs.
 ./build-baremetal --arch aarch64
 ./build-dhrystone --arch aarch64 --mode baremetal
 ./run --arch aarch64 --baremetal "$(./getvar baremetal_build_dir)/submodules/dhrystone/dhrystone"
 ....
 TODO: fix the build, just need to factor out all run arguments from link:build-baremetal[] into link:common.py[] and it should just work, no missing syscalls.
 If you really want the Buildroot package for some reason, build it with:
 ....
 ./build-buildroot --config 'BR2_PACKAGE_DHRYSTONE=y'
 ....
 and run inside the guest from `PATH` with:
 ....
 dhrystone
 ....
 ===== BST vs heap vs hashmap
 TODO: move benchmark graph from link:userland/cpp/bst_vs_heap_vs_hashmap.cpp[] to link:userland/algorithm/set[].
 The following benchmark setup works both:
 * on host through timers + https://stackoverflow.com/questions/51952471/why-do-i-get-a-constant-instead-of-logarithmic-curve-for-an-insert-time-benchmar/51953081#51953081[granule]
 * gem5 with <<m5ops-instructions,dumpstats>>, which can get more precise results with `granule == 1`
 It has been used to answer:
 * BST vs heap: https://stackoverflow.com/questions/6147243/heap-vs-binary-search-tree-bst/29548834#29548834
 * `std::set`: https://stackoverflow.com/questions/2558153/what-is-the-underlying-data-structure-of-a-stl-set-in-c/51944661#51944661
 * `std::map`: https://stackoverflow.com/questions/18414579/what-data-structure-is-inside-stdmap-in-c/51945119#51945119
 To benchmark on the host, we do:
 ....
 ./build-userland-in-tree \
  --force-rebuild \
  --optimization-level 3 \
  ./userland/cpp/bst_vs_heap_vs_hashmap.cpp \
 ;
 ./userland/cpp/bst_vs_heap_vs_hashmap.out 10000000 10000 0 | tee bst_vs_heap_vs_hashmap.dat
 gnuplot \
  -e 'input_noext="bst_vs_heap_vs_hashmap"' \
  -e 'heap_zoom_max=50' \
  -e 'hashmap_zoom_max=400' \
  ./bst-vs-heap-vs-hashmap.gnuplot \
 ;
 xdg-open bst_vs_heap_vs_hashmap.tmp.png
 ....
 The parameters `heap_zoom_max` and `hashmap_zoom_max` are chosen manually interactively to best showcase the regions of interest in those plots.
 To benchmark on gem5, we first build the benchmark with <<m5ops-instructions>> enabled, and then we run it and extract the stats:
 ....
 ./build-userland \
  --arch x86_64 \
  --ccflags='-DLKMC_M5OPS_ENABLE=1' \
  --force-rebuild userland/cpp/bst_vs_heap_vs_hashmap.cpp \
  --static \
  --optimization-level 3 \
 ;
 ./run \
  --arch x86_64 \
  --emulator gem5 \
  --static \
  --userland userland/cpp/bst_vs_heap_vs_hashmap.cpp \
  --userland-args='100000 1 0' \
  -- \
  --cpu-type=DerivO3CPU \
  --caches \
  --l2cache \
  --l1d_size=32kB \
  --l1i_size=32kB \
  --l2_size=256kB \
  --l3_size=20MB \
 ;
 ./bst-vs-heap-vs-hashmap-gem5-stats --arch x86_64 | tee bst_vs_heap_vs_hashmap_gem5.dat
 gnuplot \
  -e 'input_noext="bst_vs_heap_vs_hashmap_gem5"' \
  -e 'heap_zoom_max=500' \
  -e 'hashmap_zoom_max=400' \
  ./bst-vs-heap-vs-hashmap.gnuplot \
 ;
 xdg-open bst_vs_heap_vs_hashmap_gem5.tmp.png
 ....
 TODO: the gem5 simulation blows up on a tcmalloc allocation somewhere near 25k elements as of 3fdd83c2c58327d9714fa2347c724b78d7c05e2b + 1, likely linked to the extreme inefficiency of the stats collection?
 The cache sizes were chosen to match the host <<p51>> to improve the comparison. Ideally we should also use the same standard library.
 Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: xref:gem5-only-dump-selected-stats[xrefstyle=full]
 Sources:
 * link:userland/cpp/bst_vs_heap_vs_hashmap.cpp[]
 * link:bst-vs-heap-vs-hashmap-gem5-stats[]
 * link:bst-vs-heap-vs-hashmap.gnuplot[]
 ===== BLAS
 Buildroot supports it, which makes everything just trivial:
 ....
 ./build-buildroot --config 'BR2_PACKAGE_OPENBLAS=y'
 ./build-userland --package openblas -- userland/libs/openblas/hello.c
 ./run --eval-after './libs/openblas/hello.out; echo $?'
 ....
 Outcome: the test passes:
 ....
 0
 ....
 Source: link:userland/libs/openblas/hello.c[]
 The test performs a general matrix multiplication:
 ....
    |  1.0 -3.0 |   |  1.0  2.0  1.0 |       |  0.5  0.5  0.5 |   |  11.0 - 9.0  5.0 |
 1 * |  2.0  4.0 | * | -3.0  4.0 -1.0 | + 2 * |  0.5  0.5  0.5 | = | - 9.0  21.0 -1.0 |
    |  1.0 -1.0 |                            |  0.5  0.5  0.5 |   |   5.0 - 1.0  3.0 |
 ....
 This can be deduced from the Fortran interfaces at
 ....
 less "$(./getvar buildroot_build_build_dir)"/openblas-*/reference/dgemmf.f
 ....
 which we can map to our call as:
 ....
 C := alpha*op( A )*op( B ) + beta*C,
 SUBROUTINE DGEMMF(               TRANA,        TRANB,     M,N,K,  ALPHA,A,LDA,B,LDB,BETA,C,LDC)
 cblas_dgemm(      CblasColMajor, CblasNoTrans, CblasTrans,3,3,2  ,1,    A,3,  B,3,  2   ,C,3  );
 ....
 ===== Eigen
 Header only linear algebra library with a mainline Buildroot package:
 ....
 ./build-buildroot --config 'BR2_PACKAGE_EIGEN=y'
 ./build-userland --package eigen -- userland/libs/eigen/hello.cpp
 ....
 Just create an array and print it:
 ....
 ./run --eval-after './libs/eigen/hello.out'
 ....
 Output:
 ....
  3  -1
 2.5 1.5
 ....
 Source: link:userland/libs/eigen/hello.cpp[]
 This example just creates a matrix and prints it out.
 Tested on: https://github.com/cirosantilli/linux-kernel-module-cheat/commit/a4bdcf102c068762bb1ef26c591fcf71e5907525[a4bdcf102c068762bb1ef26c591fcf71e5907525]
 ===== PARSEC benchmark
 We have ported parts of the http://parsec.cs.princeton.edu[PARSEC benchmark] for cross compilation at: https://github.com/cirosantilli/parsec-benchmark See the documentation on that repo to find out which benchmarks have been ported. Some of the benchmarks were are segfaulting, they are documented in that repo.
 There are two ways to run PARSEC with this repo:
 * <<parsec-benchmark-without-parsecmgmt,without `pasecmgmt`>>, most likely what you want
 * <<parsec-benchmark-with-parsecmgmt,with `pasecmgmt`>>
 ====== PARSEC benchmark without parsecmgmt
 ....
 ./build --arch arm --download-dependencies gem5-buildroot parsec-benchmark
 ./build-buildroot --arch arm --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y'
 ./run --arch arm --emulator gem5
 ....
 Once inside the guest, launch one of the `test` input sized benchmarks manually as in:
 ....
 cd /parsec/ext/splash2x/apps/fmm/run
 ../inst/arm-linux.gcc/bin/fmm 1 < input_1
 ....
 To find run out how to run many of the benchmarks, have a look at the `test.sh` script of the `parse-benchmark` repo.
 From the guest, you can also run it as:
 ....
 cd /parsec
 ./test.sh
 ....
 but this might be a bit time consuming in gem5.
 ====== PARSEC change the input size
 Running a benchmark of a size different than `test`, e.g. `simsmall`, requires a rebuild with:
 ....
 ./build-buildroot \
  --arch arm \
  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_INPUT_SIZE="simsmall"' \
  -- parsec_benchmark-reconfigure \
 ;
 ....
 Large input may also require tweaking:
 * <<br2-target-rootfs-ext2-size>> if the unpacked inputs are large
 * <<memory-size>>, unless you want to meet the OOM killer, which is admittedly kind of fun
 `test.sh` only contains the run commands for the `test` size, and cannot be used for `simsmall`.
 The easiest thing to do, is to https://superuser.com/questions/231002/how-can-i-search-within-the-output-buffer-of-a-tmux-shell/1253137#1253137[scroll up on the host shell] after the build, and look for a line of type:
 ....
 Running /root/linux-kernel-module-cheat/out/aarch64/buildroot/build/parsec-benchmark-custom/ext/splash2x/apps/ocean_ncp/inst/aarch64-linux.gcc/bin/ocean_ncp -n2050 -p1 -e1e-07 -r20000 -t28800
 ....
 and then tweak the command found in `test.sh` accordingly.
 Yes, we do run the benchmarks on host just to unpack / generate inputs. They are expected fail to run since they were build for the guest instead of host, including for x86_64 guest which has a different interpreter than the host's (see `file myexecutable`).
 The rebuild is required because we unpack input files on the host.
 Separating input sizes also allows to create smaller images when only running the smaller benchmarks.
 This limitation exists because `parsecmgmt` generates the input files just before running via the Bash scripts, but we can't run `parsecmgmt` on gem5 as it is too slow!
 One option would be to do that inside the guest with QEMU.
 Also, we can't generate all input sizes at once, because many of them have the same name and would overwrite one another...
 PARSEC simply wasn't designed with non native machines in mind...
 ====== PARSEC benchmark with parsecmgmt
 Most users won't want to use this method because:
 * running the `parsecmgmt` Bash scripts takes forever before it ever starts running the actual benchmarks on gem5
 +
 Running on QEMU is feasible, but not the main use case, since QEMU cannot be used for performance measurements
 * it requires putting the full `.tar` inputs on the guest, which makes the image twice as large (1x for the `.tar`, 1x for the unpacked input files)
 It would be awesome if it were possible to use this method, since this is what Parsec supports officially, and so:
 * you don't have to dig into what raw command to run
 * there is an easy way to run all the benchmarks in one go to test them out
 * you can just run any of the benchmarks that you want
 but it simply is not feasible in gem5 because it takes too long.
 If you still want to run this, try it out with:
 ....
 ./build-buildroot \
  --arch aarch64 \
  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_PARSECMGMT=y' \
  --config 'BR2_TARGET_ROOTFS_EXT2_SIZE="3G"' \
  -- parsec_benchmark-reconfigure \
 ;
 ....
 And then you can run it just as you would on the host:
 ....
 cd /parsec/
 bash
 . env.sh
 parsecmgmt -a run -p splash2x.fmm -i test
 ....
 ====== PARSEC uninstall
 If you want to remove PARSEC later, Buildroot doesn't provide an automated package removal mechanism as mentioned at: xref:remove-buildroot-packages[xrefstyle=full], but the following procedure should be satisfactory:
 ....
 rm -rf \
  "$(./getvar buildroot_download_dir)"/parsec-* \
  "$(./getvar buildroot_build_dir)"/build/parsec-* \
  "$(./getvar buildroot_build_dir)"/build/packages-file-list.txt \
  "$(./getvar buildroot_build_dir)"/images/rootfs.* \
  "$(./getvar buildroot_build_dir)"/target/parsec-* \
 ;
 ./build-buildroot --arch arm
 ....
 ====== PARSEC benchmark hacking
 If you end up going inside link:submodules/parsec-benchmark[] to hack up the benchmark (you will!), these tips will be helpful.
 Buildroot was not designed to deal with large images, and currently cross rebuilds are a bit slow, due to some image generation and validation steps.
 A few workarounds are:
 * develop in host first as much as you can. Our PARSEC fork supports it.
 +
 If you do this, don't forget to do a:
 +
 ....
 cd "$(./getvar parsec_source_dir)"
 git clean -xdf .
 ....
 before going for the cross compile build.
 +
 * patch Buildroot to work well, and keep cross compiling all the way. This should be totally viable, and we should do it.
 +
 Don't forget to explicitly rebuild PARSEC with:
 +
 ....
 ./build-buildroot \
  --arch arm \
  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
  -- parsec_benchmark-reconfigure \
 ;
 ....
 +
 You may also want to test if your patches are still functionally correct inside of QEMU first, which is a faster emulator.
 * sell your soul, and compile natively inside the guest. We won't do this, not only because it is evil, but also because Buildroot explicitly does not support it: https://buildroot.org/downloads/manual/manual.html#faq-no-compiler-on-target ARM employees have been known to do this: https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/qemu-patch.diff
 === gem5 kernel command line parameters
 Analogous <<kernel-command-line-parameters,to QEMU>>:
@@ -14209,9 +13834,7 @@ Example: link:userland/c/memory_leak.c[]
 Maybe some day someone will use this setup to study the performance of interpreters:
-* <<node-js>>
+==== Node.js
 === Node.js
 Parent section: <<interpreted-languages>>.
@@ -14237,6 +13860,444 @@ Examples:
 ** link:rootfs_overlay/lkmc/nodejs/file_write_read.js[]
 ** link:rootfs_overlay/lkmc/nodejs/read_stdin_to_string.js[] Question: https://stackoverflow.com/questions/30441025/read-all-text-from-stdin-to-a-string
 === Algorithms
 link:userland/algorithm[]
 This is still work in progress and needs better automation, but is already a good sketch. The idea was originally started at: https://github.com/cirosantilli/algorithm-cheat
 The key idea is that input / output pairs are present in human readable files generated either:
 * manually for small test inputs
 * with a Python script for larger randomized tests
 Test programs then:
 * read input from sdtin
 * produce output to stdout
 so that we can compare the output to the expected one.
 This way, tests can be reused across several implementations in different languages, emulating the many multi-language programming competition websites out there.
 For example, for a <<userland-setup-getting-started-natively,native run>> we can can run a set / sorting test:
 ....
 cd userland/algorithm/set
 ./build
 # Run with a small hand written test.
 ./std_set.out < test_data/8.i > tmp.raw
 # Extract the output from the sorted stdout, which also
 # contained some timing information.
 ./parse_output output < tmp.raw > tmp.o
 # Compare the output to the Expected one.
 cmp tmp.o test_data/8.e
 # Same but now with a large randomly generated input.
 ./generate_io
 ./std_set.out < tmp.i | ./parse_output output > tmp.o
 cmp tmp.o tmp.e
 ....
 Sources:
 * link:userland/algorithm/set/generate_input[]
 * link:userland/algorithm/set/main.hpp[]
 * link:userland/algorithm/set/parse_output[]
 * link:userland/algorithm/set/std_set.cpp[]
 * link:userland/algorithm/set/test_data/8.e[]
 * link:userland/algorithm/set/test_data/8.i[]
 link:userland/algorithm/set/parse_output[] is needed because timing instrumentation measurements must be embedded in the program itself to allow:
 * discounting the input reading / output writing operations from the actual "read / write to / from memory algorithm" itself
 * measuring the evolution of the benchmark mid way, e.g. to see how the current container size affects insertion time: <<bst-vs-heap-vs-hashmap>>
 The following are also interesting Buildroot libraries that we could benchmark:
 * Armadillo `C++`: linear algebra
 * fftw: Fourier transform
 * Flann
 * GSL: various
 * liblinear
 * libspacialindex
 * libtommath
 * qhull
 These are good targets for <<gem5-run-benchmark,performance analysis with gem5>>, and there is some overlap between this section and <<benchmarks>>.
 ==== BST vs heap vs hashmap
 TODO: move benchmark graph from link:userland/cpp/bst_vs_heap_vs_hashmap.cpp[] to link:userland/algorithm/set[].
 The following benchmark setup works both:
 * on host through timers + https://stackoverflow.com/questions/51952471/why-do-i-get-a-constant-instead-of-logarithmic-curve-for-an-insert-time-benchmar/51953081#51953081[granule]
 * gem5 with <<m5ops-instructions,dumpstats>>, which can get more precise results with `granule == 1`
 It has been used to answer:
 * BST vs heap: https://stackoverflow.com/questions/6147243/heap-vs-binary-search-tree-bst/29548834#29548834
 * `std::set`: https://stackoverflow.com/questions/2558153/what-is-the-underlying-data-structure-of-a-stl-set-in-c/51944661#51944661
 * `std::map`: https://stackoverflow.com/questions/18414579/what-data-structure-is-inside-stdmap-in-c/51945119#51945119
 To benchmark on the host, we do:
 ....
 ./build-userland-in-tree \
  --force-rebuild \
  --optimization-level 3 \
  ./userland/cpp/bst_vs_heap_vs_hashmap.cpp \
 ;
 ./userland/cpp/bst_vs_heap_vs_hashmap.out 10000000 10000 0 | tee bst_vs_heap_vs_hashmap.dat
 gnuplot \
  -e 'input_noext="bst_vs_heap_vs_hashmap"' \
  -e 'heap_zoom_max=50' \
  -e 'hashmap_zoom_max=400' \
  ./bst-vs-heap-vs-hashmap.gnuplot \
 ;
 xdg-open bst_vs_heap_vs_hashmap.tmp.png
 ....
 The parameters `heap_zoom_max` and `hashmap_zoom_max` are chosen manually interactively to best showcase the regions of interest in those plots.
 To benchmark on gem5, we first build the benchmark with <<m5ops-instructions>> enabled, and then we run it and extract the stats:
 ....
 ./build-userland \
  --arch x86_64 \
  --ccflags='-DLKMC_M5OPS_ENABLE=1' \
  --force-rebuild userland/cpp/bst_vs_heap_vs_hashmap.cpp \
  --static \
  --optimization-level 3 \
 ;
 ./run \
  --arch x86_64 \
  --emulator gem5 \
  --static \
  --userland userland/cpp/bst_vs_heap_vs_hashmap.cpp \
  --userland-args='100000 1 0' \
  -- \
  --cpu-type=DerivO3CPU \
  --caches \
  --l2cache \
  --l1d_size=32kB \
  --l1i_size=32kB \
  --l2_size=256kB \
  --l3_size=20MB \
 ;
 ./bst-vs-heap-vs-hashmap-gem5-stats --arch x86_64 | tee bst_vs_heap_vs_hashmap_gem5.dat
 gnuplot \
  -e 'input_noext="bst_vs_heap_vs_hashmap_gem5"' \
  -e 'heap_zoom_max=500' \
  -e 'hashmap_zoom_max=400' \
  ./bst-vs-heap-vs-hashmap.gnuplot \
 ;
 xdg-open bst_vs_heap_vs_hashmap_gem5.tmp.png
 ....
 TODO: the gem5 simulation blows up on a tcmalloc allocation somewhere near 25k elements as of 3fdd83c2c58327d9714fa2347c724b78d7c05e2b + 1, likely linked to the extreme inefficiency of the stats collection?
 The cache sizes were chosen to match the host <<p51>> to improve the comparison. Ideally we should also use the same standard library.
 Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: xref:gem5-only-dump-selected-stats[xrefstyle=full]
 Sources:
 * link:userland/cpp/bst_vs_heap_vs_hashmap.cpp[]
 * link:bst-vs-heap-vs-hashmap-gem5-stats[]
 * link:bst-vs-heap-vs-hashmap.gnuplot[]
 ==== BLAS
 Buildroot supports it, which makes everything just trivial:
 ....
 ./build-buildroot --config 'BR2_PACKAGE_OPENBLAS=y'
 ./build-userland --package openblas -- userland/libs/openblas/hello.c
 ./run --eval-after './libs/openblas/hello.out; echo $?'
 ....
 Outcome: the test passes:
 ....
 0
 ....
 Source: link:userland/libs/openblas/hello.c[]
 The test performs a general matrix multiplication:
 ....
    |  1.0 -3.0 |   |  1.0  2.0  1.0 |       |  0.5  0.5  0.5 |   |  11.0 - 9.0  5.0 |
 1 * |  2.0  4.0 | * | -3.0  4.0 -1.0 | + 2 * |  0.5  0.5  0.5 | = | - 9.0  21.0 -1.0 |
    |  1.0 -1.0 |                            |  0.5  0.5  0.5 |   |   5.0 - 1.0  3.0 |
 ....
 This can be deduced from the Fortran interfaces at
 ....
 less "$(./getvar buildroot_build_build_dir)"/openblas-*/reference/dgemmf.f
 ....
 which we can map to our call as:
 ....
 C := alpha*op( A )*op( B ) + beta*C,
 SUBROUTINE DGEMMF(               TRANA,        TRANB,     M,N,K,  ALPHA,A,LDA,B,LDB,BETA,C,LDC)
 cblas_dgemm(      CblasColMajor, CblasNoTrans, CblasTrans,3,3,2  ,1,    A,3,  B,3,  2   ,C,3  );
 ....
 ==== Eigen
 Header only linear algebra library with a mainline Buildroot package:
 ....
 ./build-buildroot --config 'BR2_PACKAGE_EIGEN=y'
 ./build-userland --package eigen -- userland/libs/eigen/hello.cpp
 ....
 Just create an array and print it:
 ....
 ./run --eval-after './libs/eigen/hello.out'
 ....
 Output:
 ....
  3  -1
 2.5 1.5
 ....
 Source: link:userland/libs/eigen/hello.cpp[]
 This example just creates a matrix and prints it out.
 Tested on: https://github.com/cirosantilli/linux-kernel-module-cheat/commit/a4bdcf102c068762bb1ef26c591fcf71e5907525[a4bdcf102c068762bb1ef26c591fcf71e5907525]
 === Benchmarks
 These are good targets for <<gem5-run-benchmark,performance analysis with gem5>>.
 TODO also consider the following:
 * https://github.com/kozyraki/stamp transactional memory benchmarks
 ==== Dhrystone
 https://en.wikipedia.org/wiki/Dhrystone
 Created in the 80's, it is not a representative measure of performance in modern computers anymore. It has mostly been replaced by https://en.wikipedia.org/wiki/SPECint[SPEC], which is... closed source! Unbelievable.
 <<buildroot>> has a `dhrystone` package, but because it is so interesting to us, we decided to also build it ourselves, which allows things like static and baremetal compilation more easily.
 Build and run on QEMU <<user-mode-simulation>>:
 ....
 git submodule update --init submodules/dhrystone
 ./build-dhrystone --mode userland
 ./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
 ....
 Build and run on gem5 user mode:
 ....
 ./build-dhrystone --mode userland --static --force-rebuild
 ./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
 ....
 TODO automate run more nicely.
 Build for <<baremetal>> execution and run it in baremetal QEMU:
 ....
 # Build our Newlib stubs.
 ./build-baremetal --arch aarch64
 ./build-dhrystone --arch aarch64 --mode baremetal
 ./run --arch aarch64 --baremetal "$(./getvar baremetal_build_dir)/submodules/dhrystone/dhrystone"
 ....
 TODO: fix the build, just need to factor out all run arguments from link:build-baremetal[] into link:common.py[] and it should just work, no missing syscalls.
 If you really want the Buildroot package for some reason, build it with:
 ....
 ./build-buildroot --config 'BR2_PACKAGE_DHRYSTONE=y'
 ....
 and run inside the guest from `PATH` with:
 ....
 dhrystone
 ....
 ==== PARSEC benchmark
 We have ported parts of the http://parsec.cs.princeton.edu[PARSEC benchmark] for cross compilation at: https://github.com/cirosantilli/parsec-benchmark See the documentation on that repo to find out which benchmarks have been ported. Some of the benchmarks were are segfaulting, they are documented in that repo.
 There are two ways to run PARSEC with this repo:
 * <<parsec-benchmark-without-parsecmgmt,without `pasecmgmt`>>, most likely what you want
 * <<parsec-benchmark-with-parsecmgmt,with `pasecmgmt`>>
 ===== PARSEC benchmark without parsecmgmt
 ....
 ./build --arch arm --download-dependencies gem5-buildroot parsec-benchmark
 ./build-buildroot --arch arm --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y'
 ./run --arch arm --emulator gem5
 ....
 Once inside the guest, launch one of the `test` input sized benchmarks manually as in:
 ....
 cd /parsec/ext/splash2x/apps/fmm/run
 ../inst/arm-linux.gcc/bin/fmm 1 < input_1
 ....
 To find run out how to run many of the benchmarks, have a look at the `test.sh` script of the `parse-benchmark` repo.
 From the guest, you can also run it as:
 ....
 cd /parsec
 ./test.sh
 ....
 but this might be a bit time consuming in gem5.
 ===== PARSEC change the input size
 Running a benchmark of a size different than `test`, e.g. `simsmall`, requires a rebuild with:
 ....
 ./build-buildroot \
  --arch arm \
  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_INPUT_SIZE="simsmall"' \
  -- parsec_benchmark-reconfigure \
 ;
 ....
 Large input may also require tweaking:
 * <<br2-target-rootfs-ext2-size>> if the unpacked inputs are large
 * <<memory-size>>, unless you want to meet the OOM killer, which is admittedly kind of fun
 `test.sh` only contains the run commands for the `test` size, and cannot be used for `simsmall`.
 The easiest thing to do, is to https://superuser.com/questions/231002/how-can-i-search-within-the-output-buffer-of-a-tmux-shell/1253137#1253137[scroll up on the host shell] after the build, and look for a line of type:
 ....
 Running /root/linux-kernel-module-cheat/out/aarch64/buildroot/build/parsec-benchmark-custom/ext/splash2x/apps/ocean_ncp/inst/aarch64-linux.gcc/bin/ocean_ncp -n2050 -p1 -e1e-07 -r20000 -t28800
 ....
 and then tweak the command found in `test.sh` accordingly.
 Yes, we do run the benchmarks on host just to unpack / generate inputs. They are expected fail to run since they were build for the guest instead of host, including for x86_64 guest which has a different interpreter than the host's (see `file myexecutable`).
 The rebuild is required because we unpack input files on the host.
 Separating input sizes also allows to create smaller images when only running the smaller benchmarks.
 This limitation exists because `parsecmgmt` generates the input files just before running via the Bash scripts, but we can't run `parsecmgmt` on gem5 as it is too slow!
 One option would be to do that inside the guest with QEMU.
 Also, we can't generate all input sizes at once, because many of them have the same name and would overwrite one another...
 PARSEC simply wasn't designed with non native machines in mind...
 ===== PARSEC benchmark with parsecmgmt
 Most users won't want to use this method because:
 * running the `parsecmgmt` Bash scripts takes forever before it ever starts running the actual benchmarks on gem5
 +
 Running on QEMU is feasible, but not the main use case, since QEMU cannot be used for performance measurements
 * it requires putting the full `.tar` inputs on the guest, which makes the image twice as large (1x for the `.tar`, 1x for the unpacked input files)
 It would be awesome if it were possible to use this method, since this is what Parsec supports officially, and so:
 * you don't have to dig into what raw command to run
 * there is an easy way to run all the benchmarks in one go to test them out
 * you can just run any of the benchmarks that you want
 but it simply is not feasible in gem5 because it takes too long.
 If you still want to run this, try it out with:
 ....
 ./build-buildroot \
  --arch aarch64 \
  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_PARSECMGMT=y' \
  --config 'BR2_TARGET_ROOTFS_EXT2_SIZE="3G"' \
  -- parsec_benchmark-reconfigure \
 ;
 ....
 And then you can run it just as you would on the host:
 ....
 cd /parsec/
 bash
 . env.sh
 parsecmgmt -a run -p splash2x.fmm -i test
 ....
 ===== PARSEC uninstall
 If you want to remove PARSEC later, Buildroot doesn't provide an automated package removal mechanism as mentioned at: xref:remove-buildroot-packages[xrefstyle=full], but the following procedure should be satisfactory:
 ....
 rm -rf \
  "$(./getvar buildroot_download_dir)"/parsec-* \
  "$(./getvar buildroot_build_dir)"/build/parsec-* \
  "$(./getvar buildroot_build_dir)"/build/packages-file-list.txt \
  "$(./getvar buildroot_build_dir)"/images/rootfs.* \
  "$(./getvar buildroot_build_dir)"/target/parsec-* \
 ;
 ./build-buildroot --arch arm
 ....
 ===== PARSEC benchmark hacking
 If you end up going inside link:submodules/parsec-benchmark[] to hack up the benchmark (you will!), these tips will be helpful.
 Buildroot was not designed to deal with large images, and currently cross rebuilds are a bit slow, due to some image generation and validation steps.
 A few workarounds are:
 * develop in host first as much as you can. Our PARSEC fork supports it.
 +
 If you do this, don't forget to do a:
 +
 ....
 cd "$(./getvar parsec_source_dir)"
 git clean -xdf .
 ....
 before going for the cross compile build.
 +
 * patch Buildroot to work well, and keep cross compiling all the way. This should be totally viable, and we should do it.
 +
 Don't forget to explicitly rebuild PARSEC with:
 +
 ....
 ./build-buildroot \
  --arch arm \
  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
  -- parsec_benchmark-reconfigure \
 ;
 ....
 +
 You may also want to test if your patches are still functionally correct inside of QEMU first, which is a faster emulator.
 * sell your soul, and compile natively inside the guest. We won't do this, not only because it is evil, but also because Buildroot explicitly does not support it: https://buildroot.org/downloads/manual/manual.html#faq-no-compiler-on-target ARM employees have been known to do this: https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/qemu-patch.diff
 === Userland content bibliography
 * The Linux Programming Interface by Michael Kerrisk https://www.amazon.co.uk/Linux-Programming-Interface-System-Handbook/dp/1593272200 Lots of open source POSIX examples: https://github.com/cirosantilli/linux-programming-interface-kerrisk
--- a/common.py
+++ b/common.py
@@ -1941,3 +1941,36 @@ class TestCliFunction(LkmcCliFunction):
            self.log_error('A test failed')
            return 1
        return 0
 # IO format.
 class LkmcList(list):
    '''
    list with a lightweight serialization format for algorithm IO.
    '''
    def __init__(self, *args, **kwargs):
        if 'oneline' in kwargs:
            self.oneline = kwargs['oneline']
            del kwargs['oneline']
        else:
            self.oneline = False
        super().__init__(*args, **kwargs)
    def __str__(self):
        if self.oneline:
            sep = ' '
        else:
            sep = '\n'
        return sep.join([str(item) for item in self])
 class LkmcOrderedDict(collections.OrderedDict):
    '''
    dict with a lightweight serialization format for algorithm IO.
    '''
    def __str__(self):
        out = []
        for key in self:
            out.extend([
                str(key),
                str(self[key]) + '\n',
            ])
        return '\n'.join(out)
--- a/userland/algorithm/set/generate_io
+++ b/userland/algorithm/set/generate_io
@@ -0,0 +1,30 @@
 #!/usr/bin/env python3
 import argparse
 import random
 import sys
 import os
 sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))))
 import common
 # Handle CLI arguments.
 parser = argparse.ArgumentParser()
 parser.add_argument('--min', type=int, default=0)
 parser.add_argument('--max', type=int, default=(2**32 - 1))
 parser.add_argument('--seed', type=int)
 parser.add_argument('--size', type=int, default=1000000)
 parser.add_argument('--unique', type=bool, default=True,
    help='if True, remove duplicates from the expected output')
 args = parser.parse_args()
 random.seed(args.seed)
 input_data = common.LkmcList()
 for i in range(args.size):
    input_data.append(random.randint(args.min, args.max))
 with open('tmp.i', 'w') as i:
    i.write(str(input_data) + '\n')
 if args.unique:
    input_data = common.LkmcList(set(input_data))
 input_data.sort()
 with open('tmp.e', 'w') as e:
    e.write(str(input_data) + '\n')
--- a/userland/algorithm/set/main.hpp
+++ b/userland/algorithm/set/main.hpp
@@ -20,7 +20,7 @@
 int main(int argc, char **argv) {
    typedef uint64_t T;
 #if LKMC_ALGORITHM_SET_STD_PRIORITY_QUEUE
-    std::priority_queue<T> set;
+    std::priority_queue<T, std::vector<T>, std::greater<int>> set;
 #endif
 #if LKMC_ALGORITHM_SET_STD_SET
    std::set<T> set;
@@ -28,9 +28,8 @@ int main(int argc, char **argv) {
 #if LKMC_ALGORITHM_SET_STD_UNORDERED_SET
    std::unordered_set<T> set;
 #endif
-    std::vector<T> randoms;
+    std::vector<T> input;
    size_t i, j = 0, n, granule, base;
    unsigned int seed;
 #ifndef LKMC_M5OPS_ENABLE
    std::vector<std::chrono::nanoseconds::rep> dts;
    std::vector<decltype(base)> bases;
@@ -38,26 +37,21 @@ int main(int argc, char **argv) {
    // CLI arguments.
    if (argc > 1) {
-        n = std::stoi(argv[1]);
+        granule = std::stoi(argv[1]);
    } else {
        n = 10;
    }
    if (argc > 2) {
        granule = std::stoi(argv[2]);
    } else {
        granule = 1;
    }
-    if (argc > 3) {
+
-        seed = std::stoi(argv[3]);
+    // Read input from stdin.
-    } else {
+    std::string str;
-        seed = std::random_device()();
+    while (std::getline(std::cin, str)) {
        if (str == "")
            break;
        input.push_back(std::stoll(str));
    }
    n = input.size();
    // Action.
    for (i = 0; i < n; ++i) {
        randoms.push_back(i);
    }
    std::shuffle(randoms.begin(), randoms.end(), std::mt19937(seed));
    for (i = 0; i < n / granule; ++i) {
 #ifndef LKMC_M5OPS_ENABLE
        using clk = std::chrono::high_resolution_clock;
@@ -71,9 +65,9 @@ int main(int argc, char **argv) {
        for (j = 0; j < granule; ++j) {
 #endif
 #if LKMC_ALGORITHM_SET_STD_PRIORITY_QUEUE
-        set.emplace(randoms[base + j]);
+        set.emplace(input[base + j]);
 #else
-        set.insert(randoms[base + j]);
+        set.insert(input[base + j]);
 #endif
 #ifdef LKMC_M5OPS_ENABLE
        LKMC_M5OPS_DUMPSTATS;
@@ -87,8 +81,29 @@ int main(int argc, char **argv) {
    }
    // Report results.
    std::cout << "output" << std::endl;
 #if LKMC_ALGORITHM_SET_STD_PRIORITY_QUEUE
    while (!set.empty()) {
        std::cout << set.top() << std::endl;
        set.pop();
    }
    //T last_val = set.top();
    //std::cout << last_val << std::endl;
    //set.pop();
    //while (!set.empty()) {
    //    const auto& val = set.top();
    //    if (val != last_val)
    //        std::cout << val << std::endl;
    //    last_val = val;
    //    set.pop();
    //}
 #else
    for (const auto& item : set) {
        std::cout << item << std::endl;
    }
 #endif
    std::cout << std::endl;
 #ifndef LKMC_M5OPS_ENABLE
    // Output.
    std::cout << "times" << std::endl;
    auto bases_it = bases.begin();
    auto dts_it = dts.begin();
@@ -99,17 +114,5 @@ int main(int argc, char **argv) {
        bases_it++;
        dts_it++;
    }
    std::cout << std::endl;
    std::cout << "output" << std::endl;
 #if LKMC_ALGORITHM_SET_STD_PRIORITY_QUEUE
    while (!set.empty()) {
        std::cout << set.top() << std::endl;
        set.pop();
    }
 #else
    for (const auto& item : set) {
        std::cout << item << std::endl;
    }
 #endif
 #endif
 }
--- a/userland/algorithm/set/parse_output
+++ b/userland/algorithm/set/parse_output
@@ -0,0 +1,40 @@
 #!/usr/bin/env python3
 import argparse
 import collections
 import sys
 import os
 sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))))
 import common
 data = common.LkmcOrderedDict()
 # Parse
 output = common.LkmcList()
 next(sys.stdin)
 for line in sys.stdin:
    line = line.rstrip()
    if line == '':
        break
    output.append(int(line))
 data['output'] = output
 times = common.LkmcList()
 next(sys.stdin)
 for line in sys.stdin:
    line = line.rstrip()
    if line == '':
        break
    times.append(common.LkmcList([int(i) for i in line.split(' ')], oneline=True))
 data['times'] = times
 # Handle CLI arguments.
 parser = argparse.ArgumentParser()
 parser.add_argument('key', nargs='?')
 args = parser.parse_args()
 if args.key:
    print(data[args.key])
 else:
    print(data)
--- a/userland/algorithm/set/test_data/3.e
+++ b/userland/algorithm/set/test_data/3.e
@@ -0,0 +1,3 @@
 0
 1
 2
--- a/userland/algorithm/set/test_data/3.i
+++ b/userland/algorithm/set/test_data/3.i
@@ -0,0 +1,3 @@
 1
 2
 0
--- a/userland/algorithm/set/test_data/4.e
+++ b/userland/algorithm/set/test_data/4.e
@@ -0,0 +1,4 @@
 0
 1
 2
 3
--- a/userland/algorithm/set/test_data/4.i
+++ b/userland/algorithm/set/test_data/4.i
@@ -0,0 +1,4 @@
 1
 3
 2
 0
--- a/userland/algorithm/set/test_data/5.e
+++ b/userland/algorithm/set/test_data/5.e
@@ -0,0 +1,5 @@
 0
 1
 2
 3
 4
--- a/userland/algorithm/set/test_data/5.i
+++ b/userland/algorithm/set/test_data/5.i
@@ -0,0 +1,5 @@
 1
 4
 0
 2
 3
--- a/userland/algorithm/set/test_data/8.e
+++ b/userland/algorithm/set/test_data/8.e
@@ -0,0 +1,8 @@
 0
 1
 2
 3
 4
 5
 6
 7
--- a/userland/algorithm/set/test_data/8.i
+++ b/userland/algorithm/set/test_data/8.i
@@ -0,0 +1,8 @@
 4
 5
 6
 2
 1
 3
 0
 7