start moving algorithm in

2026-01-25 03:01:36 +01:00 · 2019-11-18 00:00:00 +00:00
parent 99d6bc6bc1
commit 0898fad4fc
13 changed files with 623 additions and 416 deletions
--- a/README.adoc
+++ b/README.adoc
@@ -1007,7 +1007,7 @@ Notable userland content included / moving into this repository includes:
 * <<c>>
 * <<cpp>>
 * <<posix>>
-* https://github.com/cirosantilli/algorithm-cheat TODO will be good to move here for performance analysis <<gem5-run-benchmark,with gem5>>
+* <<algorithms>>

 ==== Userland setup getting started

@@ -10645,6 +10645,11 @@ Now you can play a fun little game with your friends:
 * make a program that solves the computation problem, and outputs output to stdout
 * write the code that runs the correct computation in the smallest number of cycles possible

+Interesting algorithms and benchmarks for this game are being collected at:
+
+* <<algorithms>>
+* <<benchmarks>>
+
 To find out why your program is slow, a good first step is to have a look at the <<gem5-m5out-stats-txt-file>>.

 ==== Skip extra benchmark instructions
@@ -11028,386 +11033,6 @@ TODO: why doesn't this exist:
 ls /sys/devices/system/cpu/cpu0/cpufreq
 ....

-==== Interesting benchmarks
-
-Buildroot built-in libraries, mostly under Libraries > Other:
-
-* Armadillo `C++`: linear algebra
-* fftw: Fourier transform
-* Flann
-* GSL: various
-* liblinear
-* libspacialindex
-* libtommath
-* qhull
-
-Open source but not in Buildroot:
-
-* https://github.com/kozyraki/stamp transactional memory benchmarks
-
-There are not yet enabled, but it should be easy to so, see: xref:add-new-buildroot-packages[xrefstyle=full]
-
-===== Dhrystone
-
-https://en.wikipedia.org/wiki/Dhrystone
-
-Created in the 80's, it is not a representative measure of performance in modern computers anymore. It has mostly been replaced by https://en.wikipedia.org/wiki/SPECint[SPEC], which is... closed source! Unbelievable.
-
-<<buildroot>> has a `dhrystone` package, but because it is so interesting to us, we decided to also build it ourselves, which allows things like static and baremetal compilation more easily.
-
-Build and run on QEMU <<user-mode-simulation>>:
-
-....
-git submodule update --init submodules/dhrystone
-./build-dhrystone --mode userland
-./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
-....
-
-Build and run on gem5 user mode:
-
-....
-./build-dhrystone --mode userland --static --force-rebuild
-./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
-....
-
-TODO automate run more nicely.
-
-Build for <<baremetal>> execution and run it in baremetal QEMU:
-
-....
-# Build our Newlib stubs.
-./build-baremetal --arch aarch64
-./build-dhrystone --arch aarch64 --mode baremetal
-./run --arch aarch64 --baremetal "$(./getvar baremetal_build_dir)/submodules/dhrystone/dhrystone"
-....
-
-TODO: fix the build, just need to factor out all run arguments from link:build-baremetal[] into link:common.py[] and it should just work, no missing syscalls.
-
-If you really want the Buildroot package for some reason, build it with:
-
-....
-./build-buildroot --config 'BR2_PACKAGE_DHRYSTONE=y'
-....
-
-and run inside the guest from `PATH` with:
-
-....
-dhrystone
-....
-
-===== BST vs heap vs hashmap
-
-TODO: move benchmark graph from link:userland/cpp/bst_vs_heap_vs_hashmap.cpp[] to link:userland/algorithm/set[].
-
-The following benchmark setup works both:
-
-* on host through timers + https://stackoverflow.com/questions/51952471/why-do-i-get-a-constant-instead-of-logarithmic-curve-for-an-insert-time-benchmar/51953081#51953081[granule]
-* gem5 with <<m5ops-instructions,dumpstats>>, which can get more precise results with `granule == 1`
-
-It has been used to answer:
-
-* BST vs heap: https://stackoverflow.com/questions/6147243/heap-vs-binary-search-tree-bst/29548834#29548834
-* `std::set`: https://stackoverflow.com/questions/2558153/what-is-the-underlying-data-structure-of-a-stl-set-in-c/51944661#51944661
-* `std::map`: https://stackoverflow.com/questions/18414579/what-data-structure-is-inside-stdmap-in-c/51945119#51945119
-
-To benchmark on the host, we do:
-
-....
-./build-userland-in-tree \
-  --force-rebuild \
-  --optimization-level 3 \
-  ./userland/cpp/bst_vs_heap_vs_hashmap.cpp \
-;
-./userland/cpp/bst_vs_heap_vs_hashmap.out 10000000 10000 0 | tee bst_vs_heap_vs_hashmap.dat
-gnuplot \
-  -e 'input_noext="bst_vs_heap_vs_hashmap"' \
-  -e 'heap_zoom_max=50' \
-  -e 'hashmap_zoom_max=400' \
-  ./bst-vs-heap-vs-hashmap.gnuplot \
-;
-xdg-open bst_vs_heap_vs_hashmap.tmp.png
-....
-
-The parameters `heap_zoom_max` and `hashmap_zoom_max` are chosen manually interactively to best showcase the regions of interest in those plots.
-
-To benchmark on gem5, we first build the benchmark with <<m5ops-instructions>> enabled, and then we run it and extract the stats:
-
-....
-./build-userland \
-  --arch x86_64 \
-  --ccflags='-DLKMC_M5OPS_ENABLE=1' \
-  --force-rebuild userland/cpp/bst_vs_heap_vs_hashmap.cpp \
-  --static \
-  --optimization-level 3 \
-;
-./run \
-  --arch x86_64 \
-  --emulator gem5 \
-  --static \
-  --userland userland/cpp/bst_vs_heap_vs_hashmap.cpp \
-  --userland-args='100000 1 0' \
-  -- \
-  --cpu-type=DerivO3CPU \
-  --caches \
-  --l2cache \
-  --l1d_size=32kB \
-  --l1i_size=32kB \
-  --l2_size=256kB \
-  --l3_size=20MB \
-;
-./bst-vs-heap-vs-hashmap-gem5-stats --arch x86_64 | tee bst_vs_heap_vs_hashmap_gem5.dat
-gnuplot \
-  -e 'input_noext="bst_vs_heap_vs_hashmap_gem5"' \
-  -e 'heap_zoom_max=500' \
-  -e 'hashmap_zoom_max=400' \
-  ./bst-vs-heap-vs-hashmap.gnuplot \
-;
-xdg-open bst_vs_heap_vs_hashmap_gem5.tmp.png
-....
-
-TODO: the gem5 simulation blows up on a tcmalloc allocation somewhere near 25k elements as of 3fdd83c2c58327d9714fa2347c724b78d7c05e2b + 1, likely linked to the extreme inefficiency of the stats collection?
-
-The cache sizes were chosen to match the host <<p51>> to improve the comparison. Ideally we should also use the same standard library.
-
-Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: xref:gem5-only-dump-selected-stats[xrefstyle=full]
-
-Sources:
-
-* link:userland/cpp/bst_vs_heap_vs_hashmap.cpp[]
-* link:bst-vs-heap-vs-hashmap-gem5-stats[]
-* link:bst-vs-heap-vs-hashmap.gnuplot[]
-
-===== BLAS
-
-Buildroot supports it, which makes everything just trivial:
-
-....
-./build-buildroot --config 'BR2_PACKAGE_OPENBLAS=y'
-./build-userland --package openblas -- userland/libs/openblas/hello.c
-./run --eval-after './libs/openblas/hello.out; echo $?'
-....
-
-Outcome: the test passes:
-
-....
-0
-....
-
-Source: link:userland/libs/openblas/hello.c[]
-
-The test performs a general matrix multiplication:
-
-....
-    |  1.0 -3.0 |   |  1.0  2.0  1.0 |       |  0.5  0.5  0.5 |   |  11.0 - 9.0  5.0 |
-1 * |  2.0  4.0 | * | -3.0  4.0 -1.0 | + 2 * |  0.5  0.5  0.5 | = | - 9.0  21.0 -1.0 |
-    |  1.0 -1.0 |                            |  0.5  0.5  0.5 |   |   5.0 - 1.0  3.0 |
-....
-
-This can be deduced from the Fortran interfaces at
-
-....
-less "$(./getvar buildroot_build_build_dir)"/openblas-*/reference/dgemmf.f
-....
-
-which we can map to our call as:
-
-....
-C := alpha*op( A )*op( B ) + beta*C,
-SUBROUTINE DGEMMF(               TRANA,        TRANB,     M,N,K,  ALPHA,A,LDA,B,LDB,BETA,C,LDC)
-cblas_dgemm(      CblasColMajor, CblasNoTrans, CblasTrans,3,3,2  ,1,    A,3,  B,3,  2   ,C,3  );
-....
-
-===== Eigen
-
-Header only linear algebra library with a mainline Buildroot package:
-
-....
-./build-buildroot --config 'BR2_PACKAGE_EIGEN=y'
-./build-userland --package eigen -- userland/libs/eigen/hello.cpp
-....
-
-Just create an array and print it:
-
-....
-./run --eval-after './libs/eigen/hello.out'
-....
-
-Output:
-
-....
-  3  -1
-2.5 1.5
-....
-
-Source: link:userland/libs/eigen/hello.cpp[]
-
-This example just creates a matrix and prints it out.
-
-Tested on: https://github.com/cirosantilli/linux-kernel-module-cheat/commit/a4bdcf102c068762bb1ef26c591fcf71e5907525[a4bdcf102c068762bb1ef26c591fcf71e5907525]
-
-===== PARSEC benchmark
-
-We have ported parts of the http://parsec.cs.princeton.edu[PARSEC benchmark] for cross compilation at: https://github.com/cirosantilli/parsec-benchmark See the documentation on that repo to find out which benchmarks have been ported. Some of the benchmarks were are segfaulting, they are documented in that repo.
-
-There are two ways to run PARSEC with this repo:
-
-* <<parsec-benchmark-without-parsecmgmt,without `pasecmgmt`>>, most likely what you want
-* <<parsec-benchmark-with-parsecmgmt,with `pasecmgmt`>>
-
-====== PARSEC benchmark without parsecmgmt
-
-....
-./build --arch arm --download-dependencies gem5-buildroot parsec-benchmark
-./build-buildroot --arch arm --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y'
-./run --arch arm --emulator gem5
-....
-
-Once inside the guest, launch one of the `test` input sized benchmarks manually as in:
-
-....
-cd /parsec/ext/splash2x/apps/fmm/run
-../inst/arm-linux.gcc/bin/fmm 1 < input_1
-....
-
-To find run out how to run many of the benchmarks, have a look at the `test.sh` script of the `parse-benchmark` repo.
-
-From the guest, you can also run it as:
-
-....
-cd /parsec
-./test.sh
-....
-
-but this might be a bit time consuming in gem5.
-
-====== PARSEC change the input size
-
-Running a benchmark of a size different than `test`, e.g. `simsmall`, requires a rebuild with:
-
-....
-./build-buildroot \
-  --arch arm \
-  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
-  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_INPUT_SIZE="simsmall"' \
-  -- parsec_benchmark-reconfigure \
-;
-....
-
-Large input may also require tweaking:
-
-* <<br2-target-rootfs-ext2-size>> if the unpacked inputs are large
-* <<memory-size>>, unless you want to meet the OOM killer, which is admittedly kind of fun
-
-`test.sh` only contains the run commands for the `test` size, and cannot be used for `simsmall`.
-
-The easiest thing to do, is to https://superuser.com/questions/231002/how-can-i-search-within-the-output-buffer-of-a-tmux-shell/1253137#1253137[scroll up on the host shell] after the build, and look for a line of type:
-
-....
-Running /root/linux-kernel-module-cheat/out/aarch64/buildroot/build/parsec-benchmark-custom/ext/splash2x/apps/ocean_ncp/inst/aarch64-linux.gcc/bin/ocean_ncp -n2050 -p1 -e1e-07 -r20000 -t28800
-....
-
-and then tweak the command found in `test.sh` accordingly.
-
-Yes, we do run the benchmarks on host just to unpack / generate inputs. They are expected fail to run since they were build for the guest instead of host, including for x86_64 guest which has a different interpreter than the host's (see `file myexecutable`).
-
-The rebuild is required because we unpack input files on the host.
-
-Separating input sizes also allows to create smaller images when only running the smaller benchmarks.
-
-This limitation exists because `parsecmgmt` generates the input files just before running via the Bash scripts, but we can't run `parsecmgmt` on gem5 as it is too slow!
-
-One option would be to do that inside the guest with QEMU.
-
-Also, we can't generate all input sizes at once, because many of them have the same name and would overwrite one another...
-
-PARSEC simply wasn't designed with non native machines in mind...
-
-====== PARSEC benchmark with parsecmgmt
-
-Most users won't want to use this method because:
-
-* running the `parsecmgmt` Bash scripts takes forever before it ever starts running the actual benchmarks on gem5
-+
-Running on QEMU is feasible, but not the main use case, since QEMU cannot be used for performance measurements
-* it requires putting the full `.tar` inputs on the guest, which makes the image twice as large (1x for the `.tar`, 1x for the unpacked input files)
-
-It would be awesome if it were possible to use this method, since this is what Parsec supports officially, and so:
-
-* you don't have to dig into what raw command to run
-* there is an easy way to run all the benchmarks in one go to test them out
-* you can just run any of the benchmarks that you want
-
-but it simply is not feasible in gem5 because it takes too long.
-
-If you still want to run this, try it out with:
-
-....
-./build-buildroot \
-  --arch aarch64 \
-  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
-  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_PARSECMGMT=y' \
-  --config 'BR2_TARGET_ROOTFS_EXT2_SIZE="3G"' \
-  -- parsec_benchmark-reconfigure \
-;
-....
-
-And then you can run it just as you would on the host:
-
-....
-cd /parsec/
-bash
-. env.sh
-parsecmgmt -a run -p splash2x.fmm -i test
-....
-
-====== PARSEC uninstall
-
-If you want to remove PARSEC later, Buildroot doesn't provide an automated package removal mechanism as mentioned at: xref:remove-buildroot-packages[xrefstyle=full], but the following procedure should be satisfactory:
-
-....
-rm -rf \
-  "$(./getvar buildroot_download_dir)"/parsec-* \
-  "$(./getvar buildroot_build_dir)"/build/parsec-* \
-  "$(./getvar buildroot_build_dir)"/build/packages-file-list.txt \
-  "$(./getvar buildroot_build_dir)"/images/rootfs.* \
-  "$(./getvar buildroot_build_dir)"/target/parsec-* \
-;
-./build-buildroot --arch arm
-....
-
-====== PARSEC benchmark hacking
-
-If you end up going inside link:submodules/parsec-benchmark[] to hack up the benchmark (you will!), these tips will be helpful.
-
-Buildroot was not designed to deal with large images, and currently cross rebuilds are a bit slow, due to some image generation and validation steps.
-
-A few workarounds are:
-
-* develop in host first as much as you can. Our PARSEC fork supports it.
-+
-If you do this, don't forget to do a:
-+
-....
-cd "$(./getvar parsec_source_dir)"
-git clean -xdf .
-....
-before going for the cross compile build.
-+
-* patch Buildroot to work well, and keep cross compiling all the way. This should be totally viable, and we should do it.
-+
-Don't forget to explicitly rebuild PARSEC with:
-+
-....
-./build-buildroot \
-  --arch arm \
-  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
-  -- parsec_benchmark-reconfigure \
-;
-....
-+
-You may also want to test if your patches are still functionally correct inside of QEMU first, which is a faster emulator.
-* sell your soul, and compile natively inside the guest. We won't do this, not only because it is evil, but also because Buildroot explicitly does not support it: https://buildroot.org/downloads/manual/manual.html#faq-no-compiler-on-target ARM employees have been known to do this: https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/qemu-patch.diff
-
 === gem5 kernel command line parameters

 Analogous <<kernel-command-line-parameters,to QEMU>>:
@@ -14209,9 +13834,7 @@ Example: link:userland/c/memory_leak.c[]

 Maybe some day someone will use this setup to study the performance of interpreters:

-* <<node-js>>
-
-=== Node.js
+==== Node.js

 Parent section: <<interpreted-languages>>.

@@ -14237,6 +13860,444 @@ Examples:
 ** link:rootfs_overlay/lkmc/nodejs/file_write_read.js[]
 ** link:rootfs_overlay/lkmc/nodejs/read_stdin_to_string.js[] Question: https://stackoverflow.com/questions/30441025/read-all-text-from-stdin-to-a-string

+=== Algorithms
+
+link:userland/algorithm[]
+
+This is still work in progress and needs better automation, but is already a good sketch. The idea was originally started at: https://github.com/cirosantilli/algorithm-cheat
+
+The key idea is that input / output pairs are present in human readable files generated either:
+
+* manually for small test inputs
+* with a Python script for larger randomized tests
+
+Test programs then:
+
+* read input from sdtin
+* produce output to stdout
+
+so that we can compare the output to the expected one.
+
+This way, tests can be reused across several implementations in different languages, emulating the many multi-language programming competition websites out there.
+
+For example, for a <<userland-setup-getting-started-natively,native run>> we can can run a set / sorting test:
+
+....
+cd userland/algorithm/set
+./build
+
+# Run with a small hand written test.
+./std_set.out < test_data/8.i > tmp.raw
+
+# Extract the output from the sorted stdout, which also
+# contained some timing information.
+./parse_output output < tmp.raw > tmp.o
+
+# Compare the output to the Expected one.
+cmp tmp.o test_data/8.e
+
+# Same but now with a large randomly generated input.
+./generate_io
+./std_set.out < tmp.i | ./parse_output output > tmp.o
+cmp tmp.o tmp.e
+....
+
+Sources:
+
+* link:userland/algorithm/set/generate_input[]
+* link:userland/algorithm/set/main.hpp[]
+* link:userland/algorithm/set/parse_output[]
+* link:userland/algorithm/set/std_set.cpp[]
+* link:userland/algorithm/set/test_data/8.e[]
+* link:userland/algorithm/set/test_data/8.i[]
+
+link:userland/algorithm/set/parse_output[] is needed because timing instrumentation measurements must be embedded in the program itself to allow:
+
+* discounting the input reading / output writing operations from the actual "read / write to / from memory algorithm" itself
+* measuring the evolution of the benchmark mid way, e.g. to see how the current container size affects insertion time: <<bst-vs-heap-vs-hashmap>>
+
+The following are also interesting Buildroot libraries that we could benchmark:
+
+* Armadillo `C++`: linear algebra
+* fftw: Fourier transform
+* Flann
+* GSL: various
+* liblinear
+* libspacialindex
+* libtommath
+* qhull
+
+These are good targets for <<gem5-run-benchmark,performance analysis with gem5>>, and there is some overlap between this section and <<benchmarks>>.
+
+==== BST vs heap vs hashmap
+
+TODO: move benchmark graph from link:userland/cpp/bst_vs_heap_vs_hashmap.cpp[] to link:userland/algorithm/set[].
+
+The following benchmark setup works both:
+
+* on host through timers + https://stackoverflow.com/questions/51952471/why-do-i-get-a-constant-instead-of-logarithmic-curve-for-an-insert-time-benchmar/51953081#51953081[granule]
+* gem5 with <<m5ops-instructions,dumpstats>>, which can get more precise results with `granule == 1`
+
+It has been used to answer:
+
+* BST vs heap: https://stackoverflow.com/questions/6147243/heap-vs-binary-search-tree-bst/29548834#29548834
+* `std::set`: https://stackoverflow.com/questions/2558153/what-is-the-underlying-data-structure-of-a-stl-set-in-c/51944661#51944661
+* `std::map`: https://stackoverflow.com/questions/18414579/what-data-structure-is-inside-stdmap-in-c/51945119#51945119
+
+To benchmark on the host, we do:
+
+....
+./build-userland-in-tree \
+  --force-rebuild \
+  --optimization-level 3 \
+  ./userland/cpp/bst_vs_heap_vs_hashmap.cpp \
+;
+./userland/cpp/bst_vs_heap_vs_hashmap.out 10000000 10000 0 | tee bst_vs_heap_vs_hashmap.dat
+gnuplot \
+  -e 'input_noext="bst_vs_heap_vs_hashmap"' \
+  -e 'heap_zoom_max=50' \
+  -e 'hashmap_zoom_max=400' \
+  ./bst-vs-heap-vs-hashmap.gnuplot \
+;
+xdg-open bst_vs_heap_vs_hashmap.tmp.png
+....
+
+The parameters `heap_zoom_max` and `hashmap_zoom_max` are chosen manually interactively to best showcase the regions of interest in those plots.
+
+To benchmark on gem5, we first build the benchmark with <<m5ops-instructions>> enabled, and then we run it and extract the stats:
+
+....
+./build-userland \
+  --arch x86_64 \
+  --ccflags='-DLKMC_M5OPS_ENABLE=1' \
+  --force-rebuild userland/cpp/bst_vs_heap_vs_hashmap.cpp \
+  --static \
+  --optimization-level 3 \
+;
+./run \
+  --arch x86_64 \
+  --emulator gem5 \
+  --static \
+  --userland userland/cpp/bst_vs_heap_vs_hashmap.cpp \
+  --userland-args='100000 1 0' \
+  -- \
+  --cpu-type=DerivO3CPU \
+  --caches \
+  --l2cache \
+  --l1d_size=32kB \
+  --l1i_size=32kB \
+  --l2_size=256kB \
+  --l3_size=20MB \
+;
+./bst-vs-heap-vs-hashmap-gem5-stats --arch x86_64 | tee bst_vs_heap_vs_hashmap_gem5.dat
+gnuplot \
+  -e 'input_noext="bst_vs_heap_vs_hashmap_gem5"' \
+  -e 'heap_zoom_max=500' \
+  -e 'hashmap_zoom_max=400' \
+  ./bst-vs-heap-vs-hashmap.gnuplot \
+;
+xdg-open bst_vs_heap_vs_hashmap_gem5.tmp.png
+....
+
+TODO: the gem5 simulation blows up on a tcmalloc allocation somewhere near 25k elements as of 3fdd83c2c58327d9714fa2347c724b78d7c05e2b + 1, likely linked to the extreme inefficiency of the stats collection?
+
+The cache sizes were chosen to match the host <<p51>> to improve the comparison. Ideally we should also use the same standard library.
+
+Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: xref:gem5-only-dump-selected-stats[xrefstyle=full]
+
+Sources:
+
+* link:userland/cpp/bst_vs_heap_vs_hashmap.cpp[]
+* link:bst-vs-heap-vs-hashmap-gem5-stats[]
+* link:bst-vs-heap-vs-hashmap.gnuplot[]
+
+==== BLAS
+
+Buildroot supports it, which makes everything just trivial:
+
+....
+./build-buildroot --config 'BR2_PACKAGE_OPENBLAS=y'
+./build-userland --package openblas -- userland/libs/openblas/hello.c
+./run --eval-after './libs/openblas/hello.out; echo $?'
+....
+
+Outcome: the test passes:
+
+....
+0
+....
+
+Source: link:userland/libs/openblas/hello.c[]
+
+The test performs a general matrix multiplication:
+
+....
+    |  1.0 -3.0 |   |  1.0  2.0  1.0 |       |  0.5  0.5  0.5 |   |  11.0 - 9.0  5.0 |
+1 * |  2.0  4.0 | * | -3.0  4.0 -1.0 | + 2 * |  0.5  0.5  0.5 | = | - 9.0  21.0 -1.0 |
+    |  1.0 -1.0 |                            |  0.5  0.5  0.5 |   |   5.0 - 1.0  3.0 |
+....
+
+This can be deduced from the Fortran interfaces at
+
+....
+less "$(./getvar buildroot_build_build_dir)"/openblas-*/reference/dgemmf.f
+....
+
+which we can map to our call as:
+
+....
+C := alpha*op( A )*op( B ) + beta*C,
+SUBROUTINE DGEMMF(               TRANA,        TRANB,     M,N,K,  ALPHA,A,LDA,B,LDB,BETA,C,LDC)
+cblas_dgemm(      CblasColMajor, CblasNoTrans, CblasTrans,3,3,2  ,1,    A,3,  B,3,  2   ,C,3  );
+....
+
+==== Eigen
+
+Header only linear algebra library with a mainline Buildroot package:
+
+....
+./build-buildroot --config 'BR2_PACKAGE_EIGEN=y'
+./build-userland --package eigen -- userland/libs/eigen/hello.cpp
+....
+
+Just create an array and print it:
+
+....
+./run --eval-after './libs/eigen/hello.out'
+....
+
+Output:
+
+....
+  3  -1
+2.5 1.5
+....
+
+Source: link:userland/libs/eigen/hello.cpp[]
+
+This example just creates a matrix and prints it out.
+
+Tested on: https://github.com/cirosantilli/linux-kernel-module-cheat/commit/a4bdcf102c068762bb1ef26c591fcf71e5907525[a4bdcf102c068762bb1ef26c591fcf71e5907525]
+
+=== Benchmarks
+
+These are good targets for <<gem5-run-benchmark,performance analysis with gem5>>.
+
+TODO also consider the following:
+
+* https://github.com/kozyraki/stamp transactional memory benchmarks
+
+==== Dhrystone
+
+https://en.wikipedia.org/wiki/Dhrystone
+
+Created in the 80's, it is not a representative measure of performance in modern computers anymore. It has mostly been replaced by https://en.wikipedia.org/wiki/SPECint[SPEC], which is... closed source! Unbelievable.
+
+<<buildroot>> has a `dhrystone` package, but because it is so interesting to us, we decided to also build it ourselves, which allows things like static and baremetal compilation more easily.
+
+Build and run on QEMU <<user-mode-simulation>>:
+
+....
+git submodule update --init submodules/dhrystone
+./build-dhrystone --mode userland
+./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
+....
+
+Build and run on gem5 user mode:
+
+....
+./build-dhrystone --mode userland --static --force-rebuild
+./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"
+....
+
+TODO automate run more nicely.
+
+Build for <<baremetal>> execution and run it in baremetal QEMU:
+
+....
+# Build our Newlib stubs.
+./build-baremetal --arch aarch64
+./build-dhrystone --arch aarch64 --mode baremetal
+./run --arch aarch64 --baremetal "$(./getvar baremetal_build_dir)/submodules/dhrystone/dhrystone"
+....
+
+TODO: fix the build, just need to factor out all run arguments from link:build-baremetal[] into link:common.py[] and it should just work, no missing syscalls.
+
+If you really want the Buildroot package for some reason, build it with:
+
+....
+./build-buildroot --config 'BR2_PACKAGE_DHRYSTONE=y'
+....
+
+and run inside the guest from `PATH` with:
+
+....
+dhrystone
+....
+
+==== PARSEC benchmark
+
+We have ported parts of the http://parsec.cs.princeton.edu[PARSEC benchmark] for cross compilation at: https://github.com/cirosantilli/parsec-benchmark See the documentation on that repo to find out which benchmarks have been ported. Some of the benchmarks were are segfaulting, they are documented in that repo.
+
+There are two ways to run PARSEC with this repo:
+
+* <<parsec-benchmark-without-parsecmgmt,without `pasecmgmt`>>, most likely what you want
+* <<parsec-benchmark-with-parsecmgmt,with `pasecmgmt`>>
+
+===== PARSEC benchmark without parsecmgmt
+
+....
+./build --arch arm --download-dependencies gem5-buildroot parsec-benchmark
+./build-buildroot --arch arm --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y'
+./run --arch arm --emulator gem5
+....
+
+Once inside the guest, launch one of the `test` input sized benchmarks manually as in:
+
+....
+cd /parsec/ext/splash2x/apps/fmm/run
+../inst/arm-linux.gcc/bin/fmm 1 < input_1
+....
+
+To find run out how to run many of the benchmarks, have a look at the `test.sh` script of the `parse-benchmark` repo.
+
+From the guest, you can also run it as:
+
+....
+cd /parsec
+./test.sh
+....
+
+but this might be a bit time consuming in gem5.
+
+===== PARSEC change the input size
+
+Running a benchmark of a size different than `test`, e.g. `simsmall`, requires a rebuild with:
+
+....
+./build-buildroot \
+  --arch arm \
+  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
+  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_INPUT_SIZE="simsmall"' \
+  -- parsec_benchmark-reconfigure \
+;
+....
+
+Large input may also require tweaking:
+
+* <<br2-target-rootfs-ext2-size>> if the unpacked inputs are large
+* <<memory-size>>, unless you want to meet the OOM killer, which is admittedly kind of fun
+
+`test.sh` only contains the run commands for the `test` size, and cannot be used for `simsmall`.
+
+The easiest thing to do, is to https://superuser.com/questions/231002/how-can-i-search-within-the-output-buffer-of-a-tmux-shell/1253137#1253137[scroll up on the host shell] after the build, and look for a line of type:
+
+....
+Running /root/linux-kernel-module-cheat/out/aarch64/buildroot/build/parsec-benchmark-custom/ext/splash2x/apps/ocean_ncp/inst/aarch64-linux.gcc/bin/ocean_ncp -n2050 -p1 -e1e-07 -r20000 -t28800
+....
+
+and then tweak the command found in `test.sh` accordingly.
+
+Yes, we do run the benchmarks on host just to unpack / generate inputs. They are expected fail to run since they were build for the guest instead of host, including for x86_64 guest which has a different interpreter than the host's (see `file myexecutable`).
+
+The rebuild is required because we unpack input files on the host.
+
+Separating input sizes also allows to create smaller images when only running the smaller benchmarks.
+
+This limitation exists because `parsecmgmt` generates the input files just before running via the Bash scripts, but we can't run `parsecmgmt` on gem5 as it is too slow!
+
+One option would be to do that inside the guest with QEMU.
+
+Also, we can't generate all input sizes at once, because many of them have the same name and would overwrite one another...
+
+PARSEC simply wasn't designed with non native machines in mind...
+
+===== PARSEC benchmark with parsecmgmt
+
+Most users won't want to use this method because:
+
+* running the `parsecmgmt` Bash scripts takes forever before it ever starts running the actual benchmarks on gem5
+
+Running on QEMU is feasible, but not the main use case, since QEMU cannot be used for performance measurements
+* it requires putting the full `.tar` inputs on the guest, which makes the image twice as large (1x for the `.tar`, 1x for the unpacked input files)
+
+It would be awesome if it were possible to use this method, since this is what Parsec supports officially, and so:
+
+* you don't have to dig into what raw command to run
+* there is an easy way to run all the benchmarks in one go to test them out
+* you can just run any of the benchmarks that you want
+
+but it simply is not feasible in gem5 because it takes too long.
+
+If you still want to run this, try it out with:
+
+....
+./build-buildroot \
+  --arch aarch64 \
+  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
+  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_PARSECMGMT=y' \
+  --config 'BR2_TARGET_ROOTFS_EXT2_SIZE="3G"' \
+  -- parsec_benchmark-reconfigure \
+;
+....
+
+And then you can run it just as you would on the host:
+
+....
+cd /parsec/
+bash
+. env.sh
+parsecmgmt -a run -p splash2x.fmm -i test
+....
+
+===== PARSEC uninstall
+
+If you want to remove PARSEC later, Buildroot doesn't provide an automated package removal mechanism as mentioned at: xref:remove-buildroot-packages[xrefstyle=full], but the following procedure should be satisfactory:
+
+....
+rm -rf \
+  "$(./getvar buildroot_download_dir)"/parsec-* \
+  "$(./getvar buildroot_build_dir)"/build/parsec-* \
+  "$(./getvar buildroot_build_dir)"/build/packages-file-list.txt \
+  "$(./getvar buildroot_build_dir)"/images/rootfs.* \
+  "$(./getvar buildroot_build_dir)"/target/parsec-* \
+;
+./build-buildroot --arch arm
+....
+
+===== PARSEC benchmark hacking
+
+If you end up going inside link:submodules/parsec-benchmark[] to hack up the benchmark (you will!), these tips will be helpful.
+
+Buildroot was not designed to deal with large images, and currently cross rebuilds are a bit slow, due to some image generation and validation steps.
+
+A few workarounds are:
+
+* develop in host first as much as you can. Our PARSEC fork supports it.
+
+If you do this, don't forget to do a:
+
+....
+cd "$(./getvar parsec_source_dir)"
+git clean -xdf .
+....
+before going for the cross compile build.
+
+* patch Buildroot to work well, and keep cross compiling all the way. This should be totally viable, and we should do it.
+
+Don't forget to explicitly rebuild PARSEC with:
+
+....
+./build-buildroot \
+  --arch arm \
+  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
+  -- parsec_benchmark-reconfigure \
+;
+....
+
+You may also want to test if your patches are still functionally correct inside of QEMU first, which is a faster emulator.
+* sell your soul, and compile natively inside the guest. We won't do this, not only because it is evil, but also because Buildroot explicitly does not support it: https://buildroot.org/downloads/manual/manual.html#faq-no-compiler-on-target ARM employees have been known to do this: https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/qemu-patch.diff
+
 === Userland content bibliography

 * The Linux Programming Interface by Michael Kerrisk https://www.amazon.co.uk/Linux-Programming-Interface-System-Handbook/dp/1593272200 Lots of open source POSIX examples: https://github.com/cirosantilli/linux-programming-interface-kerrisk
--- a/common.py
+++ b/common.py
@@ -1941,3 +1941,36 @@ class TestCliFunction(LkmcCliFunction):
            self.log_error('A test failed')
            return 1
        return 0
+
+# IO format.
+
+class LkmcList(list):
+    '''
+    list with a lightweight serialization format for algorithm IO.
+    '''
+    def __init__(self, *args, **kwargs):
+        if 'oneline' in kwargs:
+            self.oneline = kwargs['oneline']
+            del kwargs['oneline']
+        else:
+            self.oneline = False
+        super().__init__(*args, **kwargs)
+    def __str__(self):
+        if self.oneline:
+            sep = ' '
+        else:
+            sep = '\n'
+        return sep.join([str(item) for item in self])
+
+class LkmcOrderedDict(collections.OrderedDict):
+    '''
+    dict with a lightweight serialization format for algorithm IO.
+    '''
+    def __str__(self):
+        out = []
+        for key in self:
+            out.extend([
+                str(key),
+                str(self[key]) + '\n',
+            ])
+        return '\n'.join(out)
--- a/userland/algorithm/set/generate_io
+++ b/userland/algorithm/set/generate_io
@@ -0,0 +1,30 @@
+#!/usr/bin/env python3
+
+import argparse
+import random
+import sys
+import os
+
+sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))))
+import common
+
+# Handle CLI arguments.
+parser = argparse.ArgumentParser()
+parser.add_argument('--min', type=int, default=0)
+parser.add_argument('--max', type=int, default=(2**32 - 1))
+parser.add_argument('--seed', type=int)
+parser.add_argument('--size', type=int, default=1000000)
+parser.add_argument('--unique', type=bool, default=True,
+    help='if True, remove duplicates from the expected output')
+args = parser.parse_args()
+random.seed(args.seed)
+input_data = common.LkmcList()
+for i in range(args.size):
+    input_data.append(random.randint(args.min, args.max))
+with open('tmp.i', 'w') as i:
+    i.write(str(input_data) + '\n')
+if args.unique:
+    input_data = common.LkmcList(set(input_data))
+input_data.sort()
+with open('tmp.e', 'w') as e:
+    e.write(str(input_data) + '\n')
--- a/userland/algorithm/set/main.hpp
+++ b/userland/algorithm/set/main.hpp
@@ -20,7 +20,7 @@
 int main(int argc, char **argv) {
    typedef uint64_t T;
 #if LKMC_ALGORITHM_SET_STD_PRIORITY_QUEUE
-    std::priority_queue<T> set;
+    std::priority_queue<T, std::vector<T>, std::greater<int>> set;
 #endif
 #if LKMC_ALGORITHM_SET_STD_SET
    std::set<T> set;
@@ -28,9 +28,8 @@ int main(int argc, char **argv) {
 #if LKMC_ALGORITHM_SET_STD_UNORDERED_SET
    std::unordered_set<T> set;
 #endif
-    std::vector<T> randoms;
+    std::vector<T> input;
    size_t i, j = 0, n, granule, base;
-    unsigned int seed;
 #ifndef LKMC_M5OPS_ENABLE
    std::vector<std::chrono::nanoseconds::rep> dts;
    std::vector<decltype(base)> bases;
@@ -38,26 +37,21 @@ int main(int argc, char **argv) {

    // CLI arguments.
    if (argc > 1) {
-        n = std::stoi(argv[1]);
-    } else {
-        n = 10;
-    }
-    if (argc > 2) {
-        granule = std::stoi(argv[2]);
+        granule = std::stoi(argv[1]);
    } else {
        granule = 1;
    }
-    if (argc > 3) {
-        seed = std::stoi(argv[3]);
-    } else {
-        seed = std::random_device()();
+
+    // Read input from stdin.
+    std::string str;
+    while (std::getline(std::cin, str)) {
+        if (str == "")
+            break;
+        input.push_back(std::stoll(str));
    }
+    n = input.size();

    // Action.
-    for (i = 0; i < n; ++i) {
-        randoms.push_back(i);
-    }
-    std::shuffle(randoms.begin(), randoms.end(), std::mt19937(seed));
    for (i = 0; i < n / granule; ++i) {
 #ifndef LKMC_M5OPS_ENABLE
        using clk = std::chrono::high_resolution_clock;
@@ -71,9 +65,9 @@ int main(int argc, char **argv) {
        for (j = 0; j < granule; ++j) {
 #endif
 #if LKMC_ALGORITHM_SET_STD_PRIORITY_QUEUE
-        set.emplace(randoms[base + j]);
+        set.emplace(input[base + j]);
 #else
-        set.insert(randoms[base + j]);
+        set.insert(input[base + j]);
 #endif
 #ifdef LKMC_M5OPS_ENABLE
        LKMC_M5OPS_DUMPSTATS;
@@ -87,8 +81,29 @@ int main(int argc, char **argv) {
    }

    // Report results.
+    std::cout << "output" << std::endl;
+#if LKMC_ALGORITHM_SET_STD_PRIORITY_QUEUE
+    while (!set.empty()) {
+        std::cout << set.top() << std::endl;
+        set.pop();
+    }
+    //T last_val = set.top();
+    //std::cout << last_val << std::endl;
+    //set.pop();
+    //while (!set.empty()) {
+    //    const auto& val = set.top();
+    //    if (val != last_val)
+    //        std::cout << val << std::endl;
+    //    last_val = val;
+    //    set.pop();
+    //}
+#else
+    for (const auto& item : set) {
+        std::cout << item << std::endl;
+    }
+#endif
+    std::cout << std::endl;
 #ifndef LKMC_M5OPS_ENABLE
-    // Output.
    std::cout << "times" << std::endl;
    auto bases_it = bases.begin();
    auto dts_it = dts.begin();
@@ -99,17 +114,5 @@ int main(int argc, char **argv) {
        bases_it++;
        dts_it++;
    }
-    std::cout << std::endl;
-    std::cout << "output" << std::endl;
-#if LKMC_ALGORITHM_SET_STD_PRIORITY_QUEUE
-    while (!set.empty()) {
-        std::cout << set.top() << std::endl;
-        set.pop();
-    }
-#else
-    for (const auto& item : set) {
-        std::cout << item << std::endl;
-    }
-#endif
 #endif
 }
--- a/userland/algorithm/set/parse_output
+++ b/userland/algorithm/set/parse_output
@@ -0,0 +1,40 @@
+#!/usr/bin/env python3
+
+import argparse
+import collections
+import sys
+import os
+
+sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))))
+import common
+
+data = common.LkmcOrderedDict()
+
+# Parse
+
+output = common.LkmcList()
+next(sys.stdin)
+for line in sys.stdin:
+    line = line.rstrip()
+    if line == '':
+        break
+    output.append(int(line))
+data['output'] = output
+
+times = common.LkmcList()
+next(sys.stdin)
+for line in sys.stdin:
+    line = line.rstrip()
+    if line == '':
+        break
+    times.append(common.LkmcList([int(i) for i in line.split(' ')], oneline=True))
+data['times'] = times
+
+# Handle CLI arguments.
+parser = argparse.ArgumentParser()
+parser.add_argument('key', nargs='?')
+args = parser.parse_args()
+if args.key:
+    print(data[args.key])
+else:
+    print(data)
--- a/userland/algorithm/set/test_data/3.e
+++ b/userland/algorithm/set/test_data/3.e
@@ -0,0 +1,3 @@
+0
+1
+2
--- a/userland/algorithm/set/test_data/3.i
+++ b/userland/algorithm/set/test_data/3.i
@@ -0,0 +1,3 @@
+1
+2
+0
--- a/userland/algorithm/set/test_data/4.e
+++ b/userland/algorithm/set/test_data/4.e
@@ -0,0 +1,4 @@
+0
+1
+2
+3
--- a/userland/algorithm/set/test_data/4.i
+++ b/userland/algorithm/set/test_data/4.i
@@ -0,0 +1,4 @@
+1
+3
+2
+0
--- a/userland/algorithm/set/test_data/5.e
+++ b/userland/algorithm/set/test_data/5.e
@@ -0,0 +1,5 @@
+0
+1
+2
+3
+4
--- a/userland/algorithm/set/test_data/5.i
+++ b/userland/algorithm/set/test_data/5.i
@@ -0,0 +1,5 @@
+1
+4
+0
+2
+3
--- a/userland/algorithm/set/test_data/8.e
+++ b/userland/algorithm/set/test_data/8.e
@@ -0,0 +1,8 @@
+0
+1
+2
+3
+4
+5
+6
+7
--- a/userland/algorithm/set/test_data/8.i
+++ b/userland/algorithm/set/test_data/8.i
@@ -0,0 +1,8 @@
+4
+5
+6
+2
+1
+3
+0
+7