diff --git a/index.html b/index.html
index 5f97bd2..11e1664 100644
--- a/index.html
+++ b/index.html
@@ -686,18 +686,18 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
 </li>
 </ul>
 </li>
-<li><a href="#gem5-syscall-emulation-mode">10.6. gem5 syscall emulation mode</a>
+<li><a href="#syscall-emulation-mode-program-stdin">10.6. syscall emulation mode program stdin</a></li>
+<li><a href="#gem5-syscall-emulation-mode">10.7. gem5 syscall emulation mode</a>
 <ul class="sectlevel3">
-<li><a href="#gem5-syscall-emulation-exit-status">10.6.1. gem5 syscall emulation exit status</a></li>
-<li><a href="#gem5-syscall-emulation-mode-program-stdin">10.6.2. gem5 syscall emulation mode program stdin</a></li>
-<li><a href="#gem5-syscall-emulation-mode-syscall-tracing">10.6.3. gem5 syscall emulation mode syscall tracing</a></li>
+<li><a href="#gem5-syscall-emulation-exit-status">10.7.1. gem5 syscall emulation exit status</a></li>
+<li><a href="#gem5-syscall-emulation-mode-syscall-tracing">10.7.2. gem5 syscall emulation mode syscall tracing</a></li>
 </ul>
 </li>
-<li><a href="#qemu-user-mode-quirks">10.7. QEMU user mode quirks</a>
+<li><a href="#qemu-user-mode-quirks">10.8. QEMU user mode quirks</a>
 <ul class="sectlevel3">
-<li><a href="#qemu-user-mode-does-not-show-stdout-immediately">10.7.1. QEMU user mode does not show stdout immediately</a>
+<li><a href="#qemu-user-mode-does-not-show-stdout-immediately">10.8.1. QEMU user mode does not show stdout immediately</a>
 <ul class="sectlevel4">
-<li><a href="#qemu-user-mode-does-not-show-errors">10.7.1.1. QEMU user mode does not show errors</a></li>
+<li><a href="#qemu-user-mode-does-not-show-errors">10.8.1.1. QEMU user mode does not show errors</a></li>
 </ul>
 </li>
 </ul>
@@ -1104,23 +1104,6 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
 <li><a href="#gem5-clock-frequency">19.2.2.6. gem5 clock frequency</a></li>
 </ul>
 </li>
-<li><a href="#interesting-benchmarks">19.2.3. Interesting benchmarks</a>
-<ul class="sectlevel4">
-<li><a href="#dhrystone">19.2.3.1. Dhrystone</a></li>
-<li><a href="#bst-vs-heap-vs-hashmap">19.2.3.2. BST vs heap vs hashmap</a></li>
-<li><a href="#blas">19.2.3.3. BLAS</a></li>
-<li><a href="#eigen">19.2.3.4. Eigen</a></li>
-<li><a href="#parsec-benchmark">19.2.3.5. PARSEC benchmark</a>
-<ul class="sectlevel5">
-<li><a href="#parsec-benchmark-without-parsecmgmt">19.2.3.5.1. PARSEC benchmark without parsecmgmt</a></li>
-<li><a href="#parsec-change-the-input-size">19.2.3.5.2. PARSEC change the input size</a></li>
-<li><a href="#parsec-benchmark-with-parsecmgmt">19.2.3.5.3. PARSEC benchmark with parsecmgmt</a></li>
-<li><a href="#parsec-uninstall">19.2.3.5.4. PARSEC uninstall</a></li>
-<li><a href="#parsec-benchmark-hacking">19.2.3.5.5. PARSEC benchmark hacking</a></li>
-</ul>
-</li>
-</ul>
-</li>
 </ul>
 </li>
 <li><a href="#gem5-kernel-command-line-parameters">19.3. gem5 kernel command line parameters</a></li>
@@ -1303,7 +1286,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
 <li><a href="#fork-bomb">21.3.2.1. Fork bomb</a></li>
 </ul>
 </li>
-<li><a href="#pthreads">21.3.3. pthreads</a></li>
+<li><a href="#pthreads">21.3.3. pthreads</a>
+<ul class="sectlevel4">
+<li><a href="#pthread-mutex">21.3.3.1. pthread_mutex</a></li>
+</ul>
+</li>
 <li><a href="#sysconf">21.3.4. sysconf</a></li>
 <li><a href="#mmap-2">21.3.5. mmap</a>
 <ul class="sectlevel4">
@@ -1322,9 +1309,33 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
 <li><a href="#memory-leaks">21.5.2. Memory leaks</a></li>
 </ul>
 </li>
-<li><a href="#interpreted-languages">21.6. Interpreted languages</a></li>
-<li><a href="#node-js">21.7. Node.js</a></li>
-<li><a href="#userland-content-bibliography">21.8. Userland content bibliography</a></li>
+<li><a href="#interpreted-languages">21.6. Interpreted languages</a>
+<ul class="sectlevel3">
+<li><a href="#node-js">21.6.1. Node.js</a></li>
+</ul>
+</li>
+<li><a href="#algorithms">21.7. Algorithms</a>
+<ul class="sectlevel3">
+<li><a href="#bst-vs-heap-vs-hashmap">21.7.1. BST vs heap vs hashmap</a></li>
+<li><a href="#blas">21.7.2. BLAS</a></li>
+<li><a href="#eigen">21.7.3. Eigen</a></li>
+</ul>
+</li>
+<li><a href="#benchmarks">21.8. Benchmarks</a>
+<ul class="sectlevel3">
+<li><a href="#dhrystone">21.8.1. Dhrystone</a></li>
+<li><a href="#parsec-benchmark">21.8.2. PARSEC benchmark</a>
+<ul class="sectlevel4">
+<li><a href="#parsec-benchmark-without-parsecmgmt">21.8.2.1. PARSEC benchmark without parsecmgmt</a></li>
+<li><a href="#parsec-change-the-input-size">21.8.2.2. PARSEC change the input size</a></li>
+<li><a href="#parsec-benchmark-with-parsecmgmt">21.8.2.3. PARSEC benchmark with parsecmgmt</a></li>
+<li><a href="#parsec-uninstall">21.8.2.4. PARSEC uninstall</a></li>
+<li><a href="#parsec-benchmark-hacking">21.8.2.5. PARSEC benchmark hacking</a></li>
+</ul>
+</li>
+</ul>
+</li>
+<li><a href="#userland-content-bibliography">21.9. Userland content bibliography</a></li>
 </ul>
 </li>
 <li><a href="#userland-assembly">22. Userland assembly</a>
@@ -3334,7 +3345,7 @@ dmesg</pre>
 <p><a href="#posix">POSIX</a></p>
 </li>
 <li>
-<p><a href="https://github.com/cirosantilli/algorithm-cheat" class="bare">https://github.com/cirosantilli/algorithm-cheat</a> TODO will be good to move here for performance analysis <a href="#gem5-run-benchmark">with gem5</a></p>
+<p><a href="#algorithms">Algorithms</a></p>
 </li>
 </ul>
 </div>
@@ -7324,7 +7335,7 @@ qw er</pre>
 <p>Tests under <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/libs/">userland/libs/</a> depend on certain libraries being available on the target, e.g. <a href="#blas">BLAS</a> for <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/libs/openblas">userland/libs/openblas</a>. They are not run by default, but can be enabled with <code>--package</code> and <code>--package-all</code>.</p>
 </div>
 <div class="paragraph">
-<p>The gem5 tests require building statically with build id <code>static</code>, see also: <a href="#gem5-syscall-emulation-mode">Section 10.6, &#8220;gem5 syscall emulation mode&#8221;</a>. TODO automate this better.</p>
+<p>The gem5 tests require building statically with build id <code>static</code>, see also: <a href="#gem5-syscall-emulation-mode">Section 10.7, &#8220;gem5 syscall emulation mode&#8221;</a>. TODO automate this better.</p>
 </div>
 <div class="paragraph">
 <p>See: <a href="#test-this-repo">Section 33.13, &#8220;Test this repo&#8221;</a> for more useful testing tips.</p>
@@ -7506,7 +7517,7 @@ qemu: uncaught target signal 6 (Aborted) - core dumped</pre>
 <div class="ulist">
 <ul>
 <li>
-<p>gem5 user mode currently only supports static executables as mentioned at: <a href="#gem5-syscall-emulation-mode">Section 10.6, &#8220;gem5 syscall emulation mode&#8221;</a></p>
+<p>gem5 user mode currently only supports static executables as mentioned at: <a href="#gem5-syscall-emulation-mode">Section 10.7, &#8220;gem5 syscall emulation mode&#8221;</a></p>
 </li>
 <li>
 <p>QEMU x86_64 guest on x86_64 host was failing with <a href="#stack-smashing-detected">stack smashing detected</a>, but we found a workaround</p>
@@ -7613,7 +7624,61 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex
 </div>
 </div>
 <div class="sect2">
-<h3 id="gem5-syscall-emulation-mode"><a class="anchor" href="#gem5-syscall-emulation-mode"></a><a class="link" href="#gem5-syscall-emulation-mode">10.6. gem5 syscall emulation mode</a></h3>
+<h3 id="syscall-emulation-mode-program-stdin"><a class="anchor" href="#syscall-emulation-mode-program-stdin"></a><a class="link" href="#syscall-emulation-mode-program-stdin">10.6. syscall emulation mode program stdin</a></h3>
+<div class="paragraph">
+<p>The following work on both QEMU and gem5 as of LKMC 99d6bc6bc19d4c7f62b172643be95d9c43c26145 + 1. Interactive input:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./run --userland userland/c/getchar.c</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Source: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/getchar.c">userland/c/getchar.c</a></p>
+</div>
+<div class="paragraph">
+<p>A line of type should show:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>enter a character:</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and after pressing say <code>a</code> and Enter, we get:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>you entered: a</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Note however that due to <a href="#qemu-user-mode-does-not-show-stdout-immediately">QEMU user mode does not show stdout immediately</a> we don&#8217;t really see the initial <code>enter a character</code> line.</p>
+</div>
+<div class="paragraph">
+<p>Non-interactive input from a file by forwarding emulators stdin implicitly through our Python scripts:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>printf a &gt; f.tmp
+./run --userland userland/c/getchar.c &lt; f.tmp</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Input from a file by explicitly requesting our scripts to use it via the Python API:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>printf a &gt; f.tmp
+./run --emulator gem5 --userland userland/c/getchar.c --static --stdin-file f.tmp</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>This is especially useful when running tests that require stdin input.</p>
+</div>
+</div>
+<div class="sect2">
+<h3 id="gem5-syscall-emulation-mode"><a class="anchor" href="#gem5-syscall-emulation-mode"></a><a class="link" href="#gem5-syscall-emulation-mode">10.7. gem5 syscall emulation mode</a></h3>
 <div class="paragraph">
 <p>Less robust than QEMU&#8217;s, but still usable:</p>
 </div>
@@ -7691,7 +7756,7 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex
 </div>
 </div>
 <div class="sect3">
-<h4 id="gem5-syscall-emulation-exit-status"><a class="anchor" href="#gem5-syscall-emulation-exit-status"></a><a class="link" href="#gem5-syscall-emulation-exit-status">10.6.1. gem5 syscall emulation exit status</a></h4>
+<h4 id="gem5-syscall-emulation-exit-status"><a class="anchor" href="#gem5-syscall-emulation-exit-status"></a><a class="link" href="#gem5-syscall-emulation-exit-status">10.7.1. gem5 syscall emulation exit status</a></h4>
 <div class="paragraph">
 <p>As of gem5 7fa4c946386e7207ad5859e8ade0bbfc14000d91, the crappy <code>se.py</code> script does not forward the exit status of syscall emulation mode, you can test it with:</p>
 </div>
@@ -7730,33 +7795,7 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex
 </div>
 </div>
 <div class="sect3">
-<h4 id="gem5-syscall-emulation-mode-program-stdin"><a class="anchor" href="#gem5-syscall-emulation-mode-program-stdin"></a><a class="link" href="#gem5-syscall-emulation-mode-program-stdin">10.6.2. gem5 syscall emulation mode program stdin</a></h4>
-<div class="paragraph">
-<p>gem5 shows its own stdout to terminal, and does not allow you to type stdin to programs.</p>
-</div>
-<div class="paragraph">
-<p>Instead, you must pass stdin non-interactively with the through a file with the <code>--se.py --input</code> option, e.g.:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>printf a &gt; f
-./run --emulator gem5 --userland userland/c/getchar.c --static -- --input f</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>leads to gem5 output:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>enter a character: you entered: a</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>Source: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/getchar.c">userland/c/getchar.c</a></p>
-</div>
-</div>
-<div class="sect3">
-<h4 id="gem5-syscall-emulation-mode-syscall-tracing"><a class="anchor" href="#gem5-syscall-emulation-mode-syscall-tracing"></a><a class="link" href="#gem5-syscall-emulation-mode-syscall-tracing">10.6.3. gem5 syscall emulation mode syscall tracing</a></h4>
+<h4 id="gem5-syscall-emulation-mode-syscall-tracing"><a class="anchor" href="#gem5-syscall-emulation-mode-syscall-tracing"></a><a class="link" href="#gem5-syscall-emulation-mode-syscall-tracing">10.7.2. gem5 syscall emulation mode syscall tracing</a></h4>
 <div class="paragraph">
 <p>Since gem5 has to implement syscalls itself in syscall emulation mode, it can of course clearly see which syscalls are being made, and we can log them for debug purposes with <a href="#gem5-tracing">gem5 tracing</a>, e.g.:</p>
 </div>
@@ -7804,9 +7843,9 @@ hello
 </div>
 </div>
 <div class="sect2">
-<h3 id="qemu-user-mode-quirks"><a class="anchor" href="#qemu-user-mode-quirks"></a><a class="link" href="#qemu-user-mode-quirks">10.7. QEMU user mode quirks</a></h3>
+<h3 id="qemu-user-mode-quirks"><a class="anchor" href="#qemu-user-mode-quirks"></a><a class="link" href="#qemu-user-mode-quirks">10.8. QEMU user mode quirks</a></h3>
 <div class="sect3">
-<h4 id="qemu-user-mode-does-not-show-stdout-immediately"><a class="anchor" href="#qemu-user-mode-does-not-show-stdout-immediately"></a><a class="link" href="#qemu-user-mode-does-not-show-stdout-immediately">10.7.1. QEMU user mode does not show stdout immediately</a></h4>
+<h4 id="qemu-user-mode-does-not-show-stdout-immediately"><a class="anchor" href="#qemu-user-mode-does-not-show-stdout-immediately"></a><a class="link" href="#qemu-user-mode-does-not-show-stdout-immediately">10.8.1. QEMU user mode does not show stdout immediately</a></h4>
 <div class="paragraph">
 <p>At 8d8307ac0710164701f6e14c99a69ee172ccbb70 + 1, I noticed that if you run <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/posix/count.c">userland/posix/count.c</a>:</p>
 </div>
@@ -7834,7 +7873,7 @@ hello
 <p>TODO: investigate further and then possibly post on QEMU mailing list.</p>
 </div>
 <div class="sect4">
-<h5 id="qemu-user-mode-does-not-show-errors"><a class="anchor" href="#qemu-user-mode-does-not-show-errors"></a><a class="link" href="#qemu-user-mode-does-not-show-errors">10.7.1.1. QEMU user mode does not show errors</a></h5>
+<h5 id="qemu-user-mode-does-not-show-errors"><a class="anchor" href="#qemu-user-mode-does-not-show-errors"></a><a class="link" href="#qemu-user-mode-does-not-show-errors">10.8.1.1. QEMU user mode does not show errors</a></h5>
 <div class="paragraph">
 <p>Similarly to <a href="#qemu-user-mode-does-not-show-stdout-immediately">QEMU user mode does not show stdout immediately</a>, QEMU error messages do not show at all through pipes.</p>
 </div>
@@ -17789,6 +17828,19 @@ cat out/gem5-bench-dhrystone.txt</pre>
 </ul>
 </div>
 <div class="paragraph">
+<p>Interesting algorithms and benchmarks for this game are being collected at:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="#algorithms">Algorithms</a></p>
+</li>
+<li>
+<p><a href="#benchmarks">Benchmarks</a></p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
 <p>To find out why your program is slow, a good first step is to have a look at the <a href="#gem5-m5out-stats-txt-file">gem5 m5out/stats.txt file</a>.</p>
 </div>
 <div class="sect3">
@@ -18329,577 +18381,6 @@ m5 dumpstats</pre>
 </div>
 </div>
 </div>
-<div class="sect3">
-<h4 id="interesting-benchmarks"><a class="anchor" href="#interesting-benchmarks"></a><a class="link" href="#interesting-benchmarks">19.2.3. Interesting benchmarks</a></h4>
-<div class="paragraph">
-<p>Buildroot built-in libraries, mostly under Libraries &gt; Other:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>Armadillo <code>C++</code>: linear algebra</p>
-</li>
-<li>
-<p>fftw: Fourier transform</p>
-</li>
-<li>
-<p>Flann</p>
-</li>
-<li>
-<p>GSL: various</p>
-</li>
-<li>
-<p>liblinear</p>
-</li>
-<li>
-<p>libspacialindex</p>
-</li>
-<li>
-<p>libtommath</p>
-</li>
-<li>
-<p>qhull</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>Open source but not in Buildroot:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p><a href="https://github.com/kozyraki/stamp" class="bare">https://github.com/kozyraki/stamp</a> transactional memory benchmarks</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>There are not yet enabled, but it should be easy to so, see: <a href="#add-new-buildroot-packages">Section 20.5, &#8220;Add new Buildroot packages&#8221;</a></p>
-</div>
-<div class="sect4">
-<h5 id="dhrystone"><a class="anchor" href="#dhrystone"></a><a class="link" href="#dhrystone">19.2.3.1. Dhrystone</a></h5>
-<div class="paragraph">
-<p><a href="https://en.wikipedia.org/wiki/Dhrystone" class="bare">https://en.wikipedia.org/wiki/Dhrystone</a></p>
-</div>
-<div class="paragraph">
-<p>Created in the 80&#8217;s, it is not a representative measure of performance in modern computers anymore. It has mostly been replaced by <a href="https://en.wikipedia.org/wiki/SPECint">SPEC</a>, which is&#8230;&#8203; closed source! Unbelievable.</p>
-</div>
-<div class="paragraph">
-<p><a href="#buildroot">Buildroot</a> has a <code>dhrystone</code> package, but because it is so interesting to us, we decided to also build it ourselves, which allows things like static and baremetal compilation more easily.</p>
-</div>
-<div class="paragraph">
-<p>Build and run on QEMU <a href="#user-mode-simulation">User mode simulation</a>:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>git submodule update --init submodules/dhrystone
-./build-dhrystone --mode userland
-./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>Build and run on gem5 user mode:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>./build-dhrystone --mode userland --static --force-rebuild
-./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>TODO automate run more nicely.</p>
-</div>
-<div class="paragraph">
-<p>Build for <a href="#baremetal">Baremetal</a> execution and run it in baremetal QEMU:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre># Build our Newlib stubs.
-./build-baremetal --arch aarch64
-./build-dhrystone --arch aarch64 --mode baremetal
-./run --arch aarch64 --baremetal "$(./getvar baremetal_build_dir)/submodules/dhrystone/dhrystone"</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>TODO: fix the build, just need to factor out all run arguments from <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/build-baremetal">build-baremetal</a> into <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/common.py">common.py</a> and it should just work, no missing syscalls.</p>
-</div>
-<div class="paragraph">
-<p>If you really want the Buildroot package for some reason, build it with:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>./build-buildroot --config 'BR2_PACKAGE_DHRYSTONE=y'</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>and run inside the guest from <code>PATH</code> with:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>dhrystone</pre>
-</div>
-</div>
-</div>
-<div class="sect4">
-<h5 id="bst-vs-heap-vs-hashmap"><a class="anchor" href="#bst-vs-heap-vs-hashmap"></a><a class="link" href="#bst-vs-heap-vs-hashmap">19.2.3.2. BST vs heap vs hashmap</a></h5>
-<div class="paragraph">
-<p>TODO: move benchmark graph from <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/cpp/bst_vs_heap_vs_hashmap.cpp">userland/cpp/bst_vs_heap_vs_hashmap.cpp</a> to <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/algorithm/set">userland/algorithm/set</a>.</p>
-</div>
-<div class="paragraph">
-<p>The following benchmark setup works both:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>on host through timers + <a href="https://stackoverflow.com/questions/51952471/why-do-i-get-a-constant-instead-of-logarithmic-curve-for-an-insert-time-benchmar/51953081#51953081">granule</a></p>
-</li>
-<li>
-<p>gem5 with <a href="#m5ops-instructions">dumpstats</a>, which can get more precise results with <code>granule == 1</code></p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>It has been used to answer:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>BST vs heap: <a href="https://stackoverflow.com/questions/6147243/heap-vs-binary-search-tree-bst/29548834#29548834" class="bare">https://stackoverflow.com/questions/6147243/heap-vs-binary-search-tree-bst/29548834#29548834</a></p>
-</li>
-<li>
-<p><code>std::set</code>: <a href="https://stackoverflow.com/questions/2558153/what-is-the-underlying-data-structure-of-a-stl-set-in-c/51944661#51944661" class="bare">https://stackoverflow.com/questions/2558153/what-is-the-underlying-data-structure-of-a-stl-set-in-c/51944661#51944661</a></p>
-</li>
-<li>
-<p><code>std::map</code>: <a href="https://stackoverflow.com/questions/18414579/what-data-structure-is-inside-stdmap-in-c/51945119#51945119" class="bare">https://stackoverflow.com/questions/18414579/what-data-structure-is-inside-stdmap-in-c/51945119#51945119</a></p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>To benchmark on the host, we do:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>./build-userland-in-tree \
-  --force-rebuild \
-  --optimization-level 3 \
-  ./userland/cpp/bst_vs_heap_vs_hashmap.cpp \
-;
-./userland/cpp/bst_vs_heap_vs_hashmap.out 10000000 10000 0 | tee bst_vs_heap_vs_hashmap.dat
-gnuplot \
-  -e 'input_noext="bst_vs_heap_vs_hashmap"' \
-  -e 'heap_zoom_max=50' \
-  -e 'hashmap_zoom_max=400' \
-  ./bst-vs-heap-vs-hashmap.gnuplot \
-;
-xdg-open bst_vs_heap_vs_hashmap.tmp.png</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>The parameters <code>heap_zoom_max</code> and <code>hashmap_zoom_max</code> are chosen manually interactively to best showcase the regions of interest in those plots.</p>
-</div>
-<div class="paragraph">
-<p>To benchmark on gem5, we first build the benchmark with <a href="#m5ops-instructions">m5ops instructions</a> enabled, and then we run it and extract the stats:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>./build-userland \
-  --arch x86_64 \
-  --ccflags='-DLKMC_M5OPS_ENABLE=1' \
-  --force-rebuild userland/cpp/bst_vs_heap_vs_hashmap.cpp \
-  --static \
-  --optimization-level 3 \
-;
-./run \
-  --arch x86_64 \
-  --emulator gem5 \
-  --static \
-  --userland userland/cpp/bst_vs_heap_vs_hashmap.cpp \
-  --userland-args='100000 1 0' \
-  -- \
-  --cpu-type=DerivO3CPU \
-  --caches \
-  --l2cache \
-  --l1d_size=32kB \
-  --l1i_size=32kB \
-  --l2_size=256kB \
-  --l3_size=20MB \
-;
-./bst-vs-heap-vs-hashmap-gem5-stats --arch x86_64 | tee bst_vs_heap_vs_hashmap_gem5.dat
-gnuplot \
-  -e 'input_noext="bst_vs_heap_vs_hashmap_gem5"' \
-  -e 'heap_zoom_max=500' \
-  -e 'hashmap_zoom_max=400' \
-  ./bst-vs-heap-vs-hashmap.gnuplot \
-;
-xdg-open bst_vs_heap_vs_hashmap_gem5.tmp.png</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>TODO: the gem5 simulation blows up on a tcmalloc allocation somewhere near 25k elements as of 3fdd83c2c58327d9714fa2347c724b78d7c05e2b + 1, likely linked to the extreme inefficiency of the stats collection?</p>
-</div>
-<div class="paragraph">
-<p>The cache sizes were chosen to match the host <a href="#p51">P51</a> to improve the comparison. Ideally we should also use the same standard library.</p>
-</div>
-<div class="paragraph">
-<p>Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: <a href="#gem5-only-dump-selected-stats">Section 19.9.2.1, &#8220;gem5 only dump selected stats&#8221;</a></p>
-</div>
-<div class="paragraph">
-<p>Sources:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/cpp/bst_vs_heap_vs_hashmap.cpp">userland/cpp/bst_vs_heap_vs_hashmap.cpp</a></p>
-</li>
-<li>
-<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/bst-vs-heap-vs-hashmap-gem5-stats">bst-vs-heap-vs-hashmap-gem5-stats</a></p>
-</li>
-<li>
-<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/bst-vs-heap-vs-hashmap.gnuplot">bst-vs-heap-vs-hashmap.gnuplot</a></p>
-</li>
-</ul>
-</div>
-</div>
-<div class="sect4">
-<h5 id="blas"><a class="anchor" href="#blas"></a><a class="link" href="#blas">19.2.3.3. BLAS</a></h5>
-<div class="paragraph">
-<p>Buildroot supports it, which makes everything just trivial:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>./build-buildroot --config 'BR2_PACKAGE_OPENBLAS=y'
-./build-userland --package openblas -- userland/libs/openblas/hello.c
-./run --eval-after './libs/openblas/hello.out; echo $?'</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>Outcome: the test passes:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>0</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>Source: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/libs/openblas/hello.c">userland/libs/openblas/hello.c</a></p>
-</div>
-<div class="paragraph">
-<p>The test performs a general matrix multiplication:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>    |  1.0 -3.0 |   |  1.0  2.0  1.0 |       |  0.5  0.5  0.5 |   |  11.0 - 9.0  5.0 |
-1 * |  2.0  4.0 | * | -3.0  4.0 -1.0 | + 2 * |  0.5  0.5  0.5 | = | - 9.0  21.0 -1.0 |
-    |  1.0 -1.0 |                            |  0.5  0.5  0.5 |   |   5.0 - 1.0  3.0 |</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>This can be deduced from the Fortran interfaces at</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>less "$(./getvar buildroot_build_build_dir)"/openblas-*/reference/dgemmf.f</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>which we can map to our call as:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>C := alpha*op( A )*op( B ) + beta*C,
-SUBROUTINE DGEMMF(               TRANA,        TRANB,     M,N,K,  ALPHA,A,LDA,B,LDB,BETA,C,LDC)
-cblas_dgemm(      CblasColMajor, CblasNoTrans, CblasTrans,3,3,2  ,1,    A,3,  B,3,  2   ,C,3  );</pre>
-</div>
-</div>
-</div>
-<div class="sect4">
-<h5 id="eigen"><a class="anchor" href="#eigen"></a><a class="link" href="#eigen">19.2.3.4. Eigen</a></h5>
-<div class="paragraph">
-<p>Header only linear algebra library with a mainline Buildroot package:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>./build-buildroot --config 'BR2_PACKAGE_EIGEN=y'
-./build-userland --package eigen -- userland/libs/eigen/hello.cpp</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>Just create an array and print it:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>./run --eval-after './libs/eigen/hello.out'</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>Output:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>  3  -1
-2.5 1.5</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>Source: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/libs/eigen/hello.cpp">userland/libs/eigen/hello.cpp</a></p>
-</div>
-<div class="paragraph">
-<p>This example just creates a matrix and prints it out.</p>
-</div>
-<div class="paragraph">
-<p>Tested on: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/commit/a4bdcf102c068762bb1ef26c591fcf71e5907525">a4bdcf102c068762bb1ef26c591fcf71e5907525</a></p>
-</div>
-</div>
-<div class="sect4">
-<h5 id="parsec-benchmark"><a class="anchor" href="#parsec-benchmark"></a><a class="link" href="#parsec-benchmark">19.2.3.5. PARSEC benchmark</a></h5>
-<div class="paragraph">
-<p>We have ported parts of the <a href="http://parsec.cs.princeton.edu">PARSEC benchmark</a> for cross compilation at: <a href="https://github.com/cirosantilli/parsec-benchmark" class="bare">https://github.com/cirosantilli/parsec-benchmark</a> See the documentation on that repo to find out which benchmarks have been ported. Some of the benchmarks were are segfaulting, they are documented in that repo.</p>
-</div>
-<div class="paragraph">
-<p>There are two ways to run PARSEC with this repo:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p><a href="#parsec-benchmark-without-parsecmgmt">without <code>pasecmgmt</code></a>, most likely what you want</p>
-</li>
-<li>
-<p><a href="#parsec-benchmark-with-parsecmgmt">with <code>pasecmgmt</code></a></p>
-</li>
-</ul>
-</div>
-<div class="sect5">
-<h6 id="parsec-benchmark-without-parsecmgmt"><a class="anchor" href="#parsec-benchmark-without-parsecmgmt"></a><a class="link" href="#parsec-benchmark-without-parsecmgmt">19.2.3.5.1. PARSEC benchmark without parsecmgmt</a></h6>
-<div class="literalblock">
-<div class="content">
-<pre>./build --arch arm --download-dependencies gem5-buildroot parsec-benchmark
-./build-buildroot --arch arm --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y'
-./run --arch arm --emulator gem5</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>Once inside the guest, launch one of the <code>test</code> input sized benchmarks manually as in:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>cd /parsec/ext/splash2x/apps/fmm/run
-../inst/arm-linux.gcc/bin/fmm 1 &lt; input_1</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>To find run out how to run many of the benchmarks, have a look at the <code>test.sh</code> script of the <code>parse-benchmark</code> repo.</p>
-</div>
-<div class="paragraph">
-<p>From the guest, you can also run it as:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>cd /parsec
-./test.sh</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>but this might be a bit time consuming in gem5.</p>
-</div>
-</div>
-<div class="sect5">
-<h6 id="parsec-change-the-input-size"><a class="anchor" href="#parsec-change-the-input-size"></a><a class="link" href="#parsec-change-the-input-size">19.2.3.5.2. PARSEC change the input size</a></h6>
-<div class="paragraph">
-<p>Running a benchmark of a size different than <code>test</code>, e.g. <code>simsmall</code>, requires a rebuild with:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>./build-buildroot \
-  --arch arm \
-  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
-  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_INPUT_SIZE="simsmall"' \
-  -- parsec_benchmark-reconfigure \
-;</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>Large input may also require tweaking:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p><a href="#br2-target-rootfs-ext2-size">BR2_TARGET_ROOTFS_EXT2_SIZE</a> if the unpacked inputs are large</p>
-</li>
-<li>
-<p><a href="#memory-size">Memory size</a>, unless you want to meet the OOM killer, which is admittedly kind of fun</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p><code>test.sh</code> only contains the run commands for the <code>test</code> size, and cannot be used for <code>simsmall</code>.</p>
-</div>
-<div class="paragraph">
-<p>The easiest thing to do, is to <a href="https://superuser.com/questions/231002/how-can-i-search-within-the-output-buffer-of-a-tmux-shell/1253137#1253137">scroll up on the host shell</a> after the build, and look for a line of type:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>Running /root/linux-kernel-module-cheat/out/aarch64/buildroot/build/parsec-benchmark-custom/ext/splash2x/apps/ocean_ncp/inst/aarch64-linux.gcc/bin/ocean_ncp -n2050 -p1 -e1e-07 -r20000 -t28800</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>and then tweak the command found in <code>test.sh</code> accordingly.</p>
-</div>
-<div class="paragraph">
-<p>Yes, we do run the benchmarks on host just to unpack / generate inputs. They are expected fail to run since they were build for the guest instead of host, including for x86_64 guest which has a different interpreter than the host&#8217;s (see <code>file myexecutable</code>).</p>
-</div>
-<div class="paragraph">
-<p>The rebuild is required because we unpack input files on the host.</p>
-</div>
-<div class="paragraph">
-<p>Separating input sizes also allows to create smaller images when only running the smaller benchmarks.</p>
-</div>
-<div class="paragraph">
-<p>This limitation exists because <code>parsecmgmt</code> generates the input files just before running via the Bash scripts, but we can&#8217;t run <code>parsecmgmt</code> on gem5 as it is too slow!</p>
-</div>
-<div class="paragraph">
-<p>One option would be to do that inside the guest with QEMU.</p>
-</div>
-<div class="paragraph">
-<p>Also, we can&#8217;t generate all input sizes at once, because many of them have the same name and would overwrite one another&#8230;&#8203;</p>
-</div>
-<div class="paragraph">
-<p>PARSEC simply wasn&#8217;t designed with non native machines in mind&#8230;&#8203;</p>
-</div>
-</div>
-<div class="sect5">
-<h6 id="parsec-benchmark-with-parsecmgmt"><a class="anchor" href="#parsec-benchmark-with-parsecmgmt"></a><a class="link" href="#parsec-benchmark-with-parsecmgmt">19.2.3.5.3. PARSEC benchmark with parsecmgmt</a></h6>
-<div class="paragraph">
-<p>Most users won&#8217;t want to use this method because:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>running the <code>parsecmgmt</code> Bash scripts takes forever before it ever starts running the actual benchmarks on gem5</p>
-<div class="paragraph">
-<p>Running on QEMU is feasible, but not the main use case, since QEMU cannot be used for performance measurements</p>
-</div>
-</li>
-<li>
-<p>it requires putting the full <code>.tar</code> inputs on the guest, which makes the image twice as large (1x for the <code>.tar</code>, 1x for the unpacked input files)</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>It would be awesome if it were possible to use this method, since this is what Parsec supports officially, and so:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>you don&#8217;t have to dig into what raw command to run</p>
-</li>
-<li>
-<p>there is an easy way to run all the benchmarks in one go to test them out</p>
-</li>
-<li>
-<p>you can just run any of the benchmarks that you want</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>but it simply is not feasible in gem5 because it takes too long.</p>
-</div>
-<div class="paragraph">
-<p>If you still want to run this, try it out with:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>./build-buildroot \
-  --arch aarch64 \
-  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
-  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_PARSECMGMT=y' \
-  --config 'BR2_TARGET_ROOTFS_EXT2_SIZE="3G"' \
-  -- parsec_benchmark-reconfigure \
-;</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>And then you can run it just as you would on the host:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>cd /parsec/
-bash
-. env.sh
-parsecmgmt -a run -p splash2x.fmm -i test</pre>
-</div>
-</div>
-</div>
-<div class="sect5">
-<h6 id="parsec-uninstall"><a class="anchor" href="#parsec-uninstall"></a><a class="link" href="#parsec-uninstall">19.2.3.5.4. PARSEC uninstall</a></h6>
-<div class="paragraph">
-<p>If you want to remove PARSEC later, Buildroot doesn&#8217;t provide an automated package removal mechanism as mentioned at: <a href="#remove-buildroot-packages">Section 20.6, &#8220;Remove Buildroot packages&#8221;</a>, but the following procedure should be satisfactory:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>rm -rf \
-  "$(./getvar buildroot_download_dir)"/parsec-* \
-  "$(./getvar buildroot_build_dir)"/build/parsec-* \
-  "$(./getvar buildroot_build_dir)"/build/packages-file-list.txt \
-  "$(./getvar buildroot_build_dir)"/images/rootfs.* \
-  "$(./getvar buildroot_build_dir)"/target/parsec-* \
-;
-./build-buildroot --arch arm</pre>
-</div>
-</div>
-</div>
-<div class="sect5">
-<h6 id="parsec-benchmark-hacking"><a class="anchor" href="#parsec-benchmark-hacking"></a><a class="link" href="#parsec-benchmark-hacking">19.2.3.5.5. PARSEC benchmark hacking</a></h6>
-<div class="paragraph">
-<p>If you end up going inside <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/submodules/parsec-benchmark">submodules/parsec-benchmark</a> to hack up the benchmark (you will!), these tips will be helpful.</p>
-</div>
-<div class="paragraph">
-<p>Buildroot was not designed to deal with large images, and currently cross rebuilds are a bit slow, due to some image generation and validation steps.</p>
-</div>
-<div class="paragraph">
-<p>A few workarounds are:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>develop in host first as much as you can. Our PARSEC fork supports it.</p>
-<div class="paragraph">
-<p>If you do this, don&#8217;t forget to do a:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>cd "$(./getvar parsec_source_dir)"
-git clean -xdf .</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>before going for the cross compile build.</p>
-</div>
-</li>
-<li>
-<p>patch Buildroot to work well, and keep cross compiling all the way. This should be totally viable, and we should do it.</p>
-<div class="paragraph">
-<p>Don&#8217;t forget to explicitly rebuild PARSEC with:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>./build-buildroot \
-  --arch arm \
-  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
-  -- parsec_benchmark-reconfigure \
-;</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>You may also want to test if your patches are still functionally correct inside of QEMU first, which is a faster emulator.</p>
-</div>
-</li>
-<li>
-<p>sell your soul, and compile natively inside the guest. We won&#8217;t do this, not only because it is evil, but also because Buildroot explicitly does not support it: <a href="https://buildroot.org/downloads/manual/manual.html#faq-no-compiler-on-target" class="bare">https://buildroot.org/downloads/manual/manual.html#faq-no-compiler-on-target</a> ARM employees have been known to do this: <a href="https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/qemu-patch.diff" class="bare">https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/qemu-patch.diff</a></p>
-</li>
-</ul>
-</div>
-</div>
-</div>
-</div>
 </div>
 <div class="sect2">
 <h3 id="gem5-kernel-command-line-parameters"><a class="anchor" href="#gem5-kernel-command-line-parameters"></a><a class="link" href="#gem5-kernel-command-line-parameters">19.3. gem5 kernel command line parameters</a></h3>
@@ -21800,7 +21281,7 @@ make menuconfig</pre>
 <p>Also mentioned at: <a href="https://stackoverflow.com/questions/47320800/how-to-clean-only-target-in-buildroot" class="bare">https://stackoverflow.com/questions/47320800/how-to-clean-only-target-in-buildroot</a></p>
 </div>
 <div class="paragraph">
-<p>See this for a sample manual workaround: <a href="#parsec-uninstall">Section 19.2.3.5.4, &#8220;PARSEC uninstall&#8221;</a>.</p>
+<p>See this for a sample manual workaround: <a href="#parsec-uninstall">Section 21.8.2.4, &#8220;PARSEC uninstall&#8221;</a>.</p>
 </div>
 </div>
 <div class="sect2">
@@ -22500,7 +21981,7 @@ printf '%x\n' 4198400
 <p>which is documented in <code>man proc</code>.</p>
 </div>
 <div class="paragraph">
-<p>The default value is <code>0</code>, which I can&#8217;t find a precise documentation for. <code>2</code> is precisly documented but I&#8217;m lazy to do all calculations. So let&#8217;s just verify <code>0</code> vs <code>1</code> by trying to <code>mmap</code> 1GiB of memory:</p>
+<p>The default value is <code>0</code>, which I can&#8217;t find a precise documentation for. <code>2</code> is precisely documented but I&#8217;m lazy to do all calculations. So let&#8217;s just verify <code>0</code> vs <code>1</code> by trying to <code>mmap</code> 1GiB of memory:</p>
 </div>
 <div class="literalblock">
 <div class="content">
@@ -22889,6 +22370,19 @@ time ./mutex.out 4 100000000</pre>
 </div>
 </div>
 <div class="paragraph">
+<p>Related examples:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>POSIX <a href="#pthread-mutex">pthread_mutex</a></p>
+</li>
+<li>
+<p>C11 <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/atomic.c">userland/c/atomic.c</a> documented at <a href="#c-multithreading">C multithreading</a></p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
 <p>Bibliography:</p>
 </div>
 <div class="ulist">
@@ -23028,6 +22522,42 @@ fork() return = 13039</pre>
 </li>
 </ul>
 </div>
+<div class="sect4">
+<h5 id="pthread-mutex"><a class="anchor" href="#pthread-mutex"></a><a class="link" href="#pthread-mutex">21.3.3.1. pthread_mutex</a></h5>
+<div class="paragraph">
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/posix/pthread_count.c">userland/posix/pthread_count.c</a> exemplifies the functions:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><code>pthread_mutex_lock</code></p>
+</li>
+<li>
+<p>pthread_mutex_unlock</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>That example that the same interface as: <a href="#atomic-cpp">atomic.cpp</a>.
+There are no non-locking atomic types or atomic primitives in POSIX: <a href="http://stackoverflow.com/questions/1130018/unix-portable-atomic-operations" class="bare">http://stackoverflow.com/questions/1130018/unix-portable-atomic-operations</a></p>
+</div>
+<div class="paragraph">
+<p><code>pthread_mutex_lock</code> and <code>pthread_mutex_unlock</code> and many other pthread functions already enforce cross thread memory synchronization:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="https://stackoverflow.com/questions/78172/using-c-pthreads-do-shared-variables-need-to-be-volatile/58935671#58935671" class="bare">https://stackoverflow.com/questions/78172/using-c-pthreads-do-shared-variables-need-to-be-volatile/58935671#58935671</a></p>
+</li>
+<li>
+<p><a href="https://stackoverflow.com/questions/3208060/does-guarding-a-variable-with-a-pthread-mutex-guarantee-its-also-not-cached" class="bare">https://stackoverflow.com/questions/3208060/does-guarding-a-variable-with-a-pthread-mutex-guarantee-its-also-not-cached</a></p>
+</li>
+<li>
+<p><a href="https://stackoverflow.com/questions/24137964/does-pthread-mutex-lock-contains-memory-fence-instruction" class="bare">https://stackoverflow.com/questions/24137964/does-pthread-mutex-lock-contains-memory-fence-instruction</a></p>
+</li>
+</ul>
+</div>
+</div>
 </div>
 <div class="sect3">
 <h4 id="sysconf"><a class="anchor" href="#sysconf"></a><a class="link" href="#sysconf">21.3.4. sysconf</a></h4>
@@ -23222,16 +22752,8 @@ fork() return = 13039</pre>
 <div class="paragraph">
 <p>Maybe some day someone will use this setup to study the performance of interpreters:</p>
 </div>
-<div class="ulist">
-<ul>
-<li>
-<p><a href="#node-js">Node.js</a></p>
-</li>
-</ul>
-</div>
-</div>
-<div class="sect2">
-<h3 id="node-js"><a class="anchor" href="#node-js"></a><a class="link" href="#node-js">21.7. Node.js</a></h3>
+<div class="sect3">
+<h4 id="node-js"><a class="anchor" href="#node-js"></a><a class="link" href="#node-js">21.6.1. Node.js</a></h4>
 <div class="paragraph">
 <p>Parent section: <a href="#interpreted-languages">Interpreted languages</a>.</p>
 </div>
@@ -23292,8 +22814,710 @@ fork() return = 13039</pre>
 </ul>
 </div>
 </div>
+</div>
 <div class="sect2">
-<h3 id="userland-content-bibliography"><a class="anchor" href="#userland-content-bibliography"></a><a class="link" href="#userland-content-bibliography">21.8. Userland content bibliography</a></h3>
+<h3 id="algorithms"><a class="anchor" href="#algorithms"></a><a class="link" href="#algorithms">21.7. Algorithms</a></h3>
+<div class="paragraph">
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/algorithm">userland/algorithm</a></p>
+</div>
+<div class="paragraph">
+<p>This is still work in progress and needs better automation, but is already a good sketch. Key missing features:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>actually check that outputs are correct in <code>./test</code></p>
+</li>
+<li>
+<p>create a mechanism to run all or some selected hand coded inputs</p>
+</li>
+<li>
+<p>create a mechanism to run generated input</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>The idea was originally started at: <a href="https://github.com/cirosantilli/algorithm-cheat" class="bare">https://github.com/cirosantilli/algorithm-cheat</a></p>
+</div>
+<div class="paragraph">
+<p>The key idea is that input / output pairs are present in human readable files generated either:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>manually for small test inputs</p>
+</li>
+<li>
+<p>with a Python script for larger randomized tests</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>Test programs then:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>read input from sdtin</p>
+</li>
+<li>
+<p>produce output to stdout</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>so that we can compare the output to the expected one.</p>
+</div>
+<div class="paragraph">
+<p>This way, tests can be reused across several implementations in different languages, emulating the many multi-language programming competition websites out there.</p>
+</div>
+<div class="paragraph">
+<p>For example, for a <a href="#userland-setup-getting-started-natively">native run</a> we can can run a set / sorting test:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>cd userland/algorithm/set
+./build
+
+# Run with a small hand written test.
+./std_set.out &lt; test_data/8.i &gt; tmp.raw
+
+# Extract the output from the sorted stdout, which also
+# contained some timing information.
+./parse_output output &lt; tmp.raw &gt; tmp.o
+
+# Compare the output to the Expected one.
+cmp tmp.o test_data/8.e
+
+# Same but now with a large randomly generated input.
+./generate_io
+./std_set.out &lt; tmp.i | ./parse_output output &gt; tmp.o
+cmp tmp.o tmp.e</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>It is also possible to the algorithm tests normally from emulators in <a href="#user-mode-simulation">User mode simulation</a> by setting stdin as explained at <a href="#syscall-emulation-mode-program-stdin">syscall emulation mode program stdin</a>, e.g.:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./run --arch aarch64 -u userland/algorithm/set/std_set.cpp --stdin-file userland/algorithm/set/test_data/8.i</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Sources:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/algorithm/set/generate_io">userland/algorithm/set/generate_io</a></p>
+</li>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/algorithm/set/main.hpp">userland/algorithm/set/main.hpp</a></p>
+</li>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/algorithm/set/parse_output">userland/algorithm/set/parse_output</a></p>
+</li>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/algorithm/set/std_set.cpp">userland/algorithm/set/std_set.cpp</a></p>
+</li>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/algorithm/set/test_data/8.e">userland/algorithm/set/test_data/8.e</a></p>
+</li>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/algorithm/set/test_data/8.i">userland/algorithm/set/test_data/8.i</a></p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/algorithm/set/parse_output">userland/algorithm/set/parse_output</a> is needed because timing instrumentation measurements must be embedded in the program itself to allow:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>discounting the input reading / output writing operations from the actual "read / write to / from memory algorithm" itself</p>
+</li>
+<li>
+<p>measuring the evolution of the benchmark mid way, e.g. to see how the current container size affects insertion time: <a href="#bst-vs-heap-vs-hashmap">BST vs heap vs hashmap</a></p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>The following are also interesting Buildroot libraries that we could benchmark:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>Armadillo <code>C++</code>: linear algebra</p>
+</li>
+<li>
+<p>fftw: Fourier transform</p>
+</li>
+<li>
+<p>Flann</p>
+</li>
+<li>
+<p>GSL: various</p>
+</li>
+<li>
+<p>liblinear</p>
+</li>
+<li>
+<p>libspacialindex</p>
+</li>
+<li>
+<p>libtommath</p>
+</li>
+<li>
+<p>qhull</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>These are good targets for <a href="#gem5-run-benchmark">performance analysis with gem5</a>, and there is some overlap between this section and <a href="#benchmarks">Benchmarks</a>.</p>
+</div>
+<div class="sect3">
+<h4 id="bst-vs-heap-vs-hashmap"><a class="anchor" href="#bst-vs-heap-vs-hashmap"></a><a class="link" href="#bst-vs-heap-vs-hashmap">21.7.1. BST vs heap vs hashmap</a></h4>
+<div class="paragraph">
+<p>TODO: move benchmark graph from <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/cpp/bst_vs_heap_vs_hashmap.cpp">userland/cpp/bst_vs_heap_vs_hashmap.cpp</a> to <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/algorithm/set">userland/algorithm/set</a>.</p>
+</div>
+<div class="paragraph">
+<p>The following benchmark setup works both:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>on host through timers + <a href="https://stackoverflow.com/questions/51952471/why-do-i-get-a-constant-instead-of-logarithmic-curve-for-an-insert-time-benchmar/51953081#51953081">granule</a></p>
+</li>
+<li>
+<p>gem5 with <a href="#m5ops-instructions">dumpstats</a>, which can get more precise results with <code>granule == 1</code></p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>It has been used to answer:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>BST vs heap: <a href="https://stackoverflow.com/questions/6147243/heap-vs-binary-search-tree-bst/29548834#29548834" class="bare">https://stackoverflow.com/questions/6147243/heap-vs-binary-search-tree-bst/29548834#29548834</a></p>
+</li>
+<li>
+<p><code>std::set</code>: <a href="https://stackoverflow.com/questions/2558153/what-is-the-underlying-data-structure-of-a-stl-set-in-c/51944661#51944661" class="bare">https://stackoverflow.com/questions/2558153/what-is-the-underlying-data-structure-of-a-stl-set-in-c/51944661#51944661</a></p>
+</li>
+<li>
+<p><code>std::map</code>: <a href="https://stackoverflow.com/questions/18414579/what-data-structure-is-inside-stdmap-in-c/51945119#51945119" class="bare">https://stackoverflow.com/questions/18414579/what-data-structure-is-inside-stdmap-in-c/51945119#51945119</a></p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>To benchmark on the host, we do:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./build-userland-in-tree \
+  --force-rebuild \
+  --optimization-level 3 \
+  ./userland/cpp/bst_vs_heap_vs_hashmap.cpp \
+;
+./userland/cpp/bst_vs_heap_vs_hashmap.out 10000000 10000 0 | tee bst_vs_heap_vs_hashmap.dat
+gnuplot \
+  -e 'input_noext="bst_vs_heap_vs_hashmap"' \
+  -e 'heap_zoom_max=50' \
+  -e 'hashmap_zoom_max=400' \
+  ./bst-vs-heap-vs-hashmap.gnuplot \
+;
+xdg-open bst_vs_heap_vs_hashmap.tmp.png</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>The parameters <code>heap_zoom_max</code> and <code>hashmap_zoom_max</code> are chosen manually interactively to best showcase the regions of interest in those plots.</p>
+</div>
+<div class="paragraph">
+<p>To benchmark on gem5, we first build the benchmark with <a href="#m5ops-instructions">m5ops instructions</a> enabled, and then we run it and extract the stats:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./build-userland \
+  --arch x86_64 \
+  --ccflags='-DLKMC_M5OPS_ENABLE=1' \
+  --force-rebuild userland/cpp/bst_vs_heap_vs_hashmap.cpp \
+  --static \
+  --optimization-level 3 \
+;
+./run \
+  --arch x86_64 \
+  --emulator gem5 \
+  --static \
+  --userland userland/cpp/bst_vs_heap_vs_hashmap.cpp \
+  --userland-args='100000 1 0' \
+  -- \
+  --cpu-type=DerivO3CPU \
+  --caches \
+  --l2cache \
+  --l1d_size=32kB \
+  --l1i_size=32kB \
+  --l2_size=256kB \
+  --l3_size=20MB \
+;
+./bst-vs-heap-vs-hashmap-gem5-stats --arch x86_64 | tee bst_vs_heap_vs_hashmap_gem5.dat
+gnuplot \
+  -e 'input_noext="bst_vs_heap_vs_hashmap_gem5"' \
+  -e 'heap_zoom_max=500' \
+  -e 'hashmap_zoom_max=400' \
+  ./bst-vs-heap-vs-hashmap.gnuplot \
+;
+xdg-open bst_vs_heap_vs_hashmap_gem5.tmp.png</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>TODO: the gem5 simulation blows up on a tcmalloc allocation somewhere near 25k elements as of 3fdd83c2c58327d9714fa2347c724b78d7c05e2b + 1, likely linked to the extreme inefficiency of the stats collection?</p>
+</div>
+<div class="paragraph">
+<p>The cache sizes were chosen to match the host <a href="#p51">P51</a> to improve the comparison. Ideally we should also use the same standard library.</p>
+</div>
+<div class="paragraph">
+<p>Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: <a href="#gem5-only-dump-selected-stats">Section 19.9.2.1, &#8220;gem5 only dump selected stats&#8221;</a></p>
+</div>
+<div class="paragraph">
+<p>Sources:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/cpp/bst_vs_heap_vs_hashmap.cpp">userland/cpp/bst_vs_heap_vs_hashmap.cpp</a></p>
+</li>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/bst-vs-heap-vs-hashmap-gem5-stats">bst-vs-heap-vs-hashmap-gem5-stats</a></p>
+</li>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/bst-vs-heap-vs-hashmap.gnuplot">bst-vs-heap-vs-hashmap.gnuplot</a></p>
+</li>
+</ul>
+</div>
+</div>
+<div class="sect3">
+<h4 id="blas"><a class="anchor" href="#blas"></a><a class="link" href="#blas">21.7.2. BLAS</a></h4>
+<div class="paragraph">
+<p>Buildroot supports it, which makes everything just trivial:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./build-buildroot --config 'BR2_PACKAGE_OPENBLAS=y'
+./build-userland --package openblas -- userland/libs/openblas/hello.c
+./run --eval-after './libs/openblas/hello.out; echo $?'</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Outcome: the test passes:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>0</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Source: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/libs/openblas/hello.c">userland/libs/openblas/hello.c</a></p>
+</div>
+<div class="paragraph">
+<p>The test performs a general matrix multiplication:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>    |  1.0 -3.0 |   |  1.0  2.0  1.0 |       |  0.5  0.5  0.5 |   |  11.0 - 9.0  5.0 |
+1 * |  2.0  4.0 | * | -3.0  4.0 -1.0 | + 2 * |  0.5  0.5  0.5 | = | - 9.0  21.0 -1.0 |
+    |  1.0 -1.0 |                            |  0.5  0.5  0.5 |   |   5.0 - 1.0  3.0 |</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>This can be deduced from the Fortran interfaces at</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>less "$(./getvar buildroot_build_build_dir)"/openblas-*/reference/dgemmf.f</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>which we can map to our call as:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>C := alpha*op( A )*op( B ) + beta*C,
+SUBROUTINE DGEMMF(               TRANA,        TRANB,     M,N,K,  ALPHA,A,LDA,B,LDB,BETA,C,LDC)
+cblas_dgemm(      CblasColMajor, CblasNoTrans, CblasTrans,3,3,2  ,1,    A,3,  B,3,  2   ,C,3  );</pre>
+</div>
+</div>
+</div>
+<div class="sect3">
+<h4 id="eigen"><a class="anchor" href="#eigen"></a><a class="link" href="#eigen">21.7.3. Eigen</a></h4>
+<div class="paragraph">
+<p>Header only linear algebra library with a mainline Buildroot package:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./build-buildroot --config 'BR2_PACKAGE_EIGEN=y'
+./build-userland --package eigen -- userland/libs/eigen/hello.cpp</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Just create an array and print it:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./run --eval-after './libs/eigen/hello.out'</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Output:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>  3  -1
+2.5 1.5</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Source: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/libs/eigen/hello.cpp">userland/libs/eigen/hello.cpp</a></p>
+</div>
+<div class="paragraph">
+<p>This example just creates a matrix and prints it out.</p>
+</div>
+<div class="paragraph">
+<p>Tested on: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/commit/a4bdcf102c068762bb1ef26c591fcf71e5907525">a4bdcf102c068762bb1ef26c591fcf71e5907525</a></p>
+</div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="benchmarks"><a class="anchor" href="#benchmarks"></a><a class="link" href="#benchmarks">21.8. Benchmarks</a></h3>
+<div class="paragraph">
+<p>These are good targets for <a href="#gem5-run-benchmark">performance analysis with gem5</a>.</p>
+</div>
+<div class="paragraph">
+<p>TODO also consider the following:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="https://github.com/kozyraki/stamp" class="bare">https://github.com/kozyraki/stamp</a> transactional memory benchmarks</p>
+</li>
+</ul>
+</div>
+<div class="sect3">
+<h4 id="dhrystone"><a class="anchor" href="#dhrystone"></a><a class="link" href="#dhrystone">21.8.1. Dhrystone</a></h4>
+<div class="paragraph">
+<p><a href="https://en.wikipedia.org/wiki/Dhrystone" class="bare">https://en.wikipedia.org/wiki/Dhrystone</a></p>
+</div>
+<div class="paragraph">
+<p>Created in the 80&#8217;s, it is not a representative measure of performance in modern computers anymore. It has mostly been replaced by <a href="https://en.wikipedia.org/wiki/SPECint">SPEC</a>, which is&#8230;&#8203; closed source! Unbelievable.</p>
+</div>
+<div class="paragraph">
+<p><a href="#buildroot">Buildroot</a> has a <code>dhrystone</code> package, but because it is so interesting to us, we decided to also build it ourselves, which allows things like static and baremetal compilation more easily.</p>
+</div>
+<div class="paragraph">
+<p>Build and run on QEMU <a href="#user-mode-simulation">User mode simulation</a>:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>git submodule update --init submodules/dhrystone
+./build-dhrystone --mode userland
+./run --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Build and run on gem5 user mode:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./build-dhrystone --mode userland --static --force-rebuild
+./run --emulator gem5 --userland "$(./getvar userland_build_dir)/submodules/dhrystone/dhrystone"</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>TODO automate run more nicely.</p>
+</div>
+<div class="paragraph">
+<p>Build for <a href="#baremetal">Baremetal</a> execution and run it in baremetal QEMU:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre># Build our Newlib stubs.
+./build-baremetal --arch aarch64
+./build-dhrystone --arch aarch64 --mode baremetal
+./run --arch aarch64 --baremetal "$(./getvar baremetal_build_dir)/submodules/dhrystone/dhrystone"</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>TODO: fix the build, just need to factor out all run arguments from <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/build-baremetal">build-baremetal</a> into <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/common.py">common.py</a> and it should just work, no missing syscalls.</p>
+</div>
+<div class="paragraph">
+<p>If you really want the Buildroot package for some reason, build it with:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./build-buildroot --config 'BR2_PACKAGE_DHRYSTONE=y'</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and run inside the guest from <code>PATH</code> with:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>dhrystone</pre>
+</div>
+</div>
+</div>
+<div class="sect3">
+<h4 id="parsec-benchmark"><a class="anchor" href="#parsec-benchmark"></a><a class="link" href="#parsec-benchmark">21.8.2. PARSEC benchmark</a></h4>
+<div class="paragraph">
+<p>We have ported parts of the <a href="http://parsec.cs.princeton.edu">PARSEC benchmark</a> for cross compilation at: <a href="https://github.com/cirosantilli/parsec-benchmark" class="bare">https://github.com/cirosantilli/parsec-benchmark</a> See the documentation on that repo to find out which benchmarks have been ported. Some of the benchmarks were are segfaulting, they are documented in that repo.</p>
+</div>
+<div class="paragraph">
+<p>There are two ways to run PARSEC with this repo:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="#parsec-benchmark-without-parsecmgmt">without <code>pasecmgmt</code></a>, most likely what you want</p>
+</li>
+<li>
+<p><a href="#parsec-benchmark-with-parsecmgmt">with <code>pasecmgmt</code></a></p>
+</li>
+</ul>
+</div>
+<div class="sect4">
+<h5 id="parsec-benchmark-without-parsecmgmt"><a class="anchor" href="#parsec-benchmark-without-parsecmgmt"></a><a class="link" href="#parsec-benchmark-without-parsecmgmt">21.8.2.1. PARSEC benchmark without parsecmgmt</a></h5>
+<div class="literalblock">
+<div class="content">
+<pre>./build --arch arm --download-dependencies gem5-buildroot parsec-benchmark
+./build-buildroot --arch arm --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y'
+./run --arch arm --emulator gem5</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Once inside the guest, launch one of the <code>test</code> input sized benchmarks manually as in:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>cd /parsec/ext/splash2x/apps/fmm/run
+../inst/arm-linux.gcc/bin/fmm 1 &lt; input_1</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>To find run out how to run many of the benchmarks, have a look at the <code>test.sh</code> script of the <code>parse-benchmark</code> repo.</p>
+</div>
+<div class="paragraph">
+<p>From the guest, you can also run it as:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>cd /parsec
+./test.sh</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>but this might be a bit time consuming in gem5.</p>
+</div>
+</div>
+<div class="sect4">
+<h5 id="parsec-change-the-input-size"><a class="anchor" href="#parsec-change-the-input-size"></a><a class="link" href="#parsec-change-the-input-size">21.8.2.2. PARSEC change the input size</a></h5>
+<div class="paragraph">
+<p>Running a benchmark of a size different than <code>test</code>, e.g. <code>simsmall</code>, requires a rebuild with:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./build-buildroot \
+  --arch arm \
+  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
+  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_INPUT_SIZE="simsmall"' \
+  -- parsec_benchmark-reconfigure \
+;</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Large input may also require tweaking:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="#br2-target-rootfs-ext2-size">BR2_TARGET_ROOTFS_EXT2_SIZE</a> if the unpacked inputs are large</p>
+</li>
+<li>
+<p><a href="#memory-size">Memory size</a>, unless you want to meet the OOM killer, which is admittedly kind of fun</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p><code>test.sh</code> only contains the run commands for the <code>test</code> size, and cannot be used for <code>simsmall</code>.</p>
+</div>
+<div class="paragraph">
+<p>The easiest thing to do, is to <a href="https://superuser.com/questions/231002/how-can-i-search-within-the-output-buffer-of-a-tmux-shell/1253137#1253137">scroll up on the host shell</a> after the build, and look for a line of type:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>Running /root/linux-kernel-module-cheat/out/aarch64/buildroot/build/parsec-benchmark-custom/ext/splash2x/apps/ocean_ncp/inst/aarch64-linux.gcc/bin/ocean_ncp -n2050 -p1 -e1e-07 -r20000 -t28800</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and then tweak the command found in <code>test.sh</code> accordingly.</p>
+</div>
+<div class="paragraph">
+<p>Yes, we do run the benchmarks on host just to unpack / generate inputs. They are expected fail to run since they were build for the guest instead of host, including for x86_64 guest which has a different interpreter than the host&#8217;s (see <code>file myexecutable</code>).</p>
+</div>
+<div class="paragraph">
+<p>The rebuild is required because we unpack input files on the host.</p>
+</div>
+<div class="paragraph">
+<p>Separating input sizes also allows to create smaller images when only running the smaller benchmarks.</p>
+</div>
+<div class="paragraph">
+<p>This limitation exists because <code>parsecmgmt</code> generates the input files just before running via the Bash scripts, but we can&#8217;t run <code>parsecmgmt</code> on gem5 as it is too slow!</p>
+</div>
+<div class="paragraph">
+<p>One option would be to do that inside the guest with QEMU.</p>
+</div>
+<div class="paragraph">
+<p>Also, we can&#8217;t generate all input sizes at once, because many of them have the same name and would overwrite one another&#8230;&#8203;</p>
+</div>
+<div class="paragraph">
+<p>PARSEC simply wasn&#8217;t designed with non native machines in mind&#8230;&#8203;</p>
+</div>
+</div>
+<div class="sect4">
+<h5 id="parsec-benchmark-with-parsecmgmt"><a class="anchor" href="#parsec-benchmark-with-parsecmgmt"></a><a class="link" href="#parsec-benchmark-with-parsecmgmt">21.8.2.3. PARSEC benchmark with parsecmgmt</a></h5>
+<div class="paragraph">
+<p>Most users won&#8217;t want to use this method because:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>running the <code>parsecmgmt</code> Bash scripts takes forever before it ever starts running the actual benchmarks on gem5</p>
+<div class="paragraph">
+<p>Running on QEMU is feasible, but not the main use case, since QEMU cannot be used for performance measurements</p>
+</div>
+</li>
+<li>
+<p>it requires putting the full <code>.tar</code> inputs on the guest, which makes the image twice as large (1x for the <code>.tar</code>, 1x for the unpacked input files)</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>It would be awesome if it were possible to use this method, since this is what Parsec supports officially, and so:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>you don&#8217;t have to dig into what raw command to run</p>
+</li>
+<li>
+<p>there is an easy way to run all the benchmarks in one go to test them out</p>
+</li>
+<li>
+<p>you can just run any of the benchmarks that you want</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>but it simply is not feasible in gem5 because it takes too long.</p>
+</div>
+<div class="paragraph">
+<p>If you still want to run this, try it out with:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./build-buildroot \
+  --arch aarch64 \
+  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
+  --config 'BR2_PACKAGE_PARSEC_BENCHMARK_PARSECMGMT=y' \
+  --config 'BR2_TARGET_ROOTFS_EXT2_SIZE="3G"' \
+  -- parsec_benchmark-reconfigure \
+;</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>And then you can run it just as you would on the host:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>cd /parsec/
+bash
+. env.sh
+parsecmgmt -a run -p splash2x.fmm -i test</pre>
+</div>
+</div>
+</div>
+<div class="sect4">
+<h5 id="parsec-uninstall"><a class="anchor" href="#parsec-uninstall"></a><a class="link" href="#parsec-uninstall">21.8.2.4. PARSEC uninstall</a></h5>
+<div class="paragraph">
+<p>If you want to remove PARSEC later, Buildroot doesn&#8217;t provide an automated package removal mechanism as mentioned at: <a href="#remove-buildroot-packages">Section 20.6, &#8220;Remove Buildroot packages&#8221;</a>, but the following procedure should be satisfactory:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>rm -rf \
+  "$(./getvar buildroot_download_dir)"/parsec-* \
+  "$(./getvar buildroot_build_dir)"/build/parsec-* \
+  "$(./getvar buildroot_build_dir)"/build/packages-file-list.txt \
+  "$(./getvar buildroot_build_dir)"/images/rootfs.* \
+  "$(./getvar buildroot_build_dir)"/target/parsec-* \
+;
+./build-buildroot --arch arm</pre>
+</div>
+</div>
+</div>
+<div class="sect4">
+<h5 id="parsec-benchmark-hacking"><a class="anchor" href="#parsec-benchmark-hacking"></a><a class="link" href="#parsec-benchmark-hacking">21.8.2.5. PARSEC benchmark hacking</a></h5>
+<div class="paragraph">
+<p>If you end up going inside <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/submodules/parsec-benchmark">submodules/parsec-benchmark</a> to hack up the benchmark (you will!), these tips will be helpful.</p>
+</div>
+<div class="paragraph">
+<p>Buildroot was not designed to deal with large images, and currently cross rebuilds are a bit slow, due to some image generation and validation steps.</p>
+</div>
+<div class="paragraph">
+<p>A few workarounds are:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>develop in host first as much as you can. Our PARSEC fork supports it.</p>
+<div class="paragraph">
+<p>If you do this, don&#8217;t forget to do a:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>cd "$(./getvar parsec_source_dir)"
+git clean -xdf .</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>before going for the cross compile build.</p>
+</div>
+</li>
+<li>
+<p>patch Buildroot to work well, and keep cross compiling all the way. This should be totally viable, and we should do it.</p>
+<div class="paragraph">
+<p>Don&#8217;t forget to explicitly rebuild PARSEC with:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./build-buildroot \
+  --arch arm \
+  --config 'BR2_PACKAGE_PARSEC_BENCHMARK=y' \
+  -- parsec_benchmark-reconfigure \
+;</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>You may also want to test if your patches are still functionally correct inside of QEMU first, which is a faster emulator.</p>
+</div>
+</li>
+<li>
+<p>sell your soul, and compile natively inside the guest. We won&#8217;t do this, not only because it is evil, but also because Buildroot explicitly does not support it: <a href="https://buildroot.org/downloads/manual/manual.html#faq-no-compiler-on-target" class="bare">https://buildroot.org/downloads/manual/manual.html#faq-no-compiler-on-target</a> ARM employees have been known to do this: <a href="https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/qemu-patch.diff" class="bare">https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/qemu-patch.diff</a></p>
+</li>
+</ul>
+</div>
+</div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="userland-content-bibliography"><a class="anchor" href="#userland-content-bibliography"></a><a class="link" href="#userland-content-bibliography">21.9. Userland content bibliography</a></h3>
 <div class="ulist">
 <ul>
 <li>
@@ -30734,10 +30958,10 @@ instructions 124346081</pre>
 <p>For example, the simplest scalable CPU content would be a busy loop: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/gcc/busy_loop.c">userland/gcc/busy_loop.c</a>, so let&#8217;s focus on that for now.</p>
 </div>
 <div class="paragraph">
-<p>Summary of manually collected results on <a href="#p51">P51</a> at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: <a href="#table-busy-loop-dmips">Table 7, &#8220;Busy loop DMIPS for different simulator setups&#8221;</a>. As expected, the less native / more detailed / more complex simulations are slower!</p>
+<p>Summary of manually collected results on <a href="#p51">P51</a> at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: <a href="#table-busy-loop-dmips">Table 7, &#8220;Busy loop MIPS for different simulator setups&#8221;</a>. As expected, the less native / more detailed / more complex simulations are slower!</p>
 </div>
 <table id="table-busy-loop-dmips" class="tableblock frame-all grid-all stretch">
-<caption class="title">Table 7. Busy loop DMIPS for different simulator setups</caption>
+<caption class="title">Table 7. Busy loop MIPS for different simulator setups</caption>
 <colgroup>
 <col style="width: 20%;">
 <col style="width: 20%;">
@@ -30783,18 +31007,25 @@ instructions 124346081</pre>
 <td class="tableblock halign-left valign-top"><p class="tableblock">1.1018128 * 10^7</p></td>
 <td class="tableblock halign-left valign-top"><p class="tableblock">0.2</p></td>
 </tr>
+<tr>
+<td class="tableblock halign-left valign-top"><p class="tableblock"><code>gem5 --arch aarch64 --gem5-build-id MOESI_CMP_directory -- --cpu-type DerivO3CPU --caches --ruby</code></p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">1 * 1000000 = 10^6</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">63</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">1.1005150 * 10^7</p></td>
+<td class="tableblock halign-left valign-top"><p class="tableblock">0.2</p></td>
+</tr>
 </tbody>
 </table>
 <div class="paragraph">
 <p>The first step is to determine a number of loops that will run long enough to have meaningful results, but not too long that we will get bored.</p>
 </div>
 <div class="paragraph">
-<p>On our <a href="#p51">P51</a> machine, we found 10^7 (10 million == 1000 times 10000) loops to be a good number:</p>
+<p>On our <a href="#p51">P51</a> machine, we found 10^7 (10 million == 1000 times 10000) loops to be a good number for a gem5 atomic simulation:</p>
 </div>
 <div class="literalblock">
 <div class="content">
-<pre>./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland-args '1000 10000' --static
-./get-stat sim_insts</pre>
+<pre>./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland-args '1 10000000' --static
+./gem5-stat --arch aarch64 sim_insts</pre>
 </div>
 </div>
 <div class="paragraph">
@@ -30898,10 +31129,10 @@ instructions 124346081</pre>
 <p>Let&#8217;s see if user mode runs considerably faster than full system or not, ignoring the kernel boot.</p>
 </div>
 <div class="paragraph">
-<p>First we build <a href="#dhrystone">Dhrystone</a> manually statically since dynamic linking is broken in gem5 as explained at: <a href="#gem5-syscall-emulation-mode">Section 10.6, &#8220;gem5 syscall emulation mode&#8221;</a>.</p>
+<p>First we build <a href="#dhrystone">Dhrystone</a> manually statically since dynamic linking is broken in gem5 as explained at: <a href="#gem5-syscall-emulation-mode">Section 10.7, &#8220;gem5 syscall emulation mode&#8221;</a>.</p>
 </div>
 <div class="paragraph">
-<p>TODO: move this section to our new custom dhrystone setup: <a href="#dhrystone">Section 19.2.3.1, &#8220;Dhrystone&#8221;</a>.</p>
+<p>TODO: move this section to our new custom dhrystone setup: <a href="#dhrystone">Section 21.8.1, &#8220;Dhrystone&#8221;</a>.</p>
 </div>
 <div class="paragraph">
 <p>gem5 user mode:</p>
@@ -33096,7 +33327,7 @@ echo $?</pre>
 <p>gem5: <a href="#m5-fail">m5 fail</a> works on all archs</p>
 </li>
 <li>
-<p>user mode: QEMU forwards exit status, for gem5 we do some log parsing as described at: <a href="#gem5-syscall-emulation-exit-status">Section 10.6.1, &#8220;gem5 syscall emulation exit status&#8221;</a></p>
+<p>user mode: QEMU forwards exit status, for gem5 we do some log parsing as described at: <a href="#gem5-syscall-emulation-exit-status">Section 10.7.1, &#8220;gem5 syscall emulation exit status&#8221;</a></p>
 </li>
 </ul>
 </div>

1.1018128 * 10^7	0.2
`gem5 --arch aarch64 --gem5-build-id MOESI_CMP_directory -- --cpu-type DerivO3CPU --caches --ruby`	1 * 1000000 = 10^6	63	1.1005150 * 10^7	0.2