diff --git a/index.html b/index.html
index e8b5240..50ca0cf 100644
--- a/index.html
+++ b/index.html
@@ -1281,7 +1281,8 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
 <li><a href="#cpp-multithreading">21.2.1. C++ multithreading</a>
 <ul class="sectlevel4">
 <li><a href="#atomic-cpp">21.2.1.1. atomic.cpp</a></li>
-<li><a href="#cpp-parallel-algorithms">21.2.1.2. C++ parallel algorithms</a></li>
+<li><a href="#cpp-memory-order">21.2.1.2. C++ std::memory_order</a></li>
+<li><a href="#cpp-parallel-algorithms">21.2.1.3. C++ parallel algorithms</a></li>
 </ul>
 </li>
 <li><a href="#cpp-standards">21.2.2. C++ standards</a>
@@ -1678,28 +1679,30 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
 </li>
 </ul>
 </li>
-<li><a href="#armv8-architecture-extensions">24.7. ARMv8 architecture extensions</a>
+<li><a href="#arm-thread-synchronization-primitives">24.7. ARM thread synchronization primitives</a>
 <ul class="sectlevel3">
-<li><a href="#armv8-1-architecture-extension">24.7.1. ARMv8.1 architecture extension</a>
-<ul class="sectlevel4">
-<li><a href="#arm-lse">24.7.1.1. ARM Large System Extensions (LSE)</a></li>
+<li><a href="#arm-ldxr-and-stxr-instructions">24.7.1. ARM LDXR and STXR instructions</a></li>
+<li><a href="#arm-lse">24.7.2. ARM Large System Extensions (LSE)</a></li>
 </ul>
 </li>
+<li><a href="#armv8-architecture-extensions">24.8. ARMv8 architecture extensions</a>
+<ul class="sectlevel3">
+<li><a href="#armv8-1-architecture-extension">24.8.1. ARMv8.1 architecture extension</a></li>
 </ul>
 </li>
-<li><a href="#arm-assembly-bibliography">24.8. ARM assembly bibliography</a>
+<li><a href="#arm-assembly-bibliography">24.9. ARM assembly bibliography</a>
 <ul class="sectlevel3">
-<li><a href="#arm-non-official-bibliography">24.8.1. ARM non-official bibliography</a></li>
-<li><a href="#arm-official-bibliography">24.8.2. ARM official bibliography</a>
+<li><a href="#arm-non-official-bibliography">24.9.1. ARM non-official bibliography</a></li>
+<li><a href="#arm-official-bibliography">24.9.2. ARM official bibliography</a>
 <ul class="sectlevel4">
-<li><a href="#armarm7">24.8.2.1. ARMv7 architecture reference manual</a></li>
-<li><a href="#armarm8">24.8.2.2. ARMv8 architecture reference manual</a></li>
-<li><a href="#armarm8-db">24.8.2.3. ARMv8 architecture reference manual db</a></li>
-<li><a href="#armv8-programmers-guide">24.8.2.4. Programmer&#8217;s Guide for ARMv8-A</a></li>
-<li><a href="#arm-a64-instruction-set-architecture-future-architecture-technologies-in-the-a-architecture-profile-documentation">24.8.2.5. Arm A64 Instruction Set Architecture: Future Architecture Technologies in the A architecture profile Documentation</a></li>
-<li><a href="#arm-processor-documentation">24.8.2.6. ARM processor documentation</a>
+<li><a href="#armarm7">24.9.2.1. ARMv7 architecture reference manual</a></li>
+<li><a href="#armarm8">24.9.2.2. ARMv8 architecture reference manual</a></li>
+<li><a href="#armarm8-db">24.9.2.3. ARMv8 architecture reference manual db</a></li>
+<li><a href="#armv8-programmers-guide">24.9.2.4. Programmer&#8217;s Guide for ARMv8-A</a></li>
+<li><a href="#arm-a64-instruction-set-architecture-future-architecture-technologies-in-the-a-architecture-profile-documentation">24.9.2.5. Arm A64 Instruction Set Architecture: Future Architecture Technologies in the A architecture profile Documentation</a></li>
+<li><a href="#arm-processor-documentation">24.9.2.6. ARM processor documentation</a>
 <ul class="sectlevel5">
-<li><a href="#arm-cortex15-trm">24.8.2.6.1. ARM Cortex-A15 MPCore Processor Technical Reference Manual r4p0</a></li>
+<li><a href="#arm-cortex15-trm">24.9.2.6.1. ARM Cortex-A15 MPCore Processor Technical Reference Manual r4p0</a></li>
 </ul>
 </li>
 </ul>
@@ -1746,8 +1749,9 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
 <ul class="sectlevel4">
 <li><a href="#arm-wfe-and-sev-instructions">27.8.3.1. ARM WFE and SEV instructions</a>
 <ul class="sectlevel5">
-<li><a href="#gem5-arm-wfe">27.8.3.1.1. gem5 ARM WFE</a></li>
-<li><a href="#arm-yield-instruction">27.8.3.1.2. ARM YIELD instruction</a></li>
+<li><a href="#wfe-from-userland">27.8.3.1.1. WFE from userland</a></li>
+<li><a href="#gem5-arm-wfe">27.8.3.1.2. gem5 ARM WFE</a></li>
+<li><a href="#arm-yield-instruction">27.8.3.1.3. ARM YIELD instruction</a></li>
 </ul>
 </li>
 <li><a href="#arm-ldaxr-and-stlxr-instructions">27.8.3.2. ARM LDAXR and STLXR instructions</a></li>
@@ -16759,6 +16763,17 @@ monitor info qtree</pre>
 </div>
 </div>
 <div class="paragraph">
+<p>Or if things get really involved and you want a debug script:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>printf 'break qemu_add_opts
+run
+' &gt; data/vm.gdb
+./run --debug-vm-file data/vm.gdb</pre>
+</div>
+</div>
+<div class="paragraph">
 <p>Our default emulator builds are optimized with <code>gcc -O2 -g</code>. To use <code>-O0</code> instead, build and run with:</p>
 </div>
 <div class="literalblock">
@@ -19918,6 +19933,16 @@ Exiting @ tick 3000 because all threads reached the max instruction count</pre>
 </div>
 </div>
 <div class="paragraph">
+<p>Other related fs.py options are:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><code>--abs-max-tick</code>: set the maximum guest simulation time. The same scale as the ExecAll trace is used. E.g., for the above example with 3 instructions, the same trace would be achieved with a value of 3000.</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
 <p>The message also shows on <a href="#user-mode-simulation">User mode simulation</a> deadlocks, for example in <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/posix/pthread_deadlock.c">userland/posix/pthread_deadlock.c</a>:</p>
 </div>
 <div class="literalblock">
@@ -22846,6 +22871,9 @@ echo 1 &gt; /proc/sys/vm/overcommit_memory
 <p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/cpp/atomic/aarch64_add.cpp">userland/cpp/atomic/aarch64_add.cpp</a>: non synchronized aarch64 inline assembly</p>
 </li>
 <li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/cpp/atomic/aarch64_ldaxr_stlxr.cpp">userland/cpp/atomic/aarch64_ldaxr_stlxr.cpp</a>: see: <a href="#arm-ldxr-and-stxr-instructions">ARM LDXR and STXR instructions</a></p>
+</li>
+<li>
 <p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/cpp/atomic/aarch64_ldadd.cpp">userland/cpp/atomic/aarch64_ldadd.cpp</a>: synchronized aarch64 inline assembly with the <a href="#arm-lse">ARM Large System Extensions (LSE)</a> LDADD instruction</p>
 </li>
 <li>
@@ -23031,7 +23059,16 @@ time ./mutex.out 4 100000000</pre>
 </div>
 </div>
 <div class="sect4">
-<h5 id="cpp-parallel-algorithms"><a class="anchor" href="#cpp-parallel-algorithms"></a><a class="link" href="#cpp-parallel-algorithms">21.2.1.2. C++ parallel algorithms</a></h5>
+<h5 id="cpp-memory-order"><a class="anchor" href="#cpp-memory-order"></a><a class="link" href="#cpp-memory-order">21.2.1.2. C++ std::memory_order</a></h5>
+<div class="paragraph">
+<p><a href="https://stackoverflow.com/questions/12346487/what-do-each-memory-order-mean" class="bare">https://stackoverflow.com/questions/12346487/what-do-each-memory-order-mean</a></p>
+</div>
+<div class="paragraph">
+<p>TODO let&#8217;s understand that fully one day.</p>
+</div>
+</div>
+<div class="sect4">
+<h5 id="cpp-parallel-algorithms"><a class="anchor" href="#cpp-parallel-algorithms"></a><a class="link" href="#cpp-parallel-algorithms">21.2.1.3. C++ parallel algorithms</a></h5>
 <div class="paragraph">
 <p><a href="https://stackoverflow.com/questions/51031060/are-c17-parallel-algorithms-implemented-already/55989883#55989883" class="bare">https://stackoverflow.com/questions/51031060/are-c17-parallel-algorithms-implemented-already/55989883#55989883</a></p>
 </div>
@@ -23340,11 +23377,21 @@ There are no non-locking atomic types or atomic primitives in POSIX: <a href="ht
 <p><a href="#x86-thread-synchronization-primitives">x86 thread synchronization primitives</a></p>
 </li>
 <li>
+<p><a href="#arm-thread-synchronization-primitives">ARM thread synchronization primitives</a></p>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="#arm-ldxr-and-stxr-instructions">ARM LDXR and STXR instructions</a></p>
+</li>
+<li>
 <p><a href="#arm-lse">ARM Large System Extensions (LSE)</a></p>
 </li>
 </ul>
 </div>
 </li>
+</ul>
+</div>
+</li>
 <li>
 <p>emulator topics:</p>
 <div class="ulist">
@@ -23603,10 +23650,10 @@ There are no non-locking atomic types or atomic primitives in POSIX: <a href="ht
 <div class="sect5">
 <h6 id="npm-data-files"><a class="anchor" href="#npm-data-files"></a><a class="link" href="#npm-data-files">21.6.2.1.1. NPM data-files</a></h6>
 <div class="paragraph">
-<p>Illustrates how to add extra non-code data files to an NPM package, and then use those files at runtime:</p>
+<p>Illustrates how to add extra non-code data files to an NPM package, and then use those files at runtime.</p>
 </div>
 <div class="paragraph">
-<p><a href="https://stackoverflow.com/questions/10111163/in-node-js-how-can-i-get-the-path-of-a-module-i-have-loaded-via-require-that-is" class="bare">https://stackoverflow.com/questions/10111163/in-node-js-how-can-i-get-the-path-of-a-module-i-have-loaded-via-require-that-is</a></p>
+<p><a href="https://stackoverflow.com/questions/31642477/how-to-publish-a-npm-package-with-distribution-files/59407033#59407033" class="bare">https://stackoverflow.com/questions/31642477/how-to-publish-a-npm-package-with-distribution-files/59407033#59407033</a></p>
 </div>
 </div>
 </div>
@@ -25235,6 +25282,9 @@ Is it any easy to determine which functions I can use or not, in case there are
 <p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/arch/aarch64/inline_asm/inc.c">userland/arch/aarch64/inline_asm/inc.c</a></p>
 </li>
 <li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/arch/aarch64/inline_asm/inc_32.c">userland/arch/aarch64/inline_asm/inc_32.c</a>: how to use 32-bit <code>w</code> registers in aarch64. We have to add <code>w</code> to the <code>%</code> as in <code>%w[io]</code> instead of <code>%[io]</code></p>
+</li>
+<li>
 <p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/arch/aarch64/inline_asm/multiline.cpp">userland/arch/aarch64/inline_asm/multiline.cpp</a></p>
 </li>
 </ul>
@@ -25595,7 +25645,10 @@ zmmintrin.h AVX512</pre>
 <p>This syscall is rarely used on its own, and there isn&#8217;t even a glibc wrapper for it: you almost always just want to use the <a href="#pthreads">pthreads</a> or <a href="#cpp-multithreading">C++ multithreading</a> wrappers which use it for you to <a href="#userland-mutex-implementation">implement higher level constructs like mutexes</a>.</p>
 </div>
 <div class="paragraph">
-<p>Futexes are bit complicated, because in order to achieve their efficiency, basically nothing is guaranteed: the wait might not wait, and the wakes might not wake. So you are just basically forced to use atomic operations on the futex memory address in order to be sure of anything.</p>
+<p>Futexes are bit complicated, because in order to achieve their efficiency, basically nothing is guaranteed: the wait might not wait, and the wakes might not wake.</p>
+</div>
+<div class="paragraph">
+<p>So you are just basically forced to use atomic operations on the futex memory address in order to be sure of anything (we encourage you to try without :-)).</p>
 </div>
 <div class="paragraph">
 <p>Minimal examples:</p>
@@ -25606,23 +25659,17 @@ zmmintrin.h AVX512</pre>
 <p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/lkmc/futex.h">lkmc/futex.h</a>: our futex wrapper</p>
 </li>
 <li>
-<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/linux/futex.c">userland/linux/futex.c</a>: minimal example, the main thread:</p>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/linux/futex.c">userland/linux/futex.c</a>: minimal example. It:</p>
 <div class="ulist">
 <ul>
 <li>
-<p>spawns a child</p>
+<p>first spawns a child</p>
 </li>
 <li>
-<p>the child waits on a futex</p>
+<p>then sleeps for 1 second and wakes up the futex if anyone is sleeping on it</p>
 </li>
 <li>
-<p>the main thread sleeps for one second</p>
-</li>
-<li>
-<p>the main thread wakes up the child</p>
-</li>
-<li>
-<p>the child returns</p>
+<p>the child sleeps on the futex if it reaches that futex before the end of the parent&#8217;s sleep (likely). If it did reach that <code>FUTEX_WAIT</code> there, it gets awoken by the parent.</p>
 <div class="paragraph">
 <p>So what you see is:</p>
 </div>
@@ -25631,8 +25678,8 @@ zmmintrin.h AVX512</pre>
 <pre>main start
 child start
 [wait 1s]
-main after sleep
-child end</pre>
+parent after sleep
+child after parent sleep</pre>
 </div>
 </div>
 </li>
@@ -29290,19 +29337,67 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.</p>
 </div>
 </div>
 <div class="sect2">
-<h3 id="armv8-architecture-extensions"><a class="anchor" href="#armv8-architecture-extensions"></a><a class="link" href="#armv8-architecture-extensions">24.7. ARMv8 architecture extensions</a></h3>
+<h3 id="arm-thread-synchronization-primitives"><a class="anchor" href="#arm-thread-synchronization-primitives"></a><a class="link" href="#arm-thread-synchronization-primitives">24.7. ARM thread synchronization primitives</a></h3>
+<div class="paragraph">
+<p>Parent section: <a href="#userland-multithreading">Userland multithreading</a>.</p>
+</div>
 <div class="sect3">
-<h4 id="armv8-1-architecture-extension"><a class="anchor" href="#armv8-1-architecture-extension"></a><a class="link" href="#armv8-1-architecture-extension">24.7.1. ARMv8.1 architecture extension</a></h4>
+<h4 id="arm-ldxr-and-stxr-instructions"><a class="anchor" href="#arm-ldxr-and-stxr-instructions"></a><a class="link" href="#arm-ldxr-and-stxr-instructions">24.7.1. ARM LDXR and STXR instructions</a></h4>
 <div class="paragraph">
-<p><a href="#armarm8-db">ARMv8 architecture reference manual db</a> A1.7.3 "The ARMv8.1 architecture extension"</p>
-</div>
-<div class="sect4">
-<h5 id="arm-lse"><a class="anchor" href="#arm-lse"></a><a class="link" href="#arm-lse">24.7.1.1. ARM Large System Extensions (LSE)</a></h5>
-<div class="paragraph">
-<p>Parent section: <a href="#arm-baremetal-multicore">ARM baremetal multicore</a>.</p>
+<p>Parent section: <a href="#atomic-cpp">atomic.cpp</a></p>
 </div>
 <div class="paragraph">
-<p><a href="#armarm8-db">ARMv8 architecture reference manual db</a> "ARMv8.1-LSE, ARMv8.1 Large System Extensions"</p>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/cpp/atomic/aarch64_ldaxr_stlxr.cpp">userland/cpp/atomic/aarch64_ldaxr_stlxr.cpp</a></p>
+</div>
+<div class="paragraph">
+<p>LDXR and STXR vs LDAXR and STLXR: <a href="https://stackoverflow.com/questions/21535058/arm64-ldxr-stxr-vs-ldaxr-stlxr" class="bare">https://stackoverflow.com/questions/21535058/arm64-ldxr-stxr-vs-ldaxr-stlxr</a> TODO understand better and example.</p>
+</div>
+<div class="paragraph">
+<p>LDXR and STXR for a so-called "Load-link/store-conditional" (LLSC) pattern: <a href="https://en.wikipedia.org/wiki/Load-link/store-conditional" class="bare">https://en.wikipedia.org/wiki/Load-link/store-conditional</a> which appears in many RISC ISAs.</p>
+</div>
+<div class="paragraph">
+<p>This pattern makes it such that basically:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>LDXR marks an address for exclusive access by the current CPU</p>
+</li>
+<li>
+<p>STXR:</p>
+<div class="ulist">
+<ul>
+<li>
+<p>marks the address as not being exclusive to other CPUs that may have done LDXR before</p>
+</li>
+<li>
+<p>loads fine if the address is still marked as exclusive, and stores 0 on a third register for success</p>
+</li>
+<li>
+<p>fails to load if the address is not, and stores 1 on the third register for failure</p>
+</li>
+</ul>
+</div>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>In case of failure, we just have to loop back to just before the LDXR and try again.</p>
+</div>
+<div class="paragraph">
+<p>This is therefore basically a spinlock and should only be used to cover very short critical sections such as atomic increments.</p>
+</div>
+<div class="paragraph">
+<p>C++ <code>std::atomic</code> uses this for increments before v8.1 <a href="#arm-lse">ARM Large System Extensions (LSE)</a>: <a href="https://stackoverflow.com/questions/56810/how-do-i-start-threads-in-plain-c/52453291#52453291" class="bare">https://stackoverflow.com/questions/56810/how-do-i-start-threads-in-plain-c/52453291#52453291</a></p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="arm-lse"><a class="anchor" href="#arm-lse"></a><a class="link" href="#arm-lse">24.7.2. ARM Large System Extensions (LSE)</a></h4>
+<div class="paragraph">
+<p>Set of atomic and synchronization primitives added in <a href="#armv8-1-architecture-extension">ARMv8.1 architecture extension</a>.</p>
+</div>
+<div class="paragraph">
+<p>Documented at <a href="#armarm8-db">ARMv8 architecture reference manual db</a> "ARMv8.1-LSE, ARMv8.1 Large System Extensions"</p>
 </div>
 <div class="ulist">
 <ul>
@@ -29317,20 +29412,32 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.</p>
 <div class="ulist">
 <ul>
 <li>
-<p><a href="https://stackoverflow.com/questions/21535058/arm64-ldxr-stxr-vs-ldaxr-stlxr" class="bare">https://stackoverflow.com/questions/21535058/arm64-ldxr-stxr-vs-ldaxr-stlxr</a></p>
-</li>
-<li>
 <p><a href="https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/" class="bare">https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/</a></p>
 </li>
 </ul>
 </div>
 </div>
 </div>
+<div class="sect2">
+<h3 id="armv8-architecture-extensions"><a class="anchor" href="#armv8-architecture-extensions"></a><a class="link" href="#armv8-architecture-extensions">24.8. ARMv8 architecture extensions</a></h3>
+<div class="sect3">
+<h4 id="armv8-1-architecture-extension"><a class="anchor" href="#armv8-1-architecture-extension"></a><a class="link" href="#armv8-1-architecture-extension">24.8.1. ARMv8.1 architecture extension</a></h4>
+<div class="paragraph">
+<p><a href="#armarm8-db">ARMv8 architecture reference manual db</a> A1.7.3 "The ARMv8.1 architecture extension"</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="#arm-lse">ARM Large System Extensions (LSE)</a></p>
+</li>
+</ul>
+</div>
+</div>
 </div>
 <div class="sect2">
-<h3 id="arm-assembly-bibliography"><a class="anchor" href="#arm-assembly-bibliography"></a><a class="link" href="#arm-assembly-bibliography">24.8. ARM assembly bibliography</a></h3>
+<h3 id="arm-assembly-bibliography"><a class="anchor" href="#arm-assembly-bibliography"></a><a class="link" href="#arm-assembly-bibliography">24.9. ARM assembly bibliography</a></h3>
 <div class="sect3">
-<h4 id="arm-non-official-bibliography"><a class="anchor" href="#arm-non-official-bibliography"></a><a class="link" href="#arm-non-official-bibliography">24.8.1. ARM non-official bibliography</a></h4>
+<h4 id="arm-non-official-bibliography"><a class="anchor" href="#arm-non-official-bibliography"></a><a class="link" href="#arm-non-official-bibliography">24.9.1. ARM non-official bibliography</a></h4>
 <div class="paragraph">
 <p>Good getting started tutorials:</p>
 </div>
@@ -29352,7 +29459,7 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.</p>
 </div>
 </div>
 <div class="sect3">
-<h4 id="arm-official-bibliography"><a class="anchor" href="#arm-official-bibliography"></a><a class="link" href="#arm-official-bibliography">24.8.2. ARM official bibliography</a></h4>
+<h4 id="arm-official-bibliography"><a class="anchor" href="#arm-official-bibliography"></a><a class="link" href="#arm-official-bibliography">24.9.2. ARM official bibliography</a></h4>
 <div class="paragraph">
 <p>The official manuals were stored in <a href="http://infocenter.arm.com" class="bare">http://infocenter.arm.com</a> but as of 2017 they started to slowly move to <a href="https://developer.arm.com" class="bare">https://developer.arm.com</a>.</p>
 </div>
@@ -29366,7 +29473,7 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.</p>
 <p>Bibliography: <a href="https://www.quora.com/Where-can-I-find-the-official-documentation-of-ARM-instruction-set-architectures-ISAs" class="bare">https://www.quora.com/Where-can-I-find-the-official-documentation-of-ARM-instruction-set-architectures-ISAs</a></p>
 </div>
 <div class="sect4">
-<h5 id="armarm7"><a class="anchor" href="#armarm7"></a><a class="link" href="#armarm7">24.8.2.1. ARMv7 architecture reference manual</a></h5>
+<h5 id="armarm7"><a class="anchor" href="#armarm7"></a><a class="link" href="#armarm7">24.9.2.1. ARMv7 architecture reference manual</a></h5>
 <div class="paragraph">
 <p><a href="https://developer.arm.com/products/architecture/a-profile/docs/ddi0406/latest/arm-architecture-reference-manual-armv7-a-and-armv7-r-edition" class="bare">https://developer.arm.com/products/architecture/a-profile/docs/ddi0406/latest/arm-architecture-reference-manual-armv7-a-and-armv7-r-edition</a></p>
 </div>
@@ -29378,7 +29485,7 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.</p>
 </div>
 </div>
 <div class="sect4">
-<h5 id="armarm8"><a class="anchor" href="#armarm8"></a><a class="link" href="#armarm8">24.8.2.2. ARMv8 architecture reference manual</a></h5>
+<h5 id="armarm8"><a class="anchor" href="#armarm8"></a><a class="link" href="#armarm8">24.9.2.2. ARMv8 architecture reference manual</a></h5>
 <div class="paragraph">
 <p><a href="https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf" class="bare">https://static.docs.arm.com/ddi0487/ca/DDI0487C_a_armv8_arm.pdf</a></p>
 </div>
@@ -29434,13 +29541,13 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.</p>
 </div>
 </div>
 <div class="sect4">
-<h5 id="armarm8-db"><a class="anchor" href="#armarm8-db"></a><a class="link" href="#armarm8-db">24.8.2.3. ARMv8 architecture reference manual db</a></h5>
+<h5 id="armarm8-db"><a class="anchor" href="#armarm8-db"></a><a class="link" href="#armarm8-db">24.9.2.3. ARMv8 architecture reference manual db</a></h5>
 <div class="paragraph">
 <p><a href="https://static.docs.arm.com/ddi0487/db/DDI0487D_b_armv8_arm.pdf" class="bare">https://static.docs.arm.com/ddi0487/db/DDI0487D_b_armv8_arm.pdf</a></p>
 </div>
 </div>
 <div class="sect4">
-<h5 id="armv8-programmers-guide"><a class="anchor" href="#armv8-programmers-guide"></a><a class="link" href="#armv8-programmers-guide">24.8.2.4. Programmer&#8217;s Guide for ARMv8-A</a></h5>
+<h5 id="armv8-programmers-guide"><a class="anchor" href="#armv8-programmers-guide"></a><a class="link" href="#armv8-programmers-guide">24.9.2.4. Programmer&#8217;s Guide for ARMv8-A</a></h5>
 <div class="paragraph">
 <p><a href="https://static.docs.arm.com/den0024/a/DEN0024A_v8_architecture_PG.pdf" class="bare">https://static.docs.arm.com/den0024/a/DEN0024A_v8_architecture_PG.pdf</a></p>
 </div>
@@ -29455,7 +29562,7 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.</p>
 </div>
 </div>
 <div class="sect4">
-<h5 id="arm-a64-instruction-set-architecture-future-architecture-technologies-in-the-a-architecture-profile-documentation"><a class="anchor" href="#arm-a64-instruction-set-architecture-future-architecture-technologies-in-the-a-architecture-profile-documentation"></a><a class="link" href="#arm-a64-instruction-set-architecture-future-architecture-technologies-in-the-a-architecture-profile-documentation">24.8.2.5. Arm A64 Instruction Set Architecture: Future Architecture Technologies in the A architecture profile Documentation</a></h5>
+<h5 id="arm-a64-instruction-set-architecture-future-architecture-technologies-in-the-a-architecture-profile-documentation"><a class="anchor" href="#arm-a64-instruction-set-architecture-future-architecture-technologies-in-the-a-architecture-profile-documentation"></a><a class="link" href="#arm-a64-instruction-set-architecture-future-architecture-technologies-in-the-a-architecture-profile-documentation">24.9.2.5. Arm A64 Instruction Set Architecture: Future Architecture Technologies in the A architecture profile Documentation</a></h5>
 <div class="paragraph">
 <p><a href="https://developer.arm.com/docs/ddi0602/b" class="bare">https://developer.arm.com/docs/ddi0602/b</a></p>
 </div>
@@ -29464,7 +29571,7 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.</p>
 </div>
 </div>
 <div class="sect4">
-<h5 id="arm-processor-documentation"><a class="anchor" href="#arm-processor-documentation"></a><a class="link" href="#arm-processor-documentation">24.8.2.6. ARM processor documentation</a></h5>
+<h5 id="arm-processor-documentation"><a class="anchor" href="#arm-processor-documentation"></a><a class="link" href="#arm-processor-documentation">24.9.2.6. ARM processor documentation</a></h5>
 <div class="paragraph">
 <p>ARM also releases documentation specific to each given processor.</p>
 </div>
@@ -29472,7 +29579,7 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.</p>
 <p>This adds extra details to the more portable <a href="#armarm8">ARMv8 architecture reference manual</a> ISA documentation.</p>
 </div>
 <div class="sect5">
-<h6 id="arm-cortex15-trm"><a class="anchor" href="#arm-cortex15-trm"></a><a class="link" href="#arm-cortex15-trm">24.8.2.6.1. ARM Cortex-A15 MPCore Processor Technical Reference Manual r4p0</a></h6>
+<h6 id="arm-cortex15-trm"><a class="anchor" href="#arm-cortex15-trm"></a><a class="link" href="#arm-cortex15-trm">24.9.2.6.1. ARM Cortex-A15 MPCore Processor Technical Reference Manual r4p0</a></h6>
 <div class="paragraph">
 <p><a href="http://infocenter.arm.com/help/topic/com.arm.doc.ddi0438i/DDI0438I_cortex_a15_r4p0_trm.pdf" class="bare">http://infocenter.arm.com/help/topic/com.arm.doc.ddi0438i/DDI0438I_cortex_a15_r4p0_trm.pdf</a></p>
 </div>
@@ -30829,36 +30936,6 @@ IN: main
 <p>SEV is not the only thing that can wake up a WFE, it is only an explicit software way to do it. Notably, global monitor operations on memory accesses of regions marked by LDAXR and STLXR instructions can also wake up a WFE sleeping core. This is done to allow spinlocks opens to automatically wake up WFE sleeping cores at free time without the need for a explicit SEV.</p>
 </div>
 <div class="paragraph">
-<p>WFE and SEV are usable from userland, and are part of an efficient spinlock implementation, which maybe is not something that userland should ever tho and just stick to mutexes?</p>
-</div>
-<div class="paragraph">
-<p>There is a control bit <code>SCTLR_EL1.nTWE</code> that determines if WFE is trapped or not, i.e.: is that bit is set, then it is trapped and EL0 execution raises an exception in EL1. Linux v5.2.1 does not seem to trap however, tested with <code>--trace ExecAll</code> and the <a href="#dump-regs">dump_regs kernel module</a> in a full system simulation. But then, how does the kernel prevent CPUs from going to sleep randomly and instead reschedules other tasks? Does the kernel check if CPUs are in WFE when it wakes up on the timer, and only then reschedules? This would allow for userland to implement fast spinlocks if the spinlock returns faster than the timer. The kernel seems to setup NTWE at:</p>
-</div>
-<div class="paragraph">
-<p>include/asm/sysreg.h</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>#define SCTLR_EL1_SET	(SCTLR_ELx_M    | SCTLR_ELx_C    | SCTLR_ELx_SA   |\
-             ...
-			 SCTLR_EL1_NTWE | SCTLR_ELx_IESB | SCTLR_EL1_SPAN |\</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>and:</p>
-</div>
-<div class="paragraph">
-<p>mm/proc.S</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>	/*
-	 * Prepare SCTLR
-	 */
-	mov_q	x0, SCTLR_EL1_SET</pre>
-</div>
-</div>
-<div class="paragraph">
 <p>Quotes for the above <a href="#armarm8-db">ARMv8 architecture reference manual db</a> G1.18.1 "Wait For Event and Send Event":</p>
 </div>
 <div class="quoteblock">
@@ -30950,7 +31027,101 @@ IN: main
 <p>For how userland spinlocks and mutexes are implemented see <a href="#userland-mutex-implementation">Userland mutex implementation</a>.</p>
 </div>
 <div class="sect5">
-<h6 id="gem5-arm-wfe"><a class="anchor" href="#gem5-arm-wfe"></a><a class="link" href="#gem5-arm-wfe">27.8.3.1.1. gem5 ARM WFE</a></h6>
+<h6 id="wfe-from-userland"><a class="anchor" href="#wfe-from-userland"></a><a class="link" href="#wfe-from-userland">27.8.3.1.1. WFE from userland</a></h6>
+<div class="paragraph">
+<p>WFE and SEV are usable from userland, and are part of an efficient spinlock implementation (which userland should arguably stay away from and rather use the <a href="#futex-system-call">futex system call</a> which allow for non busy sleep instead), which maybe is not something that userland should ever tho and just stick to mutexes?</p>
+</div>
+<div class="paragraph">
+<p>There is a control bit <code>SCTLR_EL1.nTWE</code> that determines if WFE is trapped or not, i.e.: is that bit is set, then it is trapped and EL0 execution raises an exception in EL1.</p>
+</div>
+<div class="paragraph">
+<p>Linux v5.2.1 does not set <code>SCTLR_EL1.nTWE</code> however, tested with <a href="#gem5-tracing">gem5 tracing</a> with <code>--trace ExecAll,Failts</code> and the <a href="#dump-regs">dump_regs kernel module</a> in a full system simulation.</p>
+</div>
+<div class="paragraph">
+<p>The kernel seems to setup nTWE at:</p>
+</div>
+<div class="paragraph">
+<p>include/asm/sysreg.h</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>#define SCTLR_EL1_SET	(SCTLR_ELx_M    | SCTLR_ELx_C    | SCTLR_ELx_SA   |\
+             ...
+			 SCTLR_EL1_NTWE | SCTLR_ELx_IESB | SCTLR_EL1_SPAN |\</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and:</p>
+</div>
+<div class="paragraph">
+<p>mm/proc.S</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>	/*
+	 * Prepare SCTLR
+	 */
+	mov_q	x0, SCTLR_EL1_SET</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>To reduce the number of instructions from our trace, first we boot, and then we restore a checkpoint after boot with <a href="#gem5-restore-new-script">gem5 checkpoint restore and run a different script</a> with a restore command that runs <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/arch/aarch64/freestanding/linux/wfe_wfe.S">userland/arch/aarch64/freestanding/linux/wfe_wfe.S</a>:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./run --arch aarch64 --emulator gem5 --gem5-worktree master --gem5-restore 1 --gem5-readfile 'arch/aarch64/freestanding/linux/wfe_wfe.out' --trace ExecAll,Faults,FmtFlag,Thread</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>On the traces, we search for <code>wfe</code>, and there are just two hits, so they must be our instructions!</p>
+</div>
+<div class="paragraph">
+<p>The traces then look like this at LKMC 777b7cbbd1d553baf2be9bc2075102be740054dd:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>112285501668497000: Thread: system.cpu: suspend contextId 0
+112285501668497000: ExecEnable: system.cpu: A0 T0 : 0x400078    :   wfe                      : IntAlu :  D=0x0000000000000000  flags=(IsSerializeAfter|IsNonSpeculative|IsQuiesce|IsUnverifiable)
+112285501668497501: Thread: system.cpu: activate contextId 0
+112285501668498000: Thread: system.cpu: suspend contextId 0
+112285501668498000: ExecEnable: system.cpu: A0 T0 : 0x40007c    :   wfe                      : IntAlu :  D=0x0000000000000000  flags=(IsSerializeAfter|IsNonSpeculative|IsQuiesce|IsUnverifiable)
+112285501909320284: Thread: system.cpu: activate contextId 0
+112285501909320500: Faults: IRQ: Invoking Fault (AArch64 target EL):IRQ cpsr:0x4003c5 PC:0x400080 elr:0x400080 newVec: 0xffffff8010082480
+112285501909320500: ExecEnable: system.cpu: A0 T0 : @vectors+1152    :   nop                      : IntAlu :   flags=(IsNop)
+112285501909321000: ExecEnable: system.cpu: A0 T0 : @vectors+1156    :   nop                      : IntAlu :   flags=(IsNop)
+
+[more exception handler, no ERET here]
+
+112285501923080500: ExecEnable: system.cpu: A0 T0 : @finish_ret_to_user+188    :   ldr   x30, [sp, #240]    : MemRead :  D=0x0000000000000000 A=0xffffff8010cb3fb0  flags=(IsInteger|IsMemRef|IsLoad)
+112285501923081000: ExecEnable: system.cpu: A0 T0 : @finish_ret_to_user+192    :   add   sp, sp, #320       : IntAlu :  D=0xffffff8010cb4000  flags=(IsInteger)
+112285501923081500: ExecEnable: system.cpu: A0 T0 : 0xffffff8010084144    :   eret                     : IntAlu :  D=0x0000000000000001  flags=(IsControl|IsSerializeAfter|IsNonSpeculative|IsSquashAfter)
+112285501923082000: ExecEnable: system.cpu: A0 T0 : 0x400080    :   movz   x0, #0, #0        : IntAlu :  D=0x0000000000000000  flags=(IsInteger)
+112285501923082500: ExecEnable: system.cpu: A0 T0 : 0x400084    :   movz   x8, #93, #0       : IntAlu :  D=0x000000000000005d  flags=(IsInteger)
+112285501923083000: ExecEnable: system.cpu: A0 T0 : 0x400088    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>so we conclude that:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>the second WFE made the CPU stop running instructions at time 112285501668498000 and PC 0x40007c</p>
+</li>
+<li>
+<p>the next thing that happened a long time later (112285501909320500, while a following instruction would happen at 112285501668498000 + 1000) was an interrupt, presumably the <a href="#arm-timer">ARM timer</a></p>
+</li>
+<li>
+<p>after a few interrupt handler instructions, the first <a href="#arm-svc-instruction">ERET</a> instruction exits the handler and comes back directly to the instruction after the WFE at PC 0x400080 == 0x40007c + 4</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>Therefore, a WFE in userland is treated much like a busy loop by the Linux kernel: the kernel does not seem to try and explicitly make up room for other processes as would happen on a futex.</p>
+</div>
+</div>
+<div class="sect5">
+<h6 id="gem5-arm-wfe"><a class="anchor" href="#gem5-arm-wfe"></a><a class="link" href="#gem5-arm-wfe">27.8.3.1.2. gem5 ARM WFE</a></h6>
 <div class="paragraph">
 <p>gem5 390a74f59934b85d91489f8a563450d8321b602d does not sleep on the first WFE on either syscall emulation or full system, because the code does:</p>
 </div>
@@ -30992,7 +31163,7 @@ IN: main
 </div>
 </div>
 <div class="sect5">
-<h6 id="arm-yield-instruction"><a class="anchor" href="#arm-yield-instruction"></a><a class="link" href="#arm-yield-instruction">27.8.3.1.2. ARM YIELD instruction</a></h6>
+<h6 id="arm-yield-instruction"><a class="anchor" href="#arm-yield-instruction"></a><a class="link" href="#arm-yield-instruction">27.8.3.1.3. ARM YIELD instruction</a></h6>
 <div class="paragraph">
 <p><a href="https://stackoverflow.com/questions/59311066/how-does-the-arm-yield-instruction-inform-other-threads-that-they-could-start-a" class="bare">https://stackoverflow.com/questions/59311066/how-does-the-arm-yield-instruction-inform-other-threads-that-they-could-start-a</a></p>
 </div>
@@ -31302,7 +31473,7 @@ cntvct_el0 0x3CF516F</pre>
 <div class="sect3">
 <h4 id="arm-baremetal-bibliography"><a class="anchor" href="#arm-baremetal-bibliography"></a><a class="link" href="#arm-baremetal-bibliography">27.8.7. ARM baremetal bibliography</a></h4>
 <div class="paragraph">
-<p>First, also consider the userland bibliography: <a href="#arm-assembly-bibliography">Section 24.8, &#8220;ARM assembly bibliography&#8221;</a>.</p>
+<p>First, also consider the userland bibliography: <a href="#arm-assembly-bibliography">Section 24.9, &#8220;ARM assembly bibliography&#8221;</a>.</p>
 </div>
 <div class="paragraph">
 <p>The most useful ARM baremetal example sets we&#8217;ve seen so far are:</p>