diff --git a/index.html b/index.html
index 31505b1..b24c3c7 100644
--- a/index.html
+++ b/index.html
@@ -1280,7 +1280,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
 <li><a href="#what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5">19.20.4.3.1. What is the coherency protocol implemented by the classic cache system in gem5?</a></li>
 </ul>
 </li>
-<li><a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">19.20.4.4. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a></li>
+<li><a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">19.20.4.4. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a>
+<ul class="sectlevel5">
+<li><a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus-and-ruby">19.20.4.4.1. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby</a></li>
+</ul>
+</li>
 <li><a href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis">19.20.4.5. gem5 event queue MinorCPU syscall emulation freestanding example analysis</a></li>
 <li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis">19.20.4.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis</a></li>
 </ul>
@@ -1389,7 +1393,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
 </li>
 </ul>
 </li>
-<li><a href="#c-multithreading">21.1.2. C multithreading</a></li>
+<li><a href="#c-multithreading">21.1.2. C multithreading</a>
+<ul class="sectlevel4">
+<li><a href="#atomic-c">21.1.2.1. atomic.c</a></li>
+</ul>
+</li>
 <li><a href="#gcc-c-extensions">21.1.3. GCC C extensions</a>
 <ul class="sectlevel4">
 <li><a href="#c-empty-struct">21.1.3.1. C empty struct</a></li>
@@ -1403,7 +1411,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
 <li><a href="#cpp-initialization-types">21.2.1. C++ initialization types</a></li>
 <li><a href="#cpp-multithreading">21.2.2. C++ multithreading</a>
 <ul class="sectlevel4">
-<li><a href="#atomic-cpp">21.2.2.1. atomic.cpp</a></li>
+<li><a href="#atomic-cpp">21.2.2.1. atomic.cpp</a>
+<ul class="sectlevel5">
+<li><a href="#detailed-gem5-analysis-of-how-data-races-happen">21.2.2.1.1. Detailed gem5 analysis of how data races happen</a></li>
+</ul>
+</li>
 <li><a href="#cpp-memory-order">21.2.2.2. C++ std::memory_order</a></li>
 <li><a href="#cpp-parallel-algorithms">21.2.2.3. C++ parallel algorithms</a></li>
 </ul>
@@ -2006,16 +2018,21 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
 <li><a href="#hardware-threads">32.1. Hardware threads</a></li>
 <li><a href="#cache-coherence">32.2. Cache coherence</a>
 <ul class="sectlevel3">
-<li><a href="#can-caches-snoop-data-from-other-caches">32.2.1. Can caches snoop data from other caches?</a></li>
-<li><a href="#vi-cache-coherence-protocol">32.2.2. VI cache coherence protocol</a></li>
-<li><a href="#msi-cache-coherence-protocol">32.2.3. MSI cache coherence protocol</a>
+<li><a href="#memory-consistency">32.2.1. Memory consistency</a>
 <ul class="sectlevel4">
-<li><a href="#msi-cache-coherence-protocol-with-transient-states">32.2.3.1. MSI cache coherence protocol with transient states</a></li>
+<li><a href="#sequential-consistency">32.2.1.1. Sequential Consistency</a></li>
 </ul>
 </li>
-<li><a href="#mesi-cache-coherence-protocol">32.2.4. MESI cache coherence protocol</a></li>
-<li><a href="#mosi-cache-coherence-protocol">32.2.5. MOSI cache coherence protocol</a></li>
-<li><a href="#moesi">32.2.6. MOESI cache coherence protocol</a></li>
+<li><a href="#can-caches-snoop-data-from-other-caches">32.2.2. Can caches snoop data from other caches?</a></li>
+<li><a href="#vi-cache-coherence-protocol">32.2.3. VI cache coherence protocol</a></li>
+<li><a href="#msi-cache-coherence-protocol">32.2.4. MSI cache coherence protocol</a>
+<ul class="sectlevel4">
+<li><a href="#msi-cache-coherence-protocol-with-transient-states">32.2.4.1. MSI cache coherence protocol with transient states</a></li>
+</ul>
+</li>
+<li><a href="#mesi-cache-coherence-protocol">32.2.5. MESI cache coherence protocol</a></li>
+<li><a href="#mosi-cache-coherence-protocol">32.2.6. MOSI cache coherence protocol</a></li>
+<li><a href="#moesi">32.2.7. MOESI cache coherence protocol</a></li>
 </ul>
 </li>
 </ul>
@@ -6670,6 +6687,23 @@ asdf=qwer</pre>
 <div class="paragraph">
 <p>Source: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/linux/init_env_poweroff.c">userland/linux/init_env_poweroff.c</a>.</p>
 </div>
+<div class="paragraph">
+<p>As of the Linux kernel v5.7 (possibly earlier, I&#8217;ve skipped a few releases), boot also shows the init arguments and environment very clearly, which is a great addition:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>&lt;6&gt;[    0.309984] Run /sbin/init as init process
+&lt;7&gt;[    0.309991]   with arguments:
+&lt;7&gt;[    0.309997]     /sbin/init
+&lt;7&gt;[    0.310004]     nokaslr
+&lt;7&gt;[    0.310010]     -
+&lt;7&gt;[    0.310016]   with environment:
+&lt;7&gt;[    0.310022]     HOME=/
+&lt;7&gt;[    0.310028]     TERM=linux
+&lt;7&gt;[    0.310035]     earlyprintk=pl011,0x1c090000
+&lt;7&gt;[    0.310041]     lkmc_home=/lkmc</pre>
+</div>
+</div>
 <div class="sect3">
 <h4 id="init-arguments"><a class="anchor" href="#init-arguments"></a><a class="link" href="#init-arguments">6.4.1. init arguments</a></h4>
 <div class="paragraph">
@@ -15903,11 +15937,17 @@ ps</pre>
 <p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/kernel_modules/dump_regs.c">kernel_modules/dump_regs.c</a></p>
 </li>
 <li>
-<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/baremetal/arch/aarch64/dump_regs.c">baremetal/arch/aarch64/dump_regs.c</a></p>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/arch/arm/dump_regs.c">userland/arch/arm/dump_regs.c</a></p>
+</li>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/arch/aarch64/dump_regs.c">userland/arch/aarch64/dump_regs.c</a></p>
 </li>
 <li>
 <p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/baremetal/arch/arm/dump_regs.c">baremetal/arch/arm/dump_regs.c</a></p>
 </li>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/baremetal/arch/aarch64/dump_regs.c">baremetal/arch/aarch64/dump_regs.c</a></p>
+</li>
 </ul>
 </div>
 <div class="paragraph">
@@ -18568,14 +18608,36 @@ getconf _NPROCESSORS_CONF</pre>
 </div>
 </div>
 <div class="paragraph">
-<p>Or from <a href="#user-mode-simulation">User mode simulation</a>, we can use <a href="#sysconf">sysconf</a> with <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/linux/sysconf.c">userland/linux/sysconf.c</a> or <a href="#cpp-multithreading">C++ multithreading</a>'s <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/cpp/thread_hardware_concurrency.cpp">userland/cpp/thread_hardware_concurrency.cpp</a>:</p>
+<p>Or from <a href="#user-mode-simulation">User mode simulation</a>, we can use either of:</p>
 </div>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="#sysconf">sysconf</a> with <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/linux/sysconf.c">userland/linux/sysconf.c</a></p>
 <div class="literalblock">
 <div class="content">
-<pre>./run --cpus 2 --emulator gem5 --userland userland/linux/sysconf.c | grep _SC_NPROCESSORS_ONLN
-./run --cpus 2 --emulator gem5 --userland userland/cpp/thread_hardware_concurrency.cpp</pre>
+<pre>./run --cpus 2 --emulator gem5 --userland userland/linux/sysconf.c | grep _SC_NPROCESSORS_ONLN</pre>
 </div>
 </div>
+</li>
+<li>
+<p><a href="#cpp-multithreading">C++ multithreading</a>'s <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/cpp/thread_hardware_concurrency.cpp">userland/cpp/thread_hardware_concurrency.cpp</a>:</p>
+<div class="literalblock">
+<div class="content">
+<pre>./run --cpus 2 --emulator gem5 --userland userland/cpp/thread_hardware_concurrency.cpp</pre>
+</div>
+</div>
+</li>
+<li>
+<p>direct access to several special filesystem files that contain this information e.g. via <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/cat.c">userland/c/cat.c</a>:</p>
+<div class="literalblock">
+<div class="content">
+<pre>./run --cpus 2 --emulator gem5 --userland userland/c/cat.c --cli-args /proc/cpuinfo</pre>
+</div>
+</div>
+</li>
+</ul>
+</div>
 <div class="sect5">
 <h6 id="qemu-user-mode-multithreading"><a class="anchor" href="#qemu-user-mode-multithreading"></a><a class="link" href="#qemu-user-mode-multithreading">19.2.2.1.1. QEMU user mode multithreading</a></h6>
 <div class="paragraph">
@@ -21058,6 +21120,9 @@ Exiting @ tick 18446744073709551615 because simulate() limit reached</pre>
 <p>How it goes faster is explained at: <a href="https://stackoverflow.com/questions/59860091/how-to-increase-the-simulation-speed-of-a-gem5-run/59861375#59861375" class="bare">https://stackoverflow.com/questions/59860091/how-to-increase-the-simulation-speed-of-a-gem5-run/59861375#59861375</a></p>
 </div>
 <div class="paragraph">
+<p>Disables debug symbols (no <code>-g</code>) for some reason.</p>
+</div>
+<div class="paragraph">
 <p>Benchmarks present at:</p>
 </div>
 <div class="ulist">
@@ -21325,7 +21390,7 @@ cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"</pre>
 <p>This is the simplest of all protocols, and therefore the first one you should study to learn how Ruby works.</p>
 </div>
 <div class="paragraph">
-<p>To study it, we can take an approach similar to what was done at: <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a>.</p>
+<p>To study it, we can take an approach similar to what was done at: <a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a>.</p>
 </div>
 <div class="paragraph">
 <p>Our full command line will be something like</p>
@@ -21367,7 +21432,7 @@ cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"</pre>
 <p>One simple example of its operation can be seen at: <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis">Section 19.20.4.2, &#8220;gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis&#8221;</a></p>
 </div>
 <div class="paragraph">
-<p>But arguably interesting effects can only be observed when we have more than 1 CPUs as in <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a>.</p>
+<p>But arguably interesting effects can only be observed when we have more than 1 CPUs as in <a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a>.</p>
 </div>
 <div class="paragraph">
 <p>TODO: describe it in more detail. It appears to be a very simple mechanism.</p>
@@ -22010,6 +22075,16 @@ static EmbeddedPyBind embed_obj("BadDevice", module_init, "BasicPioDevice");</pr
 <p>To get a feeling of how <code>SimObject</code> objects are run, see: <a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis">gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis</a>.</p>
 </div>
 <div class="paragraph">
+<p>Bibliography:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="https://stackoverflow.com/questions/61910993/viewing-the-parameters-of-the-branch-predictor-in-gem5/61914449#61914449" class="bare">https://stackoverflow.com/questions/61910993/viewing-the-parameters-of-the-branch-predictor-in-gem5/61914449#61914449</a></p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
 <p>Tested on gem5 08c79a194d1a3430801c04f37d13216cc9ec1da3.</p>
 </div>
 </div>
@@ -22127,7 +22202,7 @@ for module in modules.keys():
 <p>And from <a href="#debug-gem5-python-scripts">IPDB</a> we see that this appears to loop over every object string of type <code>m5.objects.modulename</code>.</p>
 </div>
 <div class="paragraph">
-<p>This <code><em>init</em></code> gets called from <code>src/python/importer.py</code> at the <code>exec</code>:</p>
+<p>This <code>__init__</code> gets called from <code>src/python/importer.py</code> at the <code>exec</code>:</p>
 </div>
 <div class="literalblock">
 <div class="content">
@@ -24338,15 +24413,12 @@ type=SetAssociative</pre>
 <p>The actual representation is done via separate state bits: <a href="https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L66" class="bare">https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L66</a> and MOESI appears explicitly only on the pretty printing.</p>
 </div>
 <div class="paragraph">
-<p>This pretty printing appears for example in the <code>--trace Cache</code> lines as shown at <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches">gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches</a>.</p>
+<p>This pretty printing appears for example in the <code>--trace Cache</code> lines as shown at <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches">gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches</a> and with a few more transitions visible at <a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">Section 19.20.4.4, &#8220;gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs&#8221;</a>.</p>
 </div>
 </div>
 </div>
 <div class="sect4">
-<h5 id="gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus"><a class="anchor" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus"></a><a class="link" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">19.20.4.4. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a></h5>
-<div class="paragraph">
-<p>TODO is this the minimal setup that allows us to see the <a href="#gem5-crossbar-interconnect">gem5 crossbar interconnect</a>? Can we see anything in <code>AtomicSimpleCPU</code>?</p>
-</div>
+<h5 id="gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus"><a class="anchor" href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus"></a><a class="link" href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">19.20.4.4. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a></h5>
 <div class="paragraph">
 <p>It would be amazing to analyze a simple example with interconnect packets possibly invalidating caches of other CPUs.</p>
 </div>
@@ -24357,20 +24429,22 @@ type=SetAssociative</pre>
 <p>If we don&#8217;t use such instructions that flush memory, we would only see the interconnect at work when caches run out.</p>
 </div>
 <div class="paragraph">
-<p>For this study, we will use the same CLI as <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis">gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis</a> but with multiple CPUs and a content like <a href="#atomic-cpp">atomic.cpp</a> which shares a variable across threads.</p>
+<p>For this study, we will use the same CLI as <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis">gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis</a> but with multiple CPUs and a multithreaded which shares a variable across threads.</p>
 </div>
 <div class="paragraph">
-<p>We will then focus on the behaviour or the memory of the shared variable to see if we can observe cache coherency on the crossbar:</p>
+<p>We can use <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/atomic.c">userland/c/atomic.c</a> (see also <a href="#c-multithreading">C multithreading</a>) at LKMC 7c01b29f1ee7da878c7cc9cb4565f3f3cf516a92 and gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 was with as in <a href="#detailed-gem5-analysis-of-how-data-races-happen">Detailed gem5 analysis of how data races happen</a>:</p>
 </div>
 <div class="literalblock">
 <div class="content">
 <pre>./run \
   --arch aarch64 \
-  --cli-args '2 100' \
+  --cli-args '2 10' \
   --cpus 3 \
   --emulator gem5 \
-  --gem5-worktree master3 \
-  --userland userland/cpp/atomic/aarch64_add.cpp \
+  --trace FmtFlag,Cache,DRAM,ExecAll,XBar \
+  --userland userland/c/atomic.c \
+  -- \
+  --caches \
 ;</pre>
 </div>
 </div>
@@ -24383,6 +24457,218 @@ type=SetAssociative</pre>
 </div>
 <div class="title">Figure 5. <code>config.dot.svg</code> for a system with two TimingSimpleCPU with caches.</div>
 </div>
+<div class="paragraph">
+<p>Once again we focus on the shared function region <code>my_thread_main</code> which is where the interesting cross core memory collisions will be happening.</p>
+</div>
+<div class="paragraph">
+<p>As a maybe-not-so-interesting, we have a look at the very first <code>my_thread_main</code> icache hit points:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>93946000: Cache: system.cpu1.icache: access for ReadReq [8b0:8b3] IF miss
+93946000: Cache: system.cpu1.icache: createMissPacket: created ReadCleanReq [880:8bf] IF from ReadReq [8b0:8b3] IF
+93946000: Cache: system.cpu1.icache: handleAtomicReqMiss: Sending an atomic ReadCleanReq [880:8bf] IF
+93946000: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[5] packet ReadCleanReq [880:8bf] IF
+93946000: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[5] packet ReadCleanReq [880:8bf] IF SF size: 1 lat: 1
+93946000: Cache: system.cpu0.icache: handleSnoop: snoop hit for ReadCleanReq [880:8bf] IF, old state is state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
+93946000: Cache: system.cpu0.icache: new state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
+93946000: DRAM: system.mem_ctrls: recvAtomic: ReadCleanReq 0x880
+93946000: Cache: system.cpu1.icache: handleAtomicReqMiss: Receive response: ReadResp [880:8bf] IF in state 0
+93946000: Cache: system.cpu1.icache: Block addr 0x880 (ns) moving from state 0 to state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
+93946000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main    :   sub   sp, sp, #48        : IntAlu :  D=0x0000003fffd6b9a0  flags=(IsInteger)
+93946500: Cache: system.cpu1.icache: access for ReadReq [8b4:8b7] IF hit state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
+93946500: Cache: system.cpu1.dcache: access for WriteReq [a19a8:a19af] hit state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0x14 set: 0x66 way: 0
+93946500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+4    :   str   x0, [sp, #8]       : MemWrite :  D=0x0000007ffffefc70 A=0x3fffd6b9a8  flags=(IsInteger|IsMemRef|IsStore)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Now that we know how to read cache logs from <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches">gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches</a>, it is easier to understand what happened:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>the physical address for <code>my_thread_main</code> is at 0x8b0, which gets requested is a miss, since it is the first time CPU1 goes near that region, since CPU1 was previously executing in standard library code far from our text segment</p>
+</li>
+<li>
+<p>CPU0 already has has that cache line (0x880) in its cache at <a href="#what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5">state E of MOESI</a>, so it snoops and moves to S. We can look up the logs to see exactly where CPU0 had previously read that address:</p>
+<div class="literalblock">
+<div class="content">
+<pre>table: 1, dirty: 0
+59135500: Cache: system.cpu0.icache: Block addr 0x880 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
+59135500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[1] packet WritebackClean [8880:88bf]
+59135500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[1] packet WritebackClean [8880:88bf] SF size: 0 lat: 1
+59135500: DRAM: system.mem_ctrls: recvAtomic: WritebackClean 0x8880
+59135500: ExecEnable: system.cpu0: A0 T0 : @frame_dummy    : stp</pre>
+</div>
+</div>
+</li>
+<li>
+<p>the request does touch RAM, it does not get served by the other cache directly. CPU1 is now also at state S for the block</p>
+</li>
+<li>
+<p>the second cache request from CPU1 is 4 bytes further ahead 0x8b4, and this time it is of course a hit.</p>
+<div class="paragraph">
+<p>Since this is an STR, it also does a dcache access, to 0xA19A8 in this case near its stack SP, and it is a hit, which is not surprising, since basically stack accesses are the very first thing any C code does, and there must be some setup code running on CPU1 before <code>my_thread_main</code>.</p>
+</div>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>Now let&#8217;s look for the incremented integer address that is shared across threads. We know from <a href="#detailed-gem5-analysis-of-how-data-races-happen">Detailed gem5 analysis of how data races happen</a> that the read happens at <code>my_thread_main+36</code>, so searching for he first occurrence:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>93952500: Cache: system.cpu1.icache: access for ReadReq [8d4:8d7] IF hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x23 way: 0
+93952500: Cache: system.cpu1.dcache: access for ReadReq [2060:2063] miss
+93952500: Cache: system.cpu1.dcache: createMissPacket: created ReadSharedReq [2040:207f] from ReadReq [2060:2063]
+93952500: Cache: system.cpu1.dcache: handleAtomicReqMiss: Sending an atomic ReadSharedReq [2040:207f]
+93952500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[6] packet ReadSharedReq [2040:207f]
+93952500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[6] packet ReadSharedReq [2040:207f] SF size: 0 lat: 1
+93952500: DRAM: system.mem_ctrls: recvAtomic: ReadSharedReq 0x2040
+93952500: Cache: system.cpu1.dcache: handleAtomicReqMiss: Receive response: ReadResp [2040:207f] in state 0
+93952500: Cache: system.cpu1.dcache: Block addr 0x2040 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
+93952500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000000 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>so we determine its physical address of 0x2060. It was a miss, and then it went into E.</p>
+</div>
+<div class="paragraph">
+<p>So we look ahead to the following accesses to that physical address, before CPU2 reaches that point of the code and starts making requests as well.</p>
+</div>
+<div class="paragraph">
+<p>First there is the STR for the first LDR which is of course a hit:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>93954500: Cache: system.cpu1.dcache: access for WriteReq [2060:2063] hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
+93954500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+52    :   str   x1, [x0]           : MemWrite :  D=0x0000000000000001 A=0x411060  flags=(IsInteger|IsMemRef|IsStore)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>If found the line in E, so we presume that it moves it to M. Then the second read confirms that it was in M:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>93964500: Cache: system.cpu1.dcache: access for ReadReq [2060:2063] hit state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
+93964500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000001 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and so on.</p>
+</div>
+<div class="paragraph">
+<p>Now let&#8217;s jump to when CPU2 starts making requests.</p>
+</div>
+<div class="paragraph">
+<p>The first time this happens is on its first LDR at:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>94058500: Cache: system.cpu2.dcache: access for ReadReq [2060:2063] miss
+94058500: Cache: system.cpu2.dcache: createMissPacket: created ReadSharedReq [2040:207f] from ReadReq [2060:2063]
+94058500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Sending an atomic ReadSharedReq [2040:207f]
+94058500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet ReadSharedReq [2040:207f]
+94058500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet ReadSharedReq [2040:207f] SF size: 1 lat: 1
+94058500: Cache: system.cpu1.dcache: handleSnoop: snoop hit for ReadSharedReq [2040:207f], old state is state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
+94058500: Cache: system.cpu1.dcache: new state is state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
+94058500: CoherentXBar: system.membus: recvAtomicBackdoor: Not forwarding ReadSharedReq [2040:207f]
+94058500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Receive response: ReadResp [2040:207f] in state 0
+94058500: Cache: system.cpu2.dcache: Block addr 0x2040 (ns) moving from state 0 to state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
+94058500: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000009 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and from this we see:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>CPU1 moves from M to O</p>
+</li>
+<li>
+<p>CPU2 moves from I to S</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>It also appears that no DRAM was accessed since there are no logs for it, so did the XBar get the value directly from the other cache? TODO: why did the earlier <code>93946000: DRAM</code> read happened then, since CPU0 had the line when CPU1 asked for it?</p>
+</div>
+<div class="paragraph">
+<p>The above log sequence also makes it clear that it is the XBar that maintains coherency: it appears that the CPU2 caches tells the XBar what it is doing, and then the XBar tells other caches on other CPUs about it, which leads CPU1 to move to O.</p>
+</div>
+<div class="paragraph">
+<p>Then CPU1 hits its LDR on O:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>94060500: Cache: system.cpu1.dcache: access for ReadReq [2060:2063] hit state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
+94060500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000009 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and then CPU2 writes moving to M and moving CPU1 to I:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>94060500: Cache: system.cpu2.dcache: access for WriteReq [2060:2063] hit state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
+94060500: Cache: system.cpu2.dcache: createMissPacket: created UpgradeReq [2040:207f] from WriteReq [2060:2063]
+94060500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Sending an atomic UpgradeReq [2040:207f]
+94060500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet UpgradeReq [2040:207f]
+94060500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet UpgradeReq [2040:207f] SF size: 1 lat: 1
+94060500: Cache: system.cpu1.dcache: handleSnoop: snoop hit for UpgradeReq [2040:207f], old state is state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
+94060500: Cache: system.cpu1.dcache: new state is state: 0 (I) valid: 0 writable: 0 readable: 0 dirty: 0 | tag: 0xffffffffffffffff set: 0x81 way: 0
+94060500: CoherentXBar: system.membus: recvAtomicBackdoor: Not forwarding UpgradeReq [2040:207f]
+94060500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Receive response: UpgradeResp [2040:207f] in state 5
+94060500: Cache: system.cpu2.dcache: Block addr 0x2040 (ns) moving from state 5 to state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
+94060500: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+52    :   str   x1, [x0]           : MemWrite :  D=0x000000000000000a A=0x411060  flags=(IsInteger|IsMemRef|IsStore)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and so on, they just keep fighting over that address and changing one another&#8217;s state.</p>
+</div>
+<div class="sect5">
+<h6 id="gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus-and-ruby"><a class="anchor" href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus-and-ruby"></a><a class="link" href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus-and-ruby">19.20.4.4.1. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby</a></h6>
+<div class="paragraph">
+<p>Now let&#8217;s do the exact same we did for <a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a>, but with <a href="#gem5-ruby-build">Ruby</a> rather than the classic system.</p>
+</div>
+<div class="paragraph">
+<p>Since we have fully understood coherency in that previous example, it should now be easier to understand what is going on with Ruby:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./run \
+  --arch aarch64 \
+  --cli-args '2 10' \
+  --cpus 3 \
+  --emulator gem5 \
+  --trace FmtFlag,DRAM,ExecAll,Ruby \
+  --userland userland/c/atomic.c \
+  -- \
+  --cpu-type AtomicSimpleCPU \
+  --ruby \
+;</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Note that now the <code>--trace Cache,XBar</code> flags have no effect, since Ruby replaces those classic memory model components entirely with the Ruby version, so we enable the <code>Ruby</code> flag version instead. Note however that this flag is very verbose and produces about 10x more output than the classic memory experiment.</p>
+</div>
+<div class="paragraph">
+<p>Also remember that ARM&#8217;s default Ruby protocol is <code>'MOESI_CMP_directory'</code>.</p>
+</div>
+<div class="paragraph">
+<p>First we note that the output of the experiment is the same:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>atomic 20
+non-atomic 19</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>TODO</p>
+</div>
+</div>
 </div>
 <div class="sect4">
 <h5 id="gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis"><a class="anchor" href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis"></a><a class="link" href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis">19.20.4.5. gem5 event queue MinorCPU syscall emulation freestanding example analysis</a></h5>
@@ -27059,6 +27345,13 @@ cd ../..
 <ul>
 <li>
 <p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/file_write_read.c">userland/c/file_write_read.c</a></p>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/cat.c">userland/c/cat.c</a>: a quick and dirty <code>cat</code> implementation for interactive <a href="#user-mode-simulation">User mode simulation</a> tests</p>
+</li>
+</ul>
+</div>
 </li>
 <li>
 <p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/linux/open_o_tmpfile.c">userland/linux/open_o_tmpfile.c</a>: <a href="https://stackoverflow.com/questions/4508998/what-is-an-anonymous-inode-in-linux/44388030#44388030" class="bare">https://stackoverflow.com/questions/4508998/what-is-an-anonymous-inode-in-linux/44388030#44388030</a></p>
@@ -27069,6 +27362,19 @@ cd ../..
 </ul>
 </div>
 </li>
+<li>
+<p><code>time.h</code></p>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/timespec_get.c">userland/c/timespec_get.c</a> <code>timespec_get</code> is a C11 for <code>clock_gettime</code> <a href="http://stackoverflow.com/questions/361363/how-to-measure-time-in-milliseconds-using-ansi-c/36095407#36095407" class="bare">http://stackoverflow.com/questions/361363/how-to-measure-time-in-milliseconds-using-ansi-c/36095407#36095407</a></p>
+<div class="paragraph">
+<p>Vs <code>clock()</code>: <a href="http://stackoverflow.com/questions/12392278/measure-time-in-linux-getrusage-vs-clock-gettime-vs-clock-vs-gettimeofday" class="bare">http://stackoverflow.com/questions/12392278/measure-time-in-linux-getrusage-vs-clock-gettime-vs-clock-vs-gettimeofday</a></p>
+</div>
+</li>
+</ul>
+</div>
+</li>
 </ul>
 </div>
 </li>
@@ -27240,23 +27546,100 @@ echo 1 &gt; /proc/sys/vm/overcommit_memory
 <div class="paragraph">
 <p>Added in C11!</p>
 </div>
-<div class="ulist">
-<ul>
-<li>
-<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/atomic.c">userland/c/atomic.c</a>: <code>atomic_int</code> and <code>thrd_create</code></p>
-</li>
-</ul>
-</div>
 <div class="paragraph">
 <p>Bibliography:</p>
 </div>
 <div class="ulist">
 <ul>
 <li>
+<p><a href="#userland-multithreading">Userland multithreading</a></p>
+</li>
+<li>
 <p><a href="https://stackoverflow.com/questions/3908031/how-to-multithread-c-code/52453354#52453354" class="bare">https://stackoverflow.com/questions/3908031/how-to-multithread-c-code/52453354#52453354</a></p>
 </li>
 </ul>
 </div>
+<div class="sect4">
+<h5 id="atomic-c"><a class="anchor" href="#atomic-c"></a><a class="link" href="#atomic-c">21.1.2.1. atomic.c</a></h5>
+<div class="paragraph">
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/atomic.c">userland/c/atomic.c</a></p>
+</div>
+<div class="paragraph">
+<p>Demonstrates <code>atomic_int</code> and <code>thrd_create</code>.</p>
+</div>
+<div class="paragraph">
+<p>Disassembly with GDB at LKMC 619fef4b04bddc4a5a38aec5e207dd4d5a25d206 + 1:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./run-toolchain \
+  --arch aarch64 gdb \
+  -- \
+  -batch \
+  -ex 'disas/rs my_thread_main' $(./getvar \
+  --arch aarch64 userland_build_dir)/c/atomic.out \
+;</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>shows on ARM:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>16              ++cnt;
+   0x00000000004008cc &lt;+28&gt;:    80 00 00 b0     adrp    x0, 0x411000 &lt;malloc@got.plt&gt;
+   0x00000000004008d0 &lt;+32&gt;:    00 80 01 91     add     x0, x0, #0x60
+   0x00000000004008d4 &lt;+36&gt;:    00 00 40 b9     ldr     w0, [x0]
+   0x00000000004008d8 &lt;+40&gt;:    01 04 00 11     add     w1, w0, #0x1
+   0x00000000004008dc &lt;+44&gt;:    80 00 00 b0     adrp    x0, 0x411000 &lt;malloc@got.plt&gt;
+   0x00000000004008e0 &lt;+48&gt;:    00 80 01 91     add     x0, x0, #0x60
+   0x00000000004008e4 &lt;+52&gt;:    01 00 00 b9     str     w1, [x0]
+
+17              ++acnt;
+   0x00000000004008e8 &lt;+56&gt;:    20 00 80 52     mov     w0, #0x1                        // #1
+   0x00000000004008ec &lt;+60&gt;:    e0 1b 00 b9     str     w0, [sp, #24]
+   0x00000000004008f0 &lt;+64&gt;:    e0 1b 40 b9     ldr     w0, [sp, #24]
+   0x00000000004008f4 &lt;+68&gt;:    e2 03 00 2a     mov     w2, w0
+   0x00000000004008f8 &lt;+72&gt;:    80 00 00 b0     adrp    x0, 0x411000 &lt;malloc@got.plt&gt;
+   0x00000000004008fc &lt;+76&gt;:    00 70 01 91     add     x0, x0, #0x5c
+   0x0000000000400900 &lt;+80&gt;:    03 00 e2 b8     ldaddal w2, w3, [x0]
+   0x0000000000400904 &lt;+84&gt;:    61 00 02 0b     add     w1, w3, w2
+   0x0000000000400908 &lt;+88&gt;:    e0 03 01 2a     mov     w0, w1
+   0x000000000040090c &lt;+92&gt;:    e0 1f 00 b9     str     w0, [sp, #28]</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>so:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>the atomic increment uses <a href="#arm-lse"><code>ldadd</code></a></p>
+</li>
+<li>
+<p>the non-atomic increment just does LDR, ADD, STR: <a href="#arm-lse"><code>ldadd</code></a></p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>With <code>-O3</code>:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>16              ++cnt;
+   0x0000000000400a00 &lt;+32&gt;:    60 00 40 b9     ldr     w0, [x3]
+   0x0000000000400a04 &lt;+36&gt;:    00 04 00 11     add     w0, w0, #0x1
+   0x0000000000400a08 &lt;+40&gt;:    60 00 00 b9     str     w0, [x3]
+
+17              ++acnt;
+   0x0000000000400a0c &lt;+44&gt;:    20 00 80 52     mov     w0, #0x1                        // #1
+   0x0000000000400a10 &lt;+48&gt;:    40 00 e0 b8     ldaddal w0, w0, [x2]</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>so the situation is the same but without all the horrible stack noise.</p>
+</div>
+</div>
 </div>
 <div class="sect3">
 <h4 id="gcc-c-extensions"><a class="anchor" href="#gcc-c-extensions"></a><a class="link" href="#gcc-c-extensions">21.1.3. GCC C extensions</a></h4>
@@ -27338,6 +27721,16 @@ echo 1 &gt; /proc/sys/vm/overcommit_memory
 </ul>
 </div>
 </li>
+<li>
+<p><code>virtual</code> and polymorphism</p>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/cpp/virtual.cpp">userland/cpp/virtual.cpp</a></p>
+</li>
+</ul>
+</div>
+</li>
 </ul>
 </div>
 </li>
@@ -27691,6 +28084,111 @@ time ./mutex.out 4 100000000</pre>
 </li>
 </ul>
 </div>
+<div class="sect5">
+<h6 id="detailed-gem5-analysis-of-how-data-races-happen"><a class="anchor" href="#detailed-gem5-analysis-of-how-data-races-happen"></a><a class="link" href="#detailed-gem5-analysis-of-how-data-races-happen">21.2.2.1.1. Detailed gem5 analysis of how data races happen</a></h6>
+<div class="paragraph">
+<p>The smallest data race we managed to come up as of LKMC 7c01b29f1ee7da878c7cc9cb4565f3f3cf516a92 and gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 was with <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/atomic.c">userland/c/atomic.c</a> (see also <a href="#c-multithreading">C multithreading</a>):</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./run \
+  --arch aarch64 \
+  --cli-args '2 10' \
+  --cpus 3 \
+  --emulator gem5 \
+  --userland userland/c/atomic.c \
+;</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>which outputs:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>atomic 20
+non-atomic 19</pre>
+</div>
+</div>
+<div class="paragraph">
+<div class="title">/run -aA -eg -u userland/c/atomic.c --cli-args '2 200' --cpus 3 --userland-build-id o3 -N1 --trace ExecAll&#8201;&#8212;&#8201;--caches --cpu-type TimingSimpleCPU</div>
+<p>Note that that the system is very minimal, and doesn&#8217;t even have caches, so I&#8217;m curious as to how this can happen at all.</p>
+</div>
+<div class="paragraph">
+<p>So first we do a run with <a href="#gem5-tracing"><code>--trace Exec</code></a> and look at the <code>my_thread_main</code> entries.</p>
+</div>
+<div class="paragraph">
+<p>From there we see that first CPU1 enters the function, since it was spawned first.</p>
+</div>
+<div class="paragraph">
+<p>Then for some time, both CPU1 and CPU2 are running at the same time.</p>
+</div>
+<div class="paragraph">
+<p>Finally, CPU1 exists, then CPU2 runs alone for a while to finish its loops, and then CPU2 exits.</p>
+</div>
+<div class="paragraph">
+<p>By greping the LDR data read from the log, we are able to easily spot the moment where things started to go wrong based on the <code>D=</code> data:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>grep -E 'my_thread_main\+36' trace.txt &gt; trace-ldr.txt</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>The <code>grep</code> output contains</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>94024500: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000006 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
+94036500: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000007 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
+94048500: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000008 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
+94058500: system.cpu2: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000009 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
+94060500: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000009 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
+94070500: system.cpu2: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x000000000000000a A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
+94082500: system.cpu2: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x000000000000000b A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and so se see that it is at <code>94058500</code> that things started going bad, since two consecutive loads from different CPUs read the same value <code>D=9</code>! Actually, things were not too bad afterwards because this was by coincidence the last CPU1 read, we would have missed many more increments if the number of iterations had been larger.</p>
+</div>
+<div class="paragraph">
+<p>Now that we have the first bad time, let&#8217;s look at the fuller disassembly to better understand what happens around that point.</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>94058500: system.cpu2: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000009 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
+94059000: system.cpu2: A0 T0 : @my_thread_main+40    :   add   w1, w0, #1         : IntAlu :  D=0x000000000000000a  flags=(IsInteger)
+94059000: system.cpu1: A0 T0 : @my_thread_main+120    :   b.cc   &lt;my_thread_main+28&gt; : IntAlu :   flags=(IsControl|IsDirectControl|IsCondControl)
+94059500: system.cpu1: A0 T0 : @my_thread_main+28    :   adrp   x0, #69632        : IntAlu :  D=0x0000000000411000  flags=(IsInteger)
+94059500: system.cpu2: A0 T0 : @my_thread_main+44    :   adrp   x0, #69632        : IntAlu :  D=0x0000000000411000  flags=(IsInteger)
+94060000: system.cpu2: A0 T0 : @my_thread_main+48    :   add   x0, x0, #96        : IntAlu :  D=0x0000000000411060  flags=(IsInteger)
+94060000: system.cpu1: A0 T0 : @my_thread_main+32    :   add   x0, x0, #96        : IntAlu :  D=0x0000000000411060  flags=(IsInteger)
+94060500: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000009 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
+94060500: system.cpu2: A0 T0 : @my_thread_main+52    :   str   x1, [x0]           : MemWrite :  D=0x000000000000000a A=0x411060  flags=(IsInteger|IsMemRef|IsStore)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and from this, all becomes crystal clear:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>94058500: CPU2 loads</p>
+</li>
+<li>
+<p>94060500: CPU1 loads</p>
+</li>
+<li>
+<p>94060500: CPU2 stores</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>so we see that CPU2 just happened to store after CPU1 loads.</p>
+</div>
+<div class="paragraph">
+<p>We also understand why LDADD solves the race problem in AtomicSimpleCPU: it does the load and store in one single go!</p>
+</div>
+</div>
 </div>
 <div class="sect4">
 <h5 id="cpp-memory-order"><a class="anchor" href="#cpp-memory-order"></a><a class="link" href="#cpp-memory-order">21.2.2.2. C++ std::memory_order</a></h5>
@@ -27700,6 +28198,9 @@ time ./mutex.out 4 100000000</pre>
 <div class="paragraph">
 <p>TODO let&#8217;s understand that fully one day.</p>
 </div>
+<div class="paragraph">
+<p>This is the C++ version of the more general <a href="#memory-consistency">Memory consistency</a> concept.</p>
+</div>
 </div>
 <div class="sect4">
 <h5 id="cpp-parallel-algorithms"><a class="anchor" href="#cpp-parallel-algorithms"></a><a class="link" href="#cpp-parallel-algorithms">21.2.2.3. C++ parallel algorithms</a></h5>
@@ -30593,6 +31094,9 @@ child after parent sleep</pre>
 <p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/linux/getcpu.c">userland/linux/getcpu.c</a>: a wrapper close the the syscall that also returns the current NUMA node</p>
 </li>
 <li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/linux/getcpu_syscall.c">userland/linux/getcpu_syscall.c</a>: the wrapper segfaults on error handling, so double checking with the real syscall: <a href="https://stackoverflow.com/questions/9260937/unix-socket-error-14-efault-bad-address/61879849#61879849" class="bare">https://stackoverflow.com/questions/9260937/unix-socket-error-14-efault-bad-address/61879849#61879849</a></p>
+</li>
+<li>
 <p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/linux/sched_getcpu_barrier.c">userland/linux/sched_getcpu_barrier.c</a>: this uses a barrier to ensure that gem5 will run each thread on one separate CPU</p>
 </li>
 </ul>
@@ -38266,6 +38770,31 @@ west build -b qemu_aarch64 samples/hello_world</pre>
 <p><a href="https://en.wikipedia.org/wiki/Cache_coherence" class="bare">https://en.wikipedia.org/wiki/Cache_coherence</a></p>
 </div>
 <div class="paragraph">
+<p>In simple terms, when a certain group of caches of different CPUs are coherent, reads on one core always see the writes previously made by other cores. TODO: is it that strict, or just ordering? TODO what about simultaneous read and writes?</p>
+</div>
+<div class="paragraph">
+<p><a href="http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf" class="bare">http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf</a> mentions that:</p>
+</div>
+<div class="paragraph">
+<p>Cache coherence:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>guarantees eventual write propagation</p>
+</li>
+<li>
+<p>guarantees a single order of all writes to same location</p>
+</li>
+<li>
+<p>no guarantees on when writes propagate</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>And notably it contrasts that with <a href="#memory-consistency">Memory consistency</a>, which according to them is about ordering requirements on <em>different</em> addresses.</p>
+</div>
+<div class="paragraph">
 <p>Algorithms to keep the caches of different cores of a system coherent. Only matters for multicore systems.</p>
 </div>
 <div class="paragraph">
@@ -38275,13 +38804,25 @@ west build -b qemu_aarch64 samples/hello_world</pre>
 <p>The main software use case example to have in mind is that of multiple threads incrementing an atomic counter as in <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/cpp/atomic/std_atomic.cpp">userland/cpp/atomic/std_atomic.cpp</a>, see also: <a href="#atomic-cpp">atomic.cpp</a>. Then, if one processors writes to the cache, other processors have to know about it before they read from that address.</p>
 </div>
 <div class="paragraph">
-<p>Note that cache coherency only applies to memory read/write instructions that explicitly make coherency requirements.</p>
-</div>
-<div class="paragraph">
-<p>In most ISAs, this tends to be the minority of instructions, and is only used when something is going to modify memory that is known to be shared across threads. For example, the a <a href="#x86-thread-synchronization-primitives">x86 LOCK</a> would be used to increment atomic counters that get incremented across several threads. Outside of those cases, cache coherency is not guaranteed, and behaviour is undefined.</p>
+<p>Even if caches are coherent, this is still not enough to avoid data race conditions, because this does not enforce atomicity of read modify write sequences. This is for example shown at: <a href="#detailed-gem5-analysis-of-how-data-races-happen">Detailed gem5 analysis of how data races happen</a>.</p>
 </div>
 <div class="sect3">
-<h4 id="can-caches-snoop-data-from-other-caches"><a class="anchor" href="#can-caches-snoop-data-from-other-caches"></a><a class="link" href="#can-caches-snoop-data-from-other-caches">32.2.1. Can caches snoop data from other caches?</a></h4>
+<h4 id="memory-consistency"><a class="anchor" href="#memory-consistency"></a><a class="link" href="#memory-consistency">32.2.1. Memory consistency</a></h4>
+<div class="paragraph">
+<p>According to <a href="http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf" class="bare">http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf</a> "memory consistency" is about ordering requirements of different memory addresses.</p>
+</div>
+<div class="paragraph">
+<p>This is represented explicitly in C++ for example <a href="#cpp-memory-order">C++ std::memory_order</a>.</p>
+</div>
+<div class="sect4">
+<h5 id="sequential-consistency"><a class="anchor" href="#sequential-consistency"></a><a class="link" href="#sequential-consistency">32.2.1.1. Sequential Consistency</a></h5>
+<div class="paragraph">
+<p>According to <a href="http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf" class="bare">http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf</a>, the strongest possible consistency, everything nicely ordered as you&#8217;d expect.</p>
+</div>
+</div>
+</div>
+<div class="sect3">
+<h4 id="can-caches-snoop-data-from-other-caches"><a class="anchor" href="#can-caches-snoop-data-from-other-caches"></a><a class="link" href="#can-caches-snoop-data-from-other-caches">32.2.2. Can caches snoop data from other caches?</a></h4>
 <div class="paragraph">
 <p>Either they can snoop only control, or both control and data can be snooped.</p>
 </div>
@@ -38296,7 +38837,7 @@ west build -b qemu_aarch64 samples/hello_world</pre>
 </div>
 </div>
 <div class="sect3">
-<h4 id="vi-cache-coherence-protocol"><a class="anchor" href="#vi-cache-coherence-protocol"></a><a class="link" href="#vi-cache-coherence-protocol">32.2.2. VI cache coherence protocol</a></h4>
+<h4 id="vi-cache-coherence-protocol"><a class="anchor" href="#vi-cache-coherence-protocol"></a><a class="link" href="#vi-cache-coherence-protocol">32.2.3. VI cache coherence protocol</a></h4>
 <div class="paragraph">
 <p>Mentioned at:</p>
 </div>
@@ -38543,7 +39084,7 @@ west build -b qemu_aarch64 samples/hello_world</pre>
 </div>
 </div>
 <div class="sect3">
-<h4 id="msi-cache-coherence-protocol"><a class="anchor" href="#msi-cache-coherence-protocol"></a><a class="link" href="#msi-cache-coherence-protocol">32.2.3. MSI cache coherence protocol</a></h4>
+<h4 id="msi-cache-coherence-protocol"><a class="anchor" href="#msi-cache-coherence-protocol"></a><a class="link" href="#msi-cache-coherence-protocol">32.2.4. MSI cache coherence protocol</a></h4>
 <div class="paragraph">
 <p><a href="https://en.wikipedia.org/wiki/MSI_protocol" class="bare">https://en.wikipedia.org/wiki/MSI_protocol</a></p>
 </div>
@@ -38855,7 +39396,7 @@ CACHE2 S nyy
 <p>TODO gem5 concrete example.</p>
 </div>
 <div class="sect4">
-<h5 id="msi-cache-coherence-protocol-with-transient-states"><a class="anchor" href="#msi-cache-coherence-protocol-with-transient-states"></a><a class="link" href="#msi-cache-coherence-protocol-with-transient-states">32.2.3.1. MSI cache coherence protocol with transient states</a></h5>
+<h5 id="msi-cache-coherence-protocol-with-transient-states"><a class="anchor" href="#msi-cache-coherence-protocol-with-transient-states"></a><a class="link" href="#msi-cache-coherence-protocol-with-transient-states">32.2.4.1. MSI cache coherence protocol with transient states</a></h5>
 <div class="paragraph">
 <p>TODO understand well why those are needed.</p>
 </div>
@@ -38875,7 +39416,7 @@ CACHE2 S nyy
 </div>
 </div>
 <div class="sect3">
-<h4 id="mesi-cache-coherence-protocol"><a class="anchor" href="#mesi-cache-coherence-protocol"></a><a class="link" href="#mesi-cache-coherence-protocol">32.2.4. MESI cache coherence protocol</a></h4>
+<h4 id="mesi-cache-coherence-protocol"><a class="anchor" href="#mesi-cache-coherence-protocol"></a><a class="link" href="#mesi-cache-coherence-protocol">32.2.5. MESI cache coherence protocol</a></h4>
 <div class="paragraph">
 <p><a href="https://en.wikipedia.org/wiki/MESI_protocol" class="bare">https://en.wikipedia.org/wiki/MESI_protocol</a></p>
 </div>
@@ -38896,22 +39437,48 @@ CACHE2 S nyy
 <p>Exclusive is entered from Invalid after a PrRd, but only if the reply came from DRAM (<a href="#can-caches-snoop-data-from-other-caches">or if we snooped that no one sent the reply to DRAM for us to read it</a>)! If the reply came from another cache, we go directly to shared instead. It is this extra information that allows for the split of S.</p>
 </div>
 <div class="paragraph">
+<p>This is why the simplified transition diagram shown in many places e.g.: <a href="https://upload.wikimedia.org/wikipedia/commons/c/c1/Diagrama_MESI.GIF" class="bare">https://upload.wikimedia.org/wikipedia/commons/c/c1/Diagrama_MESI.GIF</a> is not a proper state machine: I can go to either S or E given a PrRd.</p>
+</div>
+<div class="paragraph">
 <p>The advantage of this over MSI is that when we move from Exclusive to Modified, no invalidate message is required, reducing bus traffic: <a href="https://en.wikipedia.org/wiki/MESI_protocol#Advantages_of_MESI_over_MSI" class="bare">https://en.wikipedia.org/wiki/MESI_protocol#Advantages_of_MESI_over_MSI</a></p>
 </div>
 <div class="paragraph">
-<p>This is a common case on read write modify loops. On MSI, it would first do PrRd, send BusRd (to move any M to S), get data, and go to Shared, then PrWr must send BusUpgr to invalidate other Shared and move to M.</p>
+<p>This is a common case on read write modify loops. On MSI, it would:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>first do PrRd</p>
+</li>
+<li>
+<p>send BusRd (to move any M to S), get data, and go to Shared</p>
+</li>
+<li>
+<p>then PrWr must send BusUpgr to invalidate other Shared and move to M</p>
+</li>
+</ul>
 </div>
 <div class="paragraph">
-<p>With MESI, the PrRd could go to E instead of S depending on who services it. If it does go to E, then the PrWr only moves it to M, there is no need to send BusUpgr because we know that no one else is in S.</p>
+<p>With MESI:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>the PrRd could go to E instead of S depending on who services it</p>
+</li>
+<li>
+<p>if it does go to E, then the PrWr only moves it to M, there is no need to send BusUpgr because we know that no one else is in S</p>
+</li>
+</ul>
 </div>
 <div class="paragraph">
 <p>gem5 12c917de54145d2d50260035ba7fa614e25317a3 has two <a href="#gem5-ruby-build">Ruby</a> MESI models implemented: <code>MESI_Two_Level</code> and <code>MESI_Three_Level</code>.</p>
 </div>
 </div>
 <div class="sect3">
-<h4 id="mosi-cache-coherence-protocol"><a class="anchor" href="#mosi-cache-coherence-protocol"></a><a class="link" href="#mosi-cache-coherence-protocol">32.2.5. MOSI cache coherence protocol</a></h4>
+<h4 id="mosi-cache-coherence-protocol"><a class="anchor" href="#mosi-cache-coherence-protocol"></a><a class="link" href="#mosi-cache-coherence-protocol">32.2.6. MOSI cache coherence protocol</a></h4>
 <div class="paragraph">
-<p><a href="https://en.wikipedia.org/wiki/MOSI_protocol" class="bare">https://en.wikipedia.org/wiki/MOSI_protocol</a> The critical MSI vs MOSI section was a bit bogus though: <a href="https://en.wikipedia.org/w/index.php?title=MOSI_protocol&amp;oldid=895443023" class="bare">https://en.wikipedia.org/w/index.php?title=MOSI_protocol&amp;oldid=895443023</a> we have to edit it.</p>
+<p><a href="https://en.wikipedia.org/wiki/MOSI_protocol" class="bare">https://en.wikipedia.org/wiki/MOSI_protocol</a> The critical MSI vs MOSI section was a bit bogus though: <a href="https://en.wikipedia.org/w/index.php?title=MOSI_protocol&amp;oldid=895443023" class="bare">https://en.wikipedia.org/w/index.php?title=MOSI_protocol&amp;oldid=895443023</a> but I edited it :-)</p>
 </div>
 <div class="paragraph">
 <p>In MSI, it feels wasteful that an MS transaction needs to flush to memory: why do we need to flush right now, since even more caches now have that data? Why not wait until later ant try to gain something from this deferral?</p>
@@ -38969,7 +39536,7 @@ CACHE2 S nyy
 </div>
 </div>
 <div class="sect3">
-<h4 id="moesi"><a class="anchor" href="#moesi"></a><a class="link" href="#moesi">32.2.6. MOESI cache coherence protocol</a></h4>
+<h4 id="moesi"><a class="anchor" href="#moesi"></a><a class="link" href="#moesi">32.2.7. MOESI cache coherence protocol</a></h4>
 <div class="paragraph">
 <p><a href="https://en.wikipedia.org/wiki/MOESI_protocol" class="bare">https://en.wikipedia.org/wiki/MOESI_protocol</a></p>
 </div>
@@ -38980,6 +39547,9 @@ CACHE2 S nyy
 <p>In gem5 9fc9c67b4242c03f165951775be5cd0812f2a705, MOESI is the default cache coherency protocol of the <a href="#gem5-ruby-build">classic memory system</a> as shown at <a href="#what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5">Section 19.20.4.3.1, &#8220;What is the coherency protocol implemented by the classic cache system in gem5?&#8221;</a>.</p>
 </div>
 <div class="paragraph">
+<p>A good an simple example showing several MOESI transitions in the classic memory model can be seen at: <a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">Section 19.20.4.4, &#8220;gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs&#8221;</a>.</p>
+</div>
+<div class="paragraph">
 <p>gem5 12c917de54145d2d50260035ba7fa614e25317a3 has several <a href="#gem5-ruby-build">Ruby</a> MOESI models implemented: <code>MOESI_AMD_Base</code>, <code>MOESI_CMP_directory</code>, <code>MOESI_CMP_token</code> and <code>MOESI_hammer</code>.</p>
 </div>
 </div>