diff --git a/index.html b/index.html
index 096eac8..bf68eb3 100644
--- a/index.html
+++ b/index.html
@@ -1134,9 +1134,8 @@ pre{ white-space:pre }
 <li><a href="#gem5-dram-setup">19.3.3.3. gem5 DRAM setup</a></li>
 </ul>
 </li>
-<li><a href="#gem5-commmonitor">19.3.4. gem5 <code>CommMonitor</code></a></li>
-<li><a href="#gem5-disk-and-network-latency">19.3.5. gem5 disk and network latency</a></li>
-<li><a href="#gem5-clock-frequency">19.3.6. gem5 clock frequency</a></li>
+<li><a href="#gem5-disk-and-network-latency">19.3.4. gem5 disk and network latency</a></li>
+<li><a href="#gem5-clock-frequency">19.3.5. gem5 clock frequency</a></li>
 </ul>
 </li>
 <li><a href="#gem5-kernel-command-line-parameters">19.4. gem5 kernel command line parameters</a></li>
@@ -1195,7 +1194,8 @@ pre{ white-space:pre }
 <ul class="sectlevel4">
 <li><a href="#gem5-hdf5-statistics">19.10.3.1. gem5 HDF5 statistics</a></li>
 <li><a href="#gem5-only-dump-selected-stats">19.10.3.2. gem5 only dump selected stats</a></li>
-<li><a href="#gem5-stats-internals">19.10.3.3. gem5 stats internals</a></li>
+<li><a href="#meaning-of-each-gem5-stat">19.10.3.3. Meaning of each gem5 stat</a></li>
+<li><a href="#gem5-stats-internals">19.10.3.4. gem5 stats internals</a></li>
 </ul>
 </li>
 <li><a href="#gem5-config-ini">19.10.4. gem5 config.ini</a>
@@ -1257,137 +1257,155 @@ pre{ white-space:pre }
 <li><a href="#gem5-arm-platforms">19.18. gem5 ARM platforms</a></li>
 <li><a href="#gem5-upstream-images">19.19. gem5 upstream images</a></li>
 <li><a href="#gem5-bootloaders">19.20. gem5 bootloaders</a></li>
-<li><a href="#gem5-internals">19.21. gem5 internals</a>
+<li><a href="#gem5-memory-system">19.21. gem5 memory system</a>
 <ul class="sectlevel3">
-<li><a href="#gem5-eclipse-configuration">19.21.1. gem5 Eclipse configuration</a></li>
-<li><a href="#gem5-python-c-interaction">19.21.2. gem5 Python C++ interaction</a></li>
-<li><a href="#gem5-entry-point">19.21.3. gem5 entry point</a>
+<li><a href="#gem5-port-system">19.21.1. gem5 port system</a>
 <ul class="sectlevel4">
-<li><a href="#gem5-m5-objects-module">19.21.3.1. gem5 <code>m5.objects</code> module</a></li>
+<li><a href="#gem5-functional-vs-atomic-vs-timing-memory-requests">19.21.1.1. gem5 functional vs atomic vs timing memory requests</a>
+<ul class="sectlevel5">
+<li><a href="#gem5-functional-requests">19.21.1.1.1. gem5 functional requests</a></li>
 </ul>
 </li>
-<li><a href="#gem5-event-queue">19.21.4. gem5 event queue</a>
+</ul>
+</li>
+<li><a href="#gem5-packet-vs-request">19.21.2. gem5 <code>Packet</code> vs <code>Request</code></a>
 <ul class="sectlevel4">
-<li><a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis">19.21.4.1. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis</a>
+<li><a href="#gem5-packet">19.21.2.1. gem5 <code>Packet</code></a></li>
+<li><a href="#gem5-request">19.21.2.2. gem5 <code>Request</code></a>
 <ul class="sectlevel5">
-<li><a href="#atomicsimplecpu-initial-events">19.21.4.1.1. AtomicSimpleCPU initial events</a></li>
-<li><a href="#atomicsimplecpu-tick-reschedule-timing">19.21.4.1.2. AtomicSimpleCPU tick reschedule timing</a></li>
-<li><a href="#atomicsimplecpu-memory-access">19.21.4.1.3. AtomicSimpleCPU memory access</a></li>
-<li><a href="#gem5-se-py-page-translation">19.21.4.1.4. gem5 se.py page translation</a></li>
-</ul>
-</li>
-<li><a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis">19.21.4.2. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis</a>
-<ul class="sectlevel5">
-<li><a href="#timingsimplecpu-analysis-0">19.21.4.2.1. TimingSimpleCPU analysis #0</a></li>
-<li><a href="#timingsimplecpu-analysis-1">19.21.4.2.2. TimingSimpleCPU analysis #1</a></li>
-<li><a href="#timingsimplecpu-analysis-2">19.21.4.2.3. TimingSimpleCPU analysis #2</a></li>
-<li><a href="#timingsimplecpu-analysis-3-and-4">19.21.4.2.4. TimingSimpleCPU analysis #3 and #4</a></li>
-<li><a href="#timingsimplecpu-analysis-5">19.21.4.2.5. TimingSimpleCPU analysis #5</a></li>
-<li><a href="#timingsimplecpu-analysis-6">19.21.4.2.6. TimingSimpleCPU analysis #6</a></li>
-<li><a href="#timingsimplecpu-analysis-7">19.21.4.2.7. TimingSimpleCPU analysis #7</a></li>
-<li><a href="#timingsimplecpu-analysis-8">19.21.4.2.8. TimingSimpleCPU analysis #8</a></li>
-<li><a href="#timingsimplecpu-analysis-9">19.21.4.2.9. TimingSimpleCPU analysis #9</a></li>
-<li><a href="#timingsimplecpu-analysis-10">19.21.4.2.10. TimingSimpleCPU analysis #10</a></li>
-<li><a href="#timingsimplecpu-analysis-11">19.21.4.2.11. TimingSimpleCPU analysis #11</a></li>
-<li><a href="#timingsimplecpu-analysis-12">19.21.4.2.12. TimingSimpleCPU analysis #12</a></li>
-<li><a href="#timingsimplecpu-analysis-13">19.21.4.2.13. TimingSimpleCPU analysis #13</a></li>
-<li><a href="#timingsimplecpu-analysis-14">19.21.4.2.14. TimingSimpleCPU analysis #14</a></li>
-<li><a href="#timingsimplecpu-analysis-15">19.21.4.2.15. TimingSimpleCPU analysis #15</a></li>
-<li><a href="#timingsimplecpu-analysis-16">19.21.4.2.16. TimingSimpleCPU analysis #16</a></li>
-<li><a href="#timingsimplecpu-analysis-17">19.21.4.2.17. TimingSimpleCPU analysis #17</a></li>
-<li><a href="#timingsimplecpu-analysis-18">19.21.4.2.18. TimingSimpleCPU analysis #18</a></li>
-<li><a href="#timingsimplecpu-analysis-19">19.21.4.2.19. TimingSimpleCPU analysis #19</a></li>
-<li><a href="#timingsimplecpu-analysis-20">19.21.4.2.20. TimingSimpleCPU analysis #20</a></li>
-<li><a href="#timingsimplecpu-analysis-21">19.21.4.2.21. TimingSimpleCPU analysis #21</a></li>
-<li><a href="#timingsimplecpu-analysis-22">19.21.4.2.22. TimingSimpleCPU analysis #22</a></li>
-<li><a href="#timingsimplecpu-analysis-23">19.21.4.2.23. TimingSimpleCPU analysis #23</a></li>
-<li><a href="#timingsimplecpu-analysis-24">19.21.4.2.24. TimingSimpleCPU analysis #24</a></li>
-<li><a href="#timingsimplecpu-analysis-25">19.21.4.2.25. TimingSimpleCPU analysis #25</a></li>
-<li><a href="#timingsimplecpu-analysis-26">19.21.4.2.26. TimingSimpleCPU analysis #26</a></li>
-<li><a href="#timingsimplecpu-analysis-27">19.21.4.2.27. TimingSimpleCPU analysis #27</a></li>
-<li><a href="#timingsimplecpu-analysis-28">19.21.4.2.28. TimingSimpleCPU analysis #28</a></li>
-<li><a href="#timingsimplecpu-analysis-29">19.21.4.2.29. TimingSimpleCPU analysis #29</a></li>
-<li><a href="#timingsimplecpu-analysis-ldr-stall">19.21.4.2.30. TimingSimpleCPU analysis: LDR stall</a></li>
-</ul>
-</li>
-<li><a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches">19.21.4.3. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches</a>
-<ul class="sectlevel5">
-<li><a href="#what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5">19.21.4.3.1. What is the coherency protocol implemented by the classic cache system in gem5?</a></li>
-</ul>
-</li>
-<li><a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">19.21.4.4. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a></li>
-<li><a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus-and-ruby">19.21.4.5. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby</a></li>
-<li><a href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis">19.21.4.6. gem5 event queue MinorCPU syscall emulation freestanding example analysis</a>
-<ul class="sectlevel5">
-<li><a href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis-hazard">19.21.4.6.1. gem5 event queue MinorCPU syscall emulation freestanding example analysis: hazard</a></li>
-</ul>
-</li>
-<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis">19.21.4.7. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis</a>
-<ul class="sectlevel5">
-<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazardless">19.21.4.7.1. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazardless</a></li>
-<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard">19.21.4.7.2. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard</a></li>
-<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard4">19.21.4.7.3. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard4</a></li>
-<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall">19.21.4.7.4. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall</a></li>
-<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-gain">19.21.4.7.5. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-gain</a></li>
-<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-hazard4">19.21.4.7.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-hazard4</a></li>
-<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-speculative">19.21.4.7.7. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: speculative</a></li>
+<li><a href="#gem5-request-in-atomicsimplecpu">19.21.2.2.1. gem5 <code>Request</code> in <code>AtomicSimpleCPU</code></a></li>
+<li><a href="#gem5-request-in-timingsimplecpu">19.21.2.2.2. gem5 <code>Request</code> in <code>TimingSimpleCPU</code></a></li>
 </ul>
 </li>
 </ul>
 </li>
-<li><a href="#gem5-instruction-definitions">19.21.5. gem5 instruction definitions</a>
+<li><a href="#gem5-mshr">19.21.3. gem5 <code>MSHR</code></a></li>
+<li><a href="#gem5-commmonitor">19.21.4. gem5 <code>CommMonitor</code></a></li>
+</ul>
+</li>
+<li><a href="#gem5-internals">19.22. gem5 internals</a>
+<ul class="sectlevel3">
+<li><a href="#gem5-eclipse-configuration">19.22.1. gem5 Eclipse configuration</a></li>
+<li><a href="#gem5-python-c-interaction">19.22.2. gem5 Python C++ interaction</a></li>
+<li><a href="#gem5-entry-point">19.22.3. gem5 entry point</a>
 <ul class="sectlevel4">
-<li><a href="#gem5-execute-vs-initiateacc-vs-completeacc">19.21.5.1. gem5 <code>execute</code> vs <code>initiateAcc</code> vs <code>completeAcc</code></a>
+<li><a href="#gem5-m5-objects-module">19.22.3.1. gem5 <code>m5.objects</code> module</a></li>
+</ul>
+</li>
+<li><a href="#gem5-event-queue">19.22.4. gem5 event queue</a>
+<ul class="sectlevel4">
+<li><a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis">19.22.4.1. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis</a>
 <ul class="sectlevel5">
-<li><a href="#gem5-completeacc">19.21.5.1.1. gem5 <code>completeAcc</code></a></li>
+<li><a href="#atomicsimplecpu-initial-events">19.22.4.1.1. AtomicSimpleCPU initial events</a></li>
+<li><a href="#atomicsimplecpu-tick-reschedule-timing">19.22.4.1.2. AtomicSimpleCPU tick reschedule timing</a></li>
+<li><a href="#atomicsimplecpu-memory-access">19.22.4.1.3. AtomicSimpleCPU memory access</a></li>
+<li><a href="#gem5-se-py-page-translation">19.22.4.1.4. gem5 se.py page translation</a></li>
 </ul>
 </li>
-<li><a href="#gem5-microops">19.21.5.2. gem5 microops</a></li>
-</ul>
-</li>
-<li><a href="#gem5-port-system">19.21.6. gem5 port system</a>
-<ul class="sectlevel4">
-<li><a href="#gem5-functional-vs-atomic-vs-timing-memory-requests">19.21.6.1. gem5 functional vs atomic vs timing memory requests</a>
+<li><a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis">19.22.4.2. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis</a>
 <ul class="sectlevel5">
-<li><a href="#gem5-functional-requests">19.21.6.1.1. gem5 functional requests</a></li>
+<li><a href="#timingsimplecpu-analysis-0">19.22.4.2.1. TimingSimpleCPU analysis #0</a></li>
+<li><a href="#timingsimplecpu-analysis-1">19.22.4.2.2. TimingSimpleCPU analysis #1</a></li>
+<li><a href="#timingsimplecpu-analysis-2">19.22.4.2.3. TimingSimpleCPU analysis #2</a></li>
+<li><a href="#timingsimplecpu-analysis-3-and-4">19.22.4.2.4. TimingSimpleCPU analysis #3 and #4</a></li>
+<li><a href="#timingsimplecpu-analysis-5">19.22.4.2.5. TimingSimpleCPU analysis #5</a></li>
+<li><a href="#timingsimplecpu-analysis-6">19.22.4.2.6. TimingSimpleCPU analysis #6</a></li>
+<li><a href="#timingsimplecpu-analysis-7">19.22.4.2.7. TimingSimpleCPU analysis #7</a></li>
+<li><a href="#timingsimplecpu-analysis-8">19.22.4.2.8. TimingSimpleCPU analysis #8</a></li>
+<li><a href="#timingsimplecpu-analysis-9">19.22.4.2.9. TimingSimpleCPU analysis #9</a></li>
+<li><a href="#timingsimplecpu-analysis-10">19.22.4.2.10. TimingSimpleCPU analysis #10</a></li>
+<li><a href="#timingsimplecpu-analysis-11">19.22.4.2.11. TimingSimpleCPU analysis #11</a></li>
+<li><a href="#timingsimplecpu-analysis-12">19.22.4.2.12. TimingSimpleCPU analysis #12</a></li>
+<li><a href="#timingsimplecpu-analysis-13">19.22.4.2.13. TimingSimpleCPU analysis #13</a></li>
+<li><a href="#timingsimplecpu-analysis-14">19.22.4.2.14. TimingSimpleCPU analysis #14</a></li>
+<li><a href="#timingsimplecpu-analysis-15">19.22.4.2.15. TimingSimpleCPU analysis #15</a></li>
+<li><a href="#timingsimplecpu-analysis-16">19.22.4.2.16. TimingSimpleCPU analysis #16</a></li>
+<li><a href="#timingsimplecpu-analysis-17">19.22.4.2.17. TimingSimpleCPU analysis #17</a></li>
+<li><a href="#timingsimplecpu-analysis-18">19.22.4.2.18. TimingSimpleCPU analysis #18</a></li>
+<li><a href="#timingsimplecpu-analysis-19">19.22.4.2.19. TimingSimpleCPU analysis #19</a></li>
+<li><a href="#timingsimplecpu-analysis-20">19.22.4.2.20. TimingSimpleCPU analysis #20</a></li>
+<li><a href="#timingsimplecpu-analysis-21">19.22.4.2.21. TimingSimpleCPU analysis #21</a></li>
+<li><a href="#timingsimplecpu-analysis-22">19.22.4.2.22. TimingSimpleCPU analysis #22</a></li>
+<li><a href="#timingsimplecpu-analysis-23">19.22.4.2.23. TimingSimpleCPU analysis #23</a></li>
+<li><a href="#timingsimplecpu-analysis-24">19.22.4.2.24. TimingSimpleCPU analysis #24</a></li>
+<li><a href="#timingsimplecpu-analysis-25">19.22.4.2.25. TimingSimpleCPU analysis #25</a></li>
+<li><a href="#timingsimplecpu-analysis-26">19.22.4.2.26. TimingSimpleCPU analysis #26</a></li>
+<li><a href="#timingsimplecpu-analysis-27">19.22.4.2.27. TimingSimpleCPU analysis #27</a></li>
+<li><a href="#timingsimplecpu-analysis-28">19.22.4.2.28. TimingSimpleCPU analysis #28</a></li>
+<li><a href="#timingsimplecpu-analysis-29">19.22.4.2.29. TimingSimpleCPU analysis #29</a></li>
+<li><a href="#timingsimplecpu-analysis-ldr-stall">19.22.4.2.30. TimingSimpleCPU analysis: LDR stall</a></li>
 </ul>
 </li>
-</ul>
-</li>
-<li><a href="#gem5-threadcontext-vs-threadstate-vs-execcontext-vs-process">19.21.7. gem5 <code>ThreadContext</code> vs <code>ThreadState</code> vs <code>ExecContext</code> vs <code>Process</code></a>
-<ul class="sectlevel4">
-<li><a href="#gem5-threadcontext">19.21.7.1. gem5 <code>ThreadContext</code></a>
+<li><a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches">19.22.4.3. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches</a>
 <ul class="sectlevel5">
-<li><a href="#gem5-simplethread">19.21.7.1.1. gem5 <code>SimpleThread</code></a></li>
-<li><a href="#gem5-o3threadcontext">19.21.7.1.2. gem5 <code>O3ThreadContext</code></a></li>
+<li><a href="#what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5">19.22.4.3.1. What is the coherency protocol implemented by the classic cache system in gem5?</a></li>
 </ul>
 </li>
-<li><a href="#gem5-threadstate">19.21.7.2. gem5 <code>ThreadState</code></a></li>
-<li><a href="#gem5-execcontext">19.21.7.3. gem5 <code>ExecContext</code></a>
+<li><a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">19.22.4.4. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a></li>
+<li><a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">19.22.4.5. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a></li>
+<li><a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus-and-ruby">19.22.4.6. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby</a></li>
+<li><a href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis">19.22.4.7. gem5 event queue MinorCPU syscall emulation freestanding example analysis</a>
 <ul class="sectlevel5">
-<li><a href="#gem5-execcontext-readintregoperand-register-resolution">19.21.7.3.1. gem5 <code>ExecContext::readIntRegOperand</code> register resolution</a></li>
+<li><a href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis-hazard">19.22.4.7.1. gem5 event queue MinorCPU syscall emulation freestanding example analysis: hazard</a></li>
 </ul>
 </li>
-<li><a href="#gem5-process">19.21.7.4. gem5 <code>Process</code></a></li>
+<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis">19.22.4.8. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis</a>
+<ul class="sectlevel5">
+<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazardless">19.22.4.8.1. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazardless</a></li>
+<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard">19.22.4.8.2. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard</a></li>
+<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard4">19.22.4.8.3. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard4</a></li>
+<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall">19.22.4.8.4. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall</a></li>
+<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-gain">19.22.4.8.5. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-gain</a></li>
+<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-hazard4">19.22.4.8.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-hazard4</a></li>
+<li><a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-speculative">19.22.4.8.7. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: speculative</a></li>
 </ul>
 </li>
-<li><a href="#gem5-functional-units">19.21.8. gem5 functional units</a>
+</ul>
+</li>
+<li><a href="#gem5-instruction-definitions">19.22.5. gem5 instruction definitions</a>
 <ul class="sectlevel4">
-<li><a href="#gem5-minorcpu-default-functional-units">19.21.8.1. gem5 <code>MinorCPU</code> default functional units</a></li>
-<li><a href="#gem5-derivo3cpu-default-functional-units">19.21.8.2. gem5 DerivO3CPU default functional units</a></li>
+<li><a href="#gem5-execute-vs-initiateacc-vs-completeacc">19.22.5.1. gem5 <code>execute</code> vs <code>initiateAcc</code> vs <code>completeAcc</code></a>
+<ul class="sectlevel5">
+<li><a href="#gem5-completeacc">19.22.5.1.1. gem5 <code>completeAcc</code></a></li>
 </ul>
 </li>
-<li><a href="#gem5-code-generation">19.21.9. gem5 code generation</a>
-<ul class="sectlevel4">
-<li><a href="#gem5-the-isa">19.21.9.1. gem5 THE_ISA</a></li>
+<li><a href="#gem5-microops">19.22.5.2. gem5 microops</a></li>
 </ul>
 </li>
-<li><a href="#gem5-build-system">19.21.10. gem5 build system</a>
+<li><a href="#gem5-threadcontext-vs-threadstate-vs-execcontext-vs-process">19.22.6. gem5 <code>ThreadContext</code> vs <code>ThreadState</code> vs <code>ExecContext</code> vs <code>Process</code></a>
 <ul class="sectlevel4">
-<li><a href="#m5-override-py-source">19.21.10.1. M5_OVERRIDE_PY_SOURCE</a></li>
-<li><a href="#gem5-build-broken-on-recent-compiler-version">19.21.10.2. gem5 build broken on recent compiler version</a></li>
-<li><a href="#gem5-polymorphic-isa-includes">19.21.10.3. gem5 polymorphic ISA includes</a></li>
-<li><a href="#why-are-all-c-symlinked-into-the-gem5-build-dir">19.21.10.4. Why are all C++ symlinked into the gem5 build dir?</a></li>
+<li><a href="#gem5-threadcontext">19.22.6.1. gem5 <code>ThreadContext</code></a>
+<ul class="sectlevel5">
+<li><a href="#gem5-simplethread">19.22.6.1.1. gem5 <code>SimpleThread</code></a></li>
+<li><a href="#gem5-o3threadcontext">19.22.6.1.2. gem5 <code>O3ThreadContext</code></a></li>
+</ul>
+</li>
+<li><a href="#gem5-threadstate">19.22.6.2. gem5 <code>ThreadState</code></a></li>
+<li><a href="#gem5-execcontext">19.22.6.3. gem5 <code>ExecContext</code></a>
+<ul class="sectlevel5">
+<li><a href="#gem5-execcontext-readintregoperand-register-resolution">19.22.6.3.1. gem5 <code>ExecContext::readIntRegOperand</code> register resolution</a></li>
+</ul>
+</li>
+<li><a href="#gem5-process">19.22.6.4. gem5 <code>Process</code></a></li>
+</ul>
+</li>
+<li><a href="#gem5-functional-units">19.22.7. gem5 functional units</a>
+<ul class="sectlevel4">
+<li><a href="#gem5-minorcpu-default-functional-units">19.22.7.1. gem5 <code>MinorCPU</code> default functional units</a></li>
+<li><a href="#gem5-derivo3cpu-default-functional-units">19.22.7.2. gem5 DerivO3CPU default functional units</a></li>
+</ul>
+</li>
+<li><a href="#gem5-code-generation">19.22.8. gem5 code generation</a>
+<ul class="sectlevel4">
+<li><a href="#gem5-the-isa">19.22.8.1. gem5 THE_ISA</a></li>
+</ul>
+</li>
+<li><a href="#gem5-build-system">19.22.9. gem5 build system</a>
+<ul class="sectlevel4">
+<li><a href="#m5-override-py-source">19.22.9.1. M5_OVERRIDE_PY_SOURCE</a></li>
+<li><a href="#gem5-build-broken-on-recent-compiler-version">19.22.9.2. gem5 build broken on recent compiler version</a></li>
+<li><a href="#gem5-polymorphic-isa-includes">19.22.9.3. gem5 polymorphic ISA includes</a></li>
+<li><a href="#why-are-all-c-symlinked-into-the-gem5-build-dir">19.22.9.4. Why are all C++ symlinked into the gem5 build dir?</a></li>
 </ul>
 </li>
 </ul>
@@ -2128,23 +2146,27 @@ pre{ white-space:pre }
 </li>
 <li><a href="#instruction-level-parallelism">33.4. Instruction level parallelism</a></li>
 <li><a href="#hardware-threads">33.5. Hardware threads</a></li>
-<li><a href="#cache-coherence">33.6. Cache coherence</a>
+<li><a href="#caches">33.6. Caches</a>
 <ul class="sectlevel3">
-<li><a href="#memory-consistency">33.6.1. Memory consistency</a>
+<li><a href="#cache-coherence">33.6.1. Cache coherence</a>
 <ul class="sectlevel4">
-<li><a href="#sequential-consistency">33.6.1.1. Sequential Consistency</a></li>
+<li><a href="#memory-consistency">33.6.1.1. Memory consistency</a>
+<ul class="sectlevel5">
+<li><a href="#sequential-consistency">33.6.1.1.1. Sequential Consistency</a></li>
 </ul>
 </li>
-<li><a href="#can-caches-snoop-data-from-other-caches">33.6.2. Can caches snoop data from other caches?</a></li>
-<li><a href="#vi-cache-coherence-protocol">33.6.3. VI cache coherence protocol</a></li>
-<li><a href="#msi-cache-coherence-protocol">33.6.4. MSI cache coherence protocol</a>
-<ul class="sectlevel4">
-<li><a href="#msi-cache-coherence-protocol-with-transient-states">33.6.4.1. MSI cache coherence protocol with transient states</a></li>
+<li><a href="#can-caches-snoop-data-from-other-caches">33.6.1.2. Can caches snoop data from other caches?</a></li>
+<li><a href="#vi-cache-coherence-protocol">33.6.1.3. VI cache coherence protocol</a></li>
+<li><a href="#msi-cache-coherence-protocol">33.6.1.4. MSI cache coherence protocol</a>
+<ul class="sectlevel5">
+<li><a href="#msi-cache-coherence-protocol-with-transient-states">33.6.1.4.1. MSI cache coherence protocol with transient states</a></li>
+</ul>
+</li>
+<li><a href="#mesi-cache-coherence-protocol">33.6.1.5. MESI cache coherence protocol</a></li>
+<li><a href="#mosi-cache-coherence-protocol">33.6.1.6. MOSI cache coherence protocol</a></li>
+<li><a href="#moesi">33.6.1.7. MOESI cache coherence protocol</a></li>
 </ul>
 </li>
-<li><a href="#mesi-cache-coherence-protocol">33.6.5. MESI cache coherence protocol</a></li>
-<li><a href="#mosi-cache-coherence-protocol">33.6.6. MOSI cache coherence protocol</a></li>
-<li><a href="#moesi">33.6.7. MOESI cache coherence protocol</a></li>
 </ul>
 </li>
 </ul>
@@ -11445,6 +11467,9 @@ insmod myprintk.ko</pre>
 <div class="paragraph">
 <p><code>start_kernel</code> is a good definition of it: <a href="https://stackoverflow.com/questions/18266063/does-kernel-have-main-function/33422401#33422401" class="bare">https://stackoverflow.com/questions/18266063/does-kernel-have-main-function/33422401#33422401</a></p>
 </div>
+<div class="paragraph">
+<p>In gem5 aarc64 Linux v4.18, experimentally the entry point of secondary CPUs seems to be <code>secondary_holding_pen</code> as shown at <a href="https://gist.github.com/cirosantilli2/34a7bc450fcb6c1c1a910369be1fdd90" class="bare">https://gist.github.com/cirosantilli2/34a7bc450fcb6c1c1a910369be1fdd90</a></p>
+</div>
 </div>
 <div class="sect2">
 <h3 id="kernel-module-apis"><a class="anchor" href="#kernel-module-apis"></a><a class="link" href="#kernel-module-apis">15.6. Kernel module APIs</a></h3>
@@ -19808,53 +19833,7 @@ get_avphys_pages() * sysconf(_SC_PAGESIZE) = 0x1D178000</pre>
 </div>
 </div>
 <div class="sect3">
-<h4 id="gem5-commmonitor"><a class="anchor" href="#gem5-commmonitor"></a><a class="link" href="#gem5-commmonitor">19.3.4. gem5 <code>CommMonitor</code></a></h4>
-<div class="paragraph">
-<p>You can place this <a href="#gem5-python-c-interaction">SimObject</a> in between two <a href="#gem5-port-system">ports</a> to get extra statistics about the packets that are going through.</p>
-</div>
-<div class="paragraph">
-<p>It only works on <a href="#gem5-functional-vs-atomic-vs-timing-memory-requests">timing requests</a>, and does not seem to dump any memory values, only add extra <a href="#gem5-m5out-stats-txt-file">statistics</a>.</p>
-</div>
-<div class="paragraph">
-<p>For example, the patch <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/patches/manual/gem5-commmonitor-se.patch">patches/manual/gem5-commmonitor-se.patch</a> hack a <code>CommMonitor</code> between the CPU and the L1 cache on top of gem5 1c3662c9557c85f0d25490dc4fbde3f8ab0cb350:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>patch -d "$(./getvar gem5_source_dir)" -p 1 &lt; patches/manual/gem5-commmonitor-se.patch</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>That patch was done largely by copying what <code>fs.py --memcheck</code> does with a <code>MemChecker</code> object.</p>
-</div>
-<div class="paragraph">
-<p>You can then run with:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>./run \
-  --arch aarch64 \
-  --emulator gem5 \
-  --userland userland/arch/aarch64/freestanding/linux/hello.S \
-  -- \
-  --caches \
-  --cpu-type TimingSimpleCPU \
-;</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>and now we have some new extra histogram statistics such as:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>system.cpu.dcache_mon.readBurstLengthHist::samples            1</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>One neat thing about this is that it is agnostic to the memory object type, so you don&#8217;t have to recode those statistics for every new type of object that operates on memory packets.</p>
-</div>
-</div>
-<div class="sect3">
-<h4 id="gem5-disk-and-network-latency"><a class="anchor" href="#gem5-disk-and-network-latency"></a><a class="link" href="#gem5-disk-and-network-latency">19.3.5. gem5 disk and network latency</a></h4>
+<h4 id="gem5-disk-and-network-latency"><a class="anchor" href="#gem5-disk-and-network-latency"></a><a class="link" href="#gem5-disk-and-network-latency">19.3.4. gem5 disk and network latency</a></h4>
 <div class="paragraph">
 <p>TODO These look promising:</p>
 </div>
@@ -19869,7 +19848,7 @@ get_avphys_pages() * sysconf(_SC_PAGESIZE) = 0x1D178000</pre>
 </div>
 </div>
 <div class="sect3">
-<h4 id="gem5-clock-frequency"><a class="anchor" href="#gem5-clock-frequency"></a><a class="link" href="#gem5-clock-frequency">19.3.6. gem5 clock frequency</a></h4>
+<h4 id="gem5-clock-frequency"><a class="anchor" href="#gem5-clock-frequency"></a><a class="link" href="#gem5-clock-frequency">19.3.5. gem5 clock frequency</a></h4>
 <div class="paragraph">
 <p>As of gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 defaults to 2GHz for fs.py:</p>
 </div>
@@ -21565,17 +21544,75 @@ system.cpu.dtb.inst_hits</pre>
 <div class="sect4">
 <h5 id="gem5-only-dump-selected-stats"><a class="anchor" href="#gem5-only-dump-selected-stats"></a><a class="link" href="#gem5-only-dump-selected-stats">19.10.3.2. gem5 only dump selected stats</a></h5>
 <div class="paragraph">
-<p>TODO</p>
-</div>
-<div class="paragraph">
 <p><a href="https://stackoverflow.com/questions/52014953/how-to-dump-only-a-single-or-certain-selected-stats-in-gem5" class="bare">https://stackoverflow.com/questions/52014953/how-to-dump-only-a-single-or-certain-selected-stats-in-gem5</a></p>
 </div>
 <div class="paragraph">
 <p>To prevent the stats file from becoming humongous.</p>
 </div>
+<div class="paragraph">
+<p><a href="https://stackoverflow.com/questions/52014953/how-to-dump-only-a-single-or-certain-selected-stats-in-gem5/57221132#57221132" class="bare">https://stackoverflow.com/questions/52014953/how-to-dump-only-a-single-or-certain-selected-stats-in-gem5/57221132#57221132</a></p>
+</div>
 </div>
 <div class="sect4">
-<h5 id="gem5-stats-internals"><a class="anchor" href="#gem5-stats-internals"></a><a class="link" href="#gem5-stats-internals">19.10.3.3. gem5 stats internals</a></h5>
+<h5 id="meaning-of-each-gem5-stat"><a class="anchor" href="#meaning-of-each-gem5-stat"></a><a class="link" href="#meaning-of-each-gem5-stat">19.10.3.3. Meaning of each gem5 stat</a></h5>
+<div class="paragraph">
+<p>Well, run minimal examples, and reverse engineer them up!</p>
+</div>
+<div class="paragraph">
+<p>We can start with <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/arch/x86_64/freestanding/linux/hello.S">userland/arch/x86_64/freestanding/linux/hello.S</a> on atomic with <a href="#gem5-execall-trace-format">gem5 ExecAll trace format</a>.</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./run \
+  --arch aarch64 \
+  --emulator gem5 \
+  --userland userland/arch/aarch64/freestanding/linux/hello.S \
+  --trace ExecAll \
+  --trace-stdout \
+;</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>which gives:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>      0: system.cpu: A0 T0 : @_start    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
+    500: system.cpu: A0 T0 : @_start+4    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)
+   1000: system.cpu: A0 T0 : @_start+8    :   ldr   w2, #4194464       : MemRead :  D=0x0000000000000006 A=0x4000a0  flags=(IsInteger|IsMemRef|IsLoad)
+   1500: system.cpu: A0 T0 : @_start+12    :   movz   x8, #64, #0       : IntAlu :  D=0x0000000000000040  flags=(IsInteger)
+   2000: system.cpu: A0 T0 : @_start+16    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
+   2500: system.cpu: A0 T0 : @_start+20    :   movz   x0, #0, #0        : IntAlu :  D=0x0000000000000000  flags=(IsInteger)
+   3000: system.cpu: A0 T0 : @_start+24    :   movz   x8, #93, #0       : IntAlu :  D=0x000000000000005d  flags=(IsInteger)
+   3500: system.cpu: A0 T0 : @_start+28    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>The most important stat of all is usually the cycle count, which is a direct measure of performance if you modelled you system well:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>sim_ticks 3500 # Number of ticks simulated</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Next, <code>sim_insts</code> and <code>sim_ops</code> are often critical:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>sim_insts 6 # Number of instructions simulated
+sim_ops   6 # Number of ops (including micro ops) simulated</pre>
+</div>
+</div>
+<div class="paragraph">
+<p><code>sim_ops</code> is like <code>sim_insts</code> but it also includes <a href="#gem5-microops">gem5 microops</a>.</p>
+</div>
+<div class="paragraph">
+<p>In <a href="#gem5-syscall-emulation-mode">gem5 syscall emulation mode</a>, syscall instructions are magic, and therefore appear to not be counted, that is why we get 6 instructions instead of 8.</p>
+</div>
+</div>
+<div class="sect4">
+<h5 id="gem5-stats-internals"><a class="anchor" href="#gem5-stats-internals"></a><a class="link" href="#gem5-stats-internals">19.10.3.4. gem5 stats internals</a></h5>
 <div class="paragraph">
 <p>This describes the internals of the <a href="#gem5-m5out-stats-txt-file">gem5 m5out/stats.txt file</a>.</p>
 </div>
@@ -21747,7 +21784,7 @@ xdg-open "$(./getvar --arch arm --emulator gem5 m5out_dir)/config.dot.svg"</pre>
 </div>
 </div>
 <div class="paragraph">
-<p>which is kind of really cool because it allows you to graph arrows with clicks.</p>
+<p>which is kind of really cool because it allows you to view graph arrows on hover. This can be very useful because the PDF and SVG often overlap so many arrows together that you just can&#8217;t know which one is coming from/going to where.</p>
 </div>
 <div class="paragraph">
 <p>It is worth noting that if you are running a bunch of short simulations, dot/SVG/PDF generation could have a significant impact in simulation startup time, so it is something to watch out for. As per <a href="https://gem5-review.googlesource.com/c/public/gem5/+/29232" class="bare">https://gem5-review.googlesource.com/c/public/gem5/+/29232</a> it can be turned off with:</p>
@@ -22052,7 +22089,7 @@ system.cpu[0].max_insts_any_thread</pre>
 <div class="content">
 <pre>./run \
   --emulator gem5 \
-  --userland \userland/arch/x86_64/freestanding/linux/hello.S \
+  --userland userland/arch/x86_64/freestanding/linux/hello.S \
   --trace-insts-stdout \
   -- \
   --param 'system.cpu[0].max_insts_all_threads = 3' \
@@ -22498,10 +22535,13 @@ cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"</pre>
 <p>It presumably implements a crossbar switch along the lines of: <a href="https://en.wikipedia.org/wiki/Crossbar_switch" class="bare">https://en.wikipedia.org/wiki/Crossbar_switch</a></p>
 </div>
 <div class="paragraph">
-<p>One simple example of its operation can be seen at: <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis">Section 19.21.4.2, &#8220;gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis&#8221;</a></p>
+<p>This is the best introductory example analysis we have so far: <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a>. It contains more or less the most minimal example in which something interesting can be observed: multiple cores fighting over a single data memory variable.</p>
 </div>
 <div class="paragraph">
-<p>But arguably interesting effects can only be observed when we have more than 1 CPUs as in <a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a>.</p>
+<p>Long story short: the interconnect contains the snoop mechanism, and it forwards packets coming form caches of a CPU to the caches of other CPUs in which the block is present.</p>
+</div>
+<div class="paragraph">
+<p>It is therefore the heart of the <a href="#cache-coherence">Cache coherence</a> mechanism, as it informs other caches of bus transactions they need to know about.</p>
 </div>
 <div class="paragraph">
 <p>TODO: describe it in more detail. It appears to be a very simple mechanism.</p>
@@ -23003,15 +23043,678 @@ cd ..
             cur_sys.boot_loader = [ loc('boot_emm.arm64'), loc('boot_emm.arm') ]</pre>
 </div>
 </div>
+<div class="paragraph">
+<p>The bootloader basically just sets up a bit of CPU state and jumps to the kernel entry point.</p>
+</div>
+<div class="paragraph">
+<p>In aarch64 at least, CPUs other than CPU0 are also started up briefly, run some initialization, and are made wait on a WFE. This can be seen easily by booting a multicore Linux kernel run with <a href="#gem5-execall-trace-format">gem5 ExecAll trace format</a>.</p>
+</div>
 </div>
 <div class="sect2">
-<h3 id="gem5-internals"><a class="anchor" href="#gem5-internals"></a><a class="link" href="#gem5-internals">19.21. gem5 internals</a></h3>
+<h3 id="gem5-memory-system"><a class="anchor" href="#gem5-memory-system"></a><a class="link" href="#gem5-memory-system">19.21. gem5 memory system</a></h3>
+<div class="paragraph">
+<p>Parent section: <a href="#gem5-internals">gem5 internals</a>.</p>
+</div>
+<div class="sect3">
+<h4 id="gem5-port-system"><a class="anchor" href="#gem5-port-system"></a><a class="link" href="#gem5-port-system">19.21.1. gem5 port system</a></h4>
+<div class="paragraph">
+<p>The gem5 memory system is connected in a very flexible way through the port system.</p>
+</div>
+<div class="paragraph">
+<p>This system exists to allow seamlessly connecting any combination of CPU, caches, interconnects, DRAM and peripherals.</p>
+</div>
+<div class="paragraph">
+<p>A <a href="#gem5-packet"><code>Packet</code></a> is the basic information unit that gets sent across ports.</p>
+</div>
+<div class="sect4">
+<h5 id="gem5-functional-vs-atomic-vs-timing-memory-requests"><a class="anchor" href="#gem5-functional-vs-atomic-vs-timing-memory-requests"></a><a class="link" href="#gem5-functional-vs-atomic-vs-timing-memory-requests">19.21.1.1. gem5 functional vs atomic vs timing memory requests</a></h5>
+<div class="paragraph">
+<p>gem5 memory requests can be classified in the following broad categories:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>functional: get the value magically, do not update caches, see also: <a href="#gem5-functional-requests">gem5 functional requests</a></p>
+</li>
+<li>
+<p>atomic: get the value now without making a <a href="#gem5-event-queue">separate event</a>, but do not update caches. Cannot work in <a href="#gem5-ruby-build">Ruby</a> due to fundamental limitations, mentioned in passing at: <a href="https://gem5.atlassian.net/browse/GEM5-676" class="bare">https://gem5.atlassian.net/browse/GEM5-676</a></p>
+</li>
+<li>
+<p>timing: get the value simulating delays and updating caches</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>This trichotomy can be notably seen in the definition of the <a href="https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/port.hh#L75">MasterPort class</a>:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>class MasterPort : public Port, public AtomicRequestProtocol,
+    public TimingRequestProtocol, public FunctionalRequestProtocol</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and the base classes are defined under <code>src/mem/protocol/</code>.</p>
+</div>
+<div class="paragraph">
+<p>Then, by reading the rest of the class, we see that the send methods are all boring, and just forward to some polymorphic receiver that does the actual interesting activity:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>    Tick
+    sendAtomicSnoop(PacketPtr pkt)
+    {
+        return AtomicResponseProtocol::sendSnoop(_masterPort, pkt);
+    }
+
+    Tick
+    AtomicResponseProtocol::sendSnoop(AtomicRequestProtocol *peer, PacketPtr pkt)
+    {
+        assert(pkt-&gt;isRequest());
+        return peer-&gt;recvAtomicSnoop(pkt);
+    }</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>The receive methods are therefore the interesting ones, and must be overridden on derived classes if they ever expect to receive such requests:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>    Tick
+    recvAtomicSnoop(PacketPtr pkt) override
+    {
+        panic("%s was not expecting an atomic snoop request\n", name());
+        return 0;
+    }
+
+    void
+    recvFunctionalSnoop(PacketPtr pkt) override
+    {
+        panic("%s was not expecting a functional snoop request\n", name());
+    }
+
+    void
+    recvTimingSnoopReq(PacketPtr pkt) override
+    {
+        panic("%s was not expecting a timing snoop request.\n", name());
+    }</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>One question that comes up now is: but why do CPUs need to care about <a href="#cache-coherence">snoop requests</a>?</p>
+</div>
+<div class="paragraph">
+<p>And one big answer is: to be able to implement LLSC atomicity as mentioned at: <a href="#arm-ldxr-and-stxr-instructions">ARM LDXR and STXR instructions</a>, since when other cores update memory, they could invalidate the lock of the current core.</p>
+</div>
+<div class="paragraph">
+<p>Then, as you might expect, we can see that for example <code>AtomicSimpleCPU</code> does not override <code>recvTimingSnoopReq</code>.</p>
+</div>
+<div class="paragraph">
+<p>Now let see which requests are generated by ordinary <a href="#arm-ldr-instruction">ARM LDR instruction</a>. We run:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./run \
+  --arch aarch64 \
+  --debug-vm \
+  --emulator gem5 \
+  --gem5-build-type debug \
+  --useland userland/arch/aarch64/freestanding/linux/hello.S \</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and then break at the methods of the LDR class <code>LDRXL64_LIT</code>: <a href="#gem5-execute-vs-initiateacc-vs-completeacc">gem5 <code>execute</code> vs <code>initiateAcc</code> vs <code>completeAcc</code></a>.</p>
+</div>
+<div class="paragraph">
+<p>Before starting, we of course guess that:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><code>AtomicSimpleCPU</code> will be making atomic accesses from <code>execute</code></p>
+</li>
+<li>
+<p><code>TimingSimpleCPU</code> will be making timing accesses from <code>initiateAcc</code>, which must generate the event which leads to <code>completeAcc</code></p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>so let&#8217;s confirm it.</p>
+</div>
+<div class="paragraph">
+<p>We break on <code>ArmISAInst::LDRXL64_LIT::execute</code> which is what <code>AtomicSimpleCPU</code> uses, and that leads as expected to:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>MasterPort::sendAtomic
+AtomicSimpleCPU::sendPacket
+AtomicSimpleCPU::readMem
+SimpleExecContext::readMem
+readMemAtomic&lt;(ByteOrder)1, ExecContext, unsigned long&gt;
+readMemAtomicLE&lt;ExecContext, unsigned long&gt;
+ArmISAInst::LDRXL64_LIT::execute
+AtomicSimpleCPU::tick</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Notably, <code>AtomicSimpleCPU::readMem</code> immediately translates the address, creates a packet, sends the atomic request, and gets the response back without any events.</p>
+</div>
+<div class="paragraph">
+<p>And now if we do the same with <code>--cpu-type TimingSimpleCPU</code> and break at <code>ArmISAInst::LDRXL64_LIT::initiateAcc</code>, and then add another break for the next event schedule <code>b EventManager::schedule</code> (which we imagine is the memory read) we reach:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>EventManager::schedule
+DRAMCtrl::addToReadQueue
+DRAMCtrl::recvTimingReq
+DRAMCtrl::MemoryPort::recvTimingReq
+TimingRequestProtocol::sendReq
+MasterPort::sendTimingReq
+CoherentXBar::recvTimingReq
+CoherentXBar::CoherentXBarSlavePort::recvTimingReq
+TimingRequestProtocol::sendReq
+MasterPort::sendTimingReq
+TimingSimpleCPU::handleReadPacket
+TimingSimpleCPU::sendData
+TimingSimpleCPU::finishTranslation
+DataTranslation&lt;TimingSimpleCPU*&gt;::finish
+ArmISA::TLB::translateComplete
+ArmISA::TLB::translateTiming
+ArmISA::TLB::translateTiming
+TimingSimpleCPU::initiateMemRead
+SimpleExecContext::initiateMemRead
+initiateMemRead&lt;ExecContext, unsigned long&gt;
+ArmISAInst::LDRXL64_LIT::initiateAcc
+TimingSimpleCPU::completeIfetch
+TimingSimpleCPU::IcachePort::ITickEvent::process
+EventQueue::serviceOne</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>so as expected we have <code>TimingRequestProtocol::sendReq</code>.</p>
+</div>
+<div class="paragraph">
+<p>Remember however that timing requests are a bit more complicated due to <a href="#arm-paging">paging</a>, since the page table walk can itself lead to further memory requests.</p>
+</div>
+<div class="paragraph">
+<p>In this particular instance, the address being read with <code>ldr x2, =len</code> <a href="#arm-ldr-pseudo-instruction">ARM LDR pseudo-instruction</a> is likely placed just after the text section, and therefore the pagewalk is already in the TLB due to previous instruction fetches, and this is because the translation just finished immediately going through <code>TimingSimpleCPU::finishTranslation</code>, some key snippets are:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>TLB::translateComplete(const RequestPtr &amp;req, ThreadContext *tc,
+        Translation *translation, Mode mode, TLB::ArmTranslationType tranType,
+        bool callFromS2)
+{
+    bool delay = false;
+    Fault fault;
+    if (FullSystem)
+        fault = translateFs(req, tc, mode, translation, delay, true, tranType);
+    else
+        fault = translateSe(req, tc, mode, translation, delay, true);
+    if (!delay)
+        translation-&gt;finish(fault, req, tc, mode);
+    else
+        translation-&gt;markDelayed();</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and then <code>translateSe</code> does not use <code>delay</code> at all, so we learn that in syscall emulation, <code>delay</code> is always <code>false</code> and things progress immediately there. And then further down <code>TimingSimpleCPU::finishTranslation</code> does some more fault checking:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>void
+TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
+{
+    if (state-&gt;getFault() != NoFault) {
+        translationFault(state-&gt;getFault());
+    } else {
+        if (!state-&gt;isSplit) {
+            sendData(state-&gt;mainReq, state-&gt;data, state-&gt;res,
+                     state-&gt;mode == BaseTLB::Read);</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.</p>
+</div>
+<div class="sect5">
+<h6 id="gem5-functional-requests"><a class="anchor" href="#gem5-functional-requests"></a><a class="link" href="#gem5-functional-requests">19.21.1.1.1. gem5 functional requests</a></h6>
+<div class="paragraph">
+<p>As seen at <a href="#gem5-functional-vs-atomic-vs-timing-memory-requests">gem5 functional vs atomic vs timing memory requests</a>, functional requests are not used in common simulation, since the core must always go through caches.</p>
+</div>
+<div class="paragraph">
+<p>Functional access are therefore only used for more magic simulation functionalities.</p>
+</div>
+<div class="paragraph">
+<p>One such functionality, is the <a href="#gem5-syscall-emulation-mode">gem5 syscall emulation mode</a> implementation of the <a href="#futex-system-call">futex system call</a> which is done at <code>futexFunc</code> in <a href="https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/sim/syscall_emul.hh#L394"><code>src/sim/sycall_emul.hh</code></a>.</p>
+</div>
+<div class="paragraph">
+<p>As seen from <code>man futex</code>, the Linux kernel reads the value from an address that is given as the first argument of the call.</p>
+</div>
+<div class="paragraph">
+<p>Therefore, here it makes sense for gem5 syscall implementation, which does not actually have a real kernel running, to just make a functional request and be done with it, since the impact of cache changes done by this read would be insignificant to the cost of an actual full context switch that would happen on a real syscall.</p>
+</div>
+<div class="paragraph">
+<p>It is generally hard to implement functional requests for <a href="#gem5-ruby-build">Ruby</a> runs, because packets are flying through the memory system in a transient state, and there is no simple way of finding exactly which ones might have the latest version of the memory. See for example:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><a href="https://gem5.atlassian.net/browse/GEM5-496" class="bare">https://gem5.atlassian.net/browse/GEM5-496</a></p>
+</li>
+<li>
+<p><a href="https://gem5.atlassian.net/browse/GEM5-604" class="bare">https://gem5.atlassian.net/browse/GEM5-604</a></p>
+</li>
+<li>
+<p><a href="https://gem5.atlassian.net/browse/GEM5-675" class="bare">https://gem5.atlassian.net/browse/GEM5-675</a></p>
+</li>
+<li>
+<p><a href="https://gem5.atlassian.net/browse/GEM5-676" class="bare">https://gem5.atlassian.net/browse/GEM5-676</a></p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>The typical error message in that case is:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>fatal: Ruby functional read failed for address</pre>
+</div>
+</div>
+</div>
+</div>
+</div>
+<div class="sect3">
+<h4 id="gem5-packet-vs-request"><a class="anchor" href="#gem5-packet-vs-request"></a><a class="link" href="#gem5-packet-vs-request">19.21.2. gem5 <code>Packet</code> vs <code>Request</code></a></h4>
+<div class="sect4">
+<h5 id="gem5-packet"><a class="anchor" href="#gem5-packet"></a><a class="link" href="#gem5-packet">19.21.2.1. gem5 <code>Packet</code></a></h5>
+<div class="paragraph">
+<p><code>Packet</code> is what goes through <a href="#gem5-port-system">ports</a>: a single packet is sent out to the memory system, gets modified when it hits valid data, and then returns with the reply.</p>
+</div>
+<div class="paragraph">
+<p><code>Packet</code> is what CPUs create and send to get memory values. E.g. on <a href="#gem5-atomicsimplecpu">gem5 <code>AtomicSimpleCPU</code></a>:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>void
+AtomicSimpleCPU::tick()
+{
+    ...
+    Packet ifetch_pkt = Packet(ifetch_req, MemCmd::ReadReq);
+    ifetch_pkt.dataStatic(&amp;inst);
+
+    icache_latency = sendPacket(icachePort, &amp;ifetch_pkt);
+
+Tick
+AtomicSimpleCPU::sendPacket(MasterPort &amp;port, const PacketPtr &amp;pkt)
+{
+    return port.sendAtomic(pkt);
+}</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>On <a href="#gem5-timingsimplecpu">TimingSimpleCPU</a>, we note that the packet is dynamically created unlike for the AtomicSimpleCPU, since it must exist across multiple <a href="#gem5-event-queue">events</a> which happen on separate function calls, unlike atomic memory which is done immediately in a single call:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>void
+TimingSimpleCPU::sendFetch(const Fault &amp;fault, const RequestPtr &amp;req,
+                           ThreadContext *tc)
+{
+    if (fault == NoFault) {
+        DPRINTF(SimpleCPU, "Sending fetch for addr %#x(pa: %#x)\n",
+                req-&gt;getVaddr(), req-&gt;getPaddr());
+        ifetch_pkt = new Packet(req, MemCmd::ReadReq);
+        ifetch_pkt-&gt;dataStatic(&amp;inst);
+        DPRINTF(SimpleCPU, " -- pkt addr: %#x\n", ifetch_pkt-&gt;getAddr());
+
+        if (!icachePort.sendTimingReq(ifetch_pkt)) {</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>It must later delete the return packet that it gets later on, e.g. for the ifetch:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>TimingSimpleCPU::completeIfetch(PacketPtr pkt)
+{
+    if (pkt) {
+        delete pkt;
+    }</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>The most important properties of a Packet are:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><code>PacketDataPtr data;</code>: the data coming back from a reply packet or being sent via it</p>
+</li>
+<li>
+<p><code>Addr addr;</code>: the physical address of the data. TODO comment says could be virtual too, when?</p>
+<div class="literalblock">
+<div class="content">
+<pre>/// The address of the request.  This address could be virtual or
+/// physical, depending on the system configuration.
+Addr addr;</pre>
+</div>
+</div>
+</li>
+<li>
+<p><code>Flags flags;</code>: flags describing properties of the <code>Packet</code></p>
+</li>
+<li>
+<p><code>MemCmd cmd;</code>: this is basically an enumeration of possible commands, stuff like:</p>
+<div class="literalblock">
+<div class="content">
+<pre>enum Command
+{
+    InvalidCmd,
+    ReadReq,
+    ReadResp,</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Each command has a fixed number of attributes defined in the static array:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>static const CommandInfo commandInfo[];</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>which gets initialized in the .cc file in the same order as the Command enum.</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>const MemCmd::CommandInfo
+MemCmd::commandInfo[] =
+{
+    /* InvalidCmd */
+    { 0, InvalidCmd, "InvalidCmd" },
+    /* ReadReq - Read issued by a non-caching agent such as a CPU or
+     * device, with no restrictions on alignment. */
+    { SET3(IsRead, IsRequest, NeedsResponse), ReadResp, "ReadReq" },
+    /* ReadResp */
+    { SET3(IsRead, IsResponse, HasData), InvalidCmd, "ReadResp" },</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>From this we see for example that both <code>ReadReq</code> and <code>ReadResp</code> are marked with the <code>IsRead</code> attribute.</p>
+</div>
+<div class="paragraph">
+<p>The second field of this array also specifies the corresponding reply of each request. E.g. the reply of a <code>ReadReq</code> is a <code>ReadResp</code>. <code>InvalidCmd</code> is just a placeholders for requests that are already replies.</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>struct CommandInfo
+{
+    /// Set of attribute flags.
+    const std::bitset&lt;NUM_COMMAND_ATTRIBUTES&gt; attributes;
+    /// Corresponding response for requests; InvalidCmd if no
+    /// response is applicable.
+    const Command response;
+    /// String representation (for printing)
+    const std::string str;
+};</pre>
+</div>
+</div>
+</li>
+</ul>
+</div>
+</div>
+<div class="sect4">
+<h5 id="gem5-request"><a class="anchor" href="#gem5-request"></a><a class="link" href="#gem5-request">19.21.2.2. gem5 <code>Request</code></a></h5>
+<div class="paragraph">
+<p>One good way to think about <code>Request</code> vs <code>Packet</code> could be "it is what the <a href="#gem5-instruction-definitions">instruction definitions</a> see", a bit like <code>ExecContext</code> vs <code>ThreadContext</code>.</p>
+</div>
+<div class="paragraph">
+<p><code>Request</code> is passed to the constructor of <code>Packet</code>, and <code>Packet</code> keeps a reference to it:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>    Packet(const RequestPtr &amp;_req, MemCmd _cmd)
+        :  cmd(_cmd), id((PacketId)_req.get()), req(_req),
+           data(nullptr), addr(0), _isSecure(false), size(0),
+           _qosValue(0), headerDelay(0), snoopDelay(0),
+           payloadDelay(0), senderState(NULL)
+    {
+        if (req-&gt;hasPaddr()) {
+            addr = req-&gt;getPaddr();
+            flags.set(VALID_ADDR);
+            _isSecure = req-&gt;isSecure();
+        }
+        if (req-&gt;hasSize()) {
+            size = req-&gt;getSize();
+            flags.set(VALID_SIZE);
+        }
+    }</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>where <code>RequestPtr</code> is defined as:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>typedef std::shared_ptr&lt;Request&gt; RequestPtr;</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>so we see that shared pointers to requests are basically passed around.</p>
+</div>
+<div class="paragraph">
+<p>Some key fields include:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p><code>_paddr</code>:</p>
+<div class="literalblock">
+<div class="content">
+<pre>/**
+    * The physical address of the request. Valid only if validPaddr
+    * is set.
+    */
+Addr _paddr = 0;</pre>
+</div>
+</div>
+</li>
+<li>
+<p><code>_vaddr</code>:</p>
+<div class="literalblock">
+<div class="content">
+<pre>/** The virtual address of the request. */
+Addr _vaddr = MaxAddr;</pre>
+</div>
+</div>
+</li>
+</ul>
+</div>
+<div class="sect5">
+<h6 id="gem5-request-in-atomicsimplecpu"><a class="anchor" href="#gem5-request-in-atomicsimplecpu"></a><a class="link" href="#gem5-request-in-atomicsimplecpu">19.21.2.2.1. gem5 <code>Request</code> in <code>AtomicSimpleCPU</code></a></h6>
+<div class="paragraph">
+<p>In <code>AtomicSimpleCPU</code>, a single packet of each type is kept for the entire CPU, e.g.:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>RequestPtr ifetch_req;</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and it gets created at construction time:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>AtomicSimpleCPU::AtomicSimpleCPU(AtomicSimpleCPUParams *p)
+{
+    ifetch_req = std::make_shared&lt;Request&gt;();</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and then it gets modified for each request:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>setupFetchRequest(ifetch_req);</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>which does:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>req-&gt;setVirt(fetchPC, sizeof(MachInst), Request::INST_FETCH,
+                instMasterId(), instAddr);</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Virtual to physical address translation done by the CPU stores the physical address:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>fault = thread-&gt;dtb-&gt;translateAtomic(req, thread-&gt;getTC(),
+                                        BaseTLB::Read);</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>which eventually calls e.g. on fs with MMU enabled:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>Fault
+TLB::translateMmuOn(ThreadContext* tc, const RequestPtr &amp;req, Mode mode,
+                    Translation *translation, bool &amp;delay, bool timing,
+                    bool functional, Addr vaddr,
+                    ArmFault::TranMethod tranMethod)
+{
+    req-&gt;setPaddr(pa);</pre>
+</div>
+</div>
+</div>
+<div class="sect5">
+<h6 id="gem5-request-in-timingsimplecpu"><a class="anchor" href="#gem5-request-in-timingsimplecpu"></a><a class="link" href="#gem5-request-in-timingsimplecpu">19.21.2.2.2. gem5 <code>Request</code> in <code>TimingSimpleCPU</code></a></h6>
+<div class="paragraph">
+<p>In <a href="#gem5-timingsimplecpu">TimingSimpleCPU</a>, the request gets created per memory read:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>Fault
+TimingSimpleCPU::initiateMemRead(Addr addr, unsigned size,
+                                 Request::Flags flags,
+                                 const std::vector&lt;bool&gt;&amp; byte_enable)
+{
+    ...
+    RequestPtr req = std::make_shared&lt;Request&gt;(
+        addr, size, flags, dataMasterId(), pc, thread-&gt;contextId());</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and from <a href="#gem5-functional-vs-atomic-vs-timing-memory-requests">gem5 functional vs atomic vs timing memory requests</a> and <a href="#gem5-functional-vs-atomic-vs-timing-memory-requests">gem5 functional vs atomic vs timing memory requests</a> we remember that <code>initiateMemRead</code> is actually started from the <code>initiateAcc</code> instruction definitions for timing:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>Fault LDRWL64_LIT::initiateAcc(ExecContext *xc,
+    Trace::InstRecord *traceData) const
+{
+    ...
+    fault = initiateMemRead(xc, traceData, EA, Mem, memAccessFlags);</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>From this we see that <code>initiateAcc</code> memory instructions are basically extracting the required information for the request, notably the address <code>EA</code> and flags.</p>
+</div>
+</div>
+</div>
+</div>
+<div class="sect3">
+<h4 id="gem5-mshr"><a class="anchor" href="#gem5-mshr"></a><a class="link" href="#gem5-mshr">19.21.3. gem5 <code>MSHR</code></a></h4>
+<div class="paragraph">
+<p>Mentioned at: <a href="http://pages.cs.wisc.edu/~swilson/gem5-docs/gem5MemorySystem.html" class="bare">http://pages.cs.wisc.edu/~swilson/gem5-docs/gem5MemorySystem.html</a></p>
+</div>
+<div class="paragraph">
+<p>Each cache object owns a <code>MSHRQueue</code>:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>class BaseCache : public ClockedObject
+{
+    /** Miss status registers */
+    MSHRQueue mshrQueue;</pre>
+</div>
+</div>
+<div class="paragraph">
+<p><code>BaseCache</code> is the base class of <code>Cache</code> and <code>NoncoherentCache</code>.</p>
+</div>
+<div class="paragraph">
+<p><code>MSHRQueue</code> is a <code>Queue</code> of <code>MSHR</code>:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>class MSHRQueue : public Queue&lt;MSHR&gt;</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and Queue is also a gem5 class under <code>src/mem/cache/queue.hh</code>.</p>
+</div>
+</div>
+<div class="sect3">
+<h4 id="gem5-commmonitor"><a class="anchor" href="#gem5-commmonitor"></a><a class="link" href="#gem5-commmonitor">19.21.4. gem5 <code>CommMonitor</code></a></h4>
+<div class="paragraph">
+<p>You can place this <a href="#gem5-python-c-interaction">SimObject</a> in between two <a href="#gem5-port-system">ports</a> to get extra statistics about the packets that are going through.</p>
+</div>
+<div class="paragraph">
+<p>It only works on <a href="#gem5-functional-vs-atomic-vs-timing-memory-requests">timing requests</a>, and does not seem to dump any memory values, only add extra <a href="#gem5-m5out-stats-txt-file">statistics</a>.</p>
+</div>
+<div class="paragraph">
+<p>For example, the patch <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/patches/manual/gem5-commmonitor-se.patch">patches/manual/gem5-commmonitor-se.patch</a> hack a <code>CommMonitor</code> between the CPU and the L1 cache on top of gem5 1c3662c9557c85f0d25490dc4fbde3f8ab0cb350:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>patch -d "$(./getvar gem5_source_dir)" -p 1 &lt; patches/manual/gem5-commmonitor-se.patch</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>That patch was done largely by copying what <code>fs.py --memcheck</code> does with a <code>MemChecker</code> object.</p>
+</div>
+<div class="paragraph">
+<p>You can then run with:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./run \
+  --arch aarch64 \
+  --emulator gem5 \
+  --userland userland/arch/aarch64/freestanding/linux/hello.S \
+  -- \
+  --caches \
+  --cpu-type TimingSimpleCPU \
+;</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and now we have some new extra histogram statistics such as:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>system.cpu.dcache_mon.readBurstLengthHist::samples            1</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>One neat thing about this is that it is agnostic to the memory object type, so you don&#8217;t have to recode those statistics for every new type of object that operates on memory packets.</p>
+</div>
+</div>
+</div>
+<div class="sect2">
+<h3 id="gem5-internals"><a class="anchor" href="#gem5-internals"></a><a class="link" href="#gem5-internals">19.22. gem5 internals</a></h3>
 <div class="paragraph">
 <p>Internals under other sections:</p>
 </div>
 <div class="ulist">
 <ul>
 <li>
+<p><a href="#gem5-memory-system">gem5 memory system</a></p>
+</li>
+<li>
 <p><a href="#gem5-trace-internals">gem5 trace internals</a></p>
 </li>
 <li>
@@ -23023,7 +23726,7 @@ cd ..
 </ul>
 </div>
 <div class="sect3">
-<h4 id="gem5-eclipse-configuration"><a class="anchor" href="#gem5-eclipse-configuration"></a><a class="link" href="#gem5-eclipse-configuration">19.21.1. gem5 Eclipse configuration</a></h4>
+<h4 id="gem5-eclipse-configuration"><a class="anchor" href="#gem5-eclipse-configuration"></a><a class="link" href="#gem5-eclipse-configuration">19.22.1. gem5 Eclipse configuration</a></h4>
 <div class="paragraph">
 <p><a href="https://stackoverflow.com/questions/61656709/how-to-setup-eclipse-ide-for-gem5-development" class="bare">https://stackoverflow.com/questions/61656709/how-to-setup-eclipse-ide-for-gem5-development</a></p>
 </div>
@@ -23085,7 +23788,7 @@ cd ..
 </div>
 </div>
 <div class="sect3">
-<h4 id="gem5-python-c-interaction"><a class="anchor" href="#gem5-python-c-interaction"></a><a class="link" href="#gem5-python-c-interaction">19.21.2. gem5 Python C++ interaction</a></h4>
+<h4 id="gem5-python-c-interaction"><a class="anchor" href="#gem5-python-c-interaction"></a><a class="link" href="#gem5-python-c-interaction">19.22.2. gem5 Python C++ interaction</a></h4>
 <div class="paragraph">
 <p>The interaction uses the Python C extension interface <a href="https://docs.python.org/2/extending/extending.html" class="bare">https://docs.python.org/2/extending/extending.html</a> interface through the <a href="#pybind11">pybind11</a> helper library: <a href="https://github.com/pybind/pybind11" class="bare">https://github.com/pybind/pybind11</a></p>
 </div>
@@ -23270,7 +23973,7 @@ static EmbeddedPyBind embed_obj("BadDevice", module_init, "BasicPioDevice");</pr
 </div>
 </div>
 <div class="sect3">
-<h4 id="gem5-entry-point"><a class="anchor" href="#gem5-entry-point"></a><a class="link" href="#gem5-entry-point">19.21.3. gem5 entry point</a></h4>
+<h4 id="gem5-entry-point"><a class="anchor" href="#gem5-entry-point"></a><a class="link" href="#gem5-entry-point">19.22.3. gem5 entry point</a></h4>
 <div class="paragraph">
 <p>The main is at: <code>src/sim/main.cc</code>. It calls:</p>
 </div>
@@ -23358,7 +24061,7 @@ exec filecode in scope</pre>
 <p>Tested at gem5 b4879ae5b0b6644e6836b0881e4da05c64a6550d.</p>
 </div>
 <div class="sect4">
-<h5 id="gem5-m5-objects-module"><a class="anchor" href="#gem5-m5-objects-module"></a><a class="link" href="#gem5-m5-objects-module">19.21.3.1. gem5 <code>m5.objects</code> module</a></h5>
+<h5 id="gem5-m5-objects-module"><a class="anchor" href="#gem5-m5-objects-module"></a><a class="link" href="#gem5-m5-objects-module">19.22.3.1. gem5 <code>m5.objects</code> module</a></h5>
 <div class="paragraph">
 <p>All <code>SimObjects</code> seem to be automatically added to the <code>m5.objects</code> namespace, and this is done in a very convoluted way, let&#8217;s try to understand a bit:</p>
 </div>
@@ -23523,7 +24226,7 @@ for source in PySource.all:
 </div>
 </div>
 <div class="sect3">
-<h4 id="gem5-event-queue"><a class="anchor" href="#gem5-event-queue"></a><a class="link" href="#gem5-event-queue">19.21.4. gem5 event queue</a></h4>
+<h4 id="gem5-event-queue"><a class="anchor" href="#gem5-event-queue"></a><a class="link" href="#gem5-event-queue">19.22.4. gem5 event queue</a></h4>
 <div class="paragraph">
 <p>gem5 is an event based simulator, and as such the event queue is of of the crucial elements in the system.</p>
 </div>
@@ -23629,7 +24332,7 @@ b EventFunctionWrapper::process</pre>
 <p>Then, once we had that, the most perfect thing ever would be to make the full event graph containing which events schedule which events!</p>
 </div>
 <div class="sect4">
-<h5 id="gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis"><a class="anchor" href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis"></a><a class="link" href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis">19.21.4.1. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis</a></h5>
+<h5 id="gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis"><a class="anchor" href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis"></a><a class="link" href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis">19.22.4.1. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis</a></h5>
 <div class="paragraph">
 <p>Let&#8217;s now analyze every single event on a minimal <a href="#gem5-syscall-emulation-mode">gem5 syscall emulation mode</a> in the <a href="#gem5-cpu-types">simplest CPU that we have</a>:</p>
 </div>
@@ -23765,7 +24468,7 @@ AtomicSimpleCPU::tick() at atomic.cc:757 0x55555907834c</pre>
 <p>Tested in gem5 12c917de54145d2d50260035ba7fa614e25317a3.</p>
 </div>
 <div class="sect5">
-<h6 id="atomicsimplecpu-initial-events"><a class="anchor" href="#atomicsimplecpu-initial-events"></a><a class="link" href="#atomicsimplecpu-initial-events">19.21.4.1.1. AtomicSimpleCPU initial events</a></h6>
+<h6 id="atomicsimplecpu-initial-events"><a class="anchor" href="#atomicsimplecpu-initial-events"></a><a class="link" href="#atomicsimplecpu-initial-events">19.22.4.1.1. AtomicSimpleCPU initial events</a></h6>
 <div class="paragraph">
 <p>Let&#8217;s have a closer look at the initial magically scheduled events of the simulation.</p>
 </div>
@@ -23984,7 +24687,7 @@ simulate() at simulate.cc:104 0x555559476d6f</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="atomicsimplecpu-tick-reschedule-timing"><a class="anchor" href="#atomicsimplecpu-tick-reschedule-timing"></a><a class="link" href="#atomicsimplecpu-tick-reschedule-timing">19.21.4.1.2. AtomicSimpleCPU tick reschedule timing</a></h6>
+<h6 id="atomicsimplecpu-tick-reschedule-timing"><a class="anchor" href="#atomicsimplecpu-tick-reschedule-timing"></a><a class="link" href="#atomicsimplecpu-tick-reschedule-timing">19.22.4.1.2. AtomicSimpleCPU tick reschedule timing</a></h6>
 <div class="paragraph">
 <p>Inside <code>AtomicSimpleCPU::tick()</code> we saw previously that the reschedule happens at:</p>
 </div>
@@ -24024,7 +24727,7 @@ clock=500</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="atomicsimplecpu-memory-access"><a class="anchor" href="#atomicsimplecpu-memory-access"></a><a class="link" href="#atomicsimplecpu-memory-access">19.21.4.1.3. AtomicSimpleCPU memory access</a></h6>
+<h6 id="atomicsimplecpu-memory-access"><a class="anchor" href="#atomicsimplecpu-memory-access"></a><a class="link" href="#atomicsimplecpu-memory-access">19.22.4.1.3. AtomicSimpleCPU memory access</a></h6>
 <div class="paragraph">
 <p>It will be interesting to see how <code>AtomicSimpleCPU</code> makes memory access on GDB and to compare that with <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis"><code>TimingSimpleCPU</code></a>.</p>
 </div>
@@ -24078,7 +24781,7 @@ clock=500</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="gem5-se-py-page-translation"><a class="anchor" href="#gem5-se-py-page-translation"></a><a class="link" href="#gem5-se-py-page-translation">19.21.4.1.4. gem5 se.py page translation</a></h6>
+<h6 id="gem5-se-py-page-translation"><a class="anchor" href="#gem5-se-py-page-translation"></a><a class="link" href="#gem5-se-py-page-translation">19.22.4.1.4. gem5 se.py page translation</a></h6>
 <div class="paragraph">
 <p>Happens on <code>EmulationPageTable</code>, and seems to happen atomically without making any extra memory requests.</p>
 </div>
@@ -24149,7 +24852,7 @@ Exiting @ tick 3500 because exiting with last active thread context
 </div>
 </div>
 <div class="sect4">
-<h5 id="gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis"><a class="anchor" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis"></a><a class="link" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis">19.21.4.2. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis</a></h5>
+<h5 id="gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis"><a class="anchor" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis"></a><a class="link" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis">19.22.4.2. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis</a></h5>
 <div class="paragraph">
 <p>Now, let&#8217;s move on to <code>TimingSimpleCPU</code>, which is just like <code>AtomicSimpleCPU</code> internally, but now the memory requests don&#8217;t actually finish immediately: <a href="#gem5-cpu-types">gem5 CPU types</a>!</p>
 </div>
@@ -24430,7 +25133,7 @@ info: Entering event queue @ 0.  Starting simulation...
 </ul>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-0"><a class="anchor" href="#timingsimplecpu-analysis-0"></a><a class="link" href="#timingsimplecpu-analysis-0">19.21.4.2.1. TimingSimpleCPU analysis #0</a></h6>
+<h6 id="timingsimplecpu-analysis-0"><a class="anchor" href="#timingsimplecpu-analysis-0"></a><a class="link" href="#timingsimplecpu-analysis-0">19.22.4.2.1. TimingSimpleCPU analysis #0</a></h6>
 <div class="paragraph">
 <p>Schedules <code>TimingSimpleCPU::fetch</code> through:</p>
 </div>
@@ -24475,7 +25178,7 @@ ArmLinuxProcess64::initState</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-1"><a class="anchor" href="#timingsimplecpu-analysis-1"></a><a class="link" href="#timingsimplecpu-analysis-1">19.21.4.2.2. TimingSimpleCPU analysis #1</a></h6>
+<h6 id="timingsimplecpu-analysis-1"><a class="anchor" href="#timingsimplecpu-analysis-1"></a><a class="link" href="#timingsimplecpu-analysis-1">19.22.4.2.2. TimingSimpleCPU analysis #1</a></h6>
 <div class="paragraph">
 <p>Backtrace:</p>
 </div>
@@ -24606,7 +25309,7 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-2"><a class="anchor" href="#timingsimplecpu-analysis-2"></a><a class="link" href="#timingsimplecpu-analysis-2">19.21.4.2.3. TimingSimpleCPU analysis #2</a></h6>
+<h6 id="timingsimplecpu-analysis-2"><a class="anchor" href="#timingsimplecpu-analysis-2"></a><a class="link" href="#timingsimplecpu-analysis-2">19.22.4.2.3. TimingSimpleCPU analysis #2</a></h6>
 <div class="paragraph">
 <p>This is just the startup of the second rank, see: <a href="#timingsimplecpu-analysis-1">TimingSimpleCPU analysis #1</a>.</p>
 </div>
@@ -24639,13 +25342,13 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-3-and-4"><a class="anchor" href="#timingsimplecpu-analysis-3-and-4"></a><a class="link" href="#timingsimplecpu-analysis-3-and-4">19.21.4.2.4. TimingSimpleCPU analysis #3 and #4</a></h6>
+<h6 id="timingsimplecpu-analysis-3-and-4"><a class="anchor" href="#timingsimplecpu-analysis-3-and-4"></a><a class="link" href="#timingsimplecpu-analysis-3-and-4">19.22.4.2.4. TimingSimpleCPU analysis #3 and #4</a></h6>
 <div class="paragraph">
 <p>From the timing we know what that one is: the end of time exit event, like for <code>AtomicSimpleCPU</code>.</p>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-5"><a class="anchor" href="#timingsimplecpu-analysis-5"></a><a class="link" href="#timingsimplecpu-analysis-5">19.21.4.2.5. TimingSimpleCPU analysis #5</a></h6>
+<h6 id="timingsimplecpu-analysis-5"><a class="anchor" href="#timingsimplecpu-analysis-5"></a><a class="link" href="#timingsimplecpu-analysis-5">19.22.4.2.5. TimingSimpleCPU analysis #5</a></h6>
 <div class="paragraph">
 <p>Executes <code>TimingSimpleCPU::fetch()</code>.</p>
 </div>
@@ -24753,7 +25456,7 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-6"><a class="anchor" href="#timingsimplecpu-analysis-6"></a><a class="link" href="#timingsimplecpu-analysis-6">19.21.4.2.6. TimingSimpleCPU analysis #6</a></h6>
+<h6 id="timingsimplecpu-analysis-6"><a class="anchor" href="#timingsimplecpu-analysis-6"></a><a class="link" href="#timingsimplecpu-analysis-6">19.22.4.2.6. TimingSimpleCPU analysis #6</a></h6>
 <div class="paragraph">
 <p>Schedules <code>DRAMCtrl::processNextReqEvent</code> through:</p>
 </div>
@@ -24890,7 +25593,7 @@ TimingSimpleCPU::fetch</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-7"><a class="anchor" href="#timingsimplecpu-analysis-7"></a><a class="link" href="#timingsimplecpu-analysis-7">19.21.4.2.7. TimingSimpleCPU analysis #7</a></h6>
+<h6 id="timingsimplecpu-analysis-7"><a class="anchor" href="#timingsimplecpu-analysis-7"></a><a class="link" href="#timingsimplecpu-analysis-7">19.22.4.2.7. TimingSimpleCPU analysis #7</a></h6>
 <div class="paragraph">
 <p>Schedules <code>BaseXBar::Layer::releaseLayer</code> through:</p>
 </div>
@@ -24916,13 +25619,13 @@ TimingSimpleCPU::fetch</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-8"><a class="anchor" href="#timingsimplecpu-analysis-8"></a><a class="link" href="#timingsimplecpu-analysis-8">19.21.4.2.8. TimingSimpleCPU analysis #8</a></h6>
+<h6 id="timingsimplecpu-analysis-8"><a class="anchor" href="#timingsimplecpu-analysis-8"></a><a class="link" href="#timingsimplecpu-analysis-8">19.22.4.2.8. TimingSimpleCPU analysis #8</a></h6>
 <div class="paragraph">
 <p>Executes <code>DRAMCtrl::processNextReqEvent</code>.</p>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-9"><a class="anchor" href="#timingsimplecpu-analysis-9"></a><a class="link" href="#timingsimplecpu-analysis-9">19.21.4.2.9. TimingSimpleCPU analysis #9</a></h6>
+<h6 id="timingsimplecpu-analysis-9"><a class="anchor" href="#timingsimplecpu-analysis-9"></a><a class="link" href="#timingsimplecpu-analysis-9">19.22.4.2.9. TimingSimpleCPU analysis #9</a></h6>
 <div class="paragraph">
 <p>Schedules <code>DRAMCtrl::Rank::processActivateEvent</code> through:</p>
 </div>
@@ -24936,7 +25639,7 @@ DRAMCtrl::processNextReqEvent</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-10"><a class="anchor" href="#timingsimplecpu-analysis-10"></a><a class="link" href="#timingsimplecpu-analysis-10">19.21.4.2.10. TimingSimpleCPU analysis #10</a></h6>
+<h6 id="timingsimplecpu-analysis-10"><a class="anchor" href="#timingsimplecpu-analysis-10"></a><a class="link" href="#timingsimplecpu-analysis-10">19.22.4.2.10. TimingSimpleCPU analysis #10</a></h6>
 <div class="paragraph">
 <p>Schedules <code>DRAMCtrl::processRespondEvent</code> through:</p>
 </div>
@@ -24948,7 +25651,7 @@ DRAMCtrl::processNextReqEvent</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-11"><a class="anchor" href="#timingsimplecpu-analysis-11"></a><a class="link" href="#timingsimplecpu-analysis-11">19.21.4.2.11. TimingSimpleCPU analysis #11</a></h6>
+<h6 id="timingsimplecpu-analysis-11"><a class="anchor" href="#timingsimplecpu-analysis-11"></a><a class="link" href="#timingsimplecpu-analysis-11">19.22.4.2.11. TimingSimpleCPU analysis #11</a></h6>
 <div class="paragraph">
 <p>Schedules <code>DRAMCtrl::processNextReqEvent</code> through:</p>
 </div>
@@ -24960,7 +25663,7 @@ DRAMCtrl::processNextReqEvent</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-12"><a class="anchor" href="#timingsimplecpu-analysis-12"></a><a class="link" href="#timingsimplecpu-analysis-12">19.21.4.2.12. TimingSimpleCPU analysis #12</a></h6>
+<h6 id="timingsimplecpu-analysis-12"><a class="anchor" href="#timingsimplecpu-analysis-12"></a><a class="link" href="#timingsimplecpu-analysis-12">19.22.4.2.12. TimingSimpleCPU analysis #12</a></h6>
 <div class="paragraph">
 <p>Executes <code>DRAMCtrl::Rank::processActivateEvent</code>.</p>
 </div>
@@ -24969,7 +25672,7 @@ DRAMCtrl::processNextReqEvent</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-13"><a class="anchor" href="#timingsimplecpu-analysis-13"></a><a class="link" href="#timingsimplecpu-analysis-13">19.21.4.2.13. TimingSimpleCPU analysis #13</a></h6>
+<h6 id="timingsimplecpu-analysis-13"><a class="anchor" href="#timingsimplecpu-analysis-13"></a><a class="link" href="#timingsimplecpu-analysis-13">19.22.4.2.13. TimingSimpleCPU analysis #13</a></h6>
 <div class="paragraph">
 <p>Schedules <code>DRAMCtrl::Rank::processPowerEvent</code> through:</p>
 </div>
@@ -24982,7 +25685,7 @@ DRAMCtrl::Rank::processActivateEvent</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-14"><a class="anchor" href="#timingsimplecpu-analysis-14"></a><a class="link" href="#timingsimplecpu-analysis-14">19.21.4.2.14. TimingSimpleCPU analysis #14</a></h6>
+<h6 id="timingsimplecpu-analysis-14"><a class="anchor" href="#timingsimplecpu-analysis-14"></a><a class="link" href="#timingsimplecpu-analysis-14">19.22.4.2.14. TimingSimpleCPU analysis #14</a></h6>
 <div class="paragraph">
 <p>Executes <code>DRAMCtrl::Rank::processPowerEvent</code>.</p>
 </div>
@@ -24991,25 +25694,25 @@ DRAMCtrl::Rank::processActivateEvent</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-15"><a class="anchor" href="#timingsimplecpu-analysis-15"></a><a class="link" href="#timingsimplecpu-analysis-15">19.21.4.2.15. TimingSimpleCPU analysis #15</a></h6>
+<h6 id="timingsimplecpu-analysis-15"><a class="anchor" href="#timingsimplecpu-analysis-15"></a><a class="link" href="#timingsimplecpu-analysis-15">19.22.4.2.15. TimingSimpleCPU analysis #15</a></h6>
 <div class="paragraph">
 <p>Executes <code>BaseXBar::Layer&lt;SrcType, DstType&gt;::releaseLayer</code>.</p>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-16"><a class="anchor" href="#timingsimplecpu-analysis-16"></a><a class="link" href="#timingsimplecpu-analysis-16">19.21.4.2.16. TimingSimpleCPU analysis #16</a></h6>
+<h6 id="timingsimplecpu-analysis-16"><a class="anchor" href="#timingsimplecpu-analysis-16"></a><a class="link" href="#timingsimplecpu-analysis-16">19.22.4.2.16. TimingSimpleCPU analysis #16</a></h6>
 <div class="paragraph">
 <p>Executes <code>DRAMCtrl::processNextReqEvent()</code>.</p>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-17"><a class="anchor" href="#timingsimplecpu-analysis-17"></a><a class="link" href="#timingsimplecpu-analysis-17">19.21.4.2.17. TimingSimpleCPU analysis #17</a></h6>
+<h6 id="timingsimplecpu-analysis-17"><a class="anchor" href="#timingsimplecpu-analysis-17"></a><a class="link" href="#timingsimplecpu-analysis-17">19.22.4.2.17. TimingSimpleCPU analysis #17</a></h6>
 <div class="paragraph">
 <p>Executes <code>DRAMCtrl::processRespondEvent()</code>.</p>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-18"><a class="anchor" href="#timingsimplecpu-analysis-18"></a><a class="link" href="#timingsimplecpu-analysis-18">19.21.4.2.18. TimingSimpleCPU analysis #18</a></h6>
+<h6 id="timingsimplecpu-analysis-18"><a class="anchor" href="#timingsimplecpu-analysis-18"></a><a class="link" href="#timingsimplecpu-analysis-18">19.22.4.2.18. TimingSimpleCPU analysis #18</a></h6>
 <div class="paragraph">
 <p>Schedules <code>PacketQueue::processSendEvent()</code> through:</p>
 </div>
@@ -25024,13 +25727,13 @@ DRAMCtrl::processRespondEvent</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-19"><a class="anchor" href="#timingsimplecpu-analysis-19"></a><a class="link" href="#timingsimplecpu-analysis-19">19.21.4.2.19. TimingSimpleCPU analysis #19</a></h6>
+<h6 id="timingsimplecpu-analysis-19"><a class="anchor" href="#timingsimplecpu-analysis-19"></a><a class="link" href="#timingsimplecpu-analysis-19">19.22.4.2.19. TimingSimpleCPU analysis #19</a></h6>
 <div class="paragraph">
 <p>Executes <code>PacketQueue::processSendEvent()</code>.</p>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-20"><a class="anchor" href="#timingsimplecpu-analysis-20"></a><a class="link" href="#timingsimplecpu-analysis-20">19.21.4.2.20. TimingSimpleCPU analysis #20</a></h6>
+<h6 id="timingsimplecpu-analysis-20"><a class="anchor" href="#timingsimplecpu-analysis-20"></a><a class="link" href="#timingsimplecpu-analysis-20">19.22.4.2.20. TimingSimpleCPU analysis #20</a></h6>
 <div class="paragraph">
 <p>Schedules <code>PacketQueue::processSendEvent</code> through:</p>
 </div>
@@ -25054,7 +25757,7 @@ PacketQueue::processSendEvent</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-21"><a class="anchor" href="#timingsimplecpu-analysis-21"></a><a class="link" href="#timingsimplecpu-analysis-21">19.21.4.2.21. TimingSimpleCPU analysis #21</a></h6>
+<h6 id="timingsimplecpu-analysis-21"><a class="anchor" href="#timingsimplecpu-analysis-21"></a><a class="link" href="#timingsimplecpu-analysis-21">19.22.4.2.21. TimingSimpleCPU analysis #21</a></h6>
 <div class="paragraph">
 <p>Schedules <code>BaseXBar::Layer&lt;SrcType, DstType&gt;::releaseLayer</code> through:</p>
 </div>
@@ -25074,19 +25777,19 @@ PacketQueue::processSendEvent</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-22"><a class="anchor" href="#timingsimplecpu-analysis-22"></a><a class="link" href="#timingsimplecpu-analysis-22">19.21.4.2.22. TimingSimpleCPU analysis #22</a></h6>
+<h6 id="timingsimplecpu-analysis-22"><a class="anchor" href="#timingsimplecpu-analysis-22"></a><a class="link" href="#timingsimplecpu-analysis-22">19.22.4.2.22. TimingSimpleCPU analysis #22</a></h6>
 <div class="paragraph">
 <p>Executes <code>BaseXBar::Layer&lt;SrcType, DstType&gt;::releaseLayer</code>.</p>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-23"><a class="anchor" href="#timingsimplecpu-analysis-23"></a><a class="link" href="#timingsimplecpu-analysis-23">19.21.4.2.23. TimingSimpleCPU analysis #23</a></h6>
+<h6 id="timingsimplecpu-analysis-23"><a class="anchor" href="#timingsimplecpu-analysis-23"></a><a class="link" href="#timingsimplecpu-analysis-23">19.22.4.2.23. TimingSimpleCPU analysis #23</a></h6>
 <div class="paragraph">
 <p>Executes <code>PacketQueue::processSendEvent</code>.</p>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-24"><a class="anchor" href="#timingsimplecpu-analysis-24"></a><a class="link" href="#timingsimplecpu-analysis-24">19.21.4.2.24. TimingSimpleCPU analysis #24</a></h6>
+<h6 id="timingsimplecpu-analysis-24"><a class="anchor" href="#timingsimplecpu-analysis-24"></a><a class="link" href="#timingsimplecpu-analysis-24">19.22.4.2.24. TimingSimpleCPU analysis #24</a></h6>
 <div class="paragraph">
 <p>Schedules <code>TimingSimpleCPU::IcachePort::ITickEvent::process()</code> through:</p>
 </div>
@@ -25104,7 +25807,7 @@ PacketQueue::processSendEvent</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-25"><a class="anchor" href="#timingsimplecpu-analysis-25"></a><a class="link" href="#timingsimplecpu-analysis-25">19.21.4.2.25. TimingSimpleCPU analysis #25</a></h6>
+<h6 id="timingsimplecpu-analysis-25"><a class="anchor" href="#timingsimplecpu-analysis-25"></a><a class="link" href="#timingsimplecpu-analysis-25">19.22.4.2.25. TimingSimpleCPU analysis #25</a></h6>
 <div class="paragraph">
 <p>Executes <code>TimingSimpleCPU::IcachePort::ITickEvent::process()</code>.</p>
 </div>
@@ -25124,7 +25827,7 @@ PacketQueue::processSendEvent</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-26"><a class="anchor" href="#timingsimplecpu-analysis-26"></a><a class="link" href="#timingsimplecpu-analysis-26">19.21.4.2.26. TimingSimpleCPU analysis #26</a></h6>
+<h6 id="timingsimplecpu-analysis-26"><a class="anchor" href="#timingsimplecpu-analysis-26"></a><a class="link" href="#timingsimplecpu-analysis-26">19.22.4.2.26. TimingSimpleCPU analysis #26</a></h6>
 <div class="paragraph">
 <p>Schedules <code>DRAMCtrl::processNextReqEvent</code> through:</p>
 </div>
@@ -25153,7 +25856,7 @@ TimingSimpleCPU::IcachePort::ITickEvent::process</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-27"><a class="anchor" href="#timingsimplecpu-analysis-27"></a><a class="link" href="#timingsimplecpu-analysis-27">19.21.4.2.27. TimingSimpleCPU analysis #27</a></h6>
+<h6 id="timingsimplecpu-analysis-27"><a class="anchor" href="#timingsimplecpu-analysis-27"></a><a class="link" href="#timingsimplecpu-analysis-27">19.22.4.2.27. TimingSimpleCPU analysis #27</a></h6>
 <div class="paragraph">
 <p>Schedules <code>BaseXBar::Layer&lt;SrcType, DstType&gt;::releaseLayer</code> through:</p>
 </div>
@@ -25179,19 +25882,19 @@ TimingSimpleCPU::IcachePort::ITickEvent::process</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-28"><a class="anchor" href="#timingsimplecpu-analysis-28"></a><a class="link" href="#timingsimplecpu-analysis-28">19.21.4.2.28. TimingSimpleCPU analysis #28</a></h6>
+<h6 id="timingsimplecpu-analysis-28"><a class="anchor" href="#timingsimplecpu-analysis-28"></a><a class="link" href="#timingsimplecpu-analysis-28">19.22.4.2.28. TimingSimpleCPU analysis #28</a></h6>
 <div class="paragraph">
 <p>Execute <code>DRAMCtrl::processNextReqEvent</code>.</p>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-29"><a class="anchor" href="#timingsimplecpu-analysis-29"></a><a class="link" href="#timingsimplecpu-analysis-29">19.21.4.2.29. TimingSimpleCPU analysis #29</a></h6>
+<h6 id="timingsimplecpu-analysis-29"><a class="anchor" href="#timingsimplecpu-analysis-29"></a><a class="link" href="#timingsimplecpu-analysis-29">19.22.4.2.29. TimingSimpleCPU analysis #29</a></h6>
 <div class="paragraph">
 <p>Schedule <code>DRAMCtrl::processRespondEvent()</code>.</p>
 </div>
 </div>
 <div class="sect5">
-<h6 id="timingsimplecpu-analysis-ldr-stall"><a class="anchor" href="#timingsimplecpu-analysis-ldr-stall"></a><a class="link" href="#timingsimplecpu-analysis-ldr-stall">19.21.4.2.30. TimingSimpleCPU analysis: LDR stall</a></h6>
+<h6 id="timingsimplecpu-analysis-ldr-stall"><a class="anchor" href="#timingsimplecpu-analysis-ldr-stall"></a><a class="link" href="#timingsimplecpu-analysis-ldr-stall">19.22.4.2.30. TimingSimpleCPU analysis: LDR stall</a></h6>
 <div class="paragraph">
 <p>One important thing we want to check now, is how the memory reads are going to make the processor stall in the middle of an instruction.</p>
 </div>
@@ -25309,7 +26012,7 @@ TimingSimpleCPU::IcachePort::ITickEvent::process</pre>
 </div>
 </div>
 <div class="sect4">
-<h5 id="gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches"><a class="anchor" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches"></a><a class="link" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches">19.21.4.3. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches</a></h5>
+<h5 id="gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches"><a class="anchor" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches"></a><a class="link" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches">19.22.4.3. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches</a></h5>
 <div class="paragraph">
 <p>Let&#8217;s just add <code>--caches</code> to <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis">gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis</a> to see if things go any faster, and add <code>Cache</code> to <code>--trace</code> as in:</p>
 </div>
@@ -25530,7 +26233,7 @@ TimingSimpleCPU::IcachePort::ITickEvent::process</pre>
 <p>Contrast this with the non <code>--cache</code> version seen at <a href="#timingsimplecpu-analysis-5">TimingSimpleCPU analysis #5</a> in which DRAM only actually reads the 4 required bytes.</p>
 </div>
 <div class="paragraph">
-<p>The only cryptic thing about the messages is the <code>IF</code> flag, but good computer architects would have guessed it correctly, and <a href="https://github.com/gem5/gem5/blob/fa70478413e4650d0058cbfe81fd5ce362101994/src/mem/packet.cc#L372">src/mem/packet.cc</a> confirms:</p>
+<p>The only cryptic thing about the messages is the <code>IF</code> flag, but good computer architects would have guessed it correctly that it is "instruction fetch" and <a href="https://github.com/gem5/gem5/blob/fa70478413e4650d0058cbfe81fd5ce362101994/src/mem/packet.cc#L372">src/mem/packet.cc</a> confirms:</p>
 </div>
 <div class="literalblock">
 <div class="content">
@@ -25603,8 +26306,29 @@ type=SetAssociative</pre>
 </li>
 </ul>
 </div>
+<div class="paragraph">
+<p>At:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>#7         0: Cache: system.cpu.icache: access for ReadReq [78:7b] IF miss
+#8         0: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 1000
+#9      1000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 executed @ 1000
+#10     1000: Cache: system.cpu.icache: sendMSHRQueuePacket: MSHR ReadReq [78:7b] IF
+#12     1000: Cache: system.cpu.icache: createMissPacket: created ReadCleanReq [40:7f] IF from ReadReq [78:7b] IF</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>we can briefly see the <a href="#gem5-mshr">gem5 <code>MSHR</code></a> doing its thing.</p>
+</div>
+<div class="paragraph">
+<p>At time 0, the CPU icache wants to read, so it creates a <a href="#gem5-packet">packet</a> that reads 4 bytes only (<code>[78:7b]</code>) for the instruction, and that goes into the MSHR, to be treated in a future event.</p>
+</div>
+<div class="paragraph">
+<p>At 1000, the future event is executed, and so it reads the original packet from the MSHR, and uses that to create a new request <code>[40:7f]</code> which gets forwarded.</p>
+</div>
 <div class="sect5">
-<h6 id="what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5"><a class="anchor" href="#what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5"></a><a class="link" href="#what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5">19.21.4.3.1. What is the coherency protocol implemented by the classic cache system in gem5?</a></h6>
+<h6 id="what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5"><a class="anchor" href="#what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5"></a><a class="link" href="#what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5">19.22.4.3.1. What is the coherency protocol implemented by the classic cache system in gem5?</a></h6>
 <div class="paragraph">
 <p><a href="#moesi">MOESI cache coherence protocol</a>: <a href="https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L352" class="bare">https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L352</a></p>
 </div>
@@ -25612,12 +26336,12 @@ type=SetAssociative</pre>
 <p>The actual representation is done via separate state bits: <a href="https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L66" class="bare">https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L66</a> and MOESI appears explicitly only on the pretty printing.</p>
 </div>
 <div class="paragraph">
-<p>This pretty printing appears for example in the <code>--trace Cache</code> lines as shown at <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches">gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches</a> and with a few more transitions visible at <a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">Section 19.21.4.4, &#8220;gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs&#8221;</a>.</p>
+<p>This pretty printing appears for example in the <code>--trace Cache</code> lines as shown at <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches">gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches</a> and with a few more transitions visible at <a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">Section 19.22.4.4, &#8220;gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs&#8221;</a>.</p>
 </div>
 </div>
 </div>
 <div class="sect4">
-<h5 id="gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus"><a class="anchor" href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus"></a><a class="link" href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">19.21.4.4. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a></h5>
+<h5 id="gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus"><a class="anchor" href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus"></a><a class="link" href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">19.22.4.4. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a></h5>
 <div class="paragraph">
 <p>It would be amazing to analyze a simple example with interconnect packets possibly invalidating caches of other CPUs.</p>
 </div>
@@ -25827,7 +26551,228 @@ type=SetAssociative</pre>
 </div>
 </div>
 <div class="sect4">
-<h5 id="gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus-and-ruby"><a class="anchor" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus-and-ruby"></a><a class="link" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus-and-ruby">19.21.4.5. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby</a></h5>
+<h5 id="gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus"><a class="anchor" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus"></a><a class="link" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">19.22.4.5. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a></h5>
+<div class="paragraph">
+<p>Like <a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a> but with <a href="#gem5-timingsimplecpu">gem5 <code>TimingSimpleCPU</code></a> and <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/atomic/aarch64_add.c">userland/c/atomic/aarch64_add.c</a>:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./build-userland --arch aarch64 --optimization-level 3 --userland-build-id o3
+./run \
+  --arch aarch64 \
+  --cli-args '2 1000' \
+  --cpus 3 \
+  --emulator gem5 \
+  --trace FmtFlag,CacheAll,DRAM,Event,ExecAll,SimpleCPU,XBar \
+  --userland userland/c/atomic/aarch64_add.c \
+  --userland-build-id o3 \
+  -- \
+  --caches \
+  --cpu-type TimingSimpleCPU \
+;</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>This is arguably the best experiment to study the <a href="#gem5-crossbar-interconnect">gem5 crossbar interconnect</a>.</p>
+</div>
+<div class="paragraph">
+<p>We increase the loop count to 100 loops because 100 did not show memory conflicts. The output is:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>expect 200
+global 147</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Let&#8217;s double check what it compiles to with <a href="#disas">disas</a>:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>./disas --arch aarch64 --userland userland/c/atomic/aarch64_add.c --userland-build-id o3 my_thread_main</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>which contains:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>   0x0000000000400a70 &lt;+0&gt;:     03 00 40 f9     ldr     x3, [x0]
+   0x0000000000400a74 &lt;+4&gt;:     63 01 00 b4     cbz     x3, 0x400aa0 &lt;my_thread_main+48&gt;
+   0x0000000000400a78 &lt;+8&gt;:     82 00 00 d0     adrp    x2, 0x412000 &lt;malloc@got.plt&gt;
+   0x0000000000400a7c &lt;+12&gt;:    42 a0 01 91     add     x2, x2, #0x68
+   0x0000000000400a80 &lt;+16&gt;:    00 00 80 d2     mov     x0, #0x0                        // #0
+   0x0000000000400a84 &lt;+20&gt;:    1f 20 03 d5     nop
+   0x0000000000400a88 &lt;+24&gt;:    41 00 40 f9     ldr     x1, [x2]
+   0x0000000000400a8c &lt;+28&gt;:    21 04 00 91     add     x1, x1, #0x1
+   0x0000000000400a90 &lt;+32&gt;:    41 00 00 f9     str     x1, [x2]
+   0x0000000000400a94 &lt;+36&gt;:    00 04 00 91     add     x0, x0, #0x1
+   0x0000000000400a98 &lt;+40&gt;:    7f 00 00 eb     cmp     x3, x0
+   0x0000000000400a9c &lt;+44&gt;:    68 ff ff 54     b.hi    0x400a88 &lt;my_thread_main+24&gt;  // b.pmore
+   0x0000000000400aa0 &lt;+48&gt;:    00 00 80 52     mov     w0, #0x0                        // #0
+   0x0000000000400aa4 &lt;+52&gt;:    c0 03 5f d6     ret</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Grepping the logs for my_thread_main+24 shows where the first non-atomic interleaves happen at:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>471039000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+24    :   ldr   x1, [x2]           : MemRead :  D=0x000000000000002f A=0x412068  flags=(IsInteger|IsMemRef|IsLoad)
+471034000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+24    :   ldr   x1, [x2]           : MemRead :  D=0x000000000000002f A=0x412068  flags=(IsInteger|IsMemRef|IsLoad)
+471059000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+44    :   b.hi   &lt;my_thread_main+24&gt; : IntAlu :   flags=(IsControl|IsDirectControl|IsCondControl)
+471070000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+44    :   b.hi   &lt;my_thread_main+24&gt; : IntAlu :   flags=(IsControl|IsDirectControl|IsCondControl)
+471071000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+24    :   ldr   x1, [x2]           : MemRead :  D=0x0000000000000030 A=0x412068  flags=(IsInteger|IsMemRef|IsLoad)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>after a long string of cpu1 hits, since CPU1 was forked first and therefore had more time to run that operation.</p>
+</div>
+<div class="paragraph">
+<p>From those and logs around we deduce that:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>the shared address of interest is 0x412068</p>
+</li>
+<li>
+<p>the physical address is 2068</p>
+</li>
+<li>
+<p>it fits into the cache line for 2040:207f</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>With that guide, we look at the fuller logs around that region of interest. With start at the first ifetch that CPU2 does for our LDR of interest at 0x400a88:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>471033000: SimpleCPU: system.cpu2: Fetch
+471033000: SimpleCPU: system.cpu2: Translating address 0x400a88</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Things get a bit interleaved with CPU1, but soon afterwards we see the miss forwarding via <a href="#gem5-mshr">gem5 <code>MSHR</code></a> as in <a href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches">gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches</a>:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>471034000: Cache: system.cpu2.dcache: access for ReadReq [2068:206f] D=b0d989c328560000 ptr=0x5628c3d26f00 miss
+471034000: CachePort: system.cpu2.dcache.mem_side: Scheduling send event at 471035000
+471034000: Event: system.cpu2.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 140 scheduled @ 471035000</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Before the request moves on, some CPU1 action happens: a CPU1 STR finished! It hit the cache, and now we know the cache state: M:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>471034000: Cache: system.cpu1.dcache: access for WriteReq [2068:206f] D=2f00000000000000 ptr=0x5628c3d26c80 hit state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
+471034000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+32    :   str   x1, [x2]           : MemWrite :  D=0x000000000000002f A=0x412068  flags=(IsInteger|IsMemRef|IsStore)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>After this is done, CPU2 dcache finally decides that it is time to forward its request, and <em>now</em> we see the crux of this experiment happen.</p>
+</div>
+<div class="paragraph">
+<p>First <code>createMissPacket</code> creates a new packet for the cache request, and then it sends that packet into <code>CoherentXBar</code>.</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>471035000: Event: system.cpu2.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 140 executed @ 471035000
+471035000: Cache: system.cpu2.dcache: sendMSHRQueuePacket: MSHR ReadReq [2068:206f] D=b0d989c328560000 ptr=0x5628c3d26f00
+471035000: Cache: system.cpu2.dcache: createMissPacket: created ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80 from ReadReq [2068:206f] D=b0d989c328560000 ptr=0x5628c3d26f00
+471035000: CoherentXBar: system.membus: recvTimingReq: src system.membus.slave[10] packet ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>Now, the <code>SnoopFilte</code> which lives inside the crossbar decides if any other CPUs care aout hat address:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>471035000: SnoopFilter: system.membus.snoop_filter: lookupRequest: src system.membus.slave[10] packet ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
+471035000: SnoopFilter: system.membus.snoop_filter: lookupRequest:   SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000
+471035000: SnoopFilter: system.membus.snoop_filter: lookupRequest:   new SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000
+471035000: CoherentXBar: system.membus: recvTimingReq: src system.membus.slave[10] packet ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80 SF size: 1 lat: 1</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and the answer is yes: CPU1 does care about that address obviously! So the packet is forwarded as is to CPU1:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>471035000: CoherentXBar: system.membus: forwardTiming for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
+471035000: CacheVerbose: system.cpu1.dcache: recvTimingSnoopReq: for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
+471035000: CacheVerbose: system.cpu1.dcache: handleSnoop: for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
+471035000: Cache: system.cpu1.dcache: handleSnoop: snoop hit for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80, old state is state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
+471035000: Cache: system.cpu1.dcache: new state is state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
+471035000: Cache: system.cpu1.dcache: doTimingSupplyResponse: for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
+471035000: CacheVerbose: system.cpu1.dcache: doTimingSupplyResponse: created response: ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000 tick: 471044000
+471035000: Event: system.cpu1.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 94 scheduled @ 471044000
+471035000: CoherentXBar: system.membus: recvTimingReq: Not forwarding ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
+471035000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 164 scheduled @ 471036000
+471035000: BaseXBar: system.membus.reqLayer0: The crossbar layer is now busy from tick 471035000 to 471036000</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and from this we see that this read request from CPU2 made a cache from CPU1 go from M to O!</p>
+</div>
+<div class="paragraph">
+<p>Then, the CPU1 dcache actually goes ahead, and creates a response or CPU2, since it has the data. This response is sent back to the crossbar which will forward it back to CPU1.</p>
+</div>
+<div class="paragraph">
+<p>This also makes the crossbar not forward the original request to DRAM as mentioned at <code>Not forwarding</code>.</p>
+</div>
+<div class="paragraph">
+<p>This reply from CPU1 reaches the crossbar at:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>471044000: Event: system.cpu1.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 94 executed @ 471044000
+471044000: CoherentXBar: system.membus: recvTimingSnoopResp: src system.membus.slave[6] packet ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000
+471044000: SnoopFilter: system.membus.snoop_filter: updateSnoopResponse: rsp system.membus.slave[6] req system.membus.slave[10] packet ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000
+471044000: SnoopFilter: system.membus.snoop_filter: updateSnoopResponse:   old SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000
+471044000: SnoopFilter: system.membus.snoop_filter: updateSnoopResponse:   new SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101000
+471044000: CoherentXBar: system.membus: recvTimingSnoopResp: src system.membus.slave[6] packet ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000 FWD RESP
+471044000: Event: system.membus.slave[10]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 186 scheduled @ 471046000
+471044000: Event: system.membus.respLayer10.wrapped_function_event: EventFunctionWrapped 187 scheduled @ 471049000
+471044000: BaseXBar: system.membus.respLayer10: The crossbar layer is now busy from tick 471044000 to 471049000</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>and finally, at long last, CPU2 receives the snoop reply that was created in CPU1 and sent back through the crossbar, and the LDR completes:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>471046000: Event: system.membus.slave[10]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 186 executed @ 471046000
+471046000: Cache: system.cpu2.dcache: recvTimingResp: Handling response ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000
+471046000: Cache: system.cpu2.dcache: Block for addr 0x2040 being updated in Cache
+471046000: CacheRepl: system.cpu2.dcache: Replacement victim: state: 0 (I) valid: 0 writable: 0 readable: 0 dirty: 0 | tag: 0xffffffffffffffff set: 0x81 way: 0
+471046000: Cache: system.cpu2.dcache: Block addr 0x2040 (ns) moving from state 0 (I) to state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
+471046000: Cache: system.cpu2.dcache: serviceMSHRTargets: updated cmd to ReadRespWithInvalidate [2068:206f] D=2f00000000000000 ptr=0x5628c3d26f00
+471046000: Event: system.cpu2.dcache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 138 scheduled @ 471047000
+471046000: Cache: system.cpu2.dcache: processing deferred snoop...
+471046000: CacheVerbose: system.cpu2.dcache: handleSnoop: for UpgradeReq [2040:207f] D= ptr=0x5628c2d37b80
+471046000: Cache: system.cpu2.dcache: handleSnoop: snoop hit for UpgradeReq [2040:207f] D= ptr=0x5628c2d37b80, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
+471046000: Cache: system.cpu2.dcache: new state is state: 0 (I) valid: 0 writable: 0 readable: 0 dirty: 0 | tag: 0xffffffffffffffff set: 0x81 way: 0
+471046000: CacheVerbose: system.cpu2.dcache: recvTimingResp: Leaving with ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000
+471047000: Event: system.cpu2.dcache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 138 executed @ 471047000
+471047000: SimpleCPU: system.cpu2.dcache_port: Received load/store response 0x2068
+471047000: Event: Event_136: Timing CPU dcache tick 136 scheduled @ 471047000
+471047000: Event: Event_136: Timing CPU dcache tick 136 executed @ 471047000
+471034000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+24    :   ldr   x1, [x2]           : MemRead :  D=0x000000000000002f A=0x412068  flags=(IsInteger|IsMemRef|IsLoad)</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>We note therefore that no DRAM access was involved, one cache services the other directly!</p>
+</div>
+<div class="paragraph">
+<p>Tested on LKMC d429552cdeb0fc0a08cff8e627bf501eaffb068f + 1, gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772.</p>
+</div>
+</div>
+<div class="sect4">
+<h5 id="gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus-and-ruby"><a class="anchor" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus-and-ruby"></a><a class="link" href="#gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus-and-ruby">19.22.4.6. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby</a></h5>
 <div class="paragraph">
 <p>Now let&#8217;s do the exact same we did for <a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs</a>, but with <a href="#gem5-ruby-build">Ruby</a> rather than the classic system and TimingSimpleCPU (atomic does not work with Ruby)</p>
 </div>
@@ -25869,7 +26814,7 @@ non-atomic 19</pre>
 </div>
 </div>
 <div class="sect4">
-<h5 id="gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis"><a class="anchor" href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis"></a><a class="link" href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis">19.21.4.6. gem5 event queue MinorCPU syscall emulation freestanding example analysis</a></h5>
+<h5 id="gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis"><a class="anchor" href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis"></a><a class="link" href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis">19.22.4.7. gem5 event queue MinorCPU syscall emulation freestanding example analysis</a></h5>
 <div class="paragraph">
 <p>The events <a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis">for the Atomic CPU</a> were pretty simple: basically just ticks.</p>
 </div>
@@ -26039,14 +26984,14 @@ non-atomic 19</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis-hazard"><a class="anchor" href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis-hazard"></a><a class="link" href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis-hazard">19.21.4.6.1. gem5 event queue MinorCPU syscall emulation freestanding example analysis: hazard</a></h6>
+<h6 id="gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis-hazard"><a class="anchor" href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis-hazard"></a><a class="link" href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis-hazard">19.22.4.7.1. gem5 event queue MinorCPU syscall emulation freestanding example analysis: hazard</a></h6>
 <div class="paragraph">
 <p>TODO like <a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard">gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard</a> but with the hazard.</p>
 </div>
 </div>
 </div>
 <div class="sect4">
-<h5 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis">19.21.4.7. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis</a></h5>
+<h5 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis">19.22.4.8. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis</a></h5>
 <div class="paragraph">
 <p>Like <a href="#gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis">gem5 event queue MinorCPU syscall emulation freestanding example analysis</a> but even more complex since for the <a href="#gem5-derivo3cpu">gem5 <code>DerivO3CPU</code></a>!</p>
 </div>
@@ -26074,7 +27019,7 @@ non-atomic 19</pre>
 <p>This section and children are tested at LKMC 144a552cf926ea630ef9eadbb22b79fe2468c456.</p>
 </div>
 <div class="sect5">
-<h6 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazardless"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazardless"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazardless">19.21.4.7.1. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazardless</a></h6>
+<h6 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazardless"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazardless"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazardless">19.22.4.8.1. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazardless</a></h6>
 <div class="paragraph">
 <p>Let&#8217;s  have a look at the arguably simplest example <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/arch/aarch64/freestanding/linux/hazardless.S">userland/arch/aarch64/freestanding/linux/hazardless.S</a>.</p>
 </div>
@@ -26266,7 +27211,7 @@ non-atomic 19</pre>
 <p>so we observe that the first two instructions arrived, and the CPU noticed that 0x400080 hasn&#8217;t been fetched yet.</p>
 </div>
 <div class="paragraph">
-<p>Then for several cycles that follow, the fetch stage just says that it is blocked on data returning, e.g. the</p>
+<p>Then for several cycles that follow, the fetch stage just says that it is blocked on data returning:</p>
 </div>
 <div class="literalblock">
 <div class="content">
@@ -26313,7 +27258,7 @@ non-atomic 19</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard">19.21.4.7.2. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard</a></h6>
+<h6 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard">19.22.4.8.2. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard</a></h6>
 <div class="paragraph">
 <p>Now let&#8217;s do the same as in <a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazardless">gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazardless</a> but with a hazard: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/arch/aarch64/freestanding/linux/hazard.S">userland/arch/aarch64/freestanding/linux/hazard.S</a>.</p>
 </div>
@@ -26357,7 +27302,7 @@ non-atomic 19</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard4"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard4"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard4">19.21.4.7.3. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard4</a></h6>
+<h6 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard4"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard4"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard4">19.22.4.8.3. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard4</a></h6>
 <div class="paragraph">
 <p>Like <a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard">gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard</a> but a hazard of depth 4: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/arch/aarch64/freestanding/linux/hazard.S">userland/arch/aarch64/freestanding/linux/hazard.S</a>.</p>
 </div>
@@ -26398,7 +27343,7 @@ non-atomic 19</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall">19.21.4.7.4. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall</a></h6>
+<h6 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall">19.22.4.8.4. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall</a></h6>
 <div class="paragraph">
 <p>Like <a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-hazard">gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard</a> but now with an LDR stall: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/arch/aarch64/freestanding/linux/stall.S">userland/arch/aarch64/freestanding/linux/stall.S</a>.</p>
 </div>
@@ -26449,7 +27394,7 @@ non-atomic 19</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-gain"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-gain"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-gain">19.21.4.7.5. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-gain</a></h6>
+<h6 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-gain"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-gain"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-gain">19.22.4.8.5. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-gain</a></h6>
 <div class="paragraph">
 <p>Like <a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall">gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall</a> but now with an LDR stall: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/arch/aarch64/freestanding/linux/stall-gain.S">userland/arch/aarch64/freestanding/linux/stall-gain.S</a>.</p>
 </div>
@@ -26536,7 +27481,7 @@ non-atomic 19</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-hazard4"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-hazard4"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-hazard4">19.21.4.7.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-hazard4</a></h6>
+<h6 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-hazard4"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-hazard4"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-hazard4">19.22.4.8.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-hazard4</a></h6>
 <div class="paragraph">
 <p>Like <a href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-stall-gain">gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-gain</a> but now with some dependencies after the LDR: <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/arch/aarch64/freestanding/linux/stall-hazard4.S">userland/arch/aarch64/freestanding/linux/stall-hazard4.S</a>.</p>
 </div>
@@ -26603,7 +27548,7 @@ non-atomic 19</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-speculative"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-speculative"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-speculative">19.21.4.7.7. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: speculative</a></h6>
+<h6 id="gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-speculative"><a class="anchor" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-speculative"></a><a class="link" href="#gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-speculative">19.22.4.8.7. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: speculative</a></h6>
 <div class="paragraph">
 <p>Now let&#8217;s try to see some <a href="#speculative-execution">Speculative execution</a> in action with <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/arch/aarch64/freestanding/linux/speculative.S">userland/arch/aarch64/freestanding/linux/speculative.S</a>.</p>
 </div>
@@ -26792,7 +27737,7 @@ wbActual:0
 </div>
 </div>
 <div class="sect3">
-<h4 id="gem5-instruction-definitions"><a class="anchor" href="#gem5-instruction-definitions"></a><a class="link" href="#gem5-instruction-definitions">19.21.5. gem5 instruction definitions</a></h4>
+<h4 id="gem5-instruction-definitions"><a class="anchor" href="#gem5-instruction-definitions"></a><a class="link" href="#gem5-instruction-definitions">19.22.5. gem5 instruction definitions</a></h4>
 <div class="paragraph">
 <p>This is one of the parts of gem5 that rely on semi-useless <a href="#gem5-code-generation">code generation</a> inside the <code>.isa</code> sublanguage.</p>
 </div>
@@ -26835,7 +27780,7 @@ wbActual:0
 </div>
 </div>
 <div class="paragraph">
-<p>We also notice that the key argument passed to those instructions is of type <code>ExecContext</code>, which is discussed further at: <a href="#gem5-execcontext">Section 19.21.7.3, &#8220;gem5 <code>ExecContext</code>&#8221;</a>.</p>
+<p>We also notice that the key argument passed to those instructions is of type <code>ExecContext</code>, which is discussed further at: <a href="#gem5-execcontext">Section 19.22.6.3, &#8220;gem5 <code>ExecContext</code>&#8221;</a>.</p>
 </div>
 <div class="paragraph">
 <p>The file is an include so that compilation can be split up into chunks by the autogenerated includers</p>
@@ -27040,7 +27985,7 @@ namespace ArmISAInst {
 <p>Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.</p>
 </div>
 <div class="sect4">
-<h5 id="gem5-execute-vs-initiateacc-vs-completeacc"><a class="anchor" href="#gem5-execute-vs-initiateacc-vs-completeacc"></a><a class="link" href="#gem5-execute-vs-initiateacc-vs-completeacc">19.21.5.1. gem5 <code>execute</code> vs <code>initiateAcc</code> vs <code>completeAcc</code></a></h5>
+<h5 id="gem5-execute-vs-initiateacc-vs-completeacc"><a class="anchor" href="#gem5-execute-vs-initiateacc-vs-completeacc"></a><a class="link" href="#gem5-execute-vs-initiateacc-vs-completeacc">19.22.5.1. gem5 <code>execute</code> vs <code>initiateAcc</code> vs <code>completeAcc</code></a></h5>
 <div class="paragraph">
 <p>These are the key methods defined in instruction definitions, so lets see when each one gets called and what they do more or less.</p>
 </div>
@@ -27094,7 +28039,7 @@ namespace ArmISAInst {
 <p>This can be seen concretely in GDB from the analysis done at: <a href="#timingsimplecpu-analysis-ldr-stall">TimingSimpleCPU analysis: LDR stall</a> and for more memory details see <a href="#gem5-functional-vs-atomic-vs-timing-memory-requests">gem5 functional vs atomic vs timing memory requests</a>.</p>
 </div>
 <div class="sect5">
-<h6 id="gem5-completeacc"><a class="anchor" href="#gem5-completeacc"></a><a class="link" href="#gem5-completeacc">19.21.5.1.1. gem5 <code>completeAcc</code></a></h6>
+<h6 id="gem5-completeacc"><a class="anchor" href="#gem5-completeacc"></a><a class="link" href="#gem5-completeacc">19.22.5.1.1. gem5 <code>completeAcc</code></a></h6>
 <div class="paragraph">
 <p><code>completeAcc</code> is boring on most simple store memory instructions, e.g. a simple STR:</p>
 </div>
@@ -27147,7 +28092,7 @@ namespace ArmISAInst {
 </div>
 </div>
 <div class="sect4">
-<h5 id="gem5-microops"><a class="anchor" href="#gem5-microops"></a><a class="link" href="#gem5-microops">19.21.5.2. gem5 microops</a></h5>
+<h5 id="gem5-microops"><a class="anchor" href="#gem5-microops"></a><a class="link" href="#gem5-microops">19.22.5.2. gem5 microops</a></h5>
 <div class="paragraph">
 <p>Some gem5 instructions break down into multiple microops.</p>
 </div>
@@ -27208,272 +28153,7 @@ namespace ArmISAInst {
 </div>
 </div>
 <div class="sect3">
-<h4 id="gem5-port-system"><a class="anchor" href="#gem5-port-system"></a><a class="link" href="#gem5-port-system">19.21.6. gem5 port system</a></h4>
-<div class="paragraph">
-<p>The gem5 memory system is connected in a very flexible way through the port system.</p>
-</div>
-<div class="paragraph">
-<p>This system exists to allow seamlessly connecting any combination of CPU, caches, interconnects, DRAM and peripherals.</p>
-</div>
-<div class="sect4">
-<h5 id="gem5-functional-vs-atomic-vs-timing-memory-requests"><a class="anchor" href="#gem5-functional-vs-atomic-vs-timing-memory-requests"></a><a class="link" href="#gem5-functional-vs-atomic-vs-timing-memory-requests">19.21.6.1. gem5 functional vs atomic vs timing memory requests</a></h5>
-<div class="paragraph">
-<p>gem5 memory requests can be classified in the following broad categories:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p>functional: get the value magically, do not update caches, see also: <a href="#gem5-functional-requests">gem5 functional requests</a></p>
-</li>
-<li>
-<p>atomic: get the value now without making a <a href="#gem5-event-queue">separate event</a>, but do not update caches. Cannot work in <a href="#gem5-ruby-build">Ruby</a> due to fundamental limitations, mentioned in passing at: <a href="https://gem5.atlassian.net/browse/GEM5-676" class="bare">https://gem5.atlassian.net/browse/GEM5-676</a></p>
-</li>
-<li>
-<p>timing: get the value simulating delays and updating caches</p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>This trichotomy can be notably seen in the definition of the <a href="https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/port.hh#L75">MasterPort class</a>:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>class MasterPort : public Port, public AtomicRequestProtocol,
-    public TimingRequestProtocol, public FunctionalRequestProtocol</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>and the base classes are defined under <code>src/mem/protocol/</code>.</p>
-</div>
-<div class="paragraph">
-<p>Then, by reading the rest of the class, we see that the send methods are all boring, and just forward to some polymorphic receiver that does the actual interesting activity:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>    Tick
-    sendAtomicSnoop(PacketPtr pkt)
-    {
-        return AtomicResponseProtocol::sendSnoop(_masterPort, pkt);
-    }
-
-    Tick
-    AtomicResponseProtocol::sendSnoop(AtomicRequestProtocol *peer, PacketPtr pkt)
-    {
-        assert(pkt-&gt;isRequest());
-        return peer-&gt;recvAtomicSnoop(pkt);
-    }</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>The receive methods are therefore the interesting ones, and must be overridden on derived classes if they ever expect to receive such requests:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>    Tick
-    recvAtomicSnoop(PacketPtr pkt) override
-    {
-        panic("%s was not expecting an atomic snoop request\n", name());
-        return 0;
-    }
-
-    void
-    recvFunctionalSnoop(PacketPtr pkt) override
-    {
-        panic("%s was not expecting a functional snoop request\n", name());
-    }
-
-    void
-    recvTimingSnoopReq(PacketPtr pkt) override
-    {
-        panic("%s was not expecting a timing snoop request.\n", name());
-    }</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>One question that comes up now is: but why do CPUs need to care about <a href="#cache-coherence">snoop requests</a>?</p>
-</div>
-<div class="paragraph">
-<p>And one big answer is: to be able to implement LLSC atomicity as mentioned at: <a href="#arm-ldxr-and-stxr-instructions">ARM LDXR and STXR instructions</a>, since when other cores update memory, they could invalidate the lock of the current core.</p>
-</div>
-<div class="paragraph">
-<p>Then, as you might expect, we can see that for example <code>AtomicSimpleCPU</code> does not override <code>recvTimingSnoopReq</code>.</p>
-</div>
-<div class="paragraph">
-<p>Now let see which requests are generated by ordinary <a href="#arm-ldr-instruction">ARM LDR instruction</a>. We run:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>./run \
-  --arch aarch64 \
-  --debug-vm \
-  --emulator gem5 \
-  --gem5-build-type debug \
-  --useland userland/arch/aarch64/freestanding/linux/hello.S \</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>and then break at the methods of the LDR class <code>LDRXL64_LIT</code>: <a href="#gem5-execute-vs-initiateacc-vs-completeacc">gem5 <code>execute</code> vs <code>initiateAcc</code> vs <code>completeAcc</code></a>.</p>
-</div>
-<div class="paragraph">
-<p>Before starting, we of course guess that:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p><code>AtomicSimpleCPU</code> will be making atomic accesses from <code>execute</code></p>
-</li>
-<li>
-<p><code>TimingSimpleCPU</code> will be making timing accesses from <code>initiateAcc</code>, which must generate the event which leads to <code>completeAcc</code></p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>so let&#8217;s confirm it.</p>
-</div>
-<div class="paragraph">
-<p>We break on <code>ArmISAInst::LDRXL64_LIT::execute</code> which is what <code>AtomicSimpleCPU</code> uses, and that leads as expected to:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>MasterPort::sendAtomic
-AtomicSimpleCPU::sendPacket
-AtomicSimpleCPU::readMem
-SimpleExecContext::readMem
-readMemAtomic&lt;(ByteOrder)1, ExecContext, unsigned long&gt;
-readMemAtomicLE&lt;ExecContext, unsigned long&gt;
-ArmISAInst::LDRXL64_LIT::execute
-AtomicSimpleCPU::tick</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>Notably, <code>AtomicSimpleCPU::readMem</code> immediately translates the address, creates a packet, sends the atomic request, and gets the response back without any events.</p>
-</div>
-<div class="paragraph">
-<p>And now if we do the same with <code>--cpu-type TimingSimpleCPU</code> and break at <code>ArmISAInst::LDRXL64_LIT::initiateAcc</code>, and then add another break for the next event schedule <code>b EventManager::schedule</code> (which we imagine is the memory read) we reach:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>EventManager::schedule
-DRAMCtrl::addToReadQueue
-DRAMCtrl::recvTimingReq
-DRAMCtrl::MemoryPort::recvTimingReq
-TimingRequestProtocol::sendReq
-MasterPort::sendTimingReq
-CoherentXBar::recvTimingReq
-CoherentXBar::CoherentXBarSlavePort::recvTimingReq
-TimingRequestProtocol::sendReq
-MasterPort::sendTimingReq
-TimingSimpleCPU::handleReadPacket
-TimingSimpleCPU::sendData
-TimingSimpleCPU::finishTranslation
-DataTranslation&lt;TimingSimpleCPU*&gt;::finish
-ArmISA::TLB::translateComplete
-ArmISA::TLB::translateTiming
-ArmISA::TLB::translateTiming
-TimingSimpleCPU::initiateMemRead
-SimpleExecContext::initiateMemRead
-initiateMemRead&lt;ExecContext, unsigned long&gt;
-ArmISAInst::LDRXL64_LIT::initiateAcc
-TimingSimpleCPU::completeIfetch
-TimingSimpleCPU::IcachePort::ITickEvent::process
-EventQueue::serviceOne</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>so as expected we have <code>TimingRequestProtocol::sendReq</code>.</p>
-</div>
-<div class="paragraph">
-<p>Remember however that timing requests are a bit more complicated due to <a href="#arm-paging">paging</a>, since the page table walk can itself lead to further memory requests.</p>
-</div>
-<div class="paragraph">
-<p>In this particular instance, the address being read with <code>ldr x2, =len</code> <a href="#arm-ldr-pseudo-instruction">ARM LDR pseudo-instruction</a> is likely placed just after the text section, and therefore the pagewalk is already in the TLB due to previous instruction fetches, and this is because the translation just finished immediately going through <code>TimingSimpleCPU::finishTranslation</code>, some key snippets are:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>TLB::translateComplete(const RequestPtr &amp;req, ThreadContext *tc,
-        Translation *translation, Mode mode, TLB::ArmTranslationType tranType,
-        bool callFromS2)
-{
-    bool delay = false;
-    Fault fault;
-    if (FullSystem)
-        fault = translateFs(req, tc, mode, translation, delay, true, tranType);
-    else
-        fault = translateSe(req, tc, mode, translation, delay, true);
-    if (!delay)
-        translation-&gt;finish(fault, req, tc, mode);
-    else
-        translation-&gt;markDelayed();</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>and then <code>translateSe</code> does not use <code>delay</code> at all, so we learn that in syscall emulation, <code>delay</code> is always <code>false</code> and things progress immediately there. And then further down <code>TimingSimpleCPU::finishTranslation</code> does some more fault checking:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>void
-TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
-{
-    if (state-&gt;getFault() != NoFault) {
-        translationFault(state-&gt;getFault());
-    } else {
-        if (!state-&gt;isSplit) {
-            sendData(state-&gt;mainReq, state-&gt;data, state-&gt;res,
-                     state-&gt;mode == BaseTLB::Read);</pre>
-</div>
-</div>
-<div class="paragraph">
-<p>Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.</p>
-</div>
-<div class="sect5">
-<h6 id="gem5-functional-requests"><a class="anchor" href="#gem5-functional-requests"></a><a class="link" href="#gem5-functional-requests">19.21.6.1.1. gem5 functional requests</a></h6>
-<div class="paragraph">
-<p>As seen at <a href="#gem5-functional-vs-atomic-vs-timing-memory-requests">gem5 functional vs atomic vs timing memory requests</a>, functional requests are not used in common simulation, since the core must always go through caches.</p>
-</div>
-<div class="paragraph">
-<p>Functional access are therefore only used for more magic simulation functionalities.</p>
-</div>
-<div class="paragraph">
-<p>One such functionality, is the <a href="#gem5-syscall-emulation-mode">gem5 syscall emulation mode</a> implementation of the <a href="#futex-system-call">futex system call</a> which is done at <code>futexFunc</code> in <a href="https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/sim/syscall_emul.hh#L394"><code>src/sim/sycall_emul.hh</code></a>.</p>
-</div>
-<div class="paragraph">
-<p>As seen from <code>man futex</code>, the Linux kernel reads the value from an address that is given as the first argument of the call.</p>
-</div>
-<div class="paragraph">
-<p>Therefore, here it makes sense for gem5 syscall implementation, which does not actually have a real kernel running, to just make a functional request and be done with it, since the impact of cache changes done by this read would be insignificant to the cost of an actual full context switch that would happen on a real syscall.</p>
-</div>
-<div class="paragraph">
-<p>It is generally hard to implement functional requests for <a href="#gem5-ruby-build">Ruby</a> runs, because packets are flying through the memory system in a transient state, and there is no simple way of finding exactly which ones might have the latest version of the memory. See for example:</p>
-</div>
-<div class="ulist">
-<ul>
-<li>
-<p><a href="https://gem5.atlassian.net/browse/GEM5-496" class="bare">https://gem5.atlassian.net/browse/GEM5-496</a></p>
-</li>
-<li>
-<p><a href="https://gem5.atlassian.net/browse/GEM5-604" class="bare">https://gem5.atlassian.net/browse/GEM5-604</a></p>
-</li>
-<li>
-<p><a href="https://gem5.atlassian.net/browse/GEM5-675" class="bare">https://gem5.atlassian.net/browse/GEM5-675</a></p>
-</li>
-<li>
-<p><a href="https://gem5.atlassian.net/browse/GEM5-676" class="bare">https://gem5.atlassian.net/browse/GEM5-676</a></p>
-</li>
-</ul>
-</div>
-<div class="paragraph">
-<p>The typical error message in that case is:</p>
-</div>
-<div class="literalblock">
-<div class="content">
-<pre>fatal: Ruby functional read failed for address</pre>
-</div>
-</div>
-</div>
-</div>
-</div>
-<div class="sect3">
-<h4 id="gem5-threadcontext-vs-threadstate-vs-execcontext-vs-process"><a class="anchor" href="#gem5-threadcontext-vs-threadstate-vs-execcontext-vs-process"></a><a class="link" href="#gem5-threadcontext-vs-threadstate-vs-execcontext-vs-process">19.21.7. gem5 <code>ThreadContext</code> vs <code>ThreadState</code> vs <code>ExecContext</code> vs <code>Process</code></a></h4>
+<h4 id="gem5-threadcontext-vs-threadstate-vs-execcontext-vs-process"><a class="anchor" href="#gem5-threadcontext-vs-threadstate-vs-execcontext-vs-process"></a><a class="link" href="#gem5-threadcontext-vs-threadstate-vs-execcontext-vs-process">19.22.6. gem5 <code>ThreadContext</code> vs <code>ThreadState</code> vs <code>ExecContext</code> vs <code>Process</code></a></h4>
 <div class="paragraph">
 <p>These classes get used everywhere, and they have a somewhat convoluted relation with one another, so let&#8217;s figure it out this mess.</p>
 </div>
@@ -27484,7 +28164,7 @@ TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
 <p>This section and all children tested at gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.</p>
 </div>
 <div class="sect4">
-<h5 id="gem5-threadcontext"><a class="anchor" href="#gem5-threadcontext"></a><a class="link" href="#gem5-threadcontext">19.21.7.1. gem5 <code>ThreadContext</code></a></h5>
+<h5 id="gem5-threadcontext"><a class="anchor" href="#gem5-threadcontext"></a><a class="link" href="#gem5-threadcontext">19.22.6.1. gem5 <code>ThreadContext</code></a></h5>
 <div class="paragraph">
 <p>As we delve into more details below, we will reach the following conclusion: a <code>ThreadContext</code> represents on thread of a CPU with multiple <a href="#hardware-threads">Hardware threads</a>.</p>
 </div>
@@ -27534,7 +28214,7 @@ typedef SimpleThread MinorThread;</pre>
 <p>Essentially all methods of the base <code>ThreadContext</code> are pure virtual.</p>
 </div>
 <div class="sect5">
-<h6 id="gem5-simplethread"><a class="anchor" href="#gem5-simplethread"></a><a class="link" href="#gem5-simplethread">19.21.7.1.1. gem5 <code>SimpleThread</code></a></h6>
+<h6 id="gem5-simplethread"><a class="anchor" href="#gem5-simplethread"></a><a class="link" href="#gem5-simplethread">19.22.6.1.1. gem5 <code>SimpleThread</code></a></h6>
 <div class="paragraph">
 <p><code>SimpleThread</code> storage defined on <a href="#gem5-basesimplecpu"><code>BaseSimpleCPU</code></a> for simple CPUs like <code>AtomicSimpleCPU</code>:</p>
 </div>
@@ -27629,7 +28309,7 @@ typedef SimpleThread MinorThread;</pre>
 </div>
 </div>
 <div class="sect5">
-<h6 id="gem5-o3threadcontext"><a class="anchor" href="#gem5-o3threadcontext"></a><a class="link" href="#gem5-o3threadcontext">19.21.7.1.2. gem5 <code>O3ThreadContext</code></a></h6>
+<h6 id="gem5-o3threadcontext"><a class="anchor" href="#gem5-o3threadcontext"></a><a class="link" href="#gem5-o3threadcontext">19.22.6.1.2. gem5 <code>O3ThreadContext</code></a></h6>
 <div class="paragraph">
 <p>Instantiation happens in the <code>FullO3CPU</code> constructor:</p>
 </div>
@@ -27730,7 +28410,7 @@ FullO3CPU&lt;Impl&gt;::readArchIntReg(int reg_idx, ThreadID tid)
 </div>
 </div>
 <div class="sect4">
-<h5 id="gem5-threadstate"><a class="anchor" href="#gem5-threadstate"></a><a class="link" href="#gem5-threadstate">19.21.7.2. gem5 <code>ThreadState</code></a></h5>
+<h5 id="gem5-threadstate"><a class="anchor" href="#gem5-threadstate"></a><a class="link" href="#gem5-threadstate">19.22.6.2. gem5 <code>ThreadState</code></a></h5>
 <div class="paragraph">
 <p>Owned one per <code>ThreadContext</code>.</p>
 </div>
@@ -27776,7 +28456,7 @@ class O3ThreadContext : public ThreadContext
 </div>
 </div>
 <div class="sect4">
-<h5 id="gem5-execcontext"><a class="anchor" href="#gem5-execcontext"></a><a class="link" href="#gem5-execcontext">19.21.7.3. gem5 <code>ExecContext</code></a></h5>
+<h5 id="gem5-execcontext"><a class="anchor" href="#gem5-execcontext"></a><a class="link" href="#gem5-execcontext">19.22.6.3. gem5 <code>ExecContext</code></a></h5>
 <div class="paragraph">
 <p><code>ExecContext</code> gets used in <a href="#gem5-instruction-definitions">gem5 instruction definitions</a>, e.g.:</p>
 </div>
@@ -27936,7 +28616,7 @@ class O3ThreadContext : public ThreadContext
 <p>This makes sense, since each <code>ThreadContext</code> represents one CPU register set, and therefore needs a separate <code>ExecContext</code> which allows instruction implementations to access those registers.</p>
 </div>
 <div class="sect5">
-<h6 id="gem5-execcontext-readintregoperand-register-resolution"><a class="anchor" href="#gem5-execcontext-readintregoperand-register-resolution"></a><a class="link" href="#gem5-execcontext-readintregoperand-register-resolution">19.21.7.3.1. gem5 <code>ExecContext::readIntRegOperand</code> register resolution</a></h6>
+<h6 id="gem5-execcontext-readintregoperand-register-resolution"><a class="anchor" href="#gem5-execcontext-readintregoperand-register-resolution"></a><a class="link" href="#gem5-execcontext-readintregoperand-register-resolution">19.22.6.3.1. gem5 <code>ExecContext::readIntRegOperand</code> register resolution</a></h6>
 <div class="paragraph">
 <p>Let&#8217;s have a look at how <code>ExecContext::readIntRegOperand</code> actually matches registers to decoded registers IDs, since it is not obvious.</p>
 </div>
@@ -28209,7 +28889,7 @@ flattenIntIndex(int reg) const
 </div>
 </div>
 <div class="sect4">
-<h5 id="gem5-process"><a class="anchor" href="#gem5-process"></a><a class="link" href="#gem5-process">19.21.7.4. gem5 <code>Process</code></a></h5>
+<h5 id="gem5-process"><a class="anchor" href="#gem5-process"></a><a class="link" href="#gem5-process">19.22.6.4. gem5 <code>Process</code></a></h5>
 <div class="paragraph">
 <p>The <code>Process</code> class is used only for <a href="#gem5-syscall-emulation-mode">gem5 syscall emulation mode</a>, and it represents a process like a Linux userland process, in addition to any further gem5 specific data needed to represent the process.</p>
 </div>
@@ -28297,12 +28977,12 @@ readFunc(SyscallDesc *desc, ThreadContext *tc,
 </div>
 </div>
 <div class="sect3">
-<h4 id="gem5-functional-units"><a class="anchor" href="#gem5-functional-units"></a><a class="link" href="#gem5-functional-units">19.21.8. gem5 functional units</a></h4>
+<h4 id="gem5-functional-units"><a class="anchor" href="#gem5-functional-units"></a><a class="link" href="#gem5-functional-units">19.22.7. gem5 functional units</a></h4>
 <div class="paragraph">
 <p>Each instruction is marked with a class, and each class can execute in a given <a href="#execution-unit">functional unit</a>.</p>
 </div>
 <div class="sect4">
-<h5 id="gem5-minorcpu-default-functional-units"><a class="anchor" href="#gem5-minorcpu-default-functional-units"></a><a class="link" href="#gem5-minorcpu-default-functional-units">19.21.8.1. gem5 <code>MinorCPU</code> default functional units</a></h5>
+<h5 id="gem5-minorcpu-default-functional-units"><a class="anchor" href="#gem5-minorcpu-default-functional-units"></a><a class="link" href="#gem5-minorcpu-default-functional-units">19.22.7.1. gem5 <code>MinorCPU</code> default functional units</a></h5>
 <div class="paragraph">
 <p>Which units are available is visible for example on the <a href="#gem5-config-ini">gem5 config.ini</a> of a <a href="#gem5-minorcpu">gem5 MinorCPU</a> run. Functional units are not present in simple CPUs like <a href="#gem5-timingsimplecpu">gem5 <code>TimingSimpleCPU</code></a>.</p>
 </div>
@@ -28461,7 +29141,7 @@ opClass=IntAlu</pre>
 </div>
 </div>
 <div class="sect4">
-<h5 id="gem5-derivo3cpu-default-functional-units"><a class="anchor" href="#gem5-derivo3cpu-default-functional-units"></a><a class="link" href="#gem5-derivo3cpu-default-functional-units">19.21.8.2. gem5 DerivO3CPU default functional units</a></h5>
+<h5 id="gem5-derivo3cpu-default-functional-units"><a class="anchor" href="#gem5-derivo3cpu-default-functional-units"></a><a class="link" href="#gem5-derivo3cpu-default-functional-units">19.22.7.2. gem5 DerivO3CPU default functional units</a></h5>
 <div class="paragraph">
 <p>On gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772, after running:</p>
 </div>
@@ -28559,7 +29239,7 @@ pipelined=false</pre>
 </div>
 </div>
 <div class="sect3">
-<h4 id="gem5-code-generation"><a class="anchor" href="#gem5-code-generation"></a><a class="link" href="#gem5-code-generation">19.21.9. gem5 code generation</a></h4>
+<h4 id="gem5-code-generation"><a class="anchor" href="#gem5-code-generation"></a><a class="link" href="#gem5-code-generation">19.22.8. gem5 code generation</a></h4>
 <div class="paragraph">
 <p>gem5 uses a ton of code generation, which makes the project horrendous:</p>
 </div>
@@ -28604,7 +29284,7 @@ pipelined=false</pre>
 <p>But it has been widely overused to insanity. It likely also exists partly because when the project started in 2003 C++ compilers weren&#8217;t that good, so you couldn&#8217;t rely on features like templates that much.</p>
 </div>
 <div class="sect4">
-<h5 id="gem5-the-isa"><a class="anchor" href="#gem5-the-isa"></a><a class="link" href="#gem5-the-isa">19.21.9.1. gem5 THE_ISA</a></h5>
+<h5 id="gem5-the-isa"><a class="anchor" href="#gem5-the-isa"></a><a class="link" href="#gem5-the-isa">19.22.8.1. gem5 THE_ISA</a></h5>
 <div class="paragraph">
 <p>Generated code at: <code>build/&lt;ISA&gt;/config/the_isa.hh</code> which e.g. for ARM contains:</p>
 </div>
@@ -28650,9 +29330,9 @@ enum class Arch {
 </div>
 </div>
 <div class="sect3">
-<h4 id="gem5-build-system"><a class="anchor" href="#gem5-build-system"></a><a class="link" href="#gem5-build-system">19.21.10. gem5 build system</a></h4>
+<h4 id="gem5-build-system"><a class="anchor" href="#gem5-build-system"></a><a class="link" href="#gem5-build-system">19.22.9. gem5 build system</a></h4>
 <div class="sect4">
-<h5 id="m5-override-py-source"><a class="anchor" href="#m5-override-py-source"></a><a class="link" href="#m5-override-py-source">19.21.10.1. M5_OVERRIDE_PY_SOURCE</a></h5>
+<h5 id="m5-override-py-source"><a class="anchor" href="#m5-override-py-source"></a><a class="link" href="#m5-override-py-source">19.22.9.1. M5_OVERRIDE_PY_SOURCE</a></h5>
 <div class="paragraph">
 <p><a href="https://stackoverflow.com/questions/52312070/how-to-modify-a-file-under-src-python-and-run-it-without-rebuilding-in-gem5" class="bare">https://stackoverflow.com/questions/52312070/how-to-modify-a-file-under-src-python-and-run-it-without-rebuilding-in-gem5</a></p>
 </div>
@@ -28667,7 +29347,7 @@ enum class Arch {
 </div>
 </div>
 <div class="sect4">
-<h5 id="gem5-build-broken-on-recent-compiler-version"><a class="anchor" href="#gem5-build-broken-on-recent-compiler-version"></a><a class="link" href="#gem5-build-broken-on-recent-compiler-version">19.21.10.2. gem5 build broken on recent compiler version</a></h5>
+<h5 id="gem5-build-broken-on-recent-compiler-version"><a class="anchor" href="#gem5-build-broken-on-recent-compiler-version"></a><a class="link" href="#gem5-build-broken-on-recent-compiler-version">19.22.9.2. gem5 build broken on recent compiler version</a></h5>
 <div class="paragraph">
 <p>gem5 moves a bit slowly, and if your host compiler is very new, the gem5 build might be broken for it, e.g. this was the case for Ubuntu 19.10 with GCC 9 and gem5 62d75e7105fe172eb906d4f80f360ff8591d4178 from Dec 2019.</p>
 </div>
@@ -28692,7 +29372,7 @@ enum class Arch {
 </div>
 </div>
 <div class="sect4">
-<h5 id="gem5-polymorphic-isa-includes"><a class="anchor" href="#gem5-polymorphic-isa-includes"></a><a class="link" href="#gem5-polymorphic-isa-includes">19.21.10.3. gem5 polymorphic ISA includes</a></h5>
+<h5 id="gem5-polymorphic-isa-includes"><a class="anchor" href="#gem5-polymorphic-isa-includes"></a><a class="link" href="#gem5-polymorphic-isa-includes">19.22.9.3. gem5 polymorphic ISA includes</a></h5>
 <div class="paragraph">
 <p>E.g. <code>src/cpu/decode_cache.hh</code> includes:</p>
 </div>
@@ -28771,7 +29451,7 @@ build/ARM/config/the_isa.hh
 </div>
 </div>
 <div class="sect4">
-<h5 id="why-are-all-c-symlinked-into-the-gem5-build-dir"><a class="anchor" href="#why-are-all-c-symlinked-into-the-gem5-build-dir"></a><a class="link" href="#why-are-all-c-symlinked-into-the-gem5-build-dir">19.21.10.4. Why are all C++ symlinked into the gem5 build dir?</a></h5>
+<h5 id="why-are-all-c-symlinked-into-the-gem5-build-dir"><a class="anchor" href="#why-are-all-c-symlinked-into-the-gem5-build-dir"></a><a class="link" href="#why-are-all-c-symlinked-into-the-gem5-build-dir">19.22.9.4. Why are all C++ symlinked into the gem5 build dir?</a></h5>
 <div class="paragraph">
 <p>Upstream request: <a href="https://gem5.atlassian.net/browse/GEM5-469" class="bare">https://gem5.atlassian.net/browse/GEM5-469</a></p>
 </div>
@@ -30099,8 +30779,24 @@ echo 1 &gt; /proc/sys/vm/overcommit_memory
 </div>
 <div class="sect4">
 <h5 id="atomic-c"><a class="anchor" href="#atomic-c"></a><a class="link" href="#atomic-c">22.1.2.1. atomic.c</a></h5>
-<div class="paragraph">
+<div class="ulist">
+<ul>
+<li>
 <p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/atomic.c">userland/c/atomic.c</a></p>
+</li>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/atomic/">userland/c/atomic/</a>: files in this directory use the same technique as <a href="#atomic-cpp">atomic.cpp</a>, i.e. with one special case per file.</p>
+<div class="paragraph">
+<p>Maybe <a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/c/atomic.c">userland/c/atomic.c</a> should be deprecated in favor of those more minimal ones.</p>
+</div>
+<div class="paragraph">
+<p>This was added because C++-pre main is too bloated, especially when we turn one a gazillion <a href="#gem5">gem5</a> logs, it makes me want to cry.</p>
+</div>
+<div class="paragraph">
+<p>And we want a single operation per test rather than to as in <code>atomic.c</code> because when using gem5 we want absolute control over the microbenchmark.</p>
+</div>
+</li>
+</ul>
 </div>
 <div class="paragraph">
 <p>Demonstrates <code>atomic_int</code> and <code>thrd_create</code>.</p>
@@ -30514,6 +31210,9 @@ mkdir -p bin/c
 <p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/userland/cpp/atomic/">userland/cpp/atomic/</a></p>
 </div>
 <div class="paragraph">
+<p>C version at: <a href="#atomic-c">atomic.c</a>.</p>
+</div>
+<div class="paragraph">
 <p>In this set of examples, we exemplify various synchronization mechanisms, including assembly specific ones, by using the convenience of C++ multithreading:</p>
 </div>
 <div class="ulist">
@@ -31574,6 +32273,9 @@ my type is MyClassToString and a is 1 and b is 2</pre>
 </ul>
 </div>
 </li>
+<li>
+<p><a href="https://github.com/cirosantilli/linux-kernel-module-cheat/blob/master/rootfs_overlay/lkmc/nodejs/http.js">rootfs_overlay/lkmc/nodejs/http.js</a>: <code>http</code> module to create a simple HTTP server: <a href="https://nodejs.org/api/http.html" class="bare">https://nodejs.org/api/http.html</a></p>
+</li>
 </ul>
 </div>
 <div class="sect4">
@@ -42624,7 +43326,49 @@ west build -b qemu_aarch64 samples/hello_world</pre>
 </div>
 </div>
 <div class="sect2">
-<h3 id="cache-coherence"><a class="anchor" href="#cache-coherence"></a><a class="link" href="#cache-coherence">33.6. Cache coherence</a></h3>
+<h3 id="caches"><a class="anchor" href="#caches"></a><a class="link" href="#caches">33.6. Caches</a></h3>
+<div class="paragraph">
+<p><a href="https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec15.pdf" class="bare">https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec15.pdf</a> contains some of the first pictures you should see.</p>
+</div>
+<div class="paragraph">
+<p>In a direct-mapped cache architecture (every address has a single possible block), a memory address can be broken up into:</p>
+</div>
+<div class="literalblock">
+<div class="content">
+<pre>+-----+-------+--------------+
+|     |       |              | full address
++-----+-------+--------------+
+|     |       |              |
+| tag | index | block offset |</pre>
+</div>
+</div>
+<div class="paragraph">
+<p>where:</p>
+</div>
+<div class="ulist">
+<ul>
+<li>
+<p>index: determines in which block the address will go. This is the "index/ID of the block" it will go into!</p>
+</li>
+<li>
+<p>tag: allows us to differentiate between multiple addresses that have the same index</p>
+<div class="paragraph">
+<p>We really want tag to be the higher bits, so that consecutive blocks may be placed in the cache at once.</p>
+</div>
+</li>
+<li>
+<p>block offset: address withing the cache. Not used to find caches at all! Only used to find the data within the cache line</p>
+</li>
+</ul>
+</div>
+<div class="paragraph">
+<p>If the cache is set associative, we just simply make the index smaller and add a bits to the tag.</p>
+</div>
+<div class="paragraph">
+<p>For example, for a 2-way associative cache, we remove on bit from the index, and add it to the tag.</p>
+</div>
+<div class="sect3">
+<h4 id="cache-coherence"><a class="anchor" href="#cache-coherence"></a><a class="link" href="#cache-coherence">33.6.1. Cache coherence</a></h4>
 <div class="paragraph">
 <p><a href="https://en.wikipedia.org/wiki/Cache_coherence" class="bare">https://en.wikipedia.org/wiki/Cache_coherence</a></p>
 </div>
@@ -42665,23 +43409,23 @@ west build -b qemu_aarch64 samples/hello_world</pre>
 <div class="paragraph">
 <p>Even if caches are coherent, this is still not enough to avoid data race conditions, because this does not enforce atomicity of read modify write sequences. This is for example shown at: <a href="#detailed-gem5-analysis-of-how-data-races-happen">Detailed gem5 analysis of how data races happen</a>.</p>
 </div>
-<div class="sect3">
-<h4 id="memory-consistency"><a class="anchor" href="#memory-consistency"></a><a class="link" href="#memory-consistency">33.6.1. Memory consistency</a></h4>
+<div class="sect4">
+<h5 id="memory-consistency"><a class="anchor" href="#memory-consistency"></a><a class="link" href="#memory-consistency">33.6.1.1. Memory consistency</a></h5>
 <div class="paragraph">
 <p>According to <a href="http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf" class="bare">http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf</a> "memory consistency" is about ordering requirements of different memory addresses.</p>
 </div>
 <div class="paragraph">
 <p>This is represented explicitly in C++ for example <a href="#cpp-memory-order">C++ std::memory_order</a>.</p>
 </div>
-<div class="sect4">
-<h5 id="sequential-consistency"><a class="anchor" href="#sequential-consistency"></a><a class="link" href="#sequential-consistency">33.6.1.1. Sequential Consistency</a></h5>
+<div class="sect5">
+<h6 id="sequential-consistency"><a class="anchor" href="#sequential-consistency"></a><a class="link" href="#sequential-consistency">33.6.1.1.1. Sequential Consistency</a></h6>
 <div class="paragraph">
 <p>According to <a href="http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf" class="bare">http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf</a>, the strongest possible consistency, everything nicely ordered as you&#8217;d expect.</p>
 </div>
 </div>
 </div>
-<div class="sect3">
-<h4 id="can-caches-snoop-data-from-other-caches"><a class="anchor" href="#can-caches-snoop-data-from-other-caches"></a><a class="link" href="#can-caches-snoop-data-from-other-caches">33.6.2. Can caches snoop data from other caches?</a></h4>
+<div class="sect4">
+<h5 id="can-caches-snoop-data-from-other-caches"><a class="anchor" href="#can-caches-snoop-data-from-other-caches"></a><a class="link" href="#can-caches-snoop-data-from-other-caches">33.6.1.2. Can caches snoop data from other caches?</a></h5>
 <div class="paragraph">
 <p>Either they can snoop only control, or both control and data can be snooped.</p>
 </div>
@@ -42695,8 +43439,8 @@ west build -b qemu_aarch64 samples/hello_world</pre>
 <p>If data snoops are not possible, then data must always to to DRAM first.</p>
 </div>
 </div>
-<div class="sect3">
-<h4 id="vi-cache-coherence-protocol"><a class="anchor" href="#vi-cache-coherence-protocol"></a><a class="link" href="#vi-cache-coherence-protocol">33.6.3. VI cache coherence protocol</a></h4>
+<div class="sect4">
+<h5 id="vi-cache-coherence-protocol"><a class="anchor" href="#vi-cache-coherence-protocol"></a><a class="link" href="#vi-cache-coherence-protocol">33.6.1.3. VI cache coherence protocol</a></h5>
 <div class="paragraph">
 <p>Mentioned at:</p>
 </div>
@@ -42942,8 +43686,8 @@ west build -b qemu_aarch64 samples/hello_world</pre>
 <p>why PrWr stays in invalid? Why do writes always go to memory? Why not wait until eviction?</p>
 </div>
 </div>
-<div class="sect3">
-<h4 id="msi-cache-coherence-protocol"><a class="anchor" href="#msi-cache-coherence-protocol"></a><a class="link" href="#msi-cache-coherence-protocol">33.6.4. MSI cache coherence protocol</a></h4>
+<div class="sect4">
+<h5 id="msi-cache-coherence-protocol"><a class="anchor" href="#msi-cache-coherence-protocol"></a><a class="link" href="#msi-cache-coherence-protocol">33.6.1.4. MSI cache coherence protocol</a></h5>
 <div class="paragraph">
 <p><a href="https://en.wikipedia.org/wiki/MSI_protocol" class="bare">https://en.wikipedia.org/wiki/MSI_protocol</a></p>
 </div>
@@ -43254,8 +43998,8 @@ CACHE2 S nyy
 <div class="paragraph">
 <p>TODO gem5 concrete example.</p>
 </div>
-<div class="sect4">
-<h5 id="msi-cache-coherence-protocol-with-transient-states"><a class="anchor" href="#msi-cache-coherence-protocol-with-transient-states"></a><a class="link" href="#msi-cache-coherence-protocol-with-transient-states">33.6.4.1. MSI cache coherence protocol with transient states</a></h5>
+<div class="sect5">
+<h6 id="msi-cache-coherence-protocol-with-transient-states"><a class="anchor" href="#msi-cache-coherence-protocol-with-transient-states"></a><a class="link" href="#msi-cache-coherence-protocol-with-transient-states">33.6.1.4.1. MSI cache coherence protocol with transient states</a></h6>
 <div class="paragraph">
 <p>TODO understand well why those are needed.</p>
 </div>
@@ -43274,8 +44018,8 @@ CACHE2 S nyy
 </div>
 </div>
 </div>
-<div class="sect3">
-<h4 id="mesi-cache-coherence-protocol"><a class="anchor" href="#mesi-cache-coherence-protocol"></a><a class="link" href="#mesi-cache-coherence-protocol">33.6.5. MESI cache coherence protocol</a></h4>
+<div class="sect4">
+<h5 id="mesi-cache-coherence-protocol"><a class="anchor" href="#mesi-cache-coherence-protocol"></a><a class="link" href="#mesi-cache-coherence-protocol">33.6.1.5. MESI cache coherence protocol</a></h5>
 <div class="paragraph">
 <p><a href="https://en.wikipedia.org/wiki/MESI_protocol" class="bare">https://en.wikipedia.org/wiki/MESI_protocol</a></p>
 </div>
@@ -43334,8 +44078,8 @@ CACHE2 S nyy
 <p>gem5 12c917de54145d2d50260035ba7fa614e25317a3 has two <a href="#gem5-ruby-build">Ruby</a> MESI models implemented: <code>MESI_Two_Level</code> and <code>MESI_Three_Level</code>.</p>
 </div>
 </div>
-<div class="sect3">
-<h4 id="mosi-cache-coherence-protocol"><a class="anchor" href="#mosi-cache-coherence-protocol"></a><a class="link" href="#mosi-cache-coherence-protocol">33.6.6. MOSI cache coherence protocol</a></h4>
+<div class="sect4">
+<h5 id="mosi-cache-coherence-protocol"><a class="anchor" href="#mosi-cache-coherence-protocol"></a><a class="link" href="#mosi-cache-coherence-protocol">33.6.1.6. MOSI cache coherence protocol</a></h5>
 <div class="paragraph">
 <p><a href="https://en.wikipedia.org/wiki/MOSI_protocol" class="bare">https://en.wikipedia.org/wiki/MOSI_protocol</a> The critical MSI vs MOSI section was a bit bogus though: <a href="https://en.wikipedia.org/w/index.php?title=MOSI_protocol&amp;oldid=895443023" class="bare">https://en.wikipedia.org/w/index.php?title=MOSI_protocol&amp;oldid=895443023</a> but I edited it :-)</p>
 </div>
@@ -43394,8 +44138,8 @@ CACHE2 S nyy
 <p>This therefore saves one memory write through and its bus traffic.</p>
 </div>
 </div>
-<div class="sect3">
-<h4 id="moesi"><a class="anchor" href="#moesi"></a><a class="link" href="#moesi">33.6.7. MOESI cache coherence protocol</a></h4>
+<div class="sect4">
+<h5 id="moesi"><a class="anchor" href="#moesi"></a><a class="link" href="#moesi">33.6.1.7. MOESI cache coherence protocol</a></h5>
 <div class="paragraph">
 <p><a href="https://en.wikipedia.org/wiki/MOESI_protocol" class="bare">https://en.wikipedia.org/wiki/MOESI_protocol</a></p>
 </div>
@@ -43403,10 +44147,10 @@ CACHE2 S nyy
 <p><a href="#mesi-cache-coherence-protocol">MESI cache coherence protocol</a> + <a href="#mosi-cache-coherence-protocol">MOSI cache coherence protocol</a>, not much else to it!</p>
 </div>
 <div class="paragraph">
-<p>In gem5 9fc9c67b4242c03f165951775be5cd0812f2a705, MOESI is the default cache coherency protocol of the <a href="#gem5-ruby-build">classic memory system</a> as shown at <a href="#what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5">Section 19.21.4.3.1, &#8220;What is the coherency protocol implemented by the classic cache system in gem5?&#8221;</a>.</p>
+<p>In gem5 9fc9c67b4242c03f165951775be5cd0812f2a705, MOESI is the default cache coherency protocol of the <a href="#gem5-ruby-build">classic memory system</a> as shown at <a href="#what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5">Section 19.22.4.3.1, &#8220;What is the coherency protocol implemented by the classic cache system in gem5?&#8221;</a>.</p>
 </div>
 <div class="paragraph">
-<p>A good an simple example showing several MOESI transitions in the classic memory model can be seen at: <a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">Section 19.21.4.4, &#8220;gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs&#8221;</a>.</p>
+<p>A good an simple example showing several MOESI transitions in the classic memory model can be seen at: <a href="#gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus">Section 19.22.4.4, &#8220;gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs&#8221;</a>.</p>
 </div>
 <div class="paragraph">
 <p>gem5 12c917de54145d2d50260035ba7fa614e25317a3 has several <a href="#gem5-ruby-build">Ruby</a> MOESI models implemented: <code>MOESI_AMD_Base</code>, <code>MOESI_CMP_directory</code>, <code>MOESI_CMP_token</code> and <code>MOESI_hammer</code>.</p>
@@ -43415,6 +44159,7 @@ CACHE2 S nyy
 </div>
 </div>
 </div>
+</div>
 <div class="sect1">
 <h2 id="about-this-repo"><a class="anchor" href="#about-this-repo"></a><a class="link" href="#about-this-repo">34. About this repo</a></h2>
 <div class="sectionbody">