master packets and requests a bit further

2026-01-23 02:05:57 +01:00 · 2020-08-12 01:00:00 +00:00
parent 7dc9681045
commit d429552cde
1 changed files with 537 additions and 244 deletions
--- a/README.adoc
+++ b/README.adoc
@@ -11785,41 +11785,6 @@ and we see that the first instruction runs at 0x80000000:

 TODO: what are the boot memory and NOR FLASH used for?

-==== gem5 `CommMonitor`
-
-You can place this <<gem5-python-c-interaction,SimObject>> in between two <<gem5-port-system,ports>> to get extra statistics about the packets that are going through.
-
-It only works on <<gem5-functional-vs-atomic-vs-timing-memory-requests,timing requests>>, and does not seem to dump any memory values, only add extra <<gem5-m5out-stats-txt-file,statistics>>.
-
-For example, the patch link:patches/manual/gem5-commmonitor-se.patch[] hack a `CommMonitor` between the CPU and the L1 cache on top of gem5 1c3662c9557c85f0d25490dc4fbde3f8ab0cb350:
-
-....
-patch -d "$(./getvar gem5_source_dir)" -p 1 < patches/manual/gem5-commmonitor-se.patch
-....
-
-That patch was done largely by copying what `fs.py --memcheck` does with a `MemChecker` object.
-
-You can then run with:
-
-....
-./run \
-  --arch aarch64 \
-  --emulator gem5 \
-  --userland userland/arch/aarch64/freestanding/linux/hello.S \
-  -- \
-  --caches \
-  --cpu-type TimingSimpleCPU \
-;
-....
-
-and now we have some new extra histogram statistics such as:
-
-....
-system.cpu.dcache_mon.readBurstLengthHist::samples            1
-....
-
-One neat thing about this is that it is agnostic to the memory object type, so you don't have to recode those statistics for every new type of object that operates on memory packets.
-
 ==== gem5 disk and network latency

 TODO These look promising:
@@ -14058,10 +14023,510 @@ and their selection can be seen under: `src/dev/arm/RealView.py`, e.g.:
            cur_sys.boot_loader = [ loc('boot_emm.arm64'), loc('boot_emm.arm') ]
 ....

+=== gem5 memory system
+
+Parent section: <<gem5-internals>>.
+
+==== gem5 port system
+
+The gem5 memory system is connected in a very flexible way through the port system.
+
+This system exists to allow seamlessly connecting any combination of CPU, caches, interconnects, DRAM and peripherals.
+
+A <<gem5-packet,`Packet`>> is the basic information unit that gets sent across ports.
+
+===== gem5 functional vs atomic vs timing memory requests
+
+gem5 memory requests can be classified in the following broad categories:
+
+* functional: get the value magically, do not update caches, see also: <<gem5-functional-requests>>
+* atomic: get the value now without making a <<gem5-event-queue,separate event>>, but do not update caches. Cannot work in <<gem5-ruby-build,Ruby>> due to fundamental limitations, mentioned in passing at: https://gem5.atlassian.net/browse/GEM5-676
+* timing: get the value simulating delays and updating caches
+
+This trichotomy can be notably seen in the definition of the https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/port.hh#L75[MasterPort class]:
+
+....
+class MasterPort : public Port, public AtomicRequestProtocol,
+    public TimingRequestProtocol, public FunctionalRequestProtocol
+....
+
+and the base classes are defined under `src/mem/protocol/`.
+
+Then, by reading the rest of the class, we see that the send methods are all boring, and just forward to some polymorphic receiver that does the actual interesting activity:
+
+....
+    Tick
+    sendAtomicSnoop(PacketPtr pkt)
+    {
+        return AtomicResponseProtocol::sendSnoop(_masterPort, pkt);
+    }
+
+    Tick
+    AtomicResponseProtocol::sendSnoop(AtomicRequestProtocol *peer, PacketPtr pkt)
+    {
+        assert(pkt->isRequest());
+        return peer->recvAtomicSnoop(pkt);
+    }
+....
+
+The receive methods are therefore the interesting ones, and must be overridden on derived classes if they ever expect to receive such requests:
+
+....
+    Tick
+    recvAtomicSnoop(PacketPtr pkt) override
+    {
+        panic("%s was not expecting an atomic snoop request\n", name());
+        return 0;
+    }
+
+    void
+    recvFunctionalSnoop(PacketPtr pkt) override
+    {
+        panic("%s was not expecting a functional snoop request\n", name());
+    }
+
+    void
+    recvTimingSnoopReq(PacketPtr pkt) override
+    {
+        panic("%s was not expecting a timing snoop request.\n", name());
+    }
+....
+
+One question that comes up now is: but why do CPUs need to care about <<cache-coherence,snoop requests>>?
+
+And one big answer is: to be able to implement LLSC atomicity as mentioned at: <<arm-ldxr-and-stxr-instructions>>, since when other cores update memory, they could invalidate the lock of the current core.
+
+Then, as you might expect, we can see that for example `AtomicSimpleCPU` does not override `recvTimingSnoopReq`.
+
+Now let see which requests are generated by ordinary <<arm-ldr-instruction>>. We run:
+
+....
+./run \
+  --arch aarch64 \
+  --debug-vm \
+  --emulator gem5 \
+  --gem5-build-type debug \
+  --useland userland/arch/aarch64/freestanding/linux/hello.S \
+....
+
+and then break at the methods of the LDR class `LDRXL64_LIT`: <<gem5-execute-vs-initiateacc-vs-completeacc>>.
+
+Before starting, we of course guess that:
+
+* `AtomicSimpleCPU` will be making atomic accesses from `execute`
+* `TimingSimpleCPU` will be making timing accesses from `initiateAcc`, which must generate the event which leads to `completeAcc`
+
+so let's confirm it.
+
+We break on `ArmISAInst::LDRXL64_LIT::execute` which is what `AtomicSimpleCPU` uses, and that leads as expected to:
+
+....
+MasterPort::sendAtomic
+AtomicSimpleCPU::sendPacket
+AtomicSimpleCPU::readMem
+SimpleExecContext::readMem
+readMemAtomic<(ByteOrder)1, ExecContext, unsigned long>
+readMemAtomicLE<ExecContext, unsigned long>
+ArmISAInst::LDRXL64_LIT::execute
+AtomicSimpleCPU::tick
+....
+
+Notably, `AtomicSimpleCPU::readMem` immediately translates the address, creates a packet, sends the atomic request, and gets the response back without any events.
+
+And now if we do the same with `--cpu-type TimingSimpleCPU` and break at `ArmISAInst::LDRXL64_LIT::initiateAcc`, and then add another break for the next event schedule `b EventManager::schedule` (which we imagine is the memory read) we reach:
+
+....
+EventManager::schedule
+DRAMCtrl::addToReadQueue
+DRAMCtrl::recvTimingReq
+DRAMCtrl::MemoryPort::recvTimingReq
+TimingRequestProtocol::sendReq
+MasterPort::sendTimingReq
+CoherentXBar::recvTimingReq
+CoherentXBar::CoherentXBarSlavePort::recvTimingReq
+TimingRequestProtocol::sendReq
+MasterPort::sendTimingReq
+TimingSimpleCPU::handleReadPacket
+TimingSimpleCPU::sendData
+TimingSimpleCPU::finishTranslation
+DataTranslation<TimingSimpleCPU*>::finish
+ArmISA::TLB::translateComplete
+ArmISA::TLB::translateTiming
+ArmISA::TLB::translateTiming
+TimingSimpleCPU::initiateMemRead
+SimpleExecContext::initiateMemRead
+initiateMemRead<ExecContext, unsigned long>
+ArmISAInst::LDRXL64_LIT::initiateAcc
+TimingSimpleCPU::completeIfetch
+TimingSimpleCPU::IcachePort::ITickEvent::process
+EventQueue::serviceOne
+....
+
+so as expected we have `TimingRequestProtocol::sendReq`.
+
+Remember however that timing requests are a bit more complicated due to <<arm-paging,paging>>, since the page table walk can itself lead to further memory requests.
+
+In this particular instance, the address being read with `ldr x2, =len` <<arm-ldr-pseudo-instruction>> is likely placed just after the text section, and therefore the pagewalk is already in the TLB due to previous instruction fetches, and this is because the translation just finished immediately going through `TimingSimpleCPU::finishTranslation`, some key snippets are:
+
+....
+TLB::translateComplete(const RequestPtr &req, ThreadContext *tc,
+        Translation *translation, Mode mode, TLB::ArmTranslationType tranType,
+        bool callFromS2)
+{
+    bool delay = false;
+    Fault fault;
+    if (FullSystem)
+        fault = translateFs(req, tc, mode, translation, delay, true, tranType);
+    else
+        fault = translateSe(req, tc, mode, translation, delay, true);
+    if (!delay)
+        translation->finish(fault, req, tc, mode);
+    else
+        translation->markDelayed();
+....
+
+and then `translateSe` does not use `delay` at all, so we learn that in syscall emulation, `delay` is always `false` and things progress immediately there. And then further down `TimingSimpleCPU::finishTranslation` does some more fault checking:
+
+....
+void
+TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
+{
+    if (state->getFault() != NoFault) {
+        translationFault(state->getFault());
+    } else {
+        if (!state->isSplit) {
+            sendData(state->mainReq, state->data, state->res,
+                     state->mode == BaseTLB::Read);
+....
+
+
+Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.
+
+====== gem5 functional requests
+
+As seen at <<gem5-functional-vs-atomic-vs-timing-memory-requests>>, functional requests are not used in common simulation, since the core must always go through caches.
+
+Functional access are therefore only used for more magic simulation functionalities.
+
+One such functionality, is the <<gem5-syscall-emulation-mode>> implementation of the <<futex-system-call>> which is done at `futexFunc` in https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/sim/syscall_emul.hh#L394[`src/sim/sycall_emul.hh`].
+
+As seen from `man futex`, the Linux kernel reads the value from an address that is given as the first argument of the call.
+
+Therefore, here it makes sense for gem5 syscall implementation, which does not actually have a real kernel running, to just make a functional request and be done with it, since the impact of cache changes done by this read would be insignificant to the cost of an actual full context switch that would happen on a real syscall.
+
+It is generally hard to implement functional requests for <<gem5-ruby-build,Ruby>> runs, because packets are flying through the memory system in a transient state, and there is no simple way of finding exactly which ones might have the latest version of the memory. See for example:
+
+* https://gem5.atlassian.net/browse/GEM5-496
+* https://gem5.atlassian.net/browse/GEM5-604
+* https://gem5.atlassian.net/browse/GEM5-675
+* https://gem5.atlassian.net/browse/GEM5-676
+
+The typical error message in that case is:
+
+....
+fatal: Ruby functional read failed for address
+....
+
+==== gem5 `Packet` vs `Request`
+
+===== gem5 `Packet`
+
+`Packet` is what goes through <<gem5-port-system,ports>>: a single packet is sent out to the memory system, gets modified when it hits valid data, and then returns with the reply.
+
+`Packet` is what CPUs create and send to get memory values. E.g. on <<gem5-atomicsimplecpu>>:
+
+....
+void
+AtomicSimpleCPU::tick()
+{
+    ...
+    Packet ifetch_pkt = Packet(ifetch_req, MemCmd::ReadReq);
+    ifetch_pkt.dataStatic(&inst);
+
+    icache_latency = sendPacket(icachePort, &ifetch_pkt);
+
+Tick
+AtomicSimpleCPU::sendPacket(MasterPort &port, const PacketPtr &pkt)
+{
+    return port.sendAtomic(pkt);
+}
+....
+
+On <<gem5-timingsimplecpu,TimingSimpleCPU>>, we note that the packet is dynamically created unlike for the AtomicSimpleCPU, since it must exist across multiple <<gem5-event-queue,events>> which happen on separate function calls, unlike atomic memory which is done immediately in a single call:
+
+....
+void
+TimingSimpleCPU::sendFetch(const Fault &fault, const RequestPtr &req,
+                           ThreadContext *tc)
+{
+    if (fault == NoFault) {
+        DPRINTF(SimpleCPU, "Sending fetch for addr %#x(pa: %#x)\n",
+                req->getVaddr(), req->getPaddr());
+        ifetch_pkt = new Packet(req, MemCmd::ReadReq);
+        ifetch_pkt->dataStatic(&inst);
+        DPRINTF(SimpleCPU, " -- pkt addr: %#x\n", ifetch_pkt->getAddr());
+
+        if (!icachePort.sendTimingReq(ifetch_pkt)) {
+....
+
+It must later delete the return packet that it gets later on, e.g. for the ifetch:
+
+....
+TimingSimpleCPU::completeIfetch(PacketPtr pkt)
+{
+    if (pkt) {
+        delete pkt;
+    }
+....
+
+The most important properties of a Packet are:
+
+* `PacketDataPtr data;`: the data coming back from a reply packet or being sent via it
+* `Addr addr;`: the physical address of the data. TODO comment says could be virtual too, when?
+
+....
+/// The address of the request.  This address could be virtual or
+/// physical, depending on the system configuration.
+Addr addr;
+....
+* `Flags flags;`: flags describing properties of the `Packet`
+* `MemCmd cmd;`: this is basically an enumeration of possible commands, stuff like:
+
+....
+enum Command
+{
+    InvalidCmd,
+    ReadReq,
+    ReadResp,
+....
+
+Each command has a fixed number of attributes defined in the static array:
+
+....
+static const CommandInfo commandInfo[];
+....
+
+which gets initialized in the .cc file in the same order as the Command enum.
+
+....
+const MemCmd::CommandInfo
+MemCmd::commandInfo[] =
+{
+    /* InvalidCmd */
+    { 0, InvalidCmd, "InvalidCmd" },
+    /* ReadReq - Read issued by a non-caching agent such as a CPU or
+     * device, with no restrictions on alignment. */
+    { SET3(IsRead, IsRequest, NeedsResponse), ReadResp, "ReadReq" },
+    /* ReadResp */
+    { SET3(IsRead, IsResponse, HasData), InvalidCmd, "ReadResp" },
+....
+
+From this we see for example that both `ReadReq` and `ReadResp` are marked with the `IsRead` attribute.
+
+The second field of this array also specifies the corresponding reply of each request. E.g. the reply of a `ReadReq` is a `ReadResp`. `InvalidCmd` is just a placeholders for requests that are already replies.
+
+....
+struct CommandInfo
+{
+    /// Set of attribute flags.
+    const std::bitset<NUM_COMMAND_ATTRIBUTES> attributes;
+    /// Corresponding response for requests; InvalidCmd if no
+    /// response is applicable.
+    const Command response;
+    /// String representation (for printing)
+    const std::string str;
+};
+....
+
+===== gem5 `Request`
+
+One good way to think about `Request` vs `Packet` could be "it is what the <<gem5-instruction-definitions,instruction definitions>> see", a bit like `ExecContext` vs `ThreadContext`.
+
+`Request` is passed to the constructor of `Packet`, and `Packet` keeps a reference to it:
+
+....
+    Packet(const RequestPtr &_req, MemCmd _cmd)
+        :  cmd(_cmd), id((PacketId)_req.get()), req(_req),
+           data(nullptr), addr(0), _isSecure(false), size(0),
+           _qosValue(0), headerDelay(0), snoopDelay(0),
+           payloadDelay(0), senderState(NULL)
+    {
+        if (req->hasPaddr()) {
+            addr = req->getPaddr();
+            flags.set(VALID_ADDR);
+            _isSecure = req->isSecure();
+        }
+        if (req->hasSize()) {
+            size = req->getSize();
+            flags.set(VALID_SIZE);
+        }
+    }
+....
+
+where `RequestPtr` is defined as:
+
+....
+typedef std::shared_ptr<Request> RequestPtr;
+....
+
+so we see that shared pointers to requests are basically passed around.
+
+Some key fields include:
+
+* `_paddr`:
+
+....
+/**
+    * The physical address of the request. Valid only if validPaddr
+    * is set.
+    */
+Addr _paddr = 0;
+....
+* `_vaddr`:
+
+....
+/** The virtual address of the request. */
+Addr _vaddr = MaxAddr;
+....
+
+====== gem5 `Request` in `AtomicSimpleCPU`
+
+In `AtomicSimpleCPU`, a single packet of each type is kept for the entire CPU, e.g.:
+
+....
+RequestPtr ifetch_req;
+....
+
+and it gets created at construction time:
+
+....
+AtomicSimpleCPU::AtomicSimpleCPU(AtomicSimpleCPUParams *p)
+{
+    ifetch_req = std::make_shared<Request>();
+....
+
+and then it gets modified for each request:
+
+....
+setupFetchRequest(ifetch_req);
+....
+
+which does:
+
+....
+req->setVirt(fetchPC, sizeof(MachInst), Request::INST_FETCH,
+                instMasterId(), instAddr);
+....
+
+Virtual to physical address translation done by the CPU stores the physical address: 
+
+....
+fault = thread->dtb->translateAtomic(req, thread->getTC(),
+                                        BaseTLB::Read);
+....
+
+which eventually calls e.g. on fs with MMU enabled:
+
+....
+Fault
+TLB::translateMmuOn(ThreadContext* tc, const RequestPtr &req, Mode mode,
+                    Translation *translation, bool &delay, bool timing,
+                    bool functional, Addr vaddr,
+                    ArmFault::TranMethod tranMethod)
+{
+    req->setPaddr(pa);
+....
+
+====== gem5 `Request` in `TimingSimpleCPU`
+
+In <<gem5-timingsimplecpu,TimingSimpleCPU>>, the request gets created per memory read:
+
+....
+Fault
+TimingSimpleCPU::initiateMemRead(Addr addr, unsigned size,
+                                 Request::Flags flags,
+                                 const std::vector<bool>& byte_enable)
+{
+    ...
+    RequestPtr req = std::make_shared<Request>(
+        addr, size, flags, dataMasterId(), pc, thread->contextId());
+....
+
+and from <<gem5-functional-vs-atomic-vs-timing-memory-requests>> and <<gem5-functional-vs-atomic-vs-timing-memory-requests>> we remember that `initiateMemRead` is actually started from the `initiateAcc` instruction definitions for timing:
+
+....
+Fault LDRWL64_LIT::initiateAcc(ExecContext *xc,
+    Trace::InstRecord *traceData) const
+{
+    ...
+    fault = initiateMemRead(xc, traceData, EA, Mem, memAccessFlags);
+....
+
+From this we see that `initiateAcc` memory instructions are basically extracting the required information for the request, notably the address `EA` and flags.
+
+==== gem5 `MSHR`
+
+Mentioned at: http://pages.cs.wisc.edu/~swilson/gem5-docs/gem5MemorySystem.html
+
+Each cache object owns a `MSHRQueue`:
+
+....
+class BaseCache : public ClockedObject
+{
+    /** Miss status registers */
+    MSHRQueue mshrQueue;
+....
+
+`BaseCache` is the base class of `Cache` and `NoncoherentCache`.
+
+`MSHRQueue` is a `Queue` of `MSHR`:
+
+....
+class MSHRQueue : public Queue<MSHR>
+....
+
+and Queue is also a gem5 class under `src/mem/cache/queue.hh`.
+
+==== gem5 `CommMonitor`
+
+You can place this <<gem5-python-c-interaction,SimObject>> in between two <<gem5-port-system,ports>> to get extra statistics about the packets that are going through.
+
+It only works on <<gem5-functional-vs-atomic-vs-timing-memory-requests,timing requests>>, and does not seem to dump any memory values, only add extra <<gem5-m5out-stats-txt-file,statistics>>.
+
+For example, the patch link:patches/manual/gem5-commmonitor-se.patch[] hack a `CommMonitor` between the CPU and the L1 cache on top of gem5 1c3662c9557c85f0d25490dc4fbde3f8ab0cb350:
+
+....
+patch -d "$(./getvar gem5_source_dir)" -p 1 < patches/manual/gem5-commmonitor-se.patch
+....
+
+That patch was done largely by copying what `fs.py --memcheck` does with a `MemChecker` object.
+
+You can then run with:
+
+....
+./run \
+  --arch aarch64 \
+  --emulator gem5 \
+  --userland userland/arch/aarch64/freestanding/linux/hello.S \
+  -- \
+  --caches \
+  --cpu-type TimingSimpleCPU \
+;
+....
+
+and now we have some new extra histogram statistics such as:
+
+....
+system.cpu.dcache_mon.readBurstLengthHist::samples            1
+....
+
+One neat thing about this is that it is agnostic to the memory object type, so you don't have to recode those statistics for every new type of object that operates on memory packets.
+
 === gem5 internals

 Internals under other sections:

+* <<gem5-memory-system>>
 * <<gem5-trace-internals>>
 * <<gem5-checkpoint-internals>>
 * <<gem5-graphic-mode-internals>>
@@ -15989,7 +16454,7 @@ We can confirm this with `--trace DRAM` which shows:

 Contrast this with the non `--cache` version seen at <<timingsimplecpu-analysis-5>> in which DRAM only actually reads the 4 required bytes.

-The only cryptic thing about the messages is the `IF` flag, but good computer architects would have guessed it correctly, and https://github.com/gem5/gem5/blob/fa70478413e4650d0058cbfe81fd5ce362101994/src/mem/packet.cc#L372[src/mem/packet.cc] confirms:
+The only cryptic thing about the messages is the `IF` flag, but good computer architects would have guessed it correctly that it is "instruction fetch" and https://github.com/gem5/gem5/blob/fa70478413e4650d0058cbfe81fd5ce362101994/src/mem/packet.cc#L372[src/mem/packet.cc] confirms:

 ....
 void
@@ -17323,204 +17788,6 @@ From their names, which are of course not specified in the <<armarm8>>, we guess

 From the gem5 source code, we see that STP is a `class LdpStp : public PairMemOp`, and then the constructor of `PairMemOp` sets up the microops depending on the exact type of LDP/STP:

-==== gem5 port system
-
-The gem5 memory system is connected in a very flexible way through the port system.
-
-This system exists to allow seamlessly connecting any combination of CPU, caches, interconnects, DRAM and peripherals.
-
-===== gem5 functional vs atomic vs timing memory requests
-
-gem5 memory requests can be classified in the following broad categories:
-
-* functional: get the value magically, do not update caches, see also: <<gem5-functional-requests>>
-* atomic: get the value now without making a <<gem5-event-queue,separate event>>, but do not update caches. Cannot work in <<gem5-ruby-build,Ruby>> due to fundamental limitations, mentioned in passing at: https://gem5.atlassian.net/browse/GEM5-676
-* timing: get the value simulating delays and updating caches
-
-This trichotomy can be notably seen in the definition of the https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/port.hh#L75[MasterPort class]:
-
-....
-class MasterPort : public Port, public AtomicRequestProtocol,
-    public TimingRequestProtocol, public FunctionalRequestProtocol
-....
-
-and the base classes are defined under `src/mem/protocol/`.
-
-Then, by reading the rest of the class, we see that the send methods are all boring, and just forward to some polymorphic receiver that does the actual interesting activity:
-
-....
-    Tick
-    sendAtomicSnoop(PacketPtr pkt)
-    {
-        return AtomicResponseProtocol::sendSnoop(_masterPort, pkt);
-    }
-
-    Tick
-    AtomicResponseProtocol::sendSnoop(AtomicRequestProtocol *peer, PacketPtr pkt)
-    {
-        assert(pkt->isRequest());
-        return peer->recvAtomicSnoop(pkt);
-    }
-....
-
-The receive methods are therefore the interesting ones, and must be overridden on derived classes if they ever expect to receive such requests:
-
-....
-    Tick
-    recvAtomicSnoop(PacketPtr pkt) override
-    {
-        panic("%s was not expecting an atomic snoop request\n", name());
-        return 0;
-    }
-
-    void
-    recvFunctionalSnoop(PacketPtr pkt) override
-    {
-        panic("%s was not expecting a functional snoop request\n", name());
-    }
-
-    void
-    recvTimingSnoopReq(PacketPtr pkt) override
-    {
-        panic("%s was not expecting a timing snoop request.\n", name());
-    }
-....
-
-One question that comes up now is: but why do CPUs need to care about <<cache-coherence,snoop requests>>?
-
-And one big answer is: to be able to implement LLSC atomicity as mentioned at: <<arm-ldxr-and-stxr-instructions>>, since when other cores update memory, they could invalidate the lock of the current core.
-
-Then, as you might expect, we can see that for example `AtomicSimpleCPU` does not override `recvTimingSnoopReq`.
-
-Now let see which requests are generated by ordinary <<arm-ldr-instruction>>. We run:
-
-....
-./run \
-  --arch aarch64 \
-  --debug-vm \
-  --emulator gem5 \
-  --gem5-build-type debug \
-  --useland userland/arch/aarch64/freestanding/linux/hello.S \
-....
-
-and then break at the methods of the LDR class `LDRXL64_LIT`: <<gem5-execute-vs-initiateacc-vs-completeacc>>.
-
-Before starting, we of course guess that:
-
-* `AtomicSimpleCPU` will be making atomic accesses from `execute`
-* `TimingSimpleCPU` will be making timing accesses from `initiateAcc`, which must generate the event which leads to `completeAcc`
-
-so let's confirm it.
-
-We break on `ArmISAInst::LDRXL64_LIT::execute` which is what `AtomicSimpleCPU` uses, and that leads as expected to:
-
-....
-MasterPort::sendAtomic
-AtomicSimpleCPU::sendPacket
-AtomicSimpleCPU::readMem
-SimpleExecContext::readMem
-readMemAtomic<(ByteOrder)1, ExecContext, unsigned long>
-readMemAtomicLE<ExecContext, unsigned long>
-ArmISAInst::LDRXL64_LIT::execute
-AtomicSimpleCPU::tick
-....
-
-Notably, `AtomicSimpleCPU::readMem` immediately translates the address, creates a packet, sends the atomic request, and gets the response back without any events.
-
-And now if we do the same with `--cpu-type TimingSimpleCPU` and break at `ArmISAInst::LDRXL64_LIT::initiateAcc`, and then add another break for the next event schedule `b EventManager::schedule` (which we imagine is the memory read) we reach:
-
-....
-EventManager::schedule
-DRAMCtrl::addToReadQueue
-DRAMCtrl::recvTimingReq
-DRAMCtrl::MemoryPort::recvTimingReq
-TimingRequestProtocol::sendReq
-MasterPort::sendTimingReq
-CoherentXBar::recvTimingReq
-CoherentXBar::CoherentXBarSlavePort::recvTimingReq
-TimingRequestProtocol::sendReq
-MasterPort::sendTimingReq
-TimingSimpleCPU::handleReadPacket
-TimingSimpleCPU::sendData
-TimingSimpleCPU::finishTranslation
-DataTranslation<TimingSimpleCPU*>::finish
-ArmISA::TLB::translateComplete
-ArmISA::TLB::translateTiming
-ArmISA::TLB::translateTiming
-TimingSimpleCPU::initiateMemRead
-SimpleExecContext::initiateMemRead
-initiateMemRead<ExecContext, unsigned long>
-ArmISAInst::LDRXL64_LIT::initiateAcc
-TimingSimpleCPU::completeIfetch
-TimingSimpleCPU::IcachePort::ITickEvent::process
-EventQueue::serviceOne
-....
-
-so as expected we have `TimingRequestProtocol::sendReq`.
-
-Remember however that timing requests are a bit more complicated due to <<arm-paging,paging>>, since the page table walk can itself lead to further memory requests.
-
-In this particular instance, the address being read with `ldr x2, =len` <<arm-ldr-pseudo-instruction>> is likely placed just after the text section, and therefore the pagewalk is already in the TLB due to previous instruction fetches, and this is because the translation just finished immediately going through `TimingSimpleCPU::finishTranslation`, some key snippets are:
-
-....
-TLB::translateComplete(const RequestPtr &req, ThreadContext *tc,
-        Translation *translation, Mode mode, TLB::ArmTranslationType tranType,
-        bool callFromS2)
-{
-    bool delay = false;
-    Fault fault;
-    if (FullSystem)
-        fault = translateFs(req, tc, mode, translation, delay, true, tranType);
-    else
-        fault = translateSe(req, tc, mode, translation, delay, true);
-    if (!delay)
-        translation->finish(fault, req, tc, mode);
-    else
-        translation->markDelayed();
-....
-
-and then `translateSe` does not use `delay` at all, so we learn that in syscall emulation, `delay` is always `false` and things progress immediately there. And then further down `TimingSimpleCPU::finishTranslation` does some more fault checking:
-
-....
-void
-TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
-{
-    if (state->getFault() != NoFault) {
-        translationFault(state->getFault());
-    } else {
-        if (!state->isSplit) {
-            sendData(state->mainReq, state->data, state->res,
-                     state->mode == BaseTLB::Read);
-....
-
-
-Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.
-
-====== gem5 functional requests
-
-As seen at <<gem5-functional-vs-atomic-vs-timing-memory-requests>>, functional requests are not used in common simulation, since the core must always go through caches.
-
-Functional access are therefore only used for more magic simulation functionalities.
-
-One such functionality, is the <<gem5-syscall-emulation-mode>> implementation of the <<futex-system-call>> which is done at `futexFunc` in https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/sim/syscall_emul.hh#L394[`src/sim/sycall_emul.hh`].
-
-As seen from `man futex`, the Linux kernel reads the value from an address that is given as the first argument of the call.
-
-Therefore, here it makes sense for gem5 syscall implementation, which does not actually have a real kernel running, to just make a functional request and be done with it, since the impact of cache changes done by this read would be insignificant to the cost of an actual full context switch that would happen on a real syscall.
-
-It is generally hard to implement functional requests for <<gem5-ruby-build,Ruby>> runs, because packets are flying through the memory system in a transient state, and there is no simple way of finding exactly which ones might have the latest version of the memory. See for example:
-
-* https://gem5.atlassian.net/browse/GEM5-496
-* https://gem5.atlassian.net/browse/GEM5-604
-* https://gem5.atlassian.net/browse/GEM5-675
-* https://gem5.atlassian.net/browse/GEM5-676
-
-The typical error message in that case is:
-
-....
-fatal: Ruby functional read failed for address
-....
-
 ==== gem5 `ThreadContext` vs `ThreadState` vs `ExecContext` vs `Process`

 These classes get used everywhere, and they have a somewhat convoluted relation with one another, so let's figure it out this mess.
@@ -26858,7 +27125,33 @@ fails with:
 fatal: fatal condition interrupts.size() != numThreads occurred: CPU system.cpu has 1 interrupt controllers, but is expecting one per thread (2)
 ....

-=== Cache coherence
+=== Caches
+
+https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec15.pdf contains some of the first pictures you should see.
+
+In a direct-mapped cache architecture (every address has a single possible block), a memory address can be broken up into:
+
+....
+-----+-------+--------------+
+|     |       |              | full address
+-----+-------+--------------+
+|     |       |              |
+| tag | index | block offset |
+....
+
+where:
+
+* index: determines in which block the address will go. This is the "index/ID of the block" it will go into!
+* tag: allows us to differentiate between multiple addresses that have the same index
+
+We really want tag to be the higher bits, so that consecutive blocks may be placed in the cache at once.
+* block offset: address withing the cache. Not used to find caches at all! Only used to find the data within the cache line
+
+If the cache is set associative, we just simply make the index smaller and add a bits to the tag.
+
+For example, for a 2-way associative cache, we remove on bit from the index, and add it to the tag.
+
+==== Cache coherence

 https://en.wikipedia.org/wiki/Cache_coherence

@@ -26882,17 +27175,17 @@ The main software use case example to have in mind is that of multiple threads i

 Even if caches are coherent, this is still not enough to avoid data race conditions, because this does not enforce atomicity of read modify write sequences. This is for example shown at: <<detailed-gem5-analysis-of-how-data-races-happen>>.

-==== Memory consistency
+===== Memory consistency

 According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf "memory consistency" is about ordering requirements of different memory addresses.

 This is represented explicitly in C++ for example <<cpp-memory-order>>.

-===== Sequential Consistency
+====== Sequential Consistency

 According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf[], the strongest possible consistency, everything nicely ordered as you'd expect.

-==== Can caches snoop data from other caches?
+===== Can caches snoop data from other caches?

 Either they can snoop only control, or both control and data can be snooped.

@@ -26902,7 +27195,7 @@ This is the central point in question at: https://electronics.stackexchange.com/

 If data snoops are not possible, then data must always to to DRAM first.

-==== VI cache coherence protocol
+===== VI cache coherence protocol

 Mentioned at:

@@ -26993,7 +27286,7 @@ TODO at:

 why PrWr stays in invalid? Why do writes always go to memory? Why not wait until eviction?

-==== MSI cache coherence protocol
+===== MSI cache coherence protocol

 https://en.wikipedia.org/wiki/MSI_protocol

@@ -27135,7 +27428,7 @@ TODO Wikipedia requires a Flush there, why? https://electronics.stackexchange.co

 TODO gem5 concrete example.

-===== MSI cache coherence protocol with transient states
+====== MSI cache coherence protocol with transient states

 TODO understand well why those are needed.

@@ -27143,7 +27436,7 @@ TODO understand well why those are needed.
 * https://www.researchgate.net/figure/MSI-Protocol-with-Transient-States-Adapted-from-30_fig3_2531432
 * http://csg.csail.mit.edu/6.823S16/lectures/L15.pdf page 28

-==== MESI cache coherence protocol
+===== MESI cache coherence protocol

 https://en.wikipedia.org/wiki/MESI_protocol

@@ -27171,7 +27464,7 @@ With MESI:

 gem5 12c917de54145d2d50260035ba7fa614e25317a3 has two <<gem5-ruby-build,Ruby>> MESI models implemented: `MESI_Two_Level` and `MESI_Three_Level`.

-==== MOSI cache coherence protocol
+===== MOSI cache coherence protocol

 https://en.wikipedia.org/wiki/MOSI_protocol The critical MSI vs MOSI section was a bit bogus though: https://en.wikipedia.org/w/index.php?title=MOSI_protocol&oldid=895443023 but I edited it :-)

@@ -27200,7 +27493,7 @@ and MOSI would do:
 This therefore saves one memory write through and its bus traffic.

 [[moesi]]
-==== MOESI cache coherence protocol
+===== MOESI cache coherence protocol

 https://en.wikipedia.org/wiki/MOESI_protocol