From bcdc72552d9df7d8550335977f485dbee61c08f8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ciro=20Santilli=20=E5=85=AD=E5=9B=9B=E4=BA=8B=E4=BB=B6=20?= =?UTF-8?q?=E6=B3=95=E8=BD=AE=E5=8A=9F?= Date: Thu, 7 May 2020 01:00:00 +0000 Subject: [PATCH] 252dd80500cd3aa36cffc1ea6474f32a1900ec91 --- index.html | 1305 +++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 1245 insertions(+), 60 deletions(-) diff --git a/index.html b/index.html index 558cf84..bf1647e 100644 --- a/index.html +++ b/index.html @@ -1226,7 +1226,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b @@ -7863,6 +7890,24 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex

And a native Ubuntu 18.04 AMD64 run with static compilation segfaults.

+

As of LKMC f5d4998ff51a548ed3f5153aacb0411d22022058 the aarch64 error:

+
+
+
+
./run --arch aarch64 --userland userland/cpp/atomic/fail.cpp --static
+
+
+
+

is:

+
+
+
+
terminate called after throwing an instance of 'std::system_error'
+  what():  Unknown error 16781344
+qemu: uncaught target signal 6 (Aborted) - core dumped
+
+
+

The workaround:

@@ -7871,7 +7916,7 @@ qemu-x86_64: /path/to/linux-kernel-module-cheat/submodules/qemu/accel/tcg/cpu-ex
-

fixes some of the problems, but not all, so we are just skipping those tests for now.

+

fixes some of the problems, but not all TODO which were missing?, so we are just skipping those tests for now.

@@ -21153,6 +21198,9 @@ Indirect leak of 1346 byte(s) in 2 object(s) allocated from:
+

Note that the --ruby option has some crazy side effects besides enabling Ruby, e.g. it sets the default --cpu-type to TimingSimpleCPU instead of the otherwise default AtomicSimpleCPU. But why.

+
+

It is not possible to build more than one Ruby system into a single build, and this is a major pain point for testing Ruby: https://gem5.atlassian.net/browse/GEM5-467

@@ -21239,6 +21287,36 @@ cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"

This is the simplest of all protocols, and therefore the first one you should study to learn how Ruby works.

+ +
+

Our full command line will be something like

+
+
+
+
./build-gem5 --arch aarch64 --gem5-build-id MI_example
+./run \
+  --arch aarch64 \
+  --cli-args '2 100' \
+  --cpus 3 \
+  --emulator gem5 \
+  --userland userland/cpp/atomic/aarch64_add.cpp \
+  --gem5-build-id MI_example \
+  -- \
+  --ruby \
+;
+
+
+
+

which produces a config.dot.svg like the following by with 3 CPUs instead of 2:

+
+
+
+gem5 config TimingSimpleCPU 3 CPUs MI example b1623cb2087873f64197e503ab8894b5e4d4c7b4 +
+
Figure 2. config.dot.svg for a system with three TimingSimpleCPU CPUs with the Ruby MI_example protocol.
+
19.15.6.2. gem5 crossbar interconnect
@@ -21965,6 +22043,170 @@ exec filecode in scope

Tested at gem5 b4879ae5b0b6644e6836b0881e4da05c64a6550d.

+
+
19.20.3.1. gem5 m5.objects module
+
+

All SimObjects seem to be automatically added to the m5.objects namespace, and this is done in a very convoluted way, let’s try to understand a bit:

+
+
+
+
src/python/m5/objects/__init__.py
+
+
+
+

contains:

+
+
+
+
modules = __loader__.modules
+
+for module in modules.keys():
+    if module.startswith('m5.objects.'):
+        exec("from %s import *" % module)
+
+
+
+

And from IPDB we see that this appears to loop over every object string of type m5.objects.modulename.

+
+
+

This init gets called from src/python/importer.py at the exec:

+
+
+
+
class CodeImporter(object):
+    def load_module(self, fullname):
+            override = os.environ.get('M5_OVERRIDE_PY_SOURCE', 'false').lower()
+            if override in ('true', 'yes') and  os.path.exists(abspath):
+                src = open(abspath, 'r').read()
+                code = compile(src, abspath, 'exec')
+
+            if os.path.basename(srcfile) == '__init__.py':
+                mod.__path__ = fullname.split('.')
+                mod.__package__ = fullname
+            else:
+                mod.__package__ = fullname.rpartition('.')[0]
+            mod.__file__ = srcfile
+
+            exec(code, mod.__dict__)
+
+import sys
+importer = CodeImporter()
+add_module = importer.add_module
+sys.meta_path.append(importer)
+
+
+
+

Here as a bonus here we also see how M5_OVERRIDE_PY_SOURCE works.

+
+
+

In src/SConscript we see that SimObject is just a PySource with module equals to m5.objects:

+
+
+
+
class SimObject(PySource):
+    def __init__(self, source, tags=None, add_tags=None):
+        '''Specify the source file and any tags (automatically in
+        the m5.objects package)'''
+        super(SimObject, self).__init__('m5.objects', source, tags, add_tags)
+
+
+
+

The add_module method seems to be doing the magic and is called from src/sim/init.cc:

+
+
+
+
bool
+EmbeddedPython::addModule() const
+{
+    PyObject *code = getCode();
+    PyObject *result = PyObject_CallMethod(importerModule, PyCC("add_module"),
+
+
+
+

which is called from:

+
+
+
+
int
+EmbeddedPython::initAll()
+{
+    // Load the importer module
+    PyObject *code = importer->getCode();
+    importerModule = PyImport_ExecCodeModule(PyCC("importer"), code);
+    if (!importerModule) {
+        PyErr_Print();
+        return 1;
+    }
+
+    // Load the rest of the embedded python files into the embedded
+    // python importer
+    list<EmbeddedPython *>::iterator i = getList().begin();
+    list<EmbeddedPython *>::iterator end = getList().end();
+    for (; i != end; ++i)
+        if (!(*i)->addModule())
+
+
+
+

and getList comes from:

+
+
+
+
EmbeddedPython::EmbeddedPython(const char *filename, const char *abspath,
+    const char *modpath, const unsigned char *code, int zlen, int len)
+    : filename(filename), abspath(abspath), modpath(modpath), code(code),
+      zlen(zlen), len(len)
+{
+    // if we've added the importer keep track of it because we need it
+    // to bootstrap.
+    if (string(modpath) == string("importer"))
+        importer = this;
+    else
+        getList().push_back(this);
+}
+
+list<EmbeddedPython *> &
+EmbeddedPython::getList()
+{
+    static list<EmbeddedPython *> the_list;
+    return the_list;
+}
+
+
+
+

and the constructor in turn gets called from per SimObject autogenerated files such as e.g. dev/storage/Ide.py.cc for src/dev/storage/Ide.py:

+
+
+
+
EmbeddedPython embedded_m5_objects_Ide(
+    "m5/objects/Ide.py",
+    "/home/ciro/bak/git/linux-kernel-module-cheat/data/gem5/master4/src/dev/storage/Ide.py",
+    "m5.objects.Ide",
+    data_m5_objects_Ide,
+    947,
+    2099);
+
+} // anonymous namespace
+
+
+
+

which get autogenerated at src/SConscript:

+
+
+
+
def embedPyFile(target, source, env):
+
+for source in PySource.all:
+    base_py_env.Command(source.cpp, [ py_marshal, source.tnode ],
+                        MakeAction(embedPyFile, Transform("EMBED PY")))
+
+
+
+

where the PySource.all thing as you might expect is a static list of all PySource source files as they get updated in the constructor.

+
+
+

Tested in gem5 d9cb548d83fa81858599807f54b52e5be35a6b03.

+
+

19.20.4. gem5 event queue

@@ -21972,6 +22214,9 @@ exec filecode in scope

gem5 is an event based simulator, and as such the event queue is of of the crucial elements in the system.

+

Every single action that takes time (e.g. notably reading from memory) models that time delay by scheduling an event in the future.

+
+

The gem5 event queue stores one callback event for each future point in time.

@@ -22477,8 +22722,26 @@ clock=500
-
        fault = thread->itb->translateAtomic(ifetch_req, thread->getTC(),
-                                                BaseTLB::Execute);
+
fault = thread->itb->translateAtomic(ifetch_req, thread->getTC(),
+                                        BaseTLB::Execute);
+
+
+
+

and later on after translation the memory is obtained at:

+
+
+
+
icache_latency = sendPacket(icachePort, &ifetch_pkt);
+
+
+
+

which sends the packet atomically through the port:

+
+
+
+
AtomicSimpleCPU::sendPacket(MasterPort &port, const PacketPtr &pkt) {
+    return port.sendAtomic(pkt);
+}
@@ -22486,15 +22749,15 @@ clock=500
-
        thread->itb->translateTiming(ifetch_req, thread->getTC(),
-                &fetchTranslation, BaseTLB::Execute);
+
thread->itb->translateTiming(ifetch_req, thread->getTC(),
+        &fetchTranslation, BaseTLB::Execute);

and so there it is: the ITB classes are the same, but there are a separate Atomic and Timing methods!

-

The Timing one calls ArmISA::TLB::translateComplete

+

The timing request is shown further at: sends the packet atomically.

Tested in gem5 b4879ae5b0b6644e6836b0881e4da05c64a6550d.

@@ -22708,13 +22971,13 @@ info: Entering event queue @ 0. Starting simulation...
-

Looking into the generated config.dot.svg can give a better intuition on the shape of the memory system: Figure 2, “config.dot.svg for a TimingSimpleCPU without caches.”, so it is good to keep that in mind.

+

Looking into the generated config.dot.svg can give a better intuition on the shape of the memory system: Figure 3, “config.dot.svg for a TimingSimpleCPU without caches.”, so it is good to keep that in mind.

gem5 config TimingSimpleCPU 12c917de54145d2d50260035ba7fa614e25317a3
-
Figure 2. config.dot.svg for a TimingSimpleCPU without caches.
+
Figure 3. config.dot.svg for a TimingSimpleCPU without caches.

It is also helpful to see this as a tree of events where one execute event schedules other events:

@@ -23621,6 +23884,9 @@ TimingSimpleCPU::IcachePort::ITickEvent::process

One important thing we want to check now, is how the memory reads are going to make the processor stall in the middle of an instruction.

+

This is also discussed at: gem5 execute vs initiateAcc vs completeAcc.

+
+

Since we were using a simple CPU without a pipeline, the data memory access stall everything: there is no further progress until memory comes back.

@@ -23653,20 +23919,16 @@ TimingSimpleCPU::IcachePort::ITickEvent::process
-

So, where is the execute happening? Well, I’ll satisfy myself with a quick source grep and guess:

+

We can verify that execute never happens by putting a breakpoint on ArmISAInst::LDRXL64_LIT::execute which never gets called.

-
- +
+

Later on, when the memory access completes the event calls TimingSimpleCPU::completeDataAccess which calls ArmISAInst::LDRXL64_LIT::completeAcc, which sets the register value to what was read from memory.

+
+
+

More memory event details can be seen at: gem5 functional vs atomic vs timing memory requests.

The following is the region of interest of the event log:

@@ -23805,7 +24067,7 @@ TimingSimpleCPU::IcachePort::ITickEvent::process

Notably, we now see that very little time passed between the first and second instructions which are marked with ExecEnable in #39 and #47, presumably because rather than going out all the way to the DRAM system the event chain stops right at the icache.cpu_side when a hit happens, which must have been the case for the second instruction, which is just adjacent to the first one.

-

It is also interested to look into the generated config.dot.svg to compare it to the one without caches: Figure 2, “config.dot.svg for a TimingSimpleCPU without caches.”. With caches: Figure 3, “config.dot.svg for a TimingSimpleCPU with caches.”.

+

It is also interested to look into the generated config.dot.svg to compare it to the one without caches: Figure 3, “config.dot.svg for a TimingSimpleCPU without caches.”. With caches: Figure 4, “config.dot.svg for a TimingSimpleCPU with caches.”.

We can see from there, that we now have icache and dcache elements inside the CPU block, and that the CPU icache and dcache ports go through the caches to the SystemXBar rather than being directly connected as before.

@@ -23817,7 +24079,7 @@ TimingSimpleCPU::IcachePort::ITickEvent::process
gem5 config TimingSimpleCPU caches 12c917de54145d2d50260035ba7fa614e25317a3
-
Figure 3. config.dot.svg for a TimingSimpleCPU with caches.
+
Figure 4. config.dot.svg for a TimingSimpleCPU with caches.

We can break down the events between the instructions as follows.

@@ -24036,14 +24298,32 @@ type=SetAssociative

If we don’t use such instructions that flush memory, we would only see the interconnect at work when caches run out.

+
+

For this study, we will use the same CLI as gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis but with multiple CPUs and a content like atomic.cpp which shares a variable across threads.

+
+
+

We will then focus on the behaviour or the memory of the shared variable to see if we can observe cache coherency on the crossbar:

+
+
+
+
./run \
+  --arch aarch64 \
+  --cli-args '2 100' \
+  --cpus 3 \
+  --emulator gem5 \
+  --gem5-worktree master3 \
+  --userland userland/cpp/atomic/aarch64_add.cpp \
+;
+
+
+
+

The config.dot.svg now looks like this but with 3 CPUs instead of 2:

+
gem5 config TimingSimpleCPU caches 2 CPUs 12c917de54145d2d50260035ba7fa614e25317a3
-
Figure 4. config.dot.svg for a system with two TimingSimpleCPU with caches.
-
-
-

The simplest setup to understand will be to use gem5 syscall emulation multiple executables.

+
Figure 5. config.dot.svg for a system with two TimingSimpleCPU with caches.
@@ -24086,7 +24366,601 @@ type=SetAssociative
-

19.20.5. gem5 ThreadContext vs ThreadState vs ExecContext vs Process

+

19.20.5. gem5 instruction definitions

+
+

This is one of the parts of gem5 that rely on semi-useless code generation inside the .isa sublanguage.

+
+
+

Which is mostly Python, with some magic letters thrown in for good measure.

+
+
+

The class definitions get all dumped into one humongous C++ include file:

+
+
+
+
build/ARM/arch/arm/generated/exec-ns.cc.inc
+
+
+
+

That file defines the key methods of each instruction, e.g. the ARM immediate ADD instruction has its execute method defined there:

+
+
+
+
    Fault AddImm::execute(
+        ExecContext *xc, Trace::InstRecord *traceData) const
+
+
+
+

or for example the key methods of an ARM 64-bit (X) STR with an immediate offset (STR <Wt>, [<Xn|SP>], #<simm>):

+
+
+
+
    Fault STRX64_IMM::execute(ExecContext *xc,
+                                  Trace::InstRecord *traceData) const
+
+    Fault STRX64_IMM::initiateAcc(ExecContext *xc,
+                                      Trace::InstRecord *traceData) const
+
+    Fault STRX64_IMM::completeAcc(PacketPtr pkt, ExecContext *xc,
+                                      Trace::InstRecord *traceData) const
+    {
+        return NoFault;
+    }
+
+
+
+

We also notice that the key argument passed to those instructions is of type ExecContext, which is discussed further at: Section 19.20.7.3, “gem5 ExecContext.

+
+
+

The file is an include so that compilation can be split up into chunks by the autogenerated includers

+
+
+
+
build/ARM/arch/arm/generated/generic_cpu_1.cc
+build/ARM/arch/arm/generated/generic_cpu_2.cc
+...
+
+
+
+

via the __SPLIT macro as in:

+
+
+
+
#include "exec-g.cc.inc"
+#include "cpu/exec_context.hh"
+#include "decoder.hh"
+namespace ArmISAInst {
+#define __SPLIT 1
+#include "exec-ns.cc.inc"
+}
+
+
+
+

This is likely done to not overload the C++ compiler? But sure enough overloads IDEs and GDB which takes forever to load the source of any frames going through it.

+
+
+

We should split that file into one per class for the love of God.

+
+
+

The autogenerated instruction class declarations can be found at:

+
+
+
+
build/ARM/arch/arm/generated/decoder-ns.hh.inc
+
+
+
+

and the autogenerated bulk of the decoder:

+
+
+
+
build/ARM/arch/arm/generated/decoder-ns.cc.inc
+
+
+
+

which also happens to contain the constructor definitions of the instruction classes, e.g. for the ADD immediate because why not:

+
+
+
+
    AddImm::AddImm(ExtMachInst machInst,
+                                          IntRegIndex _dest,
+                                          IntRegIndex _op1,
+                                          uint32_t _imm,
+                                          bool _rotC)
+
+
+
+

The above files get tied in the autogenerated:

+
+
+
+
build/ARM/arch/arm/generated/decoder.hh
+
+
+
+

which contains:

+
+
+
+
#include "decoder-g.hh.inc"
+namespace ArmISAInst {
+#include "decoder-ns.hh.inc"
+}
+
+
+
+

Different instructions inherit form different classes, e.g. the ARM immediate ADD instruction is a DataImmOp:

+
+
+
+
class AddImm : public DataImmOp
+{
+    public:
+        // Constructor
+        AddImm(ExtMachInst machInst, IntRegIndex _dest,
+                IntRegIndex _op1, uint32_t _imm, bool _rotC=true);
+        Fault execute(ExecContext *, Trace::InstRecord *) const override;
+};
+
+
+
+

and STRX64_IMM is an ArmISA::MemoryImm64:

+
+
+
+
    class STRX64_IMM : public ArmISA::MemoryImm64
+    {
+      public:
+
+        /// Constructor.
+        STRX64_IMM(ExtMachInst machInst,
+                IntRegIndex _dest, IntRegIndex _base, int64_t _imm);
+
+        Fault execute(ExecContext *, Trace::InstRecord *) const override;
+        Fault initiateAcc(ExecContext *, Trace::InstRecord *) const override;
+        Fault completeAcc(PacketPtr, ExecContext *,
+                          Trace::InstRecord *) const override;
+
+        void
+        annotateFault(ArmFault *fault) override
+        {
+                    fault->annotate(ArmFault::SAS, 3);
+                    fault->annotate(ArmFault::SSE, false);
+                    fault->annotate(ArmFault::SRT, dest);
+                    fault->annotate(ArmFault::SF, true);
+                    fault->annotate(ArmFault::AR, false);
+        }
+    };
+
+
+
+

but different memory instructions can have different base classes too e.g. STXR:

+
+
+
+
class STXRX64 : public ArmISA::MemoryEx64
+
+
+
+

A summarized class hierarchy for the above is:

+
+
+
    +
  • +

    StaticInst

    +
    +
      +
    • +

      ArmISA::ArmStaticInst

      +
      +
        +
      • +

        ArmISA::PredOp

        +
        +
          +
        • +

          ArmISA::DataImmOp

          +
          +
            +
          • +

            ArmISA::AddImm

            +
          • +
          +
          +
        • +
        • +

          ArmISA::MightBeMicro64

          +
          +
            +
          • +

            ArmISA::Memory64

            +
            +
              +
            • +

              ArmISA::MemoryImm64

              +
              +
                +
              • +

                ArmISA::MemoryEx64

                +
                +
                  +
                • +

                  ArmISA::STXRX64

                  +
                • +
                +
                +
              • +
              +
              +
            • +
            +
            +
          • +
          +
          +
        • +
        +
        +
      • +
      +
      +
    • +
    +
    +
  • +
+
+
+

Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.

+
+
+
19.20.5.1. gem5 execute vs initiateAcc vs completeAcc
+
+

These are the key methods defined in instruction definitions, so lets see when each one gets called and what they do more or less.

+
+
+

execute is the only one of the three that gets defined by "non-memory" instructions.

+
+
+

Memory instructions define all three.

+
+
+

The three methods are present in the base class StaticInst:

+
+
+
+
    virtual Fault execute(ExecContext *xc,
+                          Trace::InstRecord *traceData) const = 0;
+
+    virtual Fault initiateAcc(ExecContext *xc,
+                              Trace::InstRecord *traceData) const
+    {
+        panic("initiateAcc not defined!");
+    }
+
+    virtual Fault completeAcc(Packet *pkt, ExecContext *xc,
+                              Trace::InstRecord *traceData) const
+    {
+        panic("completeAcc not defined!");
+    }
+
+
+
+

so we see that all instructions must implement execute, while overriding initiateAcc and completeAcc are optional and only done by classes for which those might get called: memory instructions.

+
+
+

execute is what does the actual job for non-memory instructions (obviously, since it is the only one of the three methods that is defined as not panic for those).

+
+
+

Memory instructions however run either:

+
+
+
    +
  • +

    execute in AtomicSimpleCPU: this does the entire memory access in one go

    +
  • +
  • +

    initiateAcc + completeAcc in timing CPUs. initiateAcc is called when the instruction starts executing, and completeAcc is called when the memory fetch returns from the memory system.

    +
  • +
+
+
+

This can be seen concretely in GDB from the analysis done at: TimingSimpleCPU analysis: LDR stall and for more memory details see gem5 functional vs atomic vs timing memory requests.

+
+
+
19.20.5.1.1. gem5 completeAcc
+
+

completeAcc is boring on most simple store memory instructions, e.g. a simple STR:

+
+
+
+
    Fault STRX64_IMM::completeAcc(PacketPtr pkt, ExecContext *xc,
+                                      Trace::InstRecord *traceData) const
+    {
+        return NoFault;
+    }
+
+
+
+

This is because the store does all of its job on completeAcc basically, creating the memory write request.

+
+
+

Loads however have non-trivial completeAcc, because now we have at the very least, to save the value read from memory into a CPU address.

+
+
+

Things are much more interesting however on more interesting loads, for example STXR (hand formatted here):

+
+
+
+
Fault STXRX64::completeAcc(PacketPtr pkt, ExecContext *xc,
+                                    Trace::InstRecord *traceData) const {
+    Fault fault = NoFault;
+    uint64_t XResult = 0;
+    uint32_t SevMailbox = 0;
+    uint32_t LLSCLock = 0;
+    uint64_t writeResult = pkt->req->getExtraData();
+    XResult = !writeResult; SevMailbox = 1; LLSCLock = 0;
+    if (fault == NoFault) {
+        {
+            uint64_t final_val = XResult;
+            xc->setIntRegOperand(this, 0, (XResult) & mask(aarch64 ? 64 : 32));
+            if (traceData) { traceData->setData(final_val); }
+        }
+        xc->setMiscRegOperand(this, 1, SevMailbox);
+        if (traceData) { traceData->setData(SevMailbox); }
+        xc->setMiscRegOperand(this, 2, LLSCLock);
+        if (traceData) { traceData->setData(LLSCLock); }
+    }
+    return fault;
+}
+
+
+
+

From GDB on TimingSimpleCPU analysis: LDR stall we see that completeAcc gets called from TimingSimpleCPU::completeDataAccess.

+
+
+
+
+
+

19.20.6. gem5 port system

+
+

The gem5 memory system is connected in a very flexible way through the port system.

+
+
+

This system exists to allow seamlessly connecting any combination of CPU, caches, interconnects, DRAM and peripherals.

+
+
+
19.20.6.1. gem5 functional vs atomic vs timing memory requests
+
+

gem5 memory requests can be classified in the following broad categories:

+
+
+
    +
  • +

    functional: get the value magically, do not update caches, see also: gem5 functional requests

    +
  • +
  • +

    atomic: get the value now without making a separate event, but do not update caches

    +
  • +
  • +

    timing: get the value simulating delays and updating caches

    +
  • +
+
+
+

This trichotomy can be notably seen in the definition of the MasterPort class:

+
+
+
+
class MasterPort : public Port, public AtomicRequestProtocol,
+    public TimingRequestProtocol, public FunctionalRequestProtocol
+
+
+
+

and the base classes are defined under src/mem/protocol/.

+
+
+

Then, by reading the rest of the class, we see that the send methods are all boring, and just forward to some polymorphic receiver that does the actual interesting activity:

+
+
+
+
    Tick
+    sendAtomicSnoop(PacketPtr pkt)
+    {
+        return AtomicResponseProtocol::sendSnoop(_masterPort, pkt);
+    }
+
+    Tick
+    AtomicResponseProtocol::sendSnoop(AtomicRequestProtocol *peer, PacketPtr pkt)
+    {
+        assert(pkt->isRequest());
+        return peer->recvAtomicSnoop(pkt);
+    }
+
+
+
+

The receive methods are therefore the interesting ones, and must be overridden on derived classes if they ever expect to receive such requests:

+
+
+
+
    Tick
+    recvAtomicSnoop(PacketPtr pkt) override
+    {
+        panic("%s was not expecting an atomic snoop request\n", name());
+        return 0;
+    }
+
+    void
+    recvFunctionalSnoop(PacketPtr pkt) override
+    {
+        panic("%s was not expecting a functional snoop request\n", name());
+    }
+
+    void
+    recvTimingSnoopReq(PacketPtr pkt) override
+    {
+        panic("%s was not expecting a timing snoop request.\n", name());
+    }
+
+
+
+

One question that comes up now is: but why do CPUs need to care about snoop requests?

+
+
+

And one big answer is: to be able to implement LLSC atomicity as mentioned at: ARM LDXR and STXR instructions, since when other cores update memory, they could invalidate the lock of the current core.

+
+
+

Then, as you might expect, we can see that for example AtomicSimpleCPU does not override recvTimingSnoopReq.

+
+
+

Now let see which requests are generated by ordinary ARM LDR instruction. We run:

+
+
+
+
./run \
+  --arch aarch64 \
+  --debug-vm \
+  --emulator gem5 \
+  --gem5-build-type debug \
+  --useland userland/arch/aarch64/freestanding/linux/hello.S \
+
+
+
+

and then break at the methods of the LDR class LDRXL64_LIT: gem5 execute vs initiateAcc vs completeAcc.

+
+
+

Before starting, we of course guess that:

+
+
+
    +
  • +

    AtomicSimpleCPU will be making atomic accesses from execute

    +
  • +
  • +

    TimingSimpleCPU will be making timing accesses from initiateAcc, which must generate the event which leads to completeAcc

    +
  • +
+
+
+

so let’s confirm it.

+
+
+

We break on ArmISAInst::LDRXL64_LIT::execute which is what AtomicSimpleCPU uses, and that leads as expected to:

+
+
+
+
MasterPort::sendAtomic
+AtomicSimpleCPU::sendPacket
+AtomicSimpleCPU::readMem
+SimpleExecContext::readMem
+readMemAtomic<(ByteOrder)1, ExecContext, unsigned long>
+readMemAtomicLE<ExecContext, unsigned long>
+ArmISAInst::LDRXL64_LIT::execute
+AtomicSimpleCPU::tick
+
+
+
+

Notably, AtomicSimpleCPU::readMem immediately translates the address, creates a packet, sends the atomic request, and gets the response back without any events.

+
+
+

And now if we do the same with --cpu-type TimingSimpleCPU and break at ArmISAInst::LDRXL64_LIT::initiateAcc, and then add another break for the next event schedule b EventManager::schedule (which we imagine is the memory read) we reach:

+
+
+
+
EventManager::schedule
+DRAMCtrl::addToReadQueue
+DRAMCtrl::recvTimingReq
+DRAMCtrl::MemoryPort::recvTimingReq
+TimingRequestProtocol::sendReq
+MasterPort::sendTimingReq
+CoherentXBar::recvTimingReq
+CoherentXBar::CoherentXBarSlavePort::recvTimingReq
+TimingRequestProtocol::sendReq
+MasterPort::sendTimingReq
+TimingSimpleCPU::handleReadPacket
+TimingSimpleCPU::sendData
+TimingSimpleCPU::finishTranslation
+DataTranslation<TimingSimpleCPU*>::finish
+ArmISA::TLB::translateComplete
+ArmISA::TLB::translateTiming
+ArmISA::TLB::translateTiming
+TimingSimpleCPU::initiateMemRead
+SimpleExecContext::initiateMemRead
+initiateMemRead<ExecContext, unsigned long>
+ArmISAInst::LDRXL64_LIT::initiateAcc
+TimingSimpleCPU::completeIfetch
+TimingSimpleCPU::IcachePort::ITickEvent::process
+EventQueue::serviceOne
+
+
+
+

so as expected we have TimingRequestProtocol::sendReq.

+
+
+

Remember however that timing requests are a bit more complicated due to paging, since the page table walk can itself lead to further memory requests.

+
+
+

In this particular instance, the address being read with ldr x2, =len ARM LDR pseudo-instruction is likely placed just after the text section, and therefore the pagewalk is already in the TLB due to previous instruction fetches, and this is because the translation just finished immediately going through TimingSimpleCPU::finishTranslation, some key snippets are:

+
+
+
+
TLB::translateComplete(const RequestPtr &req, ThreadContext *tc,
+        Translation *translation, Mode mode, TLB::ArmTranslationType tranType,
+        bool callFromS2)
+{
+    bool delay = false;
+    Fault fault;
+    if (FullSystem)
+        fault = translateFs(req, tc, mode, translation, delay, true, tranType);
+    else
+        fault = translateSe(req, tc, mode, translation, delay, true);
+    if (!delay)
+        translation->finish(fault, req, tc, mode);
+    else
+        translation->markDelayed();
+
+
+
+

and then translateSe does not use delay at all, so we learn that in syscall emulation, delay is always false and things progress immediately there. And then further down TimingSimpleCPU::finishTranslation does some more fault checking:

+
+
+
+
void
+TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
+{
+    if (state->getFault() != NoFault) {
+        translationFault(state->getFault());
+    } else {
+        if (!state->isSplit) {
+            sendData(state->mainReq, state->data, state->res,
+                     state->mode == BaseTLB::Read);
+
+
+
+

Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.

+
+
+
19.20.6.1.1. gem5 functional requests
+
+

As seen at gem5 functional vs atomic vs timing memory requests, functional requests are not used in common simulation, since the core must always go through caches.

+
+
+

Functional access are therefore only used for more magic simulation functionalities.

+
+
+

One such functionality, is the gem5 syscall emulation mode implementation of the futex system call which is done at futexFunc in src/sim/sycall_emul.hh.

+
+
+

As seen from man futex, the Linux kernel reads the value from an address that is given as the first argument of the call.

+
+
+

Therefore, here it makes sense for gem5 syscall implementation, which does not actually have a real kernel running, to just make a functional request and be done with it, since the impact of cache changes done by this read would be insignificant to the cost of an actual full context switch that would happen on a real syscall.

+
+
+
+
+
+

19.20.7. gem5 ThreadContext vs ThreadState vs ExecContext vs Process

These classes get used everywhere, and they have a somewhat convoluted relation with one another, so let’s figure it out this mess.

@@ -24097,7 +24971,7 @@ type=SetAssociative

This section and all children tested at gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.

-
19.20.5.1. gem5 ThreadContext
+
19.20.7.1. gem5 ThreadContext

As we delve into more details below, we will reach the following conclusion: a ThreadContext represents on thread of a CPU with multiple Hardware threads.

@@ -24147,7 +25021,7 @@ typedef SimpleThread MinorThread;

Essentially all methods of the base ThreadContext are pure virtual.

-
19.20.5.1.1. gem5 SimpleThread
+
19.20.7.1.1. gem5 SimpleThread

SimpleThread storage defined on BaseSimpleCPU for simple CPUs like AtomicSimpleCPU:

@@ -24242,7 +25116,7 @@ typedef SimpleThread MinorThread;
-
19.20.5.1.2. gem5 O3ThreadContext
+
19.20.7.1.2. gem5 O3ThreadContext

Instantiation happens in the FullO3CPU constructor:

@@ -24343,7 +25217,7 @@ FullO3CPU<Impl>::readArchIntReg(int reg_idx, ThreadID tid)
-
19.20.5.2. gem5 ThreadState
+
19.20.7.2. gem5 ThreadState

Owned one per ThreadContext.

@@ -24389,12 +25263,17 @@ class O3ThreadContext : public ThreadContext
-
19.20.5.3. gem5 ExecContext
+
19.20.7.3. gem5 ExecContext
-

ExecContext gets used in instruction definitions, e.g.:

+

ExecContext gets used in gem5 instruction definitions, e.g.:

+
+
+
+
build/ARM/arch/arm/generated/exec-ns.cc.inc
+
-

build/ARM/arch/arm/generated/exec-ns.cc.inc

+

contains:

@@ -24543,9 +25422,281 @@ class O3ThreadContext : public ThreadContext

This makes sense, since each ThreadContext represents one CPU register set, and therefore needs a separate ExecContext which allows instruction implementations to access those registers.

+
+
19.20.7.3.1. gem5 ExecContext::readIntRegOperand register resolution
+
+

Let’s have a look at how ExecContext::readIntRegOperand actually matches registers to decoded registers IDs, since it is not obvious.

+
+
+

Let’s study a simple aarch64 register register addition:

+
+
+
+
add x0, x1, x2
+
+
+
+

which corresponds to the AddXSReg instruction (formatted and simplified):

+
+
+
+
Fault AddXSReg::execute(ExecContext *xc, Trace::InstRecord *traceData) const {
+    uint64_t Op264 = 0;
+    uint64_t Dest64 = 0;
+    uint64_t Op164 = 0;
+    Op264 = ((xc->readIntRegOperand(this, 0)) & mask(intWidth));
+    Op164 = ((xc->readIntRegOperand(this, 1)) & mask(intWidth));
+    uint64_t secOp = shiftReg64(Op264, shiftAmt, shiftType, intWidth);
+    Dest64 = Op164 + secOp;
+    uint64_t final_val = Dest64;
+    xc->setIntRegOperand(this, 0, (Dest64) & mask(intWidth));
+    if (traceData) { traceData->setData(final_val); }
+    return NoFault;
+}
+
+
+
+

So what are those magic 0 and 1 constants on xc→readIntRegOperand(this, 0) and xc→readIntRegOperand(this, 1)?

+
+
+

First, we guess that they must be related to the reading of x1 and x2, which are the inputs of the addition.

+
+
+

Next, we also guess that the 0 read must correspond to x2, since it later gets potentially shifted as mentioned at Section 24.4.4.1, “ARM shift suffixes”.

+
+
+

Let’s also have a look at the decoder code that builds the instruction instance in build/ARM/arch/arm/generated/decoder-ns.cc.inc:

+
+
+
+
ArmShiftType type =
+    (ArmShiftType)(uint8_t)bits(machInst, 23, 22);
+if (type == ROR)
+    return new Unknown64(machInst);
+uint8_t imm6 = bits(machInst, 15, 10);
+if (!bits(machInst, 31) && bits(imm6, 5))
+    return new Unknown64(machInst);
+IntRegIndex rd = (IntRegIndex)(uint8_t)bits(machInst, 4, 0);
+IntRegIndex rdzr = makeZero(rd);
+IntRegIndex rn = (IntRegIndex)(uint8_t)bits(machInst, 9, 5);
+IntRegIndex rm = (IntRegIndex)(uint8_t)bits(machInst, 20, 16);
+
+return new AddXSReg(machInst, rdzr, rn, rm, imm6, type);
+
+
+
+

and the ARM instruction pseudocode from the ARMv8 architecture reference manual:

+
+
+
+
ADD <Xd>, <Xn>, <Xm>{, <shift> #<amount>}
+
+
+
+

and the constructor:

+
+
+
+
AddXSReg::AddXSReg(ExtMachInst machInst,
+    IntRegIndex _dest,
+    IntRegIndex _op1,
+    IntRegIndex _op2,
+    int32_t _shiftAmt,
+    ArmShiftType _shiftType
+) : DataXSRegOp("add", machInst, IntAluOp,
+                _dest, _op1, _op2, _shiftAmt, _shiftType) {
+    _numSrcRegs = 0;
+    _numDestRegs = 0;
+    _numFPDestRegs = 0;
+    _numVecDestRegs = 0;
+    _numVecElemDestRegs = 0;
+    _numVecPredDestRegs = 0;
+    _numIntDestRegs = 0;
+    _numCCDestRegs = 0;
+    _srcRegIdx[_numSrcRegs++] = RegId(IntRegClass, op2);
+    _destRegIdx[_numDestRegs++] = RegId(IntRegClass, dest);
+    _numIntDestRegs++;
+    _srcRegIdx[_numSrcRegs++] = RegId(IntRegClass, op1);
+    flags[IsInteger] = true;;
+}
+
+
+
+

where RegId is just a container class, and so the lines that we care about for now are:

+
+
+
+
_srcRegIdx[_numSrcRegs++] = RegId(IntRegClass, op2);
+_srcRegIdx[_numSrcRegs++] = RegId(IntRegClass, op1);
+
+
+
+

which matches the guess we made earlier: op2 is 0 and op1 is 1 (op1 and op2 are the same as _op1 and _op2 which are set in the base constructor DataXSRegOp).

+
+
+

We also note that the register decodings (which the ARM spec says are 1 for x1 and 2 for x2) are actually passed as enum IntRegIndex:

+
+
+
+
    IntRegIndex _op1,
+    IntRegIndex _op2,
+
+
+
+

which are defined at src/arch/arm/interegs.hh:

+
+
+
+
enum IntRegIndex
+{
+    /* All the unique register indices. */
+    INTREG_R0,
+    INTREG_R1,
+    INTREG_R2,
+
+
+
+

Then SimpleExecContext::readIntRegOperand does:

+
+
+
+
    /** Reads an integer register. */
+    RegVal
+    readIntRegOperand(const StaticInst *si, int idx) override
+    {
+        numIntRegReads++;
+        const RegId& reg = si->srcRegIdx(idx);
+        assert(reg.isIntReg());
+        return thread->readIntReg(reg.index());
+    }
+
+
+
+

and:

+
+
+
+
const RegId& srcRegIdx(int i)  const { return _srcRegIdx[i]; }
+
+
+
+

which is what is populated in the constructor.

+
+
+

Then, RegIndex::index() { return regIdx; } just returns the decoded register bytes, and now SimpleThread::readIntReg:

+
+
+
+
RegVal readIntReg(RegIndex reg_idx) const override {
+    int flatIndex = isa->flattenIntIndex(reg_idx);
+    return readIntRegFlat(flatIndex);
+}
+
+
+
+

readIntRegFlag is what finally reads from the int register array:

+
+
+
+
RegVal SimpleThreadContext::readIntRegFlat(RegIndex idx) const override { return intRegs[idx]; }
+
+std::array<RegVal, TheISA::NumIntRegs> SimpleThreadContext::intRegs;
+
+
+
+

and then there is the flattening magic at:

+
+
+
+
int
+flattenIntIndex(int reg) const
+{
+    assert(reg >= 0);
+    if (reg < NUM_ARCH_INTREGS) {
+        return intRegMap[reg];
+    } else if (reg < NUM_INTREGS) {
+        return reg;
+    } else if (reg == INTREG_SPX) {
+        CPSR cpsr = miscRegs[MISCREG_CPSR];
+        ExceptionLevel el = opModeToEL(
+            (OperatingMode) (uint8_t) cpsr.mode);
+        if (!cpsr.sp && el != EL0)
+            return INTREG_SP0;
+        switch (el) {
+            case EL3:
+            return INTREG_SP3;
+            case EL2:
+            return INTREG_SP2;
+            case EL1:
+            return INTREG_SP1;
+            case EL0:
+            return INTREG_SP0;
+            default:
+            panic("Invalid exception level");
+            return 0;  // Never happens.
+        }
+    } else {
+        return flattenIntRegModeIndex(reg);
+    }
+}
+
+
+
+

Then:

+
+
+
+
    NUM_ARCH_INTREGS = 32,
+
+
+
+

so we undertand that this covers x0 to x31. NUM_INTREGS is also 32, so I’m a bit confused, that case is never reached.

+
+
+
+
    INTREG_SPX = NUM_INTREGS,
+
+
+
+

SP is 32, but it is a bit more magic, since in ARM there is one SP per exception level as mentioned at ARM SP0 vs SPx.

+
+
+
+
   INTREG_SPX = NUM_INTREGS
+
+
+
+

We can also have a quick look at the AddXImm instruction which corresponds to a simple addition of an immediate as shown in userland/arch/aarch64/add.S:

+
+
+
+
add x0, x1, 2
+
+
+
+

Its execute method contains in build/ARM/arch/arm/generated/exec-ns.cc.inc (hand formatted and slightly simplified):

+
+
+
+
Fault AddXImm::execute(ExecContext *xc, Trace::InstRecord *traceData) const {
+    uint64_t Dest64 = 0;
+    uint64_t Op164 = 0;
+    Op164 = ((xc->readIntRegOperand(this, 0)) & mask(intWidth));
+    Dest64 = Op164 + imm;
+    uint64_t final_val = Dest64;
+    xc->setIntRegOperand(this, 0, (Dest64) & mask(intWidth));
+    if (traceData) { traceData->setData(final_val); }
+    return NoFault;
+}
+
+
+
+

and imm is set directly on the constructor.

+
+
-
19.20.5.4. gem5 Process
+
19.20.7.4. gem5 Process

The Process class is used only for gem5 syscall emulation mode, and it represents a process like a Linux userland process, in addition to any further gem5 specific data needed to represent the process.

@@ -24633,7 +25784,7 @@ readFunc(SyscallDesc *desc, ThreadContext *tc,
-

19.20.6. gem5 code generation

+

19.20.8. gem5 code generation

gem5 uses a ton of code generation, which makes the project horrendous:

@@ -24643,7 +25794,7 @@ readFunc(SyscallDesc *desc, ThreadContext *tc,

lots of magic happen on top of pybind11, which is already magic, to more automatically glue the C++ and Python worlds: gem5 Python C++ interaction

  • -

    .isa code which describes most of the instructions

    +

    .isa code which describes most of the instructions: gem5 instruction definitions

  • Ruby for memory systems

    @@ -24678,7 +25829,7 @@ readFunc(SyscallDesc *desc, ThreadContext *tc,

    But it has been widely overused to insanity. It likely also exists partly because when the project started in 2003 C++ compilers weren’t that good, so you couldn’t rely on features like templates that much.

  • -
    19.20.6.1. gem5 THE_ISA
    +
    19.20.8.1. gem5 THE_ISA

    Generated code at: build/<ISA>/config/the_isa.hh which e.g. for ARM contains:

    @@ -24724,9 +25875,24 @@ enum class Arch {
    -

    19.20.7. gem5 build system

    +

    19.20.9. gem5 build system

    -
    19.20.7.1. gem5 build broken on recent compiler version
    +
    19.20.9.1. M5_OVERRIDE_PY_SOURCE
    + +
    +

    Running gem5 with the M5_OVERRIDE_PY_SOURCE=true environment variable allows you to modify a file under src/python and run it without rebuilding in gem5?

    +
    +
    +

    We set this environment variable by default in our run script.

    +
    +
    +

    How M5_OVERRID_PY_SOURCE works is shown at: gem5 m5.objects module.

    +
    +
    +
    +
    19.20.9.2. gem5 build broken on recent compiler version

    gem5 moves a bit slowly, and if your host compiler is very new, the gem5 build might be broken for it, e.g. this was the case for Ubuntu 19.10 with GCC 9 and gem5 62d75e7105fe172eb906d4f80f360ff8591d4178 from Dec 2019.

    @@ -24751,7 +25917,7 @@ enum class Arch {
    -
    19.20.7.2. gem5 polymorphic ISA includes
    +
    19.20.9.3. gem5 polymorphic ISA includes

    E.g. src/cpu/decode_cache.hh includes:

    @@ -24830,7 +25996,7 @@ build/ARM/config/the_isa.hh
    -
    19.20.7.3. Why are all C++ symlinked into the gem5 build dir?
    +
    19.20.9.4. Why are all C++ symlinked into the gem5 build dir?
    @@ -26436,7 +27602,7 @@ global 10000

    so we clearly see that basically a lock addq is used to do an atomic read and write to memory every single time, just like in our other example userland/cpp/atomic/x86_64_lock_inc.cpp.

    -

    This setup can also be used to benchmark different synchronization mechanisms. For example, std::mutex was about 1.5x slower with two cores than std::atomic, presumably because it relies on the futex system call as can be seen from strace -f -s999 -v logs, while std::atomic uses just userland instructions: https://www.quora.com/How-does-std-atomic-work-in-C++11/answer/Ciro-Santilli Tested in -O3 with:

    +

    This setup can also be used to benchmark different synchronization mechanisms. For example, std::mutex was about 1.5x slower with two cores than std::atomic, presumably because it relies on the futex system call as can be seen from strace -f -s999 -v logs, while std::atomic uses just userland instructions: https://www.quora.com/How-does-std-atomic-work-in-C++11/answer/Ciro-Santilli Tested in -O3 with:

    @@ -26626,13 +27792,16 @@ fork() return = 13039 @@ -29345,7 +30514,17 @@ child after parent sleep

    22.7.2. getcpu system call and the sched_getaffinity glibc wrapper

    -

    Example: userland/linux/sched_getcpu.c

    +

    Examples:

    +
    +
    +

    Returns the CPU that the process/thread is currently running on:

    @@ -34000,6 +35179,12 @@ CurrentEL.EL 0x3

    See ARMv8 architecture reference manual db D1.6.2 "The stack pointer registers".

    +

    There is one SP per exception level.

    +
    +
    +

    This can also be seen clearly on the analysis at gem5 ExecContext::readIntRegOperand register resolution.

    +
    +

    TODO create a minimal runnable example.