diff --git a/index.html b/index.html index 2f75bd3..1b5bc67 100644 --- a/index.html +++ b/index.html @@ -1185,7 +1185,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.15.3. gem5 prof and perf builds
  • 19.15.4. gem5 clang build
  • 19.15.5. gem5 sanitation build
  • -
  • 19.15.6. gem5 Ruby build
  • +
  • 19.15.6. gem5 Ruby build + +
  • 19.15.7. gem5 Python 3 build
  • @@ -1210,10 +1214,52 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.19.3. gem5 entry point
  • 19.19.4. gem5 event queue
  • 19.19.5. gem5 stats internals
  • @@ -1224,8 +1270,9 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.19.7. gem5 build system
  • @@ -1889,10 +1936,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b @@ -2884,12 +2932,7 @@ j = 0

    Remember that the gem5 boot is considerably slower than QEMU since the simulation is more detailed.

    -

    gem5 moves a bit slowly, and if your host compiler is very new, the build might be broken for it, e.g. this was the case for Ubuntu 19.10 with GCC 9 and gem5 62d75e7105fe172eb906d4f80f360ff8591d4178 from Dec 2019. You can work around that by installing an older compiler and using it with something like:

    -
    -
    -
    -
    ./build-gem5 -- CC=gcc-8 CXX=g++-8
    -
    +

    If you have a relatively new GCC version and the gem5 build fails on your machine, see: gem5 build broken on recent compiler version.

    To get a terminal, either open a new shell and run:

    @@ -20526,13 +20569,13 @@ Indirect leak of 1346 byte(s) in 2 object(s) allocated from:
    -

    The Ruby memory system includes the SLICC domain specific language to describe memory systems: http://gem5.org/Ruby

    +

    The Ruby memory system includes the SLICC domain specific language to describe memory systems: http://gem5.org/Ruby SLICC transpiles to C++ auto-generated files under build/<isa>/mem/ruby/protocol/.

    -

    It seems to have usage outside of gem5, but the naming overload with the Ruby programming language, which also has domain specific languages as a concept, makes it impossible to google anything about it!

    +

    Ruby seems to have usage outside of gem5, but the naming overload with the Ruby programming language, which also has domain specific languages as a concept, makes it impossible to google anything about it!

    -

    Since it is not the default, Ruby is generally less stable that the classic memory model. However, because it allows describing a wide variety of important coherency protocols, while the classic system only describes a single protocol, Ruby is very importanonly describes a single protocol, Ruby is a very important feature of gem5.

    +

    Since it is not the default, Ruby is generally less stable that the classic memory model. However, because it allows describing a wide variety of important cache coherency protocols, while the classic system only describes a single protocol, Ruby is very importanonly describes a single protocol, Ruby is a very important feature of gem5.

    Ruby support must be enabled at compile time with the scons PROTOCOL= flag, which compiles support for the desired memory system type.

    @@ -20554,15 +20597,15 @@ Indirect leak of 1346 byte(s) in 2 object(s) allocated from:
    -

    For example, to use a two level MESI cache coherence protocol, we can do:

    +

    For example, to use a two level MESI protocol we can do:

    @@ -20630,7 +20673,7 @@ cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"
  • -

    without --ruby, we instead see XBar (Coherent Crossbar) related messages such as CoherentXBar:, which I believe is the more precise name for the memory model that the classic memory system uses

    +

    without --ruby, we instead see XBar (Coherent Crossbar) related messages such as CoherentXBar:, which I believe is the more precise name for the memory model that the classic memory system uses: gem5 crossbar interconnect.

  • @@ -20640,6 +20683,50 @@ cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"

    Tested in gem5 d7d9bc240615625141cd6feddbadd392457e49eb.

    +
    +
    19.15.6.1. gem5 crossbar interconnect
    +
    +

    Crossbar or XBar in the code, is the default CPU interconnect that gets used by fs.py if --ruby is not given.

    +
    + + +
    +

    TODO: describe it in more detail. It appears to be a very simple mechanism.

    +
    +
    +

    Under src/mem/ we see that there is both a coherent and a non-coherent XBar.

    +
    +
    +

    In se.py it is set at:

    +
    +
    +
    +
    if options.ruby:
    +    ...
    +else:
    +    MemClass = Simulation.setMemClass(options)
    +    system.membus = SystemXBar()
    +
    +
    +
    +

    and SystemXBar is defined at src/mem/XBar.py with a nice comment:

    +
    +
    +
    +
    # One of the key coherent crossbar instances is the system
    +# interconnect, tying together the CPU clusters, GPUs, and any I/O
    +# coherent masters, and DRAM controllers.
    +class SystemXBar(CoherentXBar):
    +
    +
    +
    +

    Tested in gem5 12c917de54145d2d50260035ba7fa614e25317a3.

    +
    +

    19.15.7. gem5 Python 3 build

    @@ -20749,6 +20836,9 @@ cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"

    Its 4 stage pipeline is described at the "MinorCPU" section of gem5 ARM RSK.

    +

    There is also an in-tree doxygen at: src/doc/inside-minor.doxygen and rendered at: http://pages.cs.wisc.edu/~swilson/gem5-docs/minor.html

    +
    +

    As of 2019, in-order cores are mostly present in low power / cost contexts, for example little cores of ARM bigLITTLE.

    @@ -21256,12 +21346,12 @@ exec filecode in scope

    CPU ticks

  • -

    TODO peripherals and memory

    +

    peripherals and memory

  • -

    At the beginning of simulation, gem5 sets up exactly two events:

    +

    At gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis we see for example that at the beginning of an AtomicCPU simulation, gem5 sets up exactly two events:

    -

    Tick events then get triggered one by one as simulation progresses, in addition to any other system events.

    +

    Then, at the end of the callback of one tick event, another tick is scheduled.

    +
    +
    +

    And so the simulation progresses tick by tick, until an exit event happens.

    The EventQueue class has one awesome dump() function that prints a human friendly representation of the queue, and can be easily called from GDB. TODO example.

    @@ -21293,6 +21386,46 @@ exec filecode in scope

    This calls the Event::process method of the event.

    +
    +

    Another important technique is to use GDB and break at interesting points such as:

    +
    +
    +
    +
    b Trace::OstreamLogger::logMessage()
    +b EventManager::schedule
    +b EventFunctionWrapper::process
    +
    +
    +
    +

    although stepping into EventFunctionWrapper::process which does std::function is a bit of a pain: https://stackoverflow.com/questions/59429401/how-to-step-into-stdfunction-user-code-from-c-functional-with-gdb

    +
    +
    +

    Another potentially useful technique is to use:

    +
    +
    +
    +
    --trace Event,ExecAll,FmtFlag,FmtStackTrace --trace-stdout
    +
    +
    +
    +

    which automates the logging of Trace::OstreamLogger::logMessage() backtraces.

    +
    +
    +

    But alas, it misses which function callback is being scheduled, which is the awesome thing we actually want:

    +
    +
    + +
    +
    +

    Then, once we had that, the most perfect thing ever would be to make the full event graph containing which events schedule which events!

    +
    19.19.4.1. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis
    @@ -21304,79 +21437,149 @@ exec filecode in scope --arch aarch64 \ --emulator gem5 \ --userland userland/arch/aarch64/freestanding/linux/hello.S \ - --trace Event \ + --trace Event,ExecAll,FmtFlag \ --trace-stdout \ ;
    -

    At LKMC a0ea29835b9bacc6aa1cceb24c79d895315991d4 + 1 this outputs:

    +

    which gives:

    -
          0: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    +
          0: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 scheduled @ 0
     **** REAL SIMULATION ****
    -      0: Event_70: generic event scheduled @ 0
    +      0: Event: Event_70: generic 70 scheduled @ 0
     info: Entering event queue @ 0.  Starting simulation...
    -      0: Event_70: generic event rescheduled @ 18446744073709551615
    -      0: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped event rescheduled @ 500
    -    500: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped event rescheduled @ 1000
    -   1000: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped event rescheduled @ 1500
    -   1500: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped event rescheduled @ 2000
    +      0: Event: Event_70: generic 70 rescheduled @ 18446744073709551615
    +      0: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 executed @ 0
    +      0: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    +      0: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 rescheduled @ 500
    +    500: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 executed @ 500
    +    500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)
    +    500: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 rescheduled @ 1000
    +   1000: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 executed @ 1000
    +   1000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8    :   ldr   w2, #4194464       : MemRead :  D=0x0000000000000006 A=0x4000a0  flags=(IsInteger|IsMemRef|IsLoad)
    +   1000: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 rescheduled @ 1500
    +   1500: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 executed @ 1500
    +   1500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+12    :   movz   x8, #64, #0       : IntAlu :  D=0x0000000000000040  flags=(IsInteger)
    +   1500: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 rescheduled @ 2000
    +   2000: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 executed @ 2000
    +   2000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+16    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
     hello
    -   2000: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped event rescheduled @ 2500
    -   2500: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped event rescheduled @ 3000
    -   3000: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped event rescheduled @ 3500
    -   3500: Event_71: generic event scheduled @ 3500
    + 2000: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 rescheduled @ 2500 + 2500: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 executed @ 2500 + 2500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+20 : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 2500: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 rescheduled @ 3000 + 3000: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 executed @ 3000 + 3000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+24 : movz x8, #93, #0 : IntAlu : D=0x000000000000005d flags=(IsInteger) + 3000: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 rescheduled @ 3500 + 3500: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 executed @ 3500 + 3500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+28 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall) + 3500: Event: Event_71: generic 71 scheduled @ 3500 + 3500: Event: Event_71: generic 71 executed @ 3500
    -

    which we immediately guess means that there is one event per tick when comparing to the ExecAll trace:

    +

    On the event trace, we can first see:

    -
          0: system.cpu A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    -    500: system.cpu A0 T0 : @asm_main_after_prologue+4    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)
    -   1000: system.cpu A0 T0 : @asm_main_after_prologue+8    :   ldr   w2, #4194464       : MemRead :  D=0x0000000000000006 A=0x4000a0  flags=(IsInteger|IsMemRef|IsLoad)
    -   1500: system.cpu A0 T0 : @asm_main_after_prologue+12    :   movz   x8, #64, #0       : IntAlu :  D=0x0000000000000040  flags=(IsInteger)
    -   2000: system.cpu A0 T0 : @asm_main_after_prologue+16    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
    -hello
    -   2500: system.cpu A0 T0 : @asm_main_after_prologue+20    :   movz   x0, #0, #0        : IntAlu :  D=0x0000000000000000  flags=(IsInteger)
    -   3000: system.cpu A0 T0 : @asm_main_after_prologue+24    :   movz   x8, #93, #0       : IntAlu :  D=0x000000000000005d  flags=(IsInteger)
    -   3500: system.cpu A0 T0 : @asm_main_after_prologue+28    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
    -Exiting @ tick 3500 because exiting with last active thread context
    +
    0: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 scheduled @ 0
    -

    On the event trace, we can see:

    +

    This schedules a tick event for time 0, and leads to the first clock tick.

    +
    +
    +

    Then:

    +
    +
    +
    +
    0: Event: Event_70: generic 70 scheduled @ 0
    +0: Event: Event_70: generic 70 rescheduled @ 18446744073709551615
    +
    +
    +
    +

    schedules the end of time event for time 0, which is later rescheduled to the actual end of time.

    +
    +
    +

    At:

    +
    +
    +
    +
    0: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 executed @ 0
    +0: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    +0: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 rescheduled @ 500
    +
    +
    +
    +

    the tick event happens, the instruction runs, and then the instruction is rescheduled in 500 time units. This is done at the end of AtomicSimpleCPU::tick():

    -
    -
    -

    Let’s study the first event. From GDB, let’s break at the point that prints messages: Trace::OstreamLogger::logMessage() to see where events are being scheduled from:

    +

    the exit system call is called, and then it schedules an exit evit, which gets executed and the simulation ends.

    +
    +
    +

    We guess then that Event_71 comes from the SE implementation of the exit syscall, so let’s just confirm, the trace contains:

    +
    +
    +
    +
    exitSimLoop() at sim_events.cc:97 0x5555594746e0
    +exitImpl() at syscall_emul.cc:215 0x55555948c046
    +exitFunc() at syscall_emul.cc:225 0x55555948c147
    +SyscallDesc::doSyscall() at syscall_desc.cc:72 0x5555594949b6
    +Process::syscall() at process.cc:401 0x555559484717
    +SimpleThread::syscall() at 0x555559558059
    +ArmISA::SupervisorCall::invoke() at faults.cc:856 0x5555572950d7
    +BaseSimpleCPU::advancePC() at base.cc:681 0x555559083133
    +AtomicSimpleCPU::tick() at atomic.cc:757 0x55555907834c
    +
    +
    +
    +

    and exitSimLoop() does:

    +
    +
    +
    +
    new GlobalSimLoopExitEvent(when + simQuantum, message, exit_code, repeat);
    +
    +
    +
    +

    Tested in gem5 12c917de54145d2d50260035ba7fa614e25317a3.

    +
    +
    +
    19.19.4.1.1. AtomicSimpleCPU initial events
    +
    +

    Let’s have a closer look at the initial magically scheduled events of the simulation.

    +
    +
    +

    Most events come from other events, but at least one initial event must be scheduled somehow from elsewhere to kick things off.

    +
    +
    +

    The initial tick event:

    +
    +
    +
    +
    0: Event: AtomicSimpleCPU tick.wrapped_function_event: EventFunctionWrapped 39 scheduled @ 0
    +
    +
    +
    +

    we’ll study by breaking at at the point that prints messages: b Trace::OstreamLogger::logMessage() to see where events are being scheduled from:

    @@ -21445,6 +21648,14 @@ ArmLinuxProcess64::initState() at process.cc:1,777 0x5555572d5e5e
    +

    and this gets called from the toplevel Python scripts e.g. se.py configs/common/Simulation.py does:

    +
    +
    +
    +
    m5.instantiate(checkpoint_dir)
    +
    +
    +

    As we can see, initState is just one stage of generic SimObject initialization. root.descendants() goes over the entire SimObject tree calling initState().

    @@ -21521,10 +21732,16 @@ ArmLinuxProcess64::initState() at process.cc:1,777 0x5555572d5e5e
    -

    So that’s how the main atomic tick loop works, fully understood!

    +

    And the second magically scheduled event is the exit event:

    +
    +
    +
    +
    0: Event: Event_70: generic 70 scheduled @ 0
    +0: Event: Event_70: generic 70 rescheduled @ 18446744073709551615
    +
    -

    The second event has backtrace:

    +

    which is scheduled with backtrace:

    @@ -21540,7 +21757,7 @@ simulate() at simulate.cc:104 0x555559476d6f
    -

    so gets scheduled automatically at object creation simulate() through the GlobalEvent() constructor:

    +

    which comes at object creation inside simulate() through the GlobalEvent() constructor:

    @@ -21561,36 +21778,165 @@ simulate() at simulate.cc:104 0x555559476d6f
    -

    And at last, we can guess without reading the code that Event_71 is comes from the SE implementation of the exit syscall, so let’s just confirm, the trace contains:

    +

    Tested in gem5 12c917de54145d2d50260035ba7fa614e25317a3.

    +
    +
    +
    +
    19.19.4.1.2. AtomicSimpleCPU tick reschedule timing
    +
    +

    Inside AtomicSimpleCPU::tick() we saw previously that the reschedule happens at:

    -
    exitSimLoop() at sim_events.cc:97 0x5555594746e0
    -exitImpl() at syscall_emul.cc:215 0x55555948c046
    -exitFunc() at syscall_emul.cc:225 0x55555948c147
    -SyscallDesc::doSyscall() at syscall_desc.cc:72 0x5555594949b6
    -Process::syscall() at process.cc:401 0x555559484717
    -SimpleThread::syscall() at 0x555559558059
    -ArmISA::SupervisorCall::invoke() at faults.cc:856 0x5555572950d7
    -BaseSimpleCPU::advancePC() at base.cc:681 0x555559083133
    -AtomicSimpleCPU::tick() at atomic.cc:757 0x55555907834c
    +
        if (latency < clockPeriod())
    +        latency = clockPeriod();
    +
    +    if (_status != Idle)
    +        reschedule(tickEvent, curTick() + latency, true);
    -

    and exitSimLoop() does:

    +

    so it is interesting to learn where that latency comes from.

    +
    +
    +

    From our logs, we see that all events happened with a 500 time unit interval between them, so that must be the value for all instructions of our simple example.

    +
    +
    +

    By GDBing it a bit, we see that none of our instructions incremented latency, and so it got set to clockPeriod(), which comes from ClockDomain::clockPeriod() which then likely comes from:

    -
    new GlobalSimLoopExitEvent(when + simQuantum, message, exit_code, repeat);
    +
        parser.add_option("--cpu-clock", action="store", type="string",
    +                      default='2GHz',
    -

    Tested at b4879ae5b0b6644e6836b0881e4da05c64a6550d.

    +

    because the time unit is picoseconds. This then shows on the config.ini as:

    +
    +
    +
    +
    [system.cpu_clk_domain]
    +type=SrcClockDomain
    +clock=500
    +
    +
    +
    +
    +
    19.19.4.1.3. AtomicSimpleCPU memory access
    +
    +

    It will be interesting to see how AtomicSimpleCPU makes memory access on GDB and to compare that with TimingSimpleCPU.

    +
    +
    +

    We assume that the memory access still goes through the CoherentXBar, but instead of generating an event to model delayed response, it must be doing the access directly.

    +
    +
    +

    Inside AtomicSimpleCPU::tick, we track ifetch_req and see:

    +
    +
    +
    +
            fault = thread->itb->translateAtomic(ifetch_req, thread->getTC(),
    +                                                BaseTLB::Execute);
    +
    +
    +
    +

    We can compare that with what happen sin TimingSimpleCPU:

    +
    +
    +
    +
            thread->itb->translateTiming(ifetch_req, thread->getTC(),
    +                &fetchTranslation, BaseTLB::Execute);
    +
    +
    +
    +

    and so there it is: the ITB classes are the same, but there are a separate Atomic and Timing methods!

    +
    +
    +

    The Timing one calls ArmISA::TLB::translateComplete

    +
    +
    +

    Tested in gem5 b4879ae5b0b6644e6836b0881e4da05c64a6550d.

    +
    +
    +
    +
    19.19.4.1.4. gem5 se.py page translation
    +
    +

    Happens on EmulationPageTable, and seems to happen atomically without making any extra memory requests.

    +
    +
    +

    TODO confirm from code, notably by seeing where the translation table is set.

    +
    +
    +

    But we can confirm with logging with:

    +
    +
    +
    +
    --trace DRAM,ExecAll,FmtFlag
    +
    +
    +
    +

    which gives

    +
    +
    +
    +
          0: DRAM: system.mem_ctrls: recvAtomic: ReadReq 0x78
    +      0: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    +    500: DRAM: system.mem_ctrls: recvAtomic: ReadReq 0x7c
    +    500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)
    +   1000: DRAM: system.mem_ctrls: recvAtomic: ReadReq 0x80
    +   1000: DRAM: system.mem_ctrls: recvAtomic: ReadReq 0xa0
    +   1000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8    :   ldr   w2, #4194464       : MemRead :  D=0x0000000000000006 A=0x4000a0  flags=(IsInteger|IsMemRef|IsLoad)
    +   1500: DRAM: system.mem_ctrls: recvAtomic: ReadReq 0x84
    +   1500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+12    :   movz   x8, #64, #0       : IntAlu :  D=0x0000000000000040  flags=(IsInteger)
    +   2000: DRAM: system.mem_ctrls: recvAtomic: ReadReq 0x88
    +   2000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+16    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
    +hello
    +   2500: DRAM: system.mem_ctrls: recvAtomic: ReadReq 0x8c
    +   2500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+20    :   movz   x0, #0, #0        : IntAlu :  D=0x0000000000000000  flags=(IsInteger)
    +   3000: DRAM: system.mem_ctrls: recvAtomic: ReadReq 0x90
    +   3000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+24    :   movz   x8, #93, #0       : IntAlu :  D=0x000000000000005d  flags=(IsInteger)
    +   3500: DRAM: system.mem_ctrls: recvAtomic: ReadReq 0x94
    +   3500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+28    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
    +Exiting @ tick 3500 because exiting with last active thread context
    +   3500: DRAM: system.mem_ctrls_0: Computing stats due to a dump callback
    +   3500: DRAM: system.mem_ctrls_1: Computing stats due to a dump callback
    +
    +
    +
    +

    So we see that before every instruction execution there was a DRAM event! Also, each read happens 4 bytes after the previous one, which is consistent with instruction fetches.

    +
    +
    +

    The DRAM addresses are very close to zero e.g. 0x78 for the first instruction, and therefore we guess that they are physical since the ELF entry point is much higher:

    +
    +
    +
    +
    ./run-toolchain --arch aarch64 readelf -- -h "$(./getvar --arch aarch64 userland_build_dir)/arch/aarch64/freestanding/linux/hello.out
    +
    +
    +
    +

    at:

    +
    +
    +
    +
      Entry point address:               0x400078
    +
    +
    +
    +

    For LDR, we see that there was an extra DRAM read as well after the fetch read, as expected.

    +
    +
    +

    Tested in gem5 b4879ae5b0b6644e6836b0881e4da05c64a6550d.

    +
    19.19.4.2. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis
    +

    Now, let’s move on to TimingSimpleCPU, which is just like AtomicSimpleCPU internally, but now the memory requests don’t actually finish immediately: gem5 CPU types!

    +
    +
    +

    This means that simulation will be much more accurate, and the DRAM memory will be modelled.

    +
    +

    TODO: analyze better what each of the memory event mean. For now, we have just collected a bunch of data there, but needs interpreting. The CPU specifics in this section are already insightful however.

    @@ -21601,8 +21947,9 @@ AtomicSimpleCPU::tick() at atomic.cc:757 0x55555907834c
    ./run \
       --arch aarch64 \
       --emulator gem5 \
    +  --gem5-build-type gem5 \
       --userland userland/arch/aarch64/freestanding/linux/hello.S \
    -  --trace Event,ExecAll \
    +  --trace Event,ExecAll,FmtFlag \
       --trace-stdout \
       -- \
       --cpu-type TimingSimpleCPU \
    @@ -21610,7 +21957,7 @@ AtomicSimpleCPU::tick() at atomic.cc:757 0x55555907834c
    -

    As of LKMC 9bfbff244d713de40e5686bd370eadb20cf78c7b + 1 the log is now much more complex.

    +

    As of LKMC 78ce2dabe18ef1d87dc435e5bc9369ce82e8d6d2 gem5 12c917de54145d2d50260035ba7fa614e25317a3 the log is now much more complex.

    Here is an abridged version with:

    @@ -21626,43 +21973,224 @@ AtomicSimpleCPU::tick() at atomic.cc:757 0x55555907834c
    -

    because all that happens in between is exactly the same as the first two instructions and therefore boring:

    +

    because all that happens in between is exactly the same as the first two instructions and therefore boring.

    +
    +
    +

    We have also manually added:

    +
    +
    +
    -
          0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    +
    #0       0: Event: system.cpu.wrapped_function_event: EventFunctionWrapped 43 scheduled @ 0
     **** REAL SIMULATION ****
    -      0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
    -      0: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
    -      0: Event_74: generic event scheduled @ 0
    +#1       0: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 14 scheduled @ 7786250
    +#2       0: Event: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped 20 scheduled @ 7786250
    +#3       0: Event: Event_74: generic 74 scheduled @ 0
     info: Entering event queue @ 0.  Starting simulation...
    -      0: Event_74: generic event rescheduled @ 18446744073709551615
    -      0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    -      0: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
    -      0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    -      0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250
    -      0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 5000
    -      0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    -  46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250
    -  74250: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
    -  74250: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
    -  77000: Event_40: Timing CPU icache tick event scheduled @ 77000
    -  77000: system.cpu A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    -  77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
    -  77000: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 78000
    -  77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 95750
    -  77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
    -  95750: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 123750
    - 123750: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 126000
    - 123750: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 126000
    - 126000: Event_40: Timing CPU icache tick event scheduled @ 126000
    -      [...]
    - 469000: system.cpu A0 T0 : @asm_main_after_prologue+28    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
    - 469000: Event_75: generic event scheduled @ 469000
    +#4 0: Event: Event_74: generic 74 rescheduled @ 18446744073709551615 + +#5 0: Event: system.cpu.wrapped_function_event: EventFunctionWrapped 43 executed @ 0 +#6 0: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 0 +#7 0: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 60 scheduled @ 1000 + +#8 0: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 0 +#9 0: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 12 scheduled @ 0 +#10 0: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 scheduled @ 46250 +#11 0: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 5000 + +#12 0: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 12 executed @ 0 +#13 0: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 15 scheduled @ 0 + +#14 0: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 15 executed @ 0 + +#15 1000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 60 executed @ 1000 + +#16 5000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 5000 + +#17 46250: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 executed @ 46250 +#18 46250: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 scheduled @ 74250 + +#19 74250: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 executed @ 74250 +#20 74250: Event: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 64 scheduled @ 77000 +#21 74250: Event: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped 65 scheduled @ 77000 + +#22 77000: Event: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped 65 executed @ 77000 + +#23 77000: Event: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 64 executed @ 77000 +#24 77000: Event: Event_40: Timing CPU icache tick 40 scheduled @ 77000 + +#25 77000: Event: Event_40: Timing CPU icache tick 40 executed @ 77000 + 77000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger) +#26 77000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 77000 +#27 77000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 60 scheduled @ 78000 + +#28 77000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 77000 +#29 77000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 scheduled @ 95750 +#30 77000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 77000 + +#31 77000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 77000 + +#32 78000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 60 executed @ 78000 + +#33 95750: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 executed @ 95750 +#34 95750: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 scheduled @ 123750 + +#35 123750: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 executed @ 123750 +#36 123750: Event: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 64 scheduled @ 126000 +#37 123750: Event: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped 65 scheduled @ 126000 + +#38 126000: Event: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped 65 executed @ 126000 + +#39 126000: Event: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 64 executed @ 126000 +#40 126000: Event: Event_40: Timing CPU icache tick 40 scheduled @ 126000 + +#41 126000: Event: Event_40: Timing CPU icache tick 40 executed @ 126000 + 126000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4 : adr x1, #28 : IntAlu : D=0x0000000000400098 flags=(IsInteger) +#42 126000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 126000 +#43 126000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 60 scheduled @ 127000 + + [...] + + 469000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+28 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall) + 469000: Event: Event_75: generic 75 scheduled @ 469000 + 469000: Event: Event_75: generic 75 executed @ 469000
    -

    0: system.cpu.wrapped_function_event schedule the initial tick, much like for for AtomicSimpleCPU. This time however, it is not a tick, but rather a fetch event that gets scheduled:

    +

    Looking into the generated config.dot.svg can give a better intuition on the shape of the memory system: Figure 2, “config.dot.svg for a TimingSimpleCPU without caches.”, so it is good to keep that in mind.

    +
    +
    +
    +gem5 config TimingSimpleCPU 12c917de54145d2d50260035ba7fa614e25317a3 +
    +
    Figure 2. config.dot.svg for a TimingSimpleCPU without caches.
    +
    +
    +

    It is also helpful to see this as a tree of events where one execute event schedules other events:

    +
    +
    +
    +
        | | | | |
    +    0 1 2 3 4   0 TimingSimpleCPU::fetch
    +    5
    +    |
    +    +---+
    +    |   |
    +    6   7       6 DRAMCtrl::processNextReqEvent
    +    8   15      7 BaseXBar::Layer::releaseLayer
    +    |
    ++---+---+
    +|   |   |
    +9   10  11      9 DRAMCtrl::Rank::processActivateEvent
    +12  17  16     10 DRAMCtrl::processRespondEvent
    +|   |          11 DRAMCtrl::processNextReqEvent
    +|   |
    +13  18         13 DRAMCtrl::Rank::processPowerEvent
    +14  19         18 PacketQueue::processSendEvent
    +    |
    +    +---+
    +    |   |
    +    20  21     20 PacketQueue::processSendEvent
    +    23  22     21 BaseXBar::Layer<SrcType, DstType>::releaseLayer
    +    |
    +    24         24 TimingSimpleCPU::IcachePort::ITickEvent::process
    +    25
    +    |
    +    +---+
    +    |   |
    +    26  27     26 DRAMCtrl::processNextReqEvent
    +    28  32     27 BaseXBar::Layer<SrcType, DstType>::releaseLayer
    +    |
    +    +---+
    +    |   |
    +    29  30     29 DRAMCtrl::processRespondEvent
    +    33  31     30 DRAMCtrl::processNextReqEvent
    +    |
    +    34         34 PacketQueue::processSendEvent
    +    35
    +    |
    +    +---+
    +    |   |
    +    36  37     36 PacketQueue::processSendEvent
    +    39  38     37 BaseXBar::Layer<SrcType, DstType>::releaseLayer
    +    |
    +    40         40 TimingSimpleCPU::IcachePort::ITickEvent::process
    +    41
    +    |
    +    +---+
    +    |   |
    +    42  43     42 DRAMCtrl::processNextReqEvent
    +               43 BaseXBar::Layer<SrcType, DstType>::releaseLayer
    +
    +
    +
    +

    Note that every schedule is followed by an execution, so we put them together, for example:

    +
    +
    +
    +
        |   |
    +    6   7    6 DRAMCtrl::processNextReqEvent
    +    8   15   7 BaseXBar::Layer::releaseLayer
    +    |
    +
    +
    +
    +

    means:

    +
    +
    + +
    +
    +

    With this, we can focus on going up the event tree from an event of interest until we see what originally caused it!

    +
    +
    +

    Notably, the above tree contains the execution of the first two instructions.

    +
    +
    +

    Observe how the events leading up to the second instruction are basically a copy of those of the first one, this is the basic TimingSimpleCPU event loop in action.

    +
    +
    +
    19.19.4.2.1. TimingSimpleCPU analysis #0
    +
    +

    Schedules TimingSimpleCPU::fetch through:

    +
    +
    +
    +
    EventManager::schedule
    +TimingSimpleCPU::activateContext
    +SimpleThread::activate
    +Process::initState
    +ArmProcess64::initState
    +ArmLinuxProcess64::initState
    +
    +
    +
    +

    This schedules the initial tick, much like for for AtomicSimpleCPU.

    +
    +
    +

    This time however, it is not a tick as in AtomicSimpleCPU, but rather a fetch event that gets scheduled for later on, since reading DRAM memory now takes time:

    @@ -21682,13 +22210,26 @@ info: Entering event queue @ 0. Starting simulation...
    -

    We have a fetch instead of a tick here compared to AtomicSimpleCPU, because in the timing CPU we must first get the instruction opcode from DRAM, which takes some cycles to return!

    +

    By looking at the source, we see that fetchEvent runs TimingSimpleCPU::fetch.

    -

    By looking at the source, we see that fetchEvent runs TimingSimpleCPU::fetch.

    +

    Just like for AtomicSimpleCPU, this call comes from the initState call, which is exposed on SimObject and ultimately comes from Python.

    +
    +
    +
    +
    19.19.4.2.2. TimingSimpleCPU analysis #1
    +
    +

    Backtrace:

    +
    +
    +
    +
    EventManager::schedule
    +DRAMCtrl::Rank::startup
    +DRAMCtrl::startup
    +
    -

    0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250: from GDB we see that it comes from DRAMCtrl::startup in mem/dram_ctrl.cc which contains:

    +

    Snippets:

    @@ -21736,12 +22277,35 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
    +

    DRAMCtrl::startup is itself a SimObject method exposed to Python and called from simulate in src/python/m5/simulate.py:

    +
    +
    +
    +
    def simulate(*args, **kwargs):
    +    global need_startup
    +
    +    if need_startup:
    +        root = objects.Root.getInstance()
    +        for obj in root.descendants(): obj.startup()
    +
    +
    +
    +

    where simulate happens after m5.instantiate, and both are called directly from the toplevel scripts, e.g. for se.py in configs/common/Simulation.py:

    +
    +
    +
    +
    def run(options, root, testsys, cpu_class):
    +    ...
    +            exit_event = m5.simulate()
    +
    +
    +

    By looking up some variable definitions in the source, we now we see some memory parameters clearly:

    +
    +
    19.19.4.2.3. TimingSimpleCPU analysis #2
    -

    0: Event_74: generic event scheduled @ 0 and 0: Event_74: generic event rescheduled @ 18446744073709551615: schedule the final exit event, same as for AtomicSimpleCPU

    +

    This is just the startup of the second rank, see: TimingSimpleCPU analysis #1.

    -

    The next interesting event is:

    +

    se.py allocates the memory controller at configs/common/MemConfig.py:

    -
    system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    +
    def config_mem(options, system):
    +
    +    ...
    +
    +    opt_mem_channels = options.mem_channels
    +
    +    ...
    +
    +    nbr_mem_ctrls = opt_mem_channels
    +
    +    ...
    +
    +    for r in system.mem_ranges:
    +        for i in range(nbr_mem_ctrls):
    +            mem_ctrl = create_mem_ctrl(cls, r, i, nbr_mem_ctrls, intlv_bits,
    +                                       intlv_size)
    +
    +            ...
    +
    +            mem_ctrls.append(mem_ctrl)
    +
    +
    +
    19.19.4.2.4. TimingSimpleCPU analysis #3 and #4
    +
    +

    From the timing we know what that one is: the end of time exit event, like for AtomicSimpleCPU.

    +
    +
    +
    +
    19.19.4.2.5. TimingSimpleCPU analysis #5
    +
    +

    Executes TimingSimpleCPU::fetch().

    +
    -

    which comes from:

    +

    The log shows that event ID 43 is now executing: we had previously seen event 43 get scheduled and had analyzed it to be the initial fetch.

    +
    +
    +

    We can step into TimingSimpleCPU::fetch() to confirm that the expected ELF entry point is being fetched. We can inspect the ELF with:

    -
    #0  Trace::OstreamLogger::logMessage
    -#1  void Trace::Logger::dprintf
    -#2  Event::trace
    -#3  EventQueue::schedule
    -#4  EventManager::schedule
    -#5  DRAMCtrl::addToReadQueue
    -#6  DRAMCtrl::recvTimingReq
    -#7  DRAMCtrl::MemoryPort::recvTimingReq
    -#8  TimingRequestProtocol::sendReq
    -#9  MasterPort::sendTimingReq
    -#10 CoherentXBar::recvTimingReq
    -#11 CoherentXBar::CoherentXBarSlavePort::recvTimingReq(Packet*))
    -#12 TimingRequestProtocol::sendReq
    -#13 MasterPort::sendTimingReq
    -#14 TimingSimpleCPU::sendFetch
    -#15 TimingSimpleCPU::FetchTranslation::finish
    -#16 ArmISA::TLB::translateComplete
    -#17 ArmISA::TLB::translateTiming
    -#18 ArmISA::TLB::translateTiming
    -#19 TimingSimpleCPU::fetch
    -#20 TimingSimpleCPU::<lambda()>::operator()(void)
    -#21 std::_Function_handler<void(), TimingSimpleCPU::TimingSimpleCPU(TimingSimpleCPUParams*)::<lambda()> >
    -#22 std::function<void)>::operator()() const
    -#23 EventFunctionWrapper::process
    -#24 EventQueue::serviceOne
    -#25 doSimLoop
    -#26 simulate
    +
    ./run-toolchain --arch aarch64 readelf -- -h "$(./getvar --arch aarch64 userland_build_dir)/arch/aarch64/freestanding/linux/hello.out
    -

    From the trace, we see that we are already running from the event queue. Therefore, we must have been running a previously scheduled event, and the previous event logs, the only such event is 0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 which scheduled a memory fetch!

    +

    which contains:

    +
    +
    +
    +
      Entry point address:               0x400078
    +
    +
    +
    +

    and by the time we go past:

    +
    +
    +
    +
    TimingSimpleCPU::fetch()
    +{
    +    ...
    +    if (needToFetch) {
    +        ...
    +        setupFetchRequest(ifetch_req);
    +        DPRINTF(SimpleCPU, "Translating address %#x\n", ifetch_req->getVaddr());
    +        thread->itb->translateTiming(ifetch_req, thread->getTC(),
    +                &fetchTranslation, BaseTLB::Execute);
    +
    +
    +
    +

    BaseSimpleCPU::setupFetchRequest sets up the fetch of the expected entry point by reading the PC:

    +
    +
    +
    +
    p/x ifetch_req->getVaddr()
    +
    +
    +
    +

    Still during the execution of the fetch, execution then moves into the address translation ArmISA::TLB::translateTiming, and after a call to:

    +
    +
    +
    +
    TLB::translateSe
    +
    +
    +
    +

    the packet now contains the physical address:

    +
    +
    +
    +
    _paddr = 0x78
    +
    +
    +
    +
    +
    19.19.4.2.6. TimingSimpleCPU analysis #6
    +
    +

    Schedules DRAMCtrl::processNextReqEvent through:

    +
    +
    +
    +
    EventManager::schedule
    +DRAMCtrl::addToReadQueue
    +DRAMCtrl::recvTimingReq
    +DRAMCtrl::MemoryPort::recvTimingReq
    +TimingRequestProtocol::sendReq
    +MasterPort::sendTimingReq
    +CoherentXBar::recvTimingReq
    +CoherentXBar::CoherentXBarSlavePort::recvTimingReq
    +TimingRequestProtocol::sendReq
    +MasterPort::sendTimingReq
    +TimingSimpleCPU::sendFetch
    +TimingSimpleCPU::FetchTranslation::finish
    +ArmISA::TLB::translateComplete
    +ArmISA::TLB::translateTiming
    +ArmISA::TLB::translateTiming
    +TimingSimpleCPU::fetch
    +
    +
    +
    +

    The event loop has started, and magic initialization schedulings are not happening anymore: now every event is being scheduled from another event:

    +
    +
    +

    From the trace, we see that we are already running from the event queue under TimingSimpleCPU::fetch as expected.

    From the backtrace we see the tortuous path that the data request takes, going through:

    @@ -21847,7 +22499,10 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
    -

    The scheduling happens at frame #5:

    +

    This matches the config.ini system image, since we see that the request goes through the CoherentXBar before reaching memory, like all other CPU memory accesses, see also: gem5 crossbar interconnect.

    +
    +
    +

    The scheduling happens at frame DRAMCtrl::addToReadQueue:

    @@ -21860,94 +22515,387 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
    -

    and from a quick source grep we see that nextReqEvent is a DRAMCtrl::processNextReqEvent.

    +

    From this we deduce that the DRAM has a request queue of some sort, and that the fetch:

    +
    +
    +
      +
    • +

      has added a read request to that queue

      +
    • +
    • +

      and has made a future request to read from the queue

      +
    • +
    -

    The next schedule:

    +

    The signature of the function is:

    -
    0: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
    +
    DRAMCtrl::addToReadQueue(PacketPtr pkt, unsigned int pktCount)
    -

    and does a BaseXBar::Layer::releaseLayer event.

    +

    where PacketPtr is of class `Packet, and so clearly the packet is coming from above.

    -

    This one is also coming from the request queue at TimingSimpleCPU::fetch. We deduce therefore that the single previous fetch event scheduled not one, but two events!

    -
    -
    -

    Now:

    +

    From:

    -
          0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    +
    p/x *pkt
    -

    comes from the previously scheduled DRAMCtrl::processNextReqEvent and schedules DRAMCtrl::Rank::processPrechargeEvent.

    -
    -
    -

    Now:

    +

    we see:

    -
          0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250
    +
    addr = 0x78
    -

    also runs from DRAMCtrl::processNextReqEvent and schedules a DRAMCtrl::processRespondEvent.

    +

    which from TimingSimpleCPU analysis #5 we know is the physical address of the ELF entry point.

    -

    I’m getting bored, let’s skip to the line that appears to matter for the first instruction:

    +

    Communication goes through certain components via the class Port interface, e.g. at TimingSimpleCPU::sendFetch a call is made to send the packet forward:

    -
      46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250
    +
    icachePort.sendTimingReq(ifetch_pkt)
    -

    But I got even more bored, and I will now skip to the first event before the instruction:

    +

    which ends up calling:

    -
      77000: Event_40: Timing CPU icache tick event scheduled @ 77000
    -  77000: system.cpu A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    +
    peer->recvTimingReq(pkt);
    -

    This event comes from PacketQueue::processSendEvent and schedules itself:

    +

    to reach the receiving side:

    -
    void
    -TimingSimpleCPU::TimingCPUPort::TickEvent::schedule(PacketPtr _pkt, Tick t)
    -{
    -    pkt = _pkt;
    -    cpu->schedule(this, t);
    -}
    +
    CoherentXBar::CoherentXBarSlavePort::recvTimingReq
    -

    which polymorphically resolves to:

    +

    Ports are also used to connect the XBar and the DRAM.

    +
    +
    +

    We will then see that at TimingSimpleCPU analysis #20 a reply packet will come back through the port interface down to the icache port, and only then does the decoding and execution happen.

    +
    +
    +
    +
    19.19.4.2.7. TimingSimpleCPU analysis #7
    +
    +

    Schedules BaseXBar::Layer::releaseLayer through:

    -
    void
    -TimingSimpleCPU::IcachePort::ITickEvent::process()
    -{
    -    cpu->completeIfetch(pkt);
    -}
    +
    EventManager::schedule
    +BaseXBar::Layer<SlavePort, MasterPort>::occupyLayer
    +BaseXBar::Layer<SlavePort, MasterPort>::succeededTiming
    +CoherentXBar::recvTimingReq
    +CoherentXBar::CoherentXBarSlavePort::recvTimingReq
    +TimingRequestProtocol::sendReq
    +MasterPort::sendTimingReq
    +TimingSimpleCPU::sendFetch
    +TimingSimpleCPU::FetchTranslation::finish
    +ArmISA::TLB::translateComplete
    +ArmISA::TLB::translateTiming
    +ArmISA::TLB::translateTiming
    +TimingSimpleCPU::fetch
    -

    and so TimingSimpleCPU::completeIfetch is, at last, the interesting TimingSimpleCPU function!

    +

    which schedules a SimpleMemory::release.

    +
    +
    +
    +
    19.19.4.2.8. TimingSimpleCPU analysis #8
    +
    +

    Executes DRAMCtrl::processNextReqEvent.

    +
    +
    +
    +
    19.19.4.2.9. TimingSimpleCPU analysis #9
    +
    +

    Schedules DRAMCtrl::Rank::processActivateEvent through:

    +
    +
    +
    +
    EventManager::schedule
    +DRAMCtrl::activateBank
    +DRAMCtrl::doDRAMAccess
    +DRAMCtrl::processNextReqEvent
    +
    +
    +
    +
    +
    19.19.4.2.10. TimingSimpleCPU analysis #10
    +
    +

    Schedules DRAMCtrl::processRespondEvent through:

    +
    +
    +
    +
    EventManager::schedule
    +DRAMCtrl::processNextReqEvent
    +
    +
    +
    +
    +
    19.19.4.2.11. TimingSimpleCPU analysis #11
    +
    +

    Schedules DRAMCtrl::processNextReqEvent through:

    +
    +
    +
    +
    EventManager::schedule
    +DRAMCtrl::processNextReqEvent
    +
    +
    +
    +
    +
    19.19.4.2.12. TimingSimpleCPU analysis #12
    +
    +

    Executes DRAMCtrl::Rank::processActivateEvent.

    -

    The end of this instruction must be setting things up in a way that can continue the PC walk loop, and by looking at the source and traces, it is clearly from: TimingSimpleCPU::advanceInst which calls TimingSimpleCPU::fetch, which is the very thing we did in this simulation!!! OMG, that’s the loop.

    +

    which schedules:

    +
    +
    +
    +
    19.19.4.2.13. TimingSimpleCPU analysis #13
    +
    +

    Schedules DRAMCtrl::Rank::processPowerEvent through:

    +
    +
    +
    +
    EventManager::schedule
    +DRAMCtrl::Rank::schedulePowerEvent
    +DRAMCtrl::Rank::processActivateEvent
    +
    +
    +
    +
    +
    19.19.4.2.14. TimingSimpleCPU analysis #14
    +
    +

    Executes DRAMCtrl::Rank::processPowerEvent.

    -

    One final thing to check, is how the memory reads are going to make the processor stall in the middle of an instruction.

    +

    This it must just be some power statistics stuff, as it does not schedule anything else.

    +
    +
    +
    +
    19.19.4.2.15. TimingSimpleCPU analysis #15
    +
    +

    Executes BaseXBar::Layer<SrcType, DstType>::releaseLayer.

    +
    +
    +
    +
    19.19.4.2.16. TimingSimpleCPU analysis #16
    +
    +

    Executes DRAMCtrl::processNextReqEvent().

    +
    +
    +
    +
    19.19.4.2.17. TimingSimpleCPU analysis #17
    +
    +

    Executes DRAMCtrl::processRespondEvent().

    +
    +
    +
    +
    19.19.4.2.18. TimingSimpleCPU analysis #18
    +
    +

    Schedules PacketQueue::processSendEvent() through:

    +
    +
    +
    +
    PacketQueue::schedSendEvent
    +PacketQueue::schedSendTiming
    +QueuedSlavePort::schedTimingResp
    +DRAMCtrl::accessAndRespond
    +DRAMCtrl::processRespondEvent
    +
    +
    +
    +
    +
    19.19.4.2.19. TimingSimpleCPU analysis #19
    +
    +

    Executes PacketQueue::processSendEvent().

    +
    +
    +
    +
    19.19.4.2.20. TimingSimpleCPU analysis #20
    +
    +

    Schedules PacketQueue::processSendEvent through:

    +
    +
    +
    +
    EventManager::schedule
    +PacketQueue::schedSendEvent
    +PacketQueue::schedSendTiming
    +QueuedSlavePort::schedTimingResp
    +CoherentXBar::recvTimingResp
    +CoherentXBar::CoherentXBarMasterPort::recvTimingResp
    +TimingResponseProtocol::sendResp
    +SlavePort::sendTimingResp
    +RespPacketQueue::sendTiming
    +PacketQueue::sendDeferredPacket
    +PacketQueue::processSendEvent
    +
    +
    +
    +

    From this backtrace, we see that this event is happening as the fetch reply packet finally comes back from DRAM.

    +
    +
    +
    +
    19.19.4.2.21. TimingSimpleCPU analysis #21
    +
    +

    Schedules BaseXBar::Layer<SrcType, DstType>::releaseLayer through:

    +
    +
    +
    +
    EventManager::schedule
    +BaseXBar::Layer<MasterPort, SlavePort>::occupyLayer
    +BaseXBar::Layer<MasterPort, SlavePort>::succeededTiming
    +CoherentXBar::recvTimingResp
    +CoherentXBar::CoherentXBarMasterPort::recvTimingResp
    +TimingResponseProtocol::sendResp
    +SlavePort::sendTimingResp
    +RespPacketQueue::sendTiming
    +PacketQueue::sendDeferredPacket
    +PacketQueue::processSendEvent
    +
    +
    +
    +
    +
    19.19.4.2.22. TimingSimpleCPU analysis #22
    +
    +

    Executes BaseXBar::Layer<SrcType, DstType>::releaseLayer.

    +
    +
    +
    +
    19.19.4.2.23. TimingSimpleCPU analysis #23
    +
    +

    Executes PacketQueue::processSendEvent.

    +
    +
    +
    +
    19.19.4.2.24. TimingSimpleCPU analysis #24
    +
    +

    Schedules TimingSimpleCPU::IcachePort::ITickEvent::process() through:

    +
    +
    +
    +
    EventManager::schedule
    +TimingSimpleCPU::TimingCPUPort::TickEvent::schedule
    +TimingSimpleCPU::IcachePort::recvTimingResp
    +TimingResponseProtocol::sendResp
    +SlavePort::sendTimingResp
    +RespPacketQueue::sendTiming
    +PacketQueue::sendDeferredPacket
    +PacketQueue::processSendEvent
    +
    +
    +
    +
    +
    19.19.4.2.25. TimingSimpleCPU analysis #25
    +
    +

    Executes TimingSimpleCPU::IcachePort::ITickEvent::process().

    +
    +
    +

    This custom process then calls TimingSimpleCPU::completeIfetch(PacketPtr pkt), and that finally executes the very first instruction:

    +
    +
    +
    +
    77000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    +
    +
    +
    +

    The end of this instruction must be setting things up in a way that can continue the PC walk loop, and by looking at the source and traces, it is clearly from: TimingSimpleCPU::advanceInst which calls TimingSimpleCPU::fetch.

    +
    +
    +

    And TimingSimpleCPU::fetch is the very thing we did in this simulation at TimingSimpleCPU analysis #0!!! OMG, that’s the loop.

    +
    +
    +
    +
    19.19.4.2.26. TimingSimpleCPU analysis #26
    +
    +

    Schedules DRAMCtrl::processNextReqEvent through:

    +
    +
    +
    +
    EventManager::schedule
    +DRAMCtrl::addToReadQueue
    +DRAMCtrl::recvTimingReq
    +DRAMCtrl::MemoryPort::recvTimingReq
    +TimingRequestProtocol::sendReq
    +MasterPort::sendTimingReq
    +CoherentXBar::recvTimingReq
    +CoherentXBar::CoherentXBarSlavePort::recvTimingReq
    +TimingRequestProtocol::sendReq
    +MasterPort::sendTimingReq
    +TimingSimpleCPU::sendFetch
    +TimingSimpleCPU::FetchTranslation::finish
    +ArmISA::TLB::translateComplete
    +ArmISA::TLB::translateTiming
    +ArmISA::TLB::translateTiming
    +TimingSimpleCPU::fetch
    +TimingSimpleCPU::advanceInst
    +TimingSimpleCPU::completeIfetch
    +TimingSimpleCPU::IcachePort::ITickEvent::process
    +
    +
    +
    +
    +
    19.19.4.2.27. TimingSimpleCPU analysis #27
    +
    +

    Schedules BaseXBar::Layer<SrcType, DstType>::releaseLayer through:

    +
    +
    +
    +
    EventManager::schedule
    +BaseXBar::Layer<SlavePort, MasterPort>::occupyLayer
    +BaseXBar::Layer<SlavePort, MasterPort>::succeededTiming
    +CoherentXBar::recvTimingReq
    +CoherentXBar::CoherentXBarSlavePort::recvTimingReq
    +TimingRequestProtocol::sendReq
    +MasterPort::sendTimingReq
    +TimingSimpleCPU::sendFetch
    +TimingSimpleCPU::FetchTranslation::finish
    +ArmISA::TLB::translateComplete
    +ArmISA::TLB::translateTiming
    +ArmISA::TLB::translateTiming
    +TimingSimpleCPU::fetch
    +TimingSimpleCPU::advanceInst
    +TimingSimpleCPU::completeIfetch
    +TimingSimpleCPU::IcachePort::ITickEvent::process
    +
    +
    +
    +
    +
    19.19.4.2.28. TimingSimpleCPU analysis #28
    +
    +

    Execute DRAMCtrl::processNextReqEvent.

    +
    +
    +
    +
    19.19.4.2.29. TimingSimpleCPU analysis #29
    +
    +

    Schedule DRAMCtrl::processRespondEvent().

    +
    +
    +
    +
    19.19.4.2.30. TimingSimpleCPU analysis: LDR stall
    +
    +

    One important thing we want to check now, is how the memory reads are going to make the processor stall in the middle of an instruction.

    +
    +
    +

    Since we were using a simple CPU without a pipeline, the data memory access stall everything: there is no further progress until memory comes back.

    For that, we can GDB to the TimingSimpleCPU::completeIfetch of the first LDR done in our test program.

    @@ -21994,48 +22942,168 @@ TimingSimpleCPU::IcachePort::ITickEvent::process()
    +
    +

    The following is the region of interest of the event log:

    +
    +
    +
    +
     175000: Event: Event_40: Timing CPU icache tick 40 executed @ 175000
    + 175000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 175000
    + 175000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 60 scheduled @ 176000
    +
    + 175000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 175000
    + 175000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 scheduled @ 193750
    + 175000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 175000
    +
    + 175000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 175000
    +
    + 176000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 60 executed @ 176000
    +
    + 193750: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 executed @ 193750
    + 193750: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 scheduled @ 221750
    +
    + 221750: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 executed @ 221750
    + 221750: Event: system.membus.slave[2]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 66 scheduled @ 224000
    + 221750: Event: system.membus.respLayer2.wrapped_function_event: EventFunctionWrapped 67 scheduled @ 224000
    +
    + 224000: Event: system.membus.respLayer2.wrapped_function_event: EventFunctionWrapped 67 executed @ 224000
    +
    + 224000: Event: system.membus.slave[2]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 66 executed @ 224000
    + 224000: Event: Event_42: Timing CPU dcache tick 42 scheduled @ 224000
    +
    + 224000: Event: Event_42: Timing CPU dcache tick 42 executed @ 224000
    + 175000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8    :   ldr   w2, #4194464       : MemRead :  D=0x0000000000000006 A=0x4000a0  flags=(IsInteger|IsMemRef|IsLoad)
    +
    +
    +
    +

    We first find it by looking for the ExecEnable of LDR.

    +
    +
    +

    Then, we go up to the previous Timing CPU icache tick event, which from the analysis of previous instruction instructions, we know is where the instruction execution starts, the LDR instruction fetch is done by then!

    +
    +
    +

    Next, several events happen as the data request must be percolating through the memory system, it must be very similar to the instruction fetches. TODO analyze event function names.

    +
    +
    +

    Finally, at last we reach

    +
    +
    +
    +
     224000: Event: Event_42: Timing CPU dcache tick 42 executed @ 224000
    + 175000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8    :   ldr   w2, #4194464       : MemRead :  D=0x0000000000000006 A=0x4000a0  flags=(IsInteger|IsMemRef|IsLoad)
    +
    +
    +
    +

    from which we guess:

    +
    +
    +
      +
    • +

      224000: this is the time that the data request finally returned, and at which execute gets called

      +
    • +
    • +

      175000: the log finally prints at the end of execution, but it does not show the actual time that things finished, but rather the time that the ifetch finished, which happened in the past

      +
    • +
    +
    +
    19.19.4.3. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches
    -

    Let’s just add --caches to see if things go any faster:

    +

    Let’s just add --caches to see if things go any faster:

    -
          0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    +
          0: Event: system.cpu.wrapped_function_event: EventFunctionWrapped 43 scheduled @ 0
     **** REAL SIMULATION ****
    -      0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
    -      0: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
    -      0: Event_84: generic event scheduled @ 0
    +      0: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 14 scheduled @ 7786250
    +      0: Event: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped 20 scheduled @ 7786250
    +      0: Event: Event_84: generic 84 scheduled @ 0
     info: Entering event queue @ 0.  Starting simulation...
    -      0: Event_84: generic event rescheduled @ 18446744073709551615
    -      0: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
    -   1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
    -   1000: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 2000
    -   1000: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
    -   1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250
    -   1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 5000
    -   1000: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
    -  46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250
    -  74250: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
    -  74250: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 80000
    -  77000: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 78000
    -  78000: Event_40: Timing CPU icache tick event scheduled @ 78000
    -  78000: system.cpu A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    -  78000: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 83000
    -  83000: Event_40: Timing CPU icache tick event scheduled @ 83000
    -  83000: system.cpu A0 T0 : @asm_main_after_prologue+4    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)
    +      0: Event: Event_84: generic 84 rescheduled @ 18446744073709551615
    +      0: Event: system.cpu.wrapped_function_event: EventFunctionWrapped 43 executed @ 0
    +      0: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 1000
    +   1000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 executed @ 1000
    +   1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 1000
    +   1000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 70 scheduled @ 2000
    +   1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 1000
    +   1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 12 scheduled @ 1000
    +   1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 scheduled @ 46250
    +   1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 5000
    +   1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 12 executed @ 1000
    +   1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 15 scheduled @ 1000
    +   1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 15 executed @ 1000
    +   2000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 70 executed @ 2000
    +   5000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 5000
    +  46250: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 executed @ 46250
    +  46250: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 scheduled @ 74250
    +  74250: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 executed @ 74250
    +  74250: Event: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 74 scheduled @ 77000
    +  74250: Event: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped 75 scheduled @ 80000
    +  77000: Event: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 74 executed @ 77000
    +  77000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 scheduled @ 78000
    +  78000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 executed @ 78000
    +  78000: Event: Event_40: Timing CPU icache tick 40 scheduled @ 78000
    +  78000: Event: Event_40: Timing CPU icache tick 40 executed @ 78000
    +  78000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    +  78000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 scheduled @ 83000
    +  80000: Event: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped 75 executed @ 80000
    +  83000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 executed @ 83000
    +  83000: Event: Event_40: Timing CPU icache tick 40 scheduled @ 83000
    +  83000: Event: Event_40: Timing CPU icache tick 40 executed @ 83000
    +  83000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)
    +  83000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 84000
       [...]
    - 191000: system.cpu A0 T0 : @asm_main_after_prologue+28    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
    - 191000: Event_85: generic event scheduled @ 191000
    + 191000: Event: Event_85: generic 85 scheduled @ 191000 + 191000: Event: Event_85: generic 85 executed @ 191000

    So yes, --caches does work here, leading to a runtime of 191000 rather than 469000 without caches!

    +
    +

    Notably, we now see that very little time passed between the first and second instructions, presumably because rather than going out all the way to the DRAM system, the event chain stops right at the icache.cpu_side when a hit happens, which must have been the case for the second instruction, which is just adjacent to the first one.

    +
    +
    +

    It is also interested to look into the generated config.dot.svg to compare it to the one without caches: Figure 2, “config.dot.svg for a TimingSimpleCPU without caches.”. With caches: Figure 3, “config.dot.svg for a TimingSimpleCPU with caches.”.

    +
    +
    +

    We can see from there, that we now have icache and dcache elements inside the CPU block, and that the CPU icache and dcache ports go through the caches to the SystemXBar rather than being directly connected as before.

    +
    +
    +

    It is worth noting that the caches do not affect the ArmITB and ArmDTB TLBs, since those are already caches themselves.

    +
    +
    +
    +gem5 config TimingSimpleCPU caches 12c917de54145d2d50260035ba7fa614e25317a3 +
    +
    Figure 3. config.dot.svg for a TimingSimpleCPU with caches.
    +
    -
    19.19.4.4. gem5 event queue MinorCPU syscall emulation freestanding example analysis
    +
    19.19.4.4. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs
    +
    +

    TODO is this the minimal setup that allows us to see the gem5 crossbar interconnect? Can we see anything in AtomicSimpleCPU?

    +
    +
    +

    It would be amazing to analyze a simple example with interconnect packets possibly invalidating caches of other CPUs.

    +
    +
    +

    To observe it we could create one well controlled workload with instructions that flush memory, and run it on two CPUs.

    +
    +
    +

    If we don’t use such instructions that flush memory, we would only see the interconnect at work when caches run out.

    +
    +
    +
    +gem5 config TimingSimpleCPU caches 2 CPUs 12c917de54145d2d50260035ba7fa614e25317a3 +
    +
    Figure 4. config.dot.svg for a system with two TimingSimpleCPU with caches.
    +
    +
    +
    +
    19.19.4.5. gem5 event queue MinorCPU syscall emulation freestanding example analysis

    The events for the Atomic CPU were pretty simple: basically just ticks.

    @@ -22043,6 +23111,12 @@ info: Entering event queue @ 0. Starting simulation...

    But as we venture into more complex CPU models such as MinorCPU, the events get much more complex and interesting.

    +

    The memory system system part must be similar to that of TimingSimpleCPU that we previously studied gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis: the main thing we want to see is how the CPU pipeline speeds up execution by preventing some memory stalls.

    +
    +
    +

    The config.dot.svg also indicates that: everything is exactly as in gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches, except that the CPU is a MinorCPU instead of TimingSimpleCPU, and the --caches are now mandatory.

    +
    +

    TODO: analyze the trace for:

    @@ -22060,6 +23134,12 @@ info: Entering event queue @ 0. Starting simulation...
    +
    +
    19.19.4.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis
    +
    +

    TODO: like gem5 event queue MinorCPU syscall emulation freestanding example analysis but even more complex!

    +
    +

    19.19.5. gem5 stats internals

    @@ -22209,7 +23289,32 @@ enum class Arch {

    19.19.7. gem5 build system

    -
    19.19.7.1. gem5 polymorphic ISA includes
    +
    19.19.7.1. gem5 build broken on recent compiler version
    +
    +

    gem5 moves a bit slowly, and if your host compiler is very new, the gem5 build might be broken for it, e.g. this was the case for Ubuntu 19.10 with GCC 9 and gem5 62d75e7105fe172eb906d4f80f360ff8591d4178 from Dec 2019.

    +
    +
    +

    This happens mostly because GCC keeps getting more strict with warnings and gem5 uses -Werror.

    +
    +
    +

    The specific problem mentioned above was later fixed, but if it ever happens again, you can work around it by either by or by disabling -Werror:

    +
    +
    +
    +
    ./build-gem5 -- CCFLAGS=-Wno-error
    +
    +
    +
    +

    or by installing an older compiler and using it with something like:

    +
    +
    +
    +
    ./build-gem5 -- CC=gcc-8 CXX=g++-8
    +
    +
    +
    +
    +
    19.19.7.2. gem5 polymorphic ISA includes

    E.g. src/cpu/decode_cache.hh includes:

    @@ -22288,7 +23393,7 @@ build/ARM/config/the_isa.hh
    -
    19.19.7.2. Why are all C++ symlinked into the gem5 build dir?
    +
    19.19.7.3. Why are all C++ symlinked into the gem5 build dir?

    Some scons madness.

    @@ -23518,6 +24623,36 @@ echo 1 > /proc/sys/vm/overcommit_memory

    userland/cpp/hello.cpp

  • +

    classes

    +
    + +
    +
  • +
  • templates

      @@ -23586,26 +24721,6 @@ echo 1 > /proc/sys/vm/overcommit_memory
  • -
  • -

    Language madness

    -
    - -
    -
  • @@ -26166,24 +27281,27 @@ One of the ways I often want to do that, especially when doing [user mode simula Here is a tiny example that calls just `exit` from the C standard library: -main.S - -``` -.global _start -_start: +main.S +
    +
    +
    +
    global _start
    +

    _start: mov $0, %rdi - call exit - -``` - -Compile and run with: - -``` -gcc -ggdb3 -nostartfiles -static -o exit.out exit.S -qemu-x86_64 -d in_asm exit.out -``` - -However, for programming convenience, and to potentially keep my examples more OS portable, I would like to avoid making raw system calls, which would of course work, by using C standard library functions instead. + call exit

    +
    +
    +
    +
    Compile and run with:
    +
    +
    +
    +

    gcc -ggdb3 -nostartfiles -static -o exit.out exit.S +qemu-x86_64 -d in_asm exit.out

    +
    +
    +
    +
    However, for programming convenience, and to potentially keep my examples more OS portable, I would like to avoid making raw system calls, which would of course work, by using C standard library functions instead.
     
     But I'm afraid that some of those C standard library functions will fail in subtle ways because I have skipped required initialization steps that would normally happen before `main`.
     
    @@ -34170,10 +35288,263 @@ west build -b qemu_aarch64 samples/hello_world

    Algorithms to keep the caches of different cores of a system coherent.

    +

    The main goal of such systems is to reduce the number of messages that have to be sent on the coherency bus, and most importantly, to memory (which passes first through the coherency bus).

    +
    +

    E.g.: if one processors writes to the cache, other processors have to know about it before they read from that address.

    +
    +

    The main software use case example to have in mind is that of multiple threads incrementing an atomic counter as in userland/cpp/atomic/std_atomic.cpp, see also: atomic.cpp.

    +
    -

    32.1.1. MSI protocol

    +

    32.1.1. VI protocol

    +
    +

    Mentioned at:

    +
    + +
    +

    This is the most trivial, but likely it is too bad and most sources don’t even mention it.

    +
    +
    +

    In what follows I make some stuff up with design choice comparisons, needs confirmation.

    +
    +
    +

    In this protocol, every cache only needs a single bit of state: validity.

    +
    +
    +

    At the start, everything is invalid.

    +
    +
    +

    Then, when you need to read and are invalid, you send a read on bus. If there is another valid cache in another CPU, it services the request. Otherwise, goes the request goes to memory. After read you become valid.

    +
    +
    +

    Read for valid generates no bus requests, which is good.

    +
    +
    +

    When you write, if you are invalid, you must first read to get the full cache line, like for any other protocol.

    +
    +
    +

    Then, there are two possible design choices, either:

    +
    +
    +
      +
    • +

      that read is marked as exclusive, and all caches that had it snoop and become invalid.

      +
      +

      Upside: no need to send the new data to the bus.

      +
      +
      +

      Downside: more invalidations. But those are not too serious, because future invalid reads tend to just hit the remaining valid cache.

      +
      +
    • +
    • +

      after the read and write, you send the data on the bus, and those that had it update and become valid.

      +
      +

      Downside: much more data on bus, so likely this is not going to be the best choice.

      +
      +
    • +
    +
    +
    +

    So we take the first option.

    +
    +
    +

    When you write and are valid, you don’t need to read. But you still have invalidate everyone else, because multiple reads can lead to multiple valid holders, otherwise other valid holders would keep reading old values.

    +
    +
    +

    We could either do this with an exclusive read, and ignore the return, or with a new Invalidate request that has no reply. This invalidation is called BusUpgr to match with Wikipedia.

    +
    +
    +

    Write also has two other possible design choices, either:

    +
    +
    +
      +
    • +

      every write writes through to memory. This is likely never the best option.

      +
    • +
    • +

      when the cache is full, eviction leads to a write to memory.

      +
      +

      If multiple valid holders may exist, then this may lead to multiple

      +
      +
    • +
    +
    +
    +

    So we take the second option.

    +
    +
    +

    With this we would have:

    +
    +
    +
      +
    • +

      V

      +
      +
        +
      • +

        PrRd

        +
        +
          +
        • +

          V +*

          +
        • +
        +
        +
      • +
      • +

        PrWr

        +
        +
          +
        • +

          V

          +
        • +
        • +

          BusUpgr

          +
        • +
        +
        +
      • +
      • +

        BusRd

        +
        +
          +
        • +

          V

          +
        • +
        • +

          BusData

          +
        • +
        +
        +
      • +
      • +

        BusRdX

        +
        +
          +
        • +

          I

          +
        • +
        • +

          BusData

          +
        • +
        +
        +
      • +
      • +

        BusUpgr

        +
        +
          +
        • +

          I +*

          +
        • +
        +
        +
      • +
      +
      +
    • +
    • +

      I

      +
      +
        +
      • +

        PrRd

        +
        +
          +
        • +

          V

          +
        • +
        • +

          BusRd

          +
        • +
        +
        +
      • +
      • +

        PrWr

        +
        +
          +
        • +

          V

          +
        • +
        • +

          BusRdX

          +
        • +
        +
        +
      • +
      • +

        BusRd

        +
        +
          +
        • +

          I +*

          +
        • +
        +
        +
      • +
      • +

        BusRdX

        +
        +
          +
        • +

          I +*

          +
        • +
        +
        +
      • +
      • +

        BusUpgr

        +
        +
          +
        • +

          I +*

          +
        • +
        +
        +
      • +
      +
      +
    • +
    +
    +
    +

    Here Flush and BusData replies are omitted since those never lead to a change of state, nor to the sending of further messages.

    +
    +
    +

    TODO at:

    +
    + +
    +

    why PrWr stays in invalid? Why do writes always go to memory? Why not wait until eviction?

    +
    +
    +
    +

    32.1.2. MSI protocol

    @@ -34399,6 +35770,9 @@ CACHE2 S nyy

    Since the writer will become the new sole data owner, the writer can get the cache from us without going to DRAM at all! This is fine, because the writer will be the new sole owner of the line, so DRAM can remain dirty without problems.

    +
    • @@ -34464,7 +35838,7 @@ CACHE2 S nyy
    -

    32.1.2. MESI protocol

    +

    32.1.3. MESI protocol

    @@ -34499,7 +35873,7 @@ CACHE2 S nyy
    -

    32.1.3. MOSI protocol

    +

    32.1.4. MOSI protocol

    https://en.wikipedia.org/wiki/MOSI_protocol

    @@ -34508,7 +35882,7 @@ CACHE2 S nyy
    -

    32.1.4. MOESI protocol

    +

    32.1.5. MOESI protocol

    https://en.wikipedia.org/wiki/MOESI_protocol