gem5: follow most of TimingSimpleCPU

2026-01-27 20:14:27 +01:00 · 2019-11-22 00:00:00 +00:00
parent 9bfbff244d
commit b9b38ed579
1 changed files with 353 additions and 2 deletions
--- a/README.adoc
+++ b/README.adoc
@@ -12185,7 +12185,13 @@ Implementations:
 * `AtomicSimpleCPU`: the default one. Memory accesses happen instantaneously. The fastest simulation except for KVM, but not realistic at all.
 +
 Useful to <<gem5-restore-checkpoint-with-a-different-cpu,boot Linux fast and then checkpoint and switch to a more detailed CPU>>.
-* `TimingSimpleCPU`: memory accesses are realistic, but the CPU has no pipeline. The simulation is faster than detailed models, but slower than `AtomicSimpleCPU`. TODO: application?
+* `TimingSimpleCPU`: memory accesses are realistic, but the CPU has no pipeline. The simulation is faster than detailed models, but slower than `AtomicSimpleCPU`.
 +
 To fully understand `TimingSimpleCPU`, see: <<gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis>>.
 +
 Without caches, the CPU just stalls all the time waiting for memory requests for every advance of the PC or memory read from a instruction!
 +
 Caches do make a difference here of course, and lead to much faster memory return times.
 <<gem5-kvm,KVM CPUs>> are an alternative way of fast forwarding boot when they work.
@@ -12570,7 +12576,7 @@ if (_status != Idle)
 +
 A new event is scheduled for the current time itself. TODO what is this event?
-Let's study the first event. From GDB, it's stack trace is:
+Let's study the first event. From <<debug-the-emulator,GDB>>, let's break at the point that prints messages: `Trace::OstreamLogger::logMessage()` to see where events are being scheduled from:
 ....
 Trace::OstreamLogger::logMessage() at trace.cc:149 0x5555593b3b1e
@@ -12658,6 +12664,351 @@ new GlobalSimLoopExitEvent(when + simQuantum, message, exit_code, repeat);
 Tested at b4879ae5b0b6644e6836b0881e4da05c64a6550d.
 ===== gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis
 <<gem5-basesimplecpu,TimingSimpleCPU>> should be the second simplest CPU to analyze, so let's give it a try:
 ....
 ./run \
  --arch aarch64 \
  --emulator gem5 \
  --userland userland/arch/aarch64/freestanding/linux/hello.S \
  --trace Event,ExecAll \
  --trace-stdout \
  -- \
  --cpu-type TimingSimpleCPU \
 ;
 ....
 As of LKMC 9bfbff244d713de40e5686bd370eadb20cf78c7b + 1 the log is now much more complex.
 Here is an abridged version with:
 * the beginning up to the second instruction
 * end ending
 because all that happens in between is exactly the same as the first two instructions and therefore boring:
 ....
      0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
 **** REAL SIMULATION ****
      0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
      0: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
      0: Event_74: generic event scheduled @ 0
 info: Entering event queue @ 0.  Starting simulation...
      0: Event_74: generic event rescheduled @ 18446744073709551615
      0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
      0: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
      0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
      0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250
      0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 5000
      0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
  46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250
  74250: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
  74250: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
  77000: Event_40: Timing CPU icache tick event scheduled @ 77000
  77000: system.cpu A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
  77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
  77000: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 78000
  77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 95750
  77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
  95750: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 123750
 123750: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 126000
 123750: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 126000
 126000: Event_40: Timing CPU icache tick event scheduled @ 126000
      [...]
 469000: system.cpu A0 T0 : @asm_main_after_prologue+28    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
 469000: Event_75: generic event scheduled @ 469000
 ....
 `0: system.cpu.wrapped_function_event` schedule the initial tick, much like for for `AtomicSimpleCPU`. This time however, it is not a tick, but rather a fetch event that gets scheduled:
 ....
 TimingSimpleCPU::activateContext(ThreadID thread_num)
 {
    DPRINTF(SimpleCPU, "ActivateContext %d\n", thread_num);
    assert(thread_num < numThreads);
    threadInfo[thread_num]->notIdleFraction = 1;
    if (_status == BaseSimpleCPU::Idle)
        _status = BaseSimpleCPU::Running;
    // kick things off by initiating the fetch of the next instruction
    if (!fetchEvent.scheduled())
        schedule(fetchEvent, clockEdge(Cycles(0)));
 ....
 We have a fetch instead of a tick here compared to `AtomicSimpleCPU`, because in the timing CPU we must first get the instruction opcode from DRAM, which takes some cycles to return!
 By looking at the source, we see that fetchEvent runs `TimingSimpleCPU::fetch`.
 `0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250`: from GDB we see that it comes from `DRAMCtrl::startup` in `mem/dram_ctrl.cc` which contains:
 ....
 void
 DRAMCtrl::startup()
 {
    // remember the memory system mode of operation
    isTimingMode = system()->isTimingMode();
    if (isTimingMode) {
        // timestamp offset should be in clock cycles for DRAMPower
        timeStampOffset = divCeil(curTick(), tCK);
        // update the start tick for the precharge accounting to the
        // current tick
        for (auto r : ranks) {
            r->startup(curTick() + tREFI - tRP);
        }
        // shift the bus busy time sufficiently far ahead that we never
        // have to worry about negative values when computing the time for
        // the next request, this will add an insignificant bubble at the
        // start of simulation
        nextBurstAt = curTick() + tRP + tRCD;
    }
 }
 ....
 which then calls:
 ....
 void
 DRAMCtrl::Rank::startup(Tick ref_tick)
 {
    assert(ref_tick > curTick());
    pwrStateTick = curTick();
    // kick off the refresh, and give ourselves enough time to
    // precharge
    schedule(refreshEvent, ref_tick);
 }
 ....
 By looking up some variable definitions in the source, we now we see some memory parameters clearly:
 * ranks: `std::vector<DRAMCtrl::Rank*>` with 2 elements. TODO why do we have 2? What does it represent? Likely linked to <<gem5-config-ini,`config.ini`>> at `system.mem_ctrls.ranks_per_channel=2`
 * `tCK=1250`, `tREFI=7800000`, `tRP=13750`, `tRCD=13750`: all defined in a single code location with a comment:
 +
 ....
     /**
     * Basic memory timing parameters initialized based on parameter
     * values.
     */
 ....
 +
 Their values can be seen under `config.ini` and they are documented in `src/mem/DRAMCtrl.py` e.g.:
 +
 ....
    # the base clock period of the DRAM
    tCK = Param.Latency("Clock period")
    # minimum time between a precharge and subsequent activate
    tRP = Param.Latency("Row precharge time")
    # the amount of time in nanoseconds from issuing an activate command
    # to the data being available in the row buffer for a read/write
    tRCD = Param.Latency("RAS to CAS delay")
    # refresh command interval, how often a "ref" command needs
    # to be sent. It is 7.8 us for a 64ms refresh requirement
    tREFI = Param.Latency("Refresh command interval")
 ....
 So we realize that we are going into deep DRAM modelling, more detail that a mere mortal should ever need to know.
 `curTick() + tREFI - tRP = 0 + 7800000 - 13750 = 7786250` which is when that `refreshEvent` was scheduled. Our simulation ends way before that point however, so we will never know what it did thank God.
 `0: Event_74: generic event scheduled @ 0` and `0: Event_74: generic event rescheduled @ 18446744073709551615`: schedule the final exit event, same as for `AtomicSimpleCPU`
 The next interesting event is:
 ....
 system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
 ....
 which comes from:
 ....
 #0  Trace::OstreamLogger::logMessage
 #1  void Trace::Logger::dprintf
 #2  Event::trace
 #3  EventQueue::schedule
 #4  EventManager::schedule
 #5  DRAMCtrl::addToReadQueue
 #6  DRAMCtrl::recvTimingReq
 #7  DRAMCtrl::MemoryPort::recvTimingReq
 #8  TimingRequestProtocol::sendReq
 #9  MasterPort::sendTimingReq
 #10 CoherentXBar::recvTimingReq
 #11 CoherentXBar::CoherentXBarSlavePort::recvTimingReq(Packet*))
 #12 TimingRequestProtocol::sendReq
 #13 MasterPort::sendTimingReq
 #14 TimingSimpleCPU::sendFetch
 #15 TimingSimpleCPU::FetchTranslation::finish
 #16 ArmISA::TLB::translateComplete
 #17 ArmISA::TLB::translateTiming
 #18 ArmISA::TLB::translateTiming
 #19 TimingSimpleCPU::fetch
 #20 TimingSimpleCPU::<lambda()>::operator()(void)
 #21 std::_Function_handler<void(), TimingSimpleCPU::TimingSimpleCPU(TimingSimpleCPUParams*)::<lambda()> >
 #22 std::function<void)>::operator()() const
 #23 EventFunctionWrapper::process
 #24 EventQueue::serviceOne
 #25 doSimLoop
 #26 simulate
 ....
 From the trace, we see that we are already running from the event queue. Therefore, we must have been running a previously scheduled event, and the previous event logs, the only such event is `0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0` which scheduled a memory fetch!
 From the backtrace we see the tortuous path that the data request takes, going through:
 * `ArmISA::TLB`
 * `CoherentXBar`
 * `DRAMCtrl`
 The scheduling happens at frame `#5`:
 ....
     // If we are not already scheduled to get a request out of the
     // queue, do so now
     if (!nextReqEvent.scheduled()) {
         DPRINTF(DRAM, "Request scheduled immediately\n");
         schedule(nextReqEvent, curTick());
     }
 ....
 and from a quick source grep we see that `nextReqEvent` is a `DRAMCtrl::processNextReqEvent`.
 The next schedule:
 ....
 0: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
 ....
 and does a `BaseXBar::Layer::releaseLayer` event.
 This one is also coming from the request queue at `TimingSimpleCPU::fetch`. We deduce therefore that the single previous fetch event scheduled not one, but two events!
 Now:
 ....
      0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
 ....
 comes from the previously scheduled `DRAMCtrl::processNextReqEvent` and schedules `DRAMCtrl::Rank::processPrechargeEvent`.
 Now:
 ....
      0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250
 ....
 also runs from `DRAMCtrl::processNextReqEvent` and schedules a `DRAMCtrl::processRespondEvent`.
 I'm getting bored, let's skip to the line that appears to matter for the first instruction:
 ....
  46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250
 ....
 But I got even more bored, and I will now skip to the first event before the instruction:
 ....
  77000: Event_40: Timing CPU icache tick event scheduled @ 77000
  77000: system.cpu A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
 ....
 This event comes from `PacketQueue::processSendEvent` and schedules itself:
 ....
 void
 TimingSimpleCPU::TimingCPUPort::TickEvent::schedule(PacketPtr _pkt, Tick t)
 {
    pkt = _pkt;
    cpu->schedule(this, t);
 }
 ....
 which polymorphically resolves to:
 ....
 void
 TimingSimpleCPU::IcachePort::ITickEvent::process()
 {
    cpu->completeIfetch(pkt);
 }
 ....
 and so `TimingSimpleCPU::completeIfetch` is, at last, the interesting `TimingSimpleCPU` function!
 The end of this instruction must be setting things up in a way that can continue the PC walk loop, and by looking at the source and traces, it is clearly from: `TimingSimpleCPU::advanceInst` which calls `TimingSimpleCPU::fetch`, which is the very thing we did in this simulation!!! OMG, that's the loop.
 One final thing to check, is how the memory reads are going to make the processor stall in the middle of an instruction.
 For that, we can GDB to the `TimingSimpleCPU::completeIfetch` of the first LDR done in our test program.
 By doing that, we see that this time at:
 ....
 if (curStaticInst && curStaticInst->isMemRef()) {
    // load or store: just send to dcache
    Fault fault = curStaticInst->initiateAcc(&t_info, traceData);
    if (_status == BaseSimpleCPU::Running) {
    }
 } else if (curStaticInst) {
    // non-memory instruction: execute completely now
    Fault fault = curStaticInst->execute(&t_info, traceData);
 ....
 * `curStaticInst->isMemRef()` is true, and there is no instruction `execute` call in that part of the branch, only for instructions that don't touch memory
 * `_status` is `BaseSimpleCPU::Status::DcacheWaitResponse` and `advanceInst` is not yet called
 So, where is the `execute` happening? Well, I'll satisfy myself with a quick source grep and guess:
 * `curStaticInst->initiateAcc` sets up some memory request events
 * which likely lead up to: `TimingSimpleCPU::completeDataAccess`, which off the bat ends in `advanceInst`.
 +
 It also calls `curStaticInst->completeAcc`, which pairs up with the `initiateAcc` call.
 ===== gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches
 Let's just add --caches to see if things go any faster:
 ....
      0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
 **** REAL SIMULATION ****
      0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
      0: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
      0: Event_84: generic event scheduled @ 0
 info: Entering event queue @ 0.  Starting simulation...
      0: Event_84: generic event rescheduled @ 18446744073709551615
      0: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
   1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
   1000: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 2000
   1000: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
   1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250
   1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 5000
   1000: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
  46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250
  74250: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
  74250: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 80000
  77000: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 78000
  78000: Event_40: Timing CPU icache tick event scheduled @ 78000
  78000: system.cpu A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
  78000: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 83000
  83000: Event_40: Timing CPU icache tick event scheduled @ 83000
  83000: system.cpu A0 T0 : @asm_main_after_prologue+4    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)
  [...]
 191000: system.cpu A0 T0 : @asm_main_after_prologue+28    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
 191000: Event_85: generic event scheduled @ 191000
 ....
 So yes, `--caches` does work here, leading to a runtime of 191000 rather than 469000 without caches!
 ===== gem5 event queue MinorCPU syscall emulation freestanding example analysis
 The events <<gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis,for the Atomic CPU>> were pretty simple: basically just ticks.