diff --git a/README.adoc b/README.adoc index 5e19bec..a539e56 100644 --- a/README.adoc +++ b/README.adoc @@ -12185,7 +12185,13 @@ Implementations: * `AtomicSimpleCPU`: the default one. Memory accesses happen instantaneously. The fastest simulation except for KVM, but not realistic at all. + Useful to <>. -* `TimingSimpleCPU`: memory accesses are realistic, but the CPU has no pipeline. The simulation is faster than detailed models, but slower than `AtomicSimpleCPU`. TODO: application? +* `TimingSimpleCPU`: memory accesses are realistic, but the CPU has no pipeline. The simulation is faster than detailed models, but slower than `AtomicSimpleCPU`. ++ +To fully understand `TimingSimpleCPU`, see: <>. ++ +Without caches, the CPU just stalls all the time waiting for memory requests for every advance of the PC or memory read from a instruction! ++ +Caches do make a difference here of course, and lead to much faster memory return times. <> are an alternative way of fast forwarding boot when they work. @@ -12570,7 +12576,7 @@ if (_status != Idle) + A new event is scheduled for the current time itself. TODO what is this event? -Let's study the first event. From GDB, it's stack trace is: +Let's study the first event. From <>, let's break at the point that prints messages: `Trace::OstreamLogger::logMessage()` to see where events are being scheduled from: .... Trace::OstreamLogger::logMessage() at trace.cc:149 0x5555593b3b1e @@ -12658,6 +12664,351 @@ new GlobalSimLoopExitEvent(when + simQuantum, message, exit_code, repeat); Tested at b4879ae5b0b6644e6836b0881e4da05c64a6550d. +===== gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis + +<> should be the second simplest CPU to analyze, so let's give it a try: + +.... +./run \ + --arch aarch64 \ + --emulator gem5 \ + --userland userland/arch/aarch64/freestanding/linux/hello.S \ + --trace Event,ExecAll \ + --trace-stdout \ + -- \ + --cpu-type TimingSimpleCPU \ +; +.... + +As of LKMC 9bfbff244d713de40e5686bd370eadb20cf78c7b + 1 the log is now much more complex. + +Here is an abridged version with: + +* the beginning up to the second instruction +* end ending + +because all that happens in between is exactly the same as the first two instructions and therefore boring: + +.... + 0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 +**** REAL SIMULATION **** + 0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250 + 0: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250 + 0: Event_74: generic event scheduled @ 0 +info: Entering event queue @ 0. Starting simulation... + 0: Event_74: generic event rescheduled @ 18446744073709551615 + 0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 + 0: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000 + 0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 + 0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250 + 0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 5000 + 0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 + 46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250 + 74250: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000 + 74250: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000 + 77000: Event_40: Timing CPU icache tick event scheduled @ 77000 + 77000: system.cpu A0 T0 : @asm_main_after_prologue : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger) + 77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000 + 77000: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 78000 + 77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 95750 + 77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000 + 95750: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 123750 + 123750: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 126000 + 123750: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 126000 + 126000: Event_40: Timing CPU icache tick event scheduled @ 126000 + [...] + 469000: system.cpu A0 T0 : @asm_main_after_prologue+28 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall) + 469000: Event_75: generic event scheduled @ 469000 +.... + +`0: system.cpu.wrapped_function_event` schedule the initial tick, much like for for `AtomicSimpleCPU`. This time however, it is not a tick, but rather a fetch event that gets scheduled: + +.... +TimingSimpleCPU::activateContext(ThreadID thread_num) +{ + DPRINTF(SimpleCPU, "ActivateContext %d\n", thread_num); + + assert(thread_num < numThreads); + + threadInfo[thread_num]->notIdleFraction = 1; + if (_status == BaseSimpleCPU::Idle) + _status = BaseSimpleCPU::Running; + + // kick things off by initiating the fetch of the next instruction + if (!fetchEvent.scheduled()) + schedule(fetchEvent, clockEdge(Cycles(0))); +.... + +We have a fetch instead of a tick here compared to `AtomicSimpleCPU`, because in the timing CPU we must first get the instruction opcode from DRAM, which takes some cycles to return! + +By looking at the source, we see that fetchEvent runs `TimingSimpleCPU::fetch`. + +`0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250`: from GDB we see that it comes from `DRAMCtrl::startup` in `mem/dram_ctrl.cc` which contains: + +.... +void +DRAMCtrl::startup() +{ + // remember the memory system mode of operation + isTimingMode = system()->isTimingMode(); + + if (isTimingMode) { + // timestamp offset should be in clock cycles for DRAMPower + timeStampOffset = divCeil(curTick(), tCK); + + // update the start tick for the precharge accounting to the + // current tick + for (auto r : ranks) { + r->startup(curTick() + tREFI - tRP); + } + + // shift the bus busy time sufficiently far ahead that we never + // have to worry about negative values when computing the time for + // the next request, this will add an insignificant bubble at the + // start of simulation + nextBurstAt = curTick() + tRP + tRCD; + } +} +.... + +which then calls: + +.... +void +DRAMCtrl::Rank::startup(Tick ref_tick) +{ + assert(ref_tick > curTick()); + + pwrStateTick = curTick(); + + // kick off the refresh, and give ourselves enough time to + // precharge + schedule(refreshEvent, ref_tick); +} +.... + +By looking up some variable definitions in the source, we now we see some memory parameters clearly: + +* ranks: `std::vector` with 2 elements. TODO why do we have 2? What does it represent? Likely linked to <> at `system.mem_ctrls.ranks_per_channel=2` +* `tCK=1250`, `tREFI=7800000`, `tRP=13750`, `tRCD=13750`: all defined in a single code location with a comment: ++ +.... + /** + * Basic memory timing parameters initialized based on parameter + * values. + */ +.... ++ +Their values can be seen under `config.ini` and they are documented in `src/mem/DRAMCtrl.py` e.g.: ++ +.... + # the base clock period of the DRAM + tCK = Param.Latency("Clock period") + + # minimum time between a precharge and subsequent activate + tRP = Param.Latency("Row precharge time") + + # the amount of time in nanoseconds from issuing an activate command + # to the data being available in the row buffer for a read/write + tRCD = Param.Latency("RAS to CAS delay") + + # refresh command interval, how often a "ref" command needs + # to be sent. It is 7.8 us for a 64ms refresh requirement + tREFI = Param.Latency("Refresh command interval") +.... + +So we realize that we are going into deep DRAM modelling, more detail that a mere mortal should ever need to know. + +`curTick() + tREFI - tRP = 0 + 7800000 - 13750 = 7786250` which is when that `refreshEvent` was scheduled. Our simulation ends way before that point however, so we will never know what it did thank God. + +`0: Event_74: generic event scheduled @ 0` and `0: Event_74: generic event rescheduled @ 18446744073709551615`: schedule the final exit event, same as for `AtomicSimpleCPU` + +The next interesting event is: + +.... +system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 +.... + +which comes from: + +.... +#0 Trace::OstreamLogger::logMessage +#1 void Trace::Logger::dprintf +#2 Event::trace +#3 EventQueue::schedule +#4 EventManager::schedule +#5 DRAMCtrl::addToReadQueue +#6 DRAMCtrl::recvTimingReq +#7 DRAMCtrl::MemoryPort::recvTimingReq +#8 TimingRequestProtocol::sendReq +#9 MasterPort::sendTimingReq +#10 CoherentXBar::recvTimingReq +#11 CoherentXBar::CoherentXBarSlavePort::recvTimingReq(Packet*)) +#12 TimingRequestProtocol::sendReq +#13 MasterPort::sendTimingReq +#14 TimingSimpleCPU::sendFetch +#15 TimingSimpleCPU::FetchTranslation::finish +#16 ArmISA::TLB::translateComplete +#17 ArmISA::TLB::translateTiming +#18 ArmISA::TLB::translateTiming +#19 TimingSimpleCPU::fetch +#20 TimingSimpleCPU::::operator()(void) +#21 std::_Function_handler > +#22 std::function::operator()() const +#23 EventFunctionWrapper::process +#24 EventQueue::serviceOne +#25 doSimLoop +#26 simulate +.... + +From the trace, we see that we are already running from the event queue. Therefore, we must have been running a previously scheduled event, and the previous event logs, the only such event is `0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0` which scheduled a memory fetch! + +From the backtrace we see the tortuous path that the data request takes, going through: + +* `ArmISA::TLB` +* `CoherentXBar` +* `DRAMCtrl` + +The scheduling happens at frame `#5`: + +.... + // If we are not already scheduled to get a request out of the + // queue, do so now + if (!nextReqEvent.scheduled()) { + DPRINTF(DRAM, "Request scheduled immediately\n"); + schedule(nextReqEvent, curTick()); + } +.... + +and from a quick source grep we see that `nextReqEvent` is a `DRAMCtrl::processNextReqEvent`. + +The next schedule: + +.... +0: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000 +.... + +and does a `BaseXBar::Layer::releaseLayer` event. + +This one is also coming from the request queue at `TimingSimpleCPU::fetch`. We deduce therefore that the single previous fetch event scheduled not one, but two events! + +Now: + +.... + 0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 +.... + +comes from the previously scheduled `DRAMCtrl::processNextReqEvent` and schedules `DRAMCtrl::Rank::processPrechargeEvent`. + +Now: + +.... + 0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250 +.... + +also runs from `DRAMCtrl::processNextReqEvent` and schedules a `DRAMCtrl::processRespondEvent`. + +I'm getting bored, let's skip to the line that appears to matter for the first instruction: + +.... + 46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250 +.... + +But I got even more bored, and I will now skip to the first event before the instruction: + +.... + 77000: Event_40: Timing CPU icache tick event scheduled @ 77000 + 77000: system.cpu A0 T0 : @asm_main_after_prologue : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger) +.... + +This event comes from `PacketQueue::processSendEvent` and schedules itself: + +.... +void +TimingSimpleCPU::TimingCPUPort::TickEvent::schedule(PacketPtr _pkt, Tick t) +{ + pkt = _pkt; + cpu->schedule(this, t); +} +.... + +which polymorphically resolves to: + +.... +void +TimingSimpleCPU::IcachePort::ITickEvent::process() +{ + cpu->completeIfetch(pkt); +} +.... + +and so `TimingSimpleCPU::completeIfetch` is, at last, the interesting `TimingSimpleCPU` function! + +The end of this instruction must be setting things up in a way that can continue the PC walk loop, and by looking at the source and traces, it is clearly from: `TimingSimpleCPU::advanceInst` which calls `TimingSimpleCPU::fetch`, which is the very thing we did in this simulation!!! OMG, that's the loop. + +One final thing to check, is how the memory reads are going to make the processor stall in the middle of an instruction. + +For that, we can GDB to the `TimingSimpleCPU::completeIfetch` of the first LDR done in our test program. + +By doing that, we see that this time at: + +.... +if (curStaticInst && curStaticInst->isMemRef()) { + // load or store: just send to dcache + Fault fault = curStaticInst->initiateAcc(&t_info, traceData); + + if (_status == BaseSimpleCPU::Running) { + } +} else if (curStaticInst) { + // non-memory instruction: execute completely now + Fault fault = curStaticInst->execute(&t_info, traceData); +.... + +* `curStaticInst->isMemRef()` is true, and there is no instruction `execute` call in that part of the branch, only for instructions that don't touch memory +* `_status` is `BaseSimpleCPU::Status::DcacheWaitResponse` and `advanceInst` is not yet called + +So, where is the `execute` happening? Well, I'll satisfy myself with a quick source grep and guess: + +* `curStaticInst->initiateAcc` sets up some memory request events +* which likely lead up to: `TimingSimpleCPU::completeDataAccess`, which off the bat ends in `advanceInst`. ++ +It also calls `curStaticInst->completeAcc`, which pairs up with the `initiateAcc` call. + +===== gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches + +Let's just add --caches to see if things go any faster: + +.... + 0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 +**** REAL SIMULATION **** + 0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250 + 0: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250 + 0: Event_84: generic event scheduled @ 0 +info: Entering event queue @ 0. Starting simulation... + 0: Event_84: generic event rescheduled @ 18446744073709551615 + 0: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000 + 1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000 + 1000: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 2000 + 1000: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000 + 1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250 + 1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 5000 + 1000: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000 + 46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250 + 74250: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000 + 74250: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 80000 + 77000: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 78000 + 78000: Event_40: Timing CPU icache tick event scheduled @ 78000 + 78000: system.cpu A0 T0 : @asm_main_after_prologue : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger) + 78000: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 83000 + 83000: Event_40: Timing CPU icache tick event scheduled @ 83000 + 83000: system.cpu A0 T0 : @asm_main_after_prologue+4 : adr x1, #28 : IntAlu : D=0x0000000000400098 flags=(IsInteger) + [...] + 191000: system.cpu A0 T0 : @asm_main_after_prologue+28 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall) + 191000: Event_85: generic event scheduled @ 191000 +.... + +So yes, `--caches` does work here, leading to a runtime of 191000 rather than 469000 without caches! + ===== gem5 event queue MinorCPU syscall emulation freestanding example analysis The events <> were pretty simple: basically just ticks.