diff --git a/index.html b/index.html index 11e1664..6ea8ed0 100644 --- a/index.html +++ b/index.html @@ -1189,7 +1189,9 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
TimingSimpleCPU: memory accesses are realistic, but the CPU has no pipeline. The simulation is faster than detailed models, but slower than AtomicSimpleCPU. TODO: application?
TimingSimpleCPU: memory accesses are realistic, but the CPU has no pipeline. The simulation is faster than detailed models, but slower than AtomicSimpleCPU.
To fully understand TimingSimpleCPU, see: gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis.
Without caches, the CPU just stalls all the time waiting for memory requests for every advance of the PC or memory read from a instruction!
+Caches do make a difference here of course, and lead to much faster memory return times.
+Let’s study the first event. From GDB, it’s stack trace is:
+Let’s study the first event. From GDB, let’s break at the point that prints messages: Trace::OstreamLogger::logMessage() to see where events are being scheduled from:
And at long, we can guess without reading the code that Event_71 is comes from the SE implementation of the exit syscall, so let’s just confirm, the trace contains:
And at last, we can guess without reading the code that Event_71 is comes from the SE implementation of the exit syscall, so let’s just confirm, the trace contains:
TimingSimpleCPU should be the second simplest CPU to analyze, so let’s give it a try:
+./run \ + --arch aarch64 \ + --emulator gem5 \ + --userland userland/arch/aarch64/freestanding/linux/hello.S \ + --trace Event,ExecAll \ + --trace-stdout \ + -- \ + --cpu-type TimingSimpleCPU \ +;+
As of LKMC 9bfbff244d713de40e5686bd370eadb20cf78c7b + 1 the log is now much more complex.
+Here is an abridged version with:
+the beginning up to the second instruction
+end ending
+because all that happens in between is exactly the same as the first two instructions and therefore boring:
+0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 +**** REAL SIMULATION **** + 0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250 + 0: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250 + 0: Event_74: generic event scheduled @ 0 +info: Entering event queue @ 0. Starting simulation... + 0: Event_74: generic event rescheduled @ 18446744073709551615 + 0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 + 0: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000 + 0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 + 0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250 + 0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 5000 + 0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 + 46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250 + 74250: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000 + 74250: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000 + 77000: Event_40: Timing CPU icache tick event scheduled @ 77000 + 77000: system.cpu A0 T0 : @asm_main_after_prologue : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger) + 77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000 + 77000: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 78000 + 77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 95750 + 77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000 + 95750: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 123750 + 123750: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 126000 + 123750: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 126000 + 126000: Event_40: Timing CPU icache tick event scheduled @ 126000 + [...] + 469000: system.cpu A0 T0 : @asm_main_after_prologue+28 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall) + 469000: Event_75: generic event scheduled @ 469000+
0: system.cpu.wrapped_function_event schedule the initial tick, much like for for AtomicSimpleCPU. This time however, it is not a tick, but rather a fetch event that gets scheduled:
TimingSimpleCPU::activateContext(ThreadID thread_num)
+{
+ DPRINTF(SimpleCPU, "ActivateContext %d\n", thread_num);
+
+ assert(thread_num < numThreads);
+
+ threadInfo[thread_num]->notIdleFraction = 1;
+ if (_status == BaseSimpleCPU::Idle)
+ _status = BaseSimpleCPU::Running;
+
+ // kick things off by initiating the fetch of the next instruction
+ if (!fetchEvent.scheduled())
+ schedule(fetchEvent, clockEdge(Cycles(0)));
+We have a fetch instead of a tick here compared to AtomicSimpleCPU, because in the timing CPU we must first get the instruction opcode from DRAM, which takes some cycles to return!
By looking at the source, we see that fetchEvent runs TimingSimpleCPU::fetch.
0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250: from GDB we see that it comes from DRAMCtrl::startup in mem/dram_ctrl.cc which contains:
void
+DRAMCtrl::startup()
+{
+ // remember the memory system mode of operation
+ isTimingMode = system()->isTimingMode();
+
+ if (isTimingMode) {
+ // timestamp offset should be in clock cycles for DRAMPower
+ timeStampOffset = divCeil(curTick(), tCK);
+
+ // update the start tick for the precharge accounting to the
+ // current tick
+ for (auto r : ranks) {
+ r->startup(curTick() + tREFI - tRP);
+ }
+
+ // shift the bus busy time sufficiently far ahead that we never
+ // have to worry about negative values when computing the time for
+ // the next request, this will add an insignificant bubble at the
+ // start of simulation
+ nextBurstAt = curTick() + tRP + tRCD;
+ }
+}
+which then calls:
+void
+DRAMCtrl::Rank::startup(Tick ref_tick)
+{
+ assert(ref_tick > curTick());
+
+ pwrStateTick = curTick();
+
+ // kick off the refresh, and give ourselves enough time to
+ // precharge
+ schedule(refreshEvent, ref_tick);
+}
+By looking up some variable definitions in the source, we now we see some memory parameters clearly:
+ranks: std::vector<DRAMCtrl::Rank*> with 2 elements. TODO why do we have 2? What does it represent? Likely linked to config.ini at system.mem_ctrls.ranks_per_channel=2
tCK=1250, tREFI=7800000, tRP=13750, tRCD=13750: all defined in a single code location with a comment:
/** + * Basic memory timing parameters initialized based on parameter + * values. + */+
Their values can be seen under config.ini and they are documented in src/mem/DRAMCtrl.py e.g.:
# the base clock period of the DRAM
+ tCK = Param.Latency("Clock period")
+
+ # minimum time between a precharge and subsequent activate
+ tRP = Param.Latency("Row precharge time")
+
+ # the amount of time in nanoseconds from issuing an activate command
+ # to the data being available in the row buffer for a read/write
+ tRCD = Param.Latency("RAS to CAS delay")
+
+ # refresh command interval, how often a "ref" command needs
+ # to be sent. It is 7.8 us for a 64ms refresh requirement
+ tREFI = Param.Latency("Refresh command interval")
+So we realize that we are going into deep DRAM modelling, more detail that a mere mortal should ever need to know.
+curTick() + tREFI - tRP = 0 + 7800000 - 13750 = 7786250 which is when that refreshEvent was scheduled. Our simulation ends way before that point however, so we will never know what it did thank God.
0: Event_74: generic event scheduled @ 0 and 0: Event_74: generic event rescheduled @ 18446744073709551615: schedule the final exit event, same as for AtomicSimpleCPU
The next interesting event is:
+system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 0+
which comes from:
+#0 Trace::OstreamLogger::logMessage +#1 void Trace::Logger::dprintf +#2 Event::trace +#3 EventQueue::schedule +#4 EventManager::schedule +#5 DRAMCtrl::addToReadQueue +#6 DRAMCtrl::recvTimingReq +#7 DRAMCtrl::MemoryPort::recvTimingReq +#8 TimingRequestProtocol::sendReq +#9 MasterPort::sendTimingReq +#10 CoherentXBar::recvTimingReq +#11 CoherentXBar::CoherentXBarSlavePort::recvTimingReq(Packet*)) +#12 TimingRequestProtocol::sendReq +#13 MasterPort::sendTimingReq +#14 TimingSimpleCPU::sendFetch +#15 TimingSimpleCPU::FetchTranslation::finish +#16 ArmISA::TLB::translateComplete +#17 ArmISA::TLB::translateTiming +#18 ArmISA::TLB::translateTiming +#19 TimingSimpleCPU::fetch +#20 TimingSimpleCPU::<lambda()>::operator()(void) +#21 std::_Function_handler<void(), TimingSimpleCPU::TimingSimpleCPU(TimingSimpleCPUParams*)::<lambda()> > +#22 std::function<void)>::operator()() const +#23 EventFunctionWrapper::process +#24 EventQueue::serviceOne +#25 doSimLoop +#26 simulate+
From the trace, we see that we are already running from the event queue. Therefore, we must have been running a previously scheduled event, and the previous event logs, the only such event is 0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 which scheduled a memory fetch!
From the backtrace we see the tortuous path that the data request takes, going through:
+ArmISA::TLB
CoherentXBar
DRAMCtrl
The scheduling happens at frame #5:
// If we are not already scheduled to get a request out of the
+ // queue, do so now
+ if (!nextReqEvent.scheduled()) {
+ DPRINTF(DRAM, "Request scheduled immediately\n");
+ schedule(nextReqEvent, curTick());
+ }
+and from a quick source grep we see that nextReqEvent is a DRAMCtrl::processNextReqEvent.
The next schedule:
+0: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000+
and does a BaseXBar::Layer::releaseLayer event.
This one is also coming from the request queue at TimingSimpleCPU::fetch. We deduce therefore that the single previous fetch event scheduled not one, but two events!
Now:
+0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0+
comes from the previously scheduled DRAMCtrl::processNextReqEvent and schedules DRAMCtrl::Rank::processPrechargeEvent.
Now:
+0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250+
also runs from DRAMCtrl::processNextReqEvent and schedules a DRAMCtrl::processRespondEvent.
I’m getting bored, let’s skip to the line that appears to matter for the first instruction:
+46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250+
But I got even more bored, and I will now skip to the first event before the instruction:
+77000: Event_40: Timing CPU icache tick event scheduled @ 77000 + 77000: system.cpu A0 T0 : @asm_main_after_prologue : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger)+
This event comes from PacketQueue::processSendEvent and schedules itself:
void
+TimingSimpleCPU::TimingCPUPort::TickEvent::schedule(PacketPtr _pkt, Tick t)
+{
+ pkt = _pkt;
+ cpu->schedule(this, t);
+}
+which polymorphically resolves to:
+void
+TimingSimpleCPU::IcachePort::ITickEvent::process()
+{
+ cpu->completeIfetch(pkt);
+}
+and so TimingSimpleCPU::completeIfetch is, at last, the interesting TimingSimpleCPU function!
The end of this instruction must be setting things up in a way that can continue the PC walk loop, and by looking at the source and traces, it is clearly from: TimingSimpleCPU::advanceInst which calls TimingSimpleCPU::fetch, which is the very thing we did in this simulation!!! OMG, that’s the loop.
One final thing to check, is how the memory reads are going to make the processor stall in the middle of an instruction.
+For that, we can GDB to the TimingSimpleCPU::completeIfetch of the first LDR done in our test program.
By doing that, we see that this time at:
+if (curStaticInst && curStaticInst->isMemRef()) {
+ // load or store: just send to dcache
+ Fault fault = curStaticInst->initiateAcc(&t_info, traceData);
+
+ if (_status == BaseSimpleCPU::Running) {
+ }
+} else if (curStaticInst) {
+ // non-memory instruction: execute completely now
+ Fault fault = curStaticInst->execute(&t_info, traceData);
+curStaticInst→isMemRef() is true, and there is no instruction execute call in that part of the branch, only for instructions that don’t touch memory
_status is BaseSimpleCPU::Status::DcacheWaitResponse and advanceInst is not yet called
So, where is the execute happening? Well, I’ll satisfy myself with a quick source grep and guess:
curStaticInst→initiateAcc sets up some memory request events
which likely lead up to: TimingSimpleCPU::completeDataAccess, which off the bat ends in advanceInst.
It also calls curStaticInst→completeAcc, which pairs up with the initiateAcc call.
Let’s just add --caches to see if things go any faster:
+0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 +**** REAL SIMULATION **** + 0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250 + 0: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250 + 0: Event_84: generic event scheduled @ 0 +info: Entering event queue @ 0. Starting simulation... + 0: Event_84: generic event rescheduled @ 18446744073709551615 + 0: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000 + 1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000 + 1000: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 2000 + 1000: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000 + 1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250 + 1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 5000 + 1000: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000 + 46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250 + 74250: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000 + 74250: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 80000 + 77000: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 78000 + 78000: Event_40: Timing CPU icache tick event scheduled @ 78000 + 78000: system.cpu A0 T0 : @asm_main_after_prologue : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger) + 78000: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 83000 + 83000: Event_40: Timing CPU icache tick event scheduled @ 83000 + 83000: system.cpu A0 T0 : @asm_main_after_prologue+4 : adr x1, #28 : IntAlu : D=0x0000000000400098 flags=(IsInteger) + [...] + 191000: system.cpu A0 T0 : @asm_main_after_prologue+28 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall) + 191000: Event_85: generic event scheduled @ 191000+
So yes, --caches does work here, leading to a runtime of 191000 rather than 469000 without caches!
The events for the Atomic CPU were pretty simple: basically just ticks.
http://www.cs.virginia.edu/stream/ref.html STREAM memory bandwidth benchmarks.
+https://github.com/kozyraki/stamp transactional memory benchmarks