gem5: follow most of TimingSimpleCPU

This commit is contained in:
Ciro Santilli 六四事件 法轮功
2019-11-22 00:00:00 +00:00
parent 9bfbff244d
commit b9b38ed579

View File

@@ -12185,7 +12185,13 @@ Implementations:
* `AtomicSimpleCPU`: the default one. Memory accesses happen instantaneously. The fastest simulation except for KVM, but not realistic at all.
+
Useful to <<gem5-restore-checkpoint-with-a-different-cpu,boot Linux fast and then checkpoint and switch to a more detailed CPU>>.
* `TimingSimpleCPU`: memory accesses are realistic, but the CPU has no pipeline. The simulation is faster than detailed models, but slower than `AtomicSimpleCPU`. TODO: application?
* `TimingSimpleCPU`: memory accesses are realistic, but the CPU has no pipeline. The simulation is faster than detailed models, but slower than `AtomicSimpleCPU`.
+
To fully understand `TimingSimpleCPU`, see: <<gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis>>.
+
Without caches, the CPU just stalls all the time waiting for memory requests for every advance of the PC or memory read from a instruction!
+
Caches do make a difference here of course, and lead to much faster memory return times.
<<gem5-kvm,KVM CPUs>> are an alternative way of fast forwarding boot when they work.
@@ -12570,7 +12576,7 @@ if (_status != Idle)
+
A new event is scheduled for the current time itself. TODO what is this event?
Let's study the first event. From GDB, it's stack trace is:
Let's study the first event. From <<debug-the-emulator,GDB>>, let's break at the point that prints messages: `Trace::OstreamLogger::logMessage()` to see where events are being scheduled from:
....
Trace::OstreamLogger::logMessage() at trace.cc:149 0x5555593b3b1e
@@ -12658,6 +12664,351 @@ new GlobalSimLoopExitEvent(when + simQuantum, message, exit_code, repeat);
Tested at b4879ae5b0b6644e6836b0881e4da05c64a6550d.
===== gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis
<<gem5-basesimplecpu,TimingSimpleCPU>> should be the second simplest CPU to analyze, so let's give it a try:
....
./run \
--arch aarch64 \
--emulator gem5 \
--userland userland/arch/aarch64/freestanding/linux/hello.S \
--trace Event,ExecAll \
--trace-stdout \
-- \
--cpu-type TimingSimpleCPU \
;
....
As of LKMC 9bfbff244d713de40e5686bd370eadb20cf78c7b + 1 the log is now much more complex.
Here is an abridged version with:
* the beginning up to the second instruction
* end ending
because all that happens in between is exactly the same as the first two instructions and therefore boring:
....
0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
**** REAL SIMULATION ****
0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
0: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
0: Event_74: generic event scheduled @ 0
info: Entering event queue @ 0. Starting simulation...
0: Event_74: generic event rescheduled @ 18446744073709551615
0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
0: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250
0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 5000
0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250
74250: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
74250: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
77000: Event_40: Timing CPU icache tick event scheduled @ 77000
77000: system.cpu A0 T0 : @asm_main_after_prologue : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger)
77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
77000: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 78000
77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 95750
77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
95750: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 123750
123750: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 126000
123750: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 126000
126000: Event_40: Timing CPU icache tick event scheduled @ 126000
[...]
469000: system.cpu A0 T0 : @asm_main_after_prologue+28 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
469000: Event_75: generic event scheduled @ 469000
....
`0: system.cpu.wrapped_function_event` schedule the initial tick, much like for for `AtomicSimpleCPU`. This time however, it is not a tick, but rather a fetch event that gets scheduled:
....
TimingSimpleCPU::activateContext(ThreadID thread_num)
{
DPRINTF(SimpleCPU, "ActivateContext %d\n", thread_num);
assert(thread_num < numThreads);
threadInfo[thread_num]->notIdleFraction = 1;
if (_status == BaseSimpleCPU::Idle)
_status = BaseSimpleCPU::Running;
// kick things off by initiating the fetch of the next instruction
if (!fetchEvent.scheduled())
schedule(fetchEvent, clockEdge(Cycles(0)));
....
We have a fetch instead of a tick here compared to `AtomicSimpleCPU`, because in the timing CPU we must first get the instruction opcode from DRAM, which takes some cycles to return!
By looking at the source, we see that fetchEvent runs `TimingSimpleCPU::fetch`.
`0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250`: from GDB we see that it comes from `DRAMCtrl::startup` in `mem/dram_ctrl.cc` which contains:
....
void
DRAMCtrl::startup()
{
// remember the memory system mode of operation
isTimingMode = system()->isTimingMode();
if (isTimingMode) {
// timestamp offset should be in clock cycles for DRAMPower
timeStampOffset = divCeil(curTick(), tCK);
// update the start tick for the precharge accounting to the
// current tick
for (auto r : ranks) {
r->startup(curTick() + tREFI - tRP);
}
// shift the bus busy time sufficiently far ahead that we never
// have to worry about negative values when computing the time for
// the next request, this will add an insignificant bubble at the
// start of simulation
nextBurstAt = curTick() + tRP + tRCD;
}
}
....
which then calls:
....
void
DRAMCtrl::Rank::startup(Tick ref_tick)
{
assert(ref_tick > curTick());
pwrStateTick = curTick();
// kick off the refresh, and give ourselves enough time to
// precharge
schedule(refreshEvent, ref_tick);
}
....
By looking up some variable definitions in the source, we now we see some memory parameters clearly:
* ranks: `std::vector<DRAMCtrl::Rank*>` with 2 elements. TODO why do we have 2? What does it represent? Likely linked to <<gem5-config-ini,`config.ini`>> at `system.mem_ctrls.ranks_per_channel=2`
* `tCK=1250`, `tREFI=7800000`, `tRP=13750`, `tRCD=13750`: all defined in a single code location with a comment:
+
....
/**
* Basic memory timing parameters initialized based on parameter
* values.
*/
....
+
Their values can be seen under `config.ini` and they are documented in `src/mem/DRAMCtrl.py` e.g.:
+
....
# the base clock period of the DRAM
tCK = Param.Latency("Clock period")
# minimum time between a precharge and subsequent activate
tRP = Param.Latency("Row precharge time")
# the amount of time in nanoseconds from issuing an activate command
# to the data being available in the row buffer for a read/write
tRCD = Param.Latency("RAS to CAS delay")
# refresh command interval, how often a "ref" command needs
# to be sent. It is 7.8 us for a 64ms refresh requirement
tREFI = Param.Latency("Refresh command interval")
....
So we realize that we are going into deep DRAM modelling, more detail that a mere mortal should ever need to know.
`curTick() + tREFI - tRP = 0 + 7800000 - 13750 = 7786250` which is when that `refreshEvent` was scheduled. Our simulation ends way before that point however, so we will never know what it did thank God.
`0: Event_74: generic event scheduled @ 0` and `0: Event_74: generic event rescheduled @ 18446744073709551615`: schedule the final exit event, same as for `AtomicSimpleCPU`
The next interesting event is:
....
system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
....
which comes from:
....
#0 Trace::OstreamLogger::logMessage
#1 void Trace::Logger::dprintf
#2 Event::trace
#3 EventQueue::schedule
#4 EventManager::schedule
#5 DRAMCtrl::addToReadQueue
#6 DRAMCtrl::recvTimingReq
#7 DRAMCtrl::MemoryPort::recvTimingReq
#8 TimingRequestProtocol::sendReq
#9 MasterPort::sendTimingReq
#10 CoherentXBar::recvTimingReq
#11 CoherentXBar::CoherentXBarSlavePort::recvTimingReq(Packet*))
#12 TimingRequestProtocol::sendReq
#13 MasterPort::sendTimingReq
#14 TimingSimpleCPU::sendFetch
#15 TimingSimpleCPU::FetchTranslation::finish
#16 ArmISA::TLB::translateComplete
#17 ArmISA::TLB::translateTiming
#18 ArmISA::TLB::translateTiming
#19 TimingSimpleCPU::fetch
#20 TimingSimpleCPU::<lambda()>::operator()(void)
#21 std::_Function_handler<void(), TimingSimpleCPU::TimingSimpleCPU(TimingSimpleCPUParams*)::<lambda()> >
#22 std::function<void)>::operator()() const
#23 EventFunctionWrapper::process
#24 EventQueue::serviceOne
#25 doSimLoop
#26 simulate
....
From the trace, we see that we are already running from the event queue. Therefore, we must have been running a previously scheduled event, and the previous event logs, the only such event is `0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0` which scheduled a memory fetch!
From the backtrace we see the tortuous path that the data request takes, going through:
* `ArmISA::TLB`
* `CoherentXBar`
* `DRAMCtrl`
The scheduling happens at frame `#5`:
....
// If we are not already scheduled to get a request out of the
// queue, do so now
if (!nextReqEvent.scheduled()) {
DPRINTF(DRAM, "Request scheduled immediately\n");
schedule(nextReqEvent, curTick());
}
....
and from a quick source grep we see that `nextReqEvent` is a `DRAMCtrl::processNextReqEvent`.
The next schedule:
....
0: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
....
and does a `BaseXBar::Layer::releaseLayer` event.
This one is also coming from the request queue at `TimingSimpleCPU::fetch`. We deduce therefore that the single previous fetch event scheduled not one, but two events!
Now:
....
0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
....
comes from the previously scheduled `DRAMCtrl::processNextReqEvent` and schedules `DRAMCtrl::Rank::processPrechargeEvent`.
Now:
....
0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250
....
also runs from `DRAMCtrl::processNextReqEvent` and schedules a `DRAMCtrl::processRespondEvent`.
I'm getting bored, let's skip to the line that appears to matter for the first instruction:
....
46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250
....
But I got even more bored, and I will now skip to the first event before the instruction:
....
77000: Event_40: Timing CPU icache tick event scheduled @ 77000
77000: system.cpu A0 T0 : @asm_main_after_prologue : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger)
....
This event comes from `PacketQueue::processSendEvent` and schedules itself:
....
void
TimingSimpleCPU::TimingCPUPort::TickEvent::schedule(PacketPtr _pkt, Tick t)
{
pkt = _pkt;
cpu->schedule(this, t);
}
....
which polymorphically resolves to:
....
void
TimingSimpleCPU::IcachePort::ITickEvent::process()
{
cpu->completeIfetch(pkt);
}
....
and so `TimingSimpleCPU::completeIfetch` is, at last, the interesting `TimingSimpleCPU` function!
The end of this instruction must be setting things up in a way that can continue the PC walk loop, and by looking at the source and traces, it is clearly from: `TimingSimpleCPU::advanceInst` which calls `TimingSimpleCPU::fetch`, which is the very thing we did in this simulation!!! OMG, that's the loop.
One final thing to check, is how the memory reads are going to make the processor stall in the middle of an instruction.
For that, we can GDB to the `TimingSimpleCPU::completeIfetch` of the first LDR done in our test program.
By doing that, we see that this time at:
....
if (curStaticInst && curStaticInst->isMemRef()) {
// load or store: just send to dcache
Fault fault = curStaticInst->initiateAcc(&t_info, traceData);
if (_status == BaseSimpleCPU::Running) {
}
} else if (curStaticInst) {
// non-memory instruction: execute completely now
Fault fault = curStaticInst->execute(&t_info, traceData);
....
* `curStaticInst->isMemRef()` is true, and there is no instruction `execute` call in that part of the branch, only for instructions that don't touch memory
* `_status` is `BaseSimpleCPU::Status::DcacheWaitResponse` and `advanceInst` is not yet called
So, where is the `execute` happening? Well, I'll satisfy myself with a quick source grep and guess:
* `curStaticInst->initiateAcc` sets up some memory request events
* which likely lead up to: `TimingSimpleCPU::completeDataAccess`, which off the bat ends in `advanceInst`.
+
It also calls `curStaticInst->completeAcc`, which pairs up with the `initiateAcc` call.
===== gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches
Let's just add --caches to see if things go any faster:
....
0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
**** REAL SIMULATION ****
0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
0: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
0: Event_84: generic event scheduled @ 0
info: Entering event queue @ 0. Starting simulation...
0: Event_84: generic event rescheduled @ 18446744073709551615
0: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
1000: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 2000
1000: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250
1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 5000
1000: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250
74250: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
74250: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 80000
77000: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 78000
78000: Event_40: Timing CPU icache tick event scheduled @ 78000
78000: system.cpu A0 T0 : @asm_main_after_prologue : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger)
78000: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 83000
83000: Event_40: Timing CPU icache tick event scheduled @ 83000
83000: system.cpu A0 T0 : @asm_main_after_prologue+4 : adr x1, #28 : IntAlu : D=0x0000000000400098 flags=(IsInteger)
[...]
191000: system.cpu A0 T0 : @asm_main_after_prologue+28 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
191000: Event_85: generic event scheduled @ 191000
....
So yes, `--caches` does work here, leading to a runtime of 191000 rather than 469000 without caches!
===== gem5 event queue MinorCPU syscall emulation freestanding example analysis
The events <<gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis,for the Atomic CPU>> were pretty simple: basically just ticks.