From 2deb2c25b3f24961f5b36bfbc19dae2395586640 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ciro=20Santilli=20=E5=85=AD=E5=9B=9B=E4=BA=8B=E4=BB=B6=20?= =?UTF-8?q?=E6=B3=95=E8=BD=AE=E5=8A=9F?= Date: Mon, 25 Nov 2019 00:00:00 +0000 Subject: [PATCH] b9b38ed5792f7e0af2546a39f1718836330c63f1 --- index.html | 467 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 462 insertions(+), 5 deletions(-) diff --git a/index.html b/index.html index 11e1664..6ea8ed0 100644 --- a/index.html +++ b/index.html @@ -1189,7 +1189,9 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.19.4. gem5 event queue
  • 19.19.5. gem5 stats internals
  • @@ -20020,7 +20022,16 @@ Indirect leak of 1346 byte(s) in 2 object(s) allocated from:
  • -

    TimingSimpleCPU: memory accesses are realistic, but the CPU has no pipeline. The simulation is faster than detailed models, but slower than AtomicSimpleCPU. TODO: application?

    +

    TimingSimpleCPU: memory accesses are realistic, but the CPU has no pipeline. The simulation is faster than detailed models, but slower than AtomicSimpleCPU.

    + +
    +

    Without caches, the CPU just stalls all the time waiting for memory requests for every advance of the PC or memory read from a instruction!

    +
    +
    +

    Caches do make a difference here of course, and lead to much faster memory return times.

    +
  • @@ -20606,7 +20617,7 @@ Exiting @ tick 3500 because exiting with last active thread context
    -

    Let’s study the first event. From GDB, it’s stack trace is:

    +

    Let’s study the first event. From GDB, let’s break at the point that prints messages: Trace::OstreamLogger::logMessage() to see where events are being scheduled from:

    @@ -20687,7 +20698,7 @@ simulate() at simulate.cc:104 0x555559476d6f
    -

    And at long, we can guess without reading the code that Event_71 is comes from the SE implementation of the exit syscall, so let’s just confirm, the trace contains:

    +

    And at last, we can guess without reading the code that Event_71 is comes from the SE implementation of the exit syscall, so let’s just confirm, the trace contains:

    @@ -20715,7 +20726,450 @@ AtomicSimpleCPU::tick() at atomic.cc:757 0x55555907834c
    -
    19.19.4.2. gem5 event queue MinorCPU syscall emulation freestanding example analysis
    +
    19.19.4.2. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis
    +
    +

    TimingSimpleCPU should be the second simplest CPU to analyze, so let’s give it a try:

    +
    +
    +
    +
    ./run \
    +  --arch aarch64 \
    +  --emulator gem5 \
    +  --userland userland/arch/aarch64/freestanding/linux/hello.S \
    +  --trace Event,ExecAll \
    +  --trace-stdout \
    +  -- \
    +  --cpu-type TimingSimpleCPU \
    +;
    +
    +
    +
    +

    As of LKMC 9bfbff244d713de40e5686bd370eadb20cf78c7b + 1 the log is now much more complex.

    +
    +
    +

    Here is an abridged version with:

    +
    +
    +
      +
    • +

      the beginning up to the second instruction

      +
    • +
    • +

      end ending

      +
    • +
    +
    +
    +

    because all that happens in between is exactly the same as the first two instructions and therefore boring:

    +
    +
    +
    +
          0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    +**** REAL SIMULATION ****
    +      0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
    +      0: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
    +      0: Event_74: generic event scheduled @ 0
    +info: Entering event queue @ 0.  Starting simulation...
    +      0: Event_74: generic event rescheduled @ 18446744073709551615
    +      0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    +      0: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
    +      0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    +      0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250
    +      0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 5000
    +      0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    +  46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250
    +  74250: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
    +  74250: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
    +  77000: Event_40: Timing CPU icache tick event scheduled @ 77000
    +  77000: system.cpu A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    +  77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
    +  77000: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 78000
    +  77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 95750
    +  77000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
    +  95750: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 123750
    + 123750: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 126000
    + 123750: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 126000
    + 126000: Event_40: Timing CPU icache tick event scheduled @ 126000
    +      [...]
    + 469000: system.cpu A0 T0 : @asm_main_after_prologue+28    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
    + 469000: Event_75: generic event scheduled @ 469000
    +
    +
    +
    +

    0: system.cpu.wrapped_function_event schedule the initial tick, much like for for AtomicSimpleCPU. This time however, it is not a tick, but rather a fetch event that gets scheduled:

    +
    +
    +
    +
    TimingSimpleCPU::activateContext(ThreadID thread_num)
    +{
    +    DPRINTF(SimpleCPU, "ActivateContext %d\n", thread_num);
    +
    +    assert(thread_num < numThreads);
    +
    +    threadInfo[thread_num]->notIdleFraction = 1;
    +    if (_status == BaseSimpleCPU::Idle)
    +        _status = BaseSimpleCPU::Running;
    +
    +    // kick things off by initiating the fetch of the next instruction
    +    if (!fetchEvent.scheduled())
    +        schedule(fetchEvent, clockEdge(Cycles(0)));
    +
    +
    +
    +

    We have a fetch instead of a tick here compared to AtomicSimpleCPU, because in the timing CPU we must first get the instruction opcode from DRAM, which takes some cycles to return!

    +
    +
    +

    By looking at the source, we see that fetchEvent runs TimingSimpleCPU::fetch.

    +
    +
    +

    0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250: from GDB we see that it comes from DRAMCtrl::startup in mem/dram_ctrl.cc which contains:

    +
    +
    +
    +
    void
    +DRAMCtrl::startup()
    +{
    +    // remember the memory system mode of operation
    +    isTimingMode = system()->isTimingMode();
    +
    +    if (isTimingMode) {
    +        // timestamp offset should be in clock cycles for DRAMPower
    +        timeStampOffset = divCeil(curTick(), tCK);
    +
    +        // update the start tick for the precharge accounting to the
    +        // current tick
    +        for (auto r : ranks) {
    +            r->startup(curTick() + tREFI - tRP);
    +        }
    +
    +        // shift the bus busy time sufficiently far ahead that we never
    +        // have to worry about negative values when computing the time for
    +        // the next request, this will add an insignificant bubble at the
    +        // start of simulation
    +        nextBurstAt = curTick() + tRP + tRCD;
    +    }
    +}
    +
    +
    +
    +

    which then calls:

    +
    +
    +
    +
    void
    +DRAMCtrl::Rank::startup(Tick ref_tick)
    +{
    +    assert(ref_tick > curTick());
    +
    +    pwrStateTick = curTick();
    +
    +    // kick off the refresh, and give ourselves enough time to
    +    // precharge
    +    schedule(refreshEvent, ref_tick);
    +}
    +
    +
    +
    +

    By looking up some variable definitions in the source, we now we see some memory parameters clearly:

    +
    +
    +
      +
    • +

      ranks: std::vector<DRAMCtrl::Rank*> with 2 elements. TODO why do we have 2? What does it represent? Likely linked to config.ini at system.mem_ctrls.ranks_per_channel=2

      +
    • +
    • +

      tCK=1250, tREFI=7800000, tRP=13750, tRCD=13750: all defined in a single code location with a comment:

      +
      +
      +
           /**
      +     * Basic memory timing parameters initialized based on parameter
      +     * values.
      +     */
      +
      +
      +
      +

      Their values can be seen under config.ini and they are documented in src/mem/DRAMCtrl.py e.g.:

      +
      +
      +
      +
          # the base clock period of the DRAM
      +    tCK = Param.Latency("Clock period")
      +
      +    # minimum time between a precharge and subsequent activate
      +    tRP = Param.Latency("Row precharge time")
      +
      +    # the amount of time in nanoseconds from issuing an activate command
      +    # to the data being available in the row buffer for a read/write
      +    tRCD = Param.Latency("RAS to CAS delay")
      +
      +    # refresh command interval, how often a "ref" command needs
      +    # to be sent. It is 7.8 us for a 64ms refresh requirement
      +    tREFI = Param.Latency("Refresh command interval")
      +
      +
      +
    • +
    +
    +
    +

    So we realize that we are going into deep DRAM modelling, more detail that a mere mortal should ever need to know.

    +
    +
    +

    curTick() + tREFI - tRP = 0 + 7800000 - 13750 = 7786250 which is when that refreshEvent was scheduled. Our simulation ends way before that point however, so we will never know what it did thank God.

    +
    +
    +

    0: Event_74: generic event scheduled @ 0 and 0: Event_74: generic event rescheduled @ 18446744073709551615: schedule the final exit event, same as for AtomicSimpleCPU

    +
    +
    +

    The next interesting event is:

    +
    +
    +
    +
    system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    +
    +
    +
    +

    which comes from:

    +
    +
    +
    +
    #0  Trace::OstreamLogger::logMessage
    +#1  void Trace::Logger::dprintf
    +#2  Event::trace
    +#3  EventQueue::schedule
    +#4  EventManager::schedule
    +#5  DRAMCtrl::addToReadQueue
    +#6  DRAMCtrl::recvTimingReq
    +#7  DRAMCtrl::MemoryPort::recvTimingReq
    +#8  TimingRequestProtocol::sendReq
    +#9  MasterPort::sendTimingReq
    +#10 CoherentXBar::recvTimingReq
    +#11 CoherentXBar::CoherentXBarSlavePort::recvTimingReq(Packet*))
    +#12 TimingRequestProtocol::sendReq
    +#13 MasterPort::sendTimingReq
    +#14 TimingSimpleCPU::sendFetch
    +#15 TimingSimpleCPU::FetchTranslation::finish
    +#16 ArmISA::TLB::translateComplete
    +#17 ArmISA::TLB::translateTiming
    +#18 ArmISA::TLB::translateTiming
    +#19 TimingSimpleCPU::fetch
    +#20 TimingSimpleCPU::<lambda()>::operator()(void)
    +#21 std::_Function_handler<void(), TimingSimpleCPU::TimingSimpleCPU(TimingSimpleCPUParams*)::<lambda()> >
    +#22 std::function<void)>::operator()() const
    +#23 EventFunctionWrapper::process
    +#24 EventQueue::serviceOne
    +#25 doSimLoop
    +#26 simulate
    +
    +
    +
    +

    From the trace, we see that we are already running from the event queue. Therefore, we must have been running a previously scheduled event, and the previous event logs, the only such event is 0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0 which scheduled a memory fetch!

    +
    +
    +

    From the backtrace we see the tortuous path that the data request takes, going through:

    +
    +
    +
      +
    • +

      ArmISA::TLB

      +
    • +
    • +

      CoherentXBar

      +
    • +
    • +

      DRAMCtrl

      +
    • +
    +
    +
    +

    The scheduling happens at frame #5:

    +
    +
    +
    +
         // If we are not already scheduled to get a request out of the
    +     // queue, do so now
    +     if (!nextReqEvent.scheduled()) {
    +         DPRINTF(DRAM, "Request scheduled immediately\n");
    +         schedule(nextReqEvent, curTick());
    +     }
    +
    +
    +
    +

    and from a quick source grep we see that nextReqEvent is a DRAMCtrl::processNextReqEvent.

    +
    +
    +

    The next schedule:

    +
    +
    +
    +
    0: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
    +
    +
    +
    +

    and does a BaseXBar::Layer::releaseLayer event.

    +
    +
    +

    This one is also coming from the request queue at TimingSimpleCPU::fetch. We deduce therefore that the single previous fetch event scheduled not one, but two events!

    +
    +
    +

    Now:

    +
    +
    +
    +
          0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    +
    +
    +
    +

    comes from the previously scheduled DRAMCtrl::processNextReqEvent and schedules DRAMCtrl::Rank::processPrechargeEvent.

    +
    +
    +

    Now:

    +
    +
    +
    +
          0: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250
    +
    +
    +
    +

    also runs from DRAMCtrl::processNextReqEvent and schedules a DRAMCtrl::processRespondEvent.

    +
    +
    +

    I’m getting bored, let’s skip to the line that appears to matter for the first instruction:

    +
    +
    +
    +
      46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250
    +
    +
    +
    +

    But I got even more bored, and I will now skip to the first event before the instruction:

    +
    +
    +
    +
      77000: Event_40: Timing CPU icache tick event scheduled @ 77000
    +  77000: system.cpu A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    +
    +
    +
    +

    This event comes from PacketQueue::processSendEvent and schedules itself:

    +
    +
    +
    +
    void
    +TimingSimpleCPU::TimingCPUPort::TickEvent::schedule(PacketPtr _pkt, Tick t)
    +{
    +    pkt = _pkt;
    +    cpu->schedule(this, t);
    +}
    +
    +
    +
    +

    which polymorphically resolves to:

    +
    +
    +
    +
    void
    +TimingSimpleCPU::IcachePort::ITickEvent::process()
    +{
    +    cpu->completeIfetch(pkt);
    +}
    +
    +
    +
    +

    and so TimingSimpleCPU::completeIfetch is, at last, the interesting TimingSimpleCPU function!

    +
    +
    +

    The end of this instruction must be setting things up in a way that can continue the PC walk loop, and by looking at the source and traces, it is clearly from: TimingSimpleCPU::advanceInst which calls TimingSimpleCPU::fetch, which is the very thing we did in this simulation!!! OMG, that’s the loop.

    +
    +
    +

    One final thing to check, is how the memory reads are going to make the processor stall in the middle of an instruction.

    +
    +
    +

    For that, we can GDB to the TimingSimpleCPU::completeIfetch of the first LDR done in our test program.

    +
    +
    +

    By doing that, we see that this time at:

    +
    +
    +
    +
    if (curStaticInst && curStaticInst->isMemRef()) {
    +    // load or store: just send to dcache
    +    Fault fault = curStaticInst->initiateAcc(&t_info, traceData);
    +
    +    if (_status == BaseSimpleCPU::Running) {
    +    }
    +} else if (curStaticInst) {
    +    // non-memory instruction: execute completely now
    +    Fault fault = curStaticInst->execute(&t_info, traceData);
    +
    +
    +
    +
      +
    • +

      curStaticInst→isMemRef() is true, and there is no instruction execute call in that part of the branch, only for instructions that don’t touch memory

      +
    • +
    • +

      _status is BaseSimpleCPU::Status::DcacheWaitResponse and advanceInst is not yet called

      +
    • +
    +
    +
    +

    So, where is the execute happening? Well, I’ll satisfy myself with a quick source grep and guess:

    +
    +
    +
      +
    • +

      curStaticInst→initiateAcc sets up some memory request events

      +
    • +
    • +

      which likely lead up to: TimingSimpleCPU::completeDataAccess, which off the bat ends in advanceInst.

      +
      +

      It also calls curStaticInst→completeAcc, which pairs up with the initiateAcc call.

      +
      +
    • +
    +
    +
    +
    +
    19.19.4.3. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches
    +
    +

    Let’s just add --caches to see if things go any faster:

    +
    +
    +
    +
          0: system.cpu.wrapped_function_event: EventFunctionWrapped event scheduled @ 0
    +**** REAL SIMULATION ****
    +      0: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
    +      0: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped event scheduled @ 7786250
    +      0: Event_84: generic event scheduled @ 0
    +info: Entering event queue @ 0.  Starting simulation...
    +      0: Event_84: generic event rescheduled @ 18446744073709551615
    +      0: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
    +   1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
    +   1000: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped event scheduled @ 2000
    +   1000: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
    +   1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 46250
    +   1000: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped event scheduled @ 5000
    +   1000: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped event scheduled @ 1000
    +  46250: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 74250
    +  74250: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped event scheduled @ 77000
    +  74250: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped event scheduled @ 80000
    +  77000: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 78000
    +  78000: Event_40: Timing CPU icache tick event scheduled @ 78000
    +  78000: system.cpu A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    +  78000: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped event scheduled @ 83000
    +  83000: Event_40: Timing CPU icache tick event scheduled @ 83000
    +  83000: system.cpu A0 T0 : @asm_main_after_prologue+4    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)
    +  [...]
    + 191000: system.cpu A0 T0 : @asm_main_after_prologue+28    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
    + 191000: Event_85: generic event scheduled @ 191000
    +
    +
    +
    +

    So yes, --caches does work here, leading to a runtime of 191000 rather than 469000 without caches!

    +
    +
    +
    +
    19.19.4.4. gem5 event queue MinorCPU syscall emulation freestanding example analysis

    The events for the Atomic CPU were pretty simple: basically just ticks.

    @@ -23197,6 +23651,9 @@ cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans,3,3,2 ,1, A,3, B,