From 3ccbce567914c0ec8206b6ec7d44094ecfb7d550 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ciro=20Santilli=20=E5=85=AD=E5=9B=9B=E4=BA=8B=E4=BB=B6=20?= =?UTF-8?q?=E6=B3=95=E8=BD=AE=E5=8A=9F?= Date: Wed, 29 Apr 2020 01:00:00 +0000 Subject: [PATCH] f5d4998ff51a548ed3f5153aacb0411d22022058 --- index.html | 653 ++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 477 insertions(+), 176 deletions(-) diff --git a/index.html b/index.html index e2c8db1..558cf84 100644 --- a/index.html +++ b/index.html @@ -692,7 +692,12 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 10.8. QEMU user mode quirks @@ -1100,9 +1105,7 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.2.2.1. Number of cores
  • 19.2.2.2. gem5 cache size
  • @@ -1197,7 +1200,8 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.15.5. gem5 sanitation build
  • 19.15.6. gem5 Ruby build
  • 19.15.7. gem5 Python 3 build
  • @@ -1267,7 +1271,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.20.4.2.30. TimingSimpleCPU analysis: LDR stall
  • -
  • 19.20.4.3. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches
  • +
  • 19.20.4.3. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches + +
  • 19.20.4.4. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs
  • 19.20.4.5. gem5 event queue MinorCPU syscall emulation freestanding example analysis
  • 19.20.4.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis
  • @@ -1980,7 +1988,7 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 32.2.4. MESI cache coherence protocol
  • 32.2.5. MOSI cache coherence protocol
  • -
  • 32.2.6. MOESI cache coherence protocol
  • +
  • 32.2.6. MOESI cache coherence protocol
  • @@ -8063,37 +8071,121 @@ hello
    -

    10.7.3. gem5 syscall emulation multiple executables

    +

    10.7.3. gem5 syscall emulation multithreading

    -

    This is not currently nicely exposed in LKMC, but gem5 syscall emulation does allow you to run multiple executables "at once".

    +

    gem5 user mode multithreading has been particularly flaky compared to QEMU’s, but work is being put into improving it.

    -

    --cmd takes a semicolon separated list, so we could do:

    +

    In gem5 syscall simulation, the fork syscall checks if there is a free CPU, and if there is a free one, the new threads runs on that CPU.

    +
    +
    +

    Otherwise, the fork call, and therefore higher level interfaces to fork such as pthread_create also fail and return a failure return status in the guest.

    +
    +
    +

    For example, if we use just one CPU for userland/posix/pthread_self.c which spawns one thread besides main:

    -
    ./run --arch aarch64 --emulator gem5 --userland userland/posix/getpid.c --cpus 2
    +
    ./run --cpus 1 --emulator gem5 --userland userland/posix/pthread_self.c --cli-args 1
    -

    and then hack the produced command by replacing:

    +

    fails with this error message coming from the guest stderr:

    -
      --cmd /home/ciro/bak/git/linux-kernel-module-cheat/out/userland/default/aarch64/posix/getpid.out \
    -  --param 'system.cpu[0].workload[:].release = "5.4.3"' \
    +
    pthread_create: Resource temporarily unavailable
    -

    with:

    +

    It works however if we add on extra CPU:

    -
      --cmd '/home/ciro/bak/git/linux-kernel-module-cheat/out/userland/default/aarch64/posix/getpid.out;/home/ciro/bak/git/linux-kernel-module-cheat/out/userland/default/aarch64/posix/getpid.out' \
    -  --param 'system.cpu[:].workload[:].release = "5.4.3"' \
    +
    ./run --cpus 2 --emulator gem5 --userland userland/posix/pthread_self.c --cli-args 1
    +

    Once threads exit, their CPU is freed and becomes available for new fork calls: For example, the following run spawns a thread, joins it, and then spawns again, and 2 CPUs are enough:

    +
    +
    +
    +
    ./run --cpus 2 --emulator gem5 --userland userland/posix/pthread_self.c --cli-args '1 2'
    +
    +
    +
    +

    because at each point in time, only up to two threads are running.

    +
    +
    +

    gem5 syscall emulation does show the expected number of cores when queried, e.g.:

    +
    +
    +
    +
    ./run --cpus 1 --userland userland/cpp/thread_hardware_concurrency.cpp --emulator gem5
    +./run --cpus 2 --userland userland/cpp/thread_hardware_concurrency.cpp --emulator gem5
    +
    +
    +
    +

    outputs 1 and 2 respectively.

    +
    +
    +

    This can also be clearly by running sched_getcpu:

    +
    +
    +
    +
    ./run \
    +  --arch aarch64 \
    +  --cli-args  4 \
    +  --cpus 8 \
    +  --emulator gem5 \
    +  --userland userland/linux/sched_getcpu.c \
    +;
    +
    +
    +
    +

    which necessarily produces an output containing the CPU numbers from 1 to 4 and no higher:

    +
    +
    +
    +
    1
    +3
    +4
    +2
    +
    +
    +
    +

    TODO why does the 2 come at the end here? Would be good to do a detailed assembly run analysis.

    +
    +
    +
    +

    10.7.4. gem5 syscall emulation multiple executables

    +
    +

    gem5 syscall emulation has the nice feature of allowing you to run multiple executables "at once".

    +
    +
    +

    Each executable starts running on the next free core much as if it had been forked right at the start of simulation: gem5 syscall emulation multithreading.

    +
    +
    +

    This can be useful to quickly create deterministic multi-CPU workload.

    +
    +
    +

    se.py --cmd takes a semicolon separated list, so we could do which LKMC exposes this by taking --userland multiple times as in:

    +
    +
    +
    +
    ./run \
    +  --arch aarch64 \
    +  --cpus 2 \
    +  --emulator gem5 \
    +  --userland userland/posix/getpid.c \
    +  --userland userland/posix/getpid.c \
    +;
    +
    +
    +
    +

    We need at least one CPU per executable, just like when forking new processes.

    +
    +

    The outcome of this is that we see two different pid messages printed to stdout:

    @@ -8103,7 +8195,7 @@ pid=100
    -

    since from gem5 Process we can see that se.py sets up one different PID per executable starting at `100:

    +

    since from gem5 Process we can see that se.py sets up one different PID per executable starting at 100:

    @@ -8114,9 +8206,6 @@ pid=100
    -

    This is basically starts running one process per CPU much like if it had been forked.

    -
    -

    We can also see that these processes are running concurrently with gem5 tracing by hacking:

    @@ -8139,6 +8228,37 @@ pid=100

    and therefore shows one instruction running on each CPU for each process at the same time.

    +
    +
    10.7.4.1. gem5 syscall emulation --smt
    +
    +

    gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4 syscall emulation has an --smt option presumably for Hardware threads but it has been neglected forever it seems: https://github.com/cirosantilli/linux-kernel-module-cheat/issues/104

    +
    +
    +

    If we start from the manually hacked working command from gem5 syscall emulation multiple executables and try to add:

    +
    +
    +
    +
    --cpu 1 --cpu-type Derivo3CPU --caches
    +
    +
    +
    +

    We choose DerivO3CPU because of the se.py assert:

    +
    +
    +
    +
    example/se.py:115:        assert(options.cpu_type == "DerivO3CPU")
    +
    +
    +
    +

    But then that fails with:

    +
    +
    +
    +
    gem5.opt: /path/to/linux-kernel-module-cheat/out/gem5/master3/build/ARM/cpu/o3/cpu.cc:205: FullO3CPU<Impl>::FullO3CPU(DerivO3CPUParams*) [with Impl = O3CPUImpl]: Assertion `params->numPhysVecPredRegs >= numThreads * TheISA::NumVecPredRegs' failed.
    +Program aborted at tick 0
    +
    +
    +
    @@ -18418,103 +18538,7 @@ ps Haux | grep qemu | wc
    -
    19.2.2.1.2. gem5 syscall emulation multithreading
    -
    -

    gem5 user mode multithreading has been particularly flaky compared to QEMU’s, but work is being put into improving it.

    -
    -
    -

    In gem5 syscall simulation, the fork syscall checks if there is a free CPU, and if there is a free one, the new threads runs on that CPU.

    -
    -
    -

    Otherwise, the fork call, and therefore higher level interfaces to fork such as pthread_create also fail and return a failure return status in the guest.

    -
    -
    -

    For example, if we use just one CPU for userland/posix/pthread_self.c which spawns one thread besides main:

    -
    -
    -
    -
    ./run --cpus 1 --emulator gem5 --userland userland/posix/pthread_self.c --cli-args 1
    -
    -
    -
    -

    fails with this error message coming from the guest stderr:

    -
    -
    -
    -
    pthread_create: Resource temporarily unavailable
    -
    -
    -
    -

    It works however if we add on extra CPU:

    -
    -
    -
    -
    ./run --cpus 2 --emulator gem5 --userland userland/posix/pthread_self.c --cli-args 1
    -
    -
    -
    -

    Once threads exit, their CPU is freed and becomes available for new fork calls: For example, the following run spawns a thread, joins it, and then spawns again, and 2 CPUs are enough:

    -
    -
    -
    -
    ./run --cpus 2 --emulator gem5 --userland userland/posix/pthread_self.c --cli-args '1 2'
    -
    -
    -
    -

    because at each point in time, only up to two threads are running.

    -
    -
    -

    gem5 syscall emulation does show the expected number of cores when queried, e.g.:

    -
    -
    -
    -
    ./run --cpus 1 --userland userland/cpp/thread_hardware_concurrency.cpp --emulator gem5
    -./run --cpus 2 --userland userland/cpp/thread_hardware_concurrency.cpp --emulator gem5
    -
    -
    -
    -

    outputs 1 and 2 respectively.

    -
    -
    -

    This can also be clearly by running sched_getcpu:

    -
    -
    -
    -
    ./run \
    -  --arch aarch64 \
    -  --cli-args  4 \
    -  --cpus 8 \
    -  --emulator gem5 \
    -  --userland userland/linux/sched_getcpu.c \
    -;
    -
    -
    -
    -

    which necessarily produces an output containing the CPU numbers from 1 to 4 and no higher:

    -
    -
    -
    -
    1
    -3
    -4
    -2
    -
    -
    -
    -

    TODO why does the 2 come at the end here? Would be good to do a detailed assembly run analysis.

    -
    -
    -
    -
    19.2.2.1.3. gem5 se.py user mode with 2 or more pthreads fails with because simulate() limit reached
    - - -
    -
    -
    19.2.2.1.4. gem5 ARM full system with more than 8 cores
    +
    19.2.2.1.2. gem5 ARM full system with more than 8 cores
    @@ -20133,7 +20157,7 @@ git -C "$(./getvar linux_source_dir)" checkout -

    drm: Add component-aware simple encoder allows you to see images through VNC, see: Section 13.3, “gem5 graphic mode”

  • -

    gem5: Add support for gem5’s extended GIC mode adds support for more than 8 cores, see: Section 19.2.2.1.4, “gem5 ARM full system with more than 8 cores”

    +

    gem5: Add support for gem5’s extended GIC mode adds support for more than 8 cores, see: Section 19.2.2.1.2, “gem5 ARM full system with more than 8 cores”

  • @@ -21085,7 +21109,7 @@ Indirect leak of 1346 byte(s) in 2 object(s) allocated from:
    -

    Then, with fs.py and se.py, you can choose to use either the classic or built-in ruby system at runtime with the --ruby option:

    +

    Then, with fs.py and se.py, you can choose to use either the classic or the ruby system type selected at build time with PROTOCOL= at runtime by passing the --ruby option:

    +

    It is not possible to build more than one Ruby system into a single build, and this is a major pain point for testing Ruby: https://gem5.atlassian.net/browse/GEM5-467

    +
    +

    For example, to use a two level MESI cache coherence protocol we can do:

    @@ -21208,15 +21235,24 @@ cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"

    Tested in gem5 d7d9bc240615625141cd6feddbadd392457e49eb.

    -
    19.15.6.1. gem5 crossbar interconnect
    +
    19.15.6.1. gem5 Ruby MI_example protocol
    +
    +

    This is the simplest of all protocols, and therefore the first one you should study to learn how Ruby works.

    +
    +
    +
    +
    19.15.6.2. gem5 crossbar interconnect

    Crossbar or XBar in the code, is the default CPU interconnect that gets used by fs.py if --ruby is not given.

    -

    One simple example of its operation can be seen at: Section 19.20.4.2, “gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis”.

    +

    It presumably implements a crossbar switch along the lines of: https://en.wikipedia.org/wiki/Crossbar_switch

    +
    +

    But arguably interesting effects can only be observed when we have more than 1 CPUs as in gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs.

    TODO: describe it in more detail. It appears to be a very simple mechanism.

    @@ -22554,7 +22590,6 @@ Exiting @ tick 3500 because exiting with last active thread context
    ./run \
       --arch aarch64 \
       --emulator gem5 \
    -  --gem5-build-type gem5 \
       --userland userland/arch/aarch64/freestanding/linux/hello.S \
       --trace Event,ExecAll,FmtFlag \
       --trace-stdout \
    @@ -23045,7 +23080,8 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
     
    -
    ./run-toolchain --arch aarch64 readelf -- -h "$(./getvar --arch aarch64 userland_build_dir)/arch/aarch64/freestanding/linux/hello.out
    +
    ./run-toolchain --arch aarch64 readelf -- \
    +  -h "$(./getvar --arch aarch64 userland_build_dir)/arch/aarch64/freestanding/linux/hello.out"
    @@ -23096,6 +23132,50 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
    _paddr = 0x78
    +
    +

    se we deduce that the vitual address 0x400078 maps to the physical address 0x78. But of course, let me log that for you byu adding --trace MMU:

    +
    +
    +
    +
          0: MMU: system.cpu.workload: Translating: 0x400078->0x78
    +
    +
    +
    +

    If we try --trace DRAM we can see:

    +
    +
    +
    +
          0: DRAM: system.mem_ctrls: recvTimingReq: request ReadReq addr 120 size 4
    +
    +
    +
    +

    where 120 == 0x78 (it logs addresses in decimal? Really??) and the size 4 which is the instruction width.

    +
    +
    +

    Now that we are here, we might as well learn how to log the data that was fetched from DRAM.

    +
    +
    +

    Fist we determine the expected bytes from:

    +
    +
    +
    +
    ./run-toolchain --arch aarch64 objdump -- \
    +  -D "$(./getvar --arch aarch64 userland_build_dir)/arch/aarch64/freestanding/linux/hello.out"
    +
    +
    +
    +

    which shows us the initial instruction encodings near the entry point _start:

    +
    +
    +
    +
    0000000000400078 <_start>:
    +  400078:       d2800020        mov     x0, #0x1                        // #1
    +  40007c:       100000e1        adr     x1, 400098 <msg>
    +
    +
    +
    +

    Now, TODO :-) The DRAM logs don’t contain data. Maybe this can be done with CommMonitor, but it is no exposed on fs.py

    +
    19.20.4.2.6. TimingSimpleCPU analysis #6
    @@ -23625,7 +23705,7 @@ TimingSimpleCPU::IcachePort::ITickEvent::process

    We first find it by looking for the ExecEnable of LDR.

    -

    Then, we go up to the previous Timing CPU icache tick event, which from the analysis of previous instruction instructions, we know is where the instruction execution starts, the LDR instruction fetch is done by then!

    +

    Then, we go up to the previous Timing CPU icache tick event, which from the analysis of previous instruction traces, we know is where the instruction execution starts, the LDR instruction fetch is done by then!

    Next, several events happen as the data request must be percolating through the memory system, it must be very similar to the instruction fetches. TODO analyze event function names.

    @@ -23657,59 +23737,72 @@ TimingSimpleCPU::IcachePort::ITickEvent::process
    19.20.4.3. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches
    -

    Let’s just add --caches to see if things go any faster:

    +

    Let’s just add --caches to gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis to see if things go any faster, and add Cache to --trace as in:

    -
          0: Event: system.cpu.wrapped_function_event: EventFunctionWrapped 43 scheduled @ 0
    -**** REAL SIMULATION ****
    -      0: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 14 scheduled @ 7786250
    -      0: Event: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped 20 scheduled @ 7786250
    -      0: Event: Event_84: generic 84 scheduled @ 0
    -info: Entering event queue @ 0.  Starting simulation...
    -      0: Event: Event_84: generic 84 rescheduled @ 18446744073709551615
    -      0: Event: system.cpu.wrapped_function_event: EventFunctionWrapped 43 executed @ 0
    -      0: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 1000
    -   1000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 executed @ 1000
    -   1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 1000
    -   1000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 70 scheduled @ 2000
    -   1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 1000
    -   1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 12 scheduled @ 1000
    -   1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 scheduled @ 46250
    -   1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 5000
    -   1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 12 executed @ 1000
    -   1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 15 scheduled @ 1000
    -   1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 15 executed @ 1000
    -   2000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 70 executed @ 2000
    -   5000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 5000
    -  46250: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 executed @ 46250
    -  46250: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 scheduled @ 74250
    -  74250: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 executed @ 74250
    -  74250: Event: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 74 scheduled @ 77000
    -  74250: Event: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped 75 scheduled @ 80000
    -  77000: Event: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 74 executed @ 77000
    -  77000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 scheduled @ 78000
    -  78000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 executed @ 78000
    -  78000: Event: Event_40: Timing CPU icache tick 40 scheduled @ 78000
    -  78000: Event: Event_40: Timing CPU icache tick 40 executed @ 78000
    -  78000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    -  78000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 scheduled @ 83000
    -  80000: Event: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped 75 executed @ 80000
    -  83000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 executed @ 83000
    -  83000: Event: Event_40: Timing CPU icache tick 40 scheduled @ 83000
    -  83000: Event: Event_40: Timing CPU icache tick 40 executed @ 83000
    -  83000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)
    -  83000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 84000
    -  [...]
    - 191000: Event: Event_85: generic 85 scheduled @ 191000
    - 191000: Event: Event_85: generic 85 executed @ 191000
    +
    --trace Cache,Event,ExecAll,-ExecSymbol,FmtFlag
    +
    +
    +
    +

    The resulting trace is:

    +
    +
    +
    +
    #0         0: Event: system.cpu.wrapped_function_event: EventFunctionWrapped 43 scheduled @ 0
    +#2         0: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 14 scheduled @ 7786250
    +#3         0: Event: system.mem_ctrls_1.wrapped_function_event: EventFunctionWrapped 20 scheduled @ 7786250
    +#4         0: Event: Event_84: generic 84 scheduled @ 0
    +#5         0: Event: Event_84: generic 84 rescheduled @ 18446744073709551615
    +#6         0: Event: system.cpu.wrapped_function_event: EventFunctionWrapped 43 executed @ 0
    +#7         0: Cache: system.cpu.icache: access for ReadReq [78:7b] IF miss
    +#8         0: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 1000
    +#9      1000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 executed @ 1000
    +#10     1000: Cache: system.cpu.icache: sendMSHRQueuePacket: MSHR ReadReq [78:7b] IF
    +#12     1000: Cache: system.cpu.icache: createMissPacket: created ReadCleanReq [40:7f] IF from ReadReq [78:7b] IF
    +#13     1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 1000
    +#14     1000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 70 scheduled @ 2000
    +#15     1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 1000
    +#16     1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 12 scheduled @ 1000
    +#17     1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 scheduled @ 46250
    +#18     1000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 5000
    +#19     1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 12 executed @ 1000
    +#20     1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 15 scheduled @ 1000
    +#22     1000: Event: system.mem_ctrls_0.wrapped_function_event: EventFunctionWrapped 15 executed @ 1000
    +#23     2000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 70 executed @ 2000
    +#24     5000: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 9 executed @ 5000
    +#25    46250: Event: system.mem_ctrls.wrapped_function_event: EventFunctionWrapped 10 executed @ 46250
    +#26    46250: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 scheduled @ 74250
    +#27    74250: Event: system.mem_ctrls.port-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 8 executed @ 74250
    +#28    74250: Event: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 74 scheduled @ 77000
    +#29    74250: Event: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped 75 scheduled @ 80000
    +#30    77000: Event: system.membus.slave[1]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 74 executed @ 77000
    +#32    77000: Cache: system.cpu.icache: recvTimingResp: Handling response ReadResp [40:7f] IF
    +#33    77000: Cache: system.cpu.icache: Block for addr 0x40 being updated in Cache
    +#34    77000: Cache: system.cpu.icache: Block addr 0x40 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0
    +#35    77000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 scheduled @ 78000
    +#36    78000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 executed @ 78000
    +#37    78000: Event: Event_40: Timing CPU icache tick 40 scheduled @ 78000
    +#38    78000: Event: Event_40: Timing CPU icache tick 40 executed @ 78000
    +#39    78000: ExecEnable: system.cpu: A0 T0 : 0x400078    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    +#40    78000: Cache: system.cpu.icache: access for ReadReq [7c:7f] IF hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0
    +#42    78000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 scheduled @ 83000
    +#43    80000: Event: system.membus.respLayer1.wrapped_function_event: EventFunctionWrapped 75 executed @ 80000
    +#44    83000: Event: system.cpu.icache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 57 executed @ 83000
    +#45    83000: Event: Event_40: Timing CPU icache tick 40 scheduled @ 83000
    +#46    83000: Event: Event_40: Timing CPU icache tick 40 executed @ 83000
    +#47    83000: ExecEnable: system.cpu: A0 T0 : 0x40007c    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)
    +#48    83000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 84000
    +          [...]
    +      191000: Event: Event_85: generic 85 scheduled @ 191000
    +      191000: Event: Event_85: generic 85 executed @ 191000

    So yes, --caches does work here, leading to a runtime of 191000 rather than 469000 without caches!

    -

    Notably, we now see that very little time passed between the first and second instructions, presumably because rather than going out all the way to the DRAM system, the event chain stops right at the icache.cpu_side when a hit happens, which must have been the case for the second instruction, which is just adjacent to the first one.

    +

    Notably, we now see that very little time passed between the first and second instructions which are marked with ExecEnable in #39 and #47, presumably because rather than going out all the way to the DRAM system the event chain stops right at the icache.cpu_side when a hit happens, which must have been the case for the second instruction, which is just adjacent to the first one.

    It is also interested to look into the generated config.dot.svg to compare it to the one without caches: Figure 2, “config.dot.svg for a TimingSimpleCPU without caches.”. With caches: Figure 3, “config.dot.svg for a TimingSimpleCPU with caches.”.

    @@ -23726,6 +23819,208 @@ info: Entering event queue @ 0. Starting simulation...
    Figure 3. config.dot.svg for a TimingSimpleCPU with caches.
    +
    +

    We can break down the events between the instructions as follows.

    +
    +
    +

    First, based on TimingSimpleCPU analysis #5, we b TimingSimpleCPU::fetch to see how the initial magically scheduled fetch, and necessarily cache miss, work:

    +
    +
    +
    +
    EventManager::schedule
    +PacketQueue::schedSendEvent
    +BaseCache::CacheMasterPort::schedSendEvent
    +BaseCache::schedMemSideSendEvent
    +BaseCache::allocateMissBuffer
    +BaseCache::handleTimingReqMiss
    +Cache::handleTimingReqMiss
    +BaseCache::recvTimingReq
    +Cache::recvTimingReq
    +BaseCache::CpuSidePort::recvTimingReq
    +TimingRequestProtocol::sendReq
    +MasterPort::sendTimingReq
    +TimingSimpleCPU::sendFetch
    +TimingSimpleCPU::FetchTranslation::finish
    +ArmISA::TLB::translateComplete
    +ArmISA::TLB::translateTiming
    +ArmISA::TLB::translateTiming
    +TimingSimpleCPU::fetch
    +
    +
    +
    +

    By comparing this to the uncached access at TimingSimpleCPU analysis #25, we see that this one does not reach the CoherentXBar at all: the cache must be scheduling an event in the future to model a delay between the cache request and XBar communication.

    +
    +
    +

    A quick source structural view shows that the source for non-Ruby caches such as the ones from this example are located under:

    +
    +
    +
    +
    src/mem/cache
    +
    +
    +
    +

    and the following simple class hierarchy:

    +
    +
    +
      +
    • +

      BaseCache

      +
      +
        +
      • +

        Cache

        +
      • +
      • +

        NoncoherentCache

        +
      • +
      +
      +
    • +
    +
    +
    +

    Next, we fast forward to #39 with b TimingSimpleCPU::IcachePort::ITickEvent::process which as we knows from previous sections, is the event that executes instructions, and therefore leaves us at the start of the second instruction.

    +
    +
    +

    Then, we b EventManager::schedule to see what that schedules:

    +
    +
    +
    +
    EventManager::schedule
    +PacketQueue::schedSendEvent
    +PacketQueue::schedSendTiming
    +QueuedSlavePort::schedTimingResp
    +BaseCache::handleTimingReqHit
    +Cache::handleTimingReqHit
    +BaseCache::recvTimingReq
    +Cache::recvTimingReq
    +BaseCache::CpuSidePort::recvTimingReq
    +TimingRequestProtocol::sendReq
    +MasterPort::sendTimingReq
    +TimingSimpleCPU::sendFetch
    +TimingSimpleCPU::FetchTranslation::finish
    +ArmISA::TLB::translateComplete
    +ArmISA::TLB::translateTiming
    +ArmISA::TLB::translateTiming
    +TimingSimpleCPU::fetch
    +TimingSimpleCPU::advanceInst
    +TimingSimpleCPU::completeIfetch
    +TimingSimpleCPU::IcachePort::ITickEvent::process
    +
    +
    +
    +

    By comparing this trace from the this cache hit and the previous cache miss, we see that BaseCache::recvTimingReq decides between either: Cache::handleTimingReqHit and Cache::handleTimingReqMiss, and from there we see that the key function that decides if the block is present is BaseCache::access.

    +
    +
    +

    We can see access behaviour at on the log lines, e.g.:

    +
    +
    +
    +
    #7         0: Cache: system.cpu.icache: access for ReadReq [78:7b] IF miss
    +#40    78000: Cache: system.cpu.icache: access for ReadReq [7c:7f] IF hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0
    +
    +
    +
    +

    which makes sense since from TimingSimpleCPU analysis #5 we know that the physical address of the initial instruction is 0x78, and 4 bytes are read for each instruction, so the second instruction access is at 0x7c.

    +
    +
    +

    The hit line also shows the precise cache state E from the MOESI protocol: What is the coherency protocol implemented by the classic cache system in gem5?.

    +
    +
    +

    The other log lines are also very clear, e.g. for the miss we see the following lines:

    +
    +
    +
    +
    #10     1000: Cache: system.cpu.icache: sendMSHRQueuePacket: MSHR ReadReq [78:7b] IF
    +#12     1000: Cache: system.cpu.icache: createMissPacket: created ReadCleanReq [40:7f] IF from ReadReq [78:7b] IF
    +#32    77000: Cache: system.cpu.icache: recvTimingResp: Handling response ReadResp [40:7f] IF
    +#33    77000: Cache: system.cpu.icache: Block for addr 0x40 being updated in Cache
    +#34    77000: Cache: system.cpu.icache: Block addr 0x40 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0
    +
    +
    +
    +

    This shows us that the cache miss fills the cache line 40:7f, so we deduce that the cache block size is 0x40 == 64 bytes. The second address only barely hit at the last bytes of the block!

    +
    +
    +

    It also informs us that the cache moved to E (from the initial I) state since a memory read was done.

    +
    +
    +

    We can confirm this with --trace DRAM which shows:

    +
    +
    +
    +
       1000: DRAM: system.mem_ctrls: recvTimingReq: request ReadCleanReq addr 64 size 64
    +
    +
    +
    +

    Contrast this with the non --cache version seen at TimingSimpleCPU analysis #5 in which DRAM only actually reads the 4 required bytes.

    +
    +
    +

    Another interesting observation of running with --trace Cache,DRAM,XBar is that between the execution of both instructions, there is a Cache event, but no DRAM or XBar events:

    +
    +
    +
    +
      78000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    +  78000: Cache: system.cpu.icache: access for ReadReq [7c:7f] IF hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x1 way: 0
    +  83000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)
    +
    +
    +
    +

    which is further consistent with the cache hit idea: no traffic goes down to the DRAM nor crossbar.

    +
    +
    +

    This block size parameter can be seen set on the gem5 config.ini file:

    +
    +
    +
    +
    [system]
    +cache_line_size=64
    +
    +
    +
    +

    so it is runtime configurable. The other key cache parameters can be seen further down in the config:

    +
    +
    +
    +
    [system.cpu.dcache]
    +assoc=2
    +size=65536
    +
    +[system.cpu.dcache.replacement_policy]
    +type=LRURP
    +
    +[system.cpu.dcache.tags.indexing_policy]
    +type=SetAssociative
    +
    +
    +
    +

    so we understand that by default the classic cache:

    +
    +
    + +
    +
    19.20.4.4. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs
    @@ -23747,6 +24042,9 @@ info: Entering event queue @ 0. Starting simulation...
    Figure 4. config.dot.svg for a system with two TimingSimpleCPU with caches.
    +
    +

    The simplest setup to understand will be to use gem5 syscall emulation multiple executables.

    +
    19.20.4.5. gem5 event queue MinorCPU syscall emulation freestanding example analysis
    @@ -36672,7 +36970,7 @@ west build -b qemu_aarch64 samples/hello_world
    -

    gem5 appears to possibly have attempted to implement hardware threads in gem5 syscall emulation mode: https://github.com/cirosantilli/linux-kernel-module-cheat/issues/104 when using gem5 syscall emulation multiple executables.

    +

    gem5 appears to possibly have attempted to implement hardware threads in gem5 syscall emulation mode as mentioned at gem5 syscall emulation --smt.

    On fs.py it is not exposed in any in-tree config however, and as pointed by the above issue O3 FS has an assert that prevents it in src/cpu/o3/cpu.cc:

    @@ -37300,7 +37598,7 @@ CACHE2 S nyy
    32.2.3.1. MSI cache coherence protocol with transient states
    -

    TODO underestand well why those are needed.

    +

    TODO understand well why those are needed.

      @@ -37336,7 +37634,7 @@ CACHE2 S nyy
    -

    Exclusive is entered from Invalid after a PrRd, but only if the reply came from DRAM (or if we snoped that no one sent the reply to DRAM for us to read it)! If the reply came from another cache, we go directly to shared instead. It is this extra information that allows for the split of S.

    +

    Exclusive is entered from Invalid after a PrRd, but only if the reply came from DRAM (or if we snooped that no one sent the reply to DRAM for us to read it)! If the reply came from another cache, we go directly to shared instead. It is this extra information that allows for the split of S.

    The advantage of this over MSI is that when we move from Exclusive to Modified, no invalidate message is required, reducing bus traffic: https://en.wikipedia.org/wiki/MESI_protocol#Advantages_of_MESI_over_MSI

    @@ -37382,7 +37680,7 @@ CACHE2 S nyy
    -

    An MSI cahe 1 would do:

    +

    An MSI cache 1 would do:

      @@ -37412,7 +37710,7 @@ CACHE2 S nyy
    +

    In gem5 9fc9c67b4242c03f165951775be5cd0812f2a705, MOESI is the default cache coherency protocol of the classic memory system as shown at Section 19.20.4.3.1, “What is the coherency protocol implemented by the classic cache system in gem5?”.

    +
    +

    gem5 12c917de54145d2d50260035ba7fa614e25317a3 has several Ruby MOESI models implemented: MOESI_AMD_Base, MOESI_CMP_directory, MOESI_CMP_token and MOESI_hammer.