diff --git a/index.html b/index.html index 096eac8..bf68eb3 100644 --- a/index.html +++ b/index.html @@ -1134,9 +1134,8 @@ pre{ white-space:pre }
  • 19.3.3.3. gem5 DRAM setup
  • -
  • 19.3.4. gem5 CommMonitor
  • -
  • 19.3.5. gem5 disk and network latency
  • -
  • 19.3.6. gem5 clock frequency
  • +
  • 19.3.4. gem5 disk and network latency
  • +
  • 19.3.5. gem5 clock frequency
  • 19.4. gem5 kernel command line parameters
  • @@ -1195,7 +1194,8 @@ pre{ white-space:pre }
  • 19.10.4. gem5 config.ini @@ -1257,137 +1257,155 @@ pre{ white-space:pre }
  • 19.18. gem5 ARM platforms
  • 19.19. gem5 upstream images
  • 19.20. gem5 bootloaders
  • -
  • 19.21. gem5 internals +
  • 19.21. gem5 memory system +
  • +
  • 19.22. gem5 internals +
    -
    19.21.4.2.1. TimingSimpleCPU analysis #0
    +
    19.22.4.2.1. TimingSimpleCPU analysis #0

    Schedules TimingSimpleCPU::fetch through:

    @@ -24475,7 +25178,7 @@ ArmLinuxProcess64::initState
    -
    19.21.4.2.2. TimingSimpleCPU analysis #1
    +
    19.22.4.2.2. TimingSimpleCPU analysis #1

    Backtrace:

    @@ -24606,7 +25309,7 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
    -
    19.21.4.2.3. TimingSimpleCPU analysis #2
    +
    19.22.4.2.3. TimingSimpleCPU analysis #2

    This is just the startup of the second rank, see: TimingSimpleCPU analysis #1.

    @@ -24639,13 +25342,13 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
    -
    19.21.4.2.4. TimingSimpleCPU analysis #3 and #4
    +
    19.22.4.2.4. TimingSimpleCPU analysis #3 and #4

    From the timing we know what that one is: the end of time exit event, like for AtomicSimpleCPU.

    -
    19.21.4.2.5. TimingSimpleCPU analysis #5
    +
    19.22.4.2.5. TimingSimpleCPU analysis #5

    Executes TimingSimpleCPU::fetch().

    @@ -24753,7 +25456,7 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
    -
    19.21.4.2.6. TimingSimpleCPU analysis #6
    +
    19.22.4.2.6. TimingSimpleCPU analysis #6

    Schedules DRAMCtrl::processNextReqEvent through:

    @@ -24890,7 +25593,7 @@ TimingSimpleCPU::fetch
    -
    19.21.4.2.7. TimingSimpleCPU analysis #7
    +
    19.22.4.2.7. TimingSimpleCPU analysis #7

    Schedules BaseXBar::Layer::releaseLayer through:

    @@ -24916,13 +25619,13 @@ TimingSimpleCPU::fetch
    -
    19.21.4.2.8. TimingSimpleCPU analysis #8
    +
    19.22.4.2.8. TimingSimpleCPU analysis #8

    Executes DRAMCtrl::processNextReqEvent.

    -
    19.21.4.2.9. TimingSimpleCPU analysis #9
    +
    19.22.4.2.9. TimingSimpleCPU analysis #9

    Schedules DRAMCtrl::Rank::processActivateEvent through:

    @@ -24936,7 +25639,7 @@ DRAMCtrl::processNextReqEvent
    -
    19.21.4.2.10. TimingSimpleCPU analysis #10
    +
    19.22.4.2.10. TimingSimpleCPU analysis #10

    Schedules DRAMCtrl::processRespondEvent through:

    @@ -24948,7 +25651,7 @@ DRAMCtrl::processNextReqEvent
    -
    19.21.4.2.11. TimingSimpleCPU analysis #11
    +
    19.22.4.2.11. TimingSimpleCPU analysis #11

    Schedules DRAMCtrl::processNextReqEvent through:

    @@ -24960,7 +25663,7 @@ DRAMCtrl::processNextReqEvent
    -
    19.21.4.2.12. TimingSimpleCPU analysis #12
    +
    19.22.4.2.12. TimingSimpleCPU analysis #12

    Executes DRAMCtrl::Rank::processActivateEvent.

    @@ -24969,7 +25672,7 @@ DRAMCtrl::processNextReqEvent
    -
    19.21.4.2.13. TimingSimpleCPU analysis #13
    +
    19.22.4.2.13. TimingSimpleCPU analysis #13

    Schedules DRAMCtrl::Rank::processPowerEvent through:

    @@ -24982,7 +25685,7 @@ DRAMCtrl::Rank::processActivateEvent
    -
    19.21.4.2.14. TimingSimpleCPU analysis #14
    +
    19.22.4.2.14. TimingSimpleCPU analysis #14

    Executes DRAMCtrl::Rank::processPowerEvent.

    @@ -24991,25 +25694,25 @@ DRAMCtrl::Rank::processActivateEvent
    -
    19.21.4.2.15. TimingSimpleCPU analysis #15
    +
    19.22.4.2.15. TimingSimpleCPU analysis #15

    Executes BaseXBar::Layer<SrcType, DstType>::releaseLayer.

    -
    19.21.4.2.16. TimingSimpleCPU analysis #16
    +
    19.22.4.2.16. TimingSimpleCPU analysis #16

    Executes DRAMCtrl::processNextReqEvent().

    -
    19.21.4.2.17. TimingSimpleCPU analysis #17
    +
    19.22.4.2.17. TimingSimpleCPU analysis #17

    Executes DRAMCtrl::processRespondEvent().

    -
    19.21.4.2.18. TimingSimpleCPU analysis #18
    +
    19.22.4.2.18. TimingSimpleCPU analysis #18

    Schedules PacketQueue::processSendEvent() through:

    @@ -25024,13 +25727,13 @@ DRAMCtrl::processRespondEvent
    -
    19.21.4.2.19. TimingSimpleCPU analysis #19
    +
    19.22.4.2.19. TimingSimpleCPU analysis #19

    Executes PacketQueue::processSendEvent().

    -
    19.21.4.2.20. TimingSimpleCPU analysis #20
    +
    19.22.4.2.20. TimingSimpleCPU analysis #20

    Schedules PacketQueue::processSendEvent through:

    @@ -25054,7 +25757,7 @@ PacketQueue::processSendEvent
    -
    19.21.4.2.21. TimingSimpleCPU analysis #21
    +
    19.22.4.2.21. TimingSimpleCPU analysis #21

    Schedules BaseXBar::Layer<SrcType, DstType>::releaseLayer through:

    @@ -25074,19 +25777,19 @@ PacketQueue::processSendEvent
    -
    19.21.4.2.22. TimingSimpleCPU analysis #22
    +
    19.22.4.2.22. TimingSimpleCPU analysis #22

    Executes BaseXBar::Layer<SrcType, DstType>::releaseLayer.

    -
    19.21.4.2.23. TimingSimpleCPU analysis #23
    +
    19.22.4.2.23. TimingSimpleCPU analysis #23

    Executes PacketQueue::processSendEvent.

    -
    19.21.4.2.24. TimingSimpleCPU analysis #24
    +
    19.22.4.2.24. TimingSimpleCPU analysis #24

    Schedules TimingSimpleCPU::IcachePort::ITickEvent::process() through:

    @@ -25104,7 +25807,7 @@ PacketQueue::processSendEvent
    -
    19.21.4.2.25. TimingSimpleCPU analysis #25
    +
    19.22.4.2.25. TimingSimpleCPU analysis #25

    Executes TimingSimpleCPU::IcachePort::ITickEvent::process().

    @@ -25124,7 +25827,7 @@ PacketQueue::processSendEvent
    -
    19.21.4.2.26. TimingSimpleCPU analysis #26
    +
    19.22.4.2.26. TimingSimpleCPU analysis #26

    Schedules DRAMCtrl::processNextReqEvent through:

    @@ -25153,7 +25856,7 @@ TimingSimpleCPU::IcachePort::ITickEvent::process
    -
    19.21.4.2.27. TimingSimpleCPU analysis #27
    +
    19.22.4.2.27. TimingSimpleCPU analysis #27

    Schedules BaseXBar::Layer<SrcType, DstType>::releaseLayer through:

    @@ -25179,19 +25882,19 @@ TimingSimpleCPU::IcachePort::ITickEvent::process
    -
    19.21.4.2.28. TimingSimpleCPU analysis #28
    +
    19.22.4.2.28. TimingSimpleCPU analysis #28

    Execute DRAMCtrl::processNextReqEvent.

    -
    19.21.4.2.29. TimingSimpleCPU analysis #29
    +
    19.22.4.2.29. TimingSimpleCPU analysis #29

    Schedule DRAMCtrl::processRespondEvent().

    -
    19.21.4.2.30. TimingSimpleCPU analysis: LDR stall
    +
    19.22.4.2.30. TimingSimpleCPU analysis: LDR stall

    One important thing we want to check now, is how the memory reads are going to make the processor stall in the middle of an instruction.

    @@ -25309,7 +26012,7 @@ TimingSimpleCPU::IcachePort::ITickEvent::process
    -
    19.21.4.3. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches
    +
    19.22.4.3. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches

    Let’s just add --caches to gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis to see if things go any faster, and add Cache to --trace as in:

    @@ -25530,7 +26233,7 @@ TimingSimpleCPU::IcachePort::ITickEvent::process

    Contrast this with the non --cache version seen at TimingSimpleCPU analysis #5 in which DRAM only actually reads the 4 required bytes.

    -

    The only cryptic thing about the messages is the IF flag, but good computer architects would have guessed it correctly, and src/mem/packet.cc confirms:

    +

    The only cryptic thing about the messages is the IF flag, but good computer architects would have guessed it correctly that it is "instruction fetch" and src/mem/packet.cc confirms:

    @@ -25603,8 +26306,29 @@ type=SetAssociative
  • +
    +

    At:

    +
    +
    +
    +
    #7         0: Cache: system.cpu.icache: access for ReadReq [78:7b] IF miss
    +#8         0: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 1000
    +#9      1000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 executed @ 1000
    +#10     1000: Cache: system.cpu.icache: sendMSHRQueuePacket: MSHR ReadReq [78:7b] IF
    +#12     1000: Cache: system.cpu.icache: createMissPacket: created ReadCleanReq [40:7f] IF from ReadReq [78:7b] IF
    +
    +
    +
    +

    we can briefly see the gem5 MSHR doing its thing.

    +
    +
    +

    At time 0, the CPU icache wants to read, so it creates a packet that reads 4 bytes only ([78:7b]) for the instruction, and that goes into the MSHR, to be treated in a future event.

    +
    +
    +

    At 1000, the future event is executed, and so it reads the original packet from the MSHR, and uses that to create a new request [40:7f] which gets forwarded.

    +
    -
    19.21.4.3.1. What is the coherency protocol implemented by the classic cache system in gem5?
    +
    19.22.4.3.1. What is the coherency protocol implemented by the classic cache system in gem5?

    MOESI cache coherence protocol: https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L352

    @@ -25612,12 +26336,12 @@ type=SetAssociative

    The actual representation is done via separate state bits: https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L66 and MOESI appears explicitly only on the pretty printing.

    -

    This pretty printing appears for example in the --trace Cache lines as shown at gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and with a few more transitions visible at Section 19.21.4.4, “gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs”.

    +

    This pretty printing appears for example in the --trace Cache lines as shown at gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and with a few more transitions visible at Section 19.22.4.4, “gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs”.

    -
    19.21.4.4. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs
    +
    19.22.4.4. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs

    It would be amazing to analyze a simple example with interconnect packets possibly invalidating caches of other CPUs.

    @@ -25827,7 +26551,228 @@ type=SetAssociative
    -
    19.21.4.5. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby
    +
    19.22.4.5. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs
    +
    +

    Like gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs but with gem5 TimingSimpleCPU and userland/c/atomic/aarch64_add.c:

    +
    +
    +
    +
    ./build-userland --arch aarch64 --optimization-level 3 --userland-build-id o3
    +./run \
    +  --arch aarch64 \
    +  --cli-args '2 1000' \
    +  --cpus 3 \
    +  --emulator gem5 \
    +  --trace FmtFlag,CacheAll,DRAM,Event,ExecAll,SimpleCPU,XBar \
    +  --userland userland/c/atomic/aarch64_add.c \
    +  --userland-build-id o3 \
    +  -- \
    +  --caches \
    +  --cpu-type TimingSimpleCPU \
    +;
    +
    +
    +
    +

    This is arguably the best experiment to study the gem5 crossbar interconnect.

    +
    +
    +

    We increase the loop count to 100 loops because 100 did not show memory conflicts. The output is:

    +
    +
    +
    +
    expect 200
    +global 147
    +
    +
    +
    +

    Let’s double check what it compiles to with disas:

    +
    +
    +
    +
    ./disas --arch aarch64 --userland userland/c/atomic/aarch64_add.c --userland-build-id o3 my_thread_main
    +
    +
    +
    +

    which contains:

    +
    +
    +
    +
       0x0000000000400a70 <+0>:     03 00 40 f9     ldr     x3, [x0]
    +   0x0000000000400a74 <+4>:     63 01 00 b4     cbz     x3, 0x400aa0 <my_thread_main+48>
    +   0x0000000000400a78 <+8>:     82 00 00 d0     adrp    x2, 0x412000 <malloc@got.plt>
    +   0x0000000000400a7c <+12>:    42 a0 01 91     add     x2, x2, #0x68
    +   0x0000000000400a80 <+16>:    00 00 80 d2     mov     x0, #0x0                        // #0
    +   0x0000000000400a84 <+20>:    1f 20 03 d5     nop
    +   0x0000000000400a88 <+24>:    41 00 40 f9     ldr     x1, [x2]
    +   0x0000000000400a8c <+28>:    21 04 00 91     add     x1, x1, #0x1
    +   0x0000000000400a90 <+32>:    41 00 00 f9     str     x1, [x2]
    +   0x0000000000400a94 <+36>:    00 04 00 91     add     x0, x0, #0x1
    +   0x0000000000400a98 <+40>:    7f 00 00 eb     cmp     x3, x0
    +   0x0000000000400a9c <+44>:    68 ff ff 54     b.hi    0x400a88 <my_thread_main+24>  // b.pmore
    +   0x0000000000400aa0 <+48>:    00 00 80 52     mov     w0, #0x0                        // #0
    +   0x0000000000400aa4 <+52>:    c0 03 5f d6     ret
    +
    +
    +
    +

    Grepping the logs for my_thread_main+24 shows where the first non-atomic interleaves happen at:

    +
    +
    +
    +
    471039000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+24    :   ldr   x1, [x2]           : MemRead :  D=0x000000000000002f A=0x412068  flags=(IsInteger|IsMemRef|IsLoad)
    +471034000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+24    :   ldr   x1, [x2]           : MemRead :  D=0x000000000000002f A=0x412068  flags=(IsInteger|IsMemRef|IsLoad)
    +471059000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+44    :   b.hi   <my_thread_main+24> : IntAlu :   flags=(IsControl|IsDirectControl|IsCondControl)
    +471070000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+44    :   b.hi   <my_thread_main+24> : IntAlu :   flags=(IsControl|IsDirectControl|IsCondControl)
    +471071000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+24    :   ldr   x1, [x2]           : MemRead :  D=0x0000000000000030 A=0x412068  flags=(IsInteger|IsMemRef|IsLoad)
    +
    +
    +
    +

    after a long string of cpu1 hits, since CPU1 was forked first and therefore had more time to run that operation.

    +
    +
    +

    From those and logs around we deduce that:

    +
    +
    + +
    +
    +

    With that guide, we look at the fuller logs around that region of interest. With start at the first ifetch that CPU2 does for our LDR of interest at 0x400a88:

    +
    +
    +
    +
    471033000: SimpleCPU: system.cpu2: Fetch
    +471033000: SimpleCPU: system.cpu2: Translating address 0x400a88
    +
    +
    +
    +

    Things get a bit interleaved with CPU1, but soon afterwards we see the miss forwarding via gem5 MSHR as in gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches:

    +
    +
    +
    +
    471034000: Cache: system.cpu2.dcache: access for ReadReq [2068:206f] D=b0d989c328560000 ptr=0x5628c3d26f00 miss
    +471034000: CachePort: system.cpu2.dcache.mem_side: Scheduling send event at 471035000
    +471034000: Event: system.cpu2.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 140 scheduled @ 471035000
    +
    +
    +
    +

    Before the request moves on, some CPU1 action happens: a CPU1 STR finished! It hit the cache, and now we know the cache state: M:

    +
    +
    +
    +
    471034000: Cache: system.cpu1.dcache: access for WriteReq [2068:206f] D=2f00000000000000 ptr=0x5628c3d26c80 hit state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
    +471034000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+32    :   str   x1, [x2]           : MemWrite :  D=0x000000000000002f A=0x412068  flags=(IsInteger|IsMemRef|IsStore)
    +
    +
    +
    +

    After this is done, CPU2 dcache finally decides that it is time to forward its request, and now we see the crux of this experiment happen.

    +
    +
    +

    First createMissPacket creates a new packet for the cache request, and then it sends that packet into CoherentXBar.

    +
    +
    +
    +
    471035000: Event: system.cpu2.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 140 executed @ 471035000
    +471035000: Cache: system.cpu2.dcache: sendMSHRQueuePacket: MSHR ReadReq [2068:206f] D=b0d989c328560000 ptr=0x5628c3d26f00
    +471035000: Cache: system.cpu2.dcache: createMissPacket: created ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80 from ReadReq [2068:206f] D=b0d989c328560000 ptr=0x5628c3d26f00
    +471035000: CoherentXBar: system.membus: recvTimingReq: src system.membus.slave[10] packet ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
    +
    +
    +
    +

    Now, the SnoopFilte which lives inside the crossbar decides if any other CPUs care aout hat address:

    +
    +
    +
    +
    471035000: SnoopFilter: system.membus.snoop_filter: lookupRequest: src system.membus.slave[10] packet ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
    +471035000: SnoopFilter: system.membus.snoop_filter: lookupRequest:   SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000
    +471035000: SnoopFilter: system.membus.snoop_filter: lookupRequest:   new SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000
    +471035000: CoherentXBar: system.membus: recvTimingReq: src system.membus.slave[10] packet ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80 SF size: 1 lat: 1
    +
    +
    +
    +

    and the answer is yes: CPU1 does care about that address obviously! So the packet is forwarded as is to CPU1:

    +
    +
    +
    +
    471035000: CoherentXBar: system.membus: forwardTiming for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
    +471035000: CacheVerbose: system.cpu1.dcache: recvTimingSnoopReq: for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
    +471035000: CacheVerbose: system.cpu1.dcache: handleSnoop: for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
    +471035000: Cache: system.cpu1.dcache: handleSnoop: snoop hit for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80, old state is state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
    +471035000: Cache: system.cpu1.dcache: new state is state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
    +471035000: Cache: system.cpu1.dcache: doTimingSupplyResponse: for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
    +471035000: CacheVerbose: system.cpu1.dcache: doTimingSupplyResponse: created response: ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000 tick: 471044000
    +471035000: Event: system.cpu1.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 94 scheduled @ 471044000
    +471035000: CoherentXBar: system.membus: recvTimingReq: Not forwarding ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
    +471035000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 164 scheduled @ 471036000
    +471035000: BaseXBar: system.membus.reqLayer0: The crossbar layer is now busy from tick 471035000 to 471036000
    +
    +
    +
    +

    and from this we see that this read request from CPU2 made a cache from CPU1 go from M to O!

    +
    +
    +

    Then, the CPU1 dcache actually goes ahead, and creates a response or CPU2, since it has the data. This response is sent back to the crossbar which will forward it back to CPU1.

    +
    +
    +

    This also makes the crossbar not forward the original request to DRAM as mentioned at Not forwarding.

    +
    +
    +

    This reply from CPU1 reaches the crossbar at:

    +
    +
    +
    +
    471044000: Event: system.cpu1.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 94 executed @ 471044000
    +471044000: CoherentXBar: system.membus: recvTimingSnoopResp: src system.membus.slave[6] packet ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000
    +471044000: SnoopFilter: system.membus.snoop_filter: updateSnoopResponse: rsp system.membus.slave[6] req system.membus.slave[10] packet ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000
    +471044000: SnoopFilter: system.membus.snoop_filter: updateSnoopResponse:   old SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000
    +471044000: SnoopFilter: system.membus.snoop_filter: updateSnoopResponse:   new SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101000
    +471044000: CoherentXBar: system.membus: recvTimingSnoopResp: src system.membus.slave[6] packet ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000 FWD RESP
    +471044000: Event: system.membus.slave[10]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 186 scheduled @ 471046000
    +471044000: Event: system.membus.respLayer10.wrapped_function_event: EventFunctionWrapped 187 scheduled @ 471049000
    +471044000: BaseXBar: system.membus.respLayer10: The crossbar layer is now busy from tick 471044000 to 471049000
    +
    +
    +
    +

    and finally, at long last, CPU2 receives the snoop reply that was created in CPU1 and sent back through the crossbar, and the LDR completes:

    +
    +
    +
    +
    471046000: Event: system.membus.slave[10]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 186 executed @ 471046000
    +471046000: Cache: system.cpu2.dcache: recvTimingResp: Handling response ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000
    +471046000: Cache: system.cpu2.dcache: Block for addr 0x2040 being updated in Cache
    +471046000: CacheRepl: system.cpu2.dcache: Replacement victim: state: 0 (I) valid: 0 writable: 0 readable: 0 dirty: 0 | tag: 0xffffffffffffffff set: 0x81 way: 0
    +471046000: Cache: system.cpu2.dcache: Block addr 0x2040 (ns) moving from state 0 (I) to state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
    +471046000: Cache: system.cpu2.dcache: serviceMSHRTargets: updated cmd to ReadRespWithInvalidate [2068:206f] D=2f00000000000000 ptr=0x5628c3d26f00
    +471046000: Event: system.cpu2.dcache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 138 scheduled @ 471047000
    +471046000: Cache: system.cpu2.dcache: processing deferred snoop...
    +471046000: CacheVerbose: system.cpu2.dcache: handleSnoop: for UpgradeReq [2040:207f] D= ptr=0x5628c2d37b80
    +471046000: Cache: system.cpu2.dcache: handleSnoop: snoop hit for UpgradeReq [2040:207f] D= ptr=0x5628c2d37b80, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
    +471046000: Cache: system.cpu2.dcache: new state is state: 0 (I) valid: 0 writable: 0 readable: 0 dirty: 0 | tag: 0xffffffffffffffff set: 0x81 way: 0
    +471046000: CacheVerbose: system.cpu2.dcache: recvTimingResp: Leaving with ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000
    +471047000: Event: system.cpu2.dcache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 138 executed @ 471047000
    +471047000: SimpleCPU: system.cpu2.dcache_port: Received load/store response 0x2068
    +471047000: Event: Event_136: Timing CPU dcache tick 136 scheduled @ 471047000
    +471047000: Event: Event_136: Timing CPU dcache tick 136 executed @ 471047000
    +471034000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+24    :   ldr   x1, [x2]           : MemRead :  D=0x000000000000002f A=0x412068  flags=(IsInteger|IsMemRef|IsLoad)
    +
    +
    +
    +

    We note therefore that no DRAM access was involved, one cache services the other directly!

    +
    +
    +

    Tested on LKMC d429552cdeb0fc0a08cff8e627bf501eaffb068f + 1, gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772.

    +
    +
    +
    +
    19.22.4.6. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby

    Now let’s do the exact same we did for gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs, but with Ruby rather than the classic system and TimingSimpleCPU (atomic does not work with Ruby)

    @@ -25869,7 +26814,7 @@ non-atomic 19
    -
    19.21.4.6. gem5 event queue MinorCPU syscall emulation freestanding example analysis
    +
    19.22.4.7. gem5 event queue MinorCPU syscall emulation freestanding example analysis

    The events for the Atomic CPU were pretty simple: basically just ticks.

    @@ -26039,14 +26984,14 @@ non-atomic 19
    -
    19.21.4.6.1. gem5 event queue MinorCPU syscall emulation freestanding example analysis: hazard
    +
    19.22.4.7.1. gem5 event queue MinorCPU syscall emulation freestanding example analysis: hazard

    TODO like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard but with the hazard.

    -
    19.21.4.7. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis
    +
    19.22.4.8. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis

    Like gem5 event queue MinorCPU syscall emulation freestanding example analysis but even more complex since for the gem5 DerivO3CPU!

    @@ -26074,7 +27019,7 @@ non-atomic 19

    This section and children are tested at LKMC 144a552cf926ea630ef9eadbb22b79fe2468c456.

    -
    19.21.4.7.1. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazardless
    +
    19.22.4.8.1. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazardless

    Let’s have a look at the arguably simplest example userland/arch/aarch64/freestanding/linux/hazardless.S.

    @@ -26266,7 +27211,7 @@ non-atomic 19

    so we observe that the first two instructions arrived, and the CPU noticed that 0x400080 hasn’t been fetched yet.

    -

    Then for several cycles that follow, the fetch stage just says that it is blocked on data returning, e.g. the

    +

    Then for several cycles that follow, the fetch stage just says that it is blocked on data returning:

    @@ -26313,7 +27258,7 @@ non-atomic 19
    -
    19.21.4.7.2. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard
    +
    19.22.4.8.2. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard

    Now let’s do the same as in gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazardless but with a hazard: userland/arch/aarch64/freestanding/linux/hazard.S.

    @@ -26357,7 +27302,7 @@ non-atomic 19
    -
    19.21.4.7.3. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard4
    +
    19.22.4.8.3. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard4

    Like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard but a hazard of depth 4: userland/arch/aarch64/freestanding/linux/hazard.S.

    @@ -26398,7 +27343,7 @@ non-atomic 19
    -
    19.21.4.7.4. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall
    +
    19.22.4.8.4. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall

    Like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard but now with an LDR stall: userland/arch/aarch64/freestanding/linux/stall.S.

    @@ -26449,7 +27394,7 @@ non-atomic 19
    -
    19.21.4.7.5. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-gain
    +
    19.22.4.8.5. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-gain

    Like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall but now with an LDR stall: userland/arch/aarch64/freestanding/linux/stall-gain.S.

    @@ -26536,7 +27481,7 @@ non-atomic 19
    -
    19.21.4.7.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-hazard4
    +
    19.22.4.8.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-hazard4

    Like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-gain but now with some dependencies after the LDR: userland/arch/aarch64/freestanding/linux/stall-hazard4.S.

    @@ -26603,7 +27548,7 @@ non-atomic 19
    -
    19.21.4.7.7. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: speculative
    +
    19.22.4.8.7. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: speculative

    Now let’s try to see some Speculative execution in action with userland/arch/aarch64/freestanding/linux/speculative.S.

    @@ -26792,7 +27737,7 @@ wbActual:0
    -

    19.21.5. gem5 instruction definitions

    +

    19.22.5. gem5 instruction definitions

    This is one of the parts of gem5 that rely on semi-useless code generation inside the .isa sublanguage.

    @@ -26835,7 +27780,7 @@ wbActual:0
    -

    We also notice that the key argument passed to those instructions is of type ExecContext, which is discussed further at: Section 19.21.7.3, “gem5 ExecContext.

    +

    We also notice that the key argument passed to those instructions is of type ExecContext, which is discussed further at: Section 19.22.6.3, “gem5 ExecContext.

    The file is an include so that compilation can be split up into chunks by the autogenerated includers

    @@ -27040,7 +27985,7 @@ namespace ArmISAInst {

    Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.

    -
    19.21.5.1. gem5 execute vs initiateAcc vs completeAcc
    +
    19.22.5.1. gem5 execute vs initiateAcc vs completeAcc

    These are the key methods defined in instruction definitions, so lets see when each one gets called and what they do more or less.

    @@ -27094,7 +28039,7 @@ namespace ArmISAInst {

    This can be seen concretely in GDB from the analysis done at: TimingSimpleCPU analysis: LDR stall and for more memory details see gem5 functional vs atomic vs timing memory requests.

    -
    19.21.5.1.1. gem5 completeAcc
    +
    19.22.5.1.1. gem5 completeAcc

    completeAcc is boring on most simple store memory instructions, e.g. a simple STR:

    @@ -27147,7 +28092,7 @@ namespace ArmISAInst {
    -
    19.21.5.2. gem5 microops
    +
    19.22.5.2. gem5 microops

    Some gem5 instructions break down into multiple microops.

    @@ -27208,272 +28153,7 @@ namespace ArmISAInst {
    -

    19.21.6. gem5 port system

    -
    -

    The gem5 memory system is connected in a very flexible way through the port system.

    -
    -
    -

    This system exists to allow seamlessly connecting any combination of CPU, caches, interconnects, DRAM and peripherals.

    -
    -
    -
    19.21.6.1. gem5 functional vs atomic vs timing memory requests
    -
    -

    gem5 memory requests can be classified in the following broad categories:

    -
    -
    - -
    -
    -

    This trichotomy can be notably seen in the definition of the MasterPort class:

    -
    -
    -
    -
    class MasterPort : public Port, public AtomicRequestProtocol,
    -    public TimingRequestProtocol, public FunctionalRequestProtocol
    -
    -
    -
    -

    and the base classes are defined under src/mem/protocol/.

    -
    -
    -

    Then, by reading the rest of the class, we see that the send methods are all boring, and just forward to some polymorphic receiver that does the actual interesting activity:

    -
    -
    -
    -
        Tick
    -    sendAtomicSnoop(PacketPtr pkt)
    -    {
    -        return AtomicResponseProtocol::sendSnoop(_masterPort, pkt);
    -    }
    -
    -    Tick
    -    AtomicResponseProtocol::sendSnoop(AtomicRequestProtocol *peer, PacketPtr pkt)
    -    {
    -        assert(pkt->isRequest());
    -        return peer->recvAtomicSnoop(pkt);
    -    }
    -
    -
    -
    -

    The receive methods are therefore the interesting ones, and must be overridden on derived classes if they ever expect to receive such requests:

    -
    -
    -
    -
        Tick
    -    recvAtomicSnoop(PacketPtr pkt) override
    -    {
    -        panic("%s was not expecting an atomic snoop request\n", name());
    -        return 0;
    -    }
    -
    -    void
    -    recvFunctionalSnoop(PacketPtr pkt) override
    -    {
    -        panic("%s was not expecting a functional snoop request\n", name());
    -    }
    -
    -    void
    -    recvTimingSnoopReq(PacketPtr pkt) override
    -    {
    -        panic("%s was not expecting a timing snoop request.\n", name());
    -    }
    -
    -
    -
    -

    One question that comes up now is: but why do CPUs need to care about snoop requests?

    -
    -
    -

    And one big answer is: to be able to implement LLSC atomicity as mentioned at: ARM LDXR and STXR instructions, since when other cores update memory, they could invalidate the lock of the current core.

    -
    -
    -

    Then, as you might expect, we can see that for example AtomicSimpleCPU does not override recvTimingSnoopReq.

    -
    -
    -

    Now let see which requests are generated by ordinary ARM LDR instruction. We run:

    -
    -
    -
    -
    ./run \
    -  --arch aarch64 \
    -  --debug-vm \
    -  --emulator gem5 \
    -  --gem5-build-type debug \
    -  --useland userland/arch/aarch64/freestanding/linux/hello.S \
    -
    -
    -
    -

    and then break at the methods of the LDR class LDRXL64_LIT: gem5 execute vs initiateAcc vs completeAcc.

    -
    -
    -

    Before starting, we of course guess that:

    -
    -
    -
      -
    • -

      AtomicSimpleCPU will be making atomic accesses from execute

      -
    • -
    • -

      TimingSimpleCPU will be making timing accesses from initiateAcc, which must generate the event which leads to completeAcc

      -
    • -
    -
    -
    -

    so let’s confirm it.

    -
    -
    -

    We break on ArmISAInst::LDRXL64_LIT::execute which is what AtomicSimpleCPU uses, and that leads as expected to:

    -
    -
    -
    -
    MasterPort::sendAtomic
    -AtomicSimpleCPU::sendPacket
    -AtomicSimpleCPU::readMem
    -SimpleExecContext::readMem
    -readMemAtomic<(ByteOrder)1, ExecContext, unsigned long>
    -readMemAtomicLE<ExecContext, unsigned long>
    -ArmISAInst::LDRXL64_LIT::execute
    -AtomicSimpleCPU::tick
    -
    -
    -
    -

    Notably, AtomicSimpleCPU::readMem immediately translates the address, creates a packet, sends the atomic request, and gets the response back without any events.

    -
    -
    -

    And now if we do the same with --cpu-type TimingSimpleCPU and break at ArmISAInst::LDRXL64_LIT::initiateAcc, and then add another break for the next event schedule b EventManager::schedule (which we imagine is the memory read) we reach:

    -
    -
    -
    -
    EventManager::schedule
    -DRAMCtrl::addToReadQueue
    -DRAMCtrl::recvTimingReq
    -DRAMCtrl::MemoryPort::recvTimingReq
    -TimingRequestProtocol::sendReq
    -MasterPort::sendTimingReq
    -CoherentXBar::recvTimingReq
    -CoherentXBar::CoherentXBarSlavePort::recvTimingReq
    -TimingRequestProtocol::sendReq
    -MasterPort::sendTimingReq
    -TimingSimpleCPU::handleReadPacket
    -TimingSimpleCPU::sendData
    -TimingSimpleCPU::finishTranslation
    -DataTranslation<TimingSimpleCPU*>::finish
    -ArmISA::TLB::translateComplete
    -ArmISA::TLB::translateTiming
    -ArmISA::TLB::translateTiming
    -TimingSimpleCPU::initiateMemRead
    -SimpleExecContext::initiateMemRead
    -initiateMemRead<ExecContext, unsigned long>
    -ArmISAInst::LDRXL64_LIT::initiateAcc
    -TimingSimpleCPU::completeIfetch
    -TimingSimpleCPU::IcachePort::ITickEvent::process
    -EventQueue::serviceOne
    -
    -
    -
    -

    so as expected we have TimingRequestProtocol::sendReq.

    -
    -
    -

    Remember however that timing requests are a bit more complicated due to paging, since the page table walk can itself lead to further memory requests.

    -
    -
    -

    In this particular instance, the address being read with ldr x2, =len ARM LDR pseudo-instruction is likely placed just after the text section, and therefore the pagewalk is already in the TLB due to previous instruction fetches, and this is because the translation just finished immediately going through TimingSimpleCPU::finishTranslation, some key snippets are:

    -
    -
    -
    -
    TLB::translateComplete(const RequestPtr &req, ThreadContext *tc,
    -        Translation *translation, Mode mode, TLB::ArmTranslationType tranType,
    -        bool callFromS2)
    -{
    -    bool delay = false;
    -    Fault fault;
    -    if (FullSystem)
    -        fault = translateFs(req, tc, mode, translation, delay, true, tranType);
    -    else
    -        fault = translateSe(req, tc, mode, translation, delay, true);
    -    if (!delay)
    -        translation->finish(fault, req, tc, mode);
    -    else
    -        translation->markDelayed();
    -
    -
    -
    -

    and then translateSe does not use delay at all, so we learn that in syscall emulation, delay is always false and things progress immediately there. And then further down TimingSimpleCPU::finishTranslation does some more fault checking:

    -
    -
    -
    -
    void
    -TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
    -{
    -    if (state->getFault() != NoFault) {
    -        translationFault(state->getFault());
    -    } else {
    -        if (!state->isSplit) {
    -            sendData(state->mainReq, state->data, state->res,
    -                     state->mode == BaseTLB::Read);
    -
    -
    -
    -

    Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.

    -
    -
    -
    19.21.6.1.1. gem5 functional requests
    -
    -

    As seen at gem5 functional vs atomic vs timing memory requests, functional requests are not used in common simulation, since the core must always go through caches.

    -
    -
    -

    Functional access are therefore only used for more magic simulation functionalities.

    -
    -
    -

    One such functionality, is the gem5 syscall emulation mode implementation of the futex system call which is done at futexFunc in src/sim/sycall_emul.hh.

    -
    -
    -

    As seen from man futex, the Linux kernel reads the value from an address that is given as the first argument of the call.

    -
    -
    -

    Therefore, here it makes sense for gem5 syscall implementation, which does not actually have a real kernel running, to just make a functional request and be done with it, since the impact of cache changes done by this read would be insignificant to the cost of an actual full context switch that would happen on a real syscall.

    -
    -
    -

    It is generally hard to implement functional requests for Ruby runs, because packets are flying through the memory system in a transient state, and there is no simple way of finding exactly which ones might have the latest version of the memory. See for example:

    -
    - -
    -

    The typical error message in that case is:

    -
    -
    -
    -
    fatal: Ruby functional read failed for address
    -
    -
    -
    -
    -
    -
    -

    19.21.7. gem5 ThreadContext vs ThreadState vs ExecContext vs Process

    +

    19.22.6. gem5 ThreadContext vs ThreadState vs ExecContext vs Process

    These classes get used everywhere, and they have a somewhat convoluted relation with one another, so let’s figure it out this mess.

    @@ -27484,7 +28164,7 @@ TimingSimpleCPU::finishTranslation(WholeTranslationState *state)

    This section and all children tested at gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.

    -
    19.21.7.1. gem5 ThreadContext
    +
    19.22.6.1. gem5 ThreadContext

    As we delve into more details below, we will reach the following conclusion: a ThreadContext represents on thread of a CPU with multiple Hardware threads.

    @@ -27534,7 +28214,7 @@ typedef SimpleThread MinorThread;

    Essentially all methods of the base ThreadContext are pure virtual.

    -
    19.21.7.1.1. gem5 SimpleThread
    +
    19.22.6.1.1. gem5 SimpleThread

    SimpleThread storage defined on BaseSimpleCPU for simple CPUs like AtomicSimpleCPU:

    @@ -27629,7 +28309,7 @@ typedef SimpleThread MinorThread;
    -
    19.21.7.1.2. gem5 O3ThreadContext
    +
    19.22.6.1.2. gem5 O3ThreadContext

    Instantiation happens in the FullO3CPU constructor:

    @@ -27730,7 +28410,7 @@ FullO3CPU<Impl>::readArchIntReg(int reg_idx, ThreadID tid)
    -
    19.21.7.2. gem5 ThreadState
    +
    19.22.6.2. gem5 ThreadState

    Owned one per ThreadContext.

    @@ -27776,7 +28456,7 @@ class O3ThreadContext : public ThreadContext
    -
    19.21.7.3. gem5 ExecContext
    +
    19.22.6.3. gem5 ExecContext

    ExecContext gets used in gem5 instruction definitions, e.g.:

    @@ -27936,7 +28616,7 @@ class O3ThreadContext : public ThreadContext

    This makes sense, since each ThreadContext represents one CPU register set, and therefore needs a separate ExecContext which allows instruction implementations to access those registers.

    -
    19.21.7.3.1. gem5 ExecContext::readIntRegOperand register resolution
    +
    19.22.6.3.1. gem5 ExecContext::readIntRegOperand register resolution

    Let’s have a look at how ExecContext::readIntRegOperand actually matches registers to decoded registers IDs, since it is not obvious.

    @@ -28209,7 +28889,7 @@ flattenIntIndex(int reg) const
    -
    19.21.7.4. gem5 Process
    +
    19.22.6.4. gem5 Process

    The Process class is used only for gem5 syscall emulation mode, and it represents a process like a Linux userland process, in addition to any further gem5 specific data needed to represent the process.

    @@ -28297,12 +28977,12 @@ readFunc(SyscallDesc *desc, ThreadContext *tc,
    -

    19.21.8. gem5 functional units

    +

    19.22.7. gem5 functional units

    Each instruction is marked with a class, and each class can execute in a given functional unit.

    -
    19.21.8.1. gem5 MinorCPU default functional units
    +
    19.22.7.1. gem5 MinorCPU default functional units

    Which units are available is visible for example on the gem5 config.ini of a gem5 MinorCPU run. Functional units are not present in simple CPUs like gem5 TimingSimpleCPU.

    @@ -28461,7 +29141,7 @@ opClass=IntAlu
    -
    19.21.8.2. gem5 DerivO3CPU default functional units
    +
    19.22.7.2. gem5 DerivO3CPU default functional units

    On gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772, after running:

    @@ -28559,7 +29239,7 @@ pipelined=false
    -

    19.21.9. gem5 code generation

    +

    19.22.8. gem5 code generation

    gem5 uses a ton of code generation, which makes the project horrendous:

    @@ -28604,7 +29284,7 @@ pipelined=false

    But it has been widely overused to insanity. It likely also exists partly because when the project started in 2003 C++ compilers weren’t that good, so you couldn’t rely on features like templates that much.

    -
    19.21.9.1. gem5 THE_ISA
    +
    19.22.8.1. gem5 THE_ISA

    Generated code at: build/<ISA>/config/the_isa.hh which e.g. for ARM contains:

    @@ -28650,9 +29330,9 @@ enum class Arch {
    -

    19.21.10. gem5 build system

    +

    19.22.9. gem5 build system

    -
    19.21.10.1. M5_OVERRIDE_PY_SOURCE
    +
    19.22.9.1. M5_OVERRIDE_PY_SOURCE
    @@ -28667,7 +29347,7 @@ enum class Arch {
    -
    19.21.10.2. gem5 build broken on recent compiler version
    +
    19.22.9.2. gem5 build broken on recent compiler version

    gem5 moves a bit slowly, and if your host compiler is very new, the gem5 build might be broken for it, e.g. this was the case for Ubuntu 19.10 with GCC 9 and gem5 62d75e7105fe172eb906d4f80f360ff8591d4178 from Dec 2019.

    @@ -28692,7 +29372,7 @@ enum class Arch {
    -
    19.21.10.3. gem5 polymorphic ISA includes
    +
    19.22.9.3. gem5 polymorphic ISA includes

    E.g. src/cpu/decode_cache.hh includes:

    @@ -28771,7 +29451,7 @@ build/ARM/config/the_isa.hh
    -
    19.21.10.4. Why are all C++ symlinked into the gem5 build dir?
    +
    19.22.9.4. Why are all C++ symlinked into the gem5 build dir?

    Upstream request: https://gem5.atlassian.net/browse/GEM5-469

    @@ -30099,8 +30779,24 @@ echo 1 > /proc/sys/vm/overcommit_memory
    22.1.2.1. atomic.c
    -
    +
    +
      +
    • userland/c/atomic.c

      +
    • +
    • +

      userland/c/atomic/: files in this directory use the same technique as atomic.cpp, i.e. with one special case per file.

      +
      +

      Maybe userland/c/atomic.c should be deprecated in favor of those more minimal ones.

      +
      +
      +

      This was added because C++-pre main is too bloated, especially when we turn one a gazillion gem5 logs, it makes me want to cry.

      +
      +
      +

      And we want a single operation per test rather than to as in atomic.c because when using gem5 we want absolute control over the microbenchmark.

      +
      +
    • +

    Demonstrates atomic_int and thrd_create.

    @@ -30514,6 +31210,9 @@ mkdir -p bin/c

    userland/cpp/atomic/

    +

    C version at: atomic.c.

    +
    +

    In this set of examples, we exemplify various synchronization mechanisms, including assembly specific ones, by using the convenience of C++ multithreading:

    @@ -31574,6 +32273,9 @@ my type is MyClassToString and a is 1 and b is 2
    +
  • +

    rootfs_overlay/lkmc/nodejs/http.js: http module to create a simple HTTP server: https://nodejs.org/api/http.html

    +
  • @@ -42624,7 +43326,49 @@ west build -b qemu_aarch64 samples/hello_world
    -

    33.6. Cache coherence

    +

    33.6. Caches

    +
    +

    https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec15.pdf contains some of the first pictures you should see.

    +
    +
    +

    In a direct-mapped cache architecture (every address has a single possible block), a memory address can be broken up into:

    +
    +
    +
    +
    +-----+-------+--------------+
    +|     |       |              | full address
    ++-----+-------+--------------+
    +|     |       |              |
    +| tag | index | block offset |
    +
    +
    +
    +

    where:

    +
    +
    + +
    +
    +

    If the cache is set associative, we just simply make the index smaller and add a bits to the tag.

    +
    +
    +

    For example, for a 2-way associative cache, we remove on bit from the index, and add it to the tag.

    +
    +
    +

    33.6.1. Cache coherence

    @@ -42665,23 +43409,23 @@ west build -b qemu_aarch64 samples/hello_world

    Even if caches are coherent, this is still not enough to avoid data race conditions, because this does not enforce atomicity of read modify write sequences. This is for example shown at: Detailed gem5 analysis of how data races happen.

    -
    -

    33.6.1. Memory consistency

    +
    +
    33.6.1.1. Memory consistency

    According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf "memory consistency" is about ordering requirements of different memory addresses.

    This is represented explicitly in C++ for example C++ std::memory_order.

    -
    -
    33.6.1.1. Sequential Consistency
    +
    +
    33.6.1.1.1. Sequential Consistency

    According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf, the strongest possible consistency, everything nicely ordered as you’d expect.

    -
    -

    33.6.2. Can caches snoop data from other caches?

    +
    +
    33.6.1.2. Can caches snoop data from other caches?

    Either they can snoop only control, or both control and data can be snooped.

    @@ -42695,8 +43439,8 @@ west build -b qemu_aarch64 samples/hello_world

    If data snoops are not possible, then data must always to to DRAM first.

    -
    -

    33.6.3. VI cache coherence protocol

    +
    +
    33.6.1.3. VI cache coherence protocol

    Mentioned at:

    @@ -42942,8 +43686,8 @@ west build -b qemu_aarch64 samples/hello_world

    why PrWr stays in invalid? Why do writes always go to memory? Why not wait until eviction?

    -
    -

    33.6.4. MSI cache coherence protocol

    +
    +
    33.6.1.4. MSI cache coherence protocol
    @@ -43254,8 +43998,8 @@ CACHE2 S nyy

    TODO gem5 concrete example.

    -
    -
    33.6.4.1. MSI cache coherence protocol with transient states
    +
    +
    33.6.1.4.1. MSI cache coherence protocol with transient states

    TODO understand well why those are needed.

    @@ -43274,8 +44018,8 @@ CACHE2 S nyy
    -
    -

    33.6.5. MESI cache coherence protocol

    +
    +
    33.6.1.5. MESI cache coherence protocol
    @@ -43334,8 +44078,8 @@ CACHE2 S nyy

    gem5 12c917de54145d2d50260035ba7fa614e25317a3 has two Ruby MESI models implemented: MESI_Two_Level and MESI_Three_Level.

    -
    -

    33.6.6. MOSI cache coherence protocol

    +
    +
    33.6.1.6. MOSI cache coherence protocol

    https://en.wikipedia.org/wiki/MOSI_protocol The critical MSI vs MOSI section was a bit bogus though: https://en.wikipedia.org/w/index.php?title=MOSI_protocol&oldid=895443023 but I edited it :-)

    @@ -43394,8 +44138,8 @@ CACHE2 S nyy

    This therefore saves one memory write through and its bus traffic.

    -
    -

    33.6.7. MOESI cache coherence protocol

    +
    -

    In gem5 9fc9c67b4242c03f165951775be5cd0812f2a705, MOESI is the default cache coherency protocol of the classic memory system as shown at Section 19.21.4.3.1, “What is the coherency protocol implemented by the classic cache system in gem5?”.

    +

    In gem5 9fc9c67b4242c03f165951775be5cd0812f2a705, MOESI is the default cache coherency protocol of the classic memory system as shown at Section 19.22.4.3.1, “What is the coherency protocol implemented by the classic cache system in gem5?”.

    -

    A good an simple example showing several MOESI transitions in the classic memory model can be seen at: Section 19.21.4.4, “gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs”.

    +

    A good an simple example showing several MOESI transitions in the classic memory model can be seen at: Section 19.22.4.4, “gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs”.

    gem5 12c917de54145d2d50260035ba7fa614e25317a3 has several Ruby MOESI models implemented: MOESI_AMD_Base, MOESI_CMP_directory, MOESI_CMP_token and MOESI_hammer.

    @@ -43415,6 +44159,7 @@ CACHE2 S nyy
    +