diff --git a/index.html b/index.html index 31505b1..b24c3c7 100644 --- a/index.html +++ b/index.html @@ -1280,7 +1280,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.20.4.3.1. What is the coherency protocol implemented by the classic cache system in gem5?
  • -
  • 19.20.4.4. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs
  • +
  • 19.20.4.4. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs + +
  • 19.20.4.5. gem5 event queue MinorCPU syscall emulation freestanding example analysis
  • 19.20.4.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis
  • @@ -1389,7 +1393,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b -
  • 21.1.2. C multithreading
  • +
  • 21.1.2. C multithreading + +
  • 21.1.3. GCC C extensions @@ -6670,6 +6687,23 @@ asdf=qwer

    Source: userland/linux/init_env_poweroff.c.

    +
    +

    As of the Linux kernel v5.7 (possibly earlier, I’ve skipped a few releases), boot also shows the init arguments and environment very clearly, which is a great addition:

    +
    +
    +
    +
    <6>[    0.309984] Run /sbin/init as init process
    +<7>[    0.309991]   with arguments:
    +<7>[    0.309997]     /sbin/init
    +<7>[    0.310004]     nokaslr
    +<7>[    0.310010]     -
    +<7>[    0.310016]   with environment:
    +<7>[    0.310022]     HOME=/
    +<7>[    0.310028]     TERM=linux
    +<7>[    0.310035]     earlyprintk=pl011,0x1c090000
    +<7>[    0.310041]     lkmc_home=/lkmc
    +
    +

    6.4.1. init arguments

    @@ -15903,11 +15937,17 @@ ps

    kernel_modules/dump_regs.c

  • -

    baremetal/arch/aarch64/dump_regs.c

    +

    userland/arch/arm/dump_regs.c

    +
  • +
  • +

    userland/arch/aarch64/dump_regs.c

  • baremetal/arch/arm/dump_regs.c

  • +
  • +

    baremetal/arch/aarch64/dump_regs.c

    +
  • @@ -18568,14 +18608,36 @@ getconf _NPROCESSORS_CONF
    -

    Or from User mode simulation, we can use sysconf with userland/linux/sysconf.c or C++ multithreading's userland/cpp/thread_hardware_concurrency.cpp:

    +

    Or from User mode simulation, we can use either of:

    +
    + +
    19.2.2.1.1. QEMU user mode multithreading
    @@ -21058,6 +21120,9 @@ Exiting @ tick 18446744073709551615 because simulate() limit reached

    How it goes faster is explained at: https://stackoverflow.com/questions/59860091/how-to-increase-the-simulation-speed-of-a-gem5-run/59861375#59861375

    +

    Disables debug symbols (no -g) for some reason.

    +
    +

    Benchmarks present at:

    @@ -21325,7 +21390,7 @@ cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"

    This is the simplest of all protocols, and therefore the first one you should study to learn how Ruby works.

    -

    To study it, we can take an approach similar to what was done at: gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs.

    +

    To study it, we can take an approach similar to what was done at: gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs.

    Our full command line will be something like

    @@ -21367,7 +21432,7 @@ cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"

    One simple example of its operation can be seen at: Section 19.20.4.2, “gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis”

    -

    But arguably interesting effects can only be observed when we have more than 1 CPUs as in gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs.

    +

    But arguably interesting effects can only be observed when we have more than 1 CPUs as in gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs.

    TODO: describe it in more detail. It appears to be a very simple mechanism.

    @@ -22010,6 +22075,16 @@ static EmbeddedPyBind embed_obj("BadDevice", module_init, "BasicPioDevice");To get a feeling of how SimObject objects are run, see: gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis.

    +

    Bibliography:

    +
    +
    + +
    +

    Tested on gem5 08c79a194d1a3430801c04f37d13216cc9ec1da3.

    @@ -22127,7 +22202,7 @@ for module in modules.keys():

    And from IPDB we see that this appears to loop over every object string of type m5.objects.modulename.

    -

    This init gets called from src/python/importer.py at the exec:

    +

    This __init__ gets called from src/python/importer.py at the exec:

    @@ -24338,15 +24413,12 @@ type=SetAssociative

    The actual representation is done via separate state bits: https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L66 and MOESI appears explicitly only on the pretty printing.

    -

    This pretty printing appears for example in the --trace Cache lines as shown at gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches.

    +

    This pretty printing appears for example in the --trace Cache lines as shown at gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and with a few more transitions visible at Section 19.20.4.4, “gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs”.

    -
    19.20.4.4. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs
    -
    -

    TODO is this the minimal setup that allows us to see the gem5 crossbar interconnect? Can we see anything in AtomicSimpleCPU?

    -
    +
    19.20.4.4. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs

    It would be amazing to analyze a simple example with interconnect packets possibly invalidating caches of other CPUs.

    @@ -24357,20 +24429,22 @@ type=SetAssociative

    If we don’t use such instructions that flush memory, we would only see the interconnect at work when caches run out.

    -

    For this study, we will use the same CLI as gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis but with multiple CPUs and a content like atomic.cpp which shares a variable across threads.

    +

    For this study, we will use the same CLI as gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis but with multiple CPUs and a multithreaded which shares a variable across threads.

    -

    We will then focus on the behaviour or the memory of the shared variable to see if we can observe cache coherency on the crossbar:

    +

    We can use userland/c/atomic.c (see also C multithreading) at LKMC 7c01b29f1ee7da878c7cc9cb4565f3f3cf516a92 and gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 was with as in Detailed gem5 analysis of how data races happen:

    ./run \
       --arch aarch64 \
    -  --cli-args '2 100' \
    +  --cli-args '2 10' \
       --cpus 3 \
       --emulator gem5 \
    -  --gem5-worktree master3 \
    -  --userland userland/cpp/atomic/aarch64_add.cpp \
    +  --trace FmtFlag,Cache,DRAM,ExecAll,XBar \
    +  --userland userland/c/atomic.c \
    +  -- \
    +  --caches \
     ;
    @@ -24383,6 +24457,218 @@ type=SetAssociative
    Figure 5. config.dot.svg for a system with two TimingSimpleCPU with caches.
    +
    +

    Once again we focus on the shared function region my_thread_main which is where the interesting cross core memory collisions will be happening.

    +
    +
    +

    As a maybe-not-so-interesting, we have a look at the very first my_thread_main icache hit points:

    +
    +
    +
    +
    93946000: Cache: system.cpu1.icache: access for ReadReq [8b0:8b3] IF miss
    +93946000: Cache: system.cpu1.icache: createMissPacket: created ReadCleanReq [880:8bf] IF from ReadReq [8b0:8b3] IF
    +93946000: Cache: system.cpu1.icache: handleAtomicReqMiss: Sending an atomic ReadCleanReq [880:8bf] IF
    +93946000: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[5] packet ReadCleanReq [880:8bf] IF
    +93946000: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[5] packet ReadCleanReq [880:8bf] IF SF size: 1 lat: 1
    +93946000: Cache: system.cpu0.icache: handleSnoop: snoop hit for ReadCleanReq [880:8bf] IF, old state is state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
    +93946000: Cache: system.cpu0.icache: new state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
    +93946000: DRAM: system.mem_ctrls: recvAtomic: ReadCleanReq 0x880
    +93946000: Cache: system.cpu1.icache: handleAtomicReqMiss: Receive response: ReadResp [880:8bf] IF in state 0
    +93946000: Cache: system.cpu1.icache: Block addr 0x880 (ns) moving from state 0 to state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
    +93946000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main    :   sub   sp, sp, #48        : IntAlu :  D=0x0000003fffd6b9a0  flags=(IsInteger)
    +93946500: Cache: system.cpu1.icache: access for ReadReq [8b4:8b7] IF hit state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
    +93946500: Cache: system.cpu1.dcache: access for WriteReq [a19a8:a19af] hit state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0x14 set: 0x66 way: 0
    +93946500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+4    :   str   x0, [sp, #8]       : MemWrite :  D=0x0000007ffffefc70 A=0x3fffd6b9a8  flags=(IsInteger|IsMemRef|IsStore)
    +
    +
    +
    +

    Now that we know how to read cache logs from gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches, it is easier to understand what happened:

    +
    +
    + +
    +
    +

    Now let’s look for the incremented integer address that is shared across threads. We know from Detailed gem5 analysis of how data races happen that the read happens at my_thread_main+36, so searching for he first occurrence:

    +
    +
    +
    +
    93952500: Cache: system.cpu1.icache: access for ReadReq [8d4:8d7] IF hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x23 way: 0
    +93952500: Cache: system.cpu1.dcache: access for ReadReq [2060:2063] miss
    +93952500: Cache: system.cpu1.dcache: createMissPacket: created ReadSharedReq [2040:207f] from ReadReq [2060:2063]
    +93952500: Cache: system.cpu1.dcache: handleAtomicReqMiss: Sending an atomic ReadSharedReq [2040:207f]
    +93952500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[6] packet ReadSharedReq [2040:207f]
    +93952500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[6] packet ReadSharedReq [2040:207f] SF size: 0 lat: 1
    +93952500: DRAM: system.mem_ctrls: recvAtomic: ReadSharedReq 0x2040
    +93952500: Cache: system.cpu1.dcache: handleAtomicReqMiss: Receive response: ReadResp [2040:207f] in state 0
    +93952500: Cache: system.cpu1.dcache: Block addr 0x2040 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
    +93952500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000000 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
    +
    +
    +
    +

    so we determine its physical address of 0x2060. It was a miss, and then it went into E.

    +
    +
    +

    So we look ahead to the following accesses to that physical address, before CPU2 reaches that point of the code and starts making requests as well.

    +
    +
    +

    First there is the STR for the first LDR which is of course a hit:

    +
    +
    +
    +
    93954500: Cache: system.cpu1.dcache: access for WriteReq [2060:2063] hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
    +93954500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+52    :   str   x1, [x0]           : MemWrite :  D=0x0000000000000001 A=0x411060  flags=(IsInteger|IsMemRef|IsStore)
    +
    +
    +
    +

    If found the line in E, so we presume that it moves it to M. Then the second read confirms that it was in M:

    +
    +
    +
    +
    93964500: Cache: system.cpu1.dcache: access for ReadReq [2060:2063] hit state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
    +93964500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000001 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
    +
    +
    +
    +

    and so on.

    +
    +
    +

    Now let’s jump to when CPU2 starts making requests.

    +
    +
    +

    The first time this happens is on its first LDR at:

    +
    +
    +
    +
    94058500: Cache: system.cpu2.dcache: access for ReadReq [2060:2063] miss
    +94058500: Cache: system.cpu2.dcache: createMissPacket: created ReadSharedReq [2040:207f] from ReadReq [2060:2063]
    +94058500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Sending an atomic ReadSharedReq [2040:207f]
    +94058500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet ReadSharedReq [2040:207f]
    +94058500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet ReadSharedReq [2040:207f] SF size: 1 lat: 1
    +94058500: Cache: system.cpu1.dcache: handleSnoop: snoop hit for ReadSharedReq [2040:207f], old state is state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
    +94058500: Cache: system.cpu1.dcache: new state is state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
    +94058500: CoherentXBar: system.membus: recvAtomicBackdoor: Not forwarding ReadSharedReq [2040:207f]
    +94058500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Receive response: ReadResp [2040:207f] in state 0
    +94058500: Cache: system.cpu2.dcache: Block addr 0x2040 (ns) moving from state 0 to state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
    +94058500: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000009 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
    +
    +
    +
    +

    and from this we see:

    +
    +
    + +
    +
    +

    It also appears that no DRAM was accessed since there are no logs for it, so did the XBar get the value directly from the other cache? TODO: why did the earlier 93946000: DRAM read happened then, since CPU0 had the line when CPU1 asked for it?

    +
    +
    +

    The above log sequence also makes it clear that it is the XBar that maintains coherency: it appears that the CPU2 caches tells the XBar what it is doing, and then the XBar tells other caches on other CPUs about it, which leads CPU1 to move to O.

    +
    +
    +

    Then CPU1 hits its LDR on O:

    +
    +
    +
    +
    94060500: Cache: system.cpu1.dcache: access for ReadReq [2060:2063] hit state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
    +94060500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000009 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
    +
    +
    +
    +

    and then CPU2 writes moving to M and moving CPU1 to I:

    +
    +
    +
    +
    94060500: Cache: system.cpu2.dcache: access for WriteReq [2060:2063] hit state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
    +94060500: Cache: system.cpu2.dcache: createMissPacket: created UpgradeReq [2040:207f] from WriteReq [2060:2063]
    +94060500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Sending an atomic UpgradeReq [2040:207f]
    +94060500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet UpgradeReq [2040:207f]
    +94060500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet UpgradeReq [2040:207f] SF size: 1 lat: 1
    +94060500: Cache: system.cpu1.dcache: handleSnoop: snoop hit for UpgradeReq [2040:207f], old state is state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
    +94060500: Cache: system.cpu1.dcache: new state is state: 0 (I) valid: 0 writable: 0 readable: 0 dirty: 0 | tag: 0xffffffffffffffff set: 0x81 way: 0
    +94060500: CoherentXBar: system.membus: recvAtomicBackdoor: Not forwarding UpgradeReq [2040:207f]
    +94060500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Receive response: UpgradeResp [2040:207f] in state 5
    +94060500: Cache: system.cpu2.dcache: Block addr 0x2040 (ns) moving from state 5 to state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
    +94060500: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+52    :   str   x1, [x0]           : MemWrite :  D=0x000000000000000a A=0x411060  flags=(IsInteger|IsMemRef|IsStore)
    +
    +
    +
    +

    and so on, they just keep fighting over that address and changing one another’s state.

    +
    +
    +
    19.20.4.4.1. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby
    +
    +

    Now let’s do the exact same we did for gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs, but with Ruby rather than the classic system.

    +
    +
    +

    Since we have fully understood coherency in that previous example, it should now be easier to understand what is going on with Ruby:

    +
    +
    +
    +
    ./run \
    +  --arch aarch64 \
    +  --cli-args '2 10' \
    +  --cpus 3 \
    +  --emulator gem5 \
    +  --trace FmtFlag,DRAM,ExecAll,Ruby \
    +  --userland userland/c/atomic.c \
    +  -- \
    +  --cpu-type AtomicSimpleCPU \
    +  --ruby \
    +;
    +
    +
    +
    +

    Note that now the --trace Cache,XBar flags have no effect, since Ruby replaces those classic memory model components entirely with the Ruby version, so we enable the Ruby flag version instead. Note however that this flag is very verbose and produces about 10x more output than the classic memory experiment.

    +
    +
    +

    Also remember that ARM’s default Ruby protocol is 'MOESI_CMP_directory'.

    +
    +
    +

    First we note that the output of the experiment is the same:

    +
    +
    +
    +
    atomic 20
    +non-atomic 19
    +
    +
    +
    +

    TODO

    +
    +
    19.20.4.5. gem5 event queue MinorCPU syscall emulation freestanding example analysis
    @@ -27059,6 +27345,13 @@ cd ../..
    +
  • +

    time.h

    +
    + +
    +
  • @@ -27240,23 +27546,100 @@ echo 1 > /proc/sys/vm/overcommit_memory

    Added in C11!

    -
    - -

    Bibliography:

    +
    +
    21.1.2.1. atomic.c
    +
    +

    userland/c/atomic.c

    +
    +
    +

    Demonstrates atomic_int and thrd_create.

    +
    +
    +

    Disassembly with GDB at LKMC 619fef4b04bddc4a5a38aec5e207dd4d5a25d206 + 1:

    +
    +
    +
    +
    ./run-toolchain \
    +  --arch aarch64 gdb \
    +  -- \
    +  -batch \
    +  -ex 'disas/rs my_thread_main' $(./getvar \
    +  --arch aarch64 userland_build_dir)/c/atomic.out \
    +;
    +
    +
    +
    +

    shows on ARM:

    +
    +
    +
    +
    16              ++cnt;
    +   0x00000000004008cc <+28>:    80 00 00 b0     adrp    x0, 0x411000 <malloc@got.plt>
    +   0x00000000004008d0 <+32>:    00 80 01 91     add     x0, x0, #0x60
    +   0x00000000004008d4 <+36>:    00 00 40 b9     ldr     w0, [x0]
    +   0x00000000004008d8 <+40>:    01 04 00 11     add     w1, w0, #0x1
    +   0x00000000004008dc <+44>:    80 00 00 b0     adrp    x0, 0x411000 <malloc@got.plt>
    +   0x00000000004008e0 <+48>:    00 80 01 91     add     x0, x0, #0x60
    +   0x00000000004008e4 <+52>:    01 00 00 b9     str     w1, [x0]
    +
    +17              ++acnt;
    +   0x00000000004008e8 <+56>:    20 00 80 52     mov     w0, #0x1                        // #1
    +   0x00000000004008ec <+60>:    e0 1b 00 b9     str     w0, [sp, #24]
    +   0x00000000004008f0 <+64>:    e0 1b 40 b9     ldr     w0, [sp, #24]
    +   0x00000000004008f4 <+68>:    e2 03 00 2a     mov     w2, w0
    +   0x00000000004008f8 <+72>:    80 00 00 b0     adrp    x0, 0x411000 <malloc@got.plt>
    +   0x00000000004008fc <+76>:    00 70 01 91     add     x0, x0, #0x5c
    +   0x0000000000400900 <+80>:    03 00 e2 b8     ldaddal w2, w3, [x0]
    +   0x0000000000400904 <+84>:    61 00 02 0b     add     w1, w3, w2
    +   0x0000000000400908 <+88>:    e0 03 01 2a     mov     w0, w1
    +   0x000000000040090c <+92>:    e0 1f 00 b9     str     w0, [sp, #28]
    +
    +
    +
    +

    so:

    +
    +
    + +
    +
    +

    With -O3:

    +
    +
    +
    +
    16              ++cnt;
    +   0x0000000000400a00 <+32>:    60 00 40 b9     ldr     w0, [x3]
    +   0x0000000000400a04 <+36>:    00 04 00 11     add     w0, w0, #0x1
    +   0x0000000000400a08 <+40>:    60 00 00 b9     str     w0, [x3]
    +
    +17              ++acnt;
    +   0x0000000000400a0c <+44>:    20 00 80 52     mov     w0, #0x1                        // #1
    +   0x0000000000400a10 <+48>:    40 00 e0 b8     ldaddal w0, w0, [x2]
    +
    +
    +
    +

    so the situation is the same but without all the horrible stack noise.

    +
    +

    21.1.3. GCC C extensions

    @@ -27338,6 +27721,16 @@ echo 1 > /proc/sys/vm/overcommit_memory
    +
  • +

    virtual and polymorphism

    +
    + +
    +
  • @@ -27691,6 +28084,111 @@ time ./mutex.out 4 100000000 +
    +
    21.2.2.1.1. Detailed gem5 analysis of how data races happen
    +
    +

    The smallest data race we managed to come up as of LKMC 7c01b29f1ee7da878c7cc9cb4565f3f3cf516a92 and gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 was with userland/c/atomic.c (see also C multithreading):

    +
    +
    +
    +
    ./run \
    +  --arch aarch64 \
    +  --cli-args '2 10' \
    +  --cpus 3 \
    +  --emulator gem5 \
    +  --userland userland/c/atomic.c \
    +;
    +
    +
    +
    +

    which outputs:

    +
    +
    +
    +
    atomic 20
    +non-atomic 19
    +
    +
    +
    +
    /run -aA -eg -u userland/c/atomic.c --cli-args '2 200' --cpus 3 --userland-build-id o3 -N1 --trace ExecAll — --caches --cpu-type TimingSimpleCPU
    +

    Note that that the system is very minimal, and doesn’t even have caches, so I’m curious as to how this can happen at all.

    +
    +
    +

    So first we do a run with --trace Exec and look at the my_thread_main entries.

    +
    +
    +

    From there we see that first CPU1 enters the function, since it was spawned first.

    +
    +
    +

    Then for some time, both CPU1 and CPU2 are running at the same time.

    +
    +
    +

    Finally, CPU1 exists, then CPU2 runs alone for a while to finish its loops, and then CPU2 exits.

    +
    +
    +

    By greping the LDR data read from the log, we are able to easily spot the moment where things started to go wrong based on the D= data:

    +
    +
    +
    +
    grep -E 'my_thread_main\+36' trace.txt > trace-ldr.txt
    +
    +
    +
    +

    The grep output contains

    +
    +
    +
    +
    94024500: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000006 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
    +94036500: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000007 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
    +94048500: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000008 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
    +94058500: system.cpu2: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000009 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
    +94060500: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000009 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
    +94070500: system.cpu2: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x000000000000000a A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
    +94082500: system.cpu2: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x000000000000000b A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
    +
    +
    +
    +

    and so se see that it is at 94058500 that things started going bad, since two consecutive loads from different CPUs read the same value D=9! Actually, things were not too bad afterwards because this was by coincidence the last CPU1 read, we would have missed many more increments if the number of iterations had been larger.

    +
    +
    +

    Now that we have the first bad time, let’s look at the fuller disassembly to better understand what happens around that point.

    +
    +
    +
    +
    94058500: system.cpu2: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000009 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
    +94059000: system.cpu2: A0 T0 : @my_thread_main+40    :   add   w1, w0, #1         : IntAlu :  D=0x000000000000000a  flags=(IsInteger)
    +94059000: system.cpu1: A0 T0 : @my_thread_main+120    :   b.cc   <my_thread_main+28> : IntAlu :   flags=(IsControl|IsDirectControl|IsCondControl)
    +94059500: system.cpu1: A0 T0 : @my_thread_main+28    :   adrp   x0, #69632        : IntAlu :  D=0x0000000000411000  flags=(IsInteger)
    +94059500: system.cpu2: A0 T0 : @my_thread_main+44    :   adrp   x0, #69632        : IntAlu :  D=0x0000000000411000  flags=(IsInteger)
    +94060000: system.cpu2: A0 T0 : @my_thread_main+48    :   add   x0, x0, #96        : IntAlu :  D=0x0000000000411060  flags=(IsInteger)
    +94060000: system.cpu1: A0 T0 : @my_thread_main+32    :   add   x0, x0, #96        : IntAlu :  D=0x0000000000411060  flags=(IsInteger)
    +94060500: system.cpu1: A0 T0 : @my_thread_main+36    :   ldr   x0, [x0]           : MemRead :  D=0x0000000000000009 A=0x411060  flags=(IsInteger|IsMemRef|IsLoad)
    +94060500: system.cpu2: A0 T0 : @my_thread_main+52    :   str   x1, [x0]           : MemWrite :  D=0x000000000000000a A=0x411060  flags=(IsInteger|IsMemRef|IsStore)
    +
    +
    +
    +

    and from this, all becomes crystal clear:

    +
    +
    + +
    +
    +

    so we see that CPU2 just happened to store after CPU1 loads.

    +
    +
    +

    We also understand why LDADD solves the race problem in AtomicSimpleCPU: it does the load and store in one single go!

    +
    +
    21.2.2.2. C++ std::memory_order
    @@ -27700,6 +28198,9 @@ time ./mutex.out 4 100000000

    TODO let’s understand that fully one day.

    +
    +

    This is the C++ version of the more general Memory consistency concept.

    +
    21.2.2.3. C++ parallel algorithms
    @@ -30593,6 +31094,9 @@ child after parent sleep

    userland/linux/getcpu.c: a wrapper close the the syscall that also returns the current NUMA node

  • +

    userland/linux/getcpu_syscall.c: the wrapper segfaults on error handling, so double checking with the real syscall: https://stackoverflow.com/questions/9260937/unix-socket-error-14-efault-bad-address/61879849#61879849

    +
  • +
  • userland/linux/sched_getcpu_barrier.c: this uses a barrier to ensure that gem5 will run each thread on one separate CPU

  • @@ -38266,6 +38770,31 @@ west build -b qemu_aarch64 samples/hello_world

    https://en.wikipedia.org/wiki/Cache_coherence

    +

    In simple terms, when a certain group of caches of different CPUs are coherent, reads on one core always see the writes previously made by other cores. TODO: is it that strict, or just ordering? TODO what about simultaneous read and writes?

    +
    +
    +

    http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf mentions that:

    +
    +
    +

    Cache coherence:

    +
    +
    + +
    +
    +

    And notably it contrasts that with Memory consistency, which according to them is about ordering requirements on different addresses.

    +
    +

    Algorithms to keep the caches of different cores of a system coherent. Only matters for multicore systems.

    @@ -38275,13 +38804,25 @@ west build -b qemu_aarch64 samples/hello_world

    The main software use case example to have in mind is that of multiple threads incrementing an atomic counter as in userland/cpp/atomic/std_atomic.cpp, see also: atomic.cpp. Then, if one processors writes to the cache, other processors have to know about it before they read from that address.

    -

    Note that cache coherency only applies to memory read/write instructions that explicitly make coherency requirements.

    -
    -
    -

    In most ISAs, this tends to be the minority of instructions, and is only used when something is going to modify memory that is known to be shared across threads. For example, the a x86 LOCK would be used to increment atomic counters that get incremented across several threads. Outside of those cases, cache coherency is not guaranteed, and behaviour is undefined.

    +

    Even if caches are coherent, this is still not enough to avoid data race conditions, because this does not enforce atomicity of read modify write sequences. This is for example shown at: Detailed gem5 analysis of how data races happen.

    -

    32.2.1. Can caches snoop data from other caches?

    +

    32.2.1. Memory consistency

    +
    +

    According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf "memory consistency" is about ordering requirements of different memory addresses.

    +
    +
    +

    This is represented explicitly in C++ for example C++ std::memory_order.

    +
    +
    +
    32.2.1.1. Sequential Consistency
    +
    +

    According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf, the strongest possible consistency, everything nicely ordered as you’d expect.

    +
    +
    +
    +
    +

    32.2.2. Can caches snoop data from other caches?

    Either they can snoop only control, or both control and data can be snooped.

    @@ -38296,7 +38837,7 @@ west build -b qemu_aarch64 samples/hello_world
    -

    32.2.2. VI cache coherence protocol

    +

    32.2.3. VI cache coherence protocol

    Mentioned at:

    @@ -38543,7 +39084,7 @@ west build -b qemu_aarch64 samples/hello_world
    -

    32.2.3. MSI cache coherence protocol

    +

    32.2.4. MSI cache coherence protocol

    https://en.wikipedia.org/wiki/MSI_protocol

    @@ -38855,7 +39396,7 @@ CACHE2 S nyy

    TODO gem5 concrete example.

    -
    32.2.3.1. MSI cache coherence protocol with transient states
    +
    32.2.4.1. MSI cache coherence protocol with transient states

    TODO understand well why those are needed.

    @@ -38875,7 +39416,7 @@ CACHE2 S nyy
    -

    32.2.4. MESI cache coherence protocol

    +

    32.2.5. MESI cache coherence protocol

    https://en.wikipedia.org/wiki/MESI_protocol

    @@ -38896,22 +39437,48 @@ CACHE2 S nyy

    Exclusive is entered from Invalid after a PrRd, but only if the reply came from DRAM (or if we snooped that no one sent the reply to DRAM for us to read it)! If the reply came from another cache, we go directly to shared instead. It is this extra information that allows for the split of S.

    +

    This is why the simplified transition diagram shown in many places e.g.: https://upload.wikimedia.org/wikipedia/commons/c/c1/Diagrama_MESI.GIF is not a proper state machine: I can go to either S or E given a PrRd.

    +
    +

    The advantage of this over MSI is that when we move from Exclusive to Modified, no invalidate message is required, reducing bus traffic: https://en.wikipedia.org/wiki/MESI_protocol#Advantages_of_MESI_over_MSI

    -

    This is a common case on read write modify loops. On MSI, it would first do PrRd, send BusRd (to move any M to S), get data, and go to Shared, then PrWr must send BusUpgr to invalidate other Shared and move to M.

    +

    This is a common case on read write modify loops. On MSI, it would:

    +
    +
    +
    -

    With MESI, the PrRd could go to E instead of S depending on who services it. If it does go to E, then the PrWr only moves it to M, there is no need to send BusUpgr because we know that no one else is in S.

    +

    With MESI:

    +
    +
    +

    gem5 12c917de54145d2d50260035ba7fa614e25317a3 has two Ruby MESI models implemented: MESI_Two_Level and MESI_Three_Level.

    -

    32.2.5. MOSI cache coherence protocol

    +

    32.2.6. MOSI cache coherence protocol

    -

    https://en.wikipedia.org/wiki/MOSI_protocol The critical MSI vs MOSI section was a bit bogus though: https://en.wikipedia.org/w/index.php?title=MOSI_protocol&oldid=895443023 we have to edit it.

    +

    https://en.wikipedia.org/wiki/MOSI_protocol The critical MSI vs MOSI section was a bit bogus though: https://en.wikipedia.org/w/index.php?title=MOSI_protocol&oldid=895443023 but I edited it :-)

    In MSI, it feels wasteful that an MS transaction needs to flush to memory: why do we need to flush right now, since even more caches now have that data? Why not wait until later ant try to gain something from this deferral?

    @@ -38969,7 +39536,7 @@ CACHE2 S nyy
    -

    32.2.6. MOESI cache coherence protocol

    +

    32.2.7. MOESI cache coherence protocol

    https://en.wikipedia.org/wiki/MOESI_protocol

    @@ -38980,6 +39547,9 @@ CACHE2 S nyy

    In gem5 9fc9c67b4242c03f165951775be5cd0812f2a705, MOESI is the default cache coherency protocol of the classic memory system as shown at Section 19.20.4.3.1, “What is the coherency protocol implemented by the classic cache system in gem5?”.

    +

    A good an simple example showing several MOESI transitions in the classic memory model can be seen at: Section 19.20.4.4, “gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs”.

    +
    +

    gem5 12c917de54145d2d50260035ba7fa614e25317a3 has several Ruby MOESI models implemented: MOESI_AMD_Base, MOESI_CMP_directory, MOESI_CMP_token and MOESI_hammer.