mirror of
https://github.com/cirosantilli/linux-kernel-module-cheat.git
synced 2026-01-23 02:05:57 +01:00
gem5: one concrete minimal example of a coherentxbar snoop
This commit is contained in:
217
README.adoc
217
README.adoc
@@ -6311,6 +6311,8 @@ This likely comes from the ifdef split at `init/main.c`:
|
||||
|
||||
`start_kernel` is a good definition of it: https://stackoverflow.com/questions/18266063/does-kernel-have-main-function/33422401#33422401
|
||||
|
||||
In gem5 aarc64 Linux v4.18, experimentally the entry point of secondary CPUs seems to be `secondary_holding_pen` as shown at https://gist.github.com/cirosantilli2/34a7bc450fcb6c1c1a910369be1fdd90
|
||||
|
||||
=== Kernel module APIs
|
||||
|
||||
==== Kernel module parameters
|
||||
@@ -13723,9 +13725,11 @@ Crossbar or `XBar` in the code, is the default <<cache-coherence,CPU interconnec
|
||||
|
||||
It presumably implements a crossbar switch along the lines of: https://en.wikipedia.org/wiki/Crossbar_switch
|
||||
|
||||
One simple example of its operation can be seen at: xref:gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis[xrefstyle=full]
|
||||
This is the best introductory example analysis we have so far: <<gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus>>. It contains more or less the most minimal example in which something interesting can be observed: multiple cores fighting over a single data memory variable.
|
||||
|
||||
But arguably interesting effects can only be observed when we have more than 1 CPUs as in <<gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus>>.
|
||||
Long story short: the interconnect contains the snoop mechanism, and it forwards packets coming form caches of a CPU to the caches of other CPUs in which the block is present.
|
||||
|
||||
It is therefore the heart of the <<cache-coherence>> mechanism, as it informs other caches of bus transactions they need to know about.
|
||||
|
||||
TODO: describe it in more detail. It appears to be a very simple mechanism.
|
||||
|
||||
@@ -14023,6 +14027,10 @@ and their selection can be seen under: `src/dev/arm/RealView.py`, e.g.:
|
||||
cur_sys.boot_loader = [ loc('boot_emm.arm64'), loc('boot_emm.arm') ]
|
||||
....
|
||||
|
||||
The bootloader basically just sets up a bit of CPU state and jumps to the kernel entry point.
|
||||
|
||||
In aarch64 at least, CPUs other than CPU0 are also started up briefly, run some initialization, and are made wait on a WFE. This can be seen easily by booting a multicore Linux kernel run with <<gem5-execall-trace-format>>.
|
||||
|
||||
=== gem5 memory system
|
||||
|
||||
Parent section: <<gem5-internals>>.
|
||||
@@ -16509,6 +16517,22 @@ so we understand that by default the classic cache:
|
||||
* has 16KiB total size
|
||||
* uses LRURP https://en.wikipedia.org/wiki/Cache_replacement_policies[replacement policy]. LRU is a well known policy, "LRU RP" seems to simply stand for "LRU Replacement Policy". Other policies can be seen under: https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/replacement_policies/[src/mem/cache/replacement_policies/]
|
||||
|
||||
At:
|
||||
|
||||
....
|
||||
#7 0: Cache: system.cpu.icache: access for ReadReq [78:7b] IF miss
|
||||
#8 0: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 1000
|
||||
#9 1000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 executed @ 1000
|
||||
#10 1000: Cache: system.cpu.icache: sendMSHRQueuePacket: MSHR ReadReq [78:7b] IF
|
||||
#12 1000: Cache: system.cpu.icache: createMissPacket: created ReadCleanReq [40:7f] IF from ReadReq [78:7b] IF
|
||||
....
|
||||
|
||||
we can briefly see the <<gem5-mshr>> doing its thing.
|
||||
|
||||
At time 0, the CPU icache wants to read, so it creates a <<gem5-packet,packet>> that reads 4 bytes only (`[78:7b]`) for the instruction, and that goes into the MSHR, to be treated in a future event.
|
||||
|
||||
At 1000, the future event is executed, and so it reads the original packet from the MSHR, and uses that to create a new request `[40:7f]` which gets forwarded.
|
||||
|
||||
====== What is the coherency protocol implemented by the classic cache system in gem5?
|
||||
|
||||
<<moesi>>: https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L352
|
||||
@@ -16673,6 +16697,182 @@ and then CPU2 writes moving to M and moving CPU1 to I:
|
||||
|
||||
and so on, they just keep fighting over that address and changing one another's state.
|
||||
|
||||
===== gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs
|
||||
|
||||
Like <<gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus>> but with <<gem5-timingsimplecpu>> and link:userland/c/atomic/aarch64_add.c[]:
|
||||
|
||||
....
|
||||
./build-userland --arch aarch64 --optimization-level 3 --userland-build-id o3
|
||||
./run \
|
||||
--arch aarch64 \
|
||||
--cli-args '2 1000' \
|
||||
--cpus 3 \
|
||||
--emulator gem5 \
|
||||
--trace FmtFlag,CacheAll,DRAM,Event,ExecAll,SimpleCPU,XBar \
|
||||
--userland userland/c/atomic/aarch64_add.c \
|
||||
--userland-build-id o3 \
|
||||
-- \
|
||||
--caches \
|
||||
--cpu-type TimingSimpleCPU \
|
||||
;
|
||||
....
|
||||
|
||||
This is arguably the best experiment to study the <<gem5-crossbar-interconnect>>.
|
||||
|
||||
We increase the loop count to 100 loops because 100 did not show memory conflicts. The output is:
|
||||
|
||||
....
|
||||
expect 200
|
||||
global 147
|
||||
....
|
||||
|
||||
Let's double check what it compiles to with <<disas>>:
|
||||
|
||||
....
|
||||
./disas --arch aarch64 --userland userland/c/atomic/aarch64_add.c --userland-build-id o3 my_thread_main
|
||||
....
|
||||
|
||||
which contains:
|
||||
|
||||
....
|
||||
0x0000000000400a70 <+0>: 03 00 40 f9 ldr x3, [x0]
|
||||
0x0000000000400a74 <+4>: 63 01 00 b4 cbz x3, 0x400aa0 <my_thread_main+48>
|
||||
0x0000000000400a78 <+8>: 82 00 00 d0 adrp x2, 0x412000 <malloc@got.plt>
|
||||
0x0000000000400a7c <+12>: 42 a0 01 91 add x2, x2, #0x68
|
||||
0x0000000000400a80 <+16>: 00 00 80 d2 mov x0, #0x0 // #0
|
||||
0x0000000000400a84 <+20>: 1f 20 03 d5 nop
|
||||
0x0000000000400a88 <+24>: 41 00 40 f9 ldr x1, [x2]
|
||||
0x0000000000400a8c <+28>: 21 04 00 91 add x1, x1, #0x1
|
||||
0x0000000000400a90 <+32>: 41 00 00 f9 str x1, [x2]
|
||||
0x0000000000400a94 <+36>: 00 04 00 91 add x0, x0, #0x1
|
||||
0x0000000000400a98 <+40>: 7f 00 00 eb cmp x3, x0
|
||||
0x0000000000400a9c <+44>: 68 ff ff 54 b.hi 0x400a88 <my_thread_main+24> // b.pmore
|
||||
0x0000000000400aa0 <+48>: 00 00 80 52 mov w0, #0x0 // #0
|
||||
0x0000000000400aa4 <+52>: c0 03 5f d6 ret
|
||||
....
|
||||
|
||||
Grepping the logs for my_thread_main+24 shows where the first non-atomic interleaves happen at:
|
||||
|
||||
....
|
||||
471039000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+24 : ldr x1, [x2] : MemRead : D=0x000000000000002f A=0x412068 flags=(IsInteger|IsMemRef|IsLoad)
|
||||
471034000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+24 : ldr x1, [x2] : MemRead : D=0x000000000000002f A=0x412068 flags=(IsInteger|IsMemRef|IsLoad)
|
||||
471059000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+44 : b.hi <my_thread_main+24> : IntAlu : flags=(IsControl|IsDirectControl|IsCondControl)
|
||||
471070000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+44 : b.hi <my_thread_main+24> : IntAlu : flags=(IsControl|IsDirectControl|IsCondControl)
|
||||
471071000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+24 : ldr x1, [x2] : MemRead : D=0x0000000000000030 A=0x412068 flags=(IsInteger|IsMemRef|IsLoad)
|
||||
....
|
||||
|
||||
after a long string of cpu1 hits, since CPU1 was forked first and therefore had more time to run that operation.
|
||||
|
||||
From those and logs around we deduce that:
|
||||
|
||||
* the shared address of interest is 0x412068
|
||||
* the physical address is 2068
|
||||
* it fits into the cache line for 2040:207f
|
||||
|
||||
With that guide, we look at the fuller logs around that region of interest. With start at the first ifetch that CPU2 does for our LDR of interest at 0x400a88:
|
||||
|
||||
....
|
||||
471033000: SimpleCPU: system.cpu2: Fetch
|
||||
471033000: SimpleCPU: system.cpu2: Translating address 0x400a88
|
||||
....
|
||||
|
||||
Things get a bit interleaved with CPU1, but soon afterwards we see the miss forwarding via <<gem5-mshr>> as in <<gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches>>:
|
||||
|
||||
....
|
||||
471034000: Cache: system.cpu2.dcache: access for ReadReq [2068:206f] D=b0d989c328560000 ptr=0x5628c3d26f00 miss
|
||||
471034000: CachePort: system.cpu2.dcache.mem_side: Scheduling send event at 471035000
|
||||
471034000: Event: system.cpu2.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 140 scheduled @ 471035000
|
||||
....
|
||||
|
||||
Before the request moves on, some CPU1 action happens: a CPU1 STR finished! It hit the cache, and now we know the cache state: M:
|
||||
|
||||
....
|
||||
471034000: Cache: system.cpu1.dcache: access for WriteReq [2068:206f] D=2f00000000000000 ptr=0x5628c3d26c80 hit state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
|
||||
471034000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+32 : str x1, [x2] : MemWrite : D=0x000000000000002f A=0x412068 flags=(IsInteger|IsMemRef|IsStore)
|
||||
....
|
||||
|
||||
After this is done, CPU2 dcache finally decides that it is time to forward its request, and _now_ we see the crux of this experiment happen.
|
||||
|
||||
First `createMissPacket` creates a new packet for the cache request, and then it sends that packet into `CoherentXBar`.
|
||||
|
||||
....
|
||||
471035000: Event: system.cpu2.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 140 executed @ 471035000
|
||||
471035000: Cache: system.cpu2.dcache: sendMSHRQueuePacket: MSHR ReadReq [2068:206f] D=b0d989c328560000 ptr=0x5628c3d26f00
|
||||
471035000: Cache: system.cpu2.dcache: createMissPacket: created ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80 from ReadReq [2068:206f] D=b0d989c328560000 ptr=0x5628c3d26f00
|
||||
471035000: CoherentXBar: system.membus: recvTimingReq: src system.membus.slave[10] packet ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
|
||||
....
|
||||
|
||||
Now, the `SnoopFilte` which lives inside the crossbar decides if any other CPUs care aout hat address:
|
||||
|
||||
....
|
||||
471035000: SnoopFilter: system.membus.snoop_filter: lookupRequest: src system.membus.slave[10] packet ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
|
||||
471035000: SnoopFilter: system.membus.snoop_filter: lookupRequest: SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000
|
||||
471035000: SnoopFilter: system.membus.snoop_filter: lookupRequest: new SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000
|
||||
471035000: CoherentXBar: system.membus: recvTimingReq: src system.membus.slave[10] packet ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80 SF size: 1 lat: 1
|
||||
....
|
||||
|
||||
and the answer is yes: CPU1 does care about that address obviously! So the packet is forwarded as is to CPU1:
|
||||
|
||||
....
|
||||
471035000: CoherentXBar: system.membus: forwardTiming for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
|
||||
471035000: CacheVerbose: system.cpu1.dcache: recvTimingSnoopReq: for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
|
||||
471035000: CacheVerbose: system.cpu1.dcache: handleSnoop: for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
|
||||
471035000: Cache: system.cpu1.dcache: handleSnoop: snoop hit for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80, old state is state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
|
||||
471035000: Cache: system.cpu1.dcache: new state is state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
|
||||
471035000: Cache: system.cpu1.dcache: doTimingSupplyResponse: for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
|
||||
471035000: CacheVerbose: system.cpu1.dcache: doTimingSupplyResponse: created response: ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000 tick: 471044000
|
||||
471035000: Event: system.cpu1.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 94 scheduled @ 471044000
|
||||
471035000: CoherentXBar: system.membus: recvTimingReq: Not forwarding ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80
|
||||
471035000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 164 scheduled @ 471036000
|
||||
471035000: BaseXBar: system.membus.reqLayer0: The crossbar layer is now busy from tick 471035000 to 471036000
|
||||
....
|
||||
|
||||
and from this we see that this read request from CPU2 made a cache from CPU1 go from M to O!
|
||||
|
||||
Then, the CPU1 dcache actually goes ahead, and creates a response or CPU2, since it has the data. This response is sent back to the crossbar which will forward it back to CPU1.
|
||||
|
||||
This also makes the crossbar not forward the original request to DRAM as mentioned at `Not forwarding`.
|
||||
|
||||
This reply from CPU1 reaches the crossbar at:
|
||||
|
||||
....
|
||||
471044000: Event: system.cpu1.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 94 executed @ 471044000
|
||||
471044000: CoherentXBar: system.membus: recvTimingSnoopResp: src system.membus.slave[6] packet ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000
|
||||
471044000: SnoopFilter: system.membus.snoop_filter: updateSnoopResponse: rsp system.membus.slave[6] req system.membus.slave[10] packet ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000
|
||||
471044000: SnoopFilter: system.membus.snoop_filter: updateSnoopResponse: old SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000
|
||||
471044000: SnoopFilter: system.membus.snoop_filter: updateSnoopResponse: new SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101000
|
||||
471044000: CoherentXBar: system.membus: recvTimingSnoopResp: src system.membus.slave[6] packet ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000 FWD RESP
|
||||
471044000: Event: system.membus.slave[10]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 186 scheduled @ 471046000
|
||||
471044000: Event: system.membus.respLayer10.wrapped_function_event: EventFunctionWrapped 187 scheduled @ 471049000
|
||||
471044000: BaseXBar: system.membus.respLayer10: The crossbar layer is now busy from tick 471044000 to 471049000
|
||||
....
|
||||
|
||||
and finally, at long last, CPU2 receives the snoop reply that was created in CPU1 and sent back through the crossbar, and the LDR completes:
|
||||
|
||||
....
|
||||
471046000: Event: system.membus.slave[10]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 186 executed @ 471046000
|
||||
471046000: Cache: system.cpu2.dcache: recvTimingResp: Handling response ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000
|
||||
471046000: Cache: system.cpu2.dcache: Block for addr 0x2040 being updated in Cache
|
||||
471046000: CacheRepl: system.cpu2.dcache: Replacement victim: state: 0 (I) valid: 0 writable: 0 readable: 0 dirty: 0 | tag: 0xffffffffffffffff set: 0x81 way: 0
|
||||
471046000: Cache: system.cpu2.dcache: Block addr 0x2040 (ns) moving from state 0 (I) to state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
|
||||
471046000: Cache: system.cpu2.dcache: serviceMSHRTargets: updated cmd to ReadRespWithInvalidate [2068:206f] D=2f00000000000000 ptr=0x5628c3d26f00
|
||||
471046000: Event: system.cpu2.dcache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 138 scheduled @ 471047000
|
||||
471046000: Cache: system.cpu2.dcache: processing deferred snoop...
|
||||
471046000: CacheVerbose: system.cpu2.dcache: handleSnoop: for UpgradeReq [2040:207f] D= ptr=0x5628c2d37b80
|
||||
471046000: Cache: system.cpu2.dcache: handleSnoop: snoop hit for UpgradeReq [2040:207f] D= ptr=0x5628c2d37b80, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
|
||||
471046000: Cache: system.cpu2.dcache: new state is state: 0 (I) valid: 0 writable: 0 readable: 0 dirty: 0 | tag: 0xffffffffffffffff set: 0x81 way: 0
|
||||
471046000: CacheVerbose: system.cpu2.dcache: recvTimingResp: Leaving with ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000
|
||||
471047000: Event: system.cpu2.dcache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 138 executed @ 471047000
|
||||
471047000: SimpleCPU: system.cpu2.dcache_port: Received load/store response 0x2068
|
||||
471047000: Event: Event_136: Timing CPU dcache tick 136 scheduled @ 471047000
|
||||
471047000: Event: Event_136: Timing CPU dcache tick 136 executed @ 471047000
|
||||
471034000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+24 : ldr x1, [x2] : MemRead : D=0x000000000000002f A=0x412068 flags=(IsInteger|IsMemRef|IsLoad)
|
||||
....
|
||||
|
||||
We note therefore that no DRAM access was involved, one cache services the other directly!
|
||||
|
||||
Tested on LKMC d429552cdeb0fc0a08cff8e627bf501eaffb068f + 1, gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772.
|
||||
|
||||
===== gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby
|
||||
|
||||
Now let's do the exact same we did for <<gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus>>, but with <<gem5-ruby-build,Ruby>> rather than the classic system and TimingSimpleCPU (atomic does not work with Ruby)
|
||||
@@ -17036,7 +17236,7 @@ FullO3CPU: Ticking main, FullO3CPU.
|
||||
|
||||
so we observe that the first two instructions arrived, and the CPU noticed that 0x400080 hasn't been fetched yet.
|
||||
|
||||
Then for several cycles that follow, the fetch stage just says that it is blocked on data returning, e.g. the
|
||||
Then for several cycles that follow, the fetch stage just says that it is blocked on data returning:
|
||||
|
||||
....
|
||||
FullO3CPU: Ticking main, FullO3CPU.
|
||||
@@ -19604,7 +19804,14 @@ Bibliography:
|
||||
|
||||
===== atomic.c
|
||||
|
||||
link:userland/c/atomic.c[]
|
||||
* link:userland/c/atomic.c[]
|
||||
* link:userland/c/atomic/[]: files in this directory use the same technique as <<atomic-cpp>>, i.e. with one special case per file.
|
||||
+
|
||||
Maybe link:userland/c/atomic.c[] should be deprecated in favor of those more minimal ones.
|
||||
+
|
||||
This was added because C++-pre main is too bloated, especially when we turn one a gazillion <<gem5>> logs, it makes me want to cry.
|
||||
+
|
||||
And we want a single operation per test rather than to as in `atomic.c` because when using gem5 we want absolute control over the microbenchmark.
|
||||
|
||||
Demonstrates `atomic_int` and `thrd_create`.
|
||||
|
||||
@@ -19828,6 +20035,8 @@ Good rule:
|
||||
|
||||
link:userland/cpp/atomic/[]
|
||||
|
||||
C version at: <<atomic-c>>.
|
||||
|
||||
In this set of examples, we exemplify various synchronization mechanisms, including assembly specific ones, by using the convenience of C++ multithreading:
|
||||
|
||||
* link:userland/cpp/atomic/main.hpp[]: contains all the code which is then specialized in separated `.cpp` files with macros
|
||||
|
||||
@@ -659,6 +659,18 @@ path_properties_tuples = (
|
||||
},
|
||||
{
|
||||
'abort.c': {'signal_received': signal.Signals.SIGABRT},
|
||||
'atomic': (
|
||||
{
|
||||
'test_run_args': {'cpus': 3},
|
||||
},
|
||||
{
|
||||
'aarch64_add.c': {'allowed_archs': {'aarch64'}},
|
||||
'aarch64_ldadd.c': {'allowed_archs': {'aarch64'}},
|
||||
'aarch64_ldaxr_stlxr.c': {'allowed_archs': {'aarch64'}},
|
||||
'x86_64_inc.c': {'allowed_archs': {'x86_64'}},
|
||||
'x86_64_lock_inc.c': {'allowed_archs': {'x86_64'}},
|
||||
},
|
||||
),
|
||||
'atomic.c': {
|
||||
'baremetal': False,
|
||||
'test_run_args': {'cpus': 3},
|
||||
|
||||
@@ -4,9 +4,9 @@
|
||||
#include <assert.h>
|
||||
#include <stdatomic.h>
|
||||
#include <stdio.h>
|
||||
#include <threads.h>
|
||||
#include <string.h>
|
||||
#include <stdlib.h>
|
||||
#include <string.h>
|
||||
#include <threads.h>
|
||||
|
||||
atomic_int acnt;
|
||||
int cnt;
|
||||
|
||||
1
userland/c/atomic/README.adoc
Normal file
1
userland/c/atomic/README.adoc
Normal file
@@ -0,0 +1 @@
|
||||
https://cirosantilli.com/linux-kernel-module-cheat#atomic-c
|
||||
2
userland/c/atomic/aarch64_add.c
Normal file
2
userland/c/atomic/aarch64_add.c
Normal file
@@ -0,0 +1,2 @@
|
||||
#define LKMC_USERLAND_ATOMIC_AARCH64_ADD 1
|
||||
#include "main.h"
|
||||
2
userland/c/atomic/aarch64_ldadd.c
Normal file
2
userland/c/atomic/aarch64_ldadd.c
Normal file
@@ -0,0 +1,2 @@
|
||||
#define LKMC_USERLAND_ATOMIC_AARCH64_LDADD 1
|
||||
#include "main.h"
|
||||
2
userland/c/atomic/aarch64_ldaxr_stlxr.c
Normal file
2
userland/c/atomic/aarch64_ldaxr_stlxr.c
Normal file
@@ -0,0 +1,2 @@
|
||||
#define LKMC_USERLAND_ATOMIC_LDAXR_STLXR 1
|
||||
#include "main.h"
|
||||
1
userland/c/atomic/build
Symbolic link
1
userland/c/atomic/build
Symbolic link
@@ -0,0 +1 @@
|
||||
../build
|
||||
2
userland/c/atomic/fail.c
Normal file
2
userland/c/atomic/fail.c
Normal file
@@ -0,0 +1,2 @@
|
||||
#define LKMC_USERLAND_ATOMIC_FAIL 1
|
||||
#include "main.h"
|
||||
118
userland/c/atomic/main.h
Normal file
118
userland/c/atomic/main.h
Normal file
@@ -0,0 +1,118 @@
|
||||
// https://cirosantilli.com/linux-kernel-module-cheat#atomic-c */
|
||||
|
||||
#if __STDC_VERSION__ >= 201112L && !defined(__STDC_NO_THREADS__)
|
||||
#include <assert.h>
|
||||
#include <stdatomic.h>
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include <stdint.h>
|
||||
#include <string.h>
|
||||
#include <threads.h>
|
||||
|
||||
#if LKMC_USERLAND_ATOMIC_STD_ATOMIC
|
||||
atomic_uint global = 0;
|
||||
#else
|
||||
uint64_t global = 0;
|
||||
#endif
|
||||
|
||||
int my_thread_main(void *thr_data) {
|
||||
size_t niters = *(size_t *)thr_data;
|
||||
for (size_t i = 0; i < niters; ++i) {
|
||||
#if LKMC_USERLAND_ATOMIC_X86_64_INC
|
||||
__asm__ __volatile__ (
|
||||
"incq %0;"
|
||||
: "+g" (global),
|
||||
"+g" (i)
|
||||
:
|
||||
:
|
||||
);
|
||||
#elif LKMC_USERLAND_ATOMIC_X86_64_LOCK_INC
|
||||
__asm__ __volatile__ (
|
||||
"lock incq %0;"
|
||||
: "+m" (global),
|
||||
"+g" (i)
|
||||
:
|
||||
:
|
||||
);
|
||||
#elif LKMC_USERLAND_ATOMIC_AARCH64_ADD
|
||||
__asm__ __volatile__ (
|
||||
"add %0, %0, 1;"
|
||||
: "+r" (global),
|
||||
"+g" (i)
|
||||
:
|
||||
:
|
||||
);
|
||||
#elif LKMC_USERLAND_ATOMIC_LDAXR_STLXR
|
||||
uint64_t scratch64;
|
||||
uint64_t scratch32;
|
||||
__asm__ __volatile__ (
|
||||
"1:"
|
||||
"ldaxr %[scratch64], [%[addr]];"
|
||||
"add %[scratch64], %[scratch64], 1;"
|
||||
"stlxr %w[scratch32], %[scratch64], [%[addr]];"
|
||||
"cbnz %w[scratch32], 1b;"
|
||||
: "=m" (global),
|
||||
"+g" (i),
|
||||
[scratch64] "=&r" (scratch64),
|
||||
[scratch32] "=&r" (scratch32)
|
||||
: [addr] "r" (&global)
|
||||
:
|
||||
);
|
||||
#elif LKMC_USERLAND_ATOMIC_AARCH64_LDADD
|
||||
__asm__ __volatile__ (
|
||||
"ldadd %[inc], xzr, [%[addr]];"
|
||||
: "=m" (global),
|
||||
"+g" (i)
|
||||
: [inc] "r" (1),
|
||||
[addr] "r" (&global)
|
||||
:
|
||||
);
|
||||
#else
|
||||
__asm__ __volatile__ (
|
||||
""
|
||||
: "+g" (i)
|
||||
: "g" (global)
|
||||
:
|
||||
);
|
||||
global++;
|
||||
#endif
|
||||
}
|
||||
return 0;
|
||||
}
|
||||
#endif
|
||||
|
||||
int main(int argc, char **argv) {
|
||||
#if __STDC_VERSION__ >= 201112L && !defined(__STDC_NO_THREADS__)
|
||||
size_t niters, nthreads;
|
||||
thrd_t *threads;
|
||||
if (argc > 1) {
|
||||
nthreads = strtoull(argv[1], NULL, 0);
|
||||
} else {
|
||||
nthreads = 2;
|
||||
}
|
||||
if (argc > 2) {
|
||||
niters = strtoull(argv[2], NULL, 0);
|
||||
} else {
|
||||
niters = 10;
|
||||
}
|
||||
threads = malloc(sizeof(thrd_t) * nthreads);
|
||||
for(size_t i = 0; i < nthreads; ++i)
|
||||
assert(thrd_create(threads + i, my_thread_main, &niters) == thrd_success);
|
||||
for(size_t i = 0; i < nthreads; ++i)
|
||||
assert(thrd_join(threads[i], NULL) == thrd_success);
|
||||
free(threads);
|
||||
uint64_t expect = nthreads * niters;
|
||||
#if LKMC_USERLAND_ATOMIC_FAIL || \
|
||||
LKMC_USERLAND_ATOMIC_X86_64_INC || \
|
||||
LKMC_USERLAND_ATOMIC_AARCH64_ADD
|
||||
printf("expect %ju\n", (uintmax_t)expect);
|
||||
printf("global %ju\n", (uintmax_t)global);
|
||||
#else
|
||||
assert(global == expect);
|
||||
#endif
|
||||
#else
|
||||
(void)argc;
|
||||
(void)argv;
|
||||
#endif
|
||||
}
|
||||
|
||||
2
userland/c/atomic/mutex.c
Normal file
2
userland/c/atomic/mutex.c
Normal file
@@ -0,0 +1,2 @@
|
||||
#define LKMC_USERLAND_ATOMIC_MUTEX 1
|
||||
#include "main.h"
|
||||
2
userland/c/atomic/std_atomic.c
Normal file
2
userland/c/atomic/std_atomic.c
Normal file
@@ -0,0 +1,2 @@
|
||||
#define LKMC_USERLAND_ATOMIC_STD_ATOMIC 1
|
||||
#include "main.h"
|
||||
1
userland/c/atomic/test
Symbolic link
1
userland/c/atomic/test
Symbolic link
@@ -0,0 +1 @@
|
||||
../test
|
||||
2
userland/c/atomic/x86_64_inc.c
Normal file
2
userland/c/atomic/x86_64_inc.c
Normal file
@@ -0,0 +1,2 @@
|
||||
#define LKMC_USERLAND_ATOMIC_X86_64_INC 1
|
||||
#include "main.h"
|
||||
2
userland/c/atomic/x86_64_lock_inc.c
Normal file
2
userland/c/atomic/x86_64_lock_inc.c
Normal file
@@ -0,0 +1,2 @@
|
||||
#define LKMC_USERLAND_ATOMIC_X86_64_LOCK_INC 1
|
||||
#include "main.h"
|
||||
@@ -1 +1 @@
|
||||
// https://cirosantilli.com/linux-kernel-module-cheat#atomic-cpp
|
||||
https://cirosantilli.com/linux-kernel-module-cheat#atomic-cpp
|
||||
|
||||
Reference in New Issue
Block a user