diff --git a/index.html b/index.html index b24c3c7..944201a 100644 --- a/index.html +++ b/index.html @@ -690,12 +690,13 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
completeAccProcessNote that secondary cores in gem5 are kind of broken however: gem5 GDB step debug secondary cores.
+Bibliography:
There are much more unimplemented syscalls in gem5 than in QEMU. Many of those are trivial to implement however.
Support for dynamic linking was added in November 2019: https://stackoverflow.com/questions/50542222/how-to-run-a-dynamically-linked-executable-syscall-emulation-mode-se-py-in-gem5/50696098#50696098
-So let’s just play with some static ones:
Support for dynamic linking was added in November 2019: https://stackoverflow.com/questions/50542222/how-to-run-a-dynamically-linked-executable-syscall-emulation-mode-se-py-in-gem5/50696098#50696098
+Note that as shown at Section 29.2.2, “Benchmark emulators on userland executables”, the dynamic version runs 200x more instructions, which might have an impact on smaller simulations in detailed CPUs.
+As of gem5 7fa4c946386e7207ad5859e8ade0bbfc14000d91, the crappy se.py script does not forward the exit status of syscall emulation mode, you can test it with:
Since gem5 has to implement syscalls itself in syscall emulation mode, it can of course clearly see which syscalls are being made, and we can log them for debug purposes with gem5 tracing, e.g.:
gem5 user mode multithreading has been particularly flaky compared to QEMU’s, but work is being put into improving it.
gem5 syscall emulation has the nice feature of allowing you to run multiple executables "at once".
and therefore shows one instruction running on each CPU for each process at the same time.
gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4 syscall emulation has an --smt option presumably for Hardware threads but it has been neglected forever it seems: https://github.com/cirosantilli/linux-kernel-module-cheat/issues/104
Check with:
+Can be checked with /proc/cpuinfo or getconf in Ubuntu 18.04:
getconf -a | grep CACHE -lscpu +lscpu cat /sys/devices/system/cpu/cpu0/cache/index2/size
and on Ubuntu 20.04 host but not Buildroot 1.31.1:
+getconf -a | grep CACHE+
and we also have an easy to use userland executable using sysconf at userland/linux/sysconf.c:
+./run --emulator gem5 --userland userland/linux/sysconf.c+
but for some reason the Linux kernel is not seeing the cache sizes:
So we take a performance measurement approach instead:
+The only precise option is therefore to look at gem5 config.ini as done at: gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches.
+Or for a quick and dirty performance measurement approach instead:
Some info at: TimingSimpleCPU analysis #1 but highly TODO :-)
+TODO These look promising:
TODO: now to verify this with the Linux kernel? Besides raw performance benchmarks.
Now for a raw simplistic benchmark on TimingSimpleCPU without caches via C busy loop:
./run --arch aarch64 --cli-args 1000000 --emulator gem5 --userland userland/gcc/busy_loop.c -- --cpu-type TimingSimpleCPU+
LKMC eb22fd3b6e7fff7e9ef946a88b208debf5b419d5 gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 outputs:
+Exiting @ tick 897173931000 because exiting with last active thread context+
and now because:
+we have no caches, each instruction is fetched from memory
+each loop contains 11 instructions as shown at Section 31.2, “C busy loop”
+and supposing that the loop dominated executable pre/post main, which we know is true since as shown in Benchmark emulators on userland executables an empty dynamically linked C program only as about 100k instructions, while our loop runs 1000000 * 11 = 12M.
we should have about 1000000 * 11 / 897173931000 ps ~ 12260722 ~ 12MB/s of random accesses. The default memory type used is DDR3_1600_8x8 as per:
common/Options.py:101: parser.add_option("--mem-type", type="choice", default="DDR3_1600_8x8
+and according to https://en.wikipedia.org/wiki/DDR3_SDRAM that reaches 6400 MB/s so we are only off by a factor of 50x :-) TODO. Maybe if the minimum transaction if 64 bytes, we would be on point.
+Another example we could use later on is userland/gcc/busy_loop.c, but then that mixes icache and dcache accesses, so the analysis is a bit more complex:
+./run --arch aarch64 --cli-args 0x1000000 --emulator gem5 --userland userland/gcc/busy_loop.c -- --cpu-type TimingSimpleCPU+
./run --memory 512M@@ -18988,8 +19092,9 @@ get_avphys_pages() * sysconf(_SC_PAGESIZE) = 0x1D178000
AV means available and gives the free memory: https://stackoverflow.com/questions/14386856/c-check-available-ram/57659190#57659190
TODO These look promising:
Clock frequency: TODO how does it affect performance in benchmarks?
+As of gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 defaults to 2GHz for fs.py:
./run --arch aarch64 --emulator gem5 -- --cpu-clock 10000000+
parser.add_option("--cpu-clock", action="store", type="string",
+ default='2GHz',
+ help="Clock for blocks running at CPU speed")
Check with:
+We can check that very easily by looking at the timestamps of a Exec trace of an gem5 AtomicSimpleCPU without any caches:
m5 resetstats -sleep 10 -m5 dumpstats+
./run \ + --arch aarch64 \ + --emulator gem5 \ + --userland userland/arch/aarch64/freestanding/linux/hello.S \ + --trace-insts-stdout \ +;
and then:
+which shows:
./gem5-stat --arch aarch64+
0: system.cpu: A0 T0 : @asm_main_after_prologue : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger) + 500: system.cpu: A0 T0 : @asm_main_after_prologue+4 : adr x1, #28 : IntAlu : D=0x0000000000400098 flags=(IsInteger) + 1000: system.cpu: A0 T0 : @asm_main_after_prologue+8 : ldr w2, #4194464 : MemRead : D=0x0000000000000006 A=0x4000a0 flags=(IsInteger|IsMemRef|IsLoad) + 1500: system.cpu: A0 T0 : @asm_main_after_prologue+12 : movz x8, #64, #0 : IntAlu : D=0x0000000000000040 flags=(IsInteger) + 2000: system.cpu: A0 T0 : @asm_main_after_prologue+16 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall) +hello + 2500: system.cpu: A0 T0 : @asm_main_after_prologue+20 : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 3000: system.cpu: A0 T0 : @asm_main_after_prologue+24 : movz x8, #93, #0 : IntAlu : D=0x000000000000005d flags=(IsInteger) + 3500: system.cpu: A0 T0 : @asm_main_after_prologue+28 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
TODO: why doesn’t this exist:
+so we see that it runs one instruction every 500 ps which makes up 2GHz.
+So if we change the frequency to say 1GHz and re-run it:
+./run \ + --arch aarch64 \ + --emulator gem5 \ + --userland userland/arch/aarch64/freestanding/linux/hello.S \ + --trace-insts-stdout \ + -- \ + --cpu-clock 1GHz \ +;+
we get as expected:
+0: system.cpu: A0 T0 : @asm_main_after_prologue : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger) + 1000: system.cpu: A0 T0 : @asm_main_after_prologue+4 : adr x1, #28 : IntAlu : D=0x0000000000400098 flags=(IsInteger) + 2000: system.cpu: A0 T0 : @asm_main_after_prologue+8 : ldr w2, #4194464 : MemRead : D=0x0000000000000006 A=0x4000a0 flags=(IsInteger|IsMemRef|IsLoad) + 3000: system.cpu: A0 T0 : @asm_main_after_prologue+12 : movz x8, #64, #0 : IntAlu : D=0x0000000000000040 flags=(IsInteger) + 4000: system.cpu: A0 T0 : @asm_main_after_prologue+16 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall) +hello + 5000: system.cpu: A0 T0 : @asm_main_after_prologue+20 : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 6000: system.cpu: A0 T0 : @asm_main_after_prologue+24 : movz x8, #93, #0 : IntAlu : D=0x000000000000005d flags=(IsInteger) + 7000: system.cpu: A0 T0 : @asm_main_after_prologue+28 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)+
As of gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1, but like gem5 cache size, does not get propagated to the guest, and is not for example visible at:
breaks when m5 is run on guest, but does not show the source code.
gem5’s secondary core GDB setup is a hack and spawns one gdbserver for each core in separate ports, e.g. 7000, 7001, etc.
+Partly because of this, it is basically unusable/very hard to use, because you can’t attach to a core that is stopped either because it hasn’t been initialized, or if you are already currently debugging another core.
+This affects both full system and userland, and is described in more detail at: https://gem5.atlassian.net/browse/GEM5-626
+In LKMC 0a3ce2f41f12024930bcdc74ff646b66dfc46999, we can easily test attaching to another core by passing --run-id, e.g. to connect to the second core we can use --run-id 1:
./run-gdb --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --run-id 1+
At gem5 2235168b72537535d74c645a70a85479801e0651, the first run does everything in AtomicSimpleCPU:
+At gem5 2235168b72537535d74c645a70a85479801e0651, the first run does everything in AtomicSimpleCPU:
An example of such file can be seen at: config.dot.svg for a TimingSimpleCPU without caches..
On Ubuntu 20.04, you can also see the dot file "directly" with xdot:
+xdot "$(./getvar --arch arm --emulator gem5 m5out_dir)/config.dot"+
which is kind of really cool because it allows you to graph arrows with clicks.
+It is worth noting that if you are running a bunch of short simulations, dot/SVG/PDF generation could have a significant impact in simulation startup time, so it is something to watch out for. As per https://gem5-review.googlesource.com/c/public/gem5/+/29232 it can be turned off with:
+gem5.opt --dot-config=''+
or in LKMC:
+./run --gem5-exe-args='--dot-config= --json-config= --dump-config='+
The time difference can be readily observed on minimal examples by running gem5 with time.
By looking into gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 src/python/m5/util/dot_writer.py are can try to remove the SVG/PDF conversion to see if those dominate the runtime:
def do_dot(root, outdir, dotFilename):
+ if not pydot:
+ warn("No dot file generated. " +
+ "Please install pydot to generate the dot file and pdf.")
+ return
+ # * use ranksep > 1.0 for for vertical separation between nodes
+ # especially useful if you need to annotate edges using e.g. visio
+ # which accepts svg format
+ # * no need for hoizontal separation as nothing moves horizonally
+ callgraph = pydot.Dot(graph_type='digraph', ranksep='1.3')
+ dot_create_nodes(root, callgraph)
+ dot_create_edges(root, callgraph)
+ dot_filename = os.path.join(outdir, dotFilename)
+ callgraph.write(dot_filename)
+ try:
+ # dot crashes if the figure is extremely wide.
+ # So avoid terminating simulation unnecessarily
+ callgraph.write_svg(dot_filename + ".svg")
+ callgraph.write_pdf(dot_filename + ".pdf")
+ except:
+ warn("failed to generate dot output from %s", dot_filename)
+but nope, they don’t, dot_create_nodes and dot_create_edges are the culprits, so the only way to gain speed is to remove .dot generation altogether. It is tempting to do this by default on LKMC and add an option to enable dot generation when desired so we can be a bit faster by default… but I’m lazy to document the option right now. When it annoys me further maybe :-)
Simple abstract CPU without a pipeline.
They are therefore completely unrealistic. But they also run much faster.
+They are therefore completely unrealistic. But they also run much faster. KVM CPUs are an alternative way of fast forwarding boot when they work.
Implementations:
@@ -21585,13 +21818,27 @@ class SystemXBar(CoherentXBar):AtomicSimpleCPUAtomicSimpleCPU: the default one. Memory accesses happen instantaneously. The fastest simulation except for KVM, but not realistic at all.
TimingSimpleCPUTimingSimpleCPU: memory accesses are realistic, but the CPU has no pipeline. The simulation is faster than detailed models, but slower than AtomicSimpleCPU.
To fully understand TimingSimpleCPU, see: gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis.
Caches do make a difference here of course, and lead to much faster memory return times.
KVM CPUs are an alternative way of fast forwarding boot when they work.
Its 4 stage pipeline is described at the "MinorCPU" section of gem5 ARM RSK.
A commented execution example can be seen at: gem5 event queue MinorCPU syscall emulation freestanding example analysis.
+There is also an in-tree doxygen at: src/doc/inside-minor.doxygen and rendered at: http://pages.cs.wisc.edu/~swilson/gem5-docs/minor.html
As of 2019, in-order cores are mostly present in low power / cost contexts, for example little cores of ARM bigLITTLE.
+As of 2019, in-order cores are mostly present in low power/cost contexts, for example little cores of ARM bigLITTLE.
The following models extend the MinorCPU class by parametrization to make it match existing CPUs more closely:
Generic out-of-order core. "O3" Stands for "Out Of Order"!
+Generic out-of-order core. "O3" Stands for "Out Of Order"!
Analogous to MinorCPU, but modelling an out of order core instead of in order.
A commented execution example can be seen at: gem5 event queue DerivO3CPU syscall emulation freestanding example analysis.
+Existing parametrizations:
At gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis we see for example that at the beginning of an AtomicCPU simulation, gem5 sets up exactly two events:
+At gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis we see for example that at the beginning of an AtomicCPU simulation, gem5 sets up exactly two events:
TODO: analyze better what each of the memory event mean. For now, we have just collected a bunch of data there, but needs interpreting. The CPU specifics in this section are already insightful however.
TimingSimpleCPU should be the second simplest CPU to analyze, so let’s give it a try:
+TimingSimpleCPU should be the second simplest CPU to analyze, so let’s give it a try:
TODO
+Some gem5 instructions break down into multiple microops.
+Microops are very similar to regular instructions, and show on the gem5 ExecAll trace format since that flag implies ExecMicro.
On aarch64 for example, one of the simplest microoped instructions is STP, which does the relatively complex operation of storing two values to memory at once, and is therefore a good candidate for being broken down into microops.
+TODO
+Each instruction is marked with a class, and each class can execute in a given functional unit.
+Which units are available is visible for example on the gem5 config.ini of a gem5 MinorCPU run. Functional units are not present in simple CPUs like gem5 TimingSimpleCPU.
For example, on gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1, the config.ini of a minor run:
./run \ + --arch aarch64 \ + --emulator gem5 \ + --userland userland/arch/aarch64/freestanding/linux/hello.S \ + --trace-insts-stdout \ + -N1 \ + -- \ + --cpu-type MinorCPU \ + --caches+
contains:
+[system.cpu] +type=MinorCPU +children=branchPred dcache dtb executeFuncUnits icache interrupts isa itb power_state tracer workload+
gem5 uses a ton of code generation, which makes the project horrendous:
But it has been widely overused to insanity. It likely also exists partly because when the project started in 2003 C++ compilers weren’t that good, so you couldn’t rely on features like templates that much.
Generated code at: build/<ISA>/config/the_isa.hh which e.g. for ARM contains:
gem5 moves a bit slowly, and if your host compiler is very new, the gem5 build might be broken for it, e.g. this was the case for Ubuntu 19.10 with GCC 9 and gem5 62d75e7105fe172eb906d4f80f360ff8591d4178 from Dec 2019.
E.g. src/cpu/decode_cache.hh includes:
Upstream request: https://gem5.atlassian.net/browse/GEM5-469
POSIX C example that prints all environment variables: userland/posix/environ.c
+POSIX' multiprocess API. Contrast with pthreads which are for threads.
Read the source comments and understand everything that is going on!
The minimal interesting example is to use fork and observe different PIDs.
POSIX' multithreading API. Contrast with fork which is for processes.
userland/posix/pthread_count.c exemplifies the functions:
Note that this blows up on gem5 userland due to NPROCESSORS_ONLN however: https://gem5.atlassian.net/browse/GEM5-622
getconf is also specified by POSIX at: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/getconf.html but not the -a option which shows all configurations.
Busybox 1.31.1 clearly states that getconf is not implemented however at docs/posix_conformance.txt:
POSIX Tools not supported: + asa, at, batch, bc, c99, command, compress, csplit, ex, fc, file, + gencat, getconf, iconv, join, link, locale, localedef, lp, m4,+
The mmap system call allows advanced memory operations.
Basic mmap example, do the same as userland/c/malloc.c, but with mmap.
Memory mapped file example: userland/posix/mmap_file.c
A bit like read and write, but from / to the Internet!
userland/gcc/busy_loop.c described at Infinite busy loop
+userland/gcc/busy_loop.c described at C busy loop
For now we can just run on gem5 to estimate the instruction count per input size and extrapolate?
For example, the simplest scalable CPU content would be a busy loop: userland/gcc/busy_loop.c, so let’s start by analyzing that one.
+For example, the simplest scalable CPU content would be an C busy loop, so let’s start by analyzing that one.
Summary of manually collected results on P51 at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: Table 7, “Busy loop MIPS for different simulator setups”. As expected, the less native / more detailed / more complex simulations are slower!
@@ -37933,14 +38256,16 @@ instructions 124346081| Time (s) | Instruction count | Approximate MIPS | +gem5 version | +Host | 68 |
1.1 * 10^11 (approx) |
2000 |
++ | ||
|---|---|---|---|---|---|---|---|---|---|---|
gem5 busy loop |
-a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
+a18f28e263c91362519ef550150b5c9d75fa3679 |
|
10^6 |
18 |
2.4005699 * 10^7 |
1.3 |
++ | + | |
gem5 empty C program statically linked |
+eb22fd3b6e7fff7e9ef946a88b208debf5b419d5 |
++ |
|
+1 |
+0 |
+5475 |
++ | 872cb227fdc0b4d60acc7840889d567a6936b6e1 |
+Ubuntu 20.04 |
+|
gem5 empty C program dynamically linked |
+eb22fd3b6e7fff7e9ef946a88b208debf5b419d5 |
++ |
|
+1 |
+0 |
+106999 |
++ | 872cb227fdc0b4d60acc7840889d567a6936b6e1 |
+Ubuntu 20.04 |
|
gem5 busy loop for a debug build |
@@ -37984,6 +38339,8 @@ instructions 124346081
33 |
2.405682 * 10^6 |
0.07 |
++ | ||||||
gem5 busy loop for a fast build |
@@ -37994,6 +38351,8 @@ instructions 124346081
15 |
2.4005699 * 10^7 |
1.6 |
++ | ||||||
gem5 busy loop for a TimingSimpleCPU |
@@ -38004,6 +38363,8 @@ instructions 124346081
26 |
2.4005699 * 10^7 |
0.9 |
++ | ||||||
gem5 busy loop for a MinorCPU |
@@ -38014,6 +38375,8 @@ instructions 124346081
31 |
1.1018152 * 10^7 |
0.4 |
++ | ||||||
gem5 busy loop for a DerivO3CPU |
@@ -38024,6 +38387,8 @@ instructions 124346081
52 |
1.1018128 * 10^7 |
0.2 |
++ | ||||||
| @@ -38034,6 +38399,8 @@ instructions 124346081 | 63 |
1.1005150 * 10^7 |
0.2 |
++ | ||||||
| @@ -38044,6 +38411,8 @@ instructions 124346081 | 68 |
9.2034139 * 10^7 |
1.6 |
++ | ||||||
| @@ -38054,6 +38423,8 @@ instructions 124346081 | 64 |
9.9674773 * 10^7 |
1.6 |
++ | ||||||
glibc C pre-main effects |
@@ -38064,6 +38435,8 @@ instructions 124346081
2 |
1.26479 * 10^5 |
0.05 |
++ | ||||||
| @@ -38074,6 +38447,8 @@ instructions 124346081 | 2 |
1.26479 * 10^5 |
0.05 |
++ | ||||||
| @@ -38084,6 +38459,8 @@ instructions 124346081 | 2 |
2.385012 * 10^6 |
1 |
++ | ||||||
| @@ -38094,6 +38471,8 @@ instructions 124346081 | 25 |
2.385012 * 10^6 |
0.1 |
++ | ||||||
gem5 optimized build immediate exit on first instruction to benchmark the simulator startup time |
@@ -38104,6 +38483,8 @@ instructions 124346081
1 |
1 |
+ | + | ||||||
same as above but debug build |
@@ -38114,6 +38495,8 @@ instructions 124346081
1 |
1 |
+ | + | ||||||
Check the effect of an ExecAll log (log every instruction) on execution time, compare to analogous run without it. |
@@ -38124,16 +38507,8 @@ instructions 124346081
2.4106774 * 10^7 |
136 |
0.2 |
-|||||||
Same as above but with run command manually hacked to output to a ramfs. Slightly faster, but the bulk was still just in log format operations! |
-d29a07ddad499f273cc90dd66e40f8474b5dfc40 |
-- |
|
-10^6 |
-2.4106774 * 10^7 |
-107 |
-0.2 |
so ~ 110 million instructions / 100 seconds makes ~ 1 MIPS (million instructions per second).
This experiment also suggests that each loop is about 11 instructions long (110M instructions / 10M loops), so we look at the disassembly:
-./run-toolchain --arch aarch64 gdb -- -batch -ex 'disas busy_loop' "$(./getvar --arch aarch64 userland_build_dir)/gcc/busy_loop.out"-
which contains:
-8 ) {
- 0x0000000000400698 <+0>: ff 83 00 d1 sub sp, sp, #0x20
- 0x000000000040069c <+4>: e0 07 00 f9 str x0, [sp, #8]
- 0x00000000004006a0 <+8>: e1 03 00 f9 str x1, [sp]
-
-9 for (unsigned i = 0; i < max; i++) {
- 0x00000000004006a4 <+12>: ff 1f 00 b9 str wzr, [sp, #28]
- 0x00000000004006a8 <+16>: 11 00 00 14 b 0x4006ec <busy_loop+84>
-
-10 for (unsigned j = 0; j < max2; j++) {
- 0x00000000004006ac <+20>: ff 1b 00 b9 str wzr, [sp, #24]
- 0x00000000004006b0 <+24>: 08 00 00 14 b 0x4006d0 <busy_loop+56>
-
-11 __asm__ __volatile__ ("" : "+g" (j), "+g" (j) : :);
- 0x00000000004006b4 <+28>: e1 1b 40 b9 ldr w1, [sp, #24]
- 0x00000000004006b8 <+32>: e0 1b 40 b9 ldr w0, [sp, #24]
- 0x00000000004006bc <+36>: e1 1b 00 b9 str w1, [sp, #24]
- 0x00000000004006c0 <+40>: e0 17 00 b9 str w0, [sp, #20]
-
-10 for (unsigned j = 0; j < max2; j++) {
- 0x00000000004006c4 <+44>: e0 17 40 b9 ldr w0, [sp, #20]
- 0x00000000004006c8 <+48>: 00 04 00 11 add w0, w0, #0x1
- 0x00000000004006cc <+52>: e0 1b 00 b9 str w0, [sp, #24]
- 0x00000000004006d0 <+56>: e0 1b 40 b9 ldr w0, [sp, #24]
- 0x00000000004006d4 <+60>: e1 03 40 f9 ldr x1, [sp]
- 0x00000000004006d8 <+64>: 3f 00 00 eb cmp x1, x0
- 0x00000000004006dc <+68>: c8 fe ff 54 b.hi 0x4006b4 <busy_loop+28> // b.pmore
-
-9 for (unsigned i = 0; i < max; i++) {
- 0x00000000004006e0 <+72>: e0 1f 40 b9 ldr w0, [sp, #28]
- 0x00000000004006e4 <+76>: 00 04 00 11 add w0, w0, #0x1
- 0x00000000004006e8 <+80>: e0 1f 00 b9 str w0, [sp, #28]
- 0x00000000004006ec <+84>: e0 1f 40 b9 ldr w0, [sp, #28]
- 0x00000000004006f0 <+88>: e1 07 40 f9 ldr x1, [sp, #8]
- 0x00000000004006f4 <+92>: 3f 00 00 eb cmp x1, x0
- 0x00000000004006f8 <+96>: a8 fd ff 54 b.hi 0x4006ac <busy_loop+20> // b.pmore
-
-12 }
-13 }
-14 }
- 0x00000000004006fc <+100>: 1f 20 03 d5 nop
- 0x0000000000400700 <+104>: ff 83 00 91 add sp, sp, #0x20
- 0x0000000000400704 <+108>: c0 03 5f d6 ret
-We look for the internal backwards jumps, and we find two:
-0x00000000004006dc <+68>: c8 fe ff 54 b.hi 0x4006b4 <busy_loop+28> // b.pmore - 0x00000000004006f8 <+96>: a8 fd ff 54 b.hi 0x4006ac <busy_loop+20> // b.pmore-
and so clearly the one at 0x4006dc happens first and jumps to a larger address than the other one, so the internal loop must be between 4006dc and 4006b4, which contains exactly 11 instructions! Bingo!
-Oh my God, unoptimized code is so horrendously inefficient, even I can’t stand all those useless loads and stores to memory variables!!!
+This experiment also suggests that each loop is about 11 instructions long (110M instructions / 10M loops), which we confirm at Section 31.2, “C busy loop”, bingo!
Then for QEMU, we experimentally turn the number of loops up to 10^10 loops (100000 100000), which contains an expected 11 * 10^10 instructions, and the runtime is 00:01:08, so we have 1.1 * 10^11 instruction / 68 seconds ~ 2 * 10^9 = 2000 MIPS!
The hard part is how to prevent the compiler from optimizing it away: https://stackoverflow.com/questions/7083482/how-to-prevent-gcc-from-optimizing-out-a-busy-wait-loop/58758133#58758133
Disassembly analysis:
+./run-toolchain --arch aarch64 gdb -- -nh -batch -ex 'disas/rs busy_loop' "$(./getvar --arch aarch64 userland_build_dir)/gcc/busy_loop.out"+
which contains at LKMC eb22fd3b6e7fff7e9ef946a88b208debf5b419d5:
+10 ) {
+ 0x0000000000400700 <+0>: ff 83 00 d1 sub sp, sp, #0x20
+ 0x0000000000400704 <+4>: e0 07 00 f9 str x0, [sp, #8]
+ 0x0000000000400708 <+8>: e1 03 00 f9 str x1, [sp]
+
+11 for (unsigned long long i = 0; i < max2; i++) {
+ 0x000000000040070c <+12>: ff 0f 00 f9 str xzr, [sp, #24]
+ 0x0000000000400710 <+16>: 11 00 00 14 b 0x400754 <busy_loop+84>
+
+12 for (unsigned long long j = 0; j < max; j++) {
+ 0x0000000000400714 <+20>: ff 0b 00 f9 str xzr, [sp, #16]
+ 0x0000000000400718 <+24>: 08 00 00 14 b 0x400738 <busy_loop+56>
+
+13 __asm__ __volatile__ ("" : "+g" (i), "+g" (j) : :);
+ 0x000000000040071c <+28>: e1 0f 40 f9 ldr x1, [sp, #24]
+ 0x0000000000400720 <+32>: e0 0b 40 f9 ldr x0, [sp, #16]
+ 0x0000000000400724 <+36>: e1 0f 00 f9 str x1, [sp, #24]
+ 0x0000000000400728 <+40>: e0 0b 00 f9 str x0, [sp, #16]
+
+12 for (unsigned long long j = 0; j < max; j++) {
+ 0x000000000040072c <+44>: e0 0b 40 f9 ldr x0, [sp, #16]
+ 0x0000000000400730 <+48>: 00 04 00 91 add x0, x0, #0x1
+ 0x0000000000400734 <+52>: e0 0b 00 f9 str x0, [sp, #16]
+ 0x0000000000400738 <+56>: e1 0b 40 f9 ldr x1, [sp, #16]
+ 0x000000000040073c <+60>: e0 07 40 f9 ldr x0, [sp, #8]
+ 0x0000000000400740 <+64>: 3f 00 00 eb cmp x1, x0
+ 0x0000000000400744 <+68>: c3 fe ff 54 b.cc 0x40071c <busy_loop+28> // b.lo, b.ul, b.last
+
+11 for (unsigned long long i = 0; i < max2; i++) {
+ 0x0000000000400748 <+72>: e0 0f 40 f9 ldr x0, [sp, #24]
+ 0x000000000040074c <+76>: 00 04 00 91 add x0, x0, #0x1
+ 0x0000000000400750 <+80>: e0 0f 00 f9 str x0, [sp, #24]
+ 0x0000000000400754 <+84>: e1 0f 40 f9 ldr x1, [sp, #24]
+ 0x0000000000400758 <+88>: e0 03 40 f9 ldr x0, [sp]
+ 0x000000000040075c <+92>: 3f 00 00 eb cmp x1, x0
+ 0x0000000000400760 <+96>: a3 fd ff 54 b.cc 0x400714 <busy_loop+20> // b.lo, b.ul, b.last
+
+14 }
+15 }
+16 }
+ 0x0000000000400764 <+100>: 1f 20 03 d5 nop
+ 0x0000000000400768 <+104>: ff 83 00 91 add sp, sp, #0x20
+ 0x000000000040076c <+108>: c0 03 5f d6 ret
+We look for the internal backwards jumps, and we find two:
+0x00000000004006dc <+68>: c8 fe ff 54 b.hi 0x4006b4 <busy_loop+28> // b.pmore + 0x00000000004006f8 <+96>: a8 fd ff 54 b.hi 0x4006ac <busy_loop+20> // b.pmore+
and so clearly the one at 0x4006dc happens first and jumps to a larger address than the other one, so the internal loop must be between 4006dc and 4006b4, which contains exactly 11 instructions.
+Oh my God, unoptimized code is so horrendously inefficient, even I can’t stand all those useless loads and stores to memory variables!!!
+In gem5, can be seen on:
+gem5’s gem5 MinorCPU implements a similar but 4 stage pipeline. TODO why didn’t they go with the classic RISC pipeline instead?
+http://www.lighterra.com/papers/modernmicroprocessors/ explains it well.
+You basically decode
+TODO in gem5? gem5 definitely has functional units explicitly modelled: gem5 functional units, so do gem5 MinorCPU or gem5 DerivO3CPU have it?
+gem5’s model is gem5 DerivO3CPU.
+Intel name: "Hyperthreading"
Even if caches are coherent, this is still not enough to avoid data race conditions, because this does not enforce atomicity of read modify write sequences. This is for example shown at: Detailed gem5 analysis of how data races happen.
According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf "memory consistency" is about ordering requirements of different memory addresses.
This is represented explicitly in C++ for example C++ std::memory_order.
According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf, the strongest possible consistency, everything nicely ordered as you’d expect.
Either they can snoop only control, or both control and data can be snooped.
Mentioned at:
TODO gem5 concrete example.
TODO understand well why those are needed.
https://en.wikipedia.org/wiki/MOSI_protocol The critical MSI vs MOSI section was a bit bogus though: https://en.wikipedia.org/w/index.php?title=MOSI_protocol&oldid=895443023 but I edited it :-)
gem5: m5 fail works on all archs
user mode: QEMU forwards exit status, for gem5 we do some log parsing as described at: Section 10.7.1, “gem5 syscall emulation exit status”
+user mode: QEMU forwards exit status, for gem5 we do some log parsing as described at: Section 10.7.2, “gem5 syscall emulation exit status”