From da4c3a029209dd73e2d7f01361672a1563f0f384 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ciro=20Santilli=20=E5=85=AD=E5=9B=9B=E4=BA=8B=E4=BB=B6=20?= =?UTF-8?q?=E6=B3=95=E8=BD=AE=E5=8A=9F?= Date: Thu, 11 Jun 2020 01:00:00 +0000 Subject: [PATCH] ce3ea9faea95daf46dea80d4236a30a0891c3ca5 --- index.html | 835 ++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 631 insertions(+), 204 deletions(-) diff --git a/index.html b/index.html index b24c3c7..944201a 100644 --- a/index.html +++ b/index.html @@ -690,12 +690,13 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 10.6. syscall emulation mode program stdin
  • 10.7. gem5 syscall emulation mode @@ -1109,10 +1110,14 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.2.2.2. gem5 cache size
  • -
  • 19.2.2.3. gem5 memory latency
  • -
  • 19.2.2.4. Memory size
  • -
  • 19.2.2.5. gem5 disk and network latency
  • -
  • 19.2.2.6. gem5 clock frequency
  • +
  • 19.2.2.3. gem5 DRAM model + +
  • +
  • 19.2.2.4. gem5 disk and network latency
  • +
  • 19.2.2.5. gem5 clock frequency
  • @@ -1122,6 +1127,7 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.5. gem5 checkpoint @@ -1211,7 +1217,12 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • +
  • 19.20.5.2. gem5 microops
  • 19.20.6. gem5 port system @@ -1324,17 +1336,18 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.20.7.4. gem5 Process
  • -
  • 19.20.8. gem5 code generation +
  • 19.20.8. gem5 functional units
  • +
  • 19.20.9. gem5 code generation
  • -
  • 19.20.9. gem5 build system +
  • 19.20.10. gem5 build system
  • @@ -1430,27 +1443,28 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 21.3. POSIX
  • 21.4. Userland multithreading
  • @@ -2010,29 +2024,36 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 31. Compilers
  • 32. Computer architecture
    -

    So we take a performance measurement approach instead:

    +

    The only precise option is therefore to look at gem5 config.ini as done at: gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches.

    +
    +
    +

    Or for a quick and dirty performance measurement approach instead:

    @@ -18870,7 +18918,12 @@ instructions 91738770
    -
    19.2.2.3. gem5 memory latency
    +
    19.2.2.3. gem5 DRAM model
    +
    +

    Some info at: TimingSimpleCPU analysis #1 but highly TODO :-)

    +
    +
    +
    19.2.2.3.1. gem5 memory latency

    TODO These look promising:

    @@ -18886,9 +18939,60 @@ instructions 91738770

    TODO: now to verify this with the Linux kernel? Besides raw performance benchmarks.

    +
    +

    Now for a raw simplistic benchmark on TimingSimpleCPU without caches via C busy loop:

    -
    -
    19.2.2.4. Memory size
    +
    +
    +
    ./run --arch aarch64 --cli-args 1000000 --emulator gem5 --userland userland/gcc/busy_loop.c -- --cpu-type TimingSimpleCPU
    +
    +
    +
    +

    LKMC eb22fd3b6e7fff7e9ef946a88b208debf5b419d5 gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 outputs:

    +
    +
    +
    +
    Exiting @ tick 897173931000 because exiting with last active thread context
    +
    +
    +
    +

    and now because:

    +
    +
    +
      +
    • +

      we have no caches, each instruction is fetched from memory

      +
    • +
    • +

      each loop contains 11 instructions as shown at Section 31.2, “C busy loop”

      +
    • +
    • +

      and supposing that the loop dominated executable pre/post main, which we know is true since as shown in Benchmark emulators on userland executables an empty dynamically linked C program only as about 100k instructions, while our loop runs 1000000 * 11 = 12M.

      +
    • +
    +
    +
    +

    we should have about 1000000 * 11 / 897173931000 ps ~ 12260722 ~ 12MB/s of random accesses. The default memory type used is DDR3_1600_8x8 as per:

    +
    +
    +
    +
    common/Options.py:101:    parser.add_option("--mem-type", type="choice", default="DDR3_1600_8x8
    +
    +
    +
    +

    and according to https://en.wikipedia.org/wiki/DDR3_SDRAM that reaches 6400 MB/s so we are only off by a factor of 50x :-) TODO. Maybe if the minimum transaction if 64 bytes, we would be on point.

    +
    +
    +

    Another example we could use later on is userland/gcc/busy_loop.c, but then that mixes icache and dcache accesses, so the analysis is a bit more complex:

    +
    +
    +
    +
    ./run --arch aarch64 --cli-args 0x1000000 --emulator gem5 --userland userland/gcc/busy_loop.c -- --cpu-type TimingSimpleCPU
    +
    +
    +
    +
    +
    19.2.2.3.2. Memory size
    ./run --memory 512M
    @@ -18988,8 +19092,9 @@ get_avphys_pages() * sysconf(_SC_PAGESIZE) = 0x1D178000

    AV means available and gives the free memory: https://stackoverflow.com/questions/14386856/c-check-available-ram/57659190#57659190

    +
    -
    19.2.2.5. gem5 disk and network latency
    +
    19.2.2.4. gem5 disk and network latency

    TODO These look promising:

    @@ -19004,35 +19109,82 @@ get_avphys_pages() * sysconf(_SC_PAGESIZE) = 0x1D178000
    -
    19.2.2.6. gem5 clock frequency
    +
    19.2.2.5. gem5 clock frequency
    -

    Clock frequency: TODO how does it affect performance in benchmarks?

    +

    As of gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 defaults to 2GHz for fs.py:

    -
    ./run --arch aarch64 --emulator gem5 -- --cpu-clock 10000000
    +
        parser.add_option("--cpu-clock", action="store", type="string",
    +                      default='2GHz',
    +                      help="Clock for blocks running at CPU speed")
    -

    Check with:

    +

    We can check that very easily by looking at the timestamps of a Exec trace of an gem5 AtomicSimpleCPU without any caches:

    -
    m5 resetstats
    -sleep 10
    -m5 dumpstats
    +
    ./run \
    +  --arch aarch64 \
    +  --emulator gem5 \
    +  --userland userland/arch/aarch64/freestanding/linux/hello.S \
    +  --trace-insts-stdout \
    +;
    -

    and then:

    +

    which shows:

    -
    ./gem5-stat --arch aarch64
    +
          0: system.cpu: A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    +    500: system.cpu: A0 T0 : @asm_main_after_prologue+4    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)
    +   1000: system.cpu: A0 T0 : @asm_main_after_prologue+8    :   ldr   w2, #4194464       : MemRead :  D=0x0000000000000006 A=0x4000a0  flags=(IsInteger|IsMemRef|IsLoad)
    +   1500: system.cpu: A0 T0 : @asm_main_after_prologue+12    :   movz   x8, #64, #0       : IntAlu :  D=0x0000000000000040  flags=(IsInteger)
    +   2000: system.cpu: A0 T0 : @asm_main_after_prologue+16    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
    +hello
    +   2500: system.cpu: A0 T0 : @asm_main_after_prologue+20    :   movz   x0, #0, #0        : IntAlu :  D=0x0000000000000000  flags=(IsInteger)
    +   3000: system.cpu: A0 T0 : @asm_main_after_prologue+24    :   movz   x8, #93, #0       : IntAlu :  D=0x000000000000005d  flags=(IsInteger)
    +   3500: system.cpu: A0 T0 : @asm_main_after_prologue+28    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
    -

    TODO: why doesn’t this exist:

    +

    so we see that it runs one instruction every 500 ps which makes up 2GHz.

    +
    +
    +

    So if we change the frequency to say 1GHz and re-run it:

    +
    +
    +
    +
    ./run \
    +  --arch aarch64 \
    +  --emulator gem5 \
    +  --userland userland/arch/aarch64/freestanding/linux/hello.S \
    +  --trace-insts-stdout \
    +  -- \
    +  --cpu-clock 1GHz \
    +;
    +
    +
    +
    +

    we get as expected:

    +
    +
    +
    +
          0: system.cpu: A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  flags=(IsInteger)
    +   1000: system.cpu: A0 T0 : @asm_main_after_prologue+4    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  flags=(IsInteger)
    +   2000: system.cpu: A0 T0 : @asm_main_after_prologue+8    :   ldr   w2, #4194464       : MemRead :  D=0x0000000000000006 A=0x4000a0  flags=(IsInteger|IsMemRef|IsLoad)
    +   3000: system.cpu: A0 T0 : @asm_main_after_prologue+12    :   movz   x8, #64, #0       : IntAlu :  D=0x0000000000000040  flags=(IsInteger)
    +   4000: system.cpu: A0 T0 : @asm_main_after_prologue+16    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
    +hello
    +   5000: system.cpu: A0 T0 : @asm_main_after_prologue+20    :   movz   x0, #0, #0        : IntAlu :  D=0x0000000000000000  flags=(IsInteger)
    +   6000: system.cpu: A0 T0 : @asm_main_after_prologue+24    :   movz   x8, #93, #0       : IntAlu :  D=0x000000000000005d  flags=(IsInteger)
    +   7000: system.cpu: A0 T0 : @asm_main_after_prologue+28    :   svc   #0x0               : IntAlu :   flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)
    +
    +
    +
    +

    As of gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1, but like gem5 cache size, does not get propagated to the guest, and is not for example visible at:

    @@ -19145,6 +19297,26 @@ m5 dumpstats

    breaks when m5 is run on guest, but does not show the source code.

    +
    +

    19.4.3. gem5 GDB step debug secondary cores

    +
    +

    gem5’s secondary core GDB setup is a hack and spawns one gdbserver for each core in separate ports, e.g. 7000, 7001, etc.

    +
    +
    +

    Partly because of this, it is basically unusable/very hard to use, because you can’t attach to a core that is stopped either because it hasn’t been initialized, or if you are already currently debugging another core.

    +
    +
    +

    This affects both full system and userland, and is described in more detail at: https://gem5.atlassian.net/browse/GEM5-626

    +
    +
    +

    In LKMC 0a3ce2f41f12024930bcdc74ff646b66dfc46999, we can easily test attaching to another core by passing --run-id, e.g. to connect to the second core we can use --run-id 1:

    +
    +
    +
    +
    ./run-gdb --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --run-id 1
    +
    +
    +

    19.5. gem5 checkpoint

    @@ -19527,7 +19699,7 @@ cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"
    -

    At gem5 2235168b72537535d74c645a70a85479801e0651, the first run does everything in AtomicSimpleCPU:

    +

    At gem5 2235168b72537535d74c645a70a85479801e0651, the first run does everything in AtomicSimpleCPU:

    @@ -20733,6 +20905,67 @@ xdg-open "$(./getvar --arch arm --emulator gem5 m5out_dir)/config.dot.svg"

    An example of such file can be seen at: config.dot.svg for a TimingSimpleCPU without caches..

    +
    +

    On Ubuntu 20.04, you can also see the dot file "directly" with xdot:

    +
    +
    +
    +
    xdot "$(./getvar --arch arm --emulator gem5 m5out_dir)/config.dot"
    +
    +
    +
    +

    which is kind of really cool because it allows you to graph arrows with clicks.

    +
    +
    +

    It is worth noting that if you are running a bunch of short simulations, dot/SVG/PDF generation could have a significant impact in simulation startup time, so it is something to watch out for. As per https://gem5-review.googlesource.com/c/public/gem5/+/29232 it can be turned off with:

    +
    +
    +
    +
    gem5.opt --dot-config=''
    +
    +
    +
    +

    or in LKMC:

    +
    +
    +
    +
    ./run --gem5-exe-args='--dot-config= --json-config= --dump-config='
    +
    +
    +
    +

    The time difference can be readily observed on minimal examples by running gem5 with time.

    +
    +
    +

    By looking into gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 src/python/m5/util/dot_writer.py are can try to remove the SVG/PDF conversion to see if those dominate the runtime:

    +
    +
    +
    +
    def do_dot(root, outdir, dotFilename):
    +    if not pydot:
    +        warn("No dot file generated. " +
    +             "Please install pydot to generate the dot file and pdf.")
    +        return
    +    # * use ranksep > 1.0 for for vertical separation between nodes
    +    # especially useful if you need to annotate edges using e.g. visio
    +    # which accepts svg format
    +    # * no need for hoizontal separation as nothing moves horizonally
    +    callgraph = pydot.Dot(graph_type='digraph', ranksep='1.3')
    +    dot_create_nodes(root, callgraph)
    +    dot_create_edges(root, callgraph)
    +    dot_filename = os.path.join(outdir, dotFilename)
    +    callgraph.write(dot_filename)
    +    try:
    +        # dot crashes if the figure is extremely wide.
    +        # So avoid terminating simulation unnecessarily
    +        callgraph.write_svg(dot_filename + ".svg")
    +        callgraph.write_pdf(dot_filename + ".pdf")
    +    except:
    +        warn("failed to generate dot output from %s", dot_filename)
    +
    +
    +
    +

    but nope, they don’t, dot_create_nodes and dot_create_edges are the culprits, so the only way to gain speed is to remove .dot generation altogether. It is tempting to do this by default on LKMC and add an option to enable dot generation when desired so we can be a bit faster by default…​ but I’m lazy to document the option right now. When it annoys me further maybe :-)

    +
    @@ -21577,7 +21810,7 @@ class SystemXBar(CoherentXBar):

    Simple abstract CPU without a pipeline.

    -

    They are therefore completely unrealistic. But they also run much faster.

    +

    They are therefore completely unrealistic. But they also run much faster. KVM CPUs are an alternative way of fast forwarding boot when they work.

    Implementations:

    @@ -21585,13 +21818,27 @@ class SystemXBar(CoherentXBar): +
    +
    19.16.1.1.1. gem5 AtomicSimpleCPU
    +

    AtomicSimpleCPU: the default one. Memory accesses happen instantaneously. The fastest simulation except for KVM, but not realistic at all.

    +
    -
  • -
  • + +
    +
    19.16.1.1.2. gem5 TimingSimpleCPU
    +

    TimingSimpleCPU: memory accesses are realistic, but the CPU has no pipeline. The simulation is faster than detailed models, but slower than AtomicSimpleCPU.

    +
    @@ -21601,11 +21848,6 @@ class SystemXBar(CoherentXBar):

    Caches do make a difference here of course, and lead to much faster memory return times.

    -
  • - - -
    -

    KVM CPUs are an alternative way of fast forwarding boot when they work.

    @@ -21629,10 +21871,13 @@ class SystemXBar(CoherentXBar):

    Its 4 stage pipeline is described at the "MinorCPU" section of gem5 ARM RSK.

    +

    A commented execution example can be seen at: gem5 event queue MinorCPU syscall emulation freestanding example analysis.

    +
    +

    There is also an in-tree doxygen at: src/doc/inside-minor.doxygen and rendered at: http://pages.cs.wisc.edu/~swilson/gem5-docs/minor.html

    -

    As of 2019, in-order cores are mostly present in low power / cost contexts, for example little cores of ARM bigLITTLE.

    +

    As of 2019, in-order cores are mostly present in low power/cost contexts, for example little cores of ARM bigLITTLE.

    The following models extend the MinorCPU class by parametrization to make it match existing CPUs more closely:

    @@ -21674,12 +21919,15 @@ class SystemXBar(CoherentXBar):
    19.16.1.3. gem5 DerivO3CPU
    -

    Generic out-of-order core. "O3" Stands for "Out Of Order"!

    +

    Generic out-of-order core. "O3" Stands for "Out Of Order"!

    Analogous to MinorCPU, but modelling an out of order core instead of in order.

    +

    A commented execution example can be seen at: gem5 event queue DerivO3CPU syscall emulation freestanding example analysis.

    +
    +

    Existing parametrizations:

    @@ -22372,7 +22620,7 @@ for source in PySource.all:
    -

    At gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis we see for example that at the beginning of an AtomicCPU simulation, gem5 sets up exactly two events:

    +

    At gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis we see for example that at the beginning of an AtomicCPU simulation, gem5 sets up exactly two events:

      @@ -22979,7 +23227,7 @@ Exiting @ tick 3500 because exiting with last active thread context

      TODO: analyze better what each of the memory event mean. For now, we have just collected a bunch of data there, but needs interpreting. The CPU specifics in this section are already insightful however.

    -

    TimingSimpleCPU should be the second simplest CPU to analyze, so let’s give it a try:

    +

    TimingSimpleCPU should be the second simplest CPU to analyze, so let’s give it a try:

    @@ -25064,6 +25312,21 @@ namespace ArmISAInst {
    +
    +
    19.20.5.2. gem5 microops
    +
    +

    TODO

    +
    +
    +

    Some gem5 instructions break down into multiple microops.

    +
    +
    +

    Microops are very similar to regular instructions, and show on the gem5 ExecAll trace format since that flag implies ExecMicro.

    +
    +
    +

    On aarch64 for example, one of the simplest microoped instructions is STP, which does the relatively complex operation of storing two values to memory at once, and is therefore a good candidate for being broken down into microops.

    +
    +

    19.20.6. gem5 port system

    @@ -26128,7 +26391,45 @@ readFunc(SyscallDesc *desc, ThreadContext *tc,
    -

    19.20.8. gem5 code generation

    +

    19.20.8. gem5 functional units

    +
    +

    TODO

    +
    +
    +

    Each instruction is marked with a class, and each class can execute in a given functional unit.

    +
    +
    +

    Which units are available is visible for example on the gem5 config.ini of a gem5 MinorCPU run. Functional units are not present in simple CPUs like gem5 TimingSimpleCPU.

    +
    +
    +

    For example, on gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1, the config.ini of a minor run:

    +
    +
    +
    +
    ./run   \
    +  --arch aarch64 \
    +  --emulator gem5 \
    +  --userland userland/arch/aarch64/freestanding/linux/hello.S \
    +  --trace-insts-stdout \
    +  -N1 \
    +  -- \
    +  --cpu-type MinorCPU \
    +  --caches
    +
    +
    +
    +

    contains:

    +
    +
    +
    +
    [system.cpu]
    +type=MinorCPU
    +children=branchPred dcache dtb executeFuncUnits icache interrupts isa itb power_state tracer workload
    +
    +
    +
    +
    +

    19.20.9. gem5 code generation

    gem5 uses a ton of code generation, which makes the project horrendous:

    @@ -26173,7 +26474,7 @@ readFunc(SyscallDesc *desc, ThreadContext *tc,

    But it has been widely overused to insanity. It likely also exists partly because when the project started in 2003 C++ compilers weren’t that good, so you couldn’t rely on features like templates that much.

    -
    19.20.8.1. gem5 THE_ISA
    +
    19.20.9.1. gem5 THE_ISA

    Generated code at: build/<ISA>/config/the_isa.hh which e.g. for ARM contains:

    @@ -26219,9 +26520,9 @@ enum class Arch {
    -

    19.20.9. gem5 build system

    +

    19.20.10. gem5 build system

    -
    19.20.9.2. gem5 build broken on recent compiler version
    +
    19.20.10.2. gem5 build broken on recent compiler version

    gem5 moves a bit slowly, and if your host compiler is very new, the gem5 build might be broken for it, e.g. this was the case for Ubuntu 19.10 with GCC 9 and gem5 62d75e7105fe172eb906d4f80f360ff8591d4178 from Dec 2019.

    @@ -26261,7 +26562,7 @@ enum class Arch {
    -
    19.20.9.3. gem5 polymorphic ISA includes
    +
    19.20.10.3. gem5 polymorphic ISA includes

    E.g. src/cpu/decode_cache.hh includes:

    @@ -26340,7 +26641,7 @@ build/ARM/config/the_isa.hh
    -
    19.20.9.4. Why are all C++ symlinked into the gem5 build dir?
    +
    19.20.10.4. Why are all C++ symlinked into the gem5 build dir?
    @@ -27262,6 +27563,9 @@ cd ../..
    -

    21.3.1. unistd.h

    +

    21.3.1. Environment variables

    +
    +

    POSIX C example that prints all environment variables: userland/posix/environ.c

    +
    +
    +
    +

    21.3.2. unistd.h

    • @@ -28269,7 +28579,7 @@ non-atomic 19
    -

    21.3.2. fork

    +

    21.3.3. fork

    POSIX' multiprocess API. Contrast with pthreads which are for threads.

    @@ -28294,7 +28604,7 @@ fork() return = 13039

    Read the source comments and understand everything that is going on!

    -
    21.3.2.1. getpid
    +
    21.3.3.1. getpid

    The minimal interesting example is to use fork and observe different PIDs.

    @@ -28306,7 +28616,7 @@ fork() return = 13039
    -
    21.3.2.2. Fork bomb
    +
    21.3.3.2. Fork bomb
    @@ -28341,7 +28651,7 @@ fork() return = 13039
    -

    21.3.3. pthreads

    +

    21.3.4. pthreads

    POSIX' multithreading API. Contrast with fork which is for processes.

    @@ -28365,7 +28675,7 @@ fork() return = 13039
    -
    21.3.3.1. pthread_mutex
    +
    21.3.4.1. pthread_mutex

    userland/posix/pthread_count.c exemplifies the functions:

    @@ -28402,7 +28712,7 @@ There are no non-locking atomic types or atomic primitives in POSIX: -

    21.3.4. sysconf

    +

    21.3.5. sysconf

    @@ -28416,6 +28726,9 @@ There are no non-locking atomic types or atomic primitives in POSIX: userland/linux/sysconf.c showcases Linux extensions to POSIX

    +
    +

    Note that this blows up on gem5 userland due to NPROCESSORS_ONLN however: https://gem5.atlassian.net/browse/GEM5-622

    +
    @@ -28433,9 +28746,19 @@ There are no non-locking atomic types or atomic primitives in POSIX:

    getconf is also specified by POSIX at: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/getconf.html but not the -a option which shows all configurations.

    +
    +

    Busybox 1.31.1 clearly states that getconf is not implemented however at docs/posix_conformance.txt:

    +
    +
    +
    +
    POSIX Tools not supported:
    +  asa, at, batch, bc, c99, command, compress, csplit, ex, fc, file,
    +  gencat, getconf, iconv, join, link, locale, localedef, lp, m4,
    +
    +
    -

    21.3.5. mmap

    +

    21.3.6. mmap

    The mmap system call allows advanced memory operations.

    @@ -28446,7 +28769,7 @@ There are no non-locking atomic types or atomic primitives in POSIX: -
    21.3.5.1. mmap MAP_ANONYMOUS
    +
    21.3.6.1. mmap MAP_ANONYMOUS

    Basic mmap example, do the same as userland/c/malloc.c, but with mmap.

    @@ -28464,7 +28787,7 @@ There are no non-locking atomic types or atomic primitives in POSIX: -
    21.3.5.2. mmap file
    +
    21.3.6.2. mmap file

    Memory mapped file example: userland/posix/mmap_file.c

    @@ -28476,7 +28799,7 @@ There are no non-locking atomic types or atomic primitives in POSIX: -
    21.3.5.3. brk
    +
    21.3.6.3. brk

    Previously POSIX, but was deprecated in favor of malloc

    @@ -28492,7 +28815,7 @@ There are no non-locking atomic types or atomic primitives in POSIX: -

    21.3.6. socket

    +

    21.3.7. socket

    A bit like read and write, but from / to the Internet!

    @@ -29788,7 +30111,7 @@ git clean -xdf . @@ -37925,7 +38248,7 @@ instructions 124346081

    For now we can just run on gem5 to estimate the instruction count per input size and extrapolate?

    -

    For example, the simplest scalable CPU content would be a busy loop: userland/gcc/busy_loop.c, so let’s start by analyzing that one.

    +

    For example, the simplest scalable CPU content would be an C busy loop, so let’s start by analyzing that one.

    Summary of manually collected results on P51 at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: Table 7, “Busy loop MIPS for different simulator setups”. As expected, the less native / more detailed / more complex simulations are slower!

    @@ -37933,14 +38256,16 @@ instructions 124346081 --------++++++++++ @@ -37952,6 +38277,8 @@ instructions 124346081 + + @@ -37964,16 +38291,44 @@ instructions 124346081 + + - + + + + + + + + + + + + + + + + + + + + + + + + + + + @@ -37984,6 +38339,8 @@ instructions 124346081 + + @@ -37994,6 +38351,8 @@ instructions 124346081 + + @@ -38004,6 +38363,8 @@ instructions 124346081 + + @@ -38014,6 +38375,8 @@ instructions 124346081 + + @@ -38024,6 +38387,8 @@ instructions 124346081 + + @@ -38034,6 +38399,8 @@ instructions 124346081 + + @@ -38044,6 +38411,8 @@ instructions 124346081 + + @@ -38054,6 +38423,8 @@ instructions 124346081 + + @@ -38064,6 +38435,8 @@ instructions 124346081 + + @@ -38074,6 +38447,8 @@ instructions 124346081 + + @@ -38084,6 +38459,8 @@ instructions 124346081 + + @@ -38094,6 +38471,8 @@ instructions 124346081 + + @@ -38104,6 +38483,8 @@ instructions 124346081 + + @@ -38114,6 +38495,8 @@ instructions 124346081 + + @@ -38124,16 +38507,8 @@ instructions 124346081 - - + - - - - - - -
    Table 7. Busy loop MIPS for different simulator setups
    Time (s) Instruction count Approximate MIPSgem5 versionHost

    68

    1.1 * 10^11 (approx)

    2000

    gem5 busy loop

    a18f28e263c91362519ef550150b5c9d75fa3679 + 1

    a18f28e263c91362519ef550150b5c9d75fa3679

    userland/gcc/busy_loop.c -O0

    ./run --arch aarch64 --emulator gem5 --static --userland userland/gcc/busy_loop.c --cli-args 1000000

    10^6

    18

    2.4005699 * 10^7

    1.3

    gem5 empty C program statically linked

    eb22fd3b6e7fff7e9ef946a88b208debf5b419d5

    userland/c/empty.c -O0

    ./run --arch aarch64 --emulator gem5 --static --userland userland/c/empty.c

    1

    0

    5475

    872cb227fdc0b4d60acc7840889d567a6936b6e1

    Ubuntu 20.04

    gem5 empty C program dynamically linked

    eb22fd3b6e7fff7e9ef946a88b208debf5b419d5

    userland/c/empty.c -O0

    ./run --arch aarch64 --emulator gem5 --userland userland/c/empty.c

    1

    0

    106999

    872cb227fdc0b4d60acc7840889d567a6936b6e1

    Ubuntu 20.04

    gem5 busy loop for a debug build

    33

    2.405682 * 10^6

    0.07

    gem5 busy loop for a fast build

    15

    2.4005699 * 10^7

    1.6

    gem5 busy loop for a TimingSimpleCPU

    26

    2.4005699 * 10^7

    0.9

    gem5 busy loop for a MinorCPU

    31

    1.1018152 * 10^7

    0.4

    gem5 busy loop for a DerivO3CPU

    52

    1.1018128 * 10^7

    0.2

    63

    1.1005150 * 10^7

    0.2

    68

    9.2034139 * 10^7

    1.6

    64

    9.9674773 * 10^7

    1.6

    glibc C pre-main effects

    2

    1.26479 * 10^5

    0.05

    2

    1.26479 * 10^5

    0.05

    2

    2.385012 * 10^6

    1

    25

    2.385012 * 10^6

    0.1

    gem5 optimized build immediate exit on first instruction to benchmark the simulator startup time

    1

    1

    same as above but debug build

    1

    1

    Check the effect of an ExecAll log (log every instruction) on execution time, compare to analogous run without it. trace.txt size: 3.5GB. 5x slowdown observed with output to a hard disk.

    2.4106774 * 10^7

    136

    0.2

    Same as above but with run command manually hacked to output to a ramfs. Slightly faster, but the bulk was still just in log format operations!

    d29a07ddad499f273cc90dd66e40f8474b5dfc40

    userland/gcc/busy_loop.c -O0

    ./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --cli-args 1000000 --gem5-worktree master --trace ExecAll

    10^6

    2.4106774 * 10^7

    107

    0.2

    @@ -38166,77 +38541,7 @@ instructions 124346081

    so ~ 110 million instructions / 100 seconds makes ~ 1 MIPS (million instructions per second).

    -

    This experiment also suggests that each loop is about 11 instructions long (110M instructions / 10M loops), so we look at the disassembly:

    -
    -
    -
    -
    ./run-toolchain --arch aarch64 gdb -- -batch -ex 'disas busy_loop' "$(./getvar --arch aarch64 userland_build_dir)/gcc/busy_loop.out"
    -
    -
    -
    -

    which contains:

    -
    -
    -
    -
    8       ) {
    -   0x0000000000400698 <+0>:     ff 83 00 d1     sub     sp, sp, #0x20
    -   0x000000000040069c <+4>:     e0 07 00 f9     str     x0, [sp, #8]
    -   0x00000000004006a0 <+8>:     e1 03 00 f9     str     x1, [sp]
    -
    -9           for (unsigned i = 0; i < max; i++) {
    -   0x00000000004006a4 <+12>:    ff 1f 00 b9     str     wzr, [sp, #28]
    -   0x00000000004006a8 <+16>:    11 00 00 14     b       0x4006ec <busy_loop+84>
    -
    -10              for (unsigned j = 0; j < max2; j++) {
    -   0x00000000004006ac <+20>:    ff 1b 00 b9     str     wzr, [sp, #24]
    -   0x00000000004006b0 <+24>:    08 00 00 14     b       0x4006d0 <busy_loop+56>
    -
    -11                  __asm__ __volatile__ ("" : "+g" (j), "+g" (j) : :);
    -   0x00000000004006b4 <+28>:    e1 1b 40 b9     ldr     w1, [sp, #24]
    -   0x00000000004006b8 <+32>:    e0 1b 40 b9     ldr     w0, [sp, #24]
    -   0x00000000004006bc <+36>:    e1 1b 00 b9     str     w1, [sp, #24]
    -   0x00000000004006c0 <+40>:    e0 17 00 b9     str     w0, [sp, #20]
    -
    -10              for (unsigned j = 0; j < max2; j++) {
    -   0x00000000004006c4 <+44>:    e0 17 40 b9     ldr     w0, [sp, #20]
    -   0x00000000004006c8 <+48>:    00 04 00 11     add     w0, w0, #0x1
    -   0x00000000004006cc <+52>:    e0 1b 00 b9     str     w0, [sp, #24]
    -   0x00000000004006d0 <+56>:    e0 1b 40 b9     ldr     w0, [sp, #24]
    -   0x00000000004006d4 <+60>:    e1 03 40 f9     ldr     x1, [sp]
    -   0x00000000004006d8 <+64>:    3f 00 00 eb     cmp     x1, x0
    -   0x00000000004006dc <+68>:    c8 fe ff 54     b.hi    0x4006b4 <busy_loop+28>  // b.pmore
    -
    -9           for (unsigned i = 0; i < max; i++) {
    -   0x00000000004006e0 <+72>:    e0 1f 40 b9     ldr     w0, [sp, #28]
    -   0x00000000004006e4 <+76>:    00 04 00 11     add     w0, w0, #0x1
    -   0x00000000004006e8 <+80>:    e0 1f 00 b9     str     w0, [sp, #28]
    -   0x00000000004006ec <+84>:    e0 1f 40 b9     ldr     w0, [sp, #28]
    -   0x00000000004006f0 <+88>:    e1 07 40 f9     ldr     x1, [sp, #8]
    -   0x00000000004006f4 <+92>:    3f 00 00 eb     cmp     x1, x0
    -   0x00000000004006f8 <+96>:    a8 fd ff 54     b.hi    0x4006ac <busy_loop+20>  // b.pmore
    -
    -12              }
    -13          }
    -14      }
    -   0x00000000004006fc <+100>:   1f 20 03 d5     nop
    -   0x0000000000400700 <+104>:   ff 83 00 91     add     sp, sp, #0x20
    -   0x0000000000400704 <+108>:   c0 03 5f d6     ret
    -
    -
    -
    -

    We look for the internal backwards jumps, and we find two:

    -
    -
    -
    -
       0x00000000004006dc <+68>:    c8 fe ff 54     b.hi    0x4006b4 <busy_loop+28>  // b.pmore
    -   0x00000000004006f8 <+96>:    a8 fd ff 54     b.hi    0x4006ac <busy_loop+20>  // b.pmore
    -
    -
    -
    -

    and so clearly the one at 0x4006dc happens first and jumps to a larger address than the other one, so the internal loop must be between 4006dc and 4006b4, which contains exactly 11 instructions! Bingo!

    -
    -
    -

    Oh my God, unoptimized code is so horrendously inefficient, even I can’t stand all those useless loads and stores to memory variables!!!

    +

    This experiment also suggests that each loop is about 11 instructions long (110M instructions / 10M loops), which we confirm at Section 31.2, “C busy loop”, bingo!

    Then for QEMU, we experimentally turn the number of loops up to 10^10 loops (100000 100000), which contains an expected 11 * 10^10 instructions, and the runtime is 00:01:08, so we have 1.1 * 10^11 instruction / 68 seconds ~ 2 * 10^9 = 2000 MIPS!

    @@ -38701,13 +39006,86 @@ west build -b qemu_aarch64 samples/hello_world
    -

    31.2. Infinite busy loop

    +

    31.2. C busy loop

    +
    +

    Disassembly analysis:

    +
    +
    +
    +
    ./run-toolchain --arch aarch64 gdb -- -nh -batch -ex 'disas/rs busy_loop' "$(./getvar --arch aarch64 userland_build_dir)/gcc/busy_loop.out"
    +
    +
    +
    +

    which contains at LKMC eb22fd3b6e7fff7e9ef946a88b208debf5b419d5:

    +
    +
    +
    +
    10      ) {
    +   0x0000000000400700 <+0>:     ff 83 00 d1     sub     sp, sp, #0x20
    +   0x0000000000400704 <+4>:     e0 07 00 f9     str     x0, [sp, #8]
    +   0x0000000000400708 <+8>:     e1 03 00 f9     str     x1, [sp]
    +
    +11          for (unsigned long long i = 0; i < max2; i++) {
    +   0x000000000040070c <+12>:    ff 0f 00 f9     str     xzr, [sp, #24]
    +   0x0000000000400710 <+16>:    11 00 00 14     b       0x400754 <busy_loop+84>
    +
    +12              for (unsigned long long j = 0; j < max; j++) {
    +   0x0000000000400714 <+20>:    ff 0b 00 f9     str     xzr, [sp, #16]
    +   0x0000000000400718 <+24>:    08 00 00 14     b       0x400738 <busy_loop+56>
    +
    +13                  __asm__ __volatile__ ("" : "+g" (i), "+g" (j) : :);
    +   0x000000000040071c <+28>:    e1 0f 40 f9     ldr     x1, [sp, #24]
    +   0x0000000000400720 <+32>:    e0 0b 40 f9     ldr     x0, [sp, #16]
    +   0x0000000000400724 <+36>:    e1 0f 00 f9     str     x1, [sp, #24]
    +   0x0000000000400728 <+40>:    e0 0b 00 f9     str     x0, [sp, #16]
    +
    +12              for (unsigned long long j = 0; j < max; j++) {
    +   0x000000000040072c <+44>:    e0 0b 40 f9     ldr     x0, [sp, #16]
    +   0x0000000000400730 <+48>:    00 04 00 91     add     x0, x0, #0x1
    +   0x0000000000400734 <+52>:    e0 0b 00 f9     str     x0, [sp, #16]
    +   0x0000000000400738 <+56>:    e1 0b 40 f9     ldr     x1, [sp, #16]
    +   0x000000000040073c <+60>:    e0 07 40 f9     ldr     x0, [sp, #8]
    +   0x0000000000400740 <+64>:    3f 00 00 eb     cmp     x1, x0
    +   0x0000000000400744 <+68>:    c3 fe ff 54     b.cc    0x40071c <busy_loop+28>  // b.lo, b.ul, b.last
    +
    +11          for (unsigned long long i = 0; i < max2; i++) {
    +   0x0000000000400748 <+72>:    e0 0f 40 f9     ldr     x0, [sp, #24]
    +   0x000000000040074c <+76>:    00 04 00 91     add     x0, x0, #0x1
    +   0x0000000000400750 <+80>:    e0 0f 00 f9     str     x0, [sp, #24]
    +   0x0000000000400754 <+84>:    e1 0f 40 f9     ldr     x1, [sp, #24]
    +   0x0000000000400758 <+88>:    e0 03 40 f9     ldr     x0, [sp]
    +   0x000000000040075c <+92>:    3f 00 00 eb     cmp     x1, x0
    +   0x0000000000400760 <+96>:    a3 fd ff 54     b.cc    0x400714 <busy_loop+20>  // b.lo, b.ul, b.last
    +
    +14              }
    +15          }
    +16      }
    +   0x0000000000400764 <+100>:   1f 20 03 d5     nop
    +   0x0000000000400768 <+104>:   ff 83 00 91     add     sp, sp, #0x20
    +   0x000000000040076c <+108>:   c0 03 5f d6     ret
    +
    +
    +
    +

    We look for the internal backwards jumps, and we find two:

    +
    +
    +
    +
       0x00000000004006dc <+68>:    c8 fe ff 54     b.hi    0x4006b4 <busy_loop+28>  // b.pmore
    +   0x00000000004006f8 <+96>:    a8 fd ff 54     b.hi    0x4006ac <busy_loop+20>  // b.pmore
    +
    +
    +
    +

    and so clearly the one at 0x4006dc happens first and jumps to a larger address than the other one, so the internal loop must be between 4006dc and 4006b4, which contains exactly 11 instructions.

    +
    +
    +

    Oh my God, unoptimized code is so horrendously inefficient, even I can’t stand all those useless loads and stores to memory variables!!!

    +
    @@ -38715,7 +39093,56 @@ west build -b qemu_aarch64 samples/hello_world

    32. Computer architecture

    -

    32.1. Hardware threads

    +

    32.1. Instruction pipelining

    +
    +

    In gem5, can be seen on:

    +
    + +
    +

    32.1.1. Classic RISC pipeline

    + +
    +

    gem5’s gem5 MinorCPU implements a similar but 4 stage pipeline. TODO why didn’t they go with the classic RISC pipeline instead?

    +
    +
    +
    +
    +

    32.2. Superscalar processor

    + + +
    +

    You basically decode

    +
    +
    +

    TODO in gem5? gem5 definitely has functional units explicitly modelled: gem5 functional units, so do gem5 MinorCPU or gem5 DerivO3CPU have it?

    +
    +
    + +
    +

    32.4. Hardware threads

    Intel name: "Hyperthreading"

    @@ -38765,7 +39192,7 @@ west build -b qemu_aarch64 samples/hello_world
    -

    32.2. Cache coherence

    +

    32.5. Cache coherence

    @@ -38807,7 +39234,7 @@ west build -b qemu_aarch64 samples/hello_world

    Even if caches are coherent, this is still not enough to avoid data race conditions, because this does not enforce atomicity of read modify write sequences. This is for example shown at: Detailed gem5 analysis of how data races happen.

    -

    32.2.1. Memory consistency

    +

    32.5.1. Memory consistency

    According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf "memory consistency" is about ordering requirements of different memory addresses.

    @@ -38815,14 +39242,14 @@ west build -b qemu_aarch64 samples/hello_world

    This is represented explicitly in C++ for example C++ std::memory_order.

    -
    32.2.1.1. Sequential Consistency
    +
    32.5.1.1. Sequential Consistency

    According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf, the strongest possible consistency, everything nicely ordered as you’d expect.

    -

    32.2.2. Can caches snoop data from other caches?

    +

    32.5.2. Can caches snoop data from other caches?

    Either they can snoop only control, or both control and data can be snooped.

    @@ -38837,7 +39264,7 @@ west build -b qemu_aarch64 samples/hello_world
    -

    32.2.3. VI cache coherence protocol

    +

    32.5.3. VI cache coherence protocol

    Mentioned at:

    @@ -39084,7 +39511,7 @@ west build -b qemu_aarch64 samples/hello_world
    -

    32.2.4. MSI cache coherence protocol

    +

    32.5.4. MSI cache coherence protocol

    @@ -39396,7 +39823,7 @@ CACHE2 S nyy

    TODO gem5 concrete example.

    -
    32.2.4.1. MSI cache coherence protocol with transient states
    +
    32.5.4.1. MSI cache coherence protocol with transient states

    TODO understand well why those are needed.

    @@ -39416,7 +39843,7 @@ CACHE2 S nyy
    -

    32.2.5. MESI cache coherence protocol

    +

    32.5.5. MESI cache coherence protocol

    @@ -39476,7 +39903,7 @@ CACHE2 S nyy
    -

    32.2.6. MOSI cache coherence protocol

    +

    32.5.6. MOSI cache coherence protocol

    https://en.wikipedia.org/wiki/MOSI_protocol The critical MSI vs MOSI section was a bit bogus though: https://en.wikipedia.org/w/index.php?title=MOSI_protocol&oldid=895443023 but I edited it :-)

    @@ -39536,7 +39963,7 @@ CACHE2 S nyy
    -

    32.2.7. MOESI cache coherence protocol

    +

    32.5.7. MOESI cache coherence protocol

    @@ -41117,7 +41544,7 @@ echo $?

    gem5: m5 fail works on all archs

  • -

    user mode: QEMU forwards exit status, for gem5 we do some log parsing as described at: Section 10.7.1, “gem5 syscall emulation exit status”

    +

    user mode: QEMU forwards exit status, for gem5 we do some log parsing as described at: Section 10.7.2, “gem5 syscall emulation exit status”