From e11bc6eb0e5c8926e5612621a3581219439a96b1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ciro=20Santilli=20=E5=85=AD=E5=9B=9B=E4=BA=8B=E4=BB=B6=20?= =?UTF-8?q?=E6=B3=95=E8=BD=AE=E5=8A=9F?= Date: Wed, 29 Jul 2020 01:00:00 +0000 Subject: [PATCH] a0d6fa15a207cb40cd8ce090c77aa9b55d7605a6 --- index.html | 478 +++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 444 insertions(+), 34 deletions(-) diff --git a/index.html b/index.html index d8388ed..3ac4aeb 100644 --- a/index.html +++ b/index.html @@ -2057,7 +2057,21 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 29.3. Benchmark machines
  • 29.4. Benchmark Internets @@ -18343,7 +18357,7 @@ less "$(./getvar gem5_source_dir)/src/cpu/exetrace.cc"

    We can make the trace smaller by naming the trace file as trace.txt.gz, which enables GZIP compression, but that is not currently exposed on our scripts, since you usually just need something human readable to work on.

    -

    Enabling tracing made the runtime about 4x slower on the P51, with or without .gz compression.

    +

    Enabling tracing made the runtime about 4x slower on the 2017 Lenovo ThinkPad P51, with or without .gz compression.

    Trace the source lines just like for QEMU with:

    @@ -21193,7 +21207,7 @@ system.cpu.dtb.inst_hits

    and there yes, we see that the file size fell from 39MB on stats.txt to 3.2MB on stats.m5, so the increase observed previously was just due to some initial size overhead (considering the patched gem5 with no spaces in the text file).

    -

    We also note however that the stat dump made the such a simulation that just loops and dumps considerably slower, from 3s to 15s on P51. Fascinating, we are definitely not disk bound there.

    +

    We also note however that the stat dump made the such a simulation that just loops and dumps considerably slower, from 3s to 15s on 2017 Lenovo ThinkPad P51. Fascinating, we are definitely not disk bound there.

    We enable HDF5 on the build by default with USE_HDF5=1. To disable it, you can add USE_HDF5=0 to the build as in:

    @@ -21639,7 +21653,7 @@ xdg-open "$(./getvar --arch arm --emulator gem5 m5out_dir)/config.dot.svg"
    -

    Sample run time: 87 minutes on P51 Ubuntu 20.04 gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1.

    +

    Sample run time: 87 minutes on 2017 Lenovo ThinkPad P51 Ubuntu 20.04 gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1.

    After the first run has downloaded the test binaries for you, you can speed up the process a little bit by skipping an useless SCons call:

    @@ -22615,7 +22629,7 @@ less o3pipeview.tmp.log
    mkdir aarch-system-201901106
     cd aarch-system-201901106
    -wget http://www.gem5.org/dist/current/arm/aarch-system-201901106.tar.bz2
    +wget http://dist.gem5.org/dist/current/arm/aarch-system-201901106.tar.bz2
     tar xvf aarch-system-201901106.tar.bz2
     cd ..
     ./run --arch aarch64 --emulator gem5 --linux-exec aarch-system-201901106/binaries/vmlinux.arm64
    @@ -28479,7 +28493,7 @@ build/ARM/config/the_isa.hh
    git submodule update --init submodules/gensim-simulator
     sudo apt install libantlr3c-dev
    -cd submodule/gensim
    +cd submodule/gensim-simulator
     make
    @@ -28525,12 +28539,12 @@ Aborted (core dumped)
    -
    cd /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/models/armv8 && \
    -  /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/build/dist/bin/gensim \
    -  -a /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/models/armv8/aarch64.ac \
    +
    cd /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/models/armv8 && \
    +  /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/build/dist/bin/gensim \
    +  -a /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/models/armv8/aarch64.ac \
       -s module,arch,decode,disasm,ee_interp,ee_blockjit,jumpinfo,function,makefile \
    -  -o decode.GenerateDotGraph=1,makefile.libtrace_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/support/libtrace/inc,makefile.archsim_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/archsim/inc,makefile.llvm_path=,makefile.Optimise=2,makefile.Debug=1 \
    -  -t /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/build/models/armv8/output-aarch64/
    + -o decode.GenerateDotGraph=1,makefile.libtrace_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/support/libtrace/inc,makefile.archsim_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/archsim/inc,makefile.llvm_path=,makefile.Optimise=2,makefile.Debug=1 \ + -t /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/build/models/armv8/output-aarch64/
    @@ -28548,7 +28562,7 @@ gensim/models/armv8/isa.ac

    and where gensim/models/armv8/isa.ac contains __builtin_abs64 usages.

    -

    GDB on gensim shows that the error comes from a call to gci.GenerateExecuteBodyFor(body_str, *action);, so it looks like there are some missing cases in EmitFixedCode.

    +

    Rebuilding with -DCMAKE_BUILD_TYPE=DEBUG + GDB on gensim shows that the error comes from a call to gci.GenerateExecuteBodyFor(body_str, *action);, so it looks like there are some missing cases in gensim/src/generators/GenCInterpreter/InterpreterNodeWalker.cpp function SSAIntrinsicStatementWalker::EmitFixedCode, e.g. there should be one for __builtin_abs64.

    This is completely broken academic code! They must be using an off-tree of part of the tool and forgot to commit.

    @@ -29867,7 +29881,7 @@ PERL5LIB="${PERL5LIB}:." make -j `nproc` ctest
    -

    This both builds and runs, took about 5 minutes on P51, but had build failues for some reason:

    +

    This both builds and runs, took about 5 minutes on 2017 Lenovo ThinkPad P51, but had build failues for some reason:

    @@ -30192,7 +30206,7 @@ mkdir -p bin/c

    All examples do exactly the same thing: span N threads and loop M times in each thread incrementing a global integer.

    -

    For inputs large enough, the non-synchronized examples are extremely likely to produce "wrong" results, for example on P51 Ubuntu 19.10 native with 2 threads and 10000 loops:

    +

    For inputs large enough, the non-synchronized examples are extremely likely to produce "wrong" results, for example on 2017 Lenovo ThinkPad P51 Ubuntu 19.10 native with 2 threads and 10000 loops:

    @@ -31519,7 +31533,7 @@ xdg-open bst_vs_heap_vs_hashmap_gem5.tmp.png

    TODO: the gem5 simulation blows up on a tcmalloc allocation somewhere near 25k elements as of 3fdd83c2c58327d9714fa2347c724b78d7c05e2b + 1, likely linked to the extreme inefficiency of the stats collection?

    -

    The cache sizes were chosen to match the host P51 to improve the comparison. Ideally we should also use the same standard library.

    +

    The cache sizes were chosen to match the host 2017 Lenovo ThinkPad P51 to improve the comparison. Ideally we should also use the same standard library.

    Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: Section 19.9.3.2, “gem5 only dump selected stats”

    @@ -31941,10 +31955,10 @@ make TARGET=linux64 XCMD='-c4' certify-all
    -

    This uses 4 contexts. TODO what are contexts? Is the same as threads?

    +

    This uses 4 contexts. TODO what are contexts? Is the same as threads? You likely want to use -c$(nproc) in practice instead?

    -

    Finishes in a few seconds, P51 results:

    +

    Finishes in a few seconds, 2017 Lenovo ThinkPad P51 results:

    @@ -31968,6 +31982,9 @@ CoreMark-PRO 18743.79 6306.76 2.97
    +

    More sample results: P51 CoreMark-Pro.

    +
    +

    And scaling appears to be the ration between multicore (4 due to -c4 and single core performance), each benchmark gets run twice with multicore and single core.

    @@ -32216,7 +32233,7 @@ RUN_FLAGS =
    -

    Output for P51 Ubuntu 20.04:

    +

    Sample output for 2017 Lenovo ThinkPad P51 Ubuntu 20.04:

    @@ -32428,12 +32445,62 @@ times[3 * ntimes + k] = mysecond() - times[3 * ntimes + k];

    See also: https://stackoverflow.com/questions/56086993/what-does-stream-memory-bandwidth-benchmark-really-measure

    -

    The LKMC usage of STREAM is analogous to that of Dhrystone. Build and run on QEMU User mode simulation:

    +

    Ubuntu 20.04 native build and run:

    git submodule update --init submodules/stream-benchmark
    -./build-stream --optimization-level 3
    +cd submodules/stream-benchmark
    +make
    +./stream_c.exe
    +
    +
    +
    +

    Sample output:

    +
    +
    +
    +
    -------------------------------------------------------------
    +STREAM version $Revision: 5.10 $
    +-------------------------------------------------------------
    +This system uses 8 bytes per array element.
    +-------------------------------------------------------------
    +Array size = 10000000 (elements), Offset = 0 (elements)
    +Memory per array = 76.3 MiB (= 0.1 GiB).
    +Total memory required = 228.9 MiB (= 0.2 GiB).
    +Each kernel will be executed 10 times.
    + The *best* time for each kernel (excluding the first iteration)
    + will be used to compute the reported bandwidth.
    +-------------------------------------------------------------
    +Number of Threads requested = 8
    +Number of Threads counted = 8
    +-------------------------------------------------------------
    +Your clock granularity/precision appears to be 1 microseconds.
    +Each test below will take on the order of 7027 microseconds.
    +   (= 7027 clock ticks)
    +Increase the size of the arrays if this shows that
    +you are not getting at least 20 clock ticks per test.
    +-------------------------------------------------------------
    +WARNING -- The above is only a rough guideline.
    +For best results, please be sure you know the
    +precision of your system timer.
    +-------------------------------------------------------------
    +Function    Best Rate MB/s  Avg time     Min time     Max time
    +Copy:           20123.2     0.008055     0.007951     0.008267
    +Scale:          20130.4     0.008032     0.007948     0.008177
    +Add:            22528.8     0.010728     0.010653     0.010867
    +Triad:          22448.4     0.010826     0.010691     0.011352
    +-------------------------------------------------------------
    +Solution Validates: avg error less than 1.000000e-13 on all three arrays
    +-------------------------------------------------------------
    +
    +
    +
    +

    The LKMC usage of STREAM is analogous to that of Dhrystone. Build and run on QEMU User mode simulation:

    +
    +
    +
    +
    ./build-stream --optimization-level 3
     ./run --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe"
    @@ -35136,7 +35203,7 @@ pop %rbp

    The exact data to show depends on the value of EAX, and for a few cases instructions ECX. When it depends on ECX, it is called a sub-leaf. Out test program prints eax == 0.

    -

    On P51 for example the output EAX, EBX, ECX and EDX are:

    +

    On 2017 Lenovo ThinkPad P51 for example the output EAX, EBX, ECX and EDX are:

    @@ -35437,7 +35504,7 @@ pop %rbp
    • -

      userland/arch/x86_64/vfmadd132pd.S: VFMADD132PD: "Multiply packed double-precision floating-point values from xmm1 and xmm3/mem, add to xmm2 and put result in xmm1." TODO: but I don’t understand the manual, experimentally on P51 Ubuntu 19.04 host the result is stored in XMM2!

      +

      userland/arch/x86_64/vfmadd132pd.S: VFMADD132PD: "Multiply packed double-precision floating-point values from xmm1 and xmm3/mem, add to xmm2 and put result in xmm1." TODO: but I don’t understand the manual, experimentally on 2017 Lenovo ThinkPad P51 Ubuntu 19.04 host the result is stored in XMM2!

    @@ -35565,7 +35632,7 @@ taskset -c 1 ./userland/arch/x86_64/rdtscp.out | tail -n 1
    -

    There is also the RDPID instruction that reads just the processor ID, but it appears to be very new for QEMU 4.0.0 or P51, as it fails with SIGILL on both.

    +

    There is also the RDPID instruction that reads just the processor ID, but it appears to be very new for QEMU 4.0.0 or 2017 Lenovo ThinkPad P51, as it fails with SIGILL on both.

    Bibliography: https://stackoverflow.com/questions/22310028/is-there-an-x86-instruction-to-tell-which-core-the-instruction-is-being-run-on/56622112#56622112

    @@ -40597,7 +40664,7 @@ import /init.${ro.zygote}.rc

    So currently, we are running benchmarks manually when it seems reasonable and uploading them to: https://github.com/cirosantilli/linux-kernel-module-cheat-regression

    -

    All benchmarks were run on the P51 machine, unless stated otherwise.

    +

    All benchmarks were run on the 2017 Lenovo ThinkPad P51 machine, unless stated otherwise.

    Run all benchmarks and upload the results:

    @@ -40742,7 +40809,7 @@ instructions 124346081

    TODO: aarch64 gem5 and QEMU use the same kernel, so why is the gem5 instruction count so much much higher?

    -

    P51 Ubuntu 19.10 LKMC b11e3cd9fb5df0e3fe61de28e8264bbc95ea9005 gem5 e779c19dbb51ad2f7699bd58a5c7827708e12b55 aarch64: 143s. Why huge increases from 70s on above table? Kernel size is also huge BTW: 147MB.

    +

    2017 Lenovo ThinkPad P51 Ubuntu 19.10 LKMC b11e3cd9fb5df0e3fe61de28e8264bbc95ea9005 gem5 e779c19dbb51ad2f7699bd58a5c7827708e12b55 aarch64: 143s. Why huge increases from 70s on above table? Kernel size is also huge BTW: 147MB.

    Note that https://gem5.atlassian.net/browse/GEM5-337 "ARM PAuth patch slows down Linux boot 2x from 2 minutes to 4 minutes" was already semi fixed at that point.

    @@ -40820,7 +40887,7 @@ instructions 124346081

    For example, the simplest scalable CPU content would be an C busy loop, so let’s start by analyzing that one.

    -

    Summary of manually collected results on P51 at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: Table 7, “Busy loop MIPS for different simulator setups”. As expected, the less native/more detailed/more complex simulations are slower!

    +

    Summary of manually collected results on 2017 Lenovo ThinkPad P51 at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: Table 7, “Busy loop MIPS for different simulator setups”. As expected, the less native/more detailed/more complex simulations are slower!

    @@ -40860,7 +40927,7 @@ instructions 124346081 - + @@ -41097,7 +41164,7 @@ instructions 124346081

    The first step is to determine a number of loops that will run long enough to have meaningful results, but not too long that we will get bored, so about 1 minute.

    -

    On our P51 machine, we found 10^7 (10 million == 1000 times 10000) loops to be a good number for a gem5 atomic simulation:

    +

    On our 2017 Lenovo ThinkPad P51 machine, we found 10^7 (10 million == 1000 times 10000) loops to be a good number for a gem5 atomic simulation:

    @@ -41197,7 +41264,7 @@ time \
    -

    Result on P51 at bad30f513c46c1b0995d3a10c0d9bc2a33dc4fa0:

    +

    Result on 2017 Lenovo ThinkPad P51 at bad30f513c46c1b0995d3a10c0d9bc2a33dc4fa0:

      @@ -41337,7 +41404,7 @@ xdg-open graph-size.pdf

      We will update this whenever the gem5 submodule is updated.

    -

    All benchmarks done on P51.

    +

    All benchmarks done on 2017 Lenovo ThinkPad P51.

    Get results with:

    @@ -41397,7 +41464,7 @@ tail -n+1 ../linux-kernel-module-cheat-regression/*/gem5-bench-build-*.txt

    and then copy the link command to a separate Bash file. Then you can time and modify it easily.

    -

    Some approximate reference values on P51:

    +

    Some approximate reference values on 2017 Lenovo ThinkPad P51:

    Table 7. Busy loop MIPS for different simulator setups

    27

    P51

    2017 Lenovo ThinkPad P51

    Ubuntu 20.04