diff --git a/index.html b/index.html
index d8388ed..3ac4aeb 100644
--- a/index.html
+++ b/index.html
@@ -2057,7 +2057,21 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
29.4. Benchmark Internets
@@ -18343,7 +18357,7 @@ less "$(./getvar gem5_source_dir)/src/cpu/exetrace.cc"
We can make the trace smaller by naming the trace file as trace.txt.gz, which enables GZIP compression, but that is not currently exposed on our scripts, since you usually just need something human readable to work on.
-
Enabling tracing made the runtime about 4x slower on the P51, with or without .gz compression.
+
Enabling tracing made the runtime about 4x slower on the 2017 Lenovo ThinkPad P51, with or without .gz compression.
Trace the source lines just like for QEMU with:
@@ -21193,7 +21207,7 @@ system.cpu.dtb.inst_hits
and there yes, we see that the file size fell from 39MB on stats.txt to 3.2MB on stats.m5, so the increase observed previously was just due to some initial size overhead (considering the patched gem5 with no spaces in the text file).
-
We also note however that the stat dump made the such a simulation that just loops and dumps considerably slower, from 3s to 15s on P51. Fascinating, we are definitely not disk bound there.
+
We also note however that the stat dump made the such a simulation that just loops and dumps considerably slower, from 3s to 15s on 2017 Lenovo ThinkPad P51. Fascinating, we are definitely not disk bound there.
We enable HDF5 on the build by default with USE_HDF5=1. To disable it, you can add USE_HDF5=0 to the build as in:
@@ -21639,7 +21653,7 @@ xdg-open "$(./getvar --arch arm --emulator gem5 m5out_dir)/config.dot.svg"
-
Sample run time: 87 minutes on P51 Ubuntu 20.04 gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1.
+
Sample run time: 87 minutes on 2017 Lenovo ThinkPad P51 Ubuntu 20.04 gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1.
After the first run has downloaded the test binaries for you, you can speed up the process a little bit by skipping an useless SCons call:
@@ -22615,7 +22629,7 @@ less o3pipeview.tmp.log
mkdir aarch-system-201901106
cd aarch-system-201901106
-wget http://www.gem5.org/dist/current/arm/aarch-system-201901106.tar.bz2
+wget http://dist.gem5.org/dist/current/arm/aarch-system-201901106.tar.bz2
tar xvf aarch-system-201901106.tar.bz2
cd ..
./run --arch aarch64 --emulator gem5 --linux-exec aarch-system-201901106/binaries/vmlinux.arm64
@@ -28479,7 +28493,7 @@ build/ARM/config/the_isa.hh
git submodule update --init submodules/gensim-simulator
sudo apt install libantlr3c-dev
-cd submodule/gensim
+cd submodule/gensim-simulator
make
@@ -28525,12 +28539,12 @@ Aborted (core dumped)
-
cd /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/models/armv8 && \
- /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/build/dist/bin/gensim \
- -a /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/models/armv8/aarch64.ac \
+cd /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/models/armv8 && \
+ /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/build/dist/bin/gensim \
+ -a /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/models/armv8/aarch64.ac \
-s module,arch,decode,disasm,ee_interp,ee_blockjit,jumpinfo,function,makefile \
- -o decode.GenerateDotGraph=1,makefile.libtrace_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/support/libtrace/inc,makefile.archsim_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/archsim/inc,makefile.llvm_path=,makefile.Optimise=2,makefile.Debug=1 \
- -t /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/build/models/armv8/output-aarch64/
+ -o decode.GenerateDotGraph=1,makefile.libtrace_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/support/libtrace/inc,makefile.archsim_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/archsim/inc,makefile.llvm_path=,makefile.Optimise=2,makefile.Debug=1 \
+ -t /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim-simulator/build/models/armv8/output-aarch64/
@@ -28548,7 +28562,7 @@ gensim/models/armv8/isa.ac
and where gensim/models/armv8/isa.ac contains __builtin_abs64 usages.
-
GDB on gensim shows that the error comes from a call to gci.GenerateExecuteBodyFor(body_str, *action);, so it looks like there are some missing cases in EmitFixedCode.
+
Rebuilding with -DCMAKE_BUILD_TYPE=DEBUG + GDB on gensim shows that the error comes from a call to gci.GenerateExecuteBodyFor(body_str, *action);, so it looks like there are some missing cases in gensim/src/generators/GenCInterpreter/InterpreterNodeWalker.cpp function SSAIntrinsicStatementWalker::EmitFixedCode, e.g. there should be one for __builtin_abs64.
This is completely broken academic code! They must be using an off-tree of part of the tool and forgot to commit.
@@ -29867,7 +29881,7 @@ PERL5LIB="${PERL5LIB}:." make -j `nproc` ctest
-
This both builds and runs, took about 5 minutes on P51, but had build failues for some reason:
+
This both builds and runs, took about 5 minutes on 2017 Lenovo ThinkPad P51, but had build failues for some reason:
@@ -30192,7 +30206,7 @@ mkdir -p bin/c
All examples do exactly the same thing: span N threads and loop M times in each thread incrementing a global integer.
-
For inputs large enough, the non-synchronized examples are extremely likely to produce "wrong" results, for example on P51 Ubuntu 19.10 native with 2 threads and 10000 loops:
+
For inputs large enough, the non-synchronized examples are extremely likely to produce "wrong" results, for example on 2017 Lenovo ThinkPad P51 Ubuntu 19.10 native with 2 threads and 10000 loops:
@@ -31519,7 +31533,7 @@ xdg-open bst_vs_heap_vs_hashmap_gem5.tmp.png
TODO: the gem5 simulation blows up on a tcmalloc allocation somewhere near 25k elements as of 3fdd83c2c58327d9714fa2347c724b78d7c05e2b + 1, likely linked to the extreme inefficiency of the stats collection?
-
The cache sizes were chosen to match the host P51 to improve the comparison. Ideally we should also use the same standard library.
+
The cache sizes were chosen to match the host 2017 Lenovo ThinkPad P51 to improve the comparison. Ideally we should also use the same standard library.
-
This uses 4 contexts. TODO what are contexts? Is the same as threads?
+
This uses 4 contexts. TODO what are contexts? Is the same as threads? You likely want to use -c$(nproc) in practice instead?
@@ -31968,6 +31982,9 @@ CoreMark-PRO 18743.79 6306.76 2.97
+
And scaling appears to be the ration between multicore (4 due to -c4 and single core performance), each benchmark gets run twice with multicore and single core.
@@ -32216,7 +32233,7 @@ RUN_FLAGS =
git submodule update --init submodules/stream-benchmark
-./build-stream --optimization-level 3
+cd submodules/stream-benchmark
+make
+./stream_c.exe
+
+
+
+
+
+
-------------------------------------------------------------
+STREAM version $Revision: 5.10 $
+-------------------------------------------------------------
+This system uses 8 bytes per array element.
+-------------------------------------------------------------
+Array size = 10000000 (elements), Offset = 0 (elements)
+Memory per array = 76.3 MiB (= 0.1 GiB).
+Total memory required = 228.9 MiB (= 0.2 GiB).
+Each kernel will be executed 10 times.
+ The *best* time for each kernel (excluding the first iteration)
+ will be used to compute the reported bandwidth.
+-------------------------------------------------------------
+Number of Threads requested = 8
+Number of Threads counted = 8
+-------------------------------------------------------------
+Your clock granularity/precision appears to be 1 microseconds.
+Each test below will take on the order of 7027 microseconds.
+ (= 7027 clock ticks)
+Increase the size of the arrays if this shows that
+you are not getting at least 20 clock ticks per test.
+-------------------------------------------------------------
+WARNING -- The above is only a rough guideline.
+For best results, please be sure you know the
+precision of your system timer.
+-------------------------------------------------------------
+Function Best Rate MB/s Avg time Min time Max time
+Copy: 20123.2 0.008055 0.007951 0.008267
+Scale: 20130.4 0.008032 0.007948 0.008177
+Add: 22528.8 0.010728 0.010653 0.010867
+Triad: 22448.4 0.010826 0.010691 0.011352
+-------------------------------------------------------------
+Solution Validates: avg error less than 1.000000e-13 on all three arrays
+-------------------------------------------------------------
+
+
+
+
+
+
./build-stream --optimization-level 3
./run --userland "$(./getvar userland_build_dir)/submodules/stream-benchmark/stream_c.exe"
@@ -35136,7 +35203,7 @@ pop %rbp
The exact data to show depends on the value of EAX, and for a few cases instructions ECX. When it depends on ECX, it is called a sub-leaf. Out test program prints eax == 0.
-
On P51 for example the output EAX, EBX, ECX and EDX are:
+
On 2017 Lenovo ThinkPad P51 for example the output EAX, EBX, ECX and EDX are:
@@ -35437,7 +35504,7 @@ pop %rbp
-
-
userland/arch/x86_64/vfmadd132pd.S: VFMADD132PD: "Multiply packed double-precision floating-point values from xmm1 and xmm3/mem, add to xmm2 and put result in xmm1." TODO: but I don’t understand the manual, experimentally on P51 Ubuntu 19.04 host the result is stored in XMM2!
+userland/arch/x86_64/vfmadd132pd.S: VFMADD132PD: "Multiply packed double-precision floating-point values from xmm1 and xmm3/mem, add to xmm2 and put result in xmm1." TODO: but I don’t understand the manual, experimentally on 2017 Lenovo ThinkPad P51 Ubuntu 19.04 host the result is stored in XMM2!
@@ -35565,7 +35632,7 @@ taskset -c 1 ./userland/arch/x86_64/rdtscp.out | tail -n 1
-
There is also the RDPID instruction that reads just the processor ID, but it appears to be very new for QEMU 4.0.0 or P51, as it fails with SIGILL on both.
+
There is also the RDPID instruction that reads just the processor ID, but it appears to be very new for QEMU 4.0.0 or 2017 Lenovo ThinkPad P51, as it fails with SIGILL on both.
-
All benchmarks were run on the P51 machine, unless stated otherwise.
+
All benchmarks were run on the 2017 Lenovo ThinkPad P51 machine, unless stated otherwise.
Run all benchmarks and upload the results:
@@ -40742,7 +40809,7 @@ instructions 124346081
TODO: aarch64 gem5 and QEMU use the same kernel, so why is the gem5 instruction count so much much higher?
-
P51 Ubuntu 19.10 LKMC b11e3cd9fb5df0e3fe61de28e8264bbc95ea9005 gem5 e779c19dbb51ad2f7699bd58a5c7827708e12b55 aarch64: 143s. Why huge increases from 70s on above table? Kernel size is also huge BTW: 147MB.
+
2017 Lenovo ThinkPad P51 Ubuntu 19.10 LKMC b11e3cd9fb5df0e3fe61de28e8264bbc95ea9005 gem5 e779c19dbb51ad2f7699bd58a5c7827708e12b55 aarch64: 143s. Why huge increases from 70s on above table? Kernel size is also huge BTW: 147MB.
Note that https://gem5.atlassian.net/browse/GEM5-337 "ARM PAuth patch slows down Linux boot 2x from 2 minutes to 4 minutes" was already semi fixed at that point.
@@ -40820,7 +40887,7 @@ instructions 124346081
For example, the simplest scalable CPU content would be an C busy loop, so let’s start by analyzing that one.
Table 7. Busy loop MIPS for different simulator setups
@@ -40860,7 +40927,7 @@ instructions 124346081
27 |
|
|
-P51 |
+2017 Lenovo ThinkPad P51 |
Ubuntu 20.04 |
@@ -41097,7 +41164,7 @@ instructions 124346081
The first step is to determine a number of loops that will run long enough to have meaningful results, but not too long that we will get bored, so about 1 minute.
-
On our P51 machine, we found 10^7 (10 million == 1000 times 10000) loops to be a good number for a gem5 atomic simulation:
+
On our 2017 Lenovo ThinkPad P51 machine, we found 10^7 (10 million == 1000 times 10000) loops to be a good number for a gem5 atomic simulation:
@@ -41197,7 +41264,7 @@ time \
-
Result on P51 at bad30f513c46c1b0995d3a10c0d9bc2a33dc4fa0:
+
Result on 2017 Lenovo ThinkPad P51 at bad30f513c46c1b0995d3a10c0d9bc2a33dc4fa0:
Get results with:
@@ -41397,7 +41464,7 @@ tail -n+1 ../linux-kernel-module-cheat-regression/*/gem5-bench-build-*.txt
and then copy the link command to a separate Bash file. Then you can time and modify it easily.
@@ -41452,12 +41519,355 @@ tail -n+1 ../linux-kernel-module-cheat-regression/*/gem5-bench-build-*.txt
-
+
-
Lenovo ThinkPad P51 laptop with the Latest stable Ubuntu.
+
Serial number: TYPE 20HH-CTO1WW S/N PF-0V5V5N 17/11
+
+
+
Summary string of key hardware for copy paste:
+
+
+
+
+
+-
+
Hard disk: Seagate ST1000LM035-1RK1 1TB hard disk
+
+-
+
GPU: NVIDIA Quadro M1200 4GB GDDR5 GPU
+
+-
+
Pre-installed OS:
+
+
+-
+
Display: 15.6" FHD (1920x1080), anti-glare, IPS
+
+-
+
With Color Sensor
+
+-
+
720p HD Camera with Microphone
+
+-
+
Keyboard with Number Pad - Euro English
+
+-
+
3+3BCP, Fingerprint Reader,Color Sensor
+
+-
+
Integrated Fingerprint Reader
+
+-
+
Hardware dTPM2.0 Enabled
+
+-
+
1TB 5400rpm HDD
+
+-
+
170W AC Adapter - UK(3pin)
+
+-
+
6 Cell Li-Polymer Battery, 90Wh
+
+-
+
Intel Dual Band Wireless AC(2x2) 8265, Bluetooth Version 4.1, vPro
+
+
+
+
+
+
+-
+
keyboard FRU number: 01HW271 (written on part, Payton2Walter2 NBL KBD,USI,DFN according to https://support.lenovo.com/us/en/partslookup That website says 01ER981 is equivalent (Payton2Walter2 NBL KBD,USI,CHY), just different manufacturer
+
+
+
+
+
+
+
+
+
+
+
+
+
CoreMark-Pro d5b4f2ba7ba31e37a5aa93423831e7d5eb933868 on Ubuntu 20.04 with XCMD="-c$(nproc)":
+
+
+
+
MultiCore SingleCore
+Workload Name (iter/s) (iter/s) Scaling
+----------------------------------------------- ---------- ---------- ----------
+cjpeg-rose7-preset 769.23 175.44 4.38
+core 7.98 2.11 3.78
+linear_alg-mid-100x100-sp 892.86 233.64 3.82
+loops-all-mid-10k-sp 35.84 7.58 4.73
+nnet_test 35.09 10.05 3.49
+parser-125k 125.00 20.41 6.12
+radix2-big-64k 3278.69 630.91 5.20
+sha-test 625.00 227.27 2.75
+zip-test 615.38 166.67 3.69
+
+MARK RESULTS TABLE
+
+Mark Name MultiCore SingleCore Scaling
+----------------------------------------------- ---------- ---------- ----------
+CoreMark-PRO 25016.00 6079.70 4.11
+
+
+
+
+
+
+
+
Bought: 2017 for approximately 2400 pounds.
+
+
+
Ubuntu 17.10 setup after buying it:
+
+
+
+
Battery life shown by Ubuntu battery app after installation:
+
+
+
+
2019-04-17: popup asking about "ThinkPad P51 Management Engine Update" from from 182.29.3287 to 184.60.3561, said yes.
+
+
+
2020-06-06: dropped some lemon juice on the bottom left of touchpad. Bottom left button not working anymore… I’m an idiot. There are many other alternatives, but very aggravating, I’ll replace it for sure. Can’t find the exact replacement part or any videos showing its replacement online easliy, dang. For the T430: https://www.youtube.com/watch?v=F3lzV9uXRjU Asked at: https://forums.lenovo.com/t5/ThinkPad-P-and-W-Series-Mobile-Workstations/P51-left-bottom-button-below-trackpad-mouse-left-click-stopped-working-possible-to-replace/m-p/5019903 Also I could not access it because you need to remove the HDD first: https://www.youtube.com/watch?v=5Klawxc7T_Y and I can’t pull it out even with considerable force, unlike in the video… And OMG, those button caps are impossible to re-install once removed!!! Then when I put the whole thing back together, the upper buttons were not working anymore. FUUUUUUUUCK. When first opening I pulled on it without properly removing the cap and it came off, but it didn’t look broken in any way and I put it back in. Keyboard works thank God, so right black connector is fine, left white one oppears to be the one for upper keys and trackpoint, both of which stopped working. The hardware manual confirms that they are both part of the same device, so basically a mouse :-) TODO can it be bought separately from te keyboard? Doesn’t look like it, photo of keyboard part includes those buttons. The manual also confirms that the bottom buttons are one device with the trackpad "trackpad with buttons", thus forming the second entire mouse.
+
+
+
+
+
+
+
+
Max frequency: 3.90GHz
+
+
+
+
+
Recommended customer price: 378.00 USD
+
+
+
+
+
cat /proc/cpuinfo of one CPU on Ubuntu 20.04 Linux kernel 5.4.0:
+
+
+
+
processor : 0
+vendor_id : GenuineIntel
+cpu family : 6
+model : 158
+model name : Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
+stepping : 9
+microcode : 0xd6
+cpu MHz : 1025.664
+cache size : 8192 KB
+physical id : 0
+siblings : 8
+core id : 0
+cpu cores : 4
+apicid : 0
+initial apicid : 0
+fpu : yes
+fpu_exception : yes
+cpuid level : 22
+wp : yes
+flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
+bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds
+bogomips : 5799.77
+clflush size : 64
+cache_alignment : 64
+address sizes : 39 bits physical, 48 bits virtual
+power management:
+
+
+
+
getconf -a | grep CACHE on Ubuntu 20.04 Linux kernel 5.4.0:
+
+
+
+
LEVEL1_ICACHE_SIZE 32768
+LEVEL1_ICACHE_ASSOC 8
+LEVEL1_ICACHE_LINESIZE 64
+LEVEL1_DCACHE_SIZE 32768
+LEVEL1_DCACHE_ASSOC 8
+LEVEL1_DCACHE_LINESIZE 64
+LEVEL2_CACHE_SIZE 262144
+LEVEL2_CACHE_ASSOC 4
+LEVEL2_CACHE_LINESIZE 64
+LEVEL3_CACHE_SIZE 8388608
+LEVEL3_CACHE_ASSOC 16
+LEVEL3_CACHE_LINESIZE 64
+LEVEL4_CACHE_SIZE 0
+LEVEL4_CACHE_ASSOC 0
+LEVEL4_CACHE_LINESIZE 0
+
+
+
+
+
+
+
Nominal speed: 2400 Mbps
+
+
+
+
+
+
+
+
+
+
+
+
sudo hdparm -Tt /dev/nvme0n1p5 on Ubuntu 20.04:
+
+
+
+
Timing cached reads: 29812 MB in 1.99 seconds = 15007.00 MB/sec
+ HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
+ Timing buffered disk reads: 6328 MB in 3.00 seconds = 2109.00 MB/sec
+
+
+
+
Nominal maximum sequential read speed: 3,000 MB/s
+
+
+
+
+
+
+
+
+
sudo hdparm -Tt /dev/sda3 on Ubuntu 20.04:
+
+
+
+
Timing cached reads: 29594 MB in 1.99 seconds = 14893.89 MB/sec
+ Timing buffered disk reads: 386 MB in 3.01 seconds = 128.07 MB/sec
+
+
+
+
Nominal maximum speed: 140MB/s
+
+
+