readme: explain how to run benchmarks with readfile straight out

gem5 cache automate benchmarks, and remove results that were scattered all over.

Move console fun under linux kernel section, since not important enough
for getting started.

Set arch x86_64 by default through common on all scripts.
This commit is contained in:
Ciro Santilli
2018-04-03 22:19:18 +01:00
parent 0dd444d8eb
commit f482f24f87
10 changed files with 94 additions and 154 deletions

View File

@@ -280,27 +280,6 @@ This makes things a bit more reproducible, since the microsecond in which you pr
+
But on the other hand maybe you are interested in observing the interrupts generated by key presses.
==== Console fun
You can also try those on the Ctrl + Alt + F3 of your Ubuntu host, but it is much more fun inside a VM!
Stop blinking:
echo 0 > /sys/class/graphics/fbcon/cursor_blink
Rotate the console 90 degrees!
echo 1 > /sys/class/graphics/fbcon/rotate
Requires `CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y`.
Documented under: `fb/`.
TODO: font and keymap. Mentioned at: https://cmcenroe.me/2017/05/05/linux-console.html and I think can be done with Busybox `loadkmap` and `loadfont`, we just have to understand their formats, related:
* https://unix.stackexchange.com/questions/177024/remap-keyboard-on-the-linux-console
* https://superuser.com/questions/194202/remapping-keys-system-wide-in-linux-not-just-in-x
=== Automatic startup commands
When debugging a module, it becomes tedious to wait for build and re-type:
@@ -1567,6 +1546,33 @@ Those commits change `BR2_LINUX_KERNEL_LATEST_VERSION` in `/linux/Config.in`.
You should then look up if there is a branch that supports that kernel. Staying on branches is a good idea as they will get backports, in particular ones that fix the build as newer host versions come out.
=== Console fun
You can also try those on the Ctrl + Alt + F3 of your Ubuntu host, but it is much more fun inside a VM!
Must be run in <<text-mode,graphical mode>>.
Stop the cursor from blinking:
....
echo 0 > /sys/class/graphics/fbcon/cursor_blink
....
Rotate the console 90 degrees!
....
echo 1 > /sys/class/graphics/fbcon/rotate
....
Requires `CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y`.
Documented under: `fb/`.
TODO: font and keymap. Mentioned at: https://cmcenroe.me/2017/05/05/linux-console.html and I think can be done with Busybox `loadkmap` and `loadfont`, we just have to understand their formats, related:
* https://unix.stackexchange.com/questions/177024/remap-keyboard-on-the-linux-console
* https://superuser.com/questions/194202/remapping-keys-system-wide-in-linux-not-just-in-x
=== ftrace
Trace a single function:
@@ -2452,38 +2458,52 @@ One methodology problem is that gem5 and QEMU were run with different kernel con
OK, this is why we used gem5 in the first place, performance measurements!
https://stackoverflow.com/questions/48944587/how-to-count-the-number-of-cpu-clock-cycles-between-the-start-and-end-of-a-bench
Let's benchmark https://en.wikipedia.org/wiki/Dhrystone[Dhrystone] which Buildroot provides.
Let's benchmark https://en.wikipedia.org/wiki/Dhrystone[Dhrystone] which Buildroot provides:
The most flexible way is to do:
....
./gem5-bench dhrystone 1000
./gem5-bench -r dhrystone 1000
# Generate a checkpoint after Linux boots.
# The boot takes a while, be patient young Padawan.
printf 'm5 exit' >readfile.gitignore
./run -a aarch64 -g -E 'm5 checkpoint;m5 readfile > a.sh;sh a.sh'
# Restore the checkpoint, and run the benchmark with parameter 1.000.
# We skip the boot completely, saving time!
printf 'm5 resetstats;dhrystone 1000;m5 exit' >readfile.gitignore
./run -a aarch64 -g -- -r 1
./gem5-ncycles -a aarch64
# Now with another parameter 10.000.
printf 'm5 resetstats;dhrystone 10000;m5 exit' >readfile.gitignore
./run -a aarch64 -g -- -r 1
./gem5-ncycles -a aarch64
....
These commands output the approximate number of CPU cycles it took Dhrystone to run.
It works like this:
A more naive and simpler to understand approach would be a direct:
* the first command boots linux with the default simplified `AtomicSimpleCPU`, and generates a <<gem5-checkpoint,checkpoint>> after the kernel boots and before running the benchmark
* the second command restores the checkpoint with the more detailed `HPI` CPU model, and runs the benchmark. We don't boot with it because that is much slower.
....
./run -a aarch64 -g -E 'm5 checkpoint;m5 resetstats;dhrystone 10000;m5 exit'
....
ARM employees have just been modifying benchmarking code with instrumentation directly: https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/xcompile-patch.diff#L230
but the problem is that this method does not allow to easily run a different script without running the boot again, see: <<gem5-restore-new-scrip>>
A few imperfections of our benchmarking method are:
* when we do `m5 resetstats` and `m5 exit`, there is some time passed before the `exec` system call returns and the actual benchmark starts and ends
* the benchmark outputs to stdout, which means so extra cycles in addition to the actual computation. But TODO: how to get the output to check that it is correct without such IO cycles?
Solutions to these problems include:
* modify benchmark code with instrumentation directly, as PARSEC and ARM employees have been doing: https://github.com/arm-university/arm-gem5-rsk/blob/aa3b51b175a0f3b6e75c9c856092ae0c8f2a7cdc/parsec_patches/xcompile-patch.diff#L230
* monitor known addresses
Discussion at: https://stackoverflow.com/questions/48944587/how-to-count-the-number-of-cpu-clock-cycles-between-the-start-and-end-of-a-bench/48944588#48944588
Those problems should be insignificant if the benchmark runs for long enough however.
TODO: even if we don't switch to the detailed CPU model, the cycle counts on the original run and the one with checkpoint restore differ slightly. Why? Multiple checkpoint restores give the same results as expected however:
....
./run -a arm -E 'm5 checkpoint;m5 resetstats;dhrystone 1000;m5 exit' -g
./run -a arm -g -- -r 1
....
Now you can play a fun little game with your friends:
* pick a computational problem
@@ -2525,6 +2545,8 @@ getconf _NPROCESSORS_CONF
===== gem5 cache size
https://stackoverflow.com/questions/49624061/how-to-run-gem5-simulator-in-fs-mode-without-cache/49634544#49634544
A quick `+./run -g -- -h+` leads us to the options:
....
@@ -2541,10 +2563,22 @@ But keep in mind that it only affects benchmark performance of the most detailed
[options="header"]
|===
|arch |CPU type |caches used
|X86 |`AtomicSimpleCPU` | no
|X86 |`DerivO3CPU` | ?*
|ARM |`AtomicSimpleCPU` | no
|ARM |`HPI` | yes
|X86
|`AtomicSimpleCPU`
|no
|X86
|`DerivO3CPU`
|?*
|ARM
|`AtomicSimpleCPU`
|no
|ARM
|`HPI`
|yes
|===
{empty}*: couldn't test because of:
@@ -2552,91 +2586,6 @@ But keep in mind that it only affects benchmark performance of the most detailed
* https://stackoverflow.com/questions/49011096/how-to-switch-cpu-models-in-gem5-after-restoring-a-checkpoint-and-then-observe-t
* https://github.com/gem5/gem5/issues/16
This has been verified with:
....
m5 resetstats && dhrystone 10000 && m5 dumpstats
....
at commit da79d6c6cde0fbe5473ce868c9be4771160a003b with the following gem5 commands cycle counts:
....
# 11M
./run -a arm -g
./run -a arm -g -- --caches --l2cache
# 175M
./run -a arm -g -- --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024 --cpu-type=HPI
# 16M
./run -a arm -g -- --caches --l1d_size=1024MB --l1i_size=1024MB --l2cache --l2_size=1024MB --l3_size=1024MB --cpu-type=HPI
# 20M
./run -a x86_64 -g -- --caches --l1d_size=1024 --l2cache --l2_size=1024 --l3_size=1024
./run -a x86_64 -g -- --caches --l1d_size=1024MB --l2cache --l2_size=1024MB --l3_size=1024MB
....
At commit f3503b4cc810556df3c736d0a147cc54e05efc83:
....
cmd='./run -a aarch64 -g'
cache_small='--caches --l2cache --l1d_size=1024 --l1i_size=1024 --l2_size=1024 --l3_size=1024'
cache_large='--caches --l2cache --l1d_size=1024kB --l1i_size=1024kB --l2_size=1024kB --l3_size=1024kB'
printf '#!/bin/sh
m5 resetstats
dhrystone 1000
m5 exit
' >readfile.gitignore
chmod +x readfile.gitignore
# Create the checkpoints after the kernel boot.
# cpt 1: no caches
$cmd -E 'm5 checkpoint;m5 readfile > a.sh;sh a.sh'
# cpt 2: small caches
$cmd -E 'm5 checkpoint;m5 readfile > a.sh;sh a.sh' -- $cache_small
# cpt 3: large caches
$cmd -E 'm5 checkpoint;m5 readfile > a.sh;sh a.sh' -- $cache_large
# cpt 4: no caches HPI
$cmd -E 'm5 checkpoint;m5 readfile > a.sh;sh a.sh' -- $cache_small --cpu-type=HPI
# cpt 5: large caches HPI
$cmd -E 'm5 checkpoint;m5 readfile > a.sh;sh a.sh' -- $cache_large --cpu-type=HPI
# dhrystone 1.000
# 2.738.340
$cmd -- -r 1
# 2.738.343
$cmd -- -r 2 $cache_small
# 2.738.307
$cmd -- -r 3 $cache_large
sed -Ei 's/^dhrystone .*/dhrystone 10000' readfile.gitignore
# 10.995.467
$cmd -- -r 1
# 10.995.470
$cmd -- -r 2 $cache_small
# 10.995.434
$cmd -- -r 3 $cache_large
sed -Ei 's/^dhrystone .*/dhrystone 100000' readfile.gitignore
# 93.475.029
$cmd -- -r 1
# 93.475.032
$cmd -- -r 2 $cache_small
# 93.475.091
$cmd -- -r 3 $cache_large
# 50.193.186
$cmd -E 'm5 checkpoint;m5 readfile > a.sh;sh a.sh' -- $cache_small --cpu-type=HPI
# 5.924.610
$cmd -E 'm5 checkpoint;m5 readfile > a.sh;sh a.sh' -- $cache_large --cpu-type=HPI
# 2.736.509
$cmd -E 'm5 checkpoint;m5 readfile > a.sh;sh a.sh' -- $cache_small --restore-with-cpu=HPI -r 2
# 2.736.949
$cmd -E 'm5 checkpoint;m5 readfile > a.sh;sh a.sh' -- $cache_large --restore-with-cpu=HPI -r 3
....
Cache sizes can in theory be checked with the methods described at: link:https://superuser.com/questions/55776/finding-l2-cache-size-in-linux[]:
....
@@ -2655,6 +2604,15 @@ Behaviour breakdown:
* arm QEMU and gem5 (both `AtomicSimpleCPU` or `HPI`), x86 gem5: `/sys` files don't exist, and `getconf` and `lscpu` value empty
* x86 QEMU: `/sys` files exist, but `getconf` and `lscpu` values still empty
So we take a performance measurement approach instead:
....
./gem5-bench-cache -a aarch64
cat out/aarch64/gem5/bench-cache.txt
....
TODO: sort out HPI, and then paste results here, why the `--cpu-type=HPI` there always generates a `switch_cpu`, even if the original run was also on HPI?
===== gem5 memory latency
TODO These look promising:
@@ -3126,14 +3084,15 @@ Internals:
* the checkpoints are stored under `out/$arch/gem5/m5out/cpt.$todo_whatisthis`
* <<m5>> is a guest utility present inside the gem5 tree which we cross-compiled and installed into the guest
[[gem5-restore-new-scrip]]
===== gem5 checkpoint restore and run a different script
You want to automate running several tests from a single pristine post-boot state.
The problem is that after the checkpoint, the memory and disk states are fixed, so you can't for example:
* hack up an existing rc script
* inject new kernel boot command line options
* hack up an existing rc script, since the disk is fixed
* inject new kernel boot command line options, since those have already been put into memory by the bootloader
There is however one loophole: <<m5-readfile>>, which reads whatever is present on the host, so we can do it like: