diff --git a/.gitignore b/.gitignore index ec4bc4e..04677c6 100644 --- a/.gitignore +++ b/.gitignore @@ -25,8 +25,11 @@ __pycache__ # Accidents. /core /m5out + +# In-tree userland builds. *.o *.out +*.so # Kernel modules. *.ko @@ -40,3 +43,7 @@ modules.order # node.js node_modules + +# Performance profiling stuff. +perf.data +callgrind.out.* diff --git a/index.html b/index.html index 21df1d2..3a7e67f 100644 --- a/index.html +++ b/index.html @@ -673,8 +673,8 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
see also: Section 19.18, “gem5 ARM platforms”.
+see also: Section 19.17, “gem5 ARM platforms”.
This generates yet new separate images with new magic constants:
@@ -7457,7 +7482,7 @@ qw erAt 125d14805f769104f93c510bedaa685a52ec025d we moved Buildroot from uClibc to glibc, and caused some user mode pain, which we document here.
glibc has a check for kernel version, likely obtained from the uname syscall, and if the kernel is not new enough, it quits.
For some reason QEMU / glibc x86_64 picks up the host libc, which breaks things.
QEMU x86_64 guest on x86_64 host was failing with stack smashing detected, but we found a workaround
+QEMU x86_64 guest on x86_64 host was failing with stack smashing detected when using glibc, but we found a workaround
gem5 user only supported static executables in the past, as mentioned at: Section 10.7, “gem5 syscall emulation mode”
@@ -17890,6 +17915,12 @@ rootgem5 however has tended towards horrendous intensive code generation in order to support all its different hardware types
gem5 also has a complex Python interface which is also largely auto-generated, which greatly increases the maintenance complexity of the project: Embedding Python in another application.
+This is done so that reconfiguring platforms can be done quickly without recompiling, and it is amazing when it works, but the maintenance costs are also very high.
+but the problem is that this method does not allow to easily run a different script without running the boot again. The ./gem5.sh script works around that by using m5 readfile as explained further at: Section 19.5.2, “gem5 checkpoint restore and run a different script”.
but the problem is that this method does not allow to easily run a different script without running the boot again. The ./gem5.sh script works around that by using m5 readfile as explained further at: Section 19.5.3, “gem5 checkpoint restore and run a different script”.
Now you can play a fun little game with your friends:
@@ -18147,10 +18178,13 @@ ps Haux | grep qemu | wcgem5 user mode multithreading has been particularly flaky compared to QEMU’s.
+gem5 user mode multithreading has been particularly flaky compared to QEMU’s, but work is being put into improving it.
You have the limitation that you must have at least one core per guest thread, otherwise pthread_create fails. For example:
In gem5 syscall simulation, the fork syscall checks if there is a free CPU, and if there is a free one, the new threads runs on that CPU. Otherwise, the fork call, and therefore higher level interfaces to fork such as pthread_create also fail and return a failure return status in the guest.
For example, if we use just one CPU for userland/posix/pthread_self.c which spawns one thread besides main:
fails because that process has a total of 2 threads: one for main and one extra thread spawned: userland/posix/pthread_self.c The error message is:
fails with this error message coming from the guest stderr:
This has to do with the fact that gem5 has a more simplistic thread implementation that does not spawn one host thread per guest thread CPU. Maybe this is required to achieve reproducible runs? What is the task switch algorithm then?
+Once threads exit, their CPU is freed and becomes available for new fork calls: For example, the following run spawns a thread, joins it, and then spawns again, and 2 CPUs are enough:
./run --cpus 2 --emulator gem5 --userland userland/posix/pthread_self.c --userland-args '1 2'+
gem5 threading does however show the expected number of cores, e.g.:
+because at each point in time, only up to two threads are running.
+gem5 syscall emulation does show the expected number of cores when queried, e.g.:
outputs 1 and 2 respectively.
TODO: aarch64 seems to failing to spawn more than 2 threads at 369a47fc6e5c2f4a7f911c1c058b6088f8824463 + 1:
-./run --arch aarch64 --cpus 3 --emulator gem5 --userland userland/posix/pthread_self.c --userland-args 2-
fails with:
-Exiting @ tick 18446744073709551615 because simulate() limit reached-
Documentation: http://gem5.org/Checkpoints
To see it in action try:
+./run --arch arm --emulator gem5+
./run --arch aarch64 --emulator gem5
since boot has already happened, and the parameters are already in the RAM of the snapshot.
In order to debug checkpoint restore bugs, this minimal setup using userland/freestanding/gem5_checkpoint_restore.S can be handy:
+./build-userland --arch aarch64 --static +./run --arch aarch64 --emulator gem5 --static --userland userland/freestanding/gem5_checkpoint_restore.S --trace-insts-stdout +./run --arch aarch64 --emulator gem5 --static --userland userland/freestanding/gem5_checkpoint_restore.S --trace-insts-stdout --gem5-restore 1 +./run --arch aarch64 --emulator gem5 --static --userland userland/freestanding/gem5_checkpoint_restore.S --trace-insts-stdout --gem5-restore 1 -- --cpu-type=DerivO3CPU --restore-with-cpu=DerivO3CPU --caches+
On the initial run, we see that all instructions are executed and the checkpoint is taken:
+0: system.cpu: A0 T0 : @asm_main_after_prologue : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 500: system.cpu: A0 T0 : @asm_main_after_prologue+4 : movz x1, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 1000: system.cpu: A0 T0 : @asm_main_after_prologue+8 : m5checkpoint : IntAlu : flags=(IsInteger|IsNonSpeculative|IsUnverifiable) +Writing checkpoint +warn: Checkpoints for file descriptors currently do not work. +info: Entering event queue @ 1000. Starting simulation... + 1500: system.cpu: A0 T0 : @asm_main_after_prologue+12 : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 2000: system.cpu: A0 T0 : @asm_main_after_prologue+16 : m5exit : No_OpClass : flags=(IsInteger|IsNonSpeculative) +Exiting @ tick 2000 because m5_exit instruction encountered+
Then, on the first restore run, the checkpoint is restored, and only instructions after the checkpoint are executed:
+info: Entering event queue @ 1000. Starting simulation... + 1500: system.cpu: A0 T0 : @asm_main_after_prologue+12 : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 2000: system.cpu: A0 T0 : @asm_main_after_prologue+16 : m5exit : No_OpClass : flags=(IsInteger|IsNonSpeculative) +Exiting @ tick 2000 because m5_exit instruction encountered+
and a similar thing happens for the restore with a different CPU type:
+info: Entering event queue @ 1000. Starting simulation... + 79000: system.cpu: A0 T0 : @asm_main_after_prologue+12 : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 FetchSeq=1 CPSeq=1 flags=(IsInteger) +Exiting @ tick 84500 because m5_exit instruction encountered+
Here we don’t see the last m5 exit instruction on the log, but it must just be something to do with the O3 logging.
Checkpoints are stored inside the m5out directory at:
You want to automate running several tests from a single pristine post-boot state.
gem5 can switch to a different CPU model when restoring a checkpoint.
A common combo is to boot Linux with a fast CPU, make a checkpoint and then replay the benchmark of interest with a slower CPU.
An illustrative interactive run:
+This can be observed interactively in full system with:
./run --arch arm --emulator gem5+
./run --arch aarch64 --emulator gem5
In guest:
+Then in the guest terminal after boot ends:
m5 checkpoint+
sh -c 'm5 checkpoint;sh' +m5 exit
And then restore the checkpoint with a different CPU:
+And then restore the checkpoint with a different slower CPU:
./run --arch arm --emulator gem5 --gem5-restore 1 -- --caches --restore-with-cpu=HPI+
./run --arch arm --emulator gem5 --gem5-restore 1 -- --caches --cpu-type=DerivO3CPU+
And now you will notice that everything happens much slower in the guest terminal!
+One even more direct and minimal way to observe this is with userland/freestanding/gem5_checkpoint_restore.S which was mentioned at gem5 checkpoint userland minimal example plus some logging:
+./run \ + --arch aarch64 \ + --emulator gem5 \ + --static \ + --trace ExecAll,FmtFlag,O3CPU,SimpleCPU \ + --userland userland/freestanding/gem5_checkpoint_restore.S \ +; +cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)" +./run \ + --arch aarch64 \ + --emulator gem5 \ + --gem5-restore 1 \ + --static \ + --trace ExecAll,FmtFlag,O3CPU,SimpleCPU \ + --userland userland/freestanding/gem5_checkpoint_restore.S \ + -- \ + --caches \ + --cpu-type DerivO3CPU \ + --restore-with-cpu DerivO3CPU \ +; +cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"+
At gem5 2235168b72537535d74c645a70a85479801e0651, the first run does everything in AtomicSimpleCPU:
+... + 0: SimpleCPU: system.cpu.dcache_port: received snoop pkt for addr:0x1f92 WriteReq + 0: SimpleCPU: system.cpu.dcache_port: received snoop pkt for addr:0x1e40 WriteReq + 0: SimpleCPU: system.cpu.dcache_port: received snoop pkt for addr:0x1e30 WriteReq + 0: SimpleCPU: system.cpu: Tick + 0: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 500: SimpleCPU: system.cpu: Tick + 500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4 : movz x1, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 1000: SimpleCPU: system.cpu: Tick + 1000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8 : m5checkpoint : IntAlu : flags=(IsInteger|IsNonSpeculative|IsUnverifiable) + 1000: SimpleCPU: system.cpu: Resume + 1500: SimpleCPU: system.cpu: Tick + 1500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+12 : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 2000: SimpleCPU: system.cpu: Tick + 2000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+16 : m5exit : No_OpClass : flags=(IsInteger|IsNonSpeculative)+
and after restore we see as expected a single ExecEnable instruction executed amidst O3CPU noise:
FullO3CPU: Ticking main, FullO3CPU. + 79000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+12 : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 FetchSeq=1 CPSeq=1 flags=(IsInteger) + 82500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400084=>0x400088).(0=>1) [sn:1] + 82500: O3CPU: system.cpu: Removing instruction, [tid:0] [sn:1] PC (0x400084=>0x400088).(0=>1) + 82500: O3CPU: system.cpu: Scheduling next tick! + 83000: O3CPU: system.cpu:+
which is the movz after the checkpoint. The final m5exit does not appear due to DerivO3CPU logging insanity.
Bibliography:
+Besides switching CPUs after a checkpoint restore, fs.py also has the --fast-forward option to automatically run the script from the start on a less detailed CPU, and switch to a more detailed CPU at a given tick.
This is generally useless compared to checkpoint restoring because:
+checkpoint restore allows to run multiple contents after the restore, and restoring to multiple different system states, which you almost always want to do
+we generally don’t know the exact tick at which the region of interest will start, especially as the binaries change. It is much easier to just instrument the content with a checkoint m5op
+But let’s give it a try anyways with userland/freestanding/gem5_checkpoint_restore.S which was mentioned at gem5 checkpoint userland minimal example
+./run \ + --arch aarch64 \ + --emulator gem5 \ + --static \ + --trace ExecAll,FmtFlag,O3CPU,SimpleCPU \ + --userland userland/freestanding/gem5_checkpoint_restore.S \ + -- \ + --caches + --cpu-type DerivO3CPU \ + --fast-forward 1000 \ +; +cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"+
At gem5 2235168b72537535d74c645a70a85479801e0651 we see something like:
+0: O3CPU: system.switch_cpus: Creating O3CPU object. + 0: O3CPU: system.switch_cpus: Workload[0] process is 0 0: SimpleCPU: system.cpu: ActivateContext 0 + 0: SimpleCPU: system.cpu.dcache_port: received snoop pkt for addr:0 WriteReq + 0: SimpleCPU: system.cpu.dcache_port: received snoop pkt for addr:0x40 WriteReq +... + + 0: SimpleCPU: system.cpu.dcache_port: received snoop pkt for addr:0x1f92 WriteReq + 0: SimpleCPU: system.cpu.dcache_port: received snoop pkt for addr:0x1e40 WriteReq + 0: SimpleCPU: system.cpu.dcache_port: received snoop pkt for addr:0x1e30 WriteReq + 0: SimpleCPU: system.cpu: Tick + 0: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 500: SimpleCPU: system.cpu: Tick + 500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4 : movz x1, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 1000: SimpleCPU: system.cpu: Tick + 1000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8 : m5checkpoint : IntAlu : flags=(IsInteger|IsNonSpeculative|IsUnverifiable) + 1000: O3CPU: system.switch_cpus: [tid:0] Calling activate thread. + 1000: O3CPU: system.switch_cpus: [tid:0] Adding to active threads list + 1500: O3CPU: system.switch_cpus: + +FullO3CPU: Ticking main, FullO3CPU. + 1500: O3CPU: system.switch_cpus: Scheduling next tick! + 2000: O3CPU: system.switch_cpus: + +FullO3CPU: Ticking main, FullO3CPU. + 2000: O3CPU: system.switch_cpus: Scheduling next tick! + 2500: O3CPU: system.switch_cpus: + +... + +FullO3CPU: Ticking main, FullO3CPU. + 44500: ExecEnable: system.switch_cpus: A0 T0 : @asm_main_after_prologue+12 : movz x0, #0, #0 : IntAlu : D=0x00000000000 + 48000: O3CPU: system.switch_cpus: Removing committed instruction [tid:0] PC (0x400084=>0x400088).(0=>1) [sn:1] + 48000: O3CPU: system.switch_cpus: Removing instruction, [tid:0] [sn:1] PC (0x400084=>0x400088).(0=>1) + 48000: O3CPU: system.switch_cpus: Scheduling next tick! + 48500: O3CPU: system.switch_cpus: + +...+
We can also compare that to the same log but without --fast-forward and other CPU switch options:
0: SimpleCPU: system.cpu.dcache_port: received snoop pkt for addr:0x1e40 WriteReq + 0: SimpleCPU: system.cpu.dcache_port: received snoop pkt for addr:0x1e30 WriteReq + 0: SimpleCPU: system.cpu: Tick + 0: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 500: SimpleCPU: system.cpu: Tick + 500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4 : movz x1, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 1000: SimpleCPU: system.cpu: Tick + 1000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8 : m5checkpoint : IntAlu : flags=(IsInteger|IsNonSpeculative|IsUnverifiable) + 1000: SimpleCPU: system.cpu: Resume + 1500: SimpleCPU: system.cpu: Tick + 1500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+12 : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 2000: SimpleCPU: system.cpu: Tick + 2000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+16 : m5exit : No_OpClass : flags=(IsInteger|IsNonSpeculative)+
Therefore, it is clear that what we wanted happen:
+up until the tick 1000, SimpleCPU was ticking
after tick 1000, cpu O3CPU started ticking
Bibliography:
+because glibc was built to expect a newer Linux kernel as shown at: Section 10.4.1, “FATAL: kernel too old”. Your choices to sole this are:
+because glibc was built to expect a newer Linux kernel as shown at: Section 10.4.1, “FATAL: kernel too old failure in userland simulation”. Your choices to solve this are:
It is obviously not possible to understand what they actually do from their commit message, so let’s explain them one by one here as we understand them:
+It is obviously not possible to understand what the Linux kernel fork commits actually do from their commit message, so let’s explain them one by one here as we understand them:
the type parameter shows is present on every node, and it maps to a Python object that inherits from SimObject.
the type parameter shows is present on every node, and it maps to a Python object that inherits from SimObject.
For example, AtomicSimpleCPU maps is defined at src/cpu/simple/AtomicSimpleCPU.py.
All those tests could in theory be added to this repo instead of to gem5, and this is actually the superior setup as it is cross emulator.
+But can the people from the project be convinced of that?
+These are just very small GTest tests that test a single class in isolation, they don’t run any executables.
Note that the command and it’s corresponding results don’t need to show consecutively on stdout because tests are run in parallel. You just have to match them based on the class name CircleBufTest to the file circlebuf.test.cpp.
This section is about running the gem5 in-tree tests.
+After the first run has downloaded the test binaries for you, you can speed up the process a little bit by skipping an useless scons call:
+After the first run has downloaded the test binaries for you, you can speed up the process a little bit by skipping an useless SCons call:
Note however that --skip-build is required at least once per branch to download the test binaries, because the test interface is bad.
Note however that running without --skip-build is required at least once to download the test binaries, because the test interface is bad.
List available instead of running them:
+./gem5-regression --gem5-worktree master --arch aarch64 --cmd list+
You can then pick one suite (has to be a suite, not an "individual test") from the list and run just it e.g. with:
+./gem5-regression --arch aarch64 -- --uid SuiteUID:tests/gem5/cpu_tests/test.py:cpu_test_AtomicSimpleCPU_Bubblesort-ARM-opt+
This error happens when the following instruction limits are reached:
In order to use different build options, you might also want to use gem5 build variants to keep the build outputs separate from one another.
Explained at: Section 18.7, “Debug the emulator”.
./build-gem5 --gem5-build-type fast+
How it goes faster is explained at: https://stackoverflow.com/questions/59860091/how-to-increase-the-simulation-speed-of-a-gem5-run/59861375#59861375
+Benchmarks present at:
+Profiling builds as of 3cea7d9ce49bda49c50e756339ff1287fd55df77 both use: -g -O3 and disable asserts and logging like the gem5 fast build and:
prof uses -pg for gprof
perf uses -lprofile for google-pprof
See also: Profiling userland programs.
+TODO test properly, benchmark vs GCC.
If there gem5 appears to have a C++ undefined behaviour bug, which is often very difficult to track down, you can try to build it with the following extra SCons options:
Ruby is a system that includes the SLICC domain specific language to describe memory systems: http://gem5.org/Ruby
+gem5 has two types of memory system:
+the classic memory system, which is used by default
+the Ruby memory system
+The Ruby memory system includes the SLICC domain specific language to describe memory systems: http://gem5.org/Ruby
It seems to have usage outside of gem5, but the naming overload with the Ruby programming language, which also has domain specific languages as a concept, makes it impossible to google anything about it!
Ruby is activated at compile time with the PROTOCOL flag, which specifies the desired memory system time.
Since it is not the default, Ruby is generally less stable that the classic memory model. However, because it allows describing a wide variety of important coherency protocols, while the classic system only describes a single protocol, Ruby is very importanonly describes a single protocol, Ruby is a very important feature of gem5.
+Ruby support must be enabled at compile time with the scons PROTOCOL= flag, which compiles support for the desired memory system type.
Note however that most ISAs already implicitly set PROTOCOL via the build_opts/ directory, e.g. build_opts/ARM contains:
PROTOCOL = 'MOESI_CMP_directory'+
and therefore ARM already compiles MOESI_CMP_directory by default.
Then, with fs.py and se.py, you can choose to use either the classic or built-in ruby system at runtime with the --ruby option:
if --ruby is given, use the ruby memory system
otherwise, use the classic memory system
+For example, to use a two level MESI cache coherence protocol, we can do:
@@ -20173,10 +20578,7 @@ Indirect leak of 1346 byte(s) in 2 object(s) allocated from:which shows that dozens of C++ files are being generated from Ruby SLICC.
TODO observe it doing something during a run.
-The relevant source files live in the source tree under:
+The relevant Ruby source files live in the source tree under:
We already pass the SLICC_HTML flag by default to the build, which generates an HTML summary of each memory protocol under:
We already pass the SLICC_HTML flag by default to the build, which generates an HTML summary of each memory protocol under (TODO broken: https://gem5.atlassian.net/browse/GEM5-357):
A minimized ruby config which was not merged upstream can be found for study at: https://gem5-review.googlesource.com/c/public/gem5/+/13599/1
One easy way to see that Ruby is being used without understanding it in detail is to enable some logging:
+./run \ + --arch aarch64 \ + --emulator gem5 \ + --gem5-worktree master \ + --userland userland/arch/aarch64/freestanding/linux/hello.S \ + --static \ + --trace ExecAll,FmtFlag,Ruby,XBar \ + -- \ + --ruby \ +; +cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"+
Then:
+when the --ruby flag is given, we see a gazillion Ruby related messages prefixed e.g. by RubyPort:.
We also observe from ExecEnable lines that instruction timing is not simple anymore, so the memory system must have latencies
without --ruby, we instead see XBar (Coherent Crossbar) related messages such as CoherentXBar:, which I believe is the more precise name for the memory model that the classic memory system uses
Certain features may not work in Ruby. For example, gem5 checkpoint creation is only possible in Ruby protocols that support flush, which is the case for PROTOCOL=MOESI_hammer but not PROTOCOL=MESI_Three_Level: https://www.mail-archive.com/gem5-users@gem5.org/msg17418.html
Tested in gem5 d7d9bc240615625141cd6feddbadd392457e49eb.
+Python 3 support was mostly added in 2019 Q3 at arounda347a1a68b8a6e370334be3a1d2d66675891e0f1 but remained buggy for some time afterwards.
gem5 has a few in tree CPU models for different purposes.
Both of those can be checked with git log and git blame.
Simple abstract CPU without a pipeline.
Generic in-order core that does not model any specific CPU.
Generic out-of-order core. "O3" Stands for "Out Of Order"!
The gem5 platform is selectable with the --machine option, which is named after the analogous QEMU -machine option, and which sets the --machine-type.
Internals under other sections:
In order to develop complex C++ software such as gem5, a good IDE setup is fundamental.
The interaction uses the Python C extension interface https://docs.python.org/2/extending/extending.html interface through the pybind11 helper library: https://github.com/pybind/pybind11
+The interaction uses the Python C extension interface https://docs.python.org/2/extending/extending.html interface through the pybind11 helper library: https://github.com/pybind/pybind11
The C++ executable both:
@@ -20558,7 +21000,7 @@ cd ..then gem5 magic simobject class adds some crazy stuff on top of it further… is is a mess. in particular, it auto generates params/ headers. TODO: why is this mess needed at all? pybind11 seems to handle constructor arguments just fine:
then gem5 magic SimObject class adds some crazy stuff on top of it further, is is a mess. In particular, it auto generates params/ headers. TODO: why is this mess needed at all? pybind11 seems to handle constructor arguments just fine:
Since BadDevice has no __init__ method, and neither BasicPioDevice, it all just falls through until the SimObject.init constructor.
Since BadDevice has no __init__ method, and neither BasicPioDevice, it all just falls through until the SimObject.__init__ constructor.
This constructor will loop through the inheritance chain and give the Python parameters to the C++ BadDeviceParams class as follows.
@@ -20689,11 +21131,17 @@ static EmbeddedPyBind embed_obj("BadDevice", module_init, "BasicPioDevice");It has been found that this usage of pybind11 across hundreds of SimObject files accounted for 50% of the gem5 build time at one point: https://gem5.atlassian.net/browse/GEM5-366
To get a feeling of how SimObject objects are run, see: gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis.
Tested on gem5 08c79a194d1a3430801c04f37d13216cc9ec1da3.
The main is at: src/sim/main.cc. It calls:
and that is where doSimLoop the main event loop, doSimLoop gets called and starts kicking off the gem5 event queue.
and that is where the main event loop, doSimLoop, gets called and starts kicking off the gem5 event queue.
Tested at gem5 b4879ae5b0b6644e6836b0881e4da05c64a6550d.
gem5 is an event based simulator, and as such the event queue is of of the crucial elements in the system.
This calls the Event::process method of the event.
Let’s now analyze every single event on a minimal gem5 syscall emulation mode in the simplest CPU that we have:
which calls EventManager::schedule.
AtomicSimpleCPU is an EventManager because SimObject inherits from it.
AtomicSimpleCPU is an EventManager because SimObject inherits from it.
tickEvent is an EventFunctionWrapper which contains a std::function<void(void)> callback;, and is initialized in the constructor as:
The call stack above ArmLinuxProcess64::initState is pybind11 fuzziness, but if we grep a bit we find the Python call point:
src/python/m5/simulate.py
+def instantiate(ckpt_dir=None): + + ... + + # Create the C++ sim objects and connect ports + for obj in root.descendants(): obj.createCCObject() + for obj in root.descendants(): obj.connectPorts() + + # Do a second pass to finish initializing the sim objects + for obj in root.descendants(): obj.init() + + ... + + # Restore checkpoint (if any) + if ckpt_dir: + ... + else: + for obj in root.descendants(): obj.initState()+
As we can see, initState is just one stage of generic SimObject initialization. root.descendants() goes over the entire SimObject tree calling initState().
Finally, we see that initState is part of the SimObject C++ API:
src/sim/sim_object.hh
+class SimObject : public EventManager, public Serializable, public Drainable,
+ public Stats::Group
+{
+
+ ...
+
+ /**
+ * initState() is called on each SimObject when *not* restoring
+ * from a checkpoint. This provides a hook for state
+ * initializations that are only required for a "cold start".
+ */
+ virtual void initState();
+Finally, we see that initState is exposed to the Python API at:
build/ARM/python/_m5/param_SimObject.cc
+module_init(py::module &m_internal)
+{
+ py::module m = m_internal.def_submodule("param_SimObject");
+ py::class_<SimObjectParams, std::unique_ptr<SimObjectParams, py::nodelete>>(m, "SimObjectParams")
+ .def_readwrite("name", &SimObjectParams::name)
+ .def_readwrite("eventq_index", &SimObjectParams::eventq_index)
+ ;
+
+ py::class_<SimObject, Drainable, Serializable, Stats::Group, std::unique_ptr<SimObject, py::nodelete>>(m, "SimObject")
+ .def("init", &SimObject::init)
+ .def("initState", &SimObject::initState)
+ .def("memInvalidate", &SimObject::memInvalidate)
+ .def("memWriteback", &SimObject::memWriteback)
+ .def("regProbePoints", &SimObject::regProbePoints)
+ .def("regProbeListeners", &SimObject::regProbeListeners)
+ .def("startup", &SimObject::startup)
+ .def("loadState", &SimObject::loadState, py::arg("cp"))
+ .def("getPort", &SimObject::getPort, pybind11::return_value_policy::reference, py::arg("if_name"), py::arg("idx"))
+ ;
+
+}
+which is more magical than the other param classes since py::class_<SimObject has non-trivial methods, those are auto-generated by the cxx_exports code generation mechanism:
class SimObject(object):
+
+ ...
+
+ cxx_exports = [
+ PyBindMethod("init"),
+ PyBindMethod("initState"),
+ PyBindMethod("memInvalidate"),
+ PyBindMethod("memWriteback"),
+ PyBindMethod("regProbePoints"),
+ PyBindMethod("regProbeListeners"),
+ PyBindMethod("startup"),
+ ]
+So that’s how the main atomic tick loop works, fully understood!
TODO: analyze better what each of the memory event mean. For now, we have just collected a bunch of data there, but needs interpreting. The CPU specifics in this section are already insightful however.
Let’s just add --caches to see if things go any faster:
The events for the Atomic CPU were pretty simple: basically just ticks.
This describes the internals of the gem5 m5out/stats.txt file.
gem5 uses a ton of code generation, which makes the project horrendous:
But it has been widely overused to insanity. It likely also exists partly because when the project started in 2003 C++ compilers weren’t that good, so you couldn’t rely on features like templates that much.
Generated code at: build/<ISA>/config/the_isa.hh which contains amongst other lines:
E.g. src/cpu/decode_cache.hh includes:
Some scons madness.
It is likely preventing ccache hits when building to different output paths, because it makes the -I includes point to different paths. This is especially important for gem5 Ruby build, which could have the exact same source files as the non-Ruby builds: https://stackoverflow.com/questions/60340271/can-ccache-handle-symlinks-to-the-same-input-source-file-as-hits
when debugging the emulator, it shows you directories inside the build directory rather than in the source tree
Certain ISAs like ARM have bootloaders that are automatically run before the main image to setup basic system state.
libguestfs: https://serverfault.com/questions/246835/convert-directory-to-qemu-kvm-virtual-disk-image/916697#916697, in particular vfs-minimum-size
use methods described at: Section 19.5.2, “gem5 checkpoint restore and run a different script” instead of putting builds on the root filesystem
+use methods described at: Section 19.5.3, “gem5 checkpoint restore and run a different script” instead of putting builds on the root filesystem
Leads to the dreadful "Stack smashing detected" message. Which is infinitely better than a silent break in any case.
We had also seen this error in our repository at: stack smashing detected when using glibc.
+OK, we have to learn this stuff.
+Examples:
+userland/gcc/profile.c: simple profiling example, where certain calls of a certain function can dominate the runtime
+Build and install the interpreter on the target:
+Examples:
+rootfs_overlay/lkmc/python/hello.py: hello world
+time
rootfs_overlay/lkmc/python/count.py: count once every second
+Buildroot has a Python package that can be added to the guest image:
LKMC 50ac89b779363774325c81157ec8b9a6bdb50a2f gem5 390a74f59934b85d91489f8a563450d8321b602da:
+At LKMC 50ac89b779363774325c81157ec8b9a6bdb50a2f gem5 390a74f59934b85d91489f8a563450d8321b602da:
Examples:
+Here we will add some better examples and explanations for: https://docs.python.org/3/extending/embedding.html#very-high-level-embedding
+"Embedding Python" basically means calling the Python interpreter from C, and possibly passing values between the two.
+These examples show to to embed the Python interpreter into a C/C++ application to interface between them
rootfs_overlay/lkmc/python/hello.py: hello world
+userland/libs/python_embed/eval.c: this example simply does eval a Python string in C, and don’t communicate any values between the two.
It could be used to call external commands that have external side effects, but it is not very exciting.
+userland/libs/python_embed/pure.c: this example actually defines some Python classes and functions from C, implementing those entirely in C.
+The C program that defines those classes then instantiates the interpreter calls some regular Python code from it: userland/libs/python_embed/pure.py
+The regular Python code can then use the native C classes as if they were defined in Python.
+Finally, the Python returns values back to the C code that called the interpreter.
+userland/libs/python_embed/pure_cpp.cpp: C version of the above, the main goal of this example is to show how to interface with C classes.
One notable user of Python embedding is the gem5 simulator, see also: gem5 vs QEMU. gem5 embeds the Python interpreter in order to interpret scripts as seen from the CLI:
+build/ARM/gem5.opt configs/example/fs.py+
gem5 then runs that Python script, which instantiates C classes defined from Python, and then finally hands back control to the C runtime to run the actual simulation faster.
+See for example BLAS.
Binary format to store data. TODO vs databases, notably SQLite: https://datascience.stackexchange.com/questions/262/hierarchical-data-format-what-are-the-advantages-compared-to-alternative-format
+Examples: userland/libs/hdf5
+The following basenames should always refer to programs that do the same thing, but in different languages:
+count: count to infinity, sleep one second between each number
| Comment | LKMC | Benchmark build | Emulator command | @@ -32689,60 +33391,87 @@ instructions 124346081|||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
QEMU busy loop |
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
- |
|
+`./run --arch aarch64 --userland userland/gcc/busy_loop.c ` |
10^10 |
68 |
1.1 * 10^11 (approx) |
2000 |
||||
gem5 busy loop |
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
- |
|
-10^7 |
-100 |
-1.10018162 * 10^8 |
-1 |
+
|
+10^6 |
+18 |
+2.4005699 * 10^7 |
+1.3 |
gem5 busy loop for a debug build |
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
- |
|
+
|
10^5 |
-32 |
-2.528728 * 10^6 |
-0.08 |
+33 |
+2.405682 * 10^6 |
+0.07 |
|
gem5 busy loop for a fast build |
+0d5a41a3f88fcd7ed40fc19474fe5aed0463663f + 1 |
+userland/gcc/busy_loop.c |
+
|
+10^6 |
+15 |
+2.4005699 * 10^7 |
+1.6 |
+|||||
gem5 busy loop for a TimingSimpleCPU |
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
- |
|
+
|
+10^6 |
+26 |
+2.4005699 * 10^7 |
+0.9 |
+||||
gem5 busy loop for a MinorCPU |
+a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
++ |
|
10^6 |
31 |
1.1018152 * 10^7 |
0.4 |
|||||
gem5 busy loop for a DerivO3CPU |
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
- |
|
+
|
10^6 |
52 |
1.1018128 * 10^7 |
0.2 |
||||
a18f28e263c91362519ef550150b5c9d75fa3679 + 1 |
- |
|
+
|
1 * 1000000 = 10^6 |
63 |
1.1005150 * 10^7 |
0.2 |
|||||
605448f07e6380634b1aa7e9732d111759f69fd |
Dhrystone |
|
@@ -32752,17 +33481,19 @@ instructions 124346081
1.6 |
|||||||||
5d233f2664a78789f9907d27e2a40e86cefad595 |
STREAM benchmark |
-
|
+
|
3 * 10^5 * 2 |
64 |
9.9674773 * 10^7 |
1.6 |
|||||
glibc C pre-main effects |
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611 |
-glibc C pre-main userland/c/m5ops.c |
+
|
1 |
2 |
@@ -32770,6 +33501,7 @@ instructions 124346081
0.05 |
||||||
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611 |
glibc C pre-main userland/c/m5ops.c |
|
@@ -32779,6 +33511,7 @@ instructions 124346081
0.05 |
|||||||||
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611 |
glibc C++ pre-main userland/cpp/m5ops.cpp |
|
@@ -32788,6 +33521,7 @@ instructions 124346081
1 |
|||||||||
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611 |
glibc C++ pre-main userland/cpp/m5ops.cpp |
|
@@ -32797,22 +33531,44 @@ instructions 124346081
0.1 |
|||||||||
gem5 optimized build immediate exit on first instruction to benchmark the simulator startup time |
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611 |
-immediate exit userland/arch/aarch64/freestanding/linux/gem5_exit.S |
+immediate exit userland/freestanding/gem5_exit.S |
|
1 |
1 |
-+ | 1 |
||||
same as above but debug build |
ab6f7331406b22f8ab6e2df5f8b8e464fb35b611 |
-immediate exit userland/arch/aarch64/freestanding/linux/gem5_exit.S |
+
|
1 |
1 |
+1 |
- | + | ||||
Check the effect of an ExecAll log (log every instruction) on execution time, compare to analogous run without it. |
+d29a07ddad499f273cc90dd66e40f8474b5dfc40 |
++ |
|
+10^6 |
+2.4106774 * 10^7 |
+136 |
+0.2 |
+|||||
Same as above but with run command manually hacked to output to a ramfs. Slightly faster, but the bulk was still just in log format operations! |
+d29a07ddad499f273cc90dd66e40f8474b5dfc40 |
++ |
|
+10^6 |
+2.4106774 * 10^7 |
+107 |
+0.2 |
We will update this whenever the gem5 submodule is updated.
All benchmarks done on P51.
+Sample results at gem5 2a9573f5942b5416fb0570cf5cb6cdecba733392: 10 to 12 minutes.
Ubuntu 19.10, GCC 9.2.1, LKMC 7c6bb29bc89ec3f1056c0680c3f08bd64018a7bc, gem5 d7d9bc240615625141cd6feddbadd392457e49eb (18-02-2020), ./build --arch aarch64 --gem5-worktree master --no-cache: 19:33 TODO must investigate why it got so much worse.
Lenovo ThinkPad P51 laptop:
+Lenovo ThinkPad P51 laptop with the Latest stable Ubuntu.
2500 USD in 2018 (high end)
-Intel Core i7-7820HQ Processor (8MB Cache, up to 3.90GHz) (4 cores 8 threads)
-32GB(16+16) DDR4 2400MHz SODIMM
-512GB SSD PCIe TLC OPAL2
-NVIDIA Quadro M1200 Mobile, latest Ubuntu supported proprietary driver
-Latest Ubuntu
-Full specs and benchmark scores will be maintained at the latest version of: https://github.com/cirosantilli/notes/blob/0c038b0e430d0017f12d028c6a0e7c0b99ec957f/my-hardware.adoc#thinkpad-p51
ccache can be disabled with the --no-ccache option as in:
./build-gem5 --no-ccache+
This can be useful to benchmark builds.
+