./run --dry-run --userland userland/c/hello.c+
diff --git a/index.html b/index.html index d7bb433..f267190 100644 --- a/index.html +++ b/index.html @@ -477,43 +477,44 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
If you don’t know which one to go for, start with QEMU Buildroot setup getting started.
Design goals of this project are documented at: Section 32.18.1, “Design goals”.
+Design goals of this project are documented at: Section 33.18.1, “Design goals”.
This setup has been mostly tested on Ubuntu. For other host operating systems see: Section 32.1, “Supported hosts”. For greater stability, consider using the latest release instead of master: https://github.com/cirosantilli/linux-kernel-module-cheat/releases
+This setup has been mostly tested on Ubuntu. For other host operating systems see: Section 33.1, “Supported hosts”. For greater stability, consider using the latest release instead of master: https://github.com/cirosantilli/linux-kernel-module-cheat/releases
Reserve 12Gb of disk and run:
@@ -1928,7 +1946,7 @@ cd linux-kernel-module-cheatYou don’t need to clone recursively even though we have .git submodules: download-dependencies fetches just the submodules that you need for this build to save time.
If something goes wrong, see: Section 32.2, “Common build issues” and use our issue tracker: https://github.com/cirosantilli/linux-kernel-module-cheat/issues
+If something goes wrong, see: Section 33.2, “Common build issues” and use our issue tracker: https://github.com/cirosantilli/linux-kernel-module-cheat/issues
The initial build will take a while (30 minutes to 2 hours) to clone and build, see Benchmark builds for more details.
@@ -2011,7 +2029,7 @@ hello2 cleanupTo avoid typing --arch aarch64 many times, you can set the default arch as explained at: Section 32.4, “Default command line arguments”
To avoid typing --arch aarch64 many times, you can set the default arch as explained at: Section 33.4, “Default command line arguments”
I now urge you to read the following sections which contain widely applicable information:
@@ -2315,7 +2333,7 @@ hello /root/.profileIf you really want to develop semiconductors, your only choice is to join an university or a semiconductor company that has the EDA licenses.
See also: Section 32.19.2, “Should you waste your life with systems programming?”.
+See also: Section 33.19.2, “Should you waste your life with systems programming?”.
While hacking QEMU, you will likely want to GDB step its source. That is trivial since QEMU is just another userland program like any other, but our setup has a shortcut to make it even more convenient, see: Section 18.7, “Debug the emulator”.
@@ -2622,9 +2640,81 @@ j = 0One of the major features of this repository is that we try to support the --dry-run option really well for all scripts.
This option, as the name suggests, outputs the external commands that would be run (or more precisely: equivalent commands), without actually running them.
+This allows you to just clone this repository and get full working commands to integrate into your project, without having to build or use this setup further!
+For example, we can obtain a QEMU run for the file userland/c/hello.c in User mode simulation by adding --dry-run to the normal command:
./run --dry-run --userland userland/c/hello.c+
which as of LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1 outputs:
++ /path/to/linux-kernel-module-cheat/out/qemu/default/opt/x86_64-linux-user/qemu-x86_64 \ + -L /path/to/linux-kernel-module-cheat/out/buildroot/build/default/x86_64/target \ + -r 5.2.1 \ + -seed 0 \ + -trace enable=load_file,file=/path/to/linux-kernel-module-cheat/out/run/qemu/x86_64/0/trace.bin \ + -cpu max \ + /path/to/linux-kernel-module-cheat/out/userland/default/x86_64/c/hello.out \ +;+
So observe that the command contains:
++: sign to differentiate it from program stdout, much like bash -x output. This is not a valid part of the generated Bash command however.
the actual command nicely, indented and with arguments broken one per line, but with continuing backslashes so you can just copy paste into a terminal
+;: both a valid part of the Bash command, and a visual mark the end of the command
For the specific case of running emulators such as QEMU, the last command is also automatically placed in a file for your convenience and later inspection:
+cat "$(./getvar run_dir)/run.sh"+
Furthermore, --dry-run also automatically specifies, in valid Bash shell syntax:
environment variables used to run the command with syntax + ENV_VAR_1=abc ENV_VAR_2=def ./some/command
change in working directory with + cd /some/new/path && ./some/command
This setup is like the QEMU Buildroot setup, but it uses gem5 instead of QEMU as a system simulator.
For the most part, if you just add the --emulator gem5 option or *-gem5 suffix to all commands and everything should magically work.
This repository has been tested inside clean Docker containers.
This is a good option if you are on a Linux host, but the native setup failed due to your weird host distribution, and you have better things to do with your life than to debug it. See also: Section 32.1, “Supported hosts”.
+This is a good option if you are on a Linux host, but the native setup failed due to your weird host distribution, and you have better things to do with your life than to debug it. See also: Section 33.1, “Supported hosts”.
For example, to do a QEMU Buildroot setup inside Docker, run:
@@ -2898,9 +2988,9 @@ j = 0This setup uses prebuilt binaries that we upload to GitHub from time to time.
can’t GDB step debug the kernel, since the source and cross toolchain with GDB are not available. Buildroot cannot easily use a host toolchain: Section 29.2.2.1.1, “Buildroot use prebuilt host toolchain”.
+can’t GDB step debug the kernel, since the source and cross toolchain with GDB are not available. Buildroot cannot easily use a host toolchain: Section 29.2.3.1.1, “Buildroot use prebuilt host toolchain”.
Maybe we could work around this by just downloading the kernel source somehow, and using a host prebuilt GDB, but we felt that it would be too messy and unreliable.
Checkout to the latest tag and use the Ubuntu packaged QEMU to boot Linux:
THIS IS DANGEROUS (AND FUN), YOU HAVE BEEN WARNED
Minimal host build system example:
In order to test the kernel and emulators, userland content in the form of executables and scripts is of course required, and we store it mostly under:
There are several ways to run our Userland content, notably:
natively on the host as shown at: Section 1.6.2.1, “Userland setup getting started natively”
+natively on the host as shown at: Section 1.7.2.1, “Userland setup getting started natively”
Can only run examples compatible with your host CPU architecture and OS, but has the fastest setup and runtimes.
the host prebuilt toolchain: Section 1.6.2.2, “Userland setup getting started with prebuilt toolchain and QEMU user mode”
+the host prebuilt toolchain: Section 1.7.2.2, “Userland setup getting started with prebuilt toolchain and QEMU user mode”
the Buildroot toolchain you built yourself: Section 10.1, “QEMU user mode getting started”
@@ -3287,7 +3377,7 @@ dmesgWith this setup, we will use the host toolchain and execute executables directly on the host.
If you are lazy to built the Buildroot toolchain and QEMU, but want to run e.g. ARM Userland assembly in User mode simulation, you can get away on Ubuntu 18.04 with just:
This present the usual trade-offs of using prebuilts as mentioned at: Section 1.4, “Prebuilt setup”.
+This present the usual trade-offs of using prebuilts as mentioned at: Section 1.5, “Prebuilt setup”.
Other functionality are analogous, e.g. testing:
@@ -3502,7 +3592,7 @@ cd userlandFirst ensure that QEMU Buildroot setup is working.
This setup does not use the Linux kernel nor Buildroot at all: it just runs your very own minimal OS.
Every .c file inside baremetal/ and .S file inside baremetal/arch/<arch>/ generates a separate baremetal image.
You don’t need to depend on GitHub.
More information about our documentation internals can be found at: Section 32.5, “Documentation”
+More information about our documentation internals can be found at: Section 33.5, “Documentation”
./run --userland path resolution is analogous to that of ./run --baremetal.
./build user-mode-qemu first builds Buildroot, and then runs ./build-userland, which is further documented at: Section 1.6, “Userland setup”. It also builds QEMU. If you ahve already done a QEMU Buildroot setup previously, this will be very fast.
./build user-mode-qemu first builds Buildroot, and then runs ./build-userland, which is further documented at: Section 1.7, “Userland setup”. It also builds QEMU. If you ahve already done a QEMU Buildroot setup previously, this will be very fast.
If you modify the userland programs, rebuild simply with:
@@ -7219,7 +7309,7 @@ qw erThe gem5 tests require building statically with build id static, see also: Section 10.6, “gem5 syscall emulation mode”. TODO automate this better.
See: Section 32.13, “Test this repo” for more useful testing tips.
+See: Section 33.13, “Test this repo” for more useful testing tips.
Running statically linked executables sometimes makes things break:
+TODO understand why:
+./run --static --userland userland/c/file_write_read.c+
fails our assertion that the data was read back correctly:
+Assertion `strcmp(data, output) == 0' faile+
Let’s see if user mode runs considerably faster than full system or not.
-First we build Dhrystone manually statically since dynamic linking is broken in gem5 as explained at: Section 10.6, “gem5 syscall emulation mode”.
-TODO: move this section to our new custom dhrystone setup: Section 19.2.3.1, “Dhrystone”.
-gem5 user mode:
-./build-buildroot --arch arm --config 'BR2_PACKAGE_DHRYSTONE=y' -make \ - -B \ - -C "$(./getvar --arch arm buildroot_build_build_dir)/dhrystone-2" \ - CC="$(./run-toolchain --arch arm --print-tool gcc)" \ - CFLAGS=-static \ -; -time \ - ./run \ - --arch arm \ - --emulator gem5 \ - --userland "$(./getvar --arch arm buildroot_build_build_dir)/dhrystone-2/dhrystone" \ - --userland-args 'asdf qwer' \ -;-
gem5 full system:
-time \ - ./run \ - --arch arm \ - --eval-after './gem5.sh' \ - --emulator gem5 - --gem5-readfile 'dhrystone 100000' \ -;-
QEMU user mode:
-time qemu-arm "$(./getvar --arch arm buildroot_build_build_dir)/dhrystone-2/dhrystone" 100000000-
QEMU full system:
-time \ - ./run \ - --arch arm \ - --eval-after 'time dhrystone 100000000;./linux/poweroff.out' \ -;-
Result on P51 at bad30f513c46c1b0995d3a10c0d9bc2a33dc4fa0:
-gem5 user: 33 seconds
-gem5 full system: 51 seconds
-QEMU user: 45 seconds
-QEMU full system: 223 seconds
-Since gem5 has to implement syscalls itself in syscall emulation mode, it can of course clearly see which syscalls are being made, and we can log them for debug purposes with gem5 tracing, e.g.:
modules built with Buildroot, see: Section 32.12.2.1, “kernel_modules buildroot package”
+modules built with Buildroot, see: Section 33.12.2.1, “kernel_modules buildroot package”
modules built from the kernel tree itself, see: Section 15.12.2, “dummy-irq”
@@ -8807,7 +8837,7 @@ xeyesWe disable networking by default because it starts an userland process, and we want to keep the number of userland processes to a minimum to make the system more understandable as explained at: Section 32.18.3, “Resource tradeoff guidelines”
+We disable networking by default because it starts an userland process, and we want to keep the number of userland processes to a minimum to make the system more understandable as explained at: Section 33.18.3, “Resource tradeoff guidelines”
To enable networking on Buildroot, simply run:
@@ -9656,15 +9686,15 @@ git log | grep -E ' Linux [0-9]+\.' | headThis also makes this repo the perfect setup to develop the Linux kernel.
In case something breaks while updating the Linux kernel, you can try to bisect it to understand the root cause, see: Section 32.14, “Bisection”.
+In case something breaks while updating the Linux kernel, you can try to bisect it to understand the root cause, see: Section 33.14, “Bisection”.
First, use use the branching procedure described at: Section 32.16, “Update a forked submodule”
+First, use use the branching procedure described at: Section 33.16, “Update a forked submodule”
Because the kernel is so central to this repository, almost all tests must be re-run, so basically just follow the full testing procedure described at: Section 32.13, “Test this repo”. The only tests that can be skipped are essentially the Baremetal tests.
+Because the kernel is so central to this repository, almost all tests must be re-run, so basically just follow the full testing procedure described at: Section 33.13, “Test this repo”. The only tests that can be skipped are essentially the Baremetal tests.
Before comitting, don’t forget to update:
@@ -16540,7 +16570,7 @@ monitor info qtreeThe --verbose is optional, but shows clearly each GCC build command so that you can confirm what --*-build-type is doing.
The build outputs are automatically stored in a different directories for optimized and debug builds, which prevents debug files from overwriting opt ones. Therefore, --gem5-build-id is not required:
The build outputs are automatically stored in a different directories for optimized and debug builds, which prevents debug files from overwriting opt ones. Therefore, --gem5-build-id is not required.
The price to pay for debuggability is high however: a Linux kernel boot was about 3x slower in QEMU and 14 times slower in gem5 debug compared to opt, see benchmarks at: Section 29.2.1, “Benchmark Linux kernel boot”
@@ -16562,7 +16592,7 @@ monitor info qtreeWhile step debugging any complext program, you always end up feeling the need to step in reverse to reach the last call to some function that was called before the failure point, in order to trace back the problem to the actual bug source.
+While step debugging any complex program, you always end up feeling the need to step in reverse to reach the last call to some function that was called before the failure point, in order to trace back the problem to the actual bug source.
While GDB "has" this feature, it is just too broken to be usable, and so we expose the amazing Mozilla RR tool conveniently in this repo: https://stackoverflow.com/questions/1470434/how-does-reverse-debugging-work/53063242#53063242
@@ -16597,10 +16627,10 @@ continueand you can now reiably use reverse debugging commands such as reverse-continue, reverse-finish and reverse-next!
and you can now reliably use reverse debugging commands such as reverse-continue, reverse-finish and reverse-next!
To restart debugging again after quitting rr, simlpy run on your host terminal:
To restart debugging again after quitting rr, simply run on your host terminal:
PANDA can list memory addresses, so I bet it can also decode the instructions: https://github.com/panda-re/panda/blob/883c85fa35f35e84a323ed3d464ff40030f06bd6/panda/docs/LINE_Censorship.md I wonder why they don’t just upstream those things to QEMU’s tracing: https://github.com/panda-re/panda/issues/290
gem5 can do it as shown at: Section 18.8.7, “gem5 tracing”.
+gem5 can do it as shown at: Section 18.8.8, “gem5 tracing”.
gem5 provides also provides a tracing mechanism documented at: http://www.gem5.org/Trace_Based_Debugging:
TODO: 7452d399290c9c1fc6366cdad129ef442f323564 ./trace2line this is too slow and takes hours. QEMU’s processing of 170k events takes 7 seconds. gem5’s processing is analogous, but there are 140M events, so it should take 7000 seconds ~ 2 hours which seems consistent with what I observe, so maybe there is no way to speed this up… The workaround is to just use gem5’s ExecSymbol to get function granularity, and then GDB individually if line detail is needed?
gem5 traces are generated from DPRINTF(<trace-id> calls scattered throughout the code, except for ExecAll instruction traces, which uses Debug::ExecEnable directly..
This debug flag traces all instructions.
This flag shows a more detailed register usage than gem5 ExecAll trace format.
As of gem5 16eeee5356585441a49d05c78abc328ef09f7ace the default tracer is ExeTracer. It is set at:
Getting started at: Section 1.2, “gem5 Buildroot setup”.
+Getting started at: Section 1.3, “gem5 Buildroot setup”.
Open source but not in Buildroot:
+https://github.com/kozyraki/stamp transactional memory benchmarks
+There are not yet enabled, but it should be easy to so, see: Section 20.5, “Add new Buildroot packages”
Build and run on gem5 use mode:
+Build and run on gem5 user mode:
The clean is necessary because the source files didn’t change, so make would just check the timestamps and not build anything.
You will then likely want to make those more permanent as explained at: Section 32.4, “Default command line arguments”.
+You will then likely want to make those more permanent as explained at: Section 33.4, “Default command line arguments”.
If none of those methods are flexible enough for you, you can just fork or hack up buildroot_packages/sample_package the sample package to do what you want.
For how to use that package, see: Section 32.12.2, “buildroot_packages directory”.
+For how to use that package, see: Section 33.12.2, “buildroot_packages directory”.
Then iterate trying to do what you want and reading the manual until it works: https://buildroot.org/downloads/manual/manual.html
@@ -21868,7 +21914,7 @@ git -C "$(./getvar qemu_source_dir)" checkout -Then, you will also want to do a Bisection to pinpoint the exact commit to blame, and CC that developer.
Finally, give the images you used save upstream developers' time as shown at: Section 32.17.2, “release-zip”.
+Finally, give the images you used save upstream developers' time as shown at: Section 33.17.2, “release-zip”.
For Buildroot problems, you should wither provide the config you have:
@@ -22176,7 +22222,7 @@ cd ../..This section documents our test and educational userland content, such as C, C++ and POSIX examples, present mostly under userland/.
Getting started at: Section 1.6, “Userland setup”
+Getting started at: Section 1.7, “Userland setup”
Userland assembly content is located at: Section 22, “Userland assembly”. It was split from this section basically because we were hitting the HTML h6 limit, stupid web :-)
This content makes up the bulk of the userland/ directory.
The quickest way to run the arch agnostic examples, which comprise the majority of the examples, is natively as shown at: Section 1.6.2.1, “Userland setup getting started natively”
+The quickest way to run the arch agnostic examples, which comprise the majority of the examples, is natively as shown at: Section 1.7.2.1, “Userland setup getting started natively”
This section was originally moved in here from: https://github.com/cirosantilli/cpp-cheat
@@ -22981,7 +23027,7 @@ echo 1 > /proc/sys/vm/overcommit_memoryLike other userland programs, these programs can be run as explained at: Section 1.6, “Userland setup”.
+Like other userland programs, these programs can be run as explained at: Section 1.7, “Userland setup”.
As a quick reminder, the fastest setups to get started are:
@@ -23514,7 +23560,7 @@ When instructions do not interpret this operand encoding as the zero register, uUserland assembly is generally simpler, and a pre-requisite for Baremetal setup.
System-land assembly cheats will be put under: Section 1.7, “Baremetal setup”.
+System-land assembly cheats will be put under: Section 1.8, “Baremetal setup”.
Getting started at: Section 1.7, “Baremetal setup”
+Getting started at: Section 1.8, “Baremetal setup”
In baremetal, we detect if tests failed by parsing logs for the Magic failure string.
See: Section 32.13, “Test this repo” for more useful testing tips.
+See: Section 33.13, “Test this repo” for more useful testing tips.
Let’s see how fast our simulators are running some well known or easy to understand userland benchmarks!
+TODO: would be amazing to have an automated guest instructions per second count, but I’m not sure how to do that nicely for QEMU: QEMU get guest instruction count.
+TODO: automate this further, produce the results table automatically, possibly by generalizing test-executables.
+For now we can just run on gem5 to estimate the instruction count per input size and extrapolate?
+For example, the simplest scalable CPU content would be a busy loop: userland/gcc/busy_loop.c, so let’s focus on that for now.
+Summary of manually collected results on P51 at LKMC a18f28e263c91362519ef550150b5c9d75fa3679 + 1: Table 7, “Busy loop DMIPS for different simulator setups”. As expected, the less native / more detailed / more complex simulations are slower!
+| Simulator | +Loops | +Time (s) | +Instruction count | +Approximate MIPS | +
|---|---|---|---|---|
|
+10^10 |
+68 |
+1.1 * 10^11 (approx) |
+2000 |
+
|
+10^7 |
+100 |
+1.10018162 * 10^8 |
+1 |
+
|
+10^6 |
+31 |
+1.1018152 * 10^7 |
+0.4 |
+
|
+10^6 |
+52 |
+1.1018128 * 10^7 |
+0.2 |
+
The first step is to determine a number of loops that will run long enough to have meaningful results, but not too long that we will get bored.
+On our P51 machine, we found 10^7 (10 million == 1000 times 10000) loops to be a good number:
+./run --arch aarch64 --emulator gem5 --userland userland/gcc/busy_loop.c --userland-args '1000 10000' --static +./get-stat sim_insts+
as it gives:
+time: 00:01:40
+instructions: 110018162 ~ 110 millions
+so ~ 110 million instructions / 100 seconds makes ~ 1 MIPS (million instructions per second).
+This experiment also suggests that each loop is about 11 instructions long (110M instructions / 10M loops), so we look at the disassembly:
+./run-toolchain --arch aarch64 gdb -- -batch -ex 'disas busy_loop' "$(./getvar --arch aarch64 userland_build_dir)/gcc/busy_loop.out"+
which contains:
+8 ) {
+ 0x0000000000400698 <+0>: ff 83 00 d1 sub sp, sp, #0x20
+ 0x000000000040069c <+4>: e0 07 00 f9 str x0, [sp, #8]
+ 0x00000000004006a0 <+8>: e1 03 00 f9 str x1, [sp]
+
+9 for (unsigned i = 0; i < max; i++) {
+ 0x00000000004006a4 <+12>: ff 1f 00 b9 str wzr, [sp, #28]
+ 0x00000000004006a8 <+16>: 11 00 00 14 b 0x4006ec <busy_loop+84>
+
+10 for (unsigned j = 0; j < max2; j++) {
+ 0x00000000004006ac <+20>: ff 1b 00 b9 str wzr, [sp, #24]
+ 0x00000000004006b0 <+24>: 08 00 00 14 b 0x4006d0 <busy_loop+56>
+
+11 __asm__ __volatile__ ("" : "+g" (j), "+g" (j) : :);
+ 0x00000000004006b4 <+28>: e1 1b 40 b9 ldr w1, [sp, #24]
+ 0x00000000004006b8 <+32>: e0 1b 40 b9 ldr w0, [sp, #24]
+ 0x00000000004006bc <+36>: e1 1b 00 b9 str w1, [sp, #24]
+ 0x00000000004006c0 <+40>: e0 17 00 b9 str w0, [sp, #20]
+
+10 for (unsigned j = 0; j < max2; j++) {
+ 0x00000000004006c4 <+44>: e0 17 40 b9 ldr w0, [sp, #20]
+ 0x00000000004006c8 <+48>: 00 04 00 11 add w0, w0, #0x1
+ 0x00000000004006cc <+52>: e0 1b 00 b9 str w0, [sp, #24]
+ 0x00000000004006d0 <+56>: e0 1b 40 b9 ldr w0, [sp, #24]
+ 0x00000000004006d4 <+60>: e1 03 40 f9 ldr x1, [sp]
+ 0x00000000004006d8 <+64>: 3f 00 00 eb cmp x1, x0
+ 0x00000000004006dc <+68>: c8 fe ff 54 b.hi 0x4006b4 <busy_loop+28> // b.pmore
+
+9 for (unsigned i = 0; i < max; i++) {
+ 0x00000000004006e0 <+72>: e0 1f 40 b9 ldr w0, [sp, #28]
+ 0x00000000004006e4 <+76>: 00 04 00 11 add w0, w0, #0x1
+ 0x00000000004006e8 <+80>: e0 1f 00 b9 str w0, [sp, #28]
+ 0x00000000004006ec <+84>: e0 1f 40 b9 ldr w0, [sp, #28]
+ 0x00000000004006f0 <+88>: e1 07 40 f9 ldr x1, [sp, #8]
+ 0x00000000004006f4 <+92>: 3f 00 00 eb cmp x1, x0
+ 0x00000000004006f8 <+96>: a8 fd ff 54 b.hi 0x4006ac <busy_loop+20> // b.pmore
+
+12 }
+13 }
+14 }
+ 0x00000000004006fc <+100>: 1f 20 03 d5 nop
+ 0x0000000000400700 <+104>: ff 83 00 91 add sp, sp, #0x20
+ 0x0000000000400704 <+108>: c0 03 5f d6 ret
+We look for the internal backwards jumps, and we find two:
+0x00000000004006dc <+68>: c8 fe ff 54 b.hi 0x4006b4 <busy_loop+28> // b.pmore + 0x00000000004006f8 <+96>: a8 fd ff 54 b.hi 0x4006ac <busy_loop+20> // b.pmore+
and so clearly the one at 0x4006dc happens first and jumps to a larger address than the other one, so the internal loop must be between 4006dc and 4006b4, which contains exactly 11 instructions! Bingo!
+Oh my God, unoptimized code is so horrendously inefficient, even I can’t stand all those useless loads and stores to memory variables!!!
+Then for QEMU, we experimentally turn the number of loops up to 10^10 loops (100000 100000), which contains an expected 11 * 10^10 instructions, and the runtime is 00:01:08, so we have 1.1 * 10^11 instruction / 68 seconds ~ 2 * 10^9 = 2000 MIPS!
We can then repeat the experiment for other gem5 CPUs to see how they compare.
+Let’s see if user mode runs considerably faster than full system or not, ignoring the kernel boot.
+First we build Dhrystone manually statically since dynamic linking is broken in gem5 as explained at: Section 10.6, “gem5 syscall emulation mode”.
+TODO: move this section to our new custom dhrystone setup: Section 19.2.3.1, “Dhrystone”.
+gem5 user mode:
+./build-buildroot --arch arm --config 'BR2_PACKAGE_DHRYSTONE=y' +make \ + -B \ + -C "$(./getvar --arch arm buildroot_build_build_dir)/dhrystone-2" \ + CC="$(./run-toolchain --arch arm --print-tool gcc)" \ + CFLAGS=-static \ +; +time \ + ./run \ + --arch arm \ + --emulator gem5 \ + --userland "$(./getvar --arch arm buildroot_build_build_dir)/dhrystone-2/dhrystone" \ + --userland-args 'asdf qwer' \ +;+
gem5 full system:
+time \ + ./run \ + --arch arm \ + --eval-after './gem5.sh' \ + --emulator gem5 + --gem5-readfile 'dhrystone 100000' \ +;+
QEMU user mode:
+time qemu-arm "$(./getvar --arch arm buildroot_build_build_dir)/dhrystone-2/dhrystone" 100000000+
QEMU full system:
+time \ + ./run \ + --arch arm \ + --eval-after 'time dhrystone 100000000;./linux/poweroff.out' \ +;+
Result on P51 at bad30f513c46c1b0995d3a10c0d9bc2a33dc4fa0:
+gem5 user: 33 seconds
+gem5 full system: 51 seconds
+QEMU user: 45 seconds
+QEMU full system: 223 seconds
+The build times are calculated after doing ./configure and make source, which downloads the sources, and basically benchmarks the Internet.
./build-buildroot -- graph-build graph-size graph-depends @@ -30399,14 +30709,14 @@ xdg-open graph-size.pdf
The biggest build time hog is always GCC, and it does not look like we can use a precompiled one: https://stackoverflow.com/questions/10833672/buildroot-environment-with-host-toolchain
This is the minimal build we could expect to get away with.
How long it takes to build gem5 itself.
This is the critical development parameter, and is dominated by the link time of huge binaries.
Algorithms to keep the caches of different cores of a system coherent.
+E.g.: if one processors writes to the cache, other processors have to know about it before they read from that address.
+This is the most basic non-trivial coherency protocol, and therefore the first one you should learn.
+Helpful video: https://www.youtube.com/watch?v=gAUVAel-2Fg "MSI Coherence - Georgia Tech - HPCA: Part 5" by Udacity.
+Let’s focus on a single cache line representing a given memory address.
+The system looks like this:
++----+ +|DRAM| ++----+ +^ +| +v ++--------+ +| BUS | ++--------+ +^ ^ +| | +v v ++------+ +------+ +|CACHE1| |CACHE2| ++------+ +------+ +^ ^ +| | +| | ++----+ +----+ +|CPU1| |CPU2| ++----+ +----++
MSI stands for which states each cache can be in for a given cache line. The states are:
+Modified: a single cache has the valid data and it has been modified from DRAM.
+Both reads and writes are free, because we don’t have to worry about other processors.
+Shared: the data is synchronized with DRAM, and may be present in multiple caches.
+Reads are free, but writes need to do extra work.
+This is the "most interesting" state of the protocol, as it allows for those free reads, even when multiple processors are using some address.
+Invalid: the cache does not have the data, CPU reads and writes need to do extra work
+The above allowed states can be summarized in the following table:
+CACHE1 + MSI + M nny +CACHE2 S nyy + I yyy+
The whole goal of the protocol is to maintain that state at all times, so that we can get those free reads when in shared state!
+To do so, the caches have to pass messages between themselves! This means generating bus traffic, which has a cost and must be kept to a minimum.
+The system components can receive and send the following messages:
+CPUn can send to CACHEn:
+"Local read": CPU reads from cache
+"Local write": CPU writes to cache
+CACHEn to itself:
+"Evict": the cache is running out of space due to another request
+CACHEn can send the following message to the bus.
+"Bus read": the cache needs to get the data. The reply will contain the full data line. It can come either from another cache that has the data, or from DRAM if none do.
+"Bus write": the cache wants to modify some data, and it does not have the line.
+The reply must contain the full data line, because maybe the processor just wants to change one byte, but the line is much larger.
+That’s why this request can also be called "Read Exclusive", as it is basically a "Bus Read" + "Invalidate" in one
+"Invalidate": the cache wants to modify some data, but it knows that all other caches are up to date, because it is in shared state.
+Therefore, it does not need to fetch the data, which saves bus traffic compared to "Bus write" since the data itself does not need to be sent.
+"Write back": send the data on the bus and tell someone to pick it up: either DRAM or another cache
+When a message is sent to the bus:
+all other caches and the DRAM will see it, this is called "snooping"
+either caches or DRAM can reply if a reply is needed, but other caches get priority to reply earlier if they can, e.g. to serve a cache request from other caches rather than going all the way to DRAM
+When a cache receives a message, it do one or both of:
+change to another MSI state
+send a message to the bus
+And finally, the transitions are:
+Modified:
+"Local read": don’t need to do anything because only the current cache holds the data
+"Local write": don’t need to do anything because only the current cache holds the data
+"Evict": have to save data to DRAM so that our local modifications won’t be lost
+Move to: Invalid
+Send message: "Write back"
+"Bus read": another cache is trying to read the address which we owned exclusively.
+Since we know what the latest data is, we can move to "Shared" rather than "Invalid" to possibly save time on future reads.
+But to do that, we need to write the data back to DRAM to maintain the shared state consistent. The MESI protocol prevents that extra read in some cases.
+And it has to be either: before the other cache gets its data from DRAM, or better, the other cache can get its data from our write back itself just like the DRAM.
+Move to: Shared
+Send message: "Write back"
+"Bus write": someone else will write to our address.
+We don’t know what they will write, so the best bet is to move to invalid.
+Since the writer will become the new sole data owner, the writer can get the cache from us without going to DRAM at all! This is fine, because the writer will be the new sole owner of the line, so DRAM can remain dirty without problems.
+Move to: Invalid
+Send message: "Write back"
+Shared: TODO
+"Local read":
+"Local write":
+"Evict":
+"Bus read":
+"Bus write":
+Invalid: TODO
+"Local read":
+"Local write":
+"Evict":
+"Bus read":
+"Bus write":
+TODO gem5 concrete example.
+Splits the Shared of MSI protocol into a new Exclusive state:
+MESI Exclusive: clean but only present in one cache
+MESI Shared: clean but may be present in more that one cache
+TODO advantage: I think the advantages over MSI are:
+when we move from Exclusive to Shared, no DRAM write back is needed, because we know that the cache is clean
+when we move from Exclusive to Modified, no invalidate message is required, reducing bus traffic
+Exclusive is entered from Invalid after a "Local read", but only if the reply came from DRAM! If the reply came from another cache, we go directly to shared instead.
+TODO compare to MSI and understand advantages. From Wikipedia it seems that MOSI can get data from the Owned cache while MSI cannot get data from Shared caches and must go to memory, but why not? Why do we need that Owned? Is it because there are multiple Shared caches and them all replying at the same time would lead to problems?
+MESI protocol + MOSI protocol, not much else to it!
+The host requirements depend a lot on which examples you want to run.
If ./build --download-dependencies fails with:
It does not work if you just download the .zip with the sources for this repository from GitHub because we use Git submodules, you must clone this repo.
If you just want to run a command after boot ends without thinking much about it, just use the --eval-after option, e.g.:
It gets annoying to retype --arch aarch64 for every single command, or to remember --config setups.
To learn how to build the documentation see: Section 1.8, “Build the documentation”.
+To learn how to build the documentation see: Section 1.9, “Build the documentation”.
When running build-doc, we do the following checks:
The scripts prints what you have to fix and exits with an error status if there are any errors.
Documentation for asciidoctor/extract-link-targets
Documentation for asciidoctor/extract-header-ids
The Asciidoctor extension scripts:
As mentioned before the TOC, we have to push this README to GitHub pages due to: https://github.com/isaacs/github/issues/1610
You did something crazy, and nothing seems to work anymore?
ccache might save you a lot of re-build when you decide to Clean the build or create a new build variant.
It is not possible to rebuild the root filesystem while running QEMU because QEMU holds the file qcow2 file:
When doing long simulations sweeping across multiple system parameters, it becomes fundamental to do multiple simulations in parallel.
To run multiple gem5 checkouts, see: Section 32.11.3.1, “gem5 worktree”.
+To run multiple gem5 checkouts, see: Section 33.11.3.1, “gem5 worktree”.
Implementation note: we create multiple namespaces for two things:
@@ -31303,7 +31974,7 @@ less "$(./getvar --arch aarch64 --emulator gem5 --run-id 1 termout_file)"It often happens that you are comparing two versions of the build, a good and a bad one, and trying to figure out why the bad one is bad.
Our build variants system allows you to keep multiple built versions of all major components, so that you can easily switching between running one or the other.
If you want to keep two builds around, one for the latest Linux version, and the other for Linux v4.16:
To run both kernels simultaneously, one on each QEMU instance, see: Section 32.10, “Simultaneous runs”.
+To run both kernels simultaneously, one on each QEMU instance, see: Section 33.10, “Simultaneous runs”.
Analogous to the Linux kernel build variants but with the --qemu-build-id option instead:
Analogous to the Linux kernel build variants but with the --gem5-build-id option instead:
Therefore, you can’t forget to checkout to the sources to that of the corresponding build before running, unless you explicitly tell gem5 to use a non-default source tree with gem5 worktree. This becomes inevitable when you want to launch multiple simultaneous runs at different checkouts.
--gem5-build-id goes a long way, but if you want to seamlessly switch between two gem5 tress without checking out multiple times, then --gem5-worktree is for you.
Suppose that you are working on a private fork of gem5, but you want to use this repository to develop it as well.
Allows you to have multiple versions of the GCC toolchain or root filesystem.
lkmc/ contains sources and headers that are shared across kernel modules, userland and baremetal examples.
Another option would have been to name it as includes/lkmc, but that would make paths longer, and we might want to store source code in that directory as well in the future.
When factoring out functionality across userland examples, there are two main options:
Source: buildroot_packages/.
A custom build script can give you more flexibility: e.g. the package can be made work with other root filesystems more easily, have better 9P support, and rebuild faster as it evades some Buildroot boilerplate.
Has the following structure:
Patches in this directory are never applied automatically: it is up to users to manually apply them before usage following the instructions in this documentation.
Source: rootfs_overlay.
This way you can just hack away the scripts and try them out immediately without any further operations.
This path can be found with:
The files:
Print out several parameters that normally change randomly from boot to boot:
lkmc_home refers to the target base directory in which we put all our custom built stuff, such as userland executables and kernel modules.
Run almost all tests:
test does not all possible tests, because there are too many possible variations and that would take forever. The rationale is the same as for ./build all and is explained in ./build --help.
You can select multiple archs and emulators of interest, as for an other command, with:
By default, continue running even after the first failure happens, and they show a summary at the end.
TODO: we really need a mechanism to automatically generate the test list automatically e.g. based on path_properties, currently there are many tests missing, and we have to add everything manually which is very annoying.
We have some pexpect automated tests for GDB for both userland and baremetal programs!
We do not know of any way to set the emulator exit status in QEMU arm full system.
For the Linux kernel, do the following manual tests for now.
You should also test that the Internet works:
build-userland and test-executables have a wide variety of target selection modes, and it was hard to keep them all working without some tests:
When updating the Linux kernel, QEMU and gem5, things sometimes break.
In order to build and run each userland and baremetal example properly, we need per-file metadata such as compiler flags and required number of cores.
This is a template update procedure for submodules for which we have some patches on on top of mainline.
Ensure that the Automated tests are passing on a clean build:
The ./build-test command builds a superset of what will be downloaded which also tests other things we would like to be working on the release. For the minimal build to generate the files to be uploaded, see: Section 32.17.2, “release-zip”
The ./build-test command builds a superset of what will be downloaded which also tests other things we would like to be working on the release. For the minimal build to generate the files to be uploaded, see: Section 33.17.2, “release-zip”
The clean build is necessary as it generates clean images since it is not possible to remove Buildroot packages
@@ -32474,7 +33145,7 @@ git push --follow-tagsCreate a zip containing all files required for Prebuilt setup:
After:
This project was created to help me understand, modify and test low level system components by using system simulators.
The trade-offs between the different setups are basically a balance between:
compatibility: how likely is is that all the components will work well together: emulator, compiler, kernel, standard library, …
guest software availability: how wide is your choice of easily installed guest software packages? See also: Section 32.18.4, “Linux distro choice”
+guest software availability: how wide is your choice of easily installed guest software packages? See also: Section 33.18.4, “Linux distro choice”
Choosing which features go into our default builds means making tradeoffs, here are our guidelines:
We haven’t found the ultimate distro yet, here is a summary table of trade-offs that we care about: Table 7, “Comparison of Linux distros for usage in this repository”.
+We haven’t found the ultimate distro yet, here is a summary table of trade-offs that we care about: Table 8, “Comparison of Linux distros for usage in this repository”.