patch -d "$(./getvar gem5_source_dir)" -p 1 < patches/manual/gem5-commmonitor-se.patch+
diff --git a/index.html b/index.html index f4cec6e..5850538 100644 --- a/index.html +++ b/index.html @@ -1254,493 +1254,494 @@ pre{ white-space:pre }
CommMonitorThreadContext vs ThreadState vs ExecContext vs Process
+ThreadContext vs ThreadState vs ExecContext vs Process
If you don’t know which one to go for, start with QEMU Buildroot setup getting started.
Design goals of this project are documented at: Section 33.20.1, “Design goals”.
+Design goals of this project are documented at: Section 34.20.1, “Design goals”.
This setup has been mostly tested on Ubuntu. For other host operating systems see: Section 33.1, “Supported hosts”. For greater stability, consider using the latest release instead of master: https://github.com/cirosantilli/linux-kernel-module-cheat/releases
+This setup has been mostly tested on Ubuntu. For other host operating systems see: Section 34.1, “Supported hosts”. For greater stability, consider using the latest release instead of master: https://github.com/cirosantilli/linux-kernel-module-cheat/releases
Reserve 12Gb of disk and run:
@@ -2405,7 +2406,7 @@ cd linux-kernel-module-cheatYou don’t need to clone recursively even though we have .git submodules: download-dependencies fetches just the submodules that you need for this build to save time.
If something goes wrong, see: Section 33.2, “Common build issues” and use our issue tracker: https://github.com/cirosantilli/linux-kernel-module-cheat/issues
+If something goes wrong, see: Section 34.2, “Common build issues” and use our issue tracker: https://github.com/cirosantilli/linux-kernel-module-cheat/issues
The initial build will take a while (30 minutes to 2 hours) to clone and build, see Benchmark builds for more details.
@@ -2488,7 +2489,7 @@ hello2 cleanupTo avoid typing --arch aarch64 many times, you can set the default arch as explained at: Section 33.4, “Default command line arguments”
To avoid typing --arch aarch64 many times, you can set the default arch as explained at: Section 34.4, “Default command line arguments”
I now urge you to read the following sections which contain widely applicable information:
@@ -3329,7 +3330,7 @@ j = 0This repository has been tested inside clean Docker containers.
This is a good option if you are on a Linux host, but the native setup failed due to your weird host distribution, and you have better things to do with your life than to debug it. See also: Section 33.1, “Supported hosts”.
+This is a good option if you are on a Linux host, but the native setup failed due to your weird host distribution, and you have better things to do with your life than to debug it. See also: Section 34.1, “Supported hosts”.
For example, to do a QEMU Buildroot setup inside Docker, run:
@@ -3517,7 +3518,7 @@ j = 0can’t GDB step debug the kernel, since the source and cross toolchain with GDB are not available. Buildroot cannot easily use a host toolchain: Section 29.2.3.1.1, “Buildroot use prebuilt host toolchain”.
+can’t GDB step debug the kernel, since the source and cross toolchain with GDB are not available. Buildroot cannot easily use a host toolchain: Section 30.2.3.1.1, “Buildroot use prebuilt host toolchain”.
Maybe we could work around this by just downloading the kernel source somehow, and using a host prebuilt GDB, but we felt that it would be too messy and unreliable.
TODO: the carriage returns are a bit different than in QEMU, see: Section 27.6, “gem5 baremetal carriage return”.
+TODO: the carriage returns are a bit different than in QEMU, see: Section 28.6, “gem5 baremetal carriage return”.
Note that ./build-baremetal requires the --emulator gem5 option, and generates separate executable images for both, as can be seen from:
But just stick to newer and better VExpress_GEM5_V1 unless you have a good reason to use RealViewPBX.
When doing baremetal programming, it is likely that you will want to learn userland assembly first, see: Section 22, “Userland assembly”.
+When doing baremetal programming, it is likely that you will want to learn userland assembly first, see: Section 23, “Userland assembly”.
For more information on baremetal, see the section: Section 27, “Baremetal”.
+For more information on baremetal, see the section: Section 28, “Baremetal”.
The following subjects are particularly important:
@@ -4407,7 +4408,7 @@ xdg-open README.htmlMore information about our documentation internals can be found at: Section 33.5, “Documentation”
+More information about our documentation internals can be found at: Section 34.5, “Documentation”
For a more minimal baremetal multicore setup, see: Section 27.10.3, “ARM baremetal multicore”.
+For a more minimal baremetal multicore setup, see: Section 28.10.3, “ARM baremetal multicore”.
We can set and get which cores the Linux kernel allows a program to run on with sched_getaffinity and sched_setaffinity:
To stop at the very first instruction of a freestanding program, just use --no-continue. A good example of this is shown at: Section 22.5.1, “Freestanding programs”.
To stop at the very first instruction of a freestanding program, just use --no-continue. A good example of this is shown at: Section 23.5.1, “Freestanding programs”.
The gem5 tests require building statically with build id static, see also: Section 10.7, “gem5 syscall emulation mode”. TODO automate this better.
See: Section 33.16, “Test this repo” for more useful testing tips.
+See: Section 34.16, “Test this repo” for more useful testing tips.
Support for dynamic linking was added in November 2019: https://stackoverflow.com/questions/50542222/how-to-run-a-dynamically-linked-executable-syscall-emulation-mode-se-py-in-gem5/50696098#50696098
Note that as shown at Section 29.2.2, “Benchmark emulators on userland executables”, the dynamic version runs 200x more instructions, which might have an impact on smaller simulations in detailed CPUs.
+Note that as shown at Section 30.2.2, “Benchmark emulators on userland executables”, the dynamic version runs 200x more instructions, which might have an impact on smaller simulations in detailed CPUs.
modules built with Buildroot, see: Section 33.15.2.1, “kernel_modules buildroot package”
+modules built with Buildroot, see: Section 34.15.2.1, “kernel_modules buildroot package”
modules built from the kernel tree itself, see: Section 15.12.2, “dummy-irq”
@@ -9052,7 +9053,7 @@ Program aborted at tick 0no need to regenerate the root filesystem at all and reboot
overcomes the check_bin_arch problem as shown at: Section 20.8, “Buildroot rebuild is slow when the root filesystem is large”
overcomes the check_bin_arch problem as shown at: Section 21.8, “Buildroot rebuild is slow when the root filesystem is large”
We disable networking by default because it starts an userland process, and we want to keep the number of userland processes to a minimum to make the system more understandable as explained at: Section 33.20.3, “Resource tradeoff guidelines”
+We disable networking by default because it starts an userland process, and we want to keep the number of userland processes to a minimum to make the system more understandable as explained at: Section 34.20.3, “Resource tradeoff guidelines”
To enable networking on Buildroot, simply run:
@@ -10684,15 +10685,15 @@ git log | grep -E ' Linux [0-9]+\.' | headThis also makes this repo the perfect setup to develop the Linux kernel.
In case something breaks while updating the Linux kernel, you can try to bisect it to understand the root cause, see: Section 33.17, “Bisection”.
+In case something breaks while updating the Linux kernel, you can try to bisect it to understand the root cause, see: Section 34.17, “Bisection”.
First, use use the branching procedure described at: Section 33.18, “Update a forked submodule”
+First, use use the branching procedure described at: Section 34.18, “Update a forked submodule”
Because the kernel is so central to this repository, almost all tests must be re-run, so basically just follow the full testing procedure described at: Section 33.16, “Test this repo”. The only tests that can be skipped are essentially the Baremetal tests.
+Because the kernel is so central to this repository, almost all tests must be re-run, so basically just follow the full testing procedure described at: Section 34.16, “Test this repo”. The only tests that can be skipped are essentially the Baremetal tests.
Before comitting, don’t forget to update:
@@ -15240,7 +15241,7 @@ detected buffer overflow in strlenSELinux requires glibc as mentioned at: Section 20.10, “libc choice”.
+SELinux requires glibc as mentioned at: Section 21.10, “libc choice”.
STRESS_NG is likely the best, but it requires glibc, see: Section 20.10, “libc choice”.
STRESS_NG is likely the best, but it requires glibc, see: Section 21.10, “libc choice”.
Websites:
@@ -17809,10 +17810,10 @@ runThe build outputs are automatically stored in a different directories for optimized and debug builds, which prevents debug files from overwriting opt ones. Therefore, --gem5-build-id is not required.
The price to pay for debuggability is high however: a Linux kernel boot was about 3x slower in QEMU and 14 times slower in gem5 debug compared to opt, see benchmarks at: Section 29.2.1, “Benchmark Linux kernel boot”.
+The price to pay for debuggability is high however: a Linux kernel boot was about 3x slower in QEMU and 14 times slower in gem5 debug compared to opt, see benchmarks at: Section 30.2.1, “Benchmark Linux kernel boot”.
Similar slowdowns can be observed at: Section 29.2.2, “Benchmark emulators on userland executables”.
+Similar slowdowns can be observed at: Section 30.2.2, “Benchmark emulators on userland executables”.
When in QEMU text mode, using --debug-vm makes Ctrl-C not get passed to the QEMU guest anymore: it is instead captured by GDB itself, so allow breaking. So e.g. you won’t be able to easily quit from a guest program like:
25007500: time count in some unit. Note how the microops execute at further timestamps.
system.cpu: distinguishes between CPUs when there are more than one. For example, running Section 27.10.3, “ARM baremetal multicore” with two cores produces system.cpu0 and system.cpu1
system.cpu: distinguishes between CPUs when there are more than one. For example, running Section 28.10.3, “ARM baremetal multicore” with two cores produces system.cpu0 and system.cpu1
T0: thread number. TODO: hyperthread? How to play with it?
runs are deterministic by default, unlike QEMU which has a special QEMU record and replay mode, that requires first playing the content once and then replaying
gem5 ARM at least appears to implement more low level CPU functionality than QEMU, e.g. QEMU only added EL2 in 2018: https://stackoverflow.com/questions/42824706/qemu-system-aarch64-entering-el1-when-emulating-a53-power-up See also: Section 27.10.1, “ARM exception levels”
+gem5 ARM at least appears to implement more low level CPU functionality than QEMU, e.g. QEMU only added EL2 in 2018: https://stackoverflow.com/questions/42824706/qemu-system-aarch64-entering-el1-when-emulating-a53-power-up See also: Section 28.10.1, “ARM exception levels”
gem5 offers more advanced logging, even for non micro architectural things which QEMU models in some way, e.g. QEMU trace memory accesses, because QEMU’s binary translation optimizations reduce visibility
@@ -18852,7 +18853,7 @@ rootslower than QEMU, see: Section 29.2.1, “Benchmark Linux kernel boot”
+slower than QEMU, see: Section 30.2.1, “Benchmark Linux kernel boot”
This implies that the user base is much smaller, since no Android devs.
we have no caches, each instruction is fetched from memory
each loop contains 11 instructions as shown at Section 31.2, “C busy loop”
+each loop contains 11 instructions as shown at Section 32.2, “C busy loop”
and supposing that the loop dominated executable pre/post main, which we know is true since as shown in Benchmark emulators on userland executables an empty dynamically linked C program only as about 100k instructions, while our loop runs 1000000 * 11 = 12M.
It presumably implements a crossbar switch along the lines of: https://en.wikipedia.org/wiki/Crossbar_switch
One simple example of its operation can be seen at: Section 19.20.4.2, “gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis”
+One simple example of its operation can be seen at: Section 19.21.4.2, “gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis”
But arguably interesting effects can only be observed when we have more than 1 CPUs as in gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs.
@@ -22762,7 +22763,50 @@ cd ..CommMonitorYou can place this SimObject in between two ports to get extra statistics about the packets that are going through.
+It only works on timing CPUs, and does not seem to dump any memory values, only add extra statistics.
+For example, the patch patches/manual/gem5-commmonitor-se.patch hack a CommMonitor between the CPU and the L1 cache on top of gem5 1c3662c9557c85f0d25490dc4fbde3f8ab0cb350:
patch -d "$(./getvar gem5_source_dir)" -p 1 < patches/manual/gem5-commmonitor-se.patch+
which you can run with:
+./run \ + --arch aarch64 \ + --emulator gem5 \ + --userland userland/arch/aarch64/freestanding/linux/hello.S \ + -- \ + --caches \ + --cpu-type TimingSimpleCPU \ +;+
and now we have some new extra histogram statistics such as:
+system.cpu.dcache_mon.readBurstLengthHist::samples 1+
One neat thing about this is that it is agnostic to the memory object type, so you don’t have to recode those statistics for every new type of object that operates on memory packets.
+Internals under other sections:
The interaction uses the Python C extension interface https://docs.python.org/2/extending/extending.html interface through the pybind11 helper library: https://github.com/pybind/pybind11
https://stackoverflow.com/questions/62969566/attributes-of-system-object-in-gem5/62970092#62970092
+The main is at: src/sim/main.cc. It calls:
Tested at gem5 b4879ae5b0b6644e6836b0881e4da05c64a6550d.
m5.objects modulem5.objects moduleAll SimObjects seem to be automatically added to the m5.objects namespace, and this is done in a very convoluted way, let’s try to understand a bit:
gem5 is an event based simulator, and as such the event queue is of of the crucial elements in the system.
Then, once we had that, the most perfect thing ever would be to make the full event graph containing which events schedule which events!
Let’s now analyze every single event on a minimal gem5 syscall emulation mode in the simplest CPU that we have:
Tested in gem5 12c917de54145d2d50260035ba7fa614e25317a3.
Let’s have a closer look at the initial magically scheduled events of the simulation.
Inside AtomicSimpleCPU::tick() we saw previously that the reschedule happens at:
It will be interesting to see how AtomicSimpleCPU makes memory access on GDB and to compare that with TimingSimpleCPU.
Happens on EmulationPageTable, and seems to happen atomically without making any extra memory requests.
Now, let’s move on to TimingSimpleCPU, which is just like AtomicSimpleCPU internally, but now the memory requests don’t actually finish immediately: gem5 CPU types!
Schedules TimingSimpleCPU::fetch through:
Backtrace:
This is just the startup of the second rank, see: TimingSimpleCPU analysis #1.
From the timing we know what that one is: the end of time exit event, like for AtomicSimpleCPU.
Executes TimingSimpleCPU::fetch().
Schedules DRAMCtrl::processNextReqEvent through:
Schedules BaseXBar::Layer::releaseLayer through:
Executes DRAMCtrl::processNextReqEvent.
Schedules DRAMCtrl::Rank::processActivateEvent through:
Schedules DRAMCtrl::processRespondEvent through:
Schedules DRAMCtrl::processNextReqEvent through:
Executes DRAMCtrl::Rank::processActivateEvent.
Schedules DRAMCtrl::Rank::processPowerEvent through:
Executes DRAMCtrl::Rank::processPowerEvent.
Executes BaseXBar::Layer<SrcType, DstType>::releaseLayer.
Executes DRAMCtrl::processNextReqEvent().
Executes DRAMCtrl::processRespondEvent().
Schedules PacketQueue::processSendEvent() through:
Executes PacketQueue::processSendEvent().
Schedules PacketQueue::processSendEvent through:
Schedules BaseXBar::Layer<SrcType, DstType>::releaseLayer through:
Executes BaseXBar::Layer<SrcType, DstType>::releaseLayer.
Executes PacketQueue::processSendEvent.
Schedules TimingSimpleCPU::IcachePort::ITickEvent::process() through:
Executes TimingSimpleCPU::IcachePort::ITickEvent::process().
Schedules DRAMCtrl::processNextReqEvent through:
Schedules BaseXBar::Layer<SrcType, DstType>::releaseLayer through:
Execute DRAMCtrl::processNextReqEvent.
Schedule DRAMCtrl::processRespondEvent().
One important thing we want to check now, is how the memory reads are going to make the processor stall in the middle of an instruction.
Let’s just add --caches to gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis to see if things go any faster, and add Cache to --trace as in:
MOESI cache coherence protocol: https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L352
The actual representation is done via separate state bits: https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L66 and MOESI appears explicitly only on the pretty printing.
This pretty printing appears for example in the --trace Cache lines as shown at gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and with a few more transitions visible at Section 19.20.4.4, “gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs”.
This pretty printing appears for example in the --trace Cache lines as shown at gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and with a few more transitions visible at Section 19.21.4.4, “gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs”.
It would be amazing to analyze a simple example with interconnect packets possibly invalidating caches of other CPUs.
and so on, they just keep fighting over that address and changing one another’s state.
Now let’s do the exact same we did for gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs, but with Ruby rather than the classic system.
The events for the Atomic CPU were pretty simple: basically just ticks.
so now we are ready to run the third and fourth instructions of the program:
,… - ldr x2, =len - mov x8, 64 -,…
+ldr x2, =len + mov x8, 64+
The LDR goes all the way down to FU 6 which is the memory one:
@@ -25794,14 +25841,14 @@ non-atomic 19TODO like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard but with the hazard.
Like gem5 event queue MinorCPU syscall emulation freestanding example analysis but even more complex since for the gem5 DerivO3CPU!
This section and children are tested at LKMC 144a552cf926ea630ef9eadbb22b79fe2468c456.
Let’s have a look at the arguably simplest example userland/arch/aarch64/freestanding/linux/hazardless.S.
Now let’s do the same as in gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazardless but with a hazard: userland/arch/aarch64/freestanding/linux/hazard.S.
Like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard but now with an LDR stall: userland/arch/aarch64/freestanding/linux/stall.S.
Like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall but now with an LDR stall: userland/arch/aarch64/freestanding/linux/stall-gain.S.
Like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-gain but now with some dependencies after the LDR: userland/arch/aarch64/freestanding/linux/stall-hazard4.S.
Now let’s try to see some Speculative execution in action with userland/arch/aarch64/freestanding/linux/speculative.S.
This is one of the parts of gem5 that rely on semi-useless code generation inside the .isa sublanguage.
We also notice that the key argument passed to those instructions is of type ExecContext, which is discussed further at: Section 19.20.7.3, “gem5 ExecContext”.
We also notice that the key argument passed to those instructions is of type ExecContext, which is discussed further at: Section 19.21.7.3, “gem5 ExecContext”.
The file is an include so that compilation can be split up into chunks by the autogenerated includers
@@ -26795,7 +26842,7 @@ namespace ArmISAInst {Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.
execute vs initiateAcc vs completeAccexecute vs initiateAcc vs completeAccThese are the key methods defined in instruction definitions, so lets see when each one gets called and what they do more or less.
This can be seen concretely in GDB from the analysis done at: TimingSimpleCPU analysis: LDR stall and for more memory details see gem5 functional vs atomic vs timing memory requests.
completeAcccompleteAcccompleteAcc is boring on most simple store memory instructions, e.g. a simple STR:
Some gem5 instructions break down into multiple microops.
The gem5 memory system is connected in a very flexible way through the port system.
This system exists to allow seamlessly connecting any combination of CPU, caches, interconnects, DRAM and peripherals.
gem5 memory requests can be classified in the following broad categories:
Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.
As seen at gem5 functional vs atomic vs timing memory requests, functional requests are not used in common simulation, since the core must always go through caches.
ThreadContext vs ThreadState vs ExecContext vs ProcessThreadContext vs ThreadState vs ExecContext vs ProcessThese classes get used everywhere, and they have a somewhat convoluted relation with one another, so let’s figure it out this mess.
This section and all children tested at gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.
ThreadContextThreadContextAs we delve into more details below, we will reach the following conclusion: a ThreadContext represents on thread of a CPU with multiple Hardware threads.
Essentially all methods of the base ThreadContext are pure virtual.
SimpleThreadSimpleThreadSimpleThread storage defined on BaseSimpleCPU for simple CPUs like AtomicSimpleCPU:
O3ThreadContextO3ThreadContextInstantiation happens in the FullO3CPU constructor:
ThreadStateThreadStateOwned one per ThreadContext.
ExecContextExecContextExecContext gets used in gem5 instruction definitions, e.g.:
This makes sense, since each ThreadContext represents one CPU register set, and therefore needs a separate ExecContext which allows instruction implementations to access those registers.
ExecContext::readIntRegOperand register resolutionExecContext::readIntRegOperand register resolutionLet’s have a look at how ExecContext::readIntRegOperand actually matches registers to decoded registers IDs, since it is not obvious.
First, we guess that they must be related to the reading of x1 and x2, which are the inputs of the addition.
Next, we also guess that the 0 read must correspond to x2, since it later gets potentially shifted as mentioned at Section 24.4.4.1, “ARM shift suffixes”.
Next, we also guess that the 0 read must correspond to x2, since it later gets potentially shifted as mentioned at Section 25.4.4.1, “ARM shift suffixes”.
Let’s also have a look at the decoder code that builds the instruction instance in build/ARM/arch/arm/generated/decoder-ns.cc.inc:
ProcessProcessThe Process class is used only for gem5 syscall emulation mode, and it represents a process like a Linux userland process, in addition to any further gem5 specific data needed to represent the process.
Each instruction is marked with a class, and each class can execute in a given functional unit.
MinorCPU default functional unitsMinorCPU default functional unitsWhich units are available is visible for example on the gem5 config.ini of a gem5 MinorCPU run. Functional units are not present in simple CPUs like gem5 TimingSimpleCPU.
On gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772, after running:
gem5 uses a ton of code generation, which makes the project horrendous:
But it has been widely overused to insanity. It likely also exists partly because when the project started in 2003 C++ compilers weren’t that good, so you couldn’t rely on features like templates that much.
Generated code at: build/<ISA>/config/the_isa.hh which e.g. for ARM contains:
gem5 moves a bit slowly, and if your host compiler is very new, the gem5 build might be broken for it, e.g. this was the case for Ubuntu 19.10 with GCC 9 and gem5 62d75e7105fe172eb906d4f80f360ff8591d4178 from Dec 2019.
E.g. src/cpu/decode_cache.hh includes:
Upstream request: https://gem5.atlassian.net/browse/GEM5-469
Buildroot is a set of Make scripts that download and compile from source compatible versions of:
Linux kernel
C standard library: Buildroot supports several implementations, see: Section 20.10, “libc choice”
+C standard library: Buildroot supports several implementations, see: Section 21.10, “libc choice”
BusyBox: provides the shell and basic command line utilities
@@ -28696,7 +28745,7 @@ gensim/models/armv8/isa.acIt therefore produces a pristine, blob-less, debuggable setup, where all moving parts are configured to work perfectly together.
Perhaps the awesomeness of Buildroot only sinks in once you notice that all it takes is 4 commands as explained at Section 20.11, “Buildroot hello world”.
+Perhaps the awesomeness of Buildroot only sinks in once you notice that all it takes is 4 commands as explained at Section 21.11, “Buildroot hello world”.
The downsides of Buildroot are:
@@ -28741,7 +28790,7 @@ gensim/models/armv8/isa.acWe provide the following mechanisms:
The clean is necessary because the source files didn’t change, so make would just check the timestamps and not build anything.
You will then likely want to make those more permanent as explained at: Section 33.4, “Default command line arguments”.
+You will then likely want to make those more permanent as explained at: Section 34.4, “Default command line arguments”.
If you are benchmarking compiled programs instead of hand written assembly, remember that we configure Buildroot to disable optimizations by default with:
if you already have a full -O0 build, you can choose to rebuild just your package of interest to save some time as described at: Section 20.2, “Custom Buildroot configs”
if you already have a full -O0 build, you can choose to rebuild just your package of interest to save some time as described at: Section 21.2, “Custom Buildroot configs”
./build-buildroot \ @@ -28847,7 +28896,7 @@ gensim/models/armv8/isa.ac
make menuconfig is a convenient way to find Buildroot configurations:
At startup, we login automatically as the root user.
Replace on inittab:
There are basically two choices:
First, see if you can’t get away without actually adding a new package, for example:
if you have a standalone C file with no dependencies besides the C standard library to be compiled with GCC, just add a new file under buildroot_packages/sample_package and you are done
if you have a dependency on a library, first check if Buildroot doesn’t have a package for it already with ls buildroot/package. If yes, just enable that package as explained at: Section 20.2, “Custom Buildroot configs”
if you have a dependency on a library, first check if Buildroot doesn’t have a package for it already with ls buildroot/package. If yes, just enable that package as explained at: Section 21.2, “Custom Buildroot configs”
If none of those methods are flexible enough for you, you can just fork or hack up buildroot_packages/sample_package the sample package to do what you want.
For how to use that package, see: Section 33.15.2, “buildroot_packages directory”.
+For how to use that package, see: Section 34.15.2, “buildroot_packages directory”.
Then iterate trying to do what you want and reading the manual until it works: https://buildroot.org/downloads/manual/manual.html
@@ -29012,7 +29061,7 @@ make menuconfigOnce you’ve built a package in to the image, there is no easy way to remove it.
Also mentioned at: https://stackoverflow.com/questions/47320800/how-to-clean-only-target-in-buildroot
See this for a sample manual workaround: Section 21.8.1.4, “PARSEC uninstall”.
+See this for a sample manual workaround: Section 22.8.1.4, “PARSEC uninstall”.
When adding new large package to the Buildroot root filesystem, it may fail with the message:
Bibliography: https://stackoverflow.com/questions/49211241/is-there-a-way-to-automatically-detect-the-minimum-required-br2-target-rootfs-ex
SquashFS creation with mksquashfs does not take fixed sizes, and I have successfully booted from it, but it is readonly, which is unacceptable.
Buildroot is not designed for large root filesystem images, and the rebuild becomes very slow when we add a large package to it.
When asking for help on upstream repositories outside of this repository, you will need to provide the commands that you are running in detail without referencing our scripts.
Then, you will also want to do a Bisection to pinpoint the exact commit to blame, and CC that developer.
Finally, give the images you used save upstream developers' time as shown at: Section 33.19.2, “release-zip”.
+Finally, give the images you used save upstream developers' time as shown at: Section 34.19.2, “release-zip”.
For Buildroot problems, you should wither provide the config you have:
@@ -29205,7 +29254,7 @@ git -C "$(./getvar qemu_source_dir)" checkout -Buildroot supports several libc implementations, including:
This repo doesn’t do much more other than setting a bunch of Buildroot configurations and building it.
Users of this repo will often want to update the compilation toolchain to the latest version to get fresh new features like new ISA instructions.
In this section we cover the most common cases.
This is of course the simplest case.
Now it gets fun, but well, guess what, we will try to do the same as Section 20.12.1, “Update GCC: GCC supported by Buildroot” but:
+Now it gets fun, but well, guess what, we will try to do the same as Section 21.12.1, “Update GCC: GCC supported by Buildroot” but:
By default, our build system uses build-linux, and the Buildroot kernel build is disabled: https://stackoverflow.com/questions/52231793/can-buildroot-build-the-root-filesystem-without-building-the-linux-kernel
This section documents our test and educational userland content, such as C, C++ and POSIX examples, present mostly under userland/.
@@ -29531,7 +29580,7 @@ cd ../..Getting started at: Section 1.8, “Userland setup”
Userland assembly content is located at: Section 22, “Userland assembly”. It was split from this section basically because we were hitting the HTML h6 limit, stupid web :-)
Userland assembly content is located at: Section 23, “Userland assembly”. It was split from this section basically because we were hitting the HTML h6 limit, stupid web :-)
This content makes up the bulk of the userland/ directory.
@@ -29543,7 +29592,7 @@ cd ../..This section was originally moved in here from: https://github.com/cirosantilli/cpp-cheat
Programs under userland/c/ are examples of ANSI C programming:
Allocate memory! Vs using the stack: https://stackoverflow.com/questions/4584089/what-is-the-function-of-the-push-pop-instructions-used-on-registers-in-x86-ass/33583134#33583134
malloc leads to the infinite joys of Memory leaks.
TODO: the exact answer is going to be hard.
General overview at: https://stackoverflow.com/questions/2798330/maximum-memory-which-malloc-can-allocate
If we start using the pages, the OOM killer would sooner or later step in and kill our process: Linux out-of-memory killer.
We can observe the OOM in LKMC 1e969e832f66cb5a72d12d57c53fb09e9721d589 which defaults to 256MiB of memory with:
Added in C11!
Example: userland/gcc/empty_struct.c
GCC implements the OpenMP threading implementation: https://stackoverflow.com/questions/3949901/pthreads-vs-openmp
strace shows that OpenMP makes clone() syscalls in Linux. TODO: does it actually call pthread_ functions, or does it make syscalls directly? Or in other words, can it work on Freestanding programs? A quick grep shows many references to pthreads.
Programs under userland/cpp/ are examples of ISO C programming.
OMG this is hell, understand when primitive variables are initialized or not:
The smallest data race we managed to come up as of LKMC 7c01b29f1ee7da878c7cc9cb4565f3f3cf516a92 and gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 was with userland/c/atomic.c (see also C multithreading):
Like for C, you have to pay for the standards… insane. So we just use the closest free drafts instead.
https://stackoverflow.com/questions/81656/where-do-i-find-the-current-c-or-c-standard-documents
Programs under userland/posix/ are examples of POSIX C programming.
POSIX C example that prints all environment variables: userland/posix/environ.c
POSIX' multiprocess API. Contrast with pthreads which are for threads.
Read the source comments and understand everything that is going on!
The minimal interesting example is to use fork and observe different PIDs.
POSIX' multithreading API. Contrast with fork which is for processes.
userland/posix/pthread_count.c exemplifies the functions:
The mmap system call allows advanced memory operations.
Basic mmap example, do the same as userland/c/malloc.c, but with mmap.
Memory mapped file example: userland/posix/mmap_file.c
A bit like read and write, but from / to the Internet!
The following sections are related to multithreading in userland:
Let’s group the hard-to-debug undefined-behaviour-like stuff found in C / C+ here and how to tackle those problems.
Maybe some day someone will use this setup to study the performance of interpreters.
Examples:
Buildroot has a Python package that can be added to the guest image:
At LKMC 50ac89b779363774325c81157ec8b9a6bdb50a2f gem5 390a74f59934b85d91489f8a563450d8321b602da:
Here we will add some better examples and explanations for: https://docs.python.org/3/extending/embedding.html#very-high-level-embedding
Host installation shown at: https://askubuntu.com/questions/594656/how-to-install-the-latest-versions-of-nodejs-and-npm/971612#971612
Illustrates how to add extra non-code data files to an NPM package, and then use those files at runtime.
No OpenJDK package as of 2018.08: https://stackoverflow.com/questions/28874150/buildroot-with-jamvm-2-0-for-java-8/59290927#59290927 partly because their build system is shit like the rest of the project’s setup.
These are good targets for performance analysis with gem5, and there is some overlap between this section and Benchmarks.
TODO: move benchmark graph from userland/cpp/bst_vs_heap_vs_hashmap.cpp to userland/algorithm/set.
Buildroot supports it, which makes everything just trivial:
Header only linear algebra library with a mainline Buildroot package:
These are good targets for performance analysis with gem5.
We have ported parts of the PARSEC benchmark for cross compilation at: https://github.com/cirosantilli/parsec-benchmark See the documentation on that repo to find out which benchmarks have been ported. Some of the benchmarks were are segfaulting, they are documented in that repo.
./build --arch arm --download-dependencies gem5-buildroot parsec-benchmark @@ -31815,7 +31864,7 @@ cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans,3,3,2 ,1, A,3, B,
Running a benchmark of a size different than test, e.g. simsmall, requires a rebuild with:
Most users won’t want to use this method because:
If you want to remove PARSEC later, Buildroot doesn’t provide an automated package removal mechanism as mentioned at: Section 20.6, “Remove Buildroot packages”, but the following procedure should be satisfactory:
+If you want to remove PARSEC later, Buildroot doesn’t provide an automated package removal mechanism as mentioned at: Section 21.6, “Remove Buildroot packages”, but the following procedure should be satisfactory:
If you end up going inside submodules/parsec-benchmark to hack up the benchmark (you will!), these tips will be helpful.
It eventually has to come to that, hasn’t it?
Tests under userland/libs require certain optional libraries to be installed on the target, and are not built or tested by default, you must enable them with either:
See for example BLAS.
The following basenames should always refer to programs that do the same thing, but in different languages:
Programs under userland/arch/<arch>/ are examples of userland assembly programming.
registers, see: Section 22.1, “Assembly registers”
+registers, see: Section 23.1, “Assembly registers”
jumping:
@@ -32961,14 +33010,14 @@ error: asm_main returned 1 at line 8After seeing an ADD hello world, you need to learn the general registers:
x86, see: Section 23.1, “x86 registers”
+x86, see: Section 24.1, “x86 registers”
arm
@@ -32999,7 +33048,7 @@ error: asm_main returned 1 at line 8Bibliography: ARMv7 architecture reference manual A2.3 "ARM core registers".
Example: userland/arch/aarch64/x31.S
Keep in mind that many ISAs started floating point as an optional thing, and it later got better integrated into the main CPU, side by side with SIMD.
Much like ADD for non-SIMD, start learning SIMD instructions by looking at the integer and floating point SIMD ADD instructions of each ISA:
Bibliography: https://stackoverflow.com/questions/1389712/getting-started-with-intel-x86-sse-simd-instructions/56409539#56409539
Fused multiply add:
By "userland assembly", we mean "the parts of the ISA which can be freely used from userland".
One big difference between both is that we can run userland assembly on Userland setup, which is easier to get running and debug.
In particular, most userland assembly examples link to the C standard library, see: Section 22.5, “Userland assembly C standard library”.
+In particular, most userland assembly examples link to the C standard library, see: Section 23.5, “Userland assembly C standard library”.
Userland assembly is generally simpler, and a pre-requisite for Baremetal setup.
@@ -33286,7 +33335,7 @@ When instructions do not interpret this operand encoding as the zero register, uAll examples except the Freestanding programs link to the C standard library.
Unlike most our other assembly examples, which use the C standard library for portability, examples under freestanding/ directories don’t link to the C standard library:
This is analogous to step debugging baremetal examples.
Assembly examples under nostartfiles directories can use the standard library, but they don’t use the pre-main boilerplate and start directly at our explicitly given _start:
Examples under arch/<arch>/c/ directories show to how use inline assembly from higher level languages such as C:
Used notably in some of the Linux system calls setups:
In arm, it is the only way to achieve this effect: https://stackoverflow.com/questions/10831792/how-to-use-specific-register-in-arm-inline-assembler
This feature notably useful for making system calls from C, see: Section 22.7, “Linux system calls”.
+This feature notably useful for making system calls from C, see: Section 23.7, “Linux system calls”.
How to use temporary registers in inline assembly:
An example of using the & early-clobber modifier: link:userland/arch/aarch64/earlyclobber.c
Not documented as of GCC 8.2, but possible: https://stackoverflow.com/questions/53960240/armv8-floating-point-output-inline-assembly
Pre-existing C wrappers using inline assembly, this is what production programs should use instead of inline assembly for SIMD:
Good official cheatsheet with all intrinsics and what they expand to: https://software.intel.com/sites/landingpage/IntrinsicsGuide
The following Userland setup programs illustrate how to make system calls:
This is how threads either:
The best article to understand spinlocks is: https://eli.thegreenplace.net/2018/basics-of-futexes/
getcpu system call and the sched_getaffinity glibc wrappergetcpu system call and the sched_getaffinity glibc wrapperExamples:
A summary of results is shown at: Table 3, “Summary of Linux calling conventions for several architectures”.
Examples:
Call C standard library functions from assembly and vice versa.
GNU GAS is the default assembler used by GDB, and therefore it completely dominates in Linux.
The Linux kernel in particular uses GNU GAS assembly extensively for the arch specific parts under arch/.
In this tutorial, we use exclusively C Preprocessor /**/ comments because:
Summary:
Let’s see how many bytes go into each data type:
There are two types of ARMv7 assemblies:
cannot have implicit destination with shift, see: Section 24.4.4.1, “ARM shift suffixes”
+cannot have implicit destination with shift, see: Section 25.4.4.1, “ARM shift suffixes”
When reading disassembly, many instructions have either a .n or .w suffix.
Arch agnostic infrastructure getting started at: Section 22, “Userland assembly”.
+Arch agnostic infrastructure getting started at: Section 23, “Userland assembly”.
link:userland/arch/x86_64/registers.S
Example: userland/arch/x86_64/address_modes.S
5.1.1 "Data Transfer Instructions"
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 7.3.1.2 "Exchange Instructions":
TODO: concrete multi-thread GCC inline assembly examples of how all those instructions are normally used as synchronization primitives.
Examples:
This is partly why the ternary ? C operator exists: https://stackoverflow.com/questions/3565368/ternary-operator-vs-if-else
It is interesting to compare this with ARMv7 conditional execution: which is available for all instructions, as shown at: Section 24.2.5, “ARM conditional execution”.
+It is interesting to compare this with ARMv7 conditional execution: which is available for all instructions, as shown at: Section 25.2.5, “ARM conditional execution”.
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.1.2 "Binary Arithmetic Instructions":
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.1.4 "Logical Instructions"
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.1.5 "Shift and Rotate Instructions"
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.1.6 "Bit and Byte Instructions"
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.1.7 "Control Transfer Instructions"
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.1.8 "String Instructions"
However, as computer architecture evolved, those instructions might not offer considerable speedups anymore, and modern glibc such as 2.29 just uses x86 SIMD operations instead:, see also: https://stackoverflow.com/questions/33480999/how-can-the-rep-stosb-instruction-execute-faster-than-the-equivalent-loop
Example: userland/arch/x86_64/rep.S
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.1.13 "Miscellaneous Instructions"
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.1.15 Random Number Generator Instructions
RDRAND sets the carry flag when data is ready so we must loop if the carry flag isn’t set.
Example: userland/arch/x86_64/cpuid.S
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.2 "X87 FPU INSTRUCTIONS"
Parent section: Section 22.3, “SIMD assembly”
+Parent section: Section 23.3, “SIMD assembly”
History:
@@ -35529,12 +35578,12 @@ pop %rbpIntel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.5 "SSE INSTRUCTIONS"
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.5.1.1 "SSE Data Transfer Instructions"
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.5.1.2 "SSE Packed Arithmetic Instructions"
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.5.1.6 "SSE Conversion Instructions"
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.6 "SSE2 INSTRUCTIONS"
userland/arch/x86_64/paddq.S: PADDQ, PADDL, PADDW, PADDB
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.15 "FUSED-MULTIPLY-ADD (FMA)"
Intel 64 and IA-32 Architectures Software Developer’s Manuals Volume 1 5.20 "SYSTEM INSTRUCTIONS"
Sources:
RDTSCP is like RDTSP, but it also stores the CPU ID into ECX: this is convenient because the value of RDTSC depends on which core we are currently on, so you often also want the core ID when you want the RDTSC.
TODO We didn’t manage to find a working ARM analogue to x86 RDTSC instruction: kernel_modules/pmccntr.c is oopsing, and even it if weren’t, it likely won’t give the cycle count since boot since it needs to be activate before it starts counting anything:
Inline assembly example at: userland/cpp/atomic/x86_64_lock_inc.cpp, see also: atomic.cpp.
We are using the May 2019 version unless otherwise noted.
Also I can’t find older versions on the website easily, so I just web archive everything.
Arch general getting started at: Section 22, “Userland assembly”.
+Arch general getting started at: Section 23, “Userland assembly”.
Instructions here loosely grouped based on that of the ARMv7 architecture reference manual Chapter A4 "The Instruction Sets".
@@ -35848,7 +35897,7 @@ taskset -c 1 ./userland/arch/x86_64/rdtscp.out | tail -n 1We cover here mostly ARMv7, and then treat aarch64 differentially, since much of the ARMv7 userland is the same in aarch32.
The ARM architecture is has been used on the vast majority of mobile phones in the 2010’s, and on a large fraction of micro controllers.
ARM Holdings was bought by the Japanese giant SoftBank in 2016.
ARMv7 is the older architecture described at: ARMv7 architecture reference manual.
They are described at: ARMv8 architecture reference manual A1.7 "ARMv8 architecture extensions".
32-bit mode of operation of ARMv8.
A great summary of differences can be found at: https://en.wikipedia.org/wiki/ARM_architecture#AArch64_features
aarch32 has two encodings: Thumb and ARM: Section 24.1.3, “ARM instruction encodings”
+aarch32 has two encodings: Thumb and ARM: Section 25.1.3, “ARM instruction encodings”
in ARMv8, the stack can be enforced to 16-byte alignment: Section 24.3.2.2.1, “ARMV8 aarch64 stack alignment”
+in ARMv8, the stack can be enforced to 16-byte alignment: Section 25.3.2.2.1, “ARMV8 aarch64 stack alignment”
The ARM instruction set is itself protected by patents / copyright / whatever, and you have to pay ARM Holdings a licence to implement it, even if you are creating your own custom Verilog code.
Understanding the basics of instruction encodings is fundamental to help you to remember what instructions do and why some things are possible or not, notably the ARM LDR pseudo-instruction and the ADRP instruction.
Thumb examples are available at:
ARM can switch between big and little endian mode on the fly!
Unconditional branch.
Branch if equal based on the status registers.
Branch with link, i.e. branch and store the return address on the RL register.
Example: userland/arch/aarch64/ret.S
Compare and branch if zero.
Weirdly, ARM B instruction and family are not the only instructions that can execute conditionally on the flags: the same also applies to most instructions, e.g. ADD.
In ARM, there are only two instruction families that do memory access:
LDR can be either a regular instruction that loads stuff into memory, or also a pseudo-instruction (assembler magic): http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0041c/Babbfdih.html
Example: userland/arch/arm/address_modes.S
As an application of the post-indexed addressing mode, let’s increment an array.
There are LDR variants that load less than full 4 bytes:
Store from memory into registers.
PC-relative STR is not possible in aarch64.
Push a pair of registers to the stack.
In ARMv8, the stack can be enforced to 16-byte alignment.
Pop values form stack into the register and optionally update the address register.
Arithmetic:
Example: userland/arch/aarch64/cset.S
Bitwise Bit Clear: clear some bits.
Unsigned Bitfield Move.
TODO: explain full behaviour. Very complicated. Has several simpler to understand aliases.
Alias for:
TODO: explain. Similar to UBFM but leave untouched bits unmodified.
Examples:
Move an immediate to a register, or a register to another register.
Cannot load from or to memory, since only the LDR and STR instruction families can do that in ARM as mentioned at: Section 24.3, “ARM load and store instructions”.
+Cannot load from or to memory, since only the LDR and STR instruction families can do that in ARM as mentioned at: Section 25.3, “ARM load and store instructions”.
Example: userland/arch/arm/mov.S
@@ -36951,7 +37000,7 @@ ldmia sp!, reglistAssemblers however support magic memory allocations which may hide what is truly going on: https://stackoverflow.com/questions/14046686/why-use-ldr-over-mov-or-vice-versa-in-arm-assembly Always ask your friendly disassembly for a good confirmation.
Set the higher or lower 16 bits of a register to an immediate in one go.
Fill a 64 bit register with 4 16-bit instructions one at a time.
Set 16-bits negated and the rest to 1.
Most data processing instructions can also optionally shift the second register operand.
Example: userland/arch/arm/s_suffix.S
Similar rationale to the ARM LDR pseudo-instruction, allowing to easily store a PC-relative reachable address into a register in one go, to overcome the 4-byte fixed instruction size.
Parent section: Section 22.10, “NOP instructions”
+Parent section: Section 23.10, “NOP instructions”
There are a few different ways to encode NOP, notably MOV a register into itself, and a dedicated miscellaneous instruction.
@@ -37091,7 +37140,7 @@ ldmia sp!, reglistGuaranteed undefined! Therefore raise illegal instruction signal. Used by GCC __builtin_trap apparently: https://stackoverflow.com/questions/16081618/programmatically-cause-undefined-instruction-exception
Examples of using them can be found at: dump_regs
Each aarch64 system register is specified in the encoding of ARM system register instructions by 5 integer numbers:
Parent section: Section 22.3, “SIMD assembly”
+Parent section: Section 23.3, “SIMD assembly”
The name for the ARMv7 and AArch32 floating point and SIMD instructions / registers.
TODO example
userland/arch/arm/vadd_scalar.S: see also: Section 22.2, “Floating point assembly”
+userland/arch/arm/vadd_scalar.S: see also: Section 23.2, “Floating point assembly”
userland/arch/arm/vadd_vector.S: see also: Section 22.3, “SIMD assembly”
+userland/arch/arm/vadd_vector.S: see also: Section 23.3, “SIMD assembly”
Example: userland/arch/arm/vcvt.S
Example: userland/arch/arm/vcvtr.S
Example: userland/arch/arm/vcvt.S
The ARMv8 architecture reference manual specifies floating point and SIMD support in the main architecture at A1.5 "Advanced SIMD and floating-point support".
The Linux kernel shows /proc/cpuinfo compatibility as neon, which is yet another intermediate name that came up at some point, see: Section 24.6.2.2, “ARM NEON”.
The Linux kernel shows /proc/cpuinfo compatibility as neon, which is yet another intermediate name that came up at some point, see: Section 25.6.2.2, “ARM NEON”.
Vs ARM VFP: https://stackoverflow.com/questions/4097034/arm-cortex-a8-whats-the-difference-between-vfp-and-neon
Support is semi-mandatory. ARMv8 architecture reference manual A1.5 "Advanced SIMD and floating-point support":
Just an informal name for the "Advanced SIMD instructions"? Very confusing.
TODO example.
userland/arch/aarch64/fadd_vector.S: see also: Section 22.3, “SIMD assembly”
+userland/arch/aarch64/fadd_vector.S: see also: Section 23.3, “SIMD assembly”
userland/arch/aarch64/fadd_scalar.S: see also: Section 22.2, “Floating point assembly”
+userland/arch/aarch64/fadd_scalar.S: see also: Section 23.2, “Floating point assembly”
It is very confusing, but FADDS and FADDD in Aarch32 are pre-UAL for vadd.f32 and vadd.f64 which we use in this tutorial, see: Section 24.6.1.2, “ARM VADD instruction”
It is very confusing, but FADDS and FADDD in Aarch32 are pre-UAL for vadd.f32 and vadd.f64 which we use in this tutorial, see: Section 25.6.1.2, “ARM VADD instruction”
The same goes for most ARMv7 mnemonics: f* is old, and v* is the newer better syntax.
Also keep in mind that fused multiply add is FMADD.
Examples at: Section 22.3, “SIMD assembly”
+Examples at: Section 23.3, “SIMD assembly”
Example: userland/arch/aarch64/ld2.S
Scalable Vector Extension.
Using SVE normally requires setting the CPACR_EL1.FPEN and ZEN bits, which as as of lkmc 29fd625f3fda79f5e0ee6cac43517ba74340d513 + 1 we also enable in our Baremetal bootloaders, see also: aarch64 baremetal NEON setup.
Get the SVE vector length. The following programs do that and print it to stdout:
ARMv8 architecture reference manual A1.7 "ARMv8 architecture extensions" says:
Parent section: Userland multithreading.
Parent section: atomic.cpp
Set of atomic and synchronization primitives added in ARMv8.1 architecture extension.
ARMv8 architecture reference manual db A1.7.3 "The ARMv8.1 architecture extension"
Good getting started tutorials:
The official manuals were stored in http://infocenter.arm.com but as of 2017 they started to slowly move to https://developer.arm.com.
Bibliography: https://www.quora.com/Where-can-I-find-the-official-documentation-of-ARM-instruction-set-architectures-ISAs
ARM also releases documentation specific to each given processor.
https://en.wikipedia.org/wiki/Executable_and_Linkable_Format
@@ -38094,7 +38143,7 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.Getting started at: Section 1.9, “Baremetal setup”
GDB step debug works on baremetal exactly as it does on the Linux kernel, which is described at: Section 2, “GDB step debug”.
As can be seen from Baremetal GDB step debug, all examples under baremetal/, with the exception of baremetal/arch/<arch>/no_bootloader, start from our tiny bootloaders:
the stack pointer
TODO: we don’t do this currently but maybe we should setup BSS
@@ -38265,7 +38314,7 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.For things to work in baremetal, we often have to layout memory in specific ways.
QEMU and gem5 currently supports baremetal CLI arguments!
It is worth noting that e.g. ARM has a Semihosting mechanism for loading CLI arguments through SYS_GET_CMDLINE, but our mechanism works in principle for any ISA.
Currently not supported, so we just hardcode argc 0 on the arm baremetal bootloader.
Semihosting is a publicly documented interface specified by ARM Holdings that allows us to do some magic operations very useful in development, such as writting to the terminal or reading and writing host files.
For gem5, you need:
+For gem5, you need patches/manual/gem5-semihost.patch:
TODO: our example is printing newlines without automatic carriage return \r as in:
For arm, some baremetal examples compile fine with:
Didn’t get it working, traking at: https://github.com/cirosantilli/linux-kernel-module-cheat/issues/119
It is incredible, but GDB also has a CPU simulator inside of it as documented at: https://sourceware.org/gdb/onlinedocs/gdb/Target-Commands.html
Since I had this compiled, I also decided to try it out on userland.
In this section we will focus on learning ARM architecture concepts that can only learnt on baremetal setups.
Userland information can be found at: https://github.com/cirosantilli/arm-assembly-cheat
ARM exception levels are analogous to x86 rings.
According to ARMv7 architecture reference manual, access to that register is controlled by other registers NSACR.{CP11, CP10} and HCPTR so those must be turned off, but I’m lazy to investigate now, even just trying to dump those registers in userland/arch/arm/dump_regs.c also leads to exceptions…
TODO. Create a minimal runnable example of going into EL0 and jumping to EL1.
See ARMv8 architecture reference manual db D1.6.2 "The stack pointer registers".
This is the most basic example of exception handling we have.
The vector table format is described on ARMv8 architecture reference manual Table D1-7 "Vector offsets from vector table base address".
Exception Syndrome Register.
See example at: Section 27.10.2, “ARM SVC instruction”
+See example at: Section 28.10.2, “ARM SVC instruction”
Documentation: ARMv8 architecture reference manual db D12.2.36 "ESR_EL1, Exception Syndrome Register (EL1)".
Exception Link Register.
See the example at: Section 27.10.2, “ARM SVC instruction”
+See the example at: Section 28.10.2, “ARM SVC instruction”
Examples:
Bibliography: https://stackoverflow.com/questions/980999/what-does-multicore-assembly-language-look-like/33651438#33651438
The WFE and SEV instructions are just hints: a compliant implementation can treat them as NOPs.
For how userland spinlocks and mutexes are implemented see Userland mutex implementation.
Examples:
WFE and SEV are usable from userland, and are part of an efficient spinlock implementation (which userland should arguably stay away from and rather use the futex system call which allow for non busy sleep instead), which maybe is not something that userland should ever tho and just stick to mutexes?
gem5 390a74f59934b85d91489f8a563450d8321b602d does not sleep on the first WFE on either syscall emulation or full system, because the code does:
Can be used to implement atomic variables, see also:
In QEMU, CPU 1 starts in a halted state. This can be observed from GDB, where:
TODO: create and study a minimal examples in gem5 where the DMB instruction leads to less cycles: https://stackoverflow.com/questions/15491751/real-life-use-cases-of-barriers-dsb-dmb-isb-in-arm
The ARM timer is the simplest way to generate hardware interrupts periodically, and therefore serves as the simples example of ARM GIC usage.
Generic Interrupt Controller.
TODO create a minimal working aarch64 example analogous to the x86 one at: https://github.com/cirosantilli/x86-bare-metal-examples/blob/6dc9a73830fc05358d8d66128f740ef9906f7677/paging.S
First, also consider the userland bibliography: Section 24.9, “ARM assembly bibliography”.
+First, also consider the userland bibliography: Section 25.9, “ARM assembly bibliography”.
The most useful ARM baremetal example sets we’ve seen so far are:
@@ -40094,7 +40143,7 @@ cntvct_el0 0x3CF516FIt is nice when thing just work.
But you can also learn a thing or two from how I actually made them work in the first place.
Enter the QEMU console:
Inside baremetal/lib/aarch64.S there is a chunk of code that enables floating point operations:
Baremetal tests work exactly like User mode tests, except that you have to add the --mode baremetal option, for example:
In baremetal, we detect if tests failed by parsing logs for the Magic failure string.
See: Section 33.16, “Test this repo” for more useful testing tips.
+See: Section 34.16, “Test this repo” for more useful testing tips.
Remember: Android AOSP is a huge undocumented piece of bloatware. It’s integration into this repo will likely never be super good. See also: https://cirosantilli.com#android
@@ -40430,7 +40479,7 @@ ISBTested on: 8.1.0_r60.
Tested on: 8.1.0_r60.
From mount, we can see that some of the mounted images are ro.
When I install an app like F-Droid, it goes under /data according to:
I don’t know how to download files from the web on Vanilla android, the default browser does not download anything, and there is no wget:
For Linux in general, see: Section 6, “init”.
TODO: didn’t fully port during refactor after 3b0a343647bed577586989fb702b760bd280844a. Reimplementing should not be hard.
@@ -40776,7 +40825,7 @@ cd -We have explored a few Continuous integration solutions.
We haven’t setup any of them yet.
We tried to automate it on Travis with .travis.yml but it hits the current 50 minute job timeout: https://travis-ci.org/cirosantilli/linux-kernel-module-cheat/builds/296454523 And I bet it would likely hit a disk maxout either way if it went on.
This setup successfully built gem5 on every commit: .circleci/config.yml
Run all kernel boot benchmarks for one arch:
TODO 62f6870e4e0b384c4bd2d514116247e81b241251 takes 33 minutes to finish at 62f6870e4e0b384c4bd2d514116247e81b241251:
Let’s see how fast our simulators are running some well known or easy to understand userland benchmarks!
so ~ 110 million instructions / 100 seconds makes ~ 1 MIPS (million instructions per second).
This experiment also suggests that each loop is about 11 instructions long (110M instructions / 10M loops), which we confirm at Section 31.2, “C busy loop”, bingo!
+This experiment also suggests that each loop is about 11 instructions long (110M instructions / 10M loops), which we confirm at Section 32.2, “C busy loop”, bingo!
Then for QEMU, we experimentally turn the number of loops up to 10^10 loops (100000 100000), which contains an expected 11 * 10^10 instructions, and the runtime is 00:01:08, so we have 1.1 * 10^11 instruction / 68 seconds ~ 2 * 10^9 = 2000 MIPS!
We can then repeat the experiment for other gem5 CPUs to see how they compare.
Let’s see if user mode runs considerably faster than full system or not, ignoring the kernel boot.
First we build Dhrystone manually statically since dynamic linking is broken in gem5 as explained at: Section 10.7, “gem5 syscall emulation mode”.
TODO: move this section to our new custom dhrystone setup: Section 21.8.2.1, “Dhrystone”.
+TODO: move this section to our new custom dhrystone setup: Section 22.8.2.1, “Dhrystone”.
gem5 user mode:
@@ -41383,7 +41432,7 @@ time \The build times are calculated after doing ./configure and make source, which downloads the sources, and basically benchmarks the Internet.
./build-buildroot -- graph-build graph-size graph-depends @@ -41419,14 +41468,14 @@ xdg-open graph-size.pdf
The biggest build time hog is always GCC, and it does not look like we can use a precompiled one: https://stackoverflow.com/questions/10833672/buildroot-environment-with-host-toolchain
This is the minimal build we could expect to get away with.
How long it takes to build gem5 itself.
A profiling of the build has been done at: https://gem5.atlassian.net/browse/GEM5-277 Analysis there showed that d7d9bc240615625141cd6feddbadd392457e49eb (2018-06-17) is also composed of 50% pybind11 and with no obvious time sinks.
This is the critical development parameter, and is dominated by the link time of huge binaries.
Serial number: TYPE 20HH-CTO1WW S/N PF-0V5V5N 17/11
STREAM benchmark on Ubuntu 20.04 results at STREAM benchmark.
CoreMark-Pro d5b4f2ba7ba31e37a5aa93423831e7d5eb933868 on Ubuntu 20.04 with XCMD="-c$(nproc)":
Bought: 2017 for approximately 2400 pounds.
Nominal speed: 2400 Mbps
PCIe TLC OPAL2.
1TB.
2c12b21b304178a81c9912817b782ead0286d282:
gem5:
Argh, compilers are boring, let’s learn a bit about them.
In gem5, can be seen on:
As mentioned at: https://stackoverflow.com/questions/10074831/what-is-general-difference-between-superscalar-and-ooo-execution it is in theory possible for an out-of-order CPU to not a Superscalar processor, but the combination is so natural (since you can look ahead, you might as well run it!) that it is not super common.
Intel name: "Hyperthreading"
Even if caches are coherent, this is still not enough to avoid data race conditions, because this does not enforce atomicity of read modify write sequences. This is for example shown at: Detailed gem5 analysis of how data races happen.
According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf "memory consistency" is about ordering requirements of different memory addresses.
This is represented explicitly in C++ for example C++ std::memory_order.
According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf, the strongest possible consistency, everything nicely ordered as you’d expect.
Either they can snoop only control, or both control and data can be snooped.
Mentioned at:
TODO gem5 concrete example.
TODO understand well why those are needed.
https://en.wikipedia.org/wiki/MOSI_protocol The critical MSI vs MOSI section was a bit bogus though: https://en.wikipedia.org/w/index.php?title=MOSI_protocol&oldid=895443023 but I edited it :-)
MESI cache coherence protocol + MOSI cache coherence protocol, not much else to it!
In gem5 9fc9c67b4242c03f165951775be5cd0812f2a705, MOESI is the default cache coherency protocol of the classic memory system as shown at Section 19.20.4.3.1, “What is the coherency protocol implemented by the classic cache system in gem5?”.
+In gem5 9fc9c67b4242c03f165951775be5cd0812f2a705, MOESI is the default cache coherency protocol of the classic memory system as shown at Section 19.21.4.3.1, “What is the coherency protocol implemented by the classic cache system in gem5?”.
A good an simple example showing several MOESI transitions in the classic memory model can be seen at: Section 19.20.4.4, “gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs”.
+A good an simple example showing several MOESI transitions in the classic memory model can be seen at: Section 19.21.4.4, “gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs”.
gem5 12c917de54145d2d50260035ba7fa614e25317a3 has several Ruby MOESI models implemented: MOESI_AMD_Base, MOESI_CMP_directory, MOESI_CMP_token and MOESI_hammer.
The host requirements depend a lot on which examples you want to run.
If ./build --download-dependencies fails with:
It does not work if you just download the .zip with the sources for this repository from GitHub because we use Git submodules, you must clone this repo.
If you just want to run a command after boot ends without thinking much about it, just use the --eval-after option, e.g.:
It gets annoying to retype --arch aarch64 for every single command, or to remember --config setups.
To learn how to build the documentation see: Section 1.10, “Build the documentation”.
When running build-doc, we do the following checks:
The scripts prints what you have to fix and exits with an error status if there are any errors.
Documentation for asciidoctor/extract-link-targets
Documentation for asciidoctor/extract-header-ids
The Asciidoctor extension scripts:
As mentioned before the TOC, we have to push this README to GitHub pages due to: https://github.com/isaacs/github/issues/1610
You did something crazy, and nothing seems to work anymore?
For now there is no way to change the build directory from out/ (resp. out.docker for <<docker>.) to something else.
ccache might save you a lot of re-build when you decide to Clean the build or create a new build variant.
The getvar helper script can print the values of internal LKMC variables.
For this reason, we use it in particular often in this README to reduce the need for refactoring.
While you could just manually find/learn the path to toolchain tools, e.g. in LKMC b15a0e455d691afa49f3b813ad9b09394dfb02b7 they are:
Since disassembly of a single function of a LKMC executable with GDB is such a common use case for run-toolchain via https://stackoverflow.com/questions/22769246/how-to-disassemble-one-single-function-using-objdump, we have this shortcut for it.
It is not possible to rebuild the root filesystem while running QEMU because QEMU holds the file qcow2 file:
When doing long simulations sweeping across multiple system parameters, it becomes fundamental to do multiple simulations in parallel.
To run multiple gem5 checkouts, see: Section 33.13.3.1, “gem5 worktree”.
+To run multiple gem5 checkouts, see: Section 34.13.3.1, “gem5 worktree”.
Implementation note: we create multiple namespaces for two things:
@@ -43903,7 +43952,7 @@ less "$(./getvar --arch aarch64 --emulator gem5 --run-id 1 termout_file)"It often happens that you are comparing two versions of the build, a good and a bad one, and trying to figure out why the bad one is bad.
Our build variants system allows you to keep multiple built versions of all major components, so that you can easily switching between running one or the other.
If you want to keep two builds around, one for the latest Linux version, and the other for Linux v4.16:
To run both kernels simultaneously, one on each QEMU instance, see: Section 33.12, “Simultaneous runs”.
+To run both kernels simultaneously, one on each QEMU instance, see: Section 34.12, “Simultaneous runs”.
Analogous to the Linux kernel build variants but with the --qemu-build-id option instead:
Analogous to the Linux kernel build variants but with the --gem5-build-id option instead:
Therefore, you can’t forget to checkout to the sources to that of the corresponding build before running, unless you explicitly tell gem5 to use a non-default source tree with gem5 worktree. This becomes inevitable when you want to launch multiple simultaneous runs at different checkouts.
--gem5-build-id goes a long way, but if you want to seamlessly switch between two gem5 tress without checking out multiple times, then --gem5-worktree is for you.
Suppose that you are working on a private fork of gem5, but you want to use this repository to develop it as well.
Allows you to have multiple versions of the GCC toolchain or root filesystem.
The --optimization-level option is available on all build scripts and sets the given GCC `-`O optimization level where it has been implemented for guest binaries.
lkmc/ contains sources and headers that are shared across kernel modules, userland and baremetal examples.
Another option would have been to name it as includes/lkmc, but that would make paths longer, and we might want to store source code in that directory as well in the future.
When factoring out functionality across userland examples, there are two main options:
Source: buildroot_packages/.
A custom build script can give you more flexibility: e.g. the package can be made work with other root filesystems more easily, have better 9P support, and rebuild faster as it evades some Buildroot boilerplate.
Has the following structure:
Patches in this directory are never applied automatically: it is up to users to manually apply them before usage following the instructions in this documentation.
Source: rootfs_overlay.
This way you can just hack away the scripts and try them out immediately without any further operations.
This path can be found with:
The files:
lkmc_home refers to the target base directory in which we put all our custom built stuff, such as userland executables and kernel modules.
In order to build and run each userland and baremetal example properly, we need per-file metadata such as compiler flags and required number of cores.
Print out several parameters that normally change randomly from boot to boot:
Run almost all tests:
test does not all possible tests, because there are too many possible variations and that would take forever. The rationale is the same as for ./build all and is explained in ./build --help.
You can select multiple archs and emulators of interest, as for an other command, with:
By default, continue running even after the first failure happens, and they show a summary at the end.
TODO: we really need a mechanism to automatically generate the test list automatically e.g. based on path_properties.py, currently there are many tests missing, and we have to add everything manually which is very annoying.
We have some pexpect automated tests for GDB for both userland and baremetal programs!
We do not know of any way to set the emulator exit status in QEMU arm full system.
For the Linux kernel, do the following manual tests for now.
You should also test that the Internet works:
build-userland and test-executables have a wide variety of target selection modes, and it was hard to keep them all working without some tests:
When updating the Linux kernel, QEMU and gem5, things sometimes break.
This is a template update procedure for submodules for which we have some patches on on top of mainline.
Ensure that the Automated tests are passing on a clean build:
The ./build-test command builds a superset of what will be downloaded which also tests other things we would like to be working on the release. For the minimal build to generate the files to be uploaded, see: Section 33.19.2, “release-zip”
The ./build-test command builds a superset of what will be downloaded which also tests other things we would like to be working on the release. For the minimal build to generate the files to be uploaded, see: Section 34.19.2, “release-zip”
The clean build is necessary as it generates clean images since it is not possible to remove Buildroot packages
@@ -45121,7 +45170,7 @@ git push --follow-tagsCreate a zip containing all files required for Prebuilt setup:
After:
This project was created to help me understand, modify and test low level system components by using system simulators.
The trade-offs between the different setups are basically a balance between:
compatibility: how likely is is that all the components will work well together: emulator, compiler, kernel, standard library, …
guest software availability: how wide is your choice of easily installed guest software packages? See also: Section 33.20.4, “Linux distro choice”
+guest software availability: how wide is your choice of easily installed guest software packages? See also: Section 34.20.4, “Linux distro choice”
Choosing which features go into our default builds means making tradeoffs, here are our guidelines:
In order to learn how to measure some of those aspects, see: Section 29, “Benchmark this repo”.
+In order to learn how to measure some of those aspects, see: Section 30, “Benchmark this repo”.
We haven’t found the ultimate distro yet, here is a summary table of trade-offs that we care about: Table 8, “Comparison of Linux distros for usage in this repository”.