[ 0.081311] printk: console [ttyAMA0] enabled+
diff --git a/index.html b/index.html index 1b5bc67..d34c89f 100644 --- a/index.html +++ b/index.html @@ -548,6 +548,7 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
and no I do have the symbols from arch/arm/boot/compressed/vmlinux', but the breaks still don’t work.
When booting Linux on a slow emulator like gem5, what you observe is that:
+first nothing shows for a while
+then at once, a bunch of message lines show at once followed on aarch64 Linux 5.4.3 by:
+[ 0.081311] printk: console [ttyAMA0] enabled+
This means of course that all the previous messages had been generated earlier and stored, but were only printed to the terminal once the terminal itself was enabled.
+Notably for example the very first message:
+[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd070]+
happens very early in the boot process.
+If you get a failure before that, it will be hard to see the print messages.
+One possible solution is to parse the dmesg buffer, gem5 actually implements that: gem5 m5out/system.dmesg file.
+modules built with Buildroot, see: Section 33.14.2.1, “kernel_modules buildroot package”
+modules built with Buildroot, see: Section 33.14.2.1, “kernel_modules buildroot package”
modules built from the kernel tree itself, see: Section 15.12.2, “dummy-irq”
@@ -10853,7 +10908,7 @@ extra/dep.ko:Unlike insmod, modprobe deals with kernel module dependencies for us.
First get kernel_modules buildroot package working.
+First get kernel_modules buildroot package working.
Then, for example:
@@ -17832,6 +17887,19 @@ rootGetting started at: Section 1.3, “gem5 Buildroot setup”.
gem5 has a bunch of crappiness, mostly described at: gem5 vs QEMU, but it does deserve some credit on the following points:
+insanely configurable system topology from Python without recompiling, made possible in part due to a well defined memory packet structure that allows adding caches and buses transparently
+each micro architectural model (gem5 CPU types) works with all ISAs
+TODO confirm and create minimal example.
+I think this file is capable of showing terminal messages before they reach the terminal by parsing the dmesg buffer from memory.
+This could be used to debug the Linux kernel boot if problems happen before the serial is enabled: Linux kernel early boot messages.
+The file appears to get dumped only on kernel panic which gem5 can detect by the PC address: Exit gem5 on panic.
+This file contains important statistics about the run:
For x86, it is interesting to try and correlate numCycles with:
TODO
The m5out/config.ini file, contains a very good high level description of the system:
./gem5-regression --gem5-worktree master --arch aarch64 --cmd list+
./gem5-regression --arch aarch64 --cmd list
Ruby seems to have usage outside of gem5, but the naming overload with the Ruby programming language, which also has domain specific languages as a concept, makes it impossible to google anything about it!
Since it is not the default, Ruby is generally less stable that the classic memory model. However, because it allows describing a wide variety of important cache coherency protocols, while the classic system only describes a single protocol, Ruby is very importanonly describes a single protocol, Ruby is a very important feature of gem5.
+Since it is not the default, Ruby is generally less stable that the classic memory model. However, because it allows describing a wide variety of important cache coherence protocols, while the classic system only describes a single protocol, Ruby is very importanonly describes a single protocol, Ruby is a very important feature of gem5.
Ruby support must be enabled at compile time with the scons PROTOCOL= flag, which compiles support for the desired memory system type.
For example, to use a two level MESI protocol we can do:
+For example, to use a two level MESI cache coherence protocol we can do:
| |
- 6 7 6 DRAMCtrl::processNextReqEvent
- 8 15 7 BaseXBar::Layer::releaseLayer
+ 6 7 6 DRAMCtrl::processNextReqEvent (0)
+ 8 15 7 BaseXBar::Layer::releaseLayer (0)
|
6: schedule DRAMCtrl::processNextReqEvent
6: schedule DRAMCtrl::processNextReqEvent to run in 0 ns after the execution that scheduled it
8: execute DRAMCtrl::processNextReqEvent
7: schedule BaseXBar::Layer::releaseLayer
7: schedule BaseXBar::Layer::releaseLayer to run in 0 ns after the execution that scheduled it
15: execute BaseXBar::Layer::releaseLayer
Observe how the events leading up to the second instruction are basically a copy of those of the first one, this is the basic TimingSimpleCPU event loop in action.
One line summary of events:
+#5: adds the request to the DRAM queue, and schedules a DRAMCtrl::processNextReqEvent which later sees that request immediately
#8: picks up the only request from the DRAM read queue (readQueue) and services that.
If there were multiple requests, priority arbitration under DRAMCtrl::chooseNext could chose a different one than the first based on packet priorities
This puts the request on the response queue respQueue and schedules another DRAMCtrl::processNextReqEvent but the request queue is empty, and that does nos schedule further events
#17: picks up the only request from the DRAM response queue and services that by placing it in yet another queue, and scheduling the PacketQueue::processSendEvent which will later pick up that packet
#19: picks up the request from the previous queue, and forwards it to another queue, and schedules yet another PacketQueue::processSendEvent
The current one is the DRAM passing the message to the XBar, and the next processSendEvent is the XBar finally sending it back to the CPU
#23: the XBar port is actually sending the reply back.
+If knows to which CPU core to send the request to because ports keep a map of request to source:
+const auto route_lookup = routeTo.find(pkt->req);+
Perhaps the awesomeness of Buildroot only sinks in once you notice that all it takes is 4 commands as explained at Section 20.11, “Buildroot hello world”.
This repo basically wraps around that, and tries to make everything even more awesome for kernel developers.
-The downsides of Buildroot are:
The hard part is dealing with crappy third party build systems and huge dependency chains.
it is written in Make and Bash rather than Python like LKMC
+This repo basically wraps around that, and tries to make everything even more awesome for kernel developers by adding the capability of seamlessly running the stuff you’ve built on emulators usually via ./run.
As this repo develops however, we’ve started taking some of the build out of Buildroot, e.g. notably the Linux kernel to have more build flexibility and faster build startup times.
+Therefore, more and more, this repo wants to take over everything that Buildroot does, and one day completely replace it to achieve emulation Nirvana, see e.g.:
+Users of this repo will often want to update the compilation toolchain to the latest version to get fresh new features like new ISA instructions.
By default, our build system uses build-linux, and the Buildroot kernel build is disabled: https://stackoverflow.com/questions/52231793/can-buildroot-build-the-root-filesystem-without-building-the-linux-kernel
+There are however some cases where we want that ability, e.g.: kernel_modules buildroot package and Benchmark Linux kernel boot.
+The build of the kernel can be enabled with the --build-kernel option of build-buildroot.
For example, to build the kernel and then boot it you could do:
+./build-buildroot --arch aarch64 --build-linux +./run --arch aarch64 --linux-exec "$(./getvar --arch aarch64 TODO)/vmlinux"+
TODO: fails on LKMC d53ffcff18aa26d24ea34b86fb80e4a5694378dch with "ERROR: No hash found for linux-4.19.16.tar.xz": https://github.com/cirosantilli/linux-kernel-module-cheat/issues/115
+Note that this kernel is not configured at all by LKMC, and there is no support to do that currently: the Buildroot default kernel configs for a target are used unchanged, e.g. make qemu_aarch64_virt_defconfig,see also; About Buildroot’s kernel configs.
Therefore, this kernel might be missing certain key capabilities, e.g. filesystem support required to boot.
+The cache sizes were chosen to match the host P51 to improve the comparison. Ideally we should also use the same standard library.
Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: Section 19.9.2.1, “gem5 only dump selected stats”
+Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: Section 19.9.3.1, “gem5 only dump selected stats”
Sources:
@@ -26505,8 +26676,22 @@ git clean -xdf .It eventually has to come to that, hasn’t it?
+userland/gcc/busy_loop.c described at Infinite busy loop
+Tests under userland/libs require certain optional libraries to be installed on the target, and are not built or tested by default, you must enable them with either:
See for example BLAS.
The following basenames should always refer to programs that do the same thing, but in different languages:
Sample results at 8fb9db39316d43a6dbd571e04dd46ae73915027f:
+Sample results at LKMC 8fb9db39316d43a6dbd571e04dd46ae73915027f:
TODO: aarch64 gem5 and QEMU use the same kernel, so why is the gem5 instruction count so much much higher?
P51 Ubuntu 19.10 LKMC b11e3cd9fb5df0e3fe61de28e8264bbc95ea9005 gem5 e779c19dbb51ad2f7699bd58a5c7827708e12b55 aarch64: 143s. Why huge increases from 70s on above table? Kernel size is also huge BTW: 147MB.
+Note that https://gem5.atlassian.net/browse/GEM5-337 "ARM PAuth patch slows down Linux boot 2x from 2 minutes to 4 minutes" was already semi fixed at that point.
+Same but with Buildroot vanilla kernel (kernel v4.19): 44s to blow up at "Please append a correct "root=" boot option; here are the available partitions" because missing some filesystem mount option. But likely wouldn’t be much more until after boot since we are almost already done by then! Therefore this vanilla kernel is much much faster! TODO find which config or kernel commit added so much time! Also that kernel is tiny at 8.5MB.
+Same but with: gem5 arm Linux kernel patches at v4.15: 73s, kernel size: 132M.
+Same but gem5 d7d9bc240615625141cd6feddbadd392457e49eb (2018-06-17) hacked with -Wnoerror: 11m 37s. So there was a huge regression in the last two years! We have to find it out.
A profiling of the build has been done at: https://gem5.atlassian.net/browse/GEM5-277
+A profiling of the build has been done at: https://gem5.atlassian.net/browse/GEM5-277 Analysis there showed that d7d9bc240615625141cd6feddbadd392457e49eb (2018-06-17) is also composed of 50% pybind11 and with no obvious time sinks.
Xephyr is an RTOS that has POSIX support. I think it works much like our Baremetal setup which uses Newlib and generates individual ELF files that contain both our C program’s code, and the Xephyr libraries.
+ +Zephyr is an RTOS that has POSIX support. I think it works much like our Baremetal setup which uses Newlib and generates individual ELF files that contain both our C program’s code, and the Zephyr libraries.
TODO get a hello world working, and then consider further integration in this repo, e.g. being able to run all C userland content on it.
@@ -35218,7 +35419,7 @@ tail -n+1 ../linux-kernel-module-cheat-regression/*/gem5-bench-build-*.txtTODO: Cortex-A CPUs are not currently supported, there are some qemu_cortex_m0 boards, but can’t find a QEMU Cortex-A. There is an x86_64 qemu board, but we don’t currently have an x86 baremetal toolchain. For this reason, we won’t touch this further for now.
However, unlike Newlib, Xephyr must be setting up a simple pre-main runtime to be able to handle threads.
+However, unlike Newlib, Zephyr must be setting up a simple pre-main runtime to be able to handle threads.
Failed attempt:
@@ -35246,6 +35447,16 @@ west build -b qemu_aarch64 samples/hello_worldThe build system of that project is a bit excessive / wonky. You need an edge CMake not present in Ubuntu 18.04, which I don’t want to install right now, and it uses the weird custom west build tool frontend.
TODO minimal setup to run it on QEMU? Possible?
+Algorithms to keep the caches of different cores of a system coherent.
+Algorithms to keep the caches of different cores of a system coherent. Only matters for multicore systems.
The main goal of such systems is to reduce the number of messages that have to be sent on the coherency bus, and most importantly, to memory (which passes first through the coherency bus).
+The main goal of such systems is to reduce the number of messages that have to be sent on the coherency bus, and even more importantly, to memory (which passes first through the coherency bus).
E.g.: if one processors writes to the cache, other processors have to know about it before they read from that address.
+The main software use case example to have in mind is that of multiple threads incrementing an atomic counter as in userland/cpp/atomic/std_atomic.cpp, see also: atomic.cpp. Then, if one processors writes to the cache, other processors have to know about it before they read from that address.
The main software use case example to have in mind is that of multiple threads incrementing an atomic counter as in userland/cpp/atomic/std_atomic.cpp, see also: atomic.cpp.
+Note that cache coherency only applies to memory read/write instructions that explicitly make coherency requirements.
+In most ISAs, this tends to be the minority of instructions, and is only used when something is going to modify memory that is known to be shared across threads. For example, the a x86 LOCK would be used to increment atomic counters that get incremented across several threads. Outside of those cases, cache coherency is not garanteed, and behaviour is undefined.
Either they can snoop only control, or both control and data can be snooped.
+The answer to this determines if some of the following design decisions make sense.
+This is the central point in question at: https://electronics.stackexchange.com/questions/484830/why-is-a-flush-needed-in-the-msi-cache-coherency-protocol-when-moving-from-modif
+If data snoops are not possible, then data must always to to DRAM first.
+Mentioned at:
that read is marked as exclusive, and all caches that had it snoop and become invalid.
+that read is marked as exclusive, and all caches that had it, snoop it become invalid.
Upside: no need to send the new data to the bus.
when the cache is full, eviction leads to a write to memory.
If multiple valid holders may exist, then this may lead to multiple
+If multiple valid holders may exist, then this may lead to multiple write through evictions of the same thing.
This is the most basic non-trivial coherency protocol, and therefore the first one you should learn.
Compared to the VI cache coherence protocol, MSI:
+adds one bit of knowledge per cache line (shared)
+splits Valid into Modified and Shared depending on the shared bit
+this allows us to not send BusUpgr messages on the bus when writing to Modified, since we now we know that the data is not present in any other cache!
+Helpful video: https://www.youtube.com/watch?v=gAUVAel-2Fg "MSI Coherence - Georgia Tech - HPCA: Part 5" by Udacity.
Therefore, it does not need to fetch the data, which saves bus traffic compared to "Bus write" since the data itself does not need to be sent.
This is also called a Bus Upgrade message or BusUpgr, as it informs others that the value is going to be upgraded.
+"Write back": send the data on the bus and tell someone to pick it up: either DRAM or another cache
@@ -35746,7 +35994,7 @@ CACHE2 S nyySince we know what the latest data is, we can move to "Shared" rather than "Invalid" to possibly save time on future reads.
But to do that, we need to write the data back to DRAM to maintain the shared state consistent. The MESI protocol prevents that extra read in some cases.
+But to do that, we need to write the data back to DRAM to maintain the shared state consistent. The MESI cache coherence protocol prevents that extra read in some cases.
And it has to be either: before the other cache gets its data from DRAM, or better, the other cache can get its data from our write back itself just like the DRAM.
@@ -35836,14 +36084,33 @@ CACHE2 S nyyTODO gem5 concrete example.
TODO underestand well why those are needed.
+Splits the Shared of MSI protocol into a new Exclusive state:
+Splits the Shared of MSI cache coherence protocol into a new Exclusive state:
MESI Exclusive: clean but only present in one cache
MESI Shared: clean but may be present in more that one cache
+MESI Shared: clean but present in more that one cache
TODO advantage: I think the advantages over MSI are:
+Exclusive is entered from Invalid after a PrRd, but only if the reply came from DRAM (or if we snoped that no one sent the reply to DRAM for us to read it)! If the reply came from another cache, we go directly to shared instead. It is this extra information that allows for the split of S.
+The advantage of this over MSI is that when we move from Exclusive to Modified, no invalidate message is required, reducing bus traffic: https://en.wikipedia.org/wiki/MESI_protocol#Advantages_of_MESI_over_MSI
+This is a common case on read write modify loops. On MSI, it would first do PrRd, send BusRd (to move any M to S), get data, and go to Shared, then PrWr must send BusUpgr to invalidate other Shared and move to M.
+With MESI, the PrRd could go to E instead of S depending on who services it. If it does go to E, then the PrWr only moves it to M, there is no need to send BusUpgr because we know that no one else is in S.
+gem5 12c917de54145d2d50260035ba7fa614e25317a3 has two Ruby MESI models implemented: MESI_Two_Level and MESI_Three_Level.
https://en.wikipedia.org/wiki/MOSI_protocol The critical MSI vs MOSI section was a bit bogus though: https://en.wikipedia.org/w/index.php?title=MOSI_protocol&oldid=895443023 we have to edit it.
+In MSI, it feels wasteful that an MS transaction needs to flush to memory: why do we need to flush right now, since even more caches now have that data? Why not wait until later ant try to gain something from this deferral?
+The problem with doing that in MSI, is that not flushing on an MS transaction would force us to every S eviction. So we would end up flushing even after reads!
+MOSI solves that by making M move to O instead of S on BusRd. Now, O is the only responsible for the flush back on eviction.
+So, in case we had:
when we move from Exclusive to Shared, no DRAM write back is needed, because we know that the cache is clean
+processor 1: M
when we move from Exclusive to Modified, no invalidate message is required, reducing bus traffic
+processor 2: I then read
+processor 1: write
Exclusive is entered from Invalid after a "Local read", but only if the reply came from DRAM! If the reply came from another cache, we go directly to shared instead.
+An MSI cahe 1 would do:
+write to main memory, go to S
+BusUpgr, go back to M, 2 back to I
+and MOSI would do:
+go to O (no bus traffic)
+BusUpgr, go back to M
+This therefore saves one memory write through and its bus traffic.
TODO compare to MSI and understand advantages. From Wikipedia it seems that MOSI can get data from the Owned cache while MSI cannot get data from Shared caches and must go to memory, but why not? Why do we need that Owned? Is it because there are multiple Shared caches and them all replying at the same time would lead to problems?
-MESI protocol + MOSI protocol, not much else to it!
+MESI cache coherence protocol + MOSI cache coherence protocol, not much else to it!
+gem5 12c917de54145d2d50260035ba7fa614e25317a3 has several Ruby MOESI models implemented: MOESI_AMD_Base, MOESI_CMP_directory, MOESI_CMP_token and MOESI_hammer.
A custom build script can give you more flexibility: e.g. the package can be made work with other root filesystems more easily, have better 9P support, and rebuild faster as it evades some Buildroot boilerplate.
As you have just seen, this sets up everything so that modprobe can correctly find the module.
./build-buildroot --build-linux and ./run --buildroot-linux are needed because the Buildroot kernel modules must use the Buildroot Linux kernel at build and run time.
./build-buildroot --build-linux and ./run --buildroot-linux are needed because the Buildroot kernel modules must use the Buildroot Linux kernel at build and run time, see also: Buildroot vanilla kernel.
The --no-overlay is required otherwise our modules.order generated by ./build-linux and installed with BR2_ROOTFS_OVERLAY overwrites the Buildroot generated one.