diff --git a/index.html b/index.html index 1b5bc67..d34c89f 100644 --- a/index.html +++ b/index.html @@ -548,6 +548,7 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 2.5. GDB step debug early boot
  • 2.6. GDB step debug userland processes @@ -1160,12 +1161,13 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.9. m5out directory
  • 19.10. m5term
  • @@ -1309,12 +1311,13 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 20.9. Report upstream bugs
  • 20.10. libc choice
  • 20.11. Buildroot hello world
  • -
  • 20.12. Update the toolchain +
  • 20.12. Update the Buildroot toolchain
  • +
  • 20.13. Buildroot vanilla kernel
  • 21. Userland content @@ -1432,15 +1435,16 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 21.8.4.5. PARSEC benchmark hacking
  • -
  • 21.8.5. userland/libs directory -
  • +
  • 21.9. Micro benchmarks
  • +
  • 21.10. userland/libs directory +
  • -
  • 21.9. Userland content filename conventions
  • -
  • 21.10. Userland content bibliography
  • +
  • 21.11. Userland content filename conventions
  • +
  • 21.12. Userland content bibliography
  • 22. Userland assembly @@ -1925,7 +1929,12 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 29.5. Benchmark this repo bibliography
  • -
  • 30. Xephyr
  • +
  • 30. RTOS + +
  • 31. Compilers
    -

    For example, to use a two level MESI protocol we can do:

    +

    For example, to use a two level MESI cache coherence protocol we can do:

    @@ -22085,24 +22168,24 @@ info: Entering event queue @ 0. Starting simulation... | +---+ | | - 6 7 6 DRAMCtrl::processNextReqEvent + 6 7 6 DRAMCtrl::processNextReqEvent (0) 8 15 7 BaseXBar::Layer::releaseLayer | +---+---+ | | | 9 10 11 9 DRAMCtrl::Rank::processActivateEvent -12 17 16 10 DRAMCtrl::processRespondEvent -| | 11 DRAMCtrl::processNextReqEvent +12 17 16 10 DRAMCtrl::processRespondEvent (46.25) +| | 11 DRAMCtrl::processNextReqEvent (5) | | 13 18 13 DRAMCtrl::Rank::processPowerEvent -14 19 18 PacketQueue::processSendEvent +14 19 18 PacketQueue::processSendEvent (28) | +---+ | | - 20 21 20 PacketQueue::processSendEvent + 20 21 20 PacketQueue::processSendEvent (2.75) 23 22 21 BaseXBar::Layer<SrcType, DstType>::releaseLayer | - 24 24 TimingSimpleCPU::IcachePort::ITickEvent::process + 24 24 TimingSimpleCPU::IcachePort::ITickEvent::process (0) 25 | +---+ @@ -22138,8 +22221,8 @@ info: Entering event queue @ 0. Starting simulation...
        |   |
    -    6   7    6 DRAMCtrl::processNextReqEvent
    -    8   15   7 BaseXBar::Layer::releaseLayer
    +    6   7    6 DRAMCtrl::processNextReqEvent (0)
    +    8   15   7 BaseXBar::Layer::releaseLayer (0)
         |
    @@ -22149,13 +22232,13 @@ info: Entering event queue @ 0. Starting simulation...
    • -

      6: schedule DRAMCtrl::processNextReqEvent

      +

      6: schedule DRAMCtrl::processNextReqEvent to run in 0 ns after the execution that scheduled it

    • 8: execute DRAMCtrl::processNextReqEvent

    • -

      7: schedule BaseXBar::Layer::releaseLayer

      +

      7: schedule BaseXBar::Layer::releaseLayer to run in 0 ns after the execution that scheduled it

    • 15: execute BaseXBar::Layer::releaseLayer

      @@ -22171,6 +22254,45 @@ info: Entering event queue @ 0. Starting simulation...

      Observe how the events leading up to the second instruction are basically a copy of those of the first one, this is the basic TimingSimpleCPU event loop in action.

      +
      +

      One line summary of events:

      +
      +
      +
        +
      • +

        #5: adds the request to the DRAM queue, and schedules a DRAMCtrl::processNextReqEvent which later sees that request immediately

        +
      • +
      • +

        #8: picks up the only request from the DRAM read queue (readQueue) and services that.

        +
        +

        If there were multiple requests, priority arbitration under DRAMCtrl::chooseNext could chose a different one than the first based on packet priorities

        +
        +
        +

        This puts the request on the response queue respQueue and schedules another DRAMCtrl::processNextReqEvent but the request queue is empty, and that does nos schedule further events

        +
        +
      • +
      • +

        #17: picks up the only request from the DRAM response queue and services that by placing it in yet another queue, and scheduling the PacketQueue::processSendEvent which will later pick up that packet

        +
      • +
      • +

        #19: picks up the request from the previous queue, and forwards it to another queue, and schedules yet another PacketQueue::processSendEvent

        +
        +

        The current one is the DRAM passing the message to the XBar, and the next processSendEvent is the XBar finally sending it back to the CPU

        +
        +
      • +
      • +

        #23: the XBar port is actually sending the reply back.

        +
        +

        If knows to which CPU core to send the request to because ports keep a map of request to source:

        +
        +
        +
        +
        const auto route_lookup = routeTo.find(pkt->req);
        +
        +
        +
      • +
      +
      19.19.4.2.1. TimingSimpleCPU analysis #0
      @@ -23485,9 +23607,6 @@ build/ARM/config/the_isa.hh

      Perhaps the awesomeness of Buildroot only sinks in once you notice that all it takes is 4 commands as explained at Section 20.11, “Buildroot hello world”.

      -

      This repo basically wraps around that, and tries to make everything even more awesome for kernel developers.

      -
      -

      The downsides of Buildroot are:

      @@ -23504,6 +23623,28 @@ build/ARM/config/the_isa.hh

      The hard part is dealing with crappy third party build systems and huge dependency chains.

    • +
    • +

      it is written in Make and Bash rather than Python like LKMC

      +
    • +
    +
    +
    +

    This repo basically wraps around that, and tries to make everything even more awesome for kernel developers by adding the capability of seamlessly running the stuff you’ve built on emulators usually via ./run.

    +
    +
    +

    As this repo develops however, we’ve started taking some of the build out of Buildroot, e.g. notably the Linux kernel to have more build flexibility and faster build startup times.

    +
    +
    +

    Therefore, more and more, this repo wants to take over everything that Buildroot does, and one day completely replace it to achieve emulation Nirvana, see e.g.:

    +
    +
    @@ -24065,7 +24206,7 @@ git -C "$(./getvar qemu_source_dir)" checkout -
    -

    20.12. Update the toolchain

    +

    20.12. Update the Buildroot toolchain

    Users of this repo will often want to update the compilation toolchain to the latest version to get fresh new features like new ISA instructions.

    @@ -24256,6 +24397,36 @@ cd ../..
    +
    +

    20.13. Buildroot vanilla kernel

    + +
    +

    There are however some cases where we want that ability, e.g.: kernel_modules buildroot package and Benchmark Linux kernel boot.

    +
    +
    +

    The build of the kernel can be enabled with the --build-kernel option of build-buildroot.

    +
    +
    +

    For example, to build the kernel and then boot it you could do:

    +
    +
    +
    +
    ./build-buildroot --arch aarch64 --build-linux
    +./run --arch aarch64 --linux-exec "$(./getvar --arch aarch64 TODO)/vmlinux"
    +
    +
    +
    +

    TODO: fails on LKMC d53ffcff18aa26d24ea34b86fb80e4a5694378dch with "ERROR: No hash found for linux-4.19.16.tar.xz": https://github.com/cirosantilli/linux-kernel-module-cheat/issues/115

    +
    +
    +

    Note that this kernel is not configured at all by LKMC, and there is no support to do that currently: the Buildroot default kernel configs for a target are used unchanged, e.g. make qemu_aarch64_virt_defconfig,see also; About Buildroot’s kernel configs.

    +
    +
    +

    Therefore, this kernel might be missing certain key capabilities, e.g. filesystem support required to boot.

    +
    +
    @@ -25940,7 +26111,7 @@ xdg-open bst_vs_heap_vs_hashmap_gem5.tmp.png

    The cache sizes were chosen to match the host P51 to improve the comparison. Ideally we should also use the same standard library.

    -

    Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: Section 19.9.2.1, “gem5 only dump selected stats”

    +

    Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: Section 19.9.3.1, “gem5 only dump selected stats”

    Sources:

    @@ -26505,8 +26676,22 @@ git clean -xdf .
    -
    -

    21.8.5. userland/libs directory

    +
    +
    +

    21.9. Micro benchmarks

    +
    +

    It eventually has to come to that, hasn’t it?

    +
    + +
    +
    +

    21.10. userland/libs directory

    Tests under userland/libs require certain optional libraries to be installed on the target, and are not built or tested by default, you must enable them with either:

    @@ -26519,8 +26704,8 @@ git clean -xdf .

    See for example BLAS.

    -
    -
    -

    21.9. Userland content filename conventions

    +

    21.11. Userland content filename conventions

    The following basenames should always refer to programs that do the same thing, but in different languages:

    @@ -26563,7 +26747,7 @@ git clean -xdf .
    -

    21.10. Userland content bibliography

    +

    21.12. Userland content bibliography

    • @@ -34381,7 +34565,7 @@ cat "$(./getvar test_boot_benchmark_file)"
    -

    Sample results at 8fb9db39316d43a6dbd571e04dd46ae73915027f:

    +

    Sample results at LKMC 8fb9db39316d43a6dbd571e04dd46ae73915027f:

    @@ -34455,6 +34639,18 @@ instructions 124346081

    TODO: aarch64 gem5 and QEMU use the same kernel, so why is the gem5 instruction count so much much higher?

    +
    +

    P51 Ubuntu 19.10 LKMC b11e3cd9fb5df0e3fe61de28e8264bbc95ea9005 gem5 e779c19dbb51ad2f7699bd58a5c7827708e12b55 aarch64: 143s. Why huge increases from 70s on above table? Kernel size is also huge BTW: 147MB.

    +
    +
    +

    Note that https://gem5.atlassian.net/browse/GEM5-337 "ARM PAuth patch slows down Linux boot 2x from 2 minutes to 4 minutes" was already semi fixed at that point.

    +
    +
    +

    Same but with Buildroot vanilla kernel (kernel v4.19): 44s to blow up at "Please append a correct "root=" boot option; here are the available partitions" because missing some filesystem mount option. But likely wouldn’t be much more until after boot since we are almost already done by then! Therefore this vanilla kernel is much much faster! TODO find which config or kernel commit added so much time! Also that kernel is tiny at 8.5MB.

    +
    +
    +

    Same but with: gem5 arm Linux kernel patches at v4.15: 73s, kernel size: 132M.

    +
    29.2.1.1. gem5 arm HPI boot takes much longer than aarch64
    @@ -35052,7 +35248,7 @@ tail -n+1 ../linux-kernel-module-cheat-regression/*/gem5-bench-build-*.txt

    Same but gem5 d7d9bc240615625141cd6feddbadd392457e49eb (2018-06-17) hacked with -Wnoerror: 11m 37s. So there was a huge regression in the last two years! We have to find it out.

    -

    A profiling of the build has been done at: https://gem5.atlassian.net/browse/GEM5-277

    +

    A profiling of the build has been done at: https://gem5.atlassian.net/browse/GEM5-277 Analysis there showed that d7d9bc240615625141cd6feddbadd392457e49eb (2018-06-17) is also composed of 50% pybind11 and with no obvious time sinks.

    29.2.3.3.1. pybind11 accounts for 50% of gem5 build time
    @@ -35206,10 +35402,15 @@ tail -n+1 ../linux-kernel-module-cheat-regression/*/gem5-bench-build-*.txt
    -

    30. Xephyr

    +

    30. RTOS

    +
    +

    30.1. Zephyr

    -

    Xephyr is an RTOS that has POSIX support. I think it works much like our Baremetal setup which uses Newlib and generates individual ELF files that contain both our C program’s code, and the Xephyr libraries.

    +

    https://en.wikipedia.org/wiki/Zephyr_(operating_system)

    +
    +
    +

    Zephyr is an RTOS that has POSIX support. I think it works much like our Baremetal setup which uses Newlib and generates individual ELF files that contain both our C program’s code, and the Zephyr libraries.

    TODO get a hello world working, and then consider further integration in this repo, e.g. being able to run all C userland content on it.

    @@ -35218,7 +35419,7 @@ tail -n+1 ../linux-kernel-module-cheat-regression/*/gem5-bench-build-*.txt

    TODO: Cortex-A CPUs are not currently supported, there are some qemu_cortex_m0 boards, but can’t find a QEMU Cortex-A. There is an x86_64 qemu board, but we don’t currently have an x86 baremetal toolchain. For this reason, we won’t touch this further for now.

    -

    However, unlike Newlib, Xephyr must be setting up a simple pre-main runtime to be able to handle threads.

    +

    However, unlike Newlib, Zephyr must be setting up a simple pre-main runtime to be able to handle threads.

    Failed attempt:

    @@ -35246,6 +35447,16 @@ west build -b qemu_aarch64 samples/hello_world

    The build system of that project is a bit excessive / wonky. You need an edge CMake not present in Ubuntu 18.04, which I don’t want to install right now, and it uses the weird custom west build tool frontend.

    +
    +

    30.2. ARM Mbed

    + +
    +

    TODO minimal setup to run it on QEMU? Possible?

    +
    +
    +

    31. Compilers

    @@ -35285,19 +35496,37 @@ west build -b qemu_aarch64 samples/hello_world

    https://en.wikipedia.org/wiki/Cache_coherence

    -

    Algorithms to keep the caches of different cores of a system coherent.

    +

    Algorithms to keep the caches of different cores of a system coherent. Only matters for multicore systems.

    -

    The main goal of such systems is to reduce the number of messages that have to be sent on the coherency bus, and most importantly, to memory (which passes first through the coherency bus).

    +

    The main goal of such systems is to reduce the number of messages that have to be sent on the coherency bus, and even more importantly, to memory (which passes first through the coherency bus).

    -

    E.g.: if one processors writes to the cache, other processors have to know about it before they read from that address.

    +

    The main software use case example to have in mind is that of multiple threads incrementing an atomic counter as in userland/cpp/atomic/std_atomic.cpp, see also: atomic.cpp. Then, if one processors writes to the cache, other processors have to know about it before they read from that address.

    -

    The main software use case example to have in mind is that of multiple threads incrementing an atomic counter as in userland/cpp/atomic/std_atomic.cpp, see also: atomic.cpp.

    +

    Note that cache coherency only applies to memory read/write instructions that explicitly make coherency requirements.

    +
    +
    +

    In most ISAs, this tends to be the minority of instructions, and is only used when something is going to modify memory that is known to be shared across threads. For example, the a x86 LOCK would be used to increment atomic counters that get incremented across several threads. Outside of those cases, cache coherency is not garanteed, and behaviour is undefined.

    -

    32.1.1. VI protocol

    +

    32.1.1. Can caches snoop data from other caches?

    +
    +

    Either they can snoop only control, or both control and data can be snooped.

    +
    +
    +

    The answer to this determines if some of the following design decisions make sense.

    +
    + +
    +

    If data snoops are not possible, then data must always to to DRAM first.

    +
    +
    +
    +

    32.1.2. VI cache coherence protocol

    Mentioned at:

    @@ -35338,7 +35567,7 @@ west build -b qemu_aarch64 samples/hello_world
    • -

      that read is marked as exclusive, and all caches that had it snoop and become invalid.

      +

      that read is marked as exclusive, and all caches that had it, snoop it become invalid.

      Upside: no need to send the new data to the bus.

      @@ -35374,7 +35603,7 @@ west build -b qemu_aarch64 samples/hello_world
    • when the cache is full, eviction leads to a write to memory.

      -

      If multiple valid holders may exist, then this may lead to multiple

      +

      If multiple valid holders may exist, then this may lead to multiple write through evictions of the same thing.

    @@ -35544,7 +35773,7 @@ west build -b qemu_aarch64 samples/hello_world
    -

    32.1.2. MSI protocol

    +

    32.1.3. MSI cache coherence protocol

    @@ -35552,6 +35781,22 @@ west build -b qemu_aarch64 samples/hello_world

    This is the most basic non-trivial coherency protocol, and therefore the first one you should learn.

    +

    Compared to the VI cache coherence protocol, MSI:

    +
    +
    +
      +
    • +

      adds one bit of knowledge per cache line (shared)

      +
    • +
    • +

      splits Valid into Modified and Shared depending on the shared bit

      +
    • +
    • +

      this allows us to not send BusUpgr messages on the bus when writing to Modified, since we now we know that the data is not present in any other cache!

      +
    • +
    +
    +

    Helpful video: https://www.youtube.com/watch?v=gAUVAel-2Fg "MSI Coherence - Georgia Tech - HPCA: Part 5" by Udacity.

    @@ -35677,6 +35922,9 @@ CACHE2 S nyy

    Therefore, it does not need to fetch the data, which saves bus traffic compared to "Bus write" since the data itself does not need to be sent.

    +
    +

    This is also called a Bus Upgrade message or BusUpgr, as it informs others that the value is going to be upgraded.

    +
  • "Write back": send the data on the bus and tell someone to pick it up: either DRAM or another cache

    @@ -35746,7 +35994,7 @@ CACHE2 S nyy

    Since we know what the latest data is, we can move to "Shared" rather than "Invalid" to possibly save time on future reads.

    -

    But to do that, we need to write the data back to DRAM to maintain the shared state consistent. The MESI protocol prevents that extra read in some cases.

    +

    But to do that, we need to write the data back to DRAM to maintain the shared state consistent. The MESI cache coherence protocol prevents that extra read in some cases.

    And it has to be either: before the other cache gets its data from DRAM, or better, the other cache can get its data from our write back itself just like the DRAM.

    @@ -35836,14 +36084,33 @@ CACHE2 S nyy

    TODO gem5 concrete example.

    +
    -

    32.1.3. MESI protocol

    +

    32.1.4. MESI cache coherence protocol

    -

    Splits the Shared of MSI protocol into a new Exclusive state:

    +

    Splits the Shared of MSI cache coherence protocol into a new Exclusive state:

      @@ -35851,43 +36118,96 @@ CACHE2 S nyy

      MESI Exclusive: clean but only present in one cache

    • -

      MESI Shared: clean but may be present in more that one cache

      +

      MESI Shared: clean but present in more that one cache

    -

    TODO advantage: I think the advantages over MSI are:

    +

    Exclusive is entered from Invalid after a PrRd, but only if the reply came from DRAM (or if we snoped that no one sent the reply to DRAM for us to read it)! If the reply came from another cache, we go directly to shared instead. It is this extra information that allows for the split of S.

    +
    +
    +

    The advantage of this over MSI is that when we move from Exclusive to Modified, no invalidate message is required, reducing bus traffic: https://en.wikipedia.org/wiki/MESI_protocol#Advantages_of_MESI_over_MSI

    +
    +
    +

    This is a common case on read write modify loops. On MSI, it would first do PrRd, send BusRd (to move any M to S), get data, and go to Shared, then PrWr must send BusUpgr to invalidate other Shared and move to M.

    +
    +
    +

    With MESI, the PrRd could go to E instead of S depending on who services it. If it does go to E, then the PrWr only moves it to M, there is no need to send BusUpgr because we know that no one else is in S.

    +
    +
    +

    gem5 12c917de54145d2d50260035ba7fa614e25317a3 has two Ruby MESI models implemented: MESI_Two_Level and MESI_Three_Level.

    +
    +
    +
    +

    32.1.5. MOSI cache coherence protocol

    +
    +

    https://en.wikipedia.org/wiki/MOSI_protocol The critical MSI vs MOSI section was a bit bogus though: https://en.wikipedia.org/w/index.php?title=MOSI_protocol&oldid=895443023 we have to edit it.

    +
    +
    +

    In MSI, it feels wasteful that an MS transaction needs to flush to memory: why do we need to flush right now, since even more caches now have that data? Why not wait until later ant try to gain something from this deferral?

    +
    +
    +

    The problem with doing that in MSI, is that not flushing on an MS transaction would force us to every S eviction. So we would end up flushing even after reads!

    +
    +
    +

    MOSI solves that by making M move to O instead of S on BusRd. Now, O is the only responsible for the flush back on eviction.

    +
    +
    +

    So, in case we had:

    • -

      when we move from Exclusive to Shared, no DRAM write back is needed, because we know that the cache is clean

      +

      processor 1: M

    • -

      when we move from Exclusive to Modified, no invalidate message is required, reducing bus traffic

      +

      processor 2: I then read

      +
    • +
    • +

      processor 1: write

    -

    Exclusive is entered from Invalid after a "Local read", but only if the reply came from DRAM! If the reply came from another cache, we go directly to shared instead.

    +

    An MSI cahe 1 would do:

    +
    +
    +
      +
    • +

      write to main memory, go to S

      +
    • +
    • +

      BusUpgr, go back to M, 2 back to I

      +
    • +
    +
    +
    +

    and MOSI would do:

    +
    +
    +
      +
    • +

      go to O (no bus traffic)

      +
    • +
    • +

      BusUpgr, go back to M

      +
    • +
    +
    +
    +

    This therefore saves one memory write through and its bus traffic.

    -

    32.1.4. MOSI protocol

    - -
    -

    TODO compare to MSI and understand advantages. From Wikipedia it seems that MOSI can get data from the Owned cache while MSI cannot get data from Shared caches and must go to memory, but why not? Why do we need that Owned? Is it because there are multiple Shared caches and them all replying at the same time would lead to problems?

    -
    -
    -
    -

    32.1.5. MOESI protocol

    +

    32.1.6. MOESI cache coherence protocol

    -

    MESI protocol + MOSI protocol, not much else to it!

    +

    MESI cache coherence protocol + MOSI cache coherence protocol, not much else to it!

    +
    +
    +

    gem5 12c917de54145d2d50260035ba7fa614e25317a3 has several Ruby MOESI models implemented: MOESI_AMD_Base, MOESI_CMP_directory, MOESI_CMP_token and MOESI_hammer.

    @@ -36920,7 +37240,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -

    A custom build script can give you more flexibility: e.g. the package can be made work with other root filesystems more easily, have better 9P support, and rebuild faster as it evades some Buildroot boilerplate.

    -
    33.14.2.1. kernel_modules buildroot package
    +
    33.14.2.1. kernel_modules buildroot package
    @@ -36956,7 +37276,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -

    As you have just seen, this sets up everything so that modprobe can correctly find the module.

    -

    ./build-buildroot --build-linux and ./run --buildroot-linux are needed because the Buildroot kernel modules must use the Buildroot Linux kernel at build and run time.

    +

    ./build-buildroot --build-linux and ./run --buildroot-linux are needed because the Buildroot kernel modules must use the Buildroot Linux kernel at build and run time, see also: Buildroot vanilla kernel.

    The --no-overlay is required otherwise our modules.order generated by ./build-linux and installed with BR2_ROOTFS_OVERLAY overwrites the Buildroot generated one.