diff --git a/index.html b/index.html index 944201a..ad16af9 100644 --- a/index.html +++ b/index.html @@ -1026,73 +1026,74 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 18. QEMU
  • 19. gem5 @@ -1140,6 +1141,7 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.5.4.1. gem5 fast forward
  • +
  • 19.5.5. gem5 checkpoint upgrader
  • 19.6. Pass extra options to gem5
  • @@ -1224,7 +1226,13 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.16.1.2. gem5 MinorCPU
  • -
  • 19.16.1.3. gem5 DerivO3CPU
  • +
  • 19.16.1.3. gem5 DerivO3CPU + +
  • 19.16.2. gem5 ARM RSK
  • @@ -1296,8 +1304,22 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.20.4.4.1. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby
  • -
  • 19.20.4.5. gem5 event queue MinorCPU syscall emulation freestanding example analysis
  • -
  • 19.20.4.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis
  • +
  • 19.20.4.5. gem5 event queue MinorCPU syscall emulation freestanding example analysis + +
  • +
  • 19.20.4.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis + +
  • 19.20.5. gem5 instruction definitions @@ -1336,7 +1358,12 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 19.20.7.4. gem5 Process
  • -
  • 19.20.8. gem5 functional units
  • +
  • 19.20.8. gem5 functional units + +
  • 19.20.9. gem5 code generation
  • +
  • 19.21. Gensim
  • 20. Buildroot @@ -1414,7 +1442,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 21.1.3. GCC C extensions
  • @@ -1508,14 +1540,15 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b @@ -1800,6 +1833,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 24.6. ARM SIMD @@ -1866,13 +1904,16 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 24.9.2.1. ARMv7 architecture reference manual
  • 24.9.2.2. ARMv8 architecture reference manual
  • 24.9.2.3. ARMv8 architecture reference manual db
  • -
  • 24.9.2.4. Programmer’s Guide for ARMv8-A
  • -
  • 24.9.2.5. Arm A64 Instruction Set Architecture: Future Architecture Technologies in the A architecture profile Documentation
  • -
  • 24.9.2.6. ARM processor documentation +
  • 24.9.2.4. ARMv8 architecture reference manual db
  • +
  • 24.9.2.5. Programmer’s Guide for ARMv8-A
  • +
  • 24.9.2.6. Arm A64 Instruction Set Architecture: Future Architecture Technologies in the A architecture profile Documentation
  • +
  • 24.9.2.7. ARM processor documentation
  • +
  • 24.9.2.8. Arm Cortex‑A77 Technical Reference Manual r1p1
  • +
  • 24.9.2.9. Arm Cortex‑A77 Software Optimization Guide r1p1
  • @@ -1886,7 +1927,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 27.1. Baremetal GDB step debug
  • 27.2. Baremetal bootloaders
  • 27.3. Baremetal linker script
  • -
  • 27.4. Baremetal command line arguments
  • +
  • 27.4. Baremetal command line arguments + +
  • 27.5. Semihosting
  • -
  • 32.2. Superscalar processor
  • -
  • 32.3. Out-of-order execution
  • -
  • 32.4. Hardware threads
  • -
  • 32.5. Cache coherence +
  • 32.2. Superscalar processor @@ -2089,7 +2149,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 33.9. ccache
  • 33.10. getvar
  • 33.11. Rebuild Buildroot while running
  • @@ -2107,79 +2171,80 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 33.13.4. Buildroot build variants
  • -
  • 33.14. Directory structure +
  • 33.14. Optimization level of a build
  • +
  • 33.15. Directory structure
  • -
  • 33.15. Test this repo +
  • 33.16. Test this repo
  • -
  • 33.16. Bisection
  • -
  • 33.17. Update a forked submodule
  • -
  • 33.18. Release +
  • 33.17. Bisection
  • +
  • 33.18. Update a forked submodule
  • +
  • 33.19. Release
  • -
  • 33.19. Design rationale +
  • 33.20. Design rationale
  • -
  • 33.20. Soft topics +
  • 33.21. Soft topics
  • -
  • 33.21. Bibliography
  • +
  • 33.22. Bibliography
  • @@ -2196,7 +2261,7 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b

    If you don’t know which one to go for, start with QEMU Buildroot setup getting started.

    -

    Design goals of this project are documented at: Section 33.19.1, “Design goals”.

    +

    Design goals of this project are documented at: Section 33.20.1, “Design goals”.

    1.1. QEMU Buildroot setup

    @@ -2613,10 +2678,10 @@ hello /root/.profile

    If you really want to develop semiconductors, your only choice is to join an university or a semiconductor company that has the EDA licenses.

    -

    See also: Section 33.20.2, “Should you waste your life with systems programming?”.

    +

    See also: Section 33.21.2, “Should you waste your life with systems programming?”.

    -

    While hacking QEMU, you will likely want to GDB step its source. That is trivial since QEMU is just another userland program like any other, but our setup has a shortcut to make it even more convenient, see: Section 18.7, “Debug the emulator”.

    +

    While hacking QEMU, you will likely want to GDB step its source. That is trivial since QEMU is just another userland program like any other, but our setup has a shortcut to make it even more convenient, see: Section 18.8, “Debug the emulator”.

    @@ -3752,7 +3817,7 @@ cd userland

    Here we used --force-rebuild to force rebuild since the sources weren’t modified since the last build.

    -

    Some CLI options have more specialized flags, e.g. -O optimization level:

    +

    Some CLI options have more specialized flags, e.g. -O for the Optimization level of a build:

    @@ -3801,7 +3866,7 @@ cd userland
    • -

      put the host executables in a separate build-variant to avoid conflict with Buildroot builds.

      +

      put the host executables in a separate build variant to avoid conflict with Buildroot builds.

    • ran with the --emulator native option to run the program natively

      @@ -3817,7 +3882,7 @@ cd userland
    -

    as shown at: Section 18.7, “Debug the emulator”, although direct GDB host usage works as well of course.

    +

    as shown at: Section 18.8, “Debug the emulator”, although direct GDB host usage works as well of course.

    @@ -4309,6 +4374,9 @@ continue

    So get ready for some weird jumps, and <value optimized out> fun. Why, Linux, why.

    +
    +

    The -O level of some other userland content can be controlled as explained at: Optimization level of a build.

    +
    @@ -6916,7 +6984,7 @@ cat f

    which can be good for automated tests, as it ensures that you are using a pristine unmodified system image every time.

    -

    Not however that we already disable disk persistency by default on ext2 filesystems even without --initrd: Section 18.2, “Disk persistency”.

    +

    Not however that we already disable disk persistency by default on ext2 filesystems even without --initrd: Section 18.3, “Disk persistency”.

    One downside of this method is that it has to put the entire filesystem into memory, and could lead to a panic:

    @@ -7469,6 +7537,19 @@ sudo ./setup -y
    +

    also mentioned at:

    +
    +
    + +
    +

    Bibliography:

    @@ -7679,7 +7760,7 @@ qw er

    The gem5 tests require building statically with build id static, see also: Section 10.7, “gem5 syscall emulation mode”. TODO automate this better.

    -

    See: Section 33.15, “Test this repo” for more useful testing tips.

    +

    See: Section 33.16, “Test this repo” for more useful testing tips.

    @@ -8531,7 +8612,7 @@ Program aborted at tick 0
    +

    Example:

    +
    +
    +
    +
    ./poll.sh
    +
    +
    +
    +

    Outcome: jiffies gets printed to stdout every second from userland, e.g.:

    +
    +
    +
    +
    poll
    +<6>[    4.275305] poll
    +<6>[    4.275580] return POLLIN
    +revents = 1
    +POLLIN n=10 buf=4294893337
    +poll
    +<6>[    4.276627] poll
    +<6>[    4.276911] return 0
    +<6>[    5.271193] wake_up
    +<6>[    5.272326] poll
    +<6>[    5.273207] return POLLIN
    +revents = 1
    +POLLIN n=10 buf=4294893588
    +poll
    +<6>[    5.276367] poll
    +<6>[    5.276618] return 0
    +<6>[    6.275178] wake_up
    +<6>[    6.276370] poll
    +<6>[    6.277269] return POLLIN
    +revents = 1
    +POLLIN n=10 buf=4294893839
    +
    +
    +
    +

    Force the poll file_operation to return 0 to see what happens more clearly:

    +
    +
    +
    +
    ./poll.sh pol0=1
    +
    +
    +
    +

    Sample output:

    +
    +
    +
    +
    poll
    +<6>[   85.674801] poll
    +<6>[   85.675788] return 0
    +<6>[   86.675182] wake_up
    +<6>[   86.676431] poll
    +<6>[   86.677373] return 0
    +<6>[   87.679198] wake_up
    +<6>[   87.680515] poll
    +<6>[   87.681564] return 0
    +<6>[   88.683198] wake_up
    +
    +
    +
    +

    From this we see that control is not returned to userland: the kernel just keeps calling the poll file_operation again and again.

    +
    +

    Typically, we are waiting for some hardware to make some piece of data available available to the kernel.

    @@ -12774,7 +12914,17 @@ cd

    To simplify this example, we just fake the hardware interrupts with a kthread that sleeps for a second in an infinite loop.

    -

    Bibliography: https://stackoverflow.com/questions/30035776/how-to-add-poll-function-to-the-kernel-module-code/44645336#44645336

    +

    Bibliography:

    +
    +
    +
    @@ -16137,7 +16287,7 @@ ps

    If you are familiar with VirtualBox, then QEMU then basically does the same thing: it opens a "window" inside your desktop that can run an operating system inside your operating system.

    -

    Also both can use very similar techniques: either binary translation or KVM. VirtualBox' binary translator is / was based on QEMU’s it seems: https://en.wikipedia.org/wiki/VirtualBox#Software-based_virtualization

    +

    Also both can use very similar techniques: either Binary translation or KVM. VirtualBox' binary translator is / was based on QEMU’s it seems: https://en.wikipedia.org/wiki/VirtualBox#Software-based_virtualization

    The huge advantage of QEMU over VirtualBox is that is supports cross arch simulation, e.g. simulate an ARM guest on an x86 host.

    @@ -16159,7 +16309,16 @@ ps
    -

    18.2. Disk persistency

    +

    18.2. Binary translation

    +
    +

    https://en.wikipedia.org/wiki/Binary_translation

    +
    +
    +

    Used by QEMU and Gensim.

    +
    +
    +
    +

    18.3. Disk persistency

    We disable disk persistency for both QEMU and gem5 by default, to prevent the emulator from putting the image in an unknown state.

    @@ -16214,7 +16373,7 @@ ps

    Disk persistency is useful to re-run shell commands from the history of a previous session with Ctrl-R, but we felt that the loss of determinism was not worth it.

    -

    18.2.1. gem5 disk persistency

    +

    18.3.1. gem5 disk persistency

    TODO how to make gem5 disk writes persistent?

    @@ -16244,7 +16403,7 @@ index 17498c42b..76b8b351d 100644
    -

    18.3. gem5 qcow2

    +

    18.4. gem5 qcow2

    qcow2 does not appear supported, there are not hits in the source tree, and there is a mention on Nate’s 2009 wishlist: http://gem5.org/Nate%27s_Wish_List

    @@ -16253,7 +16412,7 @@ index 17498c42b..76b8b351d 100644
    -

    18.4. Snapshot

    +

    18.5. Snapshot

    QEMU allows us to take snapshots at any time through the monitor.

    @@ -16351,7 +16510,7 @@ index 17498c42b..76b8b351d 100644

    Bibliography: https://stackoverflow.com/questions/40227651/does-qemu-emulator-have-checkpoint-function/48724371#48724371

    -

    18.4.1. Snapshot internals

    +

    18.5.1. Snapshot internals

    Snapshots are stored inside the .qcow2 images themselves.

    @@ -16400,7 +16559,7 @@ Format specific information:
    -

    18.5. Device models

    +

    18.6. Device models

    This section documents:

    @@ -16445,12 +16604,12 @@ Format specific information:
    -

    18.5.1. PCI

    +

    18.6.1. PCI

    Only tested in x86.

    -
    18.5.1.1. pci_min
    +
    18.6.1.1. pci_min

    PCI driver for our minimal pci_min.c QEMU fork device:

    @@ -16520,7 +16679,7 @@ lkmc_pci_min mmio_write addr = 4 val = 0 size = 4
    -
    18.5.1.2. QEMU edu PCI device
    +
    18.6.1.2. QEMU edu PCI device

    Small upstream educational PCI device:

    @@ -16578,16 +16737,19 @@ lkmc_pci_min mmio_write addr = 4 val = 0 size = 4
    -
    18.5.1.3. Manipulate PCI registers directly
    +
    18.6.1.3. Manipulate PCI registers directly

    In this section we will try to interact with PCI devices directly from userland without kernel modules.

    @@ -16733,7 +16895,7 @@ devmem 0xfeb54000 w 0x12345678
    -
    18.5.1.4. pciutils
    +
    18.6.1.4. pciutils

    There are two versions of setpci and lspci:

    @@ -16749,7 +16911,7 @@ devmem 0xfeb54000 w 0x12345678
    -
    18.5.1.5. Introduction to PCI
    +
    18.6.1.5. Introduction to PCI

    The PCI standard is non-free, obviously like everything in low level: https://pcisig.com/specifications but Google gives several illegal PDF hits :-)

    @@ -16809,7 +16971,7 @@ devmem 0xfeb54000 w 0x12345678
    -
    18.5.1.6. PCI BFD
    +
    18.6.1.6. PCI BFD

    lspci -k shows something like:

    @@ -16863,7 +17025,7 @@ devmem 0xfeb54000 w 0x12345678
    -
    18.5.1.7. PCI BAR
    +
    18.6.1.7. PCI BAR

    https://stackoverflow.com/questions/30190050/what-is-base-address-register-bar-in-pcie/44716618#44716618

    @@ -16905,7 +17067,7 @@ pci_register_bar(pdev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &edu->mmio);
    -

    18.5.2. GPIO

    +

    18.6.2. GPIO

    TODO: broken. Was working before we moved arm from -M versatilepb to -M virt around af210a76711b7fa4554dcc2abd0ddacfc810dfd4. Either make it work on -M virt if that is possible, or document precisely how to make it work with versatilepb, or hopefully vexpress which is newer.

    @@ -16948,7 +17110,7 @@ pci_register_bar(pdev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &edu->mmio);
    -

    18.5.3. LEDs

    +

    18.6.3. LEDs

    TODO: broken when arm moved to -M virt, same as GPIO.

    @@ -17020,7 +17182,7 @@ echo 255 >brightness
    -

    18.5.4. platform_device

    +

    18.6.4. platform_device

    Minimal platform device example coded into the -M versatilepb SoC of our QEMU fork.

    @@ -17098,7 +17260,7 @@ insmod platform_device.ko
    -

    18.5.5. gem5 educational hardware models

    +

    18.6.5. gem5 educational hardware models

    TODO get some working!

    @@ -17108,7 +17270,7 @@ insmod platform_device.ko
    -

    18.6. QEMU monitor

    +

    18.7. QEMU monitor

    The QEMU monitor is a magic terminal that allows you to send text commands to the QEMU VM itself: https://en.wikibooks.org/wiki/QEMU/Monitor

    @@ -17228,7 +17390,7 @@ insmod platform_device.ko
    -

    18.6.1. QEMU monitor from guest

    +

    18.7.1. QEMU monitor from guest

    Peter Maydell said potentially not possible nicely as of August 2018: https://stackoverflow.com/questions/51747744/how-to-run-a-qemu-monitor-command-from-inside-the-guest/51764110#51764110

    @@ -17245,7 +17407,7 @@ insmod platform_device.ko
    -

    18.6.2. QEMU monitor from GDB

    +

    18.7.2. QEMU monitor from GDB

    When doing GDB step debug it is possible to send QEMU monitor commands through the GDB monitor command, which saves you the trouble of opening yet another shell.

    @@ -17261,7 +17423,7 @@ monitor info qtree
    -

    18.7. Debug the emulator

    +

    18.8. Debug the emulator

    When you start hacking QEMU or gem5, it is useful to see what is going on inside the emulator themselves.

    @@ -17274,7 +17436,15 @@ monitor info qtree
    -

    Or for a faster development loop:

    +

    Or for a faster development loop you can pass -ex command as a semicolon separated list:

    +
    +
    +
    +
    ./run --debug-vm-ex 'break qemu_add_opts;run'
    +
    +
    +
    +

    which is equivalent to the more verbose:

    @@ -17282,6 +17452,9 @@ monitor info qtree
    +

    if you ever want need anything besides -ex.

    +
    +

    Or if things get really involved and you want a debug script:

    @@ -17330,7 +17503,7 @@ run

    You can still send key presses to QEMU however even without the mouse capture, just either click on the title bar, or alt tab to give it focus.

    -

    18.7.1. Reverse debug the emulator

    +

    18.8.1. Reverse debug the emulator

    While step debugging any complex program, you always end up feeling the need to step in reverse to reach the last call to some function that was called before the failure point, in order to trace back the problem to the actual bug source.

    @@ -17419,7 +17592,7 @@ reverse-next
    -

    18.7.2. Debug gem5 Python scripts

    +

    18.8.2. Debug gem5 Python scripts

    Start pdb at the first instruction:

    @@ -17453,7 +17626,7 @@ reverse-next
    -

    18.8. Tracing

    +

    18.9. Tracing

    QEMU can log several different events.

    @@ -17544,7 +17717,7 @@ Call Trace:
    -

    18.8.1. QEMU -d tracing

    +

    18.9.1. QEMU -d tracing

    QEMU also has a second trace mechanism in addition to -trace, find out the events with:

    @@ -17585,7 +17758,7 @@ IN:
    -

    18.8.2. QEMU trace register values

    +

    18.9.2. QEMU trace register values

    TODO: is it possible to show the register values for each instruction?

    @@ -17615,11 +17788,11 @@ IN:

    PANDA can list memory addresses, so I bet it can also decode the instructions: https://github.com/panda-re/panda/blob/883c85fa35f35e84a323ed3d464ff40030f06bd6/panda/docs/LINE_Censorship.md I wonder why they don’t just upstream those things to QEMU’s tracing: https://github.com/panda-re/panda/issues/290

    -

    gem5 can do it as shown at: Section 18.8.8, “gem5 tracing”.

    +

    gem5 can do it as shown at: Section 18.9.8, “gem5 tracing”.

    -

    18.8.3. QEMU trace memory accesses

    +

    18.9.3. QEMU trace memory accesses

    Not possible apparently, not even with the memory_region_ops_read and memory_region_ops_write trace events, Peter comments https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg07482.html

    @@ -17638,7 +17811,7 @@ of guest operations.

    -

    18.8.4. Trace source lines

    +

    18.9.4. Trace source lines

    We can further use Binutils' addr2line to get the line that corresponds to each address:

    @@ -17694,7 +17867,7 @@ less "$(./getvar --arch x86_64 run_dir)/trace-lines.txt"
    -

    18.8.5. QEMU record and replay

    +

    18.9.5. QEMU record and replay

    QEMU runs, unlike gem5, are not deterministic by default, however it does support a record and replay mechanism that allows you to replay a previous run deterministically.

    @@ -17801,7 +17974,7 @@ less "$(./getvar --arch x86_64 run_dir)/trace-lines.txt"

    Solved on unmerged c42634d8e3428cfa60672c3ba89cabefc720cde9 from https://github.com/ispras/qemu/tree/rr-180725

    -
    18.8.5.1. QEMU reverse debugging
    +
    18.9.5.1. QEMU reverse debugging

    TODO get working.

    @@ -17840,7 +18013,7 @@ reverse-continue
    -

    18.8.6. QEMU trace multicore

    +

    18.9.6. QEMU trace multicore

    TODO: is there any way to distinguish which instruction runs on each core? Doing:

    @@ -17855,13 +18028,13 @@ reverse-continue
    -

    18.8.7. QEMU get guest instruction count

    +

    18.9.7. QEMU get guest instruction count

    TODO: https://stackoverflow.com/questions/58766571/how-to-count-the-number-of-guest-instructions-qemu-executed-from-the-beginning-t

    -

    18.8.8. gem5 tracing

    +

    18.9.8. gem5 tracing

    gem5 provides also provides a tracing mechanism documented at: http://www.gem5.org/Trace_Based_Debugging:

    @@ -17972,7 +18145,7 @@ less "$(./getvar --arch aarch64 run_dir)/trace-lines.txt"

    TODO: 7452d399290c9c1fc6366cdad129ef442f323564 ./trace2line this is too slow and takes hours. QEMU’s processing of 170k events takes 7 seconds. gem5’s processing is analogous, but there are 140M events, so it should take 7000 seconds ~ 2 hours which seems consistent with what I observe, so maybe there is no way to speed this up…​ The workaround is to just use gem5’s ExecSymbol to get function granularity, and then GDB individually if line detail is needed?

    -
    18.8.8.1. gem5 trace internals
    +
    18.9.8.1. gem5 trace internals

    gem5 traces are generated from DPRINTF(<trace-id> calls scattered throughout the code, except for ExecAll instruction traces, which uses Debug::ExecEnable directly..

    @@ -18009,7 +18182,7 @@ extern SimpleFlag ExecEnable;
    -
    18.8.8.2. gem5 ExecAll trace format
    +
    18.9.8.2. gem5 ExecAll trace format

    This debug flag traces all instructions.

    @@ -18064,7 +18237,7 @@ extern SimpleFlag ExecEnable;

    @start_kernel: we are in the start_kernel function. Awesome feature! Implemented with libelf https://sourceforge.net/projects/elftoolchain/ copy pasted in-tree ext/libelf. To get raw addresses, remove the ExecSymbol, which is enabled by Exec. This can be done with Exec,-ExecSymbol.

  • -

    .1 as in @start_kernel.1: index of the microop

    +

    .1 as in @start_kernel.1: index of the gem5 microops

  • stp: instruction disassembly. Note however that the disassembly of many instructions are very broken as of 2019q2, and you can’t just trust them blindly.

    @@ -18092,7 +18265,7 @@ extern SimpleFlag ExecEnable;
  • -
    18.8.8.3. gem5 Registers trace format
    +
    18.9.8.3. gem5 Registers trace format

    This flag shows a more detailed register usage than gem5 ExecAll trace format.

    @@ -18147,13 +18320,13 @@ add x1, x0, 2
    -
    18.8.8.4. gem5 TARMAC traces
    +
    18.9.8.4. gem5 TARMAC traces

    https://stackoverflow.com/questions/54882466/how-to-use-the-tarmac-tracer-with-gem5

    -
    18.8.8.5. gem5 tracing internals
    +
    18.9.8.5. gem5 tracing internals

    As of gem5 16eeee5356585441a49d05c78abc328ef09f7ace the default tracer is ExeTracer. It is set at:

    @@ -18226,7 +18399,7 @@ src/arch/x86/nativetrace.hh:41:class X86NativeTrace : public NativeTrace
    -

    18.9. QEMU GUI is unresponsive

    +

    18.10. QEMU GUI is unresponsive

    Sometimes in Ubuntu 14.04, after the QEMU SDL GUI starts, it does not get updated after keyboard strokes, and there are artifacts like disappearing text.

    @@ -18711,7 +18884,24 @@ ps Haux | grep qemu | wc

    https://stackoverflow.com/questions/50248067/how-to-run-a-gem5-arm-aarch64-full-system-simulation-with-fs-py-with-more-than-8

    -

    Build the kernel with the gem5 arm Linux kernel patches, and then run:

    +

    With GICv3, tested at LKMC 224fae82e1a79d9551b941b19196c7e337663f22 gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772 on vanilla kernel:

    +
    +
    +
    +
    ./run \
    +  --arch aarch64 \
    +  --emulator gem5 \
    +  --cpus 16 \
    +  -- \
    +  --machine-type VExpress_GEM5_V2 \
    +;
    +
    +
    +
    +

    boots to a shell and nproc shows 16.

    +
    +
    +

    For the GICv2 extension method, build the kernel with the gem5 arm Linux kernel patches, and then run:

    @@ -19463,6 +19653,15 @@ Exiting @ tick 84500 because m5_exit instruction encountered

    19.5.2. gem5 checkpoint internals

    +

    A quick way to get a gem5 syscall emulation mode or full system checkpoint to observe is:

    +
    +
    +
    +
    ./run --arch aarch64 --emulator gem5 --baremetal userland/freestanding/gem5_checkpoint.S --trace-insts-stdout
    +./run --arch aarch64 --emulator gem5 --userland userland/freestanding/gem5_checkpoint.S --trace-insts-stdout
    +
    +
    +

    Checkpoints are stored inside the m5out directory at:

    @@ -19485,6 +19684,22 @@ Exiting @ tick 84500 because m5_exit instruction encountered

    The -r N integer value is just pure fs.py sugar, the backend at m5.instantiate just takes the actual tracepoint directory path as input.

    +
    +

    The file m5out/cpt.1000/m5.cpt contains almost everything in the checkpoint except memory.

    +
    +
    +

    It is a Python configparser compatible file with a section structure that matches the SimObject tree e.g.:

    +
    +
    +
    +
    [system.cpu.itb.walker.power_state]
    +currState=0
    +prvEvalTick=0
    +
    +
    +
    +

    When a checkpoint is taken, each SimObject calls its overridden serialize method to generate the checkpoint, and when loading, unserialize is called.

    +
    +
    +

    19.5.5. gem5 checkpoint upgrader

    +
    +

    The in-tree util/cpt_upgrader.py is a tool to upgrade checkpoints taken from an older version of gem5 to be compatible with the newest version, so you can update gem5 without having to re-run the simulation that generated the checkpoints.

    +
    +
    +

    For example, whenever a system register is added in ARMv8, old checkpoints break unless upgraded.

    +
    +
    +

    Unfortunately, since the process is not very automated (automatable?), and requires manually patching the upgrader every time a new breaking change is done, the upgrader tends to break soon if you try to move many versions of gem5 ahead as of 2020. This is evidenced in bug reports such as this one: https://gem5.atlassian.net/browse/GEM5-472

    +
    +
    +

    The script can be used as:

    +
    +
    +
    +
    util/cpt_upgrader.py m5out/cpt.1000/m5.cpt
    +
    +
    +
    +

    This updates the m5.cpt file in-place, and a m5out/cpt.1000/m5.cpt.bak is generated as a backup of the old file.

    +
    +
    +

    The upgrader determines which upgrades are needed by checking the version_tags entry of the checkpoint:

    +
    +
    +
    +
    [Globals]
    +version_tags=arm-ccregs arm-contextidr-el2 arm-gem5-gic-ext ...
    +
    +
    +
    +

    Each of those tags corresponds to a Python file under util/cpt_upgraders/ e.g. util/cpt_upgraders/arm-ccregs.py.

    +
    +

    19.6. Pass extra options to gem5

    @@ -21161,6 +21414,9 @@ xdg-open "$(./getvar --arch arm --emulator gem5 m5out_dir)/config.dot.svg"
    +

    Sample run time: 87 minutes on P51 Ubuntu 20.04 gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1.

    +
    +

    After the first run has downloaded the test binaries for you, you can speed up the process a little bit by skipping an useless SCons call:

    @@ -21176,7 +21432,7 @@ xdg-open "$(./getvar --arch arm --emulator gem5 m5out_dir)/config.dot.svg"
    -
    ./gem5-regression --arch aarch64 --cmd list
    +
    ./gem5-regression --arch aarch64 --cmd list -- --length quick --length long
    @@ -21333,13 +21589,13 @@ Exiting @ tick 18446744073709551615 because simulate() limit reached

    19.15.1. gem5 debug build

    -

    How to use it in LKMC: Section 18.7, “Debug the emulator”.

    +

    How to use it in LKMC: Section 18.8, “Debug the emulator”.

    If you build gem5 with scons build/ARM/gem5.debug, then that is a .debug build.

    -

    It relates to the more common .opt build just as explained at Section 18.7, “Debug the emulator”: both .opt and .debug have -g, but .opt uses -O2 while .debug uses -O0.

    +

    It relates to the more common .opt build just as explained at Section 18.8, “Debug the emulator”: both .opt and .debug have -g, but .opt uses -O2 while .debug uses -O0.

    @@ -21533,7 +21789,7 @@ Indirect leak of 1346 byte(s) in 2 object(s) allocated from:
    -

    Note that the --ruby option has some crazy side effects besides enabling Ruby, e.g. it sets the default --cpu-type to TimingSimpleCPU instead of the otherwise default AtomicSimpleCPU. But why.

    +

    Note that the --ruby option has some crazy side effects besides enabling Ruby, e.g. it sets the default --cpu-type to TimingSimpleCPU instead of the otherwise default AtomicSimpleCPU. TODO: I have been told that this is because sends the packet atomically,atomic requests do not work with Ruby, only timing.

    It is not possible to build more than one Ruby system into a single build, and this is a major pain point for testing Ruby: https://gem5.atlassian.net/browse/GEM5-467

    @@ -21786,7 +22042,7 @@ class SystemXBar(CoherentXBar):
    @@ -21853,13 +22109,13 @@ class SystemXBar(CoherentXBar):
    19.16.1.2. gem5 MinorCPU
    -

    Generic in-order core that does not model any specific CPU.

    +

    Generic in-order superscalar core.

    Its C++ implementation that can be parametrized to more closely match real cores.

    -

    Note that since gem5 is highly parametrizable, the parametrization could even change which instructions a CPU can execute by altering its available functional units, which are used to model performance.

    +

    Note that since gem5 is highly parametrizable, the parametrization could even change which instructions a CPU can execute by altering its available functional units, which are used to model performance.

    For example, MinorCPU allows all implemented instructions, including ARM SVE instructions, but a derived class modelling, say, an ARM Cortex A7 core, might not, since SVE is a newer feature and the A7 core does not have SVE.

    @@ -21917,17 +22173,40 @@ class SystemXBar(CoherentXBar):
    -
    19.16.1.3. gem5 DerivO3CPU
    +
    19.16.1.3. gem5 DerivO3CPU

    Generic out-of-order core. "O3" Stands for "Out Of Order"!

    +

    Basic documentation on the old gem5 wiki: http://www.m5sim.org/O3CPU

    +
    +

    Analogous to MinorCPU, but modelling an out of order core instead of in order.

    +

    The default functional units are described at: gem5 DerivO3CPU default functional units. All default widths are set to 8 instructions, from the config.ini:

    +
    +
    +
    +
    [system.cpu]
    +type=DerivO3CPU
    +commitWidth=8
    +decodeWidth=8
    +dispatchWidth=8
    +fetchWidth=8
    +issueWidth=8
    +renameWidth=8
    +squashWidth=8
    +wbWidth=8
    +
    +
    + +

    Existing parametrizations:

    @@ -21953,6 +22232,77 @@ class SystemXBar(CoherentXBar):
    +
    +
    19.16.1.3.1. gem5 DerivO3CPU pipeline stages
    +
    +
      +
    • +

      fetch: besides obviously fetching the instruction, this is also where branch prediction runs. Presumably because you need to branch predict before deciding what to fetch next.

      +
    • +
    • +

      retire: the instruction is completely and totally done with.

      +
      +

      Mispeculated instructions never reach this stage as can be seen at: gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: speculative.

      +
      +
      +

      The ExecAll happens at this time as well. And therefore ExecAll does not happen for mispeculated instructions.

      +
      +
    • +
    +
    +
    +
    +
    19.16.1.3.2. gem5 util/o3-pipeview.py O3 pipeline viewer
    + +
    +
    +
    ./run \
    +  --arch aarch64 \
    +  --emulator gem5 \
    +  --userland userland/arch/aarch64/freestanding/linux/hello.S \
    +  --trace O3PipeView \
    +  --trace-stdout \
    +  -- \
    +  --cpu-type DerivO3CPU \
    +  --caches \
    +;
    +"$(./getvar gem5_source_dir)/util/o3-pipeview.py" -c 500 -o o3pipeview.tmp.log --color "$(./getvar --arch aarch64 trace_txt_file)"
    +less -R o3pipeview.tmp.log
    +
    +
    +
    +

    Or without color:

    +
    +
    +
    +
    "$(./getvar gem5_source_dir)/util/o3-pipeview.py" -c 500 -o o3pipeview.tmp.log "$(./getvar --arch aarch64 trace_txt_file)"
    +less o3pipeview.tmp.log
    +
    +
    +
    +

    A sample output for this can be seen at: [hazardless-o3-pipeline].

    +
    +
    +
    +
    19.16.1.3.3. gem5 Konata O3 pipeline viewer
    + + +
    +

    Appears to be browser based, so you can zoom in and out, rather than the forced wrapping as for [gem5-util-o3-pipeview-py-o3-pipeline-viewer].

    +
    +
    +

    Uses the same data source as util/o3-pipeview.py.

    +
    +
    +

    gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-gain shows how the text-based visualization can get problematic due to stalls requiring wraparounds.

    +
    +
    @@ -22660,7 +23010,7 @@ for source in PySource.all:
    -
    b Trace::OstreamLogger::logMessage()
    +
    b Trace::OstreamLogger::logMessage
     b EventManager::schedule
     b EventFunctionWrapper::process
    @@ -23777,7 +24127,7 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
    -

    se we deduce that the vitual address 0x400078 maps to the physical address 0x78. But of course, let me log that for you byu adding --trace MMU:

    +

    so we deduce that the virtual address 0x400078 maps to the physical address 0x78. But of course, let me log that for you by adding --trace MMU:

    @@ -23799,12 +24149,11 @@ DRAMCtrl::Rank::startup(Tick ref_tick)

    Now that we are here, we might as well learn how to log the data that was fetched from DRAM.

    -

    Fist we determine the expected bytes from:

    +

    Fist we determine the expected bytes from the disassembly:

    -
    ./run-toolchain --arch aarch64 objdump -- \
    -  -D "$(./getvar --arch aarch64 userland_build_dir)/arch/aarch64/freestanding/linux/hello.out"
    +
    ./disas --arch aarch64 --userland userland/arch/aarch64/freestanding/linux/hello.S _start
    @@ -23812,9 +24161,8 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
    -
    0000000000400078 <_start>:
    -  400078:       d2800020        mov     x0, #0x1                        // #1
    -  40007c:       100000e1        adr     x1, 400098 <msg>
    +
       0x0000000000400078 <+0>:     20 00 80 d2     mov     x0, #0x1                        // #1
    +   0x000000000040007c <+4>:     e1 00 00 10     adr     x1, 0x400098 <msg>
    @@ -24599,6 +24947,26 @@ TimingSimpleCPU::IcachePort::ITickEvent::process

    Contrast this with the non --cache version seen at TimingSimpleCPU analysis #5 in which DRAM only actually reads the 4 required bytes.

    +

    The only cryptic thing about the messages is the IF flag, but good computer architects would have guessed it correctly, and src/mem/packet.cc confirms:

    +
    +
    +
    +
    void
    +Packet::print(std::ostream &o, const int verbosity,
    +              const std::string &prefix) const
    +{
    +    ccprintf(o, "%s%s [%x:%x]%s%s%s%s%s%s", prefix, cmdString(),
    +             getAddr(), getAddr() + getSize() - 1,
    +             req->isSecure() ? " (s)" : "",
    +             req->isInstFetch() ? " IF" : "",
    +             req->isUncacheable() ? " UC" : "",
    +             isExpressSnoop() ? " ES" : "",
    +             req->isToPOC() ? " PoC" : "",
    +             req->isToPOU() ? " PoU" : "");
    +}
    +
    +
    +

    Another interesting observation of running with --trace Cache,DRAM,XBar is that between the execution of both instructions, there is a Cache event, but no DRAM or XBar events:

    @@ -24930,10 +25298,7 @@ non-atomic 19

    The memory system system part must be similar to that of TimingSimpleCPU that we previously studied gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis: the main thing we want to see is how the CPU pipeline speeds up execution by preventing some memory stalls.

    -

    The config.dot.svg also indicates that: everything is exactly as in gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches, except that the CPU is a MinorCPU instead of TimingSimpleCPU, and the --caches are now mandatory.

    -
    -
    -

    TODO: analyze the trace for:

    +

    The config.dot.svg also indicates that: everything is exactly as in gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches, except that the CPU is a MinorCPU instead of TimingSimpleCPU, and the --caches are now mandatory:

    @@ -24941,7 +25306,7 @@ non-atomic 19 --arch aarch64 \ --emulator gem5 \ --userland userland/arch/aarch64/freestanding/linux/hello.S \ - --trace Event \ + --trace FmtFlag,Cache,Event,ExecAll,Minor \ --trace-stdout \ -- \ --cpu-type MinorCPU \ @@ -24949,11 +25314,898 @@ non-atomic 19 ;
    +
    +

    and here’s a handy link to the source: userland/arch/aarch64/freestanding/linux/hello.S.

    +
    +
    +

    On LKMC ce3ea9faea95daf46dea80d4236a30a0891c3ca5 gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 we see the following.

    +
    +
    +

    First there is a missed instruction fetch for the initial entry address which we know from gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches is the virtual address 0x400078 which maps to physical 0x78:

    +
    +
    +
    +
        500: Cache: system.cpu.icache: access for ReadReq [40:7f] IF miss
    +
    +
    +
    +

    The memory request comes back later on at:

    +
    +
    +
    +
      77000: Cache: system.cpu.icache: recvTimingResp: Handling response ReadResp [40:7f] IF
    +
    +
    +
    +

    and soon after the CPU also ifetches across the barrier:

    +
    +
    +
    +
      79000: Cache: system.cpu.icache: access for ReadReq [80:bf] IF miss
    +
    +
    +
    +

    TODO why? We have 0x78 and 0x7c, and those should be it since we are dual issue, right? Is this prefetching at work?

    +
    +
    +

    Later on we see the first instruction, our MOVZ, was decoded:

    +
    +
    +
    +
      80000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/1/1.1 pc: 0x400078 (movz) to FU: 0
    +
    +
    +
    +

    and that issue succeeds, because the functional unit 0 (FU 0) is an IntAlu as shown at gem5 functional units:

    +
    +
    +
    +
      80000: MinorExecute: system.cpu.execute: Issuing inst: 0/1.1/1/1.1 pc: 0x400078 (movz) into FU 0
    +
    +
    +
    +

    At the very same tick, the second instruction is also decoded, our ADR:

    +
    +
    +
    +
      80000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/1/2.2 pc: 0x40007c (adr) to FU: 0
    +  80000: MinorExecute: system.cpu.execute: Can't issue as FU: 0 is already busy
    +  80000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/1/2.2 pc: 0x40007c (adr) to FU: 1
    +  80000: MinorExecute: system.cpu.execute: Issuing inst: 0/1.1/1/2.2 pc: 0x40007c (adr) into FU 1
    +
    +
    +
    +

    This is also an IntAlu instruction, and it can’t run on FU 0 because the first instruction is already running there. But to our luck, FU 1 is also an IntAlu unit, and so it runs there.

    +
    +
    +

    Crap, those Minor logs should say what OpClass each instruction is, that would make things clearer.

    +
    +
    +

    TODO what is that 0/1.1/1/1.1 notation that shows up everywhere? Must be important, let’s look at the source.

    +
    +
    +

    Soon after (3 ticks later, so guessing due to opLat=3?), the execution appears to be over already since we see the ExecAll come through, which generally happens at the very end:

    +
    +
    +
    +
      81500: MinorExecute: system.cpu.execute: Attempting to commit [tid:0]
    +  81500: MinorExecute: system.cpu.execute: Committing micro-ops for interrupt[tid:0]
    +  81500: MinorExecute: system.cpu.execute: Trying to commit canCommitInsts: 1
    +  81500: MinorExecute: system.cpu.execute: Trying to commit from FUs
    +  81500: MinorExecute: global: ExecContext setting PC: (0x400078=>0x40007c).(0=>1)
    +  81500: MinorExecute: system.cpu.execute: Committing inst: 0/1.1/1/1.1 pc: 0x400078 (movz)
    +  81500: MinorExecute: system.cpu.execute: Unstalling 0 for inst 0/1.1/1/1.1
    +  81500: MinorExecute: system.cpu.execute: Completed inst: 0/1.1/1/1.1 pc: 0x400078 (movz)
    +  81500: MinorScoreboard: system.cpu.execute.scoreboard0: Clearing inst: 0/1.1/1/1.1 pc: 0x400078 (movz) regIndex: 0 final numResults: 0
    +  81500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue    :   movz   x0, #1, #0        : IntAlu :  D=0x0000000000000001  FetchSeq=1  CPSeq=1  flags=(IsInteger)
    +  81500: MinorExecute: system.cpu.execute: Trying to commit canCommitInsts: 1
    +  81500: MinorExecute: system.cpu.execute: Trying to commit from FUs
    +  81500: MinorExecute: global: ExecContext setting PC: (0x40007c=>0x400080).(0=>1)
    +  81500: MinorExecute: system.cpu.execute: Committing inst: 0/1.1/1/2.2 pc: 0x40007c (adr)
    +  81500: MinorExecute: system.cpu.execute: Unstalling 1 for inst 0/1.1/1/2.2
    +  81500: MinorExecute: system.cpu.execute: Completed inst: 0/1.1/1/2.2 pc: 0x40007c (adr)
    +  81500: MinorScoreboard: system.cpu.execute.scoreboard0: Clearing inst: 0/1.1/1/2.2 pc: 0x40007c (adr) regIndex: 1 final numResults: 0
    +  81500: MinorExecute: system.cpu.execute: Reached inst commit limit
    +  81500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4    :   adr   x1, #28            : IntAlu :  D=0x0000000000400098  FetchSeq=2  CPSeq=2  flags=(IsInteger)
    +
    +
    +
    +

    The ifetch for the third instruction returns at:

    +
    +
    +
    +
     129000: Cache: system.cpu.icache: recvTimingResp: Handling response ReadResp [80:bf] IF
    +
    +
    +
    +

    so now we are ready to run the third and fourth instructions of the program:

    +
    +
    +

    ,…​ + ldr x2, =len + mov x8, 64 +,…​

    +
    +
    +

    The LDR goes all the way down to FU 6 which is the memory one:

    +
    +
    +
    +
     132000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) to FU: 0
    + 132000: MinorExecute: system.cpu.execute: Can't issue as FU: 0 isn't capable
    + 132000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) to FU: 1
    + 132000: MinorExecute: system.cpu.execute: Can't issue as FU: 1 isn't capable
    + 132000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) to FU: 2
    + 132000: MinorExecute: system.cpu.execute: Can't issue as FU: 2 isn't capable
    + 132000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) to FU: 3
    + 132000: MinorExecute: system.cpu.execute: Can't issue as FU: 3 isn't capable
    + 132000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) to FU: 4
    + 132000: MinorExecute: system.cpu.execute: Can't issue as FU: 4 isn't capable
    + 132000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) to FU: 5
    + 132000: MinorExecute: system.cpu.execute: Can't issue as FU: 5 isn't capable
    + 132000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) to FU: 6
    + 132000: MinorExecute: system.cpu.execute: Issuing inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) into FU 6
    +
    +
    +
    +

    and then the MOV issue follows soon afterwards (TODO why not at the same time like for the previous pair?):

    +
    +
    +
    +
     132500: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/4.4 pc: 0x400084 (movz) to FU: 0
    + 132500: MinorExecute: system.cpu.execute: Issuing inst: 0/1.1/2/4.4 pc: 0x400084 (movz) into FU 0
    +
    +
    +
    +
    19.20.4.5.1. gem5 event queue MinorCPU syscall emulation freestanding example analysis: hazard
    + +
    19.20.4.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis
    -

    TODO: like gem5 event queue MinorCPU syscall emulation freestanding example analysis but even more complex!

    +

    Like gem5 event queue MinorCPU syscall emulation freestanding example analysis but even more complex since for the gem5 DerivO3CPU!

    +
    +
    +

    The key new debug flag is O3CPUAll:

    +
    +
    +
    +
    ./run \
    +  --arch aarch64 \
    +  --emulator gem5 \
    +  --userland userland/arch/aarch64/freestanding/linux/hello.S \
    +  --trace FmtFlag,Cache,Event,ExecAll,O3CPUAll \
    +  --trace-stdout \
    +  -- \
    +  --cpu-type DerivO3CPU \
    +  --caches \
    +;
    +
    +
    +
    +

    The output is huge and contains about 7 thousand lines!!!

    +
    +
    +

    This section and children are tested at LKMC 144a552cf926ea630ef9eadbb22b79fe2468c456.

    +
    +
    +
    19.20.4.6.1. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazardless
    +
    +

    Let’s have a look at the arguably simplest example userland/arch/aarch64/freestanding/linux/hazardless.S.

    +
    +
    +

    First let’s start with a [gem5-util-o3-pipeview-py-o3-pipeline-viewer] visualization:

    +
    +
    +
    +
    // f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
    +
    +                                     timeline                                             tick          pc.upc     disasm                      seq_num
    +[.ic.r........................................................................fdn]-(          40000) 0x00400078.0 movz x0, #0, #0           [         1]
    +[.ic.r........................................................................fdn]-(          40000) 0x0040007c.0 movz x1, #1, #0           [         2]
    +[....................fdn.ic.r....................................................]-(         120000) 0x00400080.0 movz x2, #2, #0           [         3]
    +[....................fdn.ic.r....................................................]-(         120000) 0x00400084.0 movz x3, #3, #0           [         4]
    +[....................fdn.ic.r....................................................]-(         120000) 0x00400088.0 movz x4, #4, #0           [         5]
    +[....................fdn.ic.r....................................................]-(         120000) 0x0040008c.0 movz x5, #5, #0           [         6]
    +[....................fdn.ic.r....................................................]-(         120000) 0x00400090.0 movz x6, #6, #0           [         7]
    +[....................fdn.ic.r....................................................]-(         120000) 0x00400094.0 movz x7, #7, #0           [         8]
    +[....................fdn.pic.r...................................................]-(         120000) 0x00400098.0 movz x8, #8, #0           [         9]
    +[....................fdn.pic.r...................................................]-(         120000) 0x0040009c.0 movz x9, #9, #0           [        10]
    +[.....................fdn.ic.r...................................................]-(         120000) 0x004000a0.0 movz x10, #10, #0         [        11]
    +[.....................fdn.ic.r...................................................]-(         120000) 0x004000a4.0 movz x11, #11, #0         [        12]
    +[.....................fdn.ic.r...................................................]-(         120000) 0x004000a8.0 movz x12, #12, #0         [        13]
    +[.....................fdn.ic.r...................................................]-(         120000) 0x004000ac.0 movz x13, #13, #0         [        14]
    +[.....................fdn.pic.r..................................................]-(         120000) 0x004000b0.0 movz x14, #14, #0         [        15]
    +[.....................fdn.pic.r..................................................]-(         120000) 0x004000b4.0 movz x15, #15, #0         [        16]
    +[.....................fdn.pic.r..................................................]-(         120000) 0x004000b8.0 movz x16, #16, #0         [        17]
    +[.....................fdn.pic.r..................................................]-(         120000) 0x004000bc.0 movz x17, #17, #0         [        18]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000c0.0 movz x18, #18, #0         [        19]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000c4.0 movz x19, #19, #0         [        20]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000c8.0 movz x20, #20, #0         [        21]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000cc.0 movz x21, #21, #0         [        22]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000d0.0 movz x22, #22, #0         [        23]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000d4.0 movz x23, #23, #0         [        24]
    +[............................................fdn.pic.r...........................]-(         160000) 0x004000d8.0 movz x24, #24, #0         [        25]
    +[............................................fdn.pic.r...........................]-(         160000) 0x004000dc.0 movz x25, #25, #0         [        26]
    +[.............................................fdn.ic.r...........................]-(         160000) 0x004000e0.0 movz x26, #26, #0         [        27]
    +[.............................................fdn.ic.r...........................]-(         160000) 0x004000e4.0 movz x27, #27, #0         [        28]
    +[.............................................fdn.ic.r...........................]-(         160000) 0x004000e8.0 movz x28, #28, #0         [        29]
    +[.............................................fdn.ic.r...........................]-(         160000) 0x004000ec.0 movz x29, #29, #0         [        30]
    +[.............................................fdn.pic.r..........................]-(         160000) 0x004000f0.0 movz x0, #0, #0           [        31]
    +[.............................................fdn.pic.r..........................]-(         160000) 0x004000f4.0 movz x1, #1, #0           [        32]
    +[.............................................fdn.pic.r..........................]-(         160000) 0x004000f8.0 movz x2, #2, #0           [        33]
    +[.............................................fdn.pic.r..........................]-(         160000) 0x004000fc.0 movz x3, #3, #0           [        34]
    +
    +
    +
    +

    The first of instructions has only two instructions because the first instruction is at address 0x400078, so only two instructions fit on that cache line, as the next cache line starts at 0x400080!

    +
    +
    +

    The initial fdn on top middle is likely bugged out, did it wrap around? But the rest makes sense.

    +
    +
    +

    From this, we clearly see that up to 8 instructions can be issued concurrently, which matches the default width values we had seen at gem5 DerivO3CPU.

    +
    +
    +

    For example, we can clearly see how:

    +
    +
    +
      +
    • +

      movz x2 through to movz x9 start running at the exact same time. TODO why does mov x7 do fdn.ic.r while mov x8 do fdn.ic.r? How are they different?

      +
    • +
    • +

      movz x10 through movz x17 then starts running one step later. This second chunk is fully pipelined with the first instruction pack

      +
    • +
    • +

      then comes a pause while the next fetch comes back. This group of 16 instructions took up the entire 64-byte cacheline that had been read

      +
    • +
    +
    +
    +

    First we can have a look at ExecEnable to get an initial ideal of how many instructions are run at one time:

    +
    +
    +
    +
      78500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue    :   movz   x0, #0, #0        : IntAlu :  D=0x0000000000000000  FetchSeq=1  CPSeq=1  flags=(IsInteger)
    +  78500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4    :   movz   x1, #1, #0        : IntAlu :  D=0x0000000000000001  FetchSeq=2  CPSeq=2  flags=(IsInteger)
    +
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8    :   movz   x2, #2, #0        : IntAlu :  D=0x0000000000000002  FetchSeq=3  CPSeq=3  flags=(IsInteger)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+12    :   movz   x3, #3, #0        : IntAlu :  D=0x0000000000000003  FetchSeq=4  CPSeq=4  flags=(IsInteger)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+16    :   movz   x4, #4, #0        : IntAlu :  D=0x0000000000000004  FetchSeq=5  CPSeq=5  flags=(IsInteger)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+20    :   movz   x5, #5, #0        : IntAlu :  D=0x0000000000000005  FetchSeq=6  CPSeq=6  flags=(IsInteger)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+24    :   movz   x6, #6, #0        : IntAlu :  D=0x0000000000000006  FetchSeq=7  CPSeq=7  flags=(IsInteger)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+28    :   movz   x7, #7, #0        : IntAlu :  D=0x0000000000000007  FetchSeq=8  CPSeq=8  flags=(IsInteger)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+32    :   movz   x8, #8, #0        : IntAlu :  D=0x0000000000000008  FetchSeq=9  CPSeq=9  flags=(IsInteger)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+36    :   movz   x9, #9, #0        : IntAlu :  D=0x0000000000000009  FetchSeq=10  CPSeq=10  flags=(IsInteger)
    +
    + 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+40    :   movz   x10, #10, #0      : IntAlu :  D=0x000000000000000a  FetchSeq=11  CPSeq=11  flags=(IsInteger)
    + 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+44    :   movz   x11, #11, #0      : IntAlu :  D=0x000000000000000b  FetchSeq=12  CPSeq=12  flags=(IsInteger)
    + 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+48    :   movz   x12, #12, #0      : IntAlu :  D=0x000000000000000c  FetchSeq=13  CPSeq=13  flags=(IsInteger)
    + 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+52    :   movz   x13, #13, #0      : IntAlu :  D=0x000000000000000d  FetchSeq=14  CPSeq=14  flags=(IsInteger)
    + 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+56    :   movz   x14, #14, #0      : IntAlu :  D=0x000000000000000e  FetchSeq=15  CPSeq=15  flags=(IsInteger)
    + 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+60    :   movz   x15, #15, #0      : IntAlu :  D=0x000000000000000f  FetchSeq=16  CPSeq=16  flags=(IsInteger)
    + 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+64    :   movz   x16, #16, #0      : IntAlu :  D=0x0000000000000010  FetchSeq=17  CPSeq=17  flags=(IsInteger)
    + 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+68    :   movz   x17, #17, #0      : IntAlu :  D=0x0000000000000011  FetchSeq=18  CPSeq=18  flags=(IsInteger)
    +
    +
    +
    +

    This suggests 8, but remember that ExecEnable shows issue time labels, which do not coincide necessarily with commit times. As we saw in the pipeline viewer above, instructions 9 and 10 have one extra stage.

    +
    +
    +

    After the initial two execs from the first cache line, the full commit log chunk around the first group of six `ExecEnable`s looks like:

    +
    +
    +
    +
     133500: Commit: system.cpu.commit: Getting instructions from Rename stage.
    + 133500: Commit: system.cpu.commit: Trying to commit instructions in the ROB.
    +
    + 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:3]
    + 133500: Commit: system.cpu.commit: [tid:0] [sn:3] Committing instruction with PC (0x400080=>0x400084).(0=>1)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8    :   movz   x2, #2, #0        : IntAlu :  D=0x0000000000000002  FetchSeq=3  CPSeq=3  flags=(IsInteger)
    + 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400080=>0x400084).(0=>1), [sn:3]
    + 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400080=>0x400084).(0=>1) [sn:3]
    +
    + 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:4]
    + 133500: Commit: system.cpu.commit: [tid:0] [sn:4] Committing instruction with PC (0x400084=>0x400088).(0=>1)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+12    :   movz   x3, #3, #0        : IntAlu :  D=0x0000000000000003  FetchSeq=4  CPSeq=4  flags=(IsInteger)
    + 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400084=>0x400088).(0=>1), [sn:4]
    + 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400084=>0x400088).(0=>1) [sn:4]
    +
    + 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:5]
    + 133500: Commit: system.cpu.commit: [tid:0] [sn:5] Committing instruction with PC (0x400088=>0x40008c).(0=>1)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+16    :   movz   x4, #4, #0        : IntAlu :  D=0x0000000000000004  FetchSeq=5  CPSeq=5  flags=(IsInteger)
    + 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400088=>0x40008c).(0=>1), [sn:5]
    + 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400088=>0x40008c).(0=>1) [sn:5]
    +
    + 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:6]
    + 133500: Commit: system.cpu.commit: [tid:0] [sn:6] Committing instruction with PC (0x40008c=>0x400090).(0=>1)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+20    :   movz   x5, #5, #0        : IntAlu :  D=0x0000000000000005  FetchSeq=6  CPSeq=6  flags=(IsInteger)
    + 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x40008c=>0x400090).(0=>1), [sn:6]
    + 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x40008c=>0x400090).(0=>1) [sn:6]
    +
    + 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:7]
    + 133500: Commit: system.cpu.commit: [tid:0] [sn:7] Committing instruction with PC (0x400090=>0x400094).(0=>1)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+24    :   movz   x6, #6, #0        : IntAlu :  D=0x0000000000000006  FetchSeq=7  CPSeq=7  flags=(IsInteger)
    + 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400090=>0x400094).(0=>1), [sn:7]
    + 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400090=>0x400094).(0=>1) [sn:7]
    +
    + 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:8]
    + 133500: Commit: system.cpu.commit: [tid:0] [sn:8] Committing instruction with PC (0x400094=>0x400098).(0=>1)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+28    :   movz   x7, #7, #0        : IntAlu :  D=0x0000000000000007  FetchSeq=8  CPSeq=8  flags=(IsInteger)
    + 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400094=>0x400098).(0=>1), [sn:8]
    + 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400094=>0x400098).(0=>1) [sn:8]
    +
    + 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x400098=>0x40009c).(0=>1), [sn:9] ready within ROB.
    + 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x40009c=>0x4000a0).(0=>1), [sn:10] ready within ROB.
    + 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000a0=>0x4000a4).(0=>1), [sn:11] ready within ROB.
    + 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000a4=>0x4000a8).(0=>1), [sn:12] ready within ROB.
    + 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000a8=>0x4000ac).(0=>1), [sn:13] ready within ROB.
    + 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000ac=>0x4000b0).(0=>1), [sn:14] ready within ROB.
    + 133500: Commit: system.cpu.commit: [tid:0] Instruction [sn:9] PC (0x400098=>0x40009c).(0=>1) is head of ROB and ready to commit
    + 133500: Commit: system.cpu.commit: [tid:0] ROB has 10 insts & 182 free entries.
    +
    +
    +
    +

    ROB stands for Re-order buffer.

    +
    +
    +

    0x400080⇒0x400084 is an old/new PC address of the first committed instruction.

    +
    +
    +

    Another thing we can do, it to try to follow one of the instructions back as it goes through the pipeline. Searching for example for the address 0x400080, we find:

    +
    +
    +

    The first mention of the address happens when is the fetch of the two initial instructions completes. TODO not sure why it doesn’t just also fetch the next cache line at the same time:

    +
    +
    +
    +
    FullO3CPU: Ticking main, FullO3CPU.
    +  78500: Fetch: system.cpu.fetch: Running stage.
    +  78500: Fetch: system.cpu.fetch: Attempting to fetch from [tid:0]
    +  78500: Fetch: system.cpu.fetch: [tid:0] Icache miss is complete.
    +  78500: Fetch: system.cpu.fetch: [tid:0] Adding instructions to queue to decode.
    +  78500: DynInst: global: DynInst: [sn:1] Instruction created. Instcount for system.cpu = 1
    +  78500: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x400078 (0) created [sn:1].
    +  78500: Fetch: system.cpu.fetch: [tid:0] Instruction is:   movz   x0, #0, #0
    +  78500: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (1/32).
    +  78500: DynInst: global: DynInst: [sn:2] Instruction created. Instcount for system.cpu = 2
    +  78500: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x40007c (0) created [sn:2].
    +  78500: Fetch: system.cpu.fetch: [tid:0] Instruction is:   movz   x1, #1, #0
    +  78500: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (2/32).
    +  78500: Fetch: system.cpu.fetch: [tid:0] Issuing a pipelined I-cache access, starting at PC (0x400080=>0x400084).(0=>1).
    +  78500: Fetch: system.cpu.fetch: [tid:0] Fetching cache line 0x400080 for addr 0x400080
    +
    +
    +
    +

    so we observe that the first two instructions arrived, and the CPU noticed that 0x400080 hasn’t been fetched yet.

    +
    +
    +

    Then for several cycles that follow, the fetch stage just says that it is blocked on data returning, e.g. the

    +
    +
    +
    +
    FullO3CPU: Ticking main, FullO3CPU.
    +  79000: Fetch: system.cpu.fetch: Running stage.
    +  79000: Fetch: system.cpu.fetch: There are no more threads available to fetch from.
    +  79000: Fetch: system.cpu.fetch: [tid:0] Fetch is waiting cache response!
    +
    +
    +
    +

    At the same time, the execution of the initial 2 instructions progresses through the pipeline.

    +
    +
    +

    These progress up until:

    +
    +
    +
    +
      88000: O3CPU: system.cpu: Idle!
    +
    +
    +
    +

    at which point there are no more events scheduled besides waiting for the second cache line to come back.

    +
    +
    +

    After this, some time passes without events, and the next tick happens when the fetch data returns:

    +
    +
    +
    +
    FullO3CPU: Ticking main, FullO3CPU.
    + 130000: Fetch: system.cpu.fetch: Running stage.
    + 130000: Fetch: system.cpu.fetch: Attempting to fetch from [tid:0]
    + 130000: Fetch: system.cpu.fetch: [tid:0] Icache miss is complete.
    + 130000: Fetch: system.cpu.fetch: [tid:0] Adding instructions to queue to decode.
    + 130000: DynInst: global: DynInst: [sn:3] Instruction created. Instcount for system.cpu = 1
    + 130000: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x400080 (0) created [sn:3].
    + 130000: Fetch: system.cpu.fetch: [tid:0] Instruction is:   movz   x2, #2, #0
    + 130000: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (1/32).
    + 130000: DynInst: global: DynInst: [sn:4] Instruction created. Instcount for system.cpu = 2
    + 130000: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x400084 (0) created [sn:4].
    + 130000: Fetch: system.cpu.fetch: [tid:0] Instruction is:   movz   x3, #3, #0
    + 130000: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (2/32).
    + 130000: DynInst: global: DynInst: [sn:5] Instruction created. Instcount for system.cpu = 3
    +
    +
    +
    +
    +
    19.20.4.6.2. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard
    + +
    +
    +
    // f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
    +
    +                                     timeline                                             tick          pc.upc     disasm                      seq_num
    +[.ic.r........................................................................fdn]-(          40000) 0x00400078.0 movz x0, #0, #0           [         1]
    +[.ic.r........................................................................fdn]-(          40000) 0x0040007c.0 movz x1, #1, #0           [         2]
    +[....................fdn.ic.r....................................................]-(         120000) 0x00400080.0 movz x2, #2, #0           [         3]
    +[....................fdn.pic.r...................................................]-(         120000) 0x00400084.0 add x3, x2, #1            [         4]
    +[....................fdn.ic..r...................................................]-(         120000) 0x00400088.0 movz x4, #4, #0           [         5]
    +[....................fdn.ic..r...................................................]-(         120000) 0x0040008c.0 movz x5, #5, #0           [         6]
    +[....................fdn.ic..r...................................................]-(         120000) 0x00400090.0 movz x6, #6, #0           [         7]
    +[....................fdn.ic..r...................................................]-(         120000) 0x00400094.0 movz x7, #7, #0           [         8]
    +[....................fdn.ic..r...................................................]-(         120000) 0x00400098.0 movz x8, #8, #0           [         9]
    +[....................fdn.pic.r...................................................]-(         120000) 0x0040009c.0 movz x9, #9, #0           [        10]
    +[.....................fdn.ic.r...................................................]-(         120000) 0x004000a0.0 movz x10, #10, #0         [        11]
    +[.....................fdn.ic..r..................................................]-(         120000) 0x004000a4.0 movz x11, #11, #0         [        12]
    +[.....................fdn.ic..r..................................................]-(         120000) 0x004000a8.0 movz x12, #12, #0         [        13]
    +[.....................fdn.ic..r..................................................]-(         120000) 0x004000ac.0 movz x13, #13, #0         [        14]
    +[.....................fdn.pic.r..................................................]-(         120000) 0x004000b0.0 movz x14, #14, #0         [        15]
    +[.....................fdn.pic.r..................................................]-(         120000) 0x004000b4.0 movz x15, #15, #0         [        16]
    +[.....................fdn.pic.r..................................................]-(         120000) 0x004000b8.0 movz x16, #16, #0         [        17]
    +[.....................fdn.pic.r..................................................]-(         120000) 0x004000bc.0 movz x17, #17, #0         [        18]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000c0.0 movz x18, #18, #0         [        19]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000c4.0 movz x19, #19, #0         [        20]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000c8.0 movz x20, #20, #0         [        21]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000cc.0 movz x21, #21, #0         [        22]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000d0.0 movz x22, #22, #0         [        23]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000d4.0 movz x23, #23, #0         [        24]
    +[............................................fdn.pic.r...........................]-(         160000) 0x004000d8.0 movz x24, #24, #0         [        25]
    +[............................................fdn.pic.r...........................]-(         160000) 0x004000dc.0 movz x25, #25, #0         [        26]
    +[.............................................fdn.ic.r...........................]-(         160000) 0x004000e0.0 movz x0, #0, #0           [        27]
    +[.............................................fdn.ic.r...........................]-(         160000) 0x004000e4.0 movz x8, #93, #0          [        28]
    +
    +
    +
    +

    TODO understand how the hazard happens in detail.

    +
    +
    +
    +
    19.20.4.6.3. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard4
    + +
    +
    +
    // f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
    +
    +                                     timeline                                             tick          pc.upc     disasm                      seq_num
    +[.ic.r........................................................................fdn]-(          40000) 0x00400078.0 movz x0, #0, #0           [         1]
    +[.ic.r........................................................................fdn]-(          40000) 0x0040007c.0 movz x1, #1, #0           [         2]
    +[....................fdn.ic.r....................................................]-(         120000) 0x00400080.0 movz x2, #2, #0           [         3]
    +[....................fdn.pic.r...................................................]-(         120000) 0x00400084.0 add x3, x2, #1            [         4]
    +[....................fdn.p.ic.r..................................................]-(         120000) 0x00400088.0 add x4, x3, #1            [         5]
    +[....................fdn.p..ic.r.................................................]-(         120000) 0x0040008c.0 add x5, x4, #1            [         6]
    +[....................fdn.p...ic.r................................................]-(         120000) 0x00400090.0 add x6, x5, #1            [         7]
    +[....................fdn.ic.....r................................................]-(         120000) 0x00400094.0 movz x7, #7, #0           [         8]
    +[....................fdn.ic.....r................................................]-(         120000) 0x00400098.0 movz x8, #8, #0           [         9]
    +[....................fdn.ic.....r................................................]-(         120000) 0x0040009c.0 movz x9, #9, #0           [        10]
    +[.....................fdn.ic....r................................................]-(         120000) 0x004000a0.0 movz x10, #10, #0         [        11]
    +[.....................fdn.ic....r................................................]-(         120000) 0x004000a4.0 movz x11, #11, #0         [        12]
    +[.....................fdn.ic....r................................................]-(         120000) 0x004000a8.0 movz x12, #12, #0         [        13]
    +[.....................fdn.ic....r................................................]-(         120000) 0x004000ac.0 movz x13, #13, #0         [        14]
    +[.....................fdn.ic.....r...............................................]-(         120000) 0x004000b0.0 movz x14, #14, #0         [        15]
    +[.....................fdn.pic....r...............................................]-(         120000) 0x004000b4.0 movz x15, #15, #0         [        16]
    +[.....................fdn.pic....r...............................................]-(         120000) 0x004000b8.0 movz x16, #16, #0         [        17]
    +[.....................fdn.pic....r...............................................]-(         120000) 0x004000bc.0 movz x17, #17, #0         [        18]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000c0.0 movz x18, #18, #0         [        19]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000c4.0 movz x19, #19, #0         [        20]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000c8.0 movz x20, #20, #0         [        21]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000cc.0 movz x21, #21, #0         [        22]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000d0.0 movz x22, #22, #0         [        23]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000d4.0 movz x23, #23, #0         [        24]
    +[............................................fdn.pic.r...........................]-(         160000) 0x004000d8.0 movz x24, #24, #0         [        25]
    +[............................................fdn.pic.r...........................]-(         160000) 0x004000dc.0 movz x25, #25, #0         [        26]
    +[.............................................fdn.ic.r...........................]-(         160000) 0x004000e0.0 movz x0, #0, #0           [        27]
    +[.............................................fdn.ic.r...........................]-(         160000) 0x004000e4.0 movz x8, #93, #0          [        28]
    +
    +
    +
    +
    +
    19.20.4.6.4. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall
    + +
    +

    We can see here that:

    +
    +
    +
      +
    • +

      the addition of a data section entry changed our previous address setup a bit, the entry point was now 0x004000b0 which fits 4 instructions in the cacheline instead of 2

      +
    • +
    • +

      the LDR happens to be the fourth instruction, so it takes a long time to retire. The time is about 40k ticks, which is about the same time it takes for the instruction fetch as expected.

      +
    • +
    • +

      fetch does not continue past the LDR, and so nothing is gained in this particular example, since the next instructions haven’t been fetched from memory yet!

      +
    • +
    +
    +
    +
    +
    // f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
    +
    +                                     timeline                                             tick          pc.upc     disasm                      seq_num
    +[.ic.r........................................................................fdn]-(          40000) 0x004000b0.0 movz x0, #0, #0           [         1]
    +[.ic.r........................................................................fdn]-(          40000) 0x004000b4.0 movz x1, #1, #0           [         2]
    +[.ic.r........................................................................fdn]-(          40000) 0x004000b8.0 adr x2, #65780            [         3]
    +[.............................................................................fdn]-(          40000) 0x004000bc.0 ldr x3, [x2]              [         4]
    +[.pic............................................................................]-(          80000)     ...
    +[................................r...............................................]-(         120000)     ...
    +[....................fdn.ic......r...............................................]-(         120000) 0x004000c0.0 movz x4, #4, #0           [         5]
    +[....................fdn.ic......r...............................................]-(         120000) 0x004000c4.0 movz x5, #5, #0           [         6]
    +[....................fdn.ic......r...............................................]-(         120000) 0x004000c8.0 movz x6, #6, #0           [         7]
    +[....................fdn.ic......r...............................................]-(         120000) 0x004000cc.0 movz x7, #7, #0           [         8]
    +[....................fdn.ic......r...............................................]-(         120000) 0x004000d0.0 movz x8, #8, #0           [         9]
    +[....................fdn.ic......r...............................................]-(         120000) 0x004000d4.0 movz x9, #9, #0           [        10]
    +[....................fdn.pic.....r...............................................]-(         120000) 0x004000d8.0 movz x10, #10, #0         [        11]
    +[....................fdn.pic......r..............................................]-(         120000) 0x004000dc.0 movz x11, #11, #0         [        12]
    +[.....................fdn.ic......r..............................................]-(         120000) 0x004000e0.0 movz x12, #12, #0         [        13]
    +[.....................fdn.ic......r..............................................]-(         120000) 0x004000e4.0 movz x13, #13, #0         [        14]
    +[.....................fdn.ic......r..............................................]-(         120000) 0x004000e8.0 movz x14, #14, #0         [        15]
    +[.....................fdn.ic......r..............................................]-(         120000) 0x004000ec.0 movz x15, #15, #0         [        16]
    +[.....................fdn.pic.....r..............................................]-(         120000) 0x004000f0.0 movz x16, #16, #0         [        17]
    +[.....................fdn.pic.....r..............................................]-(         120000) 0x004000f4.0 movz x17, #17, #0         [        18]
    +[.....................fdn.pic.....r..............................................]-(         120000) 0x004000f8.0 movz x18, #18, #0         [        19]
    +[.....................fdn.pic......r.............................................]-(         120000) 0x004000fc.0 movz x19, #19, #0         [        20]
    +
    +
    +
    +
    +
    19.20.4.6.5. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-gain
    + +
    +

    So in this case we see that there were actual potential gains, since the movz x11 started running immediately. We just stopped at movz x20 because a new ifetch was needed.

    +
    +
    +
    +
    // f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
    +
    +                                     timeline                                             tick          pc.upc     disasm                      seq_num
    +[.ic.r........................................................................fdn]-(          40000) 0x004000b0.0 movz x0, #0, #0           [         1]
    +[.ic.r........................................................................fdn]-(          40000) 0x004000b4.0 movz x1, #1, #0           [         2]
    +[.ic.r........................................................................fdn]-(          40000) 0x004000b8.0 movz x2, #4, #0           [         3]
    +[.ic.r........................................................................fdn]-(          40000) 0x004000bc.0 movz x3, #5, #0           [         4]
    +[....................fdn.ic.r....................................................]-(         120000) 0x004000c0.0 adr x4, #65772            [         5]
    +[....................fdn.pic.....................................................]-(         120000) 0x004000c4.0 ldr x5, [x4]              [         6]
    +[........................................................r.......................]-(         160000)     ...
    +[....................fdn.ic......................................................]-(         120000) 0x004000c8.0 movz x6, #6, #0           [         7]
    +[........................................................r.......................]-(         160000)     ...
    +[....................fdn.ic......................................................]-(         120000) 0x004000cc.0 movz x7, #7, #0           [         8]
    +[........................................................r.......................]-(         160000)     ...
    +[....................fdn.ic......................................................]-(         120000) 0x004000d0.0 movz x8, #8, #0           [         9]
    +[........................................................r.......................]-(         160000)     ...
    +[....................fdn.ic......................................................]-(         120000) 0x004000d4.0 movz x9, #9, #0           [        10]
    +[........................................................r.......................]-(         160000)     ...
    +[....................fdn.ic......................................................]-(         120000) 0x004000d8.0 movz x10, #10, #0         [        11]
    +[........................................................r.......................]-(         160000)     ...
    +[....................fdn.pic.....................................................]-(         120000) 0x004000dc.0 movz x11, #11, #0         [        12]
    +[........................................................r.......................]-(         160000)     ...
    +[.....................fdn.ic.....................................................]-(         120000) 0x004000e0.0 movz x12, #12, #0         [        13]
    +[........................................................r.......................]-(         160000)     ...
    +[.....................fdn.ic.....................................................]-(         120000) 0x004000e4.0 movz x13, #13, #0         [        14]
    +[.........................................................r......................]-(         160000)     ...
    +[.....................fdn.ic.....................................................]-(         120000) 0x004000e8.0 movz x14, #14, #0         [        15]
    +[.........................................................r......................]-(         160000)     ...
    +[.....................fdn.ic.....................................................]-(         120000) 0x004000ec.0 movz x15, #15, #0         [        16]
    +[.........................................................r......................]-(         160000)     ...
    +[.....................fdn.ic.....................................................]-(         120000) 0x004000f0.0 movz x16, #16, #0         [        17]
    +[.........................................................r......................]-(         160000)     ...
    +[.....................fdn.pic....................................................]-(         120000) 0x004000f4.0 movz x17, #17, #0         [        18]
    +[.........................................................r......................]-(         160000)     ...
    +[.....................fdn.pic....................................................]-(         120000) 0x004000f8.0 movz x18, #18, #0         [        19]
    +[.........................................................r......................]-(         160000)     ...
    +[.....................fdn.pic....................................................]-(         120000) 0x004000fc.0 movz x19, #19, #0         [        20]
    +[.........................................................r......................]-(         160000)     ...
    +[............................................fdn.ic.......r......................]-(         160000) 0x00400100.0 movz x20, #20, #0         [        21]
    +[............................................fdn.ic........r.....................]-(         160000) 0x00400104.0 movz x21, #21, #0         [        22]
    +[............................................fdn.ic........r.....................]-(         160000) 0x00400108.0 movz x22, #22, #0         [        23]
    +[............................................fdn.ic........r.....................]-(         160000) 0x0040010c.0 movz x23, #23, #0         [        24]
    +[............................................fdn.ic........r.....................]-(         160000) 0x00400110.0 movz x24, #24, #0         [        25]
    +[............................................fdn.ic........r.....................]-(         160000) 0x00400114.0 movz x25, #25, #0         [        26]
    +[............................................fdn.pic.......r.....................]-(         160000) 0x00400118.0 movz x26, #26, #0         [        27]
    +[............................................fdn.pic.......r.....................]-(         160000) 0x0040011c.0 movz x27, #27, #0         [        28]
    +[.............................................fdn.ic.......r.....................]-(         160000) 0x00400120.0 movz x28, #28, #0         [        29]
    +[.............................................fdn.ic........r....................]-(         160000) 0x00400124.0 movz x29, #29, #0         [        30]
    +[.............................................fdn.ic........r....................]-(         160000) 0x00400128.0 movz x0, #0, #0           [        31]
    +[.............................................fdn.ic........r....................]-(         160000) 0x0040012c.0 movz x1, #1, #0           [        32]
    +[.............................................fdn.pic.......r....................]-(         160000) 0x00400130.0 movz x2, #2, #0           [        33]
    +[.............................................fdn.pic.......r....................]-(         160000) 0x00400134.0 movz x3, #3, #0           [        34]
    +[.............................................fdn.pic.......r....................]-(         160000) 0x00400138.0 movz x4, #4, #0           [        35]
    +[.............................................fdn.pic.......r....................]-(         160000) 0x0040013c.0 movz x5, #5, #0           [        36]
    +
    +
    +
    +

    We now also understand the graph better from lines such as this:

    +
    +
    +
    +
    [....................fdn.pic.....................................................]-(         120000) 0x004000c4.0 ldr x5, [x4]              [         6]
    +[........................................................r.......................]-(         160000)     ...
    +[....................fdn.ic......................................................]-(         120000) 0x004000c8.0 movz x6, #6, #0           [         7]
    +[........................................................r.......................]-(         160000)     ...
    +
    +
    +
    +

    We see that extra lines are drawn (the 160000 …​ lines here) whenever something stalls for a period longer than the width of the visualisation.

    +
    +
    +

    Things are still relatively readable because the wrapping aligns them with events that actually happened on that line directly e.g. 160000) 0x00400100.0 movz x20, #20, #0..

    +
    +
    +

    But from this we kind of see the need for: gem5 Konata O3 pipeline viewer.

    +
    +
    +
    +
    19.20.4.6.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-hazard4
    + +
    +

    So in this case the ic of dependencies like add x6, x5, #1 have to wait until the LDR is finished:

    +
    +
    +
    +
    // f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
    +
    +                                     timeline                                             tick          pc.upc     disasm                      seq_num
    +[.ic.r........................................................................fdn]-(          40000) 0x004000b0.0 movz x0, #0, #0           [         1]
    +[.ic.r........................................................................fdn]-(          40000) 0x004000b4.0 movz x1, #1, #0           [         2]
    +[.ic.r........................................................................fdn]-(          40000) 0x004000b8.0 movz x2, #4, #0           [         3]
    +[.ic.r........................................................................fdn]-(          40000) 0x004000bc.0 movz x3, #5, #0           [         4]
    +[....................fdn.ic.r....................................................]-(         120000) 0x004000c0.0 adr x4, #65772            [         5]
    +[....................fdn.pic.....................................................]-(         120000) 0x004000c4.0 ldr x5, [x4]              [         6]
    +[........................................................r.......................]-(         160000)     ...
    +[....................fdn.p.......................................................]-(         120000) 0x004000c8.0 add x6, x5, #1            [         7]
    +[......................................................ic.r......................]-(         160000)     ...
    +[....................fdn.p.......................................................]-(         120000) 0x004000cc.0 add x7, x6, #1            [         8]
    +[.......................................................ic.r.....................]-(         160000)     ...
    +[....................fdn.p.......................................................]-(         120000) 0x004000d0.0 add x8, x7, #1            [         9]
    +[........................................................ic.r....................]-(         160000)     ...
    +[....................fdn.p.......................................................]-(         120000) 0x004000d4.0 add x9, x8, #1            [        10]
    +[.........................................................ic.r...................]-(         160000)     ...
    +[....................fdn.ic......................................................]-(         120000) 0x004000d8.0 movz x10, #10, #0         [        11]
    +[............................................................r...................]-(         160000)     ...
    +[....................fdn.ic......................................................]-(         120000) 0x004000dc.0 movz x11, #11, #0         [        12]
    +[............................................................r...................]-(         160000)     ...
    +[.....................fdn.ic.....................................................]-(         120000) 0x004000e0.0 movz x12, #12, #0         [        13]
    +[............................................................r...................]-(         160000)     ...
    +[.....................fdn.ic.....................................................]-(         120000) 0x004000e4.0 movz x13, #13, #0         [        14]
    +[............................................................r...................]-(         160000)     ...
    +[.....................fdn.ic.....................................................]-(         120000) 0x004000e8.0 movz x14, #14, #0         [        15]
    +[............................................................r...................]-(         160000)     ...
    +[.....................fdn.ic.....................................................]-(         120000) 0x004000ec.0 movz x15, #15, #0         [        16]
    +[............................................................r...................]-(         160000)     ...
    +[.....................fdn.ic.....................................................]-(         120000) 0x004000f0.0 movz x16, #16, #0         [        17]
    +[............................................................r...................]-(         160000)     ...
    +[.....................fdn.ic.....................................................]-(         120000) 0x004000f4.0 movz x17, #17, #0         [        18]
    +[.............................................................r..................]-(         160000)     ...
    +[.....................fdn.pic....................................................]-(         120000) 0x004000f8.0 movz x18, #18, #0         [        19]
    +[.............................................................r..................]-(         160000)     ...
    +[.....................fdn.pic....................................................]-(         120000) 0x004000fc.0 movz x19, #19, #0         [        20]
    +[.............................................................r..................]-(         160000)     ...
    +[............................................fdn.ic...........r..................]-(         160000) 0x00400100.0 movz x20, #20, #0         [        21]
    +[............................................fdn.ic...........r..................]-(         160000) 0x00400104.0 movz x21, #21, #0         [        22]
    +[............................................fdn.ic...........r..................]-(         160000) 0x00400108.0 movz x22, #22, #0         [        23]
    +[............................................fdn.ic...........r..................]-(         160000) 0x0040010c.0 movz x23, #23, #0         [        24]
    +[............................................fdn.ic...........r..................]-(         160000) 0x00400110.0 movz x24, #24, #0         [        25]
    +[............................................fdn.ic............r.................]-(         160000) 0x00400114.0 movz x25, #25, #0         [        26]
    +[............................................fdn.pic...........r.................]-(         160000) 0x00400118.0 movz x26, #26, #0         [        27]
    +[............................................fdn.pic...........r.................]-(         160000) 0x0040011c.0 movz x27, #27, #0         [        28]
    +[.............................................fdn.ic...........r.................]-(         160000) 0x00400120.0 movz x28, #28, #0         [        29]
    +[.............................................fdn.ic...........r.................]-(         160000) 0x00400124.0 movz x29, #29, #0         [        30]
    +[.............................................fdn.ic...........r.................]-(         160000) 0x00400128.0 movz x0, #0, #0           [        31]
    +[.............................................fdn.ic...........r.................]-(         160000) 0x0040012c.0 movz x1, #1, #0           [        32]
    +[.............................................fdn.pic..........r.................]-(         160000) 0x00400130.0 movz x2, #2, #0           [        33]
    +[.............................................fdn.pic...........r................]-(         160000) 0x00400134.0 movz x3, #3, #0           [        34]
    +[.............................................fdn.pic...........r................]-(         160000) 0x00400138.0 movz x4, #4, #0           [        35]
    +[.............................................fdn.pic...........r................]-(         160000) 0x0040013c.0 movz x5, #5, #0           [        36]
    +
    +
    +
    +
    +
    19.20.4.6.7. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: speculative
    + +
    +

    That program is setup such that the branch is not taken if an extra CLI argument is passed with --cli-args.

    +
    +
    +

    We purposefully set things up so that speculation will be running from the icache so we can see what is going on more clearly without ifetch stalls.

    +
    +
    +

    Without an extra CLI argument (the branch is taken):

    +
    +
    +
    +
    // f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
    +
    +                                     timeline                                             tick          pc.upc     disasm                      seq_num
    +[.............................................................................fdn]-(          40000) 0x00400078.0 ldr x0, [sp]              [         1]
    +[.ic.............................................................................]-(          80000)     ...
    +[................................r...............................................]-(         120000)     ...
    +[.............................................................................fdn]-(          40000) 0x0040007c.0 movz x1, #1, #0           [         2]
    +[.ic.............................................................................]-(          80000)     ...
    +[................................r...............................................]-(         120000)     ...
    +[....................fdn.ic......r...............................................]-(         120000) 0x00400080.0 movz x2, #2, #0           [         3]
    +[....................fdn.ic......r...............................................]-(         120000) 0x00400084.0 movz x3, #3, #0           [         4]
    +[....................fdn.ic......r...............................................]-(         120000) 0x00400088.0 movz x4, #4, #0           [         5]
    +[....................fdn.ic......r...............................................]-(         120000) 0x0040008c.0 movz x5, #5, #0           [         6]
    +[....................fdn.ic......r...............................................]-(         120000) 0x00400090.0 movz x6, #6, #0           [         7]
    +[....................fdn.p.....ic..r.............................................]-(         120000) 0x00400094.0 subs x0, #2               [         8]
    +[....................fdn.ic........r.............................................]-(         120000) 0x00400098.0 movz x0, #3, #0           [         9]
    +[....................fdn.p......ic.r.............................................]-(         120000) 0x0040009c.0 b.lt 0x400080             [        10]
    +[=====================fdn=ic=====================================================]-(         120000) 0x004000a0.0 -----movz x10, #10, #0    [        11]
    +[=====================fdn=ic=====================================================]-(         120000) 0x004000a4.0 -----movz x11, #11, #0    [        12]
    +[=====================fdn=ic=====================================================]-(         120000) 0x004000a8.0 -----movz x12, #12, #0    [        13]
    +[=====================fdn=ic=====================================================]-(         120000) 0x004000ac.0 -----movz x13, #13, #0    [        14]
    +[=====================fdn=ic=====================================================]-(         120000) 0x004000b0.0 -----movz x14, #14, #0    [        15]
    +[=====================fdn=ic=====================================================]-(         120000) 0x004000b4.0 -----movz x15, #15, #0    [        16]
    +[=====================fdn=pic====================================================]-(         120000) 0x004000b8.0 -----movz x16, #16, #0    [        17]
    +[=====================fdn=pic====================================================]-(         120000) 0x004000bc.0 -----movz x17, #17, #0    [        18]
    +[.....................................fdn.ic.r...................................]-(         120000) 0x00400080.0 movz x2, #2, #0           [        19]
    +[.....................................fdn.ic.r...................................]-(         120000) 0x00400084.0 movz x3, #3, #0           [        20]
    +[.....................................fdn.ic.r...................................]-(         120000) 0x00400088.0 movz x4, #4, #0           [        21]
    +[.....................................fdn.ic.r...................................]-(         120000) 0x0040008c.0 movz x5, #5, #0           [        22]
    +[.....................................fdn.ic.r...................................]-(         120000) 0x00400090.0 movz x6, #6, #0           [        23]
    +[.....................................fdn.pic.r..................................]-(         120000) 0x00400098.0 movz x0, #3, #0           [        25]
    +[.....................................fdn.pic.r..................................]-(         120000) 0x0040009c.0 b.lt 0x400080             [        26]
    +[......................................fdn.ic.r..................................]-(         120000) 0x004000a0.0 movz x10, #10, #0         [        27]
    +[......................................fdn.ic.r..................................]-(         120000) 0x004000a4.0 movz x11, #11, #0         [        28]
    +[......................................fdn.ic.r..................................]-(         120000) 0x004000a8.0 movz x12, #12, #0         [        29]
    +[......................................fdn.ic.r..................................]-(         120000) 0x004000ac.0 movz x13, #13, #0         [        30]
    +[......................................fdn.pic.r.................................]-(         120000) 0x004000b0.0 movz x14, #14, #0         [        31]
    +[......................................fdn.pic.r.................................]-(         120000) 0x004000b4.0 movz x15, #15, #0         [        32]
    +[......................................fdn.pic.r.................................]-(         120000) 0x004000b8.0 movz x16, #16, #0         [        33]
    +[......................................fdn.pic.r.................................]-(         120000) 0x004000bc.0 movz x17, #17, #0         [        34]
    +[.............................................fdn.ic.r...........................]-(         160000) 0x004000c0.0 movz x0, #0, #0           [        35]
    +[.............................................fdn.ic.r...........................]-(         160000) 0x004000c4.0 movz x8, #93, #0          [        36]
    +
    +
    +
    +

    So here we see that the CPU mispredicted! After the BLT instruction, the CPU continued to run movz x10, assuming that the branch would not be taken.

    +
    +
    +

    Then, at time 120000, the LDR data came back, after the wrong prediction had already been fully executed.

    +
    +
    +

    The CPU then noticed that it mispredicted, and so it started again from the correct branch target movz x2, and the instructions that were thrown away are marked as ===== in the timeline.

    +
    +
    +

    We can also see some Branch predictor log lines in the O3CPUAll log:

    +
    +
    +
    +
     130000: Fetch: system.cpu.fetch: [tid:0] [sn:10] Branch at PC 0x40009c predicted to be not taken
    + 130000: Fetch: system.cpu.fetch: [tid:0] [sn:10] Branch at PC 0x40009c predicted to go to (0x4000a0=>0x4000a4).(0=>1)
    +
    + 131500: Commit: system.cpu.commit: [tid:10] [sn:0] Inserting PC (0x40009c=>0x4000a0).(0=>1) into ROB.
    + 131500: ROB: system.cpu.rob: Adding inst PC (0x40009c=>0x4000a0).(0=>1) to the ROB.
    + 131500: ROB: system.cpu.rob: [tid:0] Now has 10 instructions.
    +
    + 132000: IEW: system.cpu.iew: [tid:0] Issue: Adding PC (0x40009c=>0x4000a0).(0=>1) [sn:10] [tid:0] to IQ.
    + 132000: IQ: system.cpu.iq: Adding instruction [sn:10] PC (0x40009c=>0x4000a0).(0=>1) to the IQ.
    + 132000: IQ: system.cpu.iq: Instruction PC (0x40009c=>0x4000a0).(0=>1) has src reg 6 (CCRegClass) that is being added to the dependency chain.
    + 132000: IQ: system.cpu.iq: Instruction PC (0x40009c=>0x4000a0).(0=>1) has src reg 8 (CCRegClass) that is being added to the dependency chain.
    + 132000: IQ: system.cpu.iq: Instruction PC (0x40009c=>0x4000a0).(0=>1) has src reg 7 (CCRegClass) that is being added to the dependency chain.
    +
    + 135500: IQ: system.cpu.iq: Waking up a dependent instruction, [sn:10] PC (0x40009c=>0x4000a0).(0=>1).
    + 135500: IQ: global: [sn:10] has 1 ready out of 3 sources. RTI 0)
    + 135500: IQ: system.cpu.iq: Waking any dependents on register 7 (CCRegClass).
    + 135500: IQ: system.cpu.iq: Waking up a dependent instruction, [sn:10] PC (0x40009c=>0x4000a0).(0=>1).
    + 135500: IQ: global: [sn:10] has 2 ready out of 3 sources. RTI 0)
    + 135500: IQ: system.cpu.iq: Waking any dependents on register 8 (CCRegClass).
    + 135500: IQ: system.cpu.iq: Waking up a dependent instruction, [sn:10] PC (0x40009c=>0x4000a0).(0=>1).
    + 135500: IQ: global: [sn:10] has 3 ready out of 3 sources. RTI 0)
    + 135500: IQ: system.cpu.iq: Instruction is ready to issue, putting it onto the ready list, PC (0x40009c=>0x4000a0).(0=>1) opclass:1 [sn:10].
    + 135500: IEW: system.cpu.iew: Setting Destination Register 6 (CCRegClass)
    + 135500: Scoreboard: system.cpu.scoreboard: Setting reg 6 (CCRegClass) as ready
    + 135500: IEW: system.cpu.iew: Setting Destination Register 7 (CCRegClass)
    + 135500: Scoreboard: system.cpu.scoreboard: Setting reg 7 (CCRegClass) as ready
    + 135500: IEW: system.cpu.iew: Setting Destination Register 8 (CCRegClass)
    + 135500: Scoreboard: system.cpu.scoreboard: Setting reg 8 (CCRegClass) as ready
    + 135500: IQ: system.cpu.iq: Attempting to schedule ready instructions from the IQ.
    + 135500: IQ: system.cpu.iq: Thread 0: Issuing instruction PC (0x40009c=>0x4000a0).(0=>1) [sn:10]
    +
    + 136000: IEW: system.cpu.iew: Execute: Processing PC (0x40009c=>0x4000a0).(0=>1), [tid:0] [sn:10].
    + 136000: IEW: global: RegFile: Access to cc register 6, has data 0x2
    + 136000: IEW: global: RegFile: Access to cc register 8, has data 0
    + 136000: IEW: global: RegFile: Access to cc register 7, has data 0
    + 136000: IEW: system.cpu.iew: Current wb cycle: 0, width: 8, numInst: 0
    +wbActual:0
    + 136000: IEW: system.cpu.iew: [tid:0] [sn:10] Execute: Branch mispredict detected.
    + 136000: IEW: system.cpu.iew: [tid:0] [sn:10] Predicted target was PC: (0x4000a0=>0x4000a4).(0=>1)
    + 136000: IEW: system.cpu.iew: [tid:0] [sn:10] Execute: Redirecting fetch to PC: (0x40009c=>0x400080).(0=>1)
    + 136000: IEW: system.cpu.iew: [tid:0] [sn:10] Squashing from a specific instruction, PC: (0x40009c=>0x400080).(0=>1)
    +
    + 136500: Commit: system.cpu.commit: [tid:0] Squashing due to branch mispred PC:0x40009c [sn:10]
    + 136500: Commit: system.cpu.commit: [tid:0] Redirecting to PC 0x400084
    + 136500: ROB: system.cpu.rob: Starting to squash within the ROB.
    + 136500: ROB: system.cpu.rob: [tid:0] Squashing instructions until [sn:10].
    + 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000bc=>0x4000c0).(0=>1), seq num 18.
    + 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000b8=>0x4000bc).(0=>1), seq num 17.
    + 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000b4=>0x4000b8).(0=>1), seq num 16.
    + 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000b0=>0x4000b4).(0=>1), seq num 15.
    + 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000ac=>0x4000b0).(0=>1), seq num 14.
    + 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000a8=>0x4000ac).(0=>1), seq num 13.
    + 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000a4=>0x4000a8).(0=>1), seq num 12.
    + 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000a0=>0x4000a4).(0=>1), seq num 11.
    + 136500: ROB: system.cpu.rob: [tid:0] Done squashing instructions.
    + 136500: Commit: system.cpu.commit: [tid:0] Marking PC (0x40009c=>0x400080).(0=>1), [sn:10] ready within ROB.
    +
    + 137000: Commit: system.cpu.commit: [tid:0] [sn:10] Committing instruction with PC (0x40009c=>0x400080).(0=>1)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+36    :   b.lt   0x400080          : IntAlu :   FetchSeq=10  CPSeq=10  flags=(IsControl|IsDirectControl|IsCondControl)
    + 137000: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x40009c=>0x400080).(0=>1), [sn:10]
    + 137000: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x40009c=>0x400080).(0=>1) [sn:10]
    + 137000: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:11]
    + 137000: Commit: system.cpu.commit: Retiring squashed instruction from ROB.
    +
    + 137000: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:10]
    + 137000: Commit: system.cpu.commit: [tid:0] [sn:10] Committing instruction with PC (0x40009c=>0x400080).(0=>1)
    + 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+36    :   b.lt   0x400080          : IntAlu :   FetchSeq=10  CPSeq=10  flags=(IsControl|IsDirectControl|IsCondControl)
    +
    + 138500: Fetch: system.cpu.fetch: [tid:0] [sn:26] Branch at PC 0x40009c predicted to be not taken
    + 138500: Fetch: system.cpu.fetch: [tid:0] [sn:26] Branch at PC 0x40009c predicted to go to (0x4000a0=>0x4000a4).(0=>1)
    +
    + 142500: Commit: system.cpu.commit: [tid:0] [sn:26] Committing instruction with PC (0x40009c=>0x4000a0).(0=>1)
    + 138500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+36    :   b.lt   0x400080          : IntAlu :   FetchSeq=26  CPSeq=18  flags=(IsControl|IsDirectControl|IsCondControl)
    + 142500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x40009c=>0x4000a0).(0=>1), [sn:26]
    +
    +
    +
    +

    With an extra CLI (the branch is not taken):

    +
    +
    +
    +
    // f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
    +
    +                                     timeline                                             tick          pc.upc     disasm                      seq_num
    +[.............................................................................fdn]-(          40000) 0x00400078.0 ldr x0, [sp]              [         1]
    +[.ic.............................................................................]-(          80000)     ...
    +[................................r...............................................]-(         120000)     ...
    +[.............................................................................fdn]-(          40000) 0x0040007c.0 movz x1, #1, #0           [         2]
    +[.ic.............................................................................]-(          80000)     ...
    +[................................r...............................................]-(         120000)     ...
    +[....................fdn.ic......r...............................................]-(         120000) 0x00400080.0 movz x2, #2, #0           [         3]
    +[....................fdn.ic......r...............................................]-(         120000) 0x00400084.0 movz x3, #3, #0           [         4]
    +[....................fdn.ic......r...............................................]-(         120000) 0x00400088.0 movz x4, #4, #0           [         5]
    +[....................fdn.ic......r...............................................]-(         120000) 0x0040008c.0 movz x5, #5, #0           [         6]
    +[....................fdn.ic......r...............................................]-(         120000) 0x00400090.0 movz x6, #6, #0           [         7]
    +[....................fdn.ic.......r..............................................]-(         120000) 0x00400098.0 movz x0, #3, #0           [         9]
    +[....................fdn.p......ic.r.............................................]-(         120000) 0x0040009c.0 b.lt 0x400080             [        10]
    +[.....................fdn.ic.......r.............................................]-(         120000) 0x004000a0.0 movz x10, #10, #0         [        11]
    +[.....................fdn.ic.......r.............................................]-(         120000) 0x004000a4.0 movz x11, #11, #0         [        12]
    +[.....................fdn.ic.......r.............................................]-(         120000) 0x004000a8.0 movz x12, #12, #0         [        13]
    +[.....................fdn.ic.......r.............................................]-(         120000) 0x004000ac.0 movz x13, #13, #0         [        14]
    +[.....................fdn.ic.......r.............................................]-(         120000) 0x004000b0.0 movz x14, #14, #0         [        15]
    +[.....................fdn.ic.......r.............................................]-(         120000) 0x004000b4.0 movz x15, #15, #0         [        16]
    +[.....................fdn.pic......r.............................................]-(         120000) 0x004000b8.0 movz x16, #16, #0         [        17]
    +[.....................fdn.pic.......r............................................]-(         120000) 0x004000bc.0 movz x17, #17, #0         [        18]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000c0.0 movz x0, #0, #0           [        19]
    +[............................................fdn.ic.r............................]-(         160000) 0x004000c4.0 movz x8, #93, #0          [        20]
    +
    +
    +
    +

    So this time the prediction was correct. Retire is delayed until the memory comes back, but we otherwise just kept running forward until hitting the next ifetch cache line.

    +
    @@ -25315,16 +26567,61 @@ namespace ArmISAInst {
    19.20.5.2. gem5 microops
    -

    TODO

    -
    -

    Some gem5 instructions break down into multiple microops.

    Microops are very similar to regular instructions, and show on the gem5 ExecAll trace format since that flag implies ExecMicro.

    -

    On aarch64 for example, one of the simplest microoped instructions is STP, which does the relatively complex operation of storing two values to memory at once, and is therefore a good candidate for being broken down into microops.

    +

    On aarch64 for example, one of the simplest microoped instructions is STP, which does the relatively complex operation of storing two values to memory at once, and is therefore a good candidate for being broken down into microops. We can observe it when executing:

    +
    +
    +
    +
    ./run \
    +  --arch arch64 \
    +  --emulator gem5 \
    +  --trace-insts-stdout \
    +  --userland userland/arch/aarch64/freestanding/linux/disassembly_test.S \
    +;
    +
    +
    +
    +

    which contains in gem5’s broken-ish disassembly that the input:

    +
    +
    +
    +
    stp x1, x2 [x0, 16]
    +
    +
    +
    +

    generated the output:

    +
    +
    +
    +
      16500: system.cpu: A0 T0 : @_start+108    : stp
    +  16500: system.cpu: A0 T0 : @_start+108. 0 :   addxi_uop   ureg0, x0, #16 : IntAlu :  D=0x0000000000420010  flags=(IsInteger|IsMicroop|IsDelayedCommit|IsFirstMicroop)
    +  17000: system.cpu: A0 T0 : @_start+108. 1 :   strxi_uop   w1, [ureg0]  : MemWrite :  D=0x000000009abcdef0 A=0x420010  flags=(IsInteger|IsMemRef|IsStore|IsMicroop|IsDelayedCommit)
    +  17500: system.cpu: A0 T0 : @_start+108. 2 :   strxi_uop   w2, [ureg0, #8] : MemWrite :  D=0x0000000000000002 A=0x420018  flags=(IsInteger|IsMemRef|IsStore|IsMicroop|IsLastMicroop)
    +
    +
    +
    +

    Where @_start+108. 0, @_start+108. 1 and @_start+108. 2 all happen at the same PC, and are therefore microops of STP.

    +
    +
    +

    From their names, which are of course not specified in the ARMv8 architecture reference manual, we guess that:

    +
    +
    + +
    +
    +

    From the gem5 source code, we see that STP is a class LdpStp : public PairMemOp, and then the constructor of PairMemOp sets up the microops depending on the exact type of LDP/STP:

    @@ -25347,7 +26644,7 @@ namespace ArmISAInst {

    functional: get the value magically, do not update caches, see also: gem5 functional requests

  • -

    atomic: get the value now without making a separate event, but do not update caches

    +

    atomic: get the value now without making a separate event, but do not update caches. Cannot work in Ruby due to fundamental limitations, mentioned in passing at: https://gem5.atlassian.net/browse/GEM5-676

  • timing: get the value simulating delays and updating caches

    @@ -25563,6 +26860,33 @@ TimingSimpleCPU::finishTranslation(WholeTranslationState *state)

    Therefore, here it makes sense for gem5 syscall implementation, which does not actually have a real kernel running, to just make a functional request and be done with it, since the impact of cache changes done by this read would be insignificant to the cost of an actual full context switch that would happen on a real syscall.

    +
    +

    It is generally hard to implement functional requests for Ruby runs, because packets are flying through the memory system in a transient state, and there is no simple way of finding exactly which ones might have the latest version of the memory. See for example:

    +
    +
    + +
    +
    +

    The typical error message in that case is:

    +
    +
    +
    +
    fatal: Ruby functional read failed for address
    +
    +
    @@ -26393,11 +27717,10 @@ readFunc(SyscallDesc *desc, ThreadContext *tc,

    19.20.8. gem5 functional units

    -

    TODO

    -
    -
    -

    Each instruction is marked with a class, and each class can execute in a given functional unit.

    +

    Each instruction is marked with a class, and each class can execute in a given functional unit.

    +
    +
    19.20.8.1. gem5 MinorCPU default functional units

    Which units are available is visible for example on the gem5 config.ini of a gem5 MinorCPU run. Functional units are not present in simple CPUs like gem5 TimingSimpleCPU.

    @@ -26411,7 +27734,6 @@ readFunc(SyscallDesc *desc, ThreadContext *tc, --emulator gem5 \ --userland userland/arch/aarch64/freestanding/linux/hello.S \ --trace-insts-stdout \ - -N1 \ -- \ --cpu-type MinorCPU \ --caches @@ -26424,7 +27746,233 @@ readFunc(SyscallDesc *desc, ThreadContext *tc,
    [system.cpu]
     type=MinorCPU
    -children=branchPred dcache dtb executeFuncUnits icache interrupts isa itb power_state tracer workload
    +children=branchPred dcache dtb executeFuncUnits icache interrupts isa itb power_state tracer workload +executeInputWidth=2 +executeIssueLimit=2 +
    +
    +
    +

    Here also note the executeInputWidth=2 and executeIssueLimit=2 suggesting that this is a dual issue superscalar processor.

    +
    +
    +

    The system.cpu points to:

    +
    +
    +
    +
    [system.cpu.executeFuncUnits]
    +type=MinorFUPool
    +children=funcUnits0 funcUnits1 funcUnits2 funcUnits3 funcUnits4 funcUnits5 funcUnits6 funcUnits7
    +
    +
    +
    +

    and the two first units are in full:

    +
    +
    +
    +
    [system.cpu.executeFuncUnits.funcUnits0]
    +type=MinorFU
    +children=opClasses timings
    +opClasses=system.cpu.executeFuncUnits.funcUnits0.opClasses
    +opLat=3
    +
    +[system.cpu.executeFuncUnits.funcUnits0.opClasses]
    +type=MinorOpClassSet
    +children=opClasses
    +
    +[system.cpu.executeFuncUnits.funcUnits0.opClasses.opClasses]
    +type=MinorOpClass
    +opClass=IntAlu
    +
    +
    +
    +

    and:

    +
    +
    +
    +
    [system.cpu.executeFuncUnits.funcUnits1]
    +type=MinorFU
    +children=opClasses timings
    +opLat=3
    +
    +[system.cpu.executeFuncUnits.funcUnits1.opClasses]
    +type=MinorOpClassSet
    +children=opClasses
    +opClasses=system.cpu.executeFuncUnits.funcUnits1.opClasses.opClasses
    +
    +[system.cpu.executeFuncUnits.funcUnits1.opClasses.opClasses]
    +type=MinorOpClass
    +opClass=IntAlu
    +
    +
    +
    +

    So we understand that both:

    +
    +
    +
      +
    • +

      the first and second functional units are IntAlu, so doing integer arithmetic operations

      +
    • +
    • +

      both have a latency of 3

      +
    • +
    • +

      each functional unit can have a set of opClass with more than one type. Those first two units just happen to have a single type.

      +
    • +
    +
    +
    +

    The full list is:

    +
    +
    +
      +
    • +

      0, 1: IntAlu, opLat=3

      +
    • +
    • +

      2: IntMult, opLat=3

      +
    • +
    • +

      3: IntDiv, opLat=9. So we see that a more complex operation such as division has higher latency.

      +
    • +
    • +

      4: FloatAdd, FloatCmp, and a gazillion other floating point related things. opLat=6.

      +
    • +
    • +

      5: SimdPredAlu: TODO SVE-related? opLat=3

      +
    • +
    • +

      6: MemRead, MemWrite, FloatMemRead, FloatMemWrite. opLat=1

      +
    • +
    • +

      7: IprAccess (TODO), InstPrefetch

      +
    • +
    +
    +
    +

    These are of course all specified in from the Python at src/cpu/minor/MinorCPU.py:

    +
    +
    +
    +
    class MinorDefaultFUPool(MinorFUPool):
    +    funcUnits = [MinorDefaultIntFU(), MinorDefaultIntFU(),
    +        MinorDefaultIntMulFU(), MinorDefaultIntDivFU(),
    +        MinorDefaultFloatSimdFU(), MinorDefaultPredFU(),
    +        MinorDefaultMemFU(), MinorDefaultMiscFU()]
    +
    +
    +
    +

    We then expect that each instruction has a certain opClass that determines on which unit it can run.

    +
    +
    +

    For example: class AddImm, which is what we get on a simple add x1, x2, 0, sets itself as an IntAluOp on the constructor as expected:

    +
    +
    +
    +
        AddImm::AddImm(ExtMachInst machInst,
    +                                          IntRegIndex _dest,
    +                                          IntRegIndex _op1,
    +                                          uint32_t _imm,
    +                                          bool _rotC)
    +        : DataImmOp("add", machInst, IntAluOp,
    +                         _dest, _op1, _imm, _rotC)
    +
    +
    +
    +
    +
    19.20.8.2. gem5 DerivO3CPU default functional units
    +
    +

    On gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772, after running:

    +
    +
    +
    +
    ./run   \
    +  --arch aarch64 \
    +  --emulator gem5 \
    +  --userland userland/arch/aarch64/freestanding/linux/hello.S \
    +  --trace-insts-stdout \
    +  -- \
    +  --cpu-type Derivo3CPU \
    +  --caches
    +
    +
    +
    +

    we see:

    +
    +
    +
    +
    [system.cpu]
    +type=DerivO3CPU
    +children=branchPred dcache dtb fuPool icache interrupts isa itb power_state tracer workload
    +
    +
    +
    +

    and following fuPool:

    +
    +
    +
    +
    [system.cpu.fuPool]
    +type=FUPool
    +children=FUList0 FUList1 FUList2 FUList3 FUList4 FUList5 FUList6 FUList7 FUList8 FUList9
    +
    +
    +
    +

    so for example FUList0 is:

    +
    +
    +
    +
    [system.cpu.fuPool.FUList0]
    +type=FUDesc
    +children=opList
    +count=6
    +eventq_index=0
    +opList=system.cpu.fuPool.FUList0.opList
    +
    +[system.cpu.fuPool.FUList0.opList]
    +type=OpDesc
    +eventq_index=0
    +opClass=IntAlu
    +opLat=1
    +pipelined=true
    +
    +
    +
    +

    and FUList1:

    +
    +
    +
    +
    [system.cpu.fuPool.FUList1.opList0]
    +type=OpDesc
    +eventq_index=0
    +opClass=IntMult
    +opLat=3
    +pipelined=true
    +
    +[system.cpu.fuPool.FUList1.opList1]
    +type=OpDesc
    +eventq_index=0
    +opClass=IntDiv
    +opLat=20
    +pipelined=false
    +
    +
    +
    +

    So summarizing all units we have:

    +
    +
    +
      +
    • +

      0, 1: IntAlu with opLat=3

      +
    • +
    • +

      2: IntMult with opLat=3 and IntDiv with opLat=20

      +
    • +
    • +

      3: FloatAdd, FloatCmp, FloatCvt with opLat=2

      +
    • +
    • +

      TODO lazy to finish the list :-)

      +
    • +
    @@ -26679,6 +28227,107 @@ build/ARM/config/the_isa.hh +
    +

    19.21. Gensim

    + + +
    +

    MIT licensed Binary translation simulator, so a bit like an MIT QEMU.

    +
    +
    +

    Video showing it boot Linux fast: https://www.youtube.com/watch?v=aZXx17oYumc

    +
    +
    +

    Its name is unfortunately completely and totally overshadowed by an unrelated software with the sane name: https://radimrehurek.com/gensim/

    +
    +
    +

    TODO: advantages over QEMU. Like the name implies, they seem to have a nice ISA description language. From quick internals look, seems to generate LLVM intermediate language, which sound good.

    +
    +
    +

    Build on Ubuntu 20.04:

    +
    +
    +
    +
    sudo apt install libantlr3c-dev
    +cd submodule/gensim
    +make
    +
    +
    +
    +

    First fails with:

    +
    +
    +
    +
    arm-none-eabi-gcc: error: unrecognized -march target: armv5
    +
    +
    +
    +

    Let’s try just armv8, who cares about arvm5!!!

    +
    +
    +
    +
    mkdir build
    +cd build
    +cmake -DTESTING_ENABLED=FALSE -DCMAKE_BUILD_TYPE=DEBUGOPT ..
    +make -j`nproc` model-armv8
    +
    +
    + +
    +
    +
    terminate called after throwing an instance of 'std::logic_error'
    +  what():  Unrecognised intrinsic: __builtin_abs64
    +Aborted (core dumped)
    +
    +
    +
    +

    Get the failing command with:

    +
    +
    +
    +
    make VERBOSE=1 model-armv8
    +
    +
    +
    +

    and we see some code generation step:

    +
    +
    +
    +
    cd /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/models/armv8 && \
    +  /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/build/dist/bin/gensim \
    +  -a /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/models/armv8/aarch64.ac \
    +  -s module,arch,decode,disasm,ee_interp,ee_blockjit,jumpinfo,function,makefile \
    +  -o decode.GenerateDotGraph=1,makefile.libtrace_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/support/libtrace/inc,makefile.archsim_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/archsim/inc,makefile.llvm_path=,makefile.Optimise=2,makefile.Debug=1 \
    +  -t /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/build/models/armv8/output-aarch64/
    +
    +
    +
    +

    We can see an inclusion path:

    +
    +
    +
    +
    gensim/models/armv8/aarch64.ac
    +		ac_isa("isa.ac");
    +gensim/models/armv8/isa.ac
    +		ac_execute("execute.simd");
    +
    +
    +
    +

    and where gensim/models/armv8/isa.ac contains __builtin_abs64 usages.

    +
    +
    +

    GDB on gensim shows that the error comes from a call to gci.GenerateExecuteBodyFor(body_str, *action);, so it looks like there are some missing cases in EmitFixedCode.

    +
    +
    +

    This is completely broken academic code! They must be using an off-tree of part of the tool and forgot to commit.

    +
    +
    @@ -27017,7 +28666,7 @@ make menuconfig

    If none of those methods are flexible enough for you, you can just fork or hack up buildroot_packages/sample_package the sample package to do what you want.

    -

    For how to use that package, see: Section 33.14.2, “buildroot_packages directory”.

    +

    For how to use that package, see: Section 33.15.2, “buildroot_packages directory”.

    Then iterate trying to do what you want and reading the manual until it works: https://buildroot.org/downloads/manual/manual.html

    @@ -27036,7 +28685,7 @@ make menuconfig

    Also mentioned at: https://stackoverflow.com/questions/47320800/how-to-clean-only-target-in-buildroot

    -

    See this for a sample manual workaround: Section 21.8.4.4, “PARSEC uninstall”.

    +

    See this for a sample manual workaround: Section 21.8.5.4, “PARSEC uninstall”.

    @@ -27203,7 +28852,7 @@ git -C "$(./getvar qemu_source_dir)" checkout -

    Then, you will also want to do a Bisection to pinpoint the exact commit to blame, and CC that developer.

    -

    Finally, give the images you used save upstream developers' time as shown at: Section 33.18.2, “release-zip”.

    +

    Finally, give the images you used save upstream developers' time as shown at: Section 33.19.2, “release-zip”.

    For Buildroot problems, you should wither provide the config you have:

    @@ -27872,17 +29521,11 @@ echo 1 > /proc/sys/vm/overcommit_memory

    Demonstrates atomic_int and thrd_create.

    -

    Disassembly with GDB at LKMC 619fef4b04bddc4a5a38aec5e207dd4d5a25d206 + 1:

    +

    Disassembly with GDB at LKMC 619fef4b04bddc4a5a38aec5e207dd4d5a25d206 + 1:

    -
    ./run-toolchain \
    -  --arch aarch64 gdb \
    -  -- \
    -  -batch \
    -  -ex 'disas/rs my_thread_main' $(./getvar \
    -  --arch aarch64 userland_build_dir)/c/atomic.out \
    -;
    +
    ./disas --arch aarch64 --userland userland/c/atomic.c my_thread_main
    @@ -27982,6 +29625,101 @@ echo 1 > /proc/sys/vm/overcommit_memory

    strace shows that OpenMP makes clone() syscalls in Linux. TODO: does it actually call pthread_ functions, or does it make syscalls directly? Or in other words, can it work on Freestanding programs? A quick grep shows many references to pthreads.

    +
    +
    21.1.3.2.1. OpenMP validation
    + +
    +

    Host build on Ubuntu 20.04:

    +
    +
    +
    +
    git submodule update --init submodules/omp-validation
    +cd submodules/omp-validation
    +PERL5LIB="${PERL5LIB}:." make -j `nproc` ctest
    +
    +
    +
    +

    This both builds and runs, took about 5 minutes on P51, but had build failues for some reason:

    +
    +
    +
    +
    Summary:
    +S Number of tested Open MP constructs: 62
    +S Number of used tests:                123
    +S Number of failed tests:              4
    +S Number of successful tests:          119
    +S + from this were verified:           115
    +
    +Normal tests:
    +N Number of failed tests:              2
    +N + from this fail compilation:        0
    +N + from this timed out                0
    +N Number of successful tests:          60
    +N + from this were verified:           58
    +
    +Orphaned tests:
    +O Number of failed tests:              2
    +O + from this fail compilation:        0
    +O + from this timed out                0
    +O Number of successful tests:          59
    +O + from this were verified:           57
    +
    +
    +
    +

    The tests and run results placed under bin/c/, e.g.:

    +
    +
    +
    +
    test_omp_threadprivate
    +test_omp_threadprivate.c
    +test_omp_threadprivate.log
    +test_omp_threadprivate.out
    +test_omp_threadprivate_compile.log
    +
    +
    +
    +

    C files are also present as some kind of code generaion is used.

    +
    +
    +

    Build only and run one of them manually:

    +
    +
    +
    +
    make -j`nproc` omp_my_sleep omp_testsuite
    +PERL5LIB="${PERL5LIB}:." ./runtest.pl --lang=c --norun testlist-c.txt
    +./bin/c/test_omp_barrier
    +
    +
    +
    +

    The bin/c directory is hardcoded in the executable, so to run it you must ensure that it exists relative to CWD, e.g.:

    +
    +
    +
    +
    cd bin/c
    +mkdir -p bin/c
    +./test_omp_barrier
    +
    +
    +
    +

    Manually cross compile all tests and optionally add some extra options, e.g. -static to more conveniently run in gem5:

    +
    +
    +
    +
    PERL5LIB="${PERL5LIB}:." ./runtest.pl --lang=c --makeopts 'CC=aarch64-linux-gnu-gcc CFLAGS_EXTRA=-static' --norun testlist-c.txt
    +./../../run --arch aarch64 --emulator gem5 --userland submodules/omp-validation/bin/c/test_omp_parallel_reduction --cpus 8 --memory 8G
    +
    +
    +
    +

    Build a single test:

    +
    +
    +
    +
    make bin/c/test_omp_sections_reduction
    +
    +
    +
    @@ -28248,7 +29986,7 @@ global 12676

    The actual value is much smaller, because the threads have often overwritten one another with older values.

    -

    With --optimization-level 3, the result almost always equals that of a single thread, e.g.:

    +

    With --optimization-level 3, the result almost always equals that of a single thread, e.g.:

    @@ -28414,7 +30152,6 @@ non-atomic 19
    -
    /run -aA -eg -u userland/c/atomic.c --cli-args '2 200' --cpus 3 --userland-build-id o3 -N1 --trace ExecAll — --caches --cpu-type TimingSimpleCPU

    Note that that the system is very minimal, and doesn’t even have caches, so I’m curious as to how this can happen at all.

    @@ -28966,6 +30703,9 @@ There are no non-locking atomic types or atomic primitives in POSIX: rootfs_overlay/lkmc/python/count.py: count once every second

  • +
  • +

    rootfs_overlay/lkmc/python/iter_method.py: how to implement iter on a class

    +
  • @@ -29219,6 +30959,36 @@ There are no non-locking atomic types or atomic primitives in POSIX: + + +
    @@ -29779,7 +31549,125 @@ cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans,3,3,2 ,1, A,3, B,
    -

    21.8.3. STREAM benchmark

    +

    21.8.3. LMbench

    +
    +

    http://www.bitmover.com/lmbench/

    +
    +
    +

    Canonical source at https://sourceforge.net/projects/lmbench/ but Intel has a fork at: https://github.com/intel/lmbench which has more recent build updates, so I think that’s the one I’d put my money on as of 2020.

    +
    +
    +

    Feels old, guessing not representative anymore like Dhrystone. But hey, history!

    +
    +
    +

    Ubuntu 20.04 AMD64 native build and run:

    +
    +
    +
    +
    git submodule update --init submodules/lmbench
    +cd submodules/lmbench
    +cd src
    +make results
    +
    +
    +
    +

    TODO it hangs for a long time at:

    +
    +
    +
    +
    Hang on, we are calculating your cache line size.
    +
    +
    +
    +

    Bug report: https://github.com/intel/lmbench/issues/15

    +
    +
    +

    the If I kill it, configuration process continues:

    +
    +
    +
    +
    Killed
    +OK, it looks like your cache line is  bytes.
    +
    +
    +
    +

    and continues with a few more interactive questions until finally:

    +
    +
    +
    +
    Confguration done, thanks.
    +
    +
    +
    +

    where it again hangs for at least 2 hours, so I lost patience and killed it.

    +
    +
    +

    TODO: how to do a non-interactive config? After the above procedure, bin/x86_64-linux-gnu/CONFIG.ciro-p51 contains:

    +
    +
    +
    +
    DISKS=""
    +DISK_DESC=""
    +OUTPUT=/dev/null
    +ENOUGH=50000
    +FASTMEM="NO"
    +FILE=/var/tmp/XXX
    +FSDIR=/var/tmp
    +INFO=INFO.ciro-p51
    +LINE_SIZE=
    +LOOP_O=0.00000000
    +MAIL=no
    +TOTAL_MEM=31903
    +MB=22332
    +MHZ="-1 System too busy"
    +MOTHERBOARD=""
    +NETWORKS=""
    +OS="x86_64-linux-gnu"
    +PROCESSORS="8"
    +REMOTE=""
    +SLOWFS="NO"
    +SYNC_MAX="1"
    +LMBENCH_SCHED="DEFAULT"
    +TIMING_O=0
    +RSH=rsh
    +RCP=rcp
    +VERSION=lmbench-3alpha4
    +BENCHMARK_HARDWARE=YES
    +BENCHMARK_OS=YES
    +BENCHMARK_SYSCALL=
    +BENCHMARK_SELECT=
    +BENCHMARK_PROC=
    +BENCHMARK_CTX=
    +BENCHMARK_PAGEFAULT=
    +BENCHMARK_FILE=
    +BENCHMARK_MMAP=
    +BENCHMARK_PIPE=
    +BENCHMARK_UNIX=
    +BENCHMARK_UDP=
    +BENCHMARK_TCP=
    +BENCHMARK_CONNECT=
    +BENCHMARK_RPC=
    +BENCHMARK_HTTP=
    +BENCHMARK_BCOPY=
    +BENCHMARK_MEM=
    +BENCHMARK_OPS=
    +
    +
    +
    +

    Native build only without running tests:

    +
    +
    +
    +
    cd src
    +make
    +
    +
    +
    +

    Interestingly, one of the creators of LMbench, Larry Mcvoy (https://www.linkedin.com/in/larrymcvoy/, https://en.wikipedia.org/wiki/Larry_McVoy), is also a co-founder of BitKeeper. Their SMC must be blazingly fast!!! Also his LinkedIn says Intel uses it. But they will forever be remembered as "the closed source Git precursor that died N years ago", RIP.

    +
    +
    +
    +

    21.8.4. STREAM benchmark

    http://www.cs.virginia.edu/stream/ref.html

    @@ -29853,7 +31741,7 @@ times[3 * ntimes + k] = mysecond() - times[3 * ntimes + k];
    -

    21.8.4. PARSEC benchmark

    +

    21.8.5. PARSEC benchmark

    We have ported parts of the PARSEC benchmark for cross compilation at: https://github.com/cirosantilli/parsec-benchmark See the documentation on that repo to find out which benchmarks have been ported. Some of the benchmarks were are segfaulting, they are documented in that repo.

    @@ -29871,7 +31759,7 @@ times[3 * ntimes + k] = mysecond() - times[3 * ntimes + k];
    -
    21.8.4.1. PARSEC benchmark without parsecmgmt
    +
    21.8.5.1. PARSEC benchmark without parsecmgmt
    ./build --arch arm --download-dependencies gem5-buildroot parsec-benchmark
    @@ -29905,7 +31793,7 @@ times[3 * ntimes + k] = mysecond() - times[3 * ntimes + k];
     
    -
    21.8.4.2. PARSEC change the input size
    +
    21.8.5.2. PARSEC change the input size

    Running a benchmark of a size different than test, e.g. simsmall, requires a rebuild with:

    @@ -29969,7 +31857,7 @@ times[3 * ntimes + k] = mysecond() - times[3 * ntimes + k];
    -
    21.8.4.3. PARSEC benchmark with parsecmgmt
    +
    21.8.5.3. PARSEC benchmark with parsecmgmt

    Most users won’t want to use this method because:

    @@ -30032,7 +31920,7 @@ parsecmgmt -a run -p splash2x.fmm -i test
    -
    21.8.4.4. PARSEC uninstall
    +
    21.8.5.4. PARSEC uninstall

    If you want to remove PARSEC later, Buildroot doesn’t provide an automated package removal mechanism as mentioned at: Section 20.6, “Remove Buildroot packages”, but the following procedure should be satisfactory:

    @@ -30050,7 +31938,7 @@ parsecmgmt -a run -p splash2x.fmm -i test
    -
    21.8.4.5. PARSEC benchmark hacking
    +
    21.8.5.5. PARSEC benchmark hacking

    If you end up going inside submodules/parsec-benchmark to hack up the benchmark (you will!), these tips will be helpful.

    @@ -31250,10 +33138,10 @@ zmmintrin.h AVX512

    userland/arch/x86_64/freestanding/linux/int_system_call.S

  • -

    userland/arch/x86_64/inline_asm/freestanding/linux/hello.c

    +

    userland/arch/x86_64/inline_asm/freestanding/linux/hello.c: this shows how to do system calls from inline assembly without any C standard library helpers like syscall

  • -

    userland/arch/x86_64/inline_asm/freestanding/linux/hello_regvar.c

    +

    userland/arch/x86_64/inline_asm/freestanding/linux/hello_regvar.c: same as userland/arch/x86_64/inline_asm/freestanding/linux/hello.c but using register variables instead of register constraints

  • @@ -31266,7 +33154,7 @@ zmmintrin.h AVX512

    userland/arch/arm/freestanding/linux/hello.S

  • -

    userland/arch/arm/inline_asm/freestanding/linux/hello.c

    +

    userland/arch/arm/inline_asm/freestanding/linux/hello.c: there are no register constraints in ARM, so register variables are the most efficient way of storing variables in specific general purpose registers: https://stackoverflow.com/questions/3929442/how-to-specify-an-individual-register-as-constraint-in-arm-gcc-inline-assembly/54845046#54845046

  • @@ -34587,6 +36475,158 @@ ldmia sp!, reglist

    Why GNU GAS 2.29 does not have a mnemonic for it in A64 because it is very recent: shows in ARMv8 architecture reference manual db but not ca.

    +
    +

    24.5.3. ARM system register instructions

    +
    +

    Examples of using them can be found at: dump_regs

    +
    +
    +

    aarch64 only uses exactly 2 instructions:

    +
    +
    + +
    +
    +

    aarch32 is a bit more messy due to older setups, we have both:

    +
    +
    + +
    +
    +

    TODO why both? For example, as mentioned at https://stackoverflow.com/questions/62920281/cross-compilng-c-program-for-armv8-a-in-linux-x86-64-system/62922677#62922677 a register that was accessed with MRC in armv7 can move to MRS in aarch64, as is the case for:

    +
    +
    +
    +
    mrs r0, ctr     /* aarch32 */
    +mrc x0, ctr_el0 /* aarch64 */
    +
    +
    +
    +

    Other functionality has moved away from coprocessors into actual instructions, e.g. cache invalidation:

    +
    +
    +
    +
    /* aarch32: DCISW, Data Cache line Invalidate by Set/Way. */
    +mcr     p15, 0, r5, c7, c6, 2
    +
    +/* aarch64: moved to one of the DC instruction variants. */
    +dc isw
    +
    +
    +
    +

    ARMv8 architecture reference manual db G1.19.4 "Background to the System register interface" says that only CP14 and CP15 are specified by the ISA:

    +
    +
    +
    +
    +

    The interface to the System registers was originally defined as part of a generic coprocessor interface, that gave access to 15 coprocessors, CP0 - CP15. Of these, CP8 - CP15 were reserved for use by Arm, while CP0 - CP7 were available for IMPLEMENTATION DEFINED coprocessors.

    +
    +
    +
    +
    +

    and the actual coprocessor registers are specified in Chapter G7 "AArch32 System Register Encoding" at:

    +
    +
    + +
    +
    +

    The actual MRC assembly does not exactly match the order of that table, this is how you can decode it, sample MCR:

    +
    +
    +
    +
    mcr     p15, 0, r5, c7, c6, 2
    +
    +
    +
    +

    what each part means:

    +
    +
    +
    +
    mcr     p<coproc>, <opc1>, <src-dest-reg>, <CRn>, <CRm>, <opc2>
    +
    +
    +
    +
    24.5.3.1. ARM system register encodings
    +
    +

    Each aarch64 system register is specified in the encoding of ARM system register instructions by 5 integer numbers:

    +
    +
    +
      +
    • +

      op0

      +
    • +
    • +

      op1

      +
    • +
    • +

      CRn

      +
    • +
    • +

      CRm

      +
    • +
    • +

      op2

      +
    • +
    +
    +
    +

    The encodings are given on large tables in ARMv8 architecture reference manual db Chapter D12 "AArch64 System Register Encoding".

    +
    +
    +

    As shown in baremetal/arch/aarch64/dump_regs.c as of LKMC 4e05b00d23c73cc4d3b83be94affdb6f28008d99, you can use the encoding parameters directly in GNU GAS assembly:

    +
    +
    +
    +
    uint32_t id_isar6_el1;
    +__asm__ ("mrs %0, s3_0_c0_c2_7" : "=r" (id_isar6_el1) : :);
    +LKMC_DUMP_SYSTEM_REGS_PRINTF("ID_ISAR6_EL1 0x%" PRIX32 "\n", id_isar6_el1);
    +
    +
    +
    +

    This can be useful to refer to new system registers which your older version of GNU GAS version does not yet have a name for.

    +
    +
    +

    The Linux kernel also uses explicit sysreg encoding extensively since it is of course a very early user of many new system registers, this is done at arch/arm64/include/asm/sysreg.h in Linux v5.4.

    +
    +
    +

    24.6. ARM SIMD

    @@ -35326,7 +37366,13 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.

    -
    24.9.2.4. Programmer’s Guide for ARMv8-A
    +
    24.9.2.4. ARMv8 architecture reference manual db
    +
    +

    https://static.docs.arm.com/ddi0487/fa/DDI0487F_a_armv8_arm.pdf

    +
    +
    +
    +
    24.9.2.5. Programmer’s Guide for ARMv8-A

    https://static.docs.arm.com/den0024/a/DEN0024A_v8_architecture_PG.pdf

    @@ -35341,7 +37387,7 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.

    -
    24.9.2.5. Arm A64 Instruction Set Architecture: Future Architecture Technologies in the A architecture profile Documentation
    +
    24.9.2.6. Arm A64 Instruction Set Architecture: Future Architecture Technologies in the A architecture profile Documentation

    https://developer.arm.com/docs/ddi0602/b

    @@ -35350,15 +37396,31 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.

    -
    24.9.2.6. ARM processor documentation
    +
    24.9.2.7. ARM processor documentation

    ARM also releases documentation specific to each given processor.

    This adds extra details to the more portable ARMv8 architecture reference manual ISA documentation.

    +
    +

    For every processor, there are basically two key documents:

    +
    +
    + +
    -
    24.9.2.6.1. ARM Cortex-A15 MPCore Processor Technical Reference Manual r4p0
    +
    24.9.2.7.1. ARM Cortex-A15 MPCore Processor Technical Reference Manual r4p0
    @@ -35367,6 +37429,18 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.

    +
    +
    24.9.2.8. Arm Cortex‑A77 Technical Reference Manual r1p1
    +
    +

    https://static.docs.arm.com/101111/0101/arm_cortex_a77_trm_101111_0101_04_en.pdf

    +
    +
    +
    +
    24.9.2.9. Arm Cortex‑A77 Software Optimization Guide r1p1
    +
    +

    https://static.docs.arm.com/swog011050/c/Arm_Cortex-A77_Software_Optimization_Guide.pdf

    +
    +
    @@ -35634,6 +37708,15 @@ cc

    It is worth noting that e.g. ARM has a Semihosting mechanism for loading CLI arguments through SYS_GET_CMDLINE, but our mechanism works in principle for any ISA.

    +
    +

    27.4.1. gem5 baremetal arm CLI args

    +
    +

    Currently not supported, so we just hardcode argc 0 on the arm baremetal bootloader.

    +
    +
    +

    I think we have to keep the CLI args below 32 GiB, otherwise argc cannot be correctly setup. But currently the gem5 text segment is exactly at 32 GiB, and we always place the CLI args higher in the Baremetal linker script.

    +
    +

    27.5. Semihosting

    @@ -36846,6 +38929,15 @@ IN: main
  • userland/arch/aarch64/inline_asm/wfe_ldxr_str.cpp

  • +
  • +

    userland/arch/aarch64/inline_asm/futex_ldxr_stxr.c: tests that ldxr and stxr do not interact with futexes. This was leading to problems in gem5 syscall emulation mode at one point: https://gem5.atlassian.net/browse/GEM5-537

    +
    +

    Correct outcome: gem5 simulate() limit reached.

    +
    +
    +

    Incorrect behaviour due to: https://gem5.atlassian.net/browse/GEM5-537: Exits successfully. */

    +
    +
  • @@ -37650,7 +39742,7 @@ ISB

    In baremetal, we detect if tests failed by parsing logs for the Magic failure string.

    -

    See: Section 33.15, “Test this repo” for more useful testing tips.

    +

    See: Section 33.16, “Test this repo” for more useful testing tips.

    @@ -38195,6 +40287,14 @@ instructions 124346081

    Same but with: gem5 arm Linux kernel patches at v4.15: 73s, kernel size: 132M.

    +
    +

    On Ubuntu 20.04 gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772 this took 22 minutes 53 seconds:

    +
    +
    +
    +
    ./run -aa -eg --cpus 2 --tmux --quit-after-boot -- --cpu-type DerivO3CPU --caches
    +
    +
    29.2.1.1. gem5 arm HPI boot takes much longer than aarch64
    @@ -39014,11 +41114,11 @@ west build -b qemu_aarch64 samples/hello_world

    The hard part is how to prevent the compiler from optimizing it away: https://stackoverflow.com/questions/7083482/how-to-prevent-gcc-from-optimizing-out-a-busy-wait-loop/58758133#58758133

    -

    Disassembly analysis:

    +

    Disassembly analysis:

    -
    ./run-toolchain --arch aarch64 gdb -- -nh -batch -ex 'disas/rs busy_loop' "$(./getvar --arch aarch64 userland_build_dir)/gcc/busy_loop.out"
    +
    ./disas --arch aarch64 --userland userland/gcc/busy_loop.out busy_loop
    @@ -39103,7 +41203,7 @@ west build -b qemu_aarch64 samples/hello_world

    gem5 MinorCPU

  • -

    gem5 DerivO3CPU

    +

    gem5 DerivO3CPU

  • @@ -39126,10 +41226,35 @@ west build -b qemu_aarch64 samples/hello_world

    http://www.lighterra.com/papers/modernmicroprocessors/ explains it well.

    -

    You basically decode

    +

    You basically decode multiple instructions in one go, and run them at the same time if they can go in separate functional units and have no conflicts. Genius!

    -

    TODO in gem5? gem5 definitely has functional units explicitly modelled: gem5 functional units, so do gem5 MinorCPU or gem5 DerivO3CPU have it?

    +

    And so the concept of branch predictor must come in here: when a conditional branch is reached, you have to decide which side to execute before knowing for sure.

    +
    +
    +

    This is why it is called a type of Instruction level parallelism.

    +
    +
    +

    Although this is a microarchitectural feature, it is so important that it is publicly documented. For example:

    +
    +
    + +
    +
    +

    32.2.1. Execution unit

    +
    +

    https://en.wikipedia.org/wiki/Execution_unit

    +
    +
    +

    gem5 calls them "functional units".

    +
    +
    +

    gem5 has functional units explicitly modelled as shown at gem5 functional units, and those are used by both gem5 MinorCPU and gem5 DerivO3CPU.

    +
    @@ -39138,11 +41263,82 @@ west build -b qemu_aarch64 samples/hello_world

    https://en.wikipedia.org/wiki/Out-of-order_execution

    -

    gem5’s model is gem5 DerivO3CPU.

    +

    gem5’s model is gem5 DerivO3CPU.

    +
    +
    +

    Allows working around data dependencies: you can execute the second next instruction forward if the first next depends on the current one.

    +
    +
    +

    Likely used on basically all (?) 2020 non-power-constrained CPUs.

    +
    +
    +

    As mentioned at: https://stackoverflow.com/questions/10074831/what-is-general-difference-between-superscalar-and-ooo-execution it is in theory possible for an out-of-order CPU to not a Superscalar processor, but the combination is so natural (since you can look ahead, you might as well run it!) that it is not super common.

    +
    +
    +

    32.3.1. Speculative execution

    +
    +

    https://en.wikipedia.org/wiki/Speculative_execution

    +
    +
    +

    A gem5 example can be seen at: gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: speculative.

    +
    +
    +

    Bibliography:

    +
    +
    + +
    +
    +
    32.3.1.1. Branch predictor
    + +
    +

    Comes in for superscalar processors.

    +
    + +
    +
    +
    +

    32.3.2. Re-order buffer

    +
    +

    https://en.wikipedia.org/wiki/Re-order_buffer

    +
    +
    +
    +

    32.3.3. Register renaming

    +
    +

    https://en.wikipedia.org/wiki/Register_renaming

    +
    -

    32.4. Hardware threads

    +

    32.4. Instruction level parallelism

    +
    +

    https://en.wikipedia.org/wiki/Instruction-level_parallelism

    +
    +
    +

    Basically means decoding and then potentially executing a bunch of instructions in one go.

    +
    +
    +

    Important examples:

    +
    +
    + +
    +
    +
    +

    32.5. Hardware threads

    Intel name: "Hyperthreading"

    @@ -39192,7 +41388,7 @@ west build -b qemu_aarch64 samples/hello_world
    -

    32.5. Cache coherence

    +

    32.6. Cache coherence

    https://en.wikipedia.org/wiki/Cache_coherence

    @@ -39234,7 +41430,7 @@ west build -b qemu_aarch64 samples/hello_world

    Even if caches are coherent, this is still not enough to avoid data race conditions, because this does not enforce atomicity of read modify write sequences. This is for example shown at: Detailed gem5 analysis of how data races happen.

    -

    32.5.1. Memory consistency

    +

    32.6.1. Memory consistency

    According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf "memory consistency" is about ordering requirements of different memory addresses.

    @@ -39242,14 +41438,14 @@ west build -b qemu_aarch64 samples/hello_world

    This is represented explicitly in C++ for example C++ std::memory_order.

    -
    32.5.1.1. Sequential Consistency
    +
    32.6.1.1. Sequential Consistency

    According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf, the strongest possible consistency, everything nicely ordered as you’d expect.

    -

    32.5.2. Can caches snoop data from other caches?

    +

    32.6.2. Can caches snoop data from other caches?

    Either they can snoop only control, or both control and data can be snooped.

    @@ -39264,7 +41460,7 @@ west build -b qemu_aarch64 samples/hello_world
    -

    32.5.3. VI cache coherence protocol

    +

    32.6.3. VI cache coherence protocol

    Mentioned at:

    @@ -39511,7 +41707,7 @@ west build -b qemu_aarch64 samples/hello_world
    -

    32.5.4. MSI cache coherence protocol

    +

    32.6.4. MSI cache coherence protocol

    https://en.wikipedia.org/wiki/MSI_protocol

    @@ -39823,7 +42019,7 @@ CACHE2 S nyy

    TODO gem5 concrete example.

    -
    32.5.4.1. MSI cache coherence protocol with transient states
    +
    32.6.4.1. MSI cache coherence protocol with transient states

    TODO understand well why those are needed.

    @@ -39843,7 +42039,7 @@ CACHE2 S nyy
    -

    32.5.5. MESI cache coherence protocol

    +

    32.6.5. MESI cache coherence protocol

    https://en.wikipedia.org/wiki/MESI_protocol

    @@ -39903,7 +42099,7 @@ CACHE2 S nyy
    -

    32.5.6. MOSI cache coherence protocol

    +

    32.6.6. MOSI cache coherence protocol

    https://en.wikipedia.org/wiki/MOSI_protocol The critical MSI vs MOSI section was a bit bogus though: https://en.wikipedia.org/w/index.php?title=MOSI_protocol&oldid=895443023 but I edited it :-)

    @@ -39963,7 +42159,7 @@ CACHE2 S nyy
    -

    32.5.7. MOESI cache coherence protocol

    +

    32.6.7. MOESI cache coherence protocol

    https://en.wikipedia.org/wiki/MOESI_protocol

    @@ -40495,7 +42691,7 @@ export CCACHE_MAXSIZE="20G"

    33.10.1. run-toolchain

    -

    While you could just manually find/learn the path to toolchain tools, e.g. in LKMC b15a0e455d691afa49f3b813ad9b09394dfb02b7 they are

    +

    While you could just manually find/learn the path to toolchain tools, e.g. in LKMC b15a0e455d691afa49f3b813ad9b09394dfb02b7 they are:

    @@ -40513,6 +42709,17 @@ export CCACHE_MAXSIZE="20G"
    +

    This plays nicely with getvar e.g. you could disassembly userland/c/hello.c with:

    +
    +
    +
    +
    ./run-toolchain --arch aarch64 objdump -- -D $(./getvar --arch aarch64 userland_build_dir)/c/hello.out
    +
    +
    +
    +

    however disassembly is such a common use case that we have a shortcut for it: disas.

    +
    +

    Alternatively, if you just need a variable to feed into your own Build system, you can also use getvar:

    @@ -40528,6 +42735,36 @@ export CCACHE_MAXSIZE="20G"
    /path/to/linux-kernel-module-cheat/out/buildroot/build/default/aarch64/host/usr/bin/aarch64-buildroot-linux-gnu
    +
    +
    33.10.1.1. disas
    +
    +

    Since disassembly of a single function of a LKMC executable with GDB is such a common use case for run-toolchain via https://stackoverflow.com/questions/22769246/how-to-disassemble-one-single-function-using-objdump, we have this shortcut for it.

    +
    +
    +

    For example to disassemle a function from an userland binary:

    +
    +
    +
    +
    ./disas --arch aarch64 --userland userland/c/hello.c main
    +
    +
    +
    +

    or to disassemble a function from the Linux kernel:

    +
    +
    +
    +
    ./disas --arch aarch64 start_kernel
    +
    +
    +
    +

    and a baremetal executable:

    +
    +
    +
    +
    ./disas --arch aarch64 --baremetal baremetal/arch/aarch64/no_bootloader/exit.S _start
    +
    +
    +
    @@ -40889,9 +43126,36 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
    -

    33.14. Directory structure

    +

    33.14. Optimization level of a build

    +
    +

    The --optimization-level option is available on all build scripts and sets the given GCC `-`O optimization level where it has been implemented for guest binaries.

    +
    +
    +

    The default optimization level is -O0 to improve guest visibility.

    +
    +
    +

    To keep things sane, you generally want to create a separate build variant for each optimization level, e.g. to create an -O3 build:

    +
    +
    +
    +
    ./build-userland --optimization-level 3 --userland-build-id o3
    +./run --userland userland/c/hello.c --userland-build-id o3
    +
    +
    +
    +

    Note that for some guest content, there are hard technical challenges why we are not able to forward -O, notably the linux kernel: Disable kernel compiler optimizations.

    +
    +
    +

    Our emulators however are build with higher optimization levels by default otherwise running anything would be too unbearably slow.

    +
    +
    +

    Emulator builds are also controlled with other mechanisms instead of --optimization-level as explained at: Debug the emulator.

    +
    +
    +
    +

    33.15. Directory structure

    -

    33.14.1. lkmc directory

    +

    33.15.1. lkmc directory

    lkmc/ contains sources and headers that are shared across kernel modules, userland and baremetal examples.

    @@ -40902,7 +43166,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -

    Another option would have been to name it as includes/lkmc, but that would make paths longer, and we might want to store source code in that directory as well in the future.

    -
    33.14.1.1. Userland objects vs header-only
    +
    33.15.1.1. Userland objects vs header-only

    When factoring out functionality across userland examples, there are two main options:

    @@ -40961,7 +43225,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
    -

    33.14.2. buildroot_packages directory

    +

    33.15.2. buildroot_packages directory

    Source: buildroot_packages/.

    @@ -41010,7 +43274,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -

    A custom build script can give you more flexibility: e.g. the package can be made work with other root filesystems more easily, have better 9P support, and rebuild faster as it evades some Buildroot boilerplate.

    -
    33.14.2.1. kernel_modules buildroot package
    +
    33.15.2.1. kernel_modules buildroot package

    Source: buildroot_packages/kernel_modules/

    @@ -41057,9 +43321,9 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
    -

    33.14.3. patches directory

    +

    33.15.3. patches directory

    -
    33.14.3.1. patches/global directory
    +
    33.15.3.1. patches/global directory

    Has the following structure:

    @@ -41076,7 +43340,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
    -
    33.14.3.2. patches/manual directory
    +
    33.15.3.2. patches/manual directory

    Patches in this directory are never applied automatically: it is up to users to manually apply them before usage following the instructions in this documentation.

    @@ -41086,7 +43350,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
    -

    33.14.4. rootfs_overlay

    +

    33.15.4. rootfs_overlay

    Source: rootfs_overlay.

    @@ -41133,7 +43397,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -

    This way you can just hack away the scripts and try them out immediately without any further operations.

    -
    33.14.4.1. out_rootfs_overlay_dir
    +
    33.15.4.1. out_rootfs_overlay_dir

    This path can be found with:

    @@ -41167,7 +43431,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
    -

    33.14.5. lkmc.c

    +

    33.15.5. lkmc.c

    The files:

    @@ -41197,7 +43461,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
    -

    33.14.6. lkmc_home

    +

    33.15.6. lkmc_home

    lkmc_home refers to the target base directory in which we put all our custom built stuff, such as userland executables and kernel modules.

    @@ -41230,7 +43494,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
    -

    33.14.7. path_properties.py

    +

    33.15.7. path_properties.py

    In order to build and run each userland and baremetal example properly, we need per-file metadata such as compiler flags and required number of cores.

    @@ -41293,7 +43557,7 @@ baremetal=True
    -

    33.14.8. rand_check.out

    +

    33.15.8. rand_check.out

    Print out several parameters that normally change randomly from boot to boot:

    @@ -41321,9 +43585,9 @@ baremetal=True
    -

    33.15. Test this repo

    +

    33.16. Test this repo

    -

    33.15.1. Automated tests

    +

    33.16.1. Automated tests

    Run almost all tests:

    @@ -41379,7 +43643,7 @@ echo $?

    test does not all possible tests, because there are too many possible variations and that would take forever. The rationale is the same as for ./build all and is explained in ./build --help.

    -
    33.15.1.1. Test arch and emulator selection
    +
    33.16.1.1. Test arch and emulator selection

    You can select multiple archs and emulators of interest, as for an other command, with:

    @@ -41412,7 +43676,7 @@ echo $?
    -
    33.15.1.2. Quit on fail
    +
    33.16.1.2. Quit on fail

    By default, continue running even after the first failure happens, and they show a summary at the end.

    @@ -41426,7 +43690,7 @@ echo $?
    -
    33.15.1.3. Test userland in full system
    +
    33.16.1.3. Test userland in full system

    TODO: we really need a mechanism to automatically generate the test list automatically e.g. based on path_properties.py, currently there are many tests missing, and we have to add everything manually which is very annoying.

    @@ -41455,7 +43719,7 @@ echo $?
    -
    33.15.1.4. GDB tests
    +
    33.16.1.4. GDB tests

    We have some pexpect automated tests for GDB for both userland and baremetal programs!

    @@ -41528,7 +43792,7 @@ echo $?
    -
    33.15.1.5. Magic failure string
    +
    33.16.1.5. Magic failure string

    We do not know of any way to set the emulator exit status in QEMU arm full system.

    @@ -41631,9 +43895,9 @@ echo $?
    -

    33.15.2. Non-automated tests

    +

    33.16.2. Non-automated tests

    -
    33.15.2.1. Test GDB Linux kernel
    +
    33.16.2.1. Test GDB Linux kernel

    For the Linux kernel, do the following manual tests for now.

    @@ -41671,7 +43935,7 @@ echo $?
    -
    33.15.2.2. Test the Internet
    +
    33.16.2.2. Test the Internet

    You should also test that the Internet works:

    @@ -41682,7 +43946,7 @@ echo $?
    -
    33.15.2.3. CLI script tests
    +
    33.16.2.3. CLI script tests

    build-userland and test-executables have a wide variety of target selection modes, and it was hard to keep them all working without some tests:

    @@ -41700,7 +43964,7 @@ echo $?
    -

    33.16. Bisection

    +

    33.17. Bisection

    When updating the Linux kernel, QEMU and gem5, things sometimes break.

    @@ -41756,7 +44020,7 @@ git submodule update
    -

    33.17. Update a forked submodule

    +

    33.18. Update a forked submodule

    This is a template update procedure for submodules for which we have some patches on on top of mainline.

    @@ -41785,9 +44049,9 @@ git commit -m "linux: update to ${next_mainline_revision}"
    -

    33.18. Release

    +

    33.19. Release

    -

    33.18.1. Release procedure

    +

    33.19.1. Release procedure

    Ensure that the Automated tests are passing on a clean build:

    @@ -41798,7 +44062,7 @@ git commit -m "linux: update to ${next_mainline_revision}"
    -

    The ./build-test command builds a superset of what will be downloaded which also tests other things we would like to be working on the release. For the minimal build to generate the files to be uploaded, see: Section 33.18.2, “release-zip”

    +

    The ./build-test command builds a superset of what will be downloaded which also tests other things we would like to be working on the release. For the minimal build to generate the files to be uploaded, see: Section 33.19.2, “release-zip”

    The clean build is necessary as it generates clean images since it is not possible to remove Buildroot packages

    @@ -41868,7 +44132,7 @@ git push --follow-tags
    -

    33.18.2. release-zip

    +

    33.19.2. release-zip

    Create a zip containing all files required for Prebuilt setup:

    @@ -41893,7 +44157,7 @@ git push --follow-tags
    -

    33.18.3. release-upload

    +

    33.19.3. release-upload

    After:

    @@ -41941,9 +44205,9 @@ git push --follow-tags
    -

    33.19. Design rationale

    +

    33.20. Design rationale

    -

    33.19.1. Design goals

    +

    33.20.1. Design goals

    This project was created to help me understand, modify and test low level system components by using system simulators.

    @@ -42019,7 +44283,7 @@ git push --follow-tags
    -

    33.19.2. Setup trade-offs

    +

    33.20.2. Setup trade-offs

    The trade-offs between the different setups are basically a balance between:

    @@ -42044,13 +44308,13 @@ git push --follow-tags

    compatibility: how likely is is that all the components will work well together: emulator, compiler, kernel, standard library, …​

  • -

    guest software availability: how wide is your choice of easily installed guest software packages? See also: Section 33.19.4, “Linux distro choice”

    +

    guest software availability: how wide is your choice of easily installed guest software packages? See also: Section 33.20.4, “Linux distro choice”

  • -

    33.19.3. Resource tradeoff guidelines

    +

    33.20.3. Resource tradeoff guidelines

    Choosing which features go into our default builds means making tradeoffs, here are our guidelines:

    @@ -42095,7 +44359,7 @@ git push --follow-tags
    -

    33.19.4. Linux distro choice

    +

    33.20.4. Linux distro choice

    We haven’t found the ultimate distro yet, here is a summary table of trade-offs that we care about: Table 8, “Comparison of Linux distros for usage in this repository”.

    @@ -42198,9 +44462,9 @@ git push --follow-tags
    -

    33.20. Soft topics

    +

    33.21. Soft topics

    -

    33.20.1. Fairy tale

    +

    33.21.1. Fairy tale

    @@ -42237,7 +44501,7 @@ git push --follow-tags
    -

    33.20.2. Should you waste your life with systems programming?

    +

    33.21.2. Should you waste your life with systems programming?

    Being the hardcore person who fully understands an important complex system such as a computer, it does have a nice ring to it doesn’t it?

    @@ -42266,6 +44530,9 @@ git push --follow-tags

    In that sense, therefore, the kernel is not as open as one might want to believe.

    +
    +

    Of course, if there is some super useful and undocumented hardware that is just waiting there to be reverse engineered, then that’s a much juicier target :-)

    +
  • it is impossible to become rich with this knowledge.

    @@ -42314,7 +44581,7 @@ git push --follow-tags
  • -

    Are you fine with those points, and ready to continue wasting your life?

    +

    Are you fine with those points, and ready to continue wasting your life with this crap?

    Good. In that case, read on, and let’s have some fun together ;-)

    @@ -42322,7 +44589,7 @@ git push --follow-tags
    -

    33.21. Bibliography

    +

    33.22. Bibliography

    Runnable stuff:

    @@ -42366,6 +44633,9 @@ git push --follow-tags