diff --git a/index.html b/index.html
index 944201a..ad16af9 100644
--- a/index.html
+++ b/index.html
@@ -1026,73 +1026,74 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
18. QEMU
19. gem5
@@ -1140,6 +1141,7 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
19.5.4.1. gem5 fast forward
+19.5.5. gem5 checkpoint upgrader
19.6. Pass extra options to gem5
@@ -1224,7 +1226,13 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
19.16.1.2. gem5 MinorCPU
-19.16.1.3. gem5 DerivO3CPU
+19.16.1.3. gem5 DerivO3CPU
+
+
19.16.2. gem5 ARM RSK
@@ -1296,8 +1304,22 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
19.20.4.4.1. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby
-19.20.4.5. gem5 event queue MinorCPU syscall emulation freestanding example analysis
-19.20.4.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis
+19.20.4.5. gem5 event queue MinorCPU syscall emulation freestanding example analysis
+
+
+19.20.4.6. gem5 event queue DerivO3CPU syscall emulation freestanding example analysis
+
+
19.20.5. gem5 instruction definitions
@@ -1336,7 +1358,12 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
19.20.7.4. gem5 Process
-19.20.8. gem5 functional units
+19.20.8. gem5 functional units
+
+
19.20.9. gem5 code generation
- 19.20.9.1. gem5 THE_ISA
@@ -1352,6 +1379,7 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
+19.21. Gensim
20. Buildroot
@@ -1414,7 +1442,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
21.1.3. GCC C extensions
@@ -1508,14 +1540,15 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
@@ -1800,6 +1833,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
24.6. ARM SIMD
@@ -1866,13 +1904,16 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
24.9.2.1. ARMv7 architecture reference manual
24.9.2.2. ARMv8 architecture reference manual
24.9.2.3. ARMv8 architecture reference manual db
-24.9.2.4. Programmer’s Guide for ARMv8-A
-24.9.2.5. Arm A64 Instruction Set Architecture: Future Architecture Technologies in the A architecture profile Documentation
-24.9.2.6. ARM processor documentation
+24.9.2.4. ARMv8 architecture reference manual db
+24.9.2.5. Programmer’s Guide for ARMv8-A
+24.9.2.6. Arm A64 Instruction Set Architecture: Future Architecture Technologies in the A architecture profile Documentation
+24.9.2.7. ARM processor documentation
+24.9.2.8. Arm Cortex‑A77 Technical Reference Manual r1p1
+24.9.2.9. Arm Cortex‑A77 Software Optimization Guide r1p1
@@ -1886,7 +1927,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
27.1. Baremetal GDB step debug
27.2. Baremetal bootloaders
27.3. Baremetal linker script
-27.4. Baremetal command line arguments
+27.4. Baremetal command line arguments
+
+
27.5. Semihosting
-32.2. Superscalar processor
-32.3. Out-of-order execution
-32.4. Hardware threads
-32.5. Cache coherence
+32.2. Superscalar processor
@@ -2089,7 +2149,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
33.9. ccache
33.10. getvar
33.11. Rebuild Buildroot while running
@@ -2107,79 +2171,80 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
33.13.4. Buildroot build variants
-33.14. Directory structure
+33.14. Optimization level of a build
+33.15. Directory structure
-33.15. Test this repo
+33.16. Test this repo
-33.16. Bisection
-33.17. Update a forked submodule
-33.18. Release
+33.17. Bisection
+33.18. Update a forked submodule
+33.19. Release
-33.19. Design rationale
+33.20. Design rationale
-33.20. Soft topics
+33.21. Soft topics
-33.21. Bibliography
+33.22. Bibliography
@@ -2196,7 +2261,7 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
If you don’t know which one to go for, start with QEMU Buildroot setup getting started.
@@ -2613,10 +2678,10 @@ hello /root/.profile
If you really want to develop semiconductors, your only choice is to join an university or a semiconductor company that has the EDA licenses.
-
While hacking QEMU, you will likely want to GDB step its source. That is trivial since QEMU is just another userland program like any other, but our setup has a shortcut to make it even more convenient, see: Section 18.7, “Debug the emulator”.
+
While hacking QEMU, you will likely want to GDB step its source. That is trivial since QEMU is just another userland program like any other, but our setup has a shortcut to make it even more convenient, see: Section 18.8, “Debug the emulator”.
@@ -3752,7 +3817,7 @@ cd userland
Here we used --force-rebuild to force rebuild since the sources weren’t modified since the last build.
-
Some CLI options have more specialized flags, e.g. -O optimization level:
+
Some CLI options have more specialized flags, e.g. -O for the Optimization level of a build:
@@ -3801,7 +3866,7 @@ cd userland
-
-
put the host executables in a separate build-variant to avoid conflict with Buildroot builds.
+put the host executables in a separate build variant to avoid conflict with Buildroot builds.
-
ran with the --emulator native option to run the program natively
@@ -3817,7 +3882,7 @@ cd userland
@@ -4309,6 +4374,9 @@ continue
So get ready for some weird jumps, and <value optimized out> fun. Why, Linux, why.
+
@@ -6916,7 +6984,7 @@ cat f
which can be good for automated tests, as it ensures that you are using a pristine unmodified system image every time.
One downside of this method is that it has to put the entire filesystem into memory, and could lead to a panic:
@@ -7469,6 +7537,19 @@ sudo ./setup -y
+
+
@@ -8531,7 +8612,7 @@ Program aborted at tick 0
-
-
modules built with Buildroot, see: Section 33.14.2.1, “kernel_modules buildroot package”
+modules built with Buildroot, see: Section 33.15.2.1, “kernel_modules buildroot package”
-
modules built from the kernel tree itself, see: Section 15.12.2, “dummy-irq”
@@ -9438,7 +9519,7 @@ xeyes
-
We disable networking by default because it starts an userland process, and we want to keep the number of userland processes to a minimum to make the system more understandable as explained at: Section 33.19.3, “Resource tradeoff guidelines”
+
We disable networking by default because it starts an userland process, and we want to keep the number of userland processes to a minimum to make the system more understandable as explained at: Section 33.20.3, “Resource tradeoff guidelines”
To enable networking on Buildroot, simply run:
@@ -10287,15 +10368,15 @@ git log | grep -E ' Linux [0-9]+\.' | head
This also makes this repo the perfect setup to develop the Linux kernel.
-
In case something breaks while updating the Linux kernel, you can try to bisect it to understand the root cause, see: Section 33.16, “Bisection”.
+
In case something breaks while updating the Linux kernel, you can try to bisect it to understand the root cause, see: Section 33.17, “Bisection”.
-
Because the kernel is so central to this repository, almost all tests must be re-run, so basically just follow the full testing procedure described at: Section 33.15, “Test this repo”. The only tests that can be skipped are essentially the Baremetal tests.
+
Because the kernel is so central to this repository, almost all tests must be re-run, so basically just follow the full testing procedure described at: Section 33.16, “Test this repo”. The only tests that can be skipped are essentially the Baremetal tests.
Before comitting, don’t forget to update:
@@ -12587,7 +12668,10 @@ echo $?
-
File operations are the main method of userland driver communication. struct file_operations determines what the kernel will do on filesystem system calls of Pseudo filesystems.
+
File operations are the main method of userland driver communication.
+
+
+
struct file_operations determines what the kernel will do on filesystem system calls of Pseudo filesystems.
This example illustrates the most basic system calls: open, read, write, close and lseek:
@@ -12741,15 +12825,7 @@ cd
-
The poll system call allows an user process to do a non-busy wait on a kernel event:
-
-
-
-
Outcome: jiffies gets printed to stdout every second from userland.
+
The poll system call allows an user process to do a non-busy wait on a kernel event.
Sources:
@@ -12765,6 +12841,70 @@ cd
+
+
+
Outcome: jiffies gets printed to stdout every second from userland, e.g.:
+
+
+
+
poll
+<6>[ 4.275305] poll
+<6>[ 4.275580] return POLLIN
+revents = 1
+POLLIN n=10 buf=4294893337
+poll
+<6>[ 4.276627] poll
+<6>[ 4.276911] return 0
+<6>[ 5.271193] wake_up
+<6>[ 5.272326] poll
+<6>[ 5.273207] return POLLIN
+revents = 1
+POLLIN n=10 buf=4294893588
+poll
+<6>[ 5.276367] poll
+<6>[ 5.276618] return 0
+<6>[ 6.275178] wake_up
+<6>[ 6.276370] poll
+<6>[ 6.277269] return POLLIN
+revents = 1
+POLLIN n=10 buf=4294893839
+
+
+
+
Force the poll file_operation to return 0 to see what happens more clearly:
+
+
+
+
+
+
poll
+<6>[ 85.674801] poll
+<6>[ 85.675788] return 0
+<6>[ 86.675182] wake_up
+<6>[ 86.676431] poll
+<6>[ 86.677373] return 0
+<6>[ 87.679198] wake_up
+<6>[ 87.680515] poll
+<6>[ 87.681564] return 0
+<6>[ 88.683198] wake_up
+
+
+
+
From this we see that control is not returned to userland: the kernel just keeps calling the poll file_operation again and again.
+
+
Typically, we are waiting for some hardware to make some piece of data available available to the kernel.
@@ -12774,7 +12914,17 @@ cd
To simplify this example, we just fake the hardware interrupts with a kthread that sleeps for a second in an infinite loop.
+
@@ -16137,7 +16287,7 @@ ps
If you are familiar with VirtualBox, then QEMU then basically does the same thing: it opens a "window" inside your desktop that can run an operating system inside your operating system.
The huge advantage of QEMU over VirtualBox is that is supports cross arch simulation, e.g. simulate an ARM guest on an x86 host.
@@ -16159,7 +16309,16 @@ ps
+
+
We disable disk persistency for both QEMU and gem5 by default, to prevent the emulator from putting the image in an unknown state.
@@ -16214,7 +16373,7 @@ ps
Disk persistency is useful to re-run shell commands from the history of a previous session with Ctrl-R, but we felt that the loss of determinism was not worth it.
-
+
TODO how to make gem5 disk writes persistent?
@@ -16244,7 +16403,7 @@ index 17498c42b..76b8b351d 100644
-
+
@@ -16253,7 +16412,7 @@ index 17498c42b..76b8b351d 100644
-
+
Snapshots are stored inside the .qcow2 images themselves.
@@ -16400,7 +16559,7 @@ Format specific information:
-
+
@@ -16445,12 +16604,12 @@ Format specific information:
-
+
-
+
PCI driver for our minimal pci_min.c QEMU fork device:
@@ -16520,7 +16679,7 @@ lkmc_pci_min mmio_write addr = 4 val = 0 size = 4
-
+
Small upstream educational PCI device:
@@ -16578,16 +16737,19 @@ lkmc_pci_min mmio_write addr = 4 val = 0 size = 4
-
+
In this section we will try to interact with PCI devices directly from userland without kernel modules.
@@ -16733,7 +16895,7 @@ devmem 0xfeb54000 w 0x12345678
-
+
There are two versions of setpci and lspci:
@@ -16749,7 +16911,7 @@ devmem 0xfeb54000 w 0x12345678
-
+
@@ -16809,7 +16971,7 @@ devmem 0xfeb54000 w 0x12345678
-
+
lspci -k shows something like:
@@ -16863,7 +17025,7 @@ devmem 0xfeb54000 w 0x12345678
-
+
@@ -16905,7 +17067,7 @@ pci_register_bar(pdev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &edu->mmio);
-
+
TODO: broken. Was working before we moved arm from -M versatilepb to -M virt around af210a76711b7fa4554dcc2abd0ddacfc810dfd4. Either make it work on -M virt if that is possible, or document precisely how to make it work with versatilepb, or hopefully vexpress which is newer.
@@ -16948,7 +17110,7 @@ pci_register_bar(pdev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &edu->mmio);
-
+
TODO: broken when arm moved to -M virt, same as GPIO.
@@ -17020,7 +17182,7 @@ echo 255 >brightness
-
+
Minimal platform device example coded into the -M versatilepb SoC of our QEMU fork.
@@ -17098,7 +17260,7 @@ insmod platform_device.ko
-
+
@@ -17108,7 +17270,7 @@ insmod platform_device.ko
-
+
@@ -17228,7 +17390,7 @@ insmod platform_device.ko
-
+
@@ -17245,7 +17407,7 @@ insmod platform_device.ko
-
+
When doing GDB step debug it is possible to send QEMU monitor commands through the GDB monitor command, which saves you the trouble of opening yet another shell.
@@ -17261,7 +17423,7 @@ monitor info qtree
-
+
When you start hacking QEMU or gem5, it is useful to see what is going on inside the emulator themselves.
@@ -17274,7 +17436,15 @@ monitor info qtree
-
Or for a faster development loop:
+
Or for a faster development loop you can pass -ex command as a semicolon separated list:
+
+
+
+
./run --debug-vm-ex 'break qemu_add_opts;run'
+
+
+
+
which is equivalent to the more verbose:
@@ -17282,6 +17452,9 @@ monitor info qtree
+
if you ever want need anything besides -ex.
+
+
Or if things get really involved and you want a debug script:
@@ -17330,7 +17503,7 @@ run
You can still send key presses to QEMU however even without the mouse capture, just either click on the title bar, or alt tab to give it focus.
-
+
While step debugging any complex program, you always end up feeling the need to step in reverse to reach the last call to some function that was called before the failure point, in order to trace back the problem to the actual bug source.
@@ -17419,7 +17592,7 @@ reverse-next
-
+
Start pdb at the first instruction:
@@ -17453,7 +17626,7 @@ reverse-next
-
+
QEMU can log several different events.
@@ -17544,7 +17717,7 @@ Call Trace:
-
+
QEMU also has a second trace mechanism in addition to -trace, find out the events with:
@@ -17585,7 +17758,7 @@ IN:
-
+
@@ -17638,7 +17811,7 @@ of guest operations.
-
+
We can further use Binutils' addr2line to get the line that corresponds to each address:
@@ -17694,7 +17867,7 @@ less "$(./getvar --arch x86_64 run_dir)/trace-lines.txt"
-
+
QEMU runs, unlike gem5, are not deterministic by default, however it does support a record and replay mechanism that allows you to replay a previous run deterministically.
@@ -17801,7 +17974,7 @@ less "$(./getvar --arch x86_64 run_dir)/trace-lines.txt"
Solved on unmerged c42634d8e3428cfa60672c3ba89cabefc720cde9 from https://github.com/ispras/qemu/tree/rr-180725
-
+
@@ -17840,7 +18013,7 @@ reverse-continue
-
+
TODO: is there any way to distinguish which instruction runs on each core? Doing:
@@ -17855,13 +18028,13 @@ reverse-continue
-
+
@@ -17972,7 +18145,7 @@ less "$(./getvar --arch aarch64 run_dir)/trace-lines.txt"
TODO: 7452d399290c9c1fc6366cdad129ef442f323564 ./trace2line this is too slow and takes hours. QEMU’s processing of 170k events takes 7 seconds. gem5’s processing is analogous, but there are 140M events, so it should take 7000 seconds ~ 2 hours which seems consistent with what I observe, so maybe there is no way to speed this up… The workaround is to just use gem5’s ExecSymbol to get function granularity, and then GDB individually if line detail is needed?
-
+
gem5 traces are generated from DPRINTF(<trace-id> calls scattered throughout the code, except for ExecAll instruction traces, which uses Debug::ExecEnable directly..
@@ -18009,7 +18182,7 @@ extern SimpleFlag ExecEnable;
-
+
This debug flag traces all instructions.
@@ -18064,7 +18237,7 @@ extern SimpleFlag ExecEnable;
@start_kernel: we are in the start_kernel function. Awesome feature! Implemented with libelf https://sourceforge.net/projects/elftoolchain/ copy pasted in-tree ext/libelf. To get raw addresses, remove the ExecSymbol, which is enabled by Exec. This can be done with Exec,-ExecSymbol.
-.1 as in @start_kernel.1: index of the microop
+.1 as in @start_kernel.1: index of the gem5 microops
stp: instruction disassembly. Note however that the disassembly of many instructions are very broken as of 2019q2, and you can’t just trust them blindly.
@@ -18092,7 +18265,7 @@ extern SimpleFlag ExecEnable;
-
+
@@ -18147,13 +18320,13 @@ add x1, x0, 2
-
+
As of gem5 16eeee5356585441a49d05c78abc328ef09f7ace the default tracer is ExeTracer. It is set at:
@@ -18226,7 +18399,7 @@ src/arch/x86/nativetrace.hh:41:class X86NativeTrace : public NativeTrace
-
Build the kernel with the gem5 arm Linux kernel patches, and then run:
+
With GICv3, tested at LKMC 224fae82e1a79d9551b941b19196c7e337663f22 gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772 on vanilla kernel:
+
+
+
+
./run \
+ --arch aarch64 \
+ --emulator gem5 \
+ --cpus 16 \
+ -- \
+ --machine-type VExpress_GEM5_V2 \
+;
+
+
+
+
boots to a shell and nproc shows 16.
+
+
@@ -19463,6 +19653,15 @@ Exiting @ tick 84500 because m5_exit instruction encountered
+
+
+
./run --arch aarch64 --emulator gem5 --baremetal userland/freestanding/gem5_checkpoint.S --trace-insts-stdout
+./run --arch aarch64 --emulator gem5 --userland userland/freestanding/gem5_checkpoint.S --trace-insts-stdout
+
+
+
@@ -19485,6 +19684,22 @@ Exiting @ tick 84500 because m5_exit instruction encountered
The -r N integer value is just pure fs.py sugar, the backend at m5.instantiate just takes the actual tracepoint directory path as input.
+
+
The file m5out/cpt.1000/m5.cpt contains almost everything in the checkpoint except memory.
+
+
+
+
+
[system.cpu.itb.walker.power_state]
+currState=0
+prvEvalTick=0
+
+
+
+
When a checkpoint is taken, each SimObject calls its overridden serialize method to generate the checkpoint, and when loading, unserialize is called.
+
@@ -19742,6 +19957,9 @@ cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"
@@ -19872,6 +20090,41 @@ FullO3CPU: Ticking main, FullO3CPU.
+
+
+
+
The in-tree util/cpt_upgrader.py is a tool to upgrade checkpoints taken from an older version of gem5 to be compatible with the newest version, so you can update gem5 without having to re-run the simulation that generated the checkpoints.
+
+
+
+
Unfortunately, since the process is not very automated (automatable?), and requires manually patching the upgrader every time a new breaking change is done, the upgrader tends to break soon if you try to move many versions of gem5 ahead as of 2020. This is evidenced in bug reports such as this one: https://gem5.atlassian.net/browse/GEM5-472
+
+
+
The script can be used as:
+
+
+
+
util/cpt_upgrader.py m5out/cpt.1000/m5.cpt
+
+
+
+
This updates the m5.cpt file in-place, and a m5out/cpt.1000/m5.cpt.bak is generated as a backup of the old file.
+
+
+
The upgrader determines which upgrades are needed by checking the version_tags entry of the checkpoint:
+
+
+
+
[Globals]
+version_tags=arm-ccregs arm-contextidr-el2 arm-gem5-gic-ext ...
+
+
+
+
Each of those tags corresponds to a Python file under util/cpt_upgraders/ e.g. util/cpt_upgraders/arm-ccregs.py.
+
+
@@ -21161,6 +21414,9 @@ xdg-open "$(./getvar --arch arm --emulator gem5 m5out_dir)/config.dot.svg"
+
Sample run time: 87 minutes on P51 Ubuntu 20.04 gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1.
+
+
After the first run has downloaded the test binaries for you, you can speed up the process a little bit by skipping an useless SCons call:
@@ -21176,7 +21432,7 @@ xdg-open "$(./getvar --arch arm --emulator gem5 m5out_dir)/config.dot.svg"
-
./gem5-regression --arch aarch64 --cmd list
+
./gem5-regression --arch aarch64 --cmd list -- --length quick --length long
@@ -21333,13 +21589,13 @@ Exiting @ tick 18446744073709551615 because simulate() limit reached
If you build gem5 with scons build/ARM/gem5.debug, then that is a .debug build.
@@ -21533,7 +21789,7 @@ Indirect leak of 1346 byte(s) in 2 object(s) allocated from:
It is not possible to build more than one Ruby system into a single build, and this is a major pain point for testing Ruby: https://gem5.atlassian.net/browse/GEM5-467
@@ -21786,7 +22042,7 @@ class SystemXBar(CoherentXBar):
@@ -21853,13 +22109,13 @@ class SystemXBar(CoherentXBar):
-
Generic in-order core that does not model any specific CPU.
+
Generic in-order superscalar core.
Its C++ implementation that can be parametrized to more closely match real cores.
-
Note that since gem5 is highly parametrizable, the parametrization could even change which instructions a CPU can execute by altering its available functional units, which are used to model performance.
+
Note that since gem5 is highly parametrizable, the parametrization could even change which instructions a CPU can execute by altering its available functional units, which are used to model performance.
For example, MinorCPU allows all implemented instructions, including ARM SVE instructions, but a derived class modelling, say, an ARM Cortex A7 core, might not, since SVE is a newer feature and the A7 core does not have SVE.
@@ -21917,17 +22173,40 @@ class SystemXBar(CoherentXBar):
-
+
+
Analogous to MinorCPU, but modelling an out of order core instead of in order.
+
+
+
[system.cpu]
+type=DerivO3CPU
+commitWidth=8
+decodeWidth=8
+dispatchWidth=8
+fetchWidth=8
+issueWidth=8
+renameWidth=8
+squashWidth=8
+wbWidth=8
+
+
+
+
Existing parametrizations:
@@ -21953,6 +22232,77 @@ class SystemXBar(CoherentXBar):
+
+
+
+
+-
+
fetch: besides obviously fetching the instruction, this is also where branch prediction runs. Presumably because you need to branch predict before deciding what to fetch next.
+
+-
+
retire: the instruction is completely and totally done with.
+
+
+
The ExecAll happens at this time as well. And therefore ExecAll does not happen for mispeculated instructions.
+
+
+
+
+
+
+
+
+
+
+
./run \
+ --arch aarch64 \
+ --emulator gem5 \
+ --userland userland/arch/aarch64/freestanding/linux/hello.S \
+ --trace O3PipeView \
+ --trace-stdout \
+ -- \
+ --cpu-type DerivO3CPU \
+ --caches \
+;
+"$(./getvar gem5_source_dir)/util/o3-pipeview.py" -c 500 -o o3pipeview.tmp.log --color "$(./getvar --arch aarch64 trace_txt_file)"
+less -R o3pipeview.tmp.log
+
+
+
+
+
+
"$(./getvar gem5_source_dir)/util/o3-pipeview.py" -c 500 -o o3pipeview.tmp.log "$(./getvar --arch aarch64 trace_txt_file)"
+less o3pipeview.tmp.log
+
+
+
+
+
+
+
+
+
+
+
Uses the same data source as util/o3-pipeview.py.
+
+
+
@@ -22660,7 +23010,7 @@ for source in PySource.all:
-
b Trace::OstreamLogger::logMessage()
+b Trace::OstreamLogger::logMessage
b EventManager::schedule
b EventFunctionWrapper::process
@@ -23777,7 +24127,7 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
-
se we deduce that the vitual address 0x400078 maps to the physical address 0x78. But of course, let me log that for you byu adding --trace MMU:
+
so we deduce that the virtual address 0x400078 maps to the physical address 0x78. But of course, let me log that for you by adding --trace MMU:
@@ -23799,12 +24149,11 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
Now that we are here, we might as well learn how to log the data that was fetched from DRAM.
-
Fist we determine the expected bytes from:
+
Fist we determine the expected bytes from the disassembly:
-
./run-toolchain --arch aarch64 objdump -- \
- -D "$(./getvar --arch aarch64 userland_build_dir)/arch/aarch64/freestanding/linux/hello.out"
+
./disas --arch aarch64 --userland userland/arch/aarch64/freestanding/linux/hello.S _start
@@ -23812,9 +24161,8 @@ DRAMCtrl::Rank::startup(Tick ref_tick)
-
0000000000400078 <_start>:
- 400078: d2800020 mov x0, #0x1 // #1
- 40007c: 100000e1 adr x1, 400098 <msg>
+
0x0000000000400078 <+0>: 20 00 80 d2 mov x0, #0x1 // #1
+ 0x000000000040007c <+4>: e1 00 00 10 adr x1, 0x400098 <msg>
@@ -24599,6 +24947,26 @@ TimingSimpleCPU::IcachePort::ITickEvent::process
Contrast this with the non --cache version seen at TimingSimpleCPU analysis #5 in which DRAM only actually reads the 4 required bytes.
+
The only cryptic thing about the messages is the IF flag, but good computer architects would have guessed it correctly, and src/mem/packet.cc confirms:
+
+
+
+
void
+Packet::print(std::ostream &o, const int verbosity,
+ const std::string &prefix) const
+{
+ ccprintf(o, "%s%s [%x:%x]%s%s%s%s%s%s", prefix, cmdString(),
+ getAddr(), getAddr() + getSize() - 1,
+ req->isSecure() ? " (s)" : "",
+ req->isInstFetch() ? " IF" : "",
+ req->isUncacheable() ? " UC" : "",
+ isExpressSnoop() ? " ES" : "",
+ req->isToPOC() ? " PoC" : "",
+ req->isToPOU() ? " PoU" : "");
+}
+
+
+
Another interesting observation of running with --trace Cache,DRAM,XBar is that between the execution of both instructions, there is a Cache event, but no DRAM or XBar events:
-
@@ -24941,7 +25306,7 @@ non-atomic 19
--arch aarch64 \
--emulator gem5 \
--userland userland/arch/aarch64/freestanding/linux/hello.S \
- --trace Event \
+ --trace FmtFlag,Cache,Event,ExecAll,Minor \
--trace-stdout \
-- \
--cpu-type MinorCPU \
@@ -24949,11 +25314,898 @@ non-atomic 19
;
+
+
+
On LKMC ce3ea9faea95daf46dea80d4236a30a0891c3ca5 gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 we see the following.
+
+
+
+
+
500: Cache: system.cpu.icache: access for ReadReq [40:7f] IF miss
+
+
+
+
The memory request comes back later on at:
+
+
+
+
77000: Cache: system.cpu.icache: recvTimingResp: Handling response ReadResp [40:7f] IF
+
+
+
+
and soon after the CPU also ifetches across the barrier:
+
+
+
+
79000: Cache: system.cpu.icache: access for ReadReq [80:bf] IF miss
+
+
+
+
TODO why? We have 0x78 and 0x7c, and those should be it since we are dual issue, right? Is this prefetching at work?
+
+
+
Later on we see the first instruction, our MOVZ, was decoded:
+
+
+
+
80000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/1/1.1 pc: 0x400078 (movz) to FU: 0
+
+
+
+
and that issue succeeds, because the functional unit 0 (FU 0) is an IntAlu as shown at gem5 functional units:
+
+
+
+
80000: MinorExecute: system.cpu.execute: Issuing inst: 0/1.1/1/1.1 pc: 0x400078 (movz) into FU 0
+
+
+
+
At the very same tick, the second instruction is also decoded, our ADR:
+
+
+
+
80000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/1/2.2 pc: 0x40007c (adr) to FU: 0
+ 80000: MinorExecute: system.cpu.execute: Can't issue as FU: 0 is already busy
+ 80000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/1/2.2 pc: 0x40007c (adr) to FU: 1
+ 80000: MinorExecute: system.cpu.execute: Issuing inst: 0/1.1/1/2.2 pc: 0x40007c (adr) into FU 1
+
+
+
+
This is also an IntAlu instruction, and it can’t run on FU 0 because the first instruction is already running there. But to our luck, FU 1 is also an IntAlu unit, and so it runs there.
+
+
+
Crap, those Minor logs should say what OpClass each instruction is, that would make things clearer.
+
+
+
TODO what is that 0/1.1/1/1.1 notation that shows up everywhere? Must be important, let’s look at the source.
+
+
+
Soon after (3 ticks later, so guessing due to opLat=3?), the execution appears to be over already since we see the ExecAll come through, which generally happens at the very end:
+
+
+
+
81500: MinorExecute: system.cpu.execute: Attempting to commit [tid:0]
+ 81500: MinorExecute: system.cpu.execute: Committing micro-ops for interrupt[tid:0]
+ 81500: MinorExecute: system.cpu.execute: Trying to commit canCommitInsts: 1
+ 81500: MinorExecute: system.cpu.execute: Trying to commit from FUs
+ 81500: MinorExecute: global: ExecContext setting PC: (0x400078=>0x40007c).(0=>1)
+ 81500: MinorExecute: system.cpu.execute: Committing inst: 0/1.1/1/1.1 pc: 0x400078 (movz)
+ 81500: MinorExecute: system.cpu.execute: Unstalling 0 for inst 0/1.1/1/1.1
+ 81500: MinorExecute: system.cpu.execute: Completed inst: 0/1.1/1/1.1 pc: 0x400078 (movz)
+ 81500: MinorScoreboard: system.cpu.execute.scoreboard0: Clearing inst: 0/1.1/1/1.1 pc: 0x400078 (movz) regIndex: 0 final numResults: 0
+ 81500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 FetchSeq=1 CPSeq=1 flags=(IsInteger)
+ 81500: MinorExecute: system.cpu.execute: Trying to commit canCommitInsts: 1
+ 81500: MinorExecute: system.cpu.execute: Trying to commit from FUs
+ 81500: MinorExecute: global: ExecContext setting PC: (0x40007c=>0x400080).(0=>1)
+ 81500: MinorExecute: system.cpu.execute: Committing inst: 0/1.1/1/2.2 pc: 0x40007c (adr)
+ 81500: MinorExecute: system.cpu.execute: Unstalling 1 for inst 0/1.1/1/2.2
+ 81500: MinorExecute: system.cpu.execute: Completed inst: 0/1.1/1/2.2 pc: 0x40007c (adr)
+ 81500: MinorScoreboard: system.cpu.execute.scoreboard0: Clearing inst: 0/1.1/1/2.2 pc: 0x40007c (adr) regIndex: 1 final numResults: 0
+ 81500: MinorExecute: system.cpu.execute: Reached inst commit limit
+ 81500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4 : adr x1, #28 : IntAlu : D=0x0000000000400098 FetchSeq=2 CPSeq=2 flags=(IsInteger)
+
+
+
+
The ifetch for the third instruction returns at:
+
+
+
+
129000: Cache: system.cpu.icache: recvTimingResp: Handling response ReadResp [80:bf] IF
+
+
+
+
so now we are ready to run the third and fourth instructions of the program:
+
+
+
,…
+ ldr x2, =len
+ mov x8, 64
+,…
+
+
+
The LDR goes all the way down to FU 6 which is the memory one:
+
+
+
+
132000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) to FU: 0
+ 132000: MinorExecute: system.cpu.execute: Can't issue as FU: 0 isn't capable
+ 132000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) to FU: 1
+ 132000: MinorExecute: system.cpu.execute: Can't issue as FU: 1 isn't capable
+ 132000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) to FU: 2
+ 132000: MinorExecute: system.cpu.execute: Can't issue as FU: 2 isn't capable
+ 132000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) to FU: 3
+ 132000: MinorExecute: system.cpu.execute: Can't issue as FU: 3 isn't capable
+ 132000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) to FU: 4
+ 132000: MinorExecute: system.cpu.execute: Can't issue as FU: 4 isn't capable
+ 132000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) to FU: 5
+ 132000: MinorExecute: system.cpu.execute: Can't issue as FU: 5 isn't capable
+ 132000: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) to FU: 6
+ 132000: MinorExecute: system.cpu.execute: Issuing inst: 0/1.1/2/3.3 pc: 0x400080 (ldr) into FU 6
+
+
+
+
and then the MOV issue follows soon afterwards (TODO why not at the same time like for the previous pair?):
+
+
+
+
132500: MinorExecute: system.cpu.execute: Trying to issue inst: 0/1.1/2/4.4 pc: 0x400084 (movz) to FU: 0
+ 132500: MinorExecute: system.cpu.execute: Issuing inst: 0/1.1/2/4.4 pc: 0x400084 (movz) into FU 0
+
+
+
+
+
+
+
./run \
+ --arch aarch64 \
+ --emulator gem5 \
+ --userland userland/arch/aarch64/freestanding/linux/hello.S \
+ --trace FmtFlag,Cache,Event,ExecAll,O3CPUAll \
+ --trace-stdout \
+ -- \
+ --cpu-type DerivO3CPU \
+ --caches \
+;
+
+
+
+
The output is huge and contains about 7 thousand lines!!!
+
+
+
This section and children are tested at LKMC 144a552cf926ea630ef9eadbb22b79fe2468c456.
+
+
+
+
+
+
+
+
// f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
+
+ timeline tick pc.upc disasm seq_num
+[.ic.r........................................................................fdn]-( 40000) 0x00400078.0 movz x0, #0, #0 [ 1]
+[.ic.r........................................................................fdn]-( 40000) 0x0040007c.0 movz x1, #1, #0 [ 2]
+[....................fdn.ic.r....................................................]-( 120000) 0x00400080.0 movz x2, #2, #0 [ 3]
+[....................fdn.ic.r....................................................]-( 120000) 0x00400084.0 movz x3, #3, #0 [ 4]
+[....................fdn.ic.r....................................................]-( 120000) 0x00400088.0 movz x4, #4, #0 [ 5]
+[....................fdn.ic.r....................................................]-( 120000) 0x0040008c.0 movz x5, #5, #0 [ 6]
+[....................fdn.ic.r....................................................]-( 120000) 0x00400090.0 movz x6, #6, #0 [ 7]
+[....................fdn.ic.r....................................................]-( 120000) 0x00400094.0 movz x7, #7, #0 [ 8]
+[....................fdn.pic.r...................................................]-( 120000) 0x00400098.0 movz x8, #8, #0 [ 9]
+[....................fdn.pic.r...................................................]-( 120000) 0x0040009c.0 movz x9, #9, #0 [ 10]
+[.....................fdn.ic.r...................................................]-( 120000) 0x004000a0.0 movz x10, #10, #0 [ 11]
+[.....................fdn.ic.r...................................................]-( 120000) 0x004000a4.0 movz x11, #11, #0 [ 12]
+[.....................fdn.ic.r...................................................]-( 120000) 0x004000a8.0 movz x12, #12, #0 [ 13]
+[.....................fdn.ic.r...................................................]-( 120000) 0x004000ac.0 movz x13, #13, #0 [ 14]
+[.....................fdn.pic.r..................................................]-( 120000) 0x004000b0.0 movz x14, #14, #0 [ 15]
+[.....................fdn.pic.r..................................................]-( 120000) 0x004000b4.0 movz x15, #15, #0 [ 16]
+[.....................fdn.pic.r..................................................]-( 120000) 0x004000b8.0 movz x16, #16, #0 [ 17]
+[.....................fdn.pic.r..................................................]-( 120000) 0x004000bc.0 movz x17, #17, #0 [ 18]
+[............................................fdn.ic.r............................]-( 160000) 0x004000c0.0 movz x18, #18, #0 [ 19]
+[............................................fdn.ic.r............................]-( 160000) 0x004000c4.0 movz x19, #19, #0 [ 20]
+[............................................fdn.ic.r............................]-( 160000) 0x004000c8.0 movz x20, #20, #0 [ 21]
+[............................................fdn.ic.r............................]-( 160000) 0x004000cc.0 movz x21, #21, #0 [ 22]
+[............................................fdn.ic.r............................]-( 160000) 0x004000d0.0 movz x22, #22, #0 [ 23]
+[............................................fdn.ic.r............................]-( 160000) 0x004000d4.0 movz x23, #23, #0 [ 24]
+[............................................fdn.pic.r...........................]-( 160000) 0x004000d8.0 movz x24, #24, #0 [ 25]
+[............................................fdn.pic.r...........................]-( 160000) 0x004000dc.0 movz x25, #25, #0 [ 26]
+[.............................................fdn.ic.r...........................]-( 160000) 0x004000e0.0 movz x26, #26, #0 [ 27]
+[.............................................fdn.ic.r...........................]-( 160000) 0x004000e4.0 movz x27, #27, #0 [ 28]
+[.............................................fdn.ic.r...........................]-( 160000) 0x004000e8.0 movz x28, #28, #0 [ 29]
+[.............................................fdn.ic.r...........................]-( 160000) 0x004000ec.0 movz x29, #29, #0 [ 30]
+[.............................................fdn.pic.r..........................]-( 160000) 0x004000f0.0 movz x0, #0, #0 [ 31]
+[.............................................fdn.pic.r..........................]-( 160000) 0x004000f4.0 movz x1, #1, #0 [ 32]
+[.............................................fdn.pic.r..........................]-( 160000) 0x004000f8.0 movz x2, #2, #0 [ 33]
+[.............................................fdn.pic.r..........................]-( 160000) 0x004000fc.0 movz x3, #3, #0 [ 34]
+
+
+
+
The first of instructions has only two instructions because the first instruction is at address 0x400078, so only two instructions fit on that cache line, as the next cache line starts at 0x400080!
+
+
+
The initial fdn on top middle is likely bugged out, did it wrap around? But the rest makes sense.
+
+
+
From this, we clearly see that up to 8 instructions can be issued concurrently, which matches the default width values we had seen at gem5 DerivO3CPU.
+
+
+
For example, we can clearly see how:
+
+
+
+-
+
movz x2 through to movz x9 start running at the exact same time. TODO why does mov x7 do fdn.ic.r while mov x8 do fdn.ic.r? How are they different?
+
+-
+
movz x10 through movz x17 then starts running one step later. This second chunk is fully pipelined with the first instruction pack
+
+-
+
then comes a pause while the next fetch comes back. This group of 16 instructions took up the entire 64-byte cacheline that had been read
+
+
+
+
+
First we can have a look at ExecEnable to get an initial ideal of how many instructions are run at one time:
+
+
+
+
78500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 FetchSeq=1 CPSeq=1 flags=(IsInteger)
+ 78500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+4 : movz x1, #1, #0 : IntAlu : D=0x0000000000000001 FetchSeq=2 CPSeq=2 flags=(IsInteger)
+
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8 : movz x2, #2, #0 : IntAlu : D=0x0000000000000002 FetchSeq=3 CPSeq=3 flags=(IsInteger)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+12 : movz x3, #3, #0 : IntAlu : D=0x0000000000000003 FetchSeq=4 CPSeq=4 flags=(IsInteger)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+16 : movz x4, #4, #0 : IntAlu : D=0x0000000000000004 FetchSeq=5 CPSeq=5 flags=(IsInteger)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+20 : movz x5, #5, #0 : IntAlu : D=0x0000000000000005 FetchSeq=6 CPSeq=6 flags=(IsInteger)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+24 : movz x6, #6, #0 : IntAlu : D=0x0000000000000006 FetchSeq=7 CPSeq=7 flags=(IsInteger)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+28 : movz x7, #7, #0 : IntAlu : D=0x0000000000000007 FetchSeq=8 CPSeq=8 flags=(IsInteger)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+32 : movz x8, #8, #0 : IntAlu : D=0x0000000000000008 FetchSeq=9 CPSeq=9 flags=(IsInteger)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+36 : movz x9, #9, #0 : IntAlu : D=0x0000000000000009 FetchSeq=10 CPSeq=10 flags=(IsInteger)
+
+ 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+40 : movz x10, #10, #0 : IntAlu : D=0x000000000000000a FetchSeq=11 CPSeq=11 flags=(IsInteger)
+ 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+44 : movz x11, #11, #0 : IntAlu : D=0x000000000000000b FetchSeq=12 CPSeq=12 flags=(IsInteger)
+ 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+48 : movz x12, #12, #0 : IntAlu : D=0x000000000000000c FetchSeq=13 CPSeq=13 flags=(IsInteger)
+ 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+52 : movz x13, #13, #0 : IntAlu : D=0x000000000000000d FetchSeq=14 CPSeq=14 flags=(IsInteger)
+ 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+56 : movz x14, #14, #0 : IntAlu : D=0x000000000000000e FetchSeq=15 CPSeq=15 flags=(IsInteger)
+ 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+60 : movz x15, #15, #0 : IntAlu : D=0x000000000000000f FetchSeq=16 CPSeq=16 flags=(IsInteger)
+ 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+64 : movz x16, #16, #0 : IntAlu : D=0x0000000000000010 FetchSeq=17 CPSeq=17 flags=(IsInteger)
+ 130500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+68 : movz x17, #17, #0 : IntAlu : D=0x0000000000000011 FetchSeq=18 CPSeq=18 flags=(IsInteger)
+
+
+
+
This suggests 8, but remember that ExecEnable shows issue time labels, which do not coincide necessarily with commit times. As we saw in the pipeline viewer above, instructions 9 and 10 have one extra stage.
+
+
+
After the initial two execs from the first cache line, the full commit log chunk around the first group of six `ExecEnable`s looks like:
+
+
+
+
133500: Commit: system.cpu.commit: Getting instructions from Rename stage.
+ 133500: Commit: system.cpu.commit: Trying to commit instructions in the ROB.
+
+ 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:3]
+ 133500: Commit: system.cpu.commit: [tid:0] [sn:3] Committing instruction with PC (0x400080=>0x400084).(0=>1)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+8 : movz x2, #2, #0 : IntAlu : D=0x0000000000000002 FetchSeq=3 CPSeq=3 flags=(IsInteger)
+ 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400080=>0x400084).(0=>1), [sn:3]
+ 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400080=>0x400084).(0=>1) [sn:3]
+
+ 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:4]
+ 133500: Commit: system.cpu.commit: [tid:0] [sn:4] Committing instruction with PC (0x400084=>0x400088).(0=>1)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+12 : movz x3, #3, #0 : IntAlu : D=0x0000000000000003 FetchSeq=4 CPSeq=4 flags=(IsInteger)
+ 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400084=>0x400088).(0=>1), [sn:4]
+ 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400084=>0x400088).(0=>1) [sn:4]
+
+ 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:5]
+ 133500: Commit: system.cpu.commit: [tid:0] [sn:5] Committing instruction with PC (0x400088=>0x40008c).(0=>1)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+16 : movz x4, #4, #0 : IntAlu : D=0x0000000000000004 FetchSeq=5 CPSeq=5 flags=(IsInteger)
+ 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400088=>0x40008c).(0=>1), [sn:5]
+ 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400088=>0x40008c).(0=>1) [sn:5]
+
+ 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:6]
+ 133500: Commit: system.cpu.commit: [tid:0] [sn:6] Committing instruction with PC (0x40008c=>0x400090).(0=>1)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+20 : movz x5, #5, #0 : IntAlu : D=0x0000000000000005 FetchSeq=6 CPSeq=6 flags=(IsInteger)
+ 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x40008c=>0x400090).(0=>1), [sn:6]
+ 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x40008c=>0x400090).(0=>1) [sn:6]
+
+ 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:7]
+ 133500: Commit: system.cpu.commit: [tid:0] [sn:7] Committing instruction with PC (0x400090=>0x400094).(0=>1)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+24 : movz x6, #6, #0 : IntAlu : D=0x0000000000000006 FetchSeq=7 CPSeq=7 flags=(IsInteger)
+ 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400090=>0x400094).(0=>1), [sn:7]
+ 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400090=>0x400094).(0=>1) [sn:7]
+
+ 133500: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:8]
+ 133500: Commit: system.cpu.commit: [tid:0] [sn:8] Committing instruction with PC (0x400094=>0x400098).(0=>1)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+28 : movz x7, #7, #0 : IntAlu : D=0x0000000000000007 FetchSeq=8 CPSeq=8 flags=(IsInteger)
+ 133500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x400094=>0x400098).(0=>1), [sn:8]
+ 133500: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x400094=>0x400098).(0=>1) [sn:8]
+
+ 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x400098=>0x40009c).(0=>1), [sn:9] ready within ROB.
+ 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x40009c=>0x4000a0).(0=>1), [sn:10] ready within ROB.
+ 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000a0=>0x4000a4).(0=>1), [sn:11] ready within ROB.
+ 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000a4=>0x4000a8).(0=>1), [sn:12] ready within ROB.
+ 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000a8=>0x4000ac).(0=>1), [sn:13] ready within ROB.
+ 133500: Commit: system.cpu.commit: [tid:0] Marking PC (0x4000ac=>0x4000b0).(0=>1), [sn:14] ready within ROB.
+ 133500: Commit: system.cpu.commit: [tid:0] Instruction [sn:9] PC (0x400098=>0x40009c).(0=>1) is head of ROB and ready to commit
+ 133500: Commit: system.cpu.commit: [tid:0] ROB has 10 insts & 182 free entries.
+
+
+
+
+
0x400080⇒0x400084 is an old/new PC address of the first committed instruction.
+
+
+
Another thing we can do, it to try to follow one of the instructions back as it goes through the pipeline. Searching for example for the address 0x400080, we find:
+
+
+
The first mention of the address happens when is the fetch of the two initial instructions completes. TODO not sure why it doesn’t just also fetch the next cache line at the same time:
+
+
+
+
FullO3CPU: Ticking main, FullO3CPU.
+ 78500: Fetch: system.cpu.fetch: Running stage.
+ 78500: Fetch: system.cpu.fetch: Attempting to fetch from [tid:0]
+ 78500: Fetch: system.cpu.fetch: [tid:0] Icache miss is complete.
+ 78500: Fetch: system.cpu.fetch: [tid:0] Adding instructions to queue to decode.
+ 78500: DynInst: global: DynInst: [sn:1] Instruction created. Instcount for system.cpu = 1
+ 78500: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x400078 (0) created [sn:1].
+ 78500: Fetch: system.cpu.fetch: [tid:0] Instruction is: movz x0, #0, #0
+ 78500: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (1/32).
+ 78500: DynInst: global: DynInst: [sn:2] Instruction created. Instcount for system.cpu = 2
+ 78500: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x40007c (0) created [sn:2].
+ 78500: Fetch: system.cpu.fetch: [tid:0] Instruction is: movz x1, #1, #0
+ 78500: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (2/32).
+ 78500: Fetch: system.cpu.fetch: [tid:0] Issuing a pipelined I-cache access, starting at PC (0x400080=>0x400084).(0=>1).
+ 78500: Fetch: system.cpu.fetch: [tid:0] Fetching cache line 0x400080 for addr 0x400080
+
+
+
+
so we observe that the first two instructions arrived, and the CPU noticed that 0x400080 hasn’t been fetched yet.
+
+
+
Then for several cycles that follow, the fetch stage just says that it is blocked on data returning, e.g. the
+
+
+
+
FullO3CPU: Ticking main, FullO3CPU.
+ 79000: Fetch: system.cpu.fetch: Running stage.
+ 79000: Fetch: system.cpu.fetch: There are no more threads available to fetch from.
+ 79000: Fetch: system.cpu.fetch: [tid:0] Fetch is waiting cache response!
+
+
+
+
At the same time, the execution of the initial 2 instructions progresses through the pipeline.
+
+
+
These progress up until:
+
+
+
+
88000: O3CPU: system.cpu: Idle!
+
+
+
+
at which point there are no more events scheduled besides waiting for the second cache line to come back.
+
+
+
After this, some time passes without events, and the next tick happens when the fetch data returns:
+
+
+
+
FullO3CPU: Ticking main, FullO3CPU.
+ 130000: Fetch: system.cpu.fetch: Running stage.
+ 130000: Fetch: system.cpu.fetch: Attempting to fetch from [tid:0]
+ 130000: Fetch: system.cpu.fetch: [tid:0] Icache miss is complete.
+ 130000: Fetch: system.cpu.fetch: [tid:0] Adding instructions to queue to decode.
+ 130000: DynInst: global: DynInst: [sn:3] Instruction created. Instcount for system.cpu = 1
+ 130000: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x400080 (0) created [sn:3].
+ 130000: Fetch: system.cpu.fetch: [tid:0] Instruction is: movz x2, #2, #0
+ 130000: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (1/32).
+ 130000: DynInst: global: DynInst: [sn:4] Instruction created. Instcount for system.cpu = 2
+ 130000: Fetch: system.cpu.fetch: [tid:0] Instruction PC 0x400084 (0) created [sn:4].
+ 130000: Fetch: system.cpu.fetch: [tid:0] Instruction is: movz x3, #3, #0
+ 130000: Fetch: system.cpu.fetch: [tid:0] Fetch queue entry created (2/32).
+ 130000: DynInst: global: DynInst: [sn:5] Instruction created. Instcount for system.cpu = 3
+
+
+
+
+
+
+
+
+
// f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
+
+ timeline tick pc.upc disasm seq_num
+[.ic.r........................................................................fdn]-( 40000) 0x00400078.0 movz x0, #0, #0 [ 1]
+[.ic.r........................................................................fdn]-( 40000) 0x0040007c.0 movz x1, #1, #0 [ 2]
+[....................fdn.ic.r....................................................]-( 120000) 0x00400080.0 movz x2, #2, #0 [ 3]
+[....................fdn.pic.r...................................................]-( 120000) 0x00400084.0 add x3, x2, #1 [ 4]
+[....................fdn.ic..r...................................................]-( 120000) 0x00400088.0 movz x4, #4, #0 [ 5]
+[....................fdn.ic..r...................................................]-( 120000) 0x0040008c.0 movz x5, #5, #0 [ 6]
+[....................fdn.ic..r...................................................]-( 120000) 0x00400090.0 movz x6, #6, #0 [ 7]
+[....................fdn.ic..r...................................................]-( 120000) 0x00400094.0 movz x7, #7, #0 [ 8]
+[....................fdn.ic..r...................................................]-( 120000) 0x00400098.0 movz x8, #8, #0 [ 9]
+[....................fdn.pic.r...................................................]-( 120000) 0x0040009c.0 movz x9, #9, #0 [ 10]
+[.....................fdn.ic.r...................................................]-( 120000) 0x004000a0.0 movz x10, #10, #0 [ 11]
+[.....................fdn.ic..r..................................................]-( 120000) 0x004000a4.0 movz x11, #11, #0 [ 12]
+[.....................fdn.ic..r..................................................]-( 120000) 0x004000a8.0 movz x12, #12, #0 [ 13]
+[.....................fdn.ic..r..................................................]-( 120000) 0x004000ac.0 movz x13, #13, #0 [ 14]
+[.....................fdn.pic.r..................................................]-( 120000) 0x004000b0.0 movz x14, #14, #0 [ 15]
+[.....................fdn.pic.r..................................................]-( 120000) 0x004000b4.0 movz x15, #15, #0 [ 16]
+[.....................fdn.pic.r..................................................]-( 120000) 0x004000b8.0 movz x16, #16, #0 [ 17]
+[.....................fdn.pic.r..................................................]-( 120000) 0x004000bc.0 movz x17, #17, #0 [ 18]
+[............................................fdn.ic.r............................]-( 160000) 0x004000c0.0 movz x18, #18, #0 [ 19]
+[............................................fdn.ic.r............................]-( 160000) 0x004000c4.0 movz x19, #19, #0 [ 20]
+[............................................fdn.ic.r............................]-( 160000) 0x004000c8.0 movz x20, #20, #0 [ 21]
+[............................................fdn.ic.r............................]-( 160000) 0x004000cc.0 movz x21, #21, #0 [ 22]
+[............................................fdn.ic.r............................]-( 160000) 0x004000d0.0 movz x22, #22, #0 [ 23]
+[............................................fdn.ic.r............................]-( 160000) 0x004000d4.0 movz x23, #23, #0 [ 24]
+[............................................fdn.pic.r...........................]-( 160000) 0x004000d8.0 movz x24, #24, #0 [ 25]
+[............................................fdn.pic.r...........................]-( 160000) 0x004000dc.0 movz x25, #25, #0 [ 26]
+[.............................................fdn.ic.r...........................]-( 160000) 0x004000e0.0 movz x0, #0, #0 [ 27]
+[.............................................fdn.ic.r...........................]-( 160000) 0x004000e4.0 movz x8, #93, #0 [ 28]
+
+
+
+
TODO understand how the hazard happens in detail.
+
+
+
+
+
+
+
+
// f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
+
+ timeline tick pc.upc disasm seq_num
+[.ic.r........................................................................fdn]-( 40000) 0x00400078.0 movz x0, #0, #0 [ 1]
+[.ic.r........................................................................fdn]-( 40000) 0x0040007c.0 movz x1, #1, #0 [ 2]
+[....................fdn.ic.r....................................................]-( 120000) 0x00400080.0 movz x2, #2, #0 [ 3]
+[....................fdn.pic.r...................................................]-( 120000) 0x00400084.0 add x3, x2, #1 [ 4]
+[....................fdn.p.ic.r..................................................]-( 120000) 0x00400088.0 add x4, x3, #1 [ 5]
+[....................fdn.p..ic.r.................................................]-( 120000) 0x0040008c.0 add x5, x4, #1 [ 6]
+[....................fdn.p...ic.r................................................]-( 120000) 0x00400090.0 add x6, x5, #1 [ 7]
+[....................fdn.ic.....r................................................]-( 120000) 0x00400094.0 movz x7, #7, #0 [ 8]
+[....................fdn.ic.....r................................................]-( 120000) 0x00400098.0 movz x8, #8, #0 [ 9]
+[....................fdn.ic.....r................................................]-( 120000) 0x0040009c.0 movz x9, #9, #0 [ 10]
+[.....................fdn.ic....r................................................]-( 120000) 0x004000a0.0 movz x10, #10, #0 [ 11]
+[.....................fdn.ic....r................................................]-( 120000) 0x004000a4.0 movz x11, #11, #0 [ 12]
+[.....................fdn.ic....r................................................]-( 120000) 0x004000a8.0 movz x12, #12, #0 [ 13]
+[.....................fdn.ic....r................................................]-( 120000) 0x004000ac.0 movz x13, #13, #0 [ 14]
+[.....................fdn.ic.....r...............................................]-( 120000) 0x004000b0.0 movz x14, #14, #0 [ 15]
+[.....................fdn.pic....r...............................................]-( 120000) 0x004000b4.0 movz x15, #15, #0 [ 16]
+[.....................fdn.pic....r...............................................]-( 120000) 0x004000b8.0 movz x16, #16, #0 [ 17]
+[.....................fdn.pic....r...............................................]-( 120000) 0x004000bc.0 movz x17, #17, #0 [ 18]
+[............................................fdn.ic.r............................]-( 160000) 0x004000c0.0 movz x18, #18, #0 [ 19]
+[............................................fdn.ic.r............................]-( 160000) 0x004000c4.0 movz x19, #19, #0 [ 20]
+[............................................fdn.ic.r............................]-( 160000) 0x004000c8.0 movz x20, #20, #0 [ 21]
+[............................................fdn.ic.r............................]-( 160000) 0x004000cc.0 movz x21, #21, #0 [ 22]
+[............................................fdn.ic.r............................]-( 160000) 0x004000d0.0 movz x22, #22, #0 [ 23]
+[............................................fdn.ic.r............................]-( 160000) 0x004000d4.0 movz x23, #23, #0 [ 24]
+[............................................fdn.pic.r...........................]-( 160000) 0x004000d8.0 movz x24, #24, #0 [ 25]
+[............................................fdn.pic.r...........................]-( 160000) 0x004000dc.0 movz x25, #25, #0 [ 26]
+[.............................................fdn.ic.r...........................]-( 160000) 0x004000e0.0 movz x0, #0, #0 [ 27]
+[.............................................fdn.ic.r...........................]-( 160000) 0x004000e4.0 movz x8, #93, #0 [ 28]
+
+
+
+
+
+
+
+
+
+-
+
the addition of a data section entry changed our previous address setup a bit, the entry point was now 0x004000b0 which fits 4 instructions in the cacheline instead of 2
+
+-
+
the LDR happens to be the fourth instruction, so it takes a long time to retire. The time is about 40k ticks, which is about the same time it takes for the instruction fetch as expected.
+
+-
+
fetch does not continue past the LDR, and so nothing is gained in this particular example, since the next instructions haven’t been fetched from memory yet!
+
+
+
+
+
+
// f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
+
+ timeline tick pc.upc disasm seq_num
+[.ic.r........................................................................fdn]-( 40000) 0x004000b0.0 movz x0, #0, #0 [ 1]
+[.ic.r........................................................................fdn]-( 40000) 0x004000b4.0 movz x1, #1, #0 [ 2]
+[.ic.r........................................................................fdn]-( 40000) 0x004000b8.0 adr x2, #65780 [ 3]
+[.............................................................................fdn]-( 40000) 0x004000bc.0 ldr x3, [x2] [ 4]
+[.pic............................................................................]-( 80000) ...
+[................................r...............................................]-( 120000) ...
+[....................fdn.ic......r...............................................]-( 120000) 0x004000c0.0 movz x4, #4, #0 [ 5]
+[....................fdn.ic......r...............................................]-( 120000) 0x004000c4.0 movz x5, #5, #0 [ 6]
+[....................fdn.ic......r...............................................]-( 120000) 0x004000c8.0 movz x6, #6, #0 [ 7]
+[....................fdn.ic......r...............................................]-( 120000) 0x004000cc.0 movz x7, #7, #0 [ 8]
+[....................fdn.ic......r...............................................]-( 120000) 0x004000d0.0 movz x8, #8, #0 [ 9]
+[....................fdn.ic......r...............................................]-( 120000) 0x004000d4.0 movz x9, #9, #0 [ 10]
+[....................fdn.pic.....r...............................................]-( 120000) 0x004000d8.0 movz x10, #10, #0 [ 11]
+[....................fdn.pic......r..............................................]-( 120000) 0x004000dc.0 movz x11, #11, #0 [ 12]
+[.....................fdn.ic......r..............................................]-( 120000) 0x004000e0.0 movz x12, #12, #0 [ 13]
+[.....................fdn.ic......r..............................................]-( 120000) 0x004000e4.0 movz x13, #13, #0 [ 14]
+[.....................fdn.ic......r..............................................]-( 120000) 0x004000e8.0 movz x14, #14, #0 [ 15]
+[.....................fdn.ic......r..............................................]-( 120000) 0x004000ec.0 movz x15, #15, #0 [ 16]
+[.....................fdn.pic.....r..............................................]-( 120000) 0x004000f0.0 movz x16, #16, #0 [ 17]
+[.....................fdn.pic.....r..............................................]-( 120000) 0x004000f4.0 movz x17, #17, #0 [ 18]
+[.....................fdn.pic.....r..............................................]-( 120000) 0x004000f8.0 movz x18, #18, #0 [ 19]
+[.....................fdn.pic......r.............................................]-( 120000) 0x004000fc.0 movz x19, #19, #0 [ 20]
+
+
+
+
+
+
+
+
So in this case we see that there were actual potential gains, since the movz x11 started running immediately. We just stopped at movz x20 because a new ifetch was needed.
+
+
+
+
// f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
+
+ timeline tick pc.upc disasm seq_num
+[.ic.r........................................................................fdn]-( 40000) 0x004000b0.0 movz x0, #0, #0 [ 1]
+[.ic.r........................................................................fdn]-( 40000) 0x004000b4.0 movz x1, #1, #0 [ 2]
+[.ic.r........................................................................fdn]-( 40000) 0x004000b8.0 movz x2, #4, #0 [ 3]
+[.ic.r........................................................................fdn]-( 40000) 0x004000bc.0 movz x3, #5, #0 [ 4]
+[....................fdn.ic.r....................................................]-( 120000) 0x004000c0.0 adr x4, #65772 [ 5]
+[....................fdn.pic.....................................................]-( 120000) 0x004000c4.0 ldr x5, [x4] [ 6]
+[........................................................r.......................]-( 160000) ...
+[....................fdn.ic......................................................]-( 120000) 0x004000c8.0 movz x6, #6, #0 [ 7]
+[........................................................r.......................]-( 160000) ...
+[....................fdn.ic......................................................]-( 120000) 0x004000cc.0 movz x7, #7, #0 [ 8]
+[........................................................r.......................]-( 160000) ...
+[....................fdn.ic......................................................]-( 120000) 0x004000d0.0 movz x8, #8, #0 [ 9]
+[........................................................r.......................]-( 160000) ...
+[....................fdn.ic......................................................]-( 120000) 0x004000d4.0 movz x9, #9, #0 [ 10]
+[........................................................r.......................]-( 160000) ...
+[....................fdn.ic......................................................]-( 120000) 0x004000d8.0 movz x10, #10, #0 [ 11]
+[........................................................r.......................]-( 160000) ...
+[....................fdn.pic.....................................................]-( 120000) 0x004000dc.0 movz x11, #11, #0 [ 12]
+[........................................................r.......................]-( 160000) ...
+[.....................fdn.ic.....................................................]-( 120000) 0x004000e0.0 movz x12, #12, #0 [ 13]
+[........................................................r.......................]-( 160000) ...
+[.....................fdn.ic.....................................................]-( 120000) 0x004000e4.0 movz x13, #13, #0 [ 14]
+[.........................................................r......................]-( 160000) ...
+[.....................fdn.ic.....................................................]-( 120000) 0x004000e8.0 movz x14, #14, #0 [ 15]
+[.........................................................r......................]-( 160000) ...
+[.....................fdn.ic.....................................................]-( 120000) 0x004000ec.0 movz x15, #15, #0 [ 16]
+[.........................................................r......................]-( 160000) ...
+[.....................fdn.ic.....................................................]-( 120000) 0x004000f0.0 movz x16, #16, #0 [ 17]
+[.........................................................r......................]-( 160000) ...
+[.....................fdn.pic....................................................]-( 120000) 0x004000f4.0 movz x17, #17, #0 [ 18]
+[.........................................................r......................]-( 160000) ...
+[.....................fdn.pic....................................................]-( 120000) 0x004000f8.0 movz x18, #18, #0 [ 19]
+[.........................................................r......................]-( 160000) ...
+[.....................fdn.pic....................................................]-( 120000) 0x004000fc.0 movz x19, #19, #0 [ 20]
+[.........................................................r......................]-( 160000) ...
+[............................................fdn.ic.......r......................]-( 160000) 0x00400100.0 movz x20, #20, #0 [ 21]
+[............................................fdn.ic........r.....................]-( 160000) 0x00400104.0 movz x21, #21, #0 [ 22]
+[............................................fdn.ic........r.....................]-( 160000) 0x00400108.0 movz x22, #22, #0 [ 23]
+[............................................fdn.ic........r.....................]-( 160000) 0x0040010c.0 movz x23, #23, #0 [ 24]
+[............................................fdn.ic........r.....................]-( 160000) 0x00400110.0 movz x24, #24, #0 [ 25]
+[............................................fdn.ic........r.....................]-( 160000) 0x00400114.0 movz x25, #25, #0 [ 26]
+[............................................fdn.pic.......r.....................]-( 160000) 0x00400118.0 movz x26, #26, #0 [ 27]
+[............................................fdn.pic.......r.....................]-( 160000) 0x0040011c.0 movz x27, #27, #0 [ 28]
+[.............................................fdn.ic.......r.....................]-( 160000) 0x00400120.0 movz x28, #28, #0 [ 29]
+[.............................................fdn.ic........r....................]-( 160000) 0x00400124.0 movz x29, #29, #0 [ 30]
+[.............................................fdn.ic........r....................]-( 160000) 0x00400128.0 movz x0, #0, #0 [ 31]
+[.............................................fdn.ic........r....................]-( 160000) 0x0040012c.0 movz x1, #1, #0 [ 32]
+[.............................................fdn.pic.......r....................]-( 160000) 0x00400130.0 movz x2, #2, #0 [ 33]
+[.............................................fdn.pic.......r....................]-( 160000) 0x00400134.0 movz x3, #3, #0 [ 34]
+[.............................................fdn.pic.......r....................]-( 160000) 0x00400138.0 movz x4, #4, #0 [ 35]
+[.............................................fdn.pic.......r....................]-( 160000) 0x0040013c.0 movz x5, #5, #0 [ 36]
+
+
+
+
We now also understand the graph better from lines such as this:
+
+
+
+
[....................fdn.pic.....................................................]-( 120000) 0x004000c4.0 ldr x5, [x4] [ 6]
+[........................................................r.......................]-( 160000) ...
+[....................fdn.ic......................................................]-( 120000) 0x004000c8.0 movz x6, #6, #0 [ 7]
+[........................................................r.......................]-( 160000) ...
+
+
+
+
We see that extra lines are drawn (the 160000 … lines here) whenever something stalls for a period longer than the width of the visualisation.
+
+
+
Things are still relatively readable because the wrapping aligns them with events that actually happened on that line directly e.g. 160000) 0x00400100.0 movz x20, #20, #0..
+
+
+
+
+
+
+
+
So in this case the ic of dependencies like add x6, x5, #1 have to wait until the LDR is finished:
+
+
+
+
// f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
+
+ timeline tick pc.upc disasm seq_num
+[.ic.r........................................................................fdn]-( 40000) 0x004000b0.0 movz x0, #0, #0 [ 1]
+[.ic.r........................................................................fdn]-( 40000) 0x004000b4.0 movz x1, #1, #0 [ 2]
+[.ic.r........................................................................fdn]-( 40000) 0x004000b8.0 movz x2, #4, #0 [ 3]
+[.ic.r........................................................................fdn]-( 40000) 0x004000bc.0 movz x3, #5, #0 [ 4]
+[....................fdn.ic.r....................................................]-( 120000) 0x004000c0.0 adr x4, #65772 [ 5]
+[....................fdn.pic.....................................................]-( 120000) 0x004000c4.0 ldr x5, [x4] [ 6]
+[........................................................r.......................]-( 160000) ...
+[....................fdn.p.......................................................]-( 120000) 0x004000c8.0 add x6, x5, #1 [ 7]
+[......................................................ic.r......................]-( 160000) ...
+[....................fdn.p.......................................................]-( 120000) 0x004000cc.0 add x7, x6, #1 [ 8]
+[.......................................................ic.r.....................]-( 160000) ...
+[....................fdn.p.......................................................]-( 120000) 0x004000d0.0 add x8, x7, #1 [ 9]
+[........................................................ic.r....................]-( 160000) ...
+[....................fdn.p.......................................................]-( 120000) 0x004000d4.0 add x9, x8, #1 [ 10]
+[.........................................................ic.r...................]-( 160000) ...
+[....................fdn.ic......................................................]-( 120000) 0x004000d8.0 movz x10, #10, #0 [ 11]
+[............................................................r...................]-( 160000) ...
+[....................fdn.ic......................................................]-( 120000) 0x004000dc.0 movz x11, #11, #0 [ 12]
+[............................................................r...................]-( 160000) ...
+[.....................fdn.ic.....................................................]-( 120000) 0x004000e0.0 movz x12, #12, #0 [ 13]
+[............................................................r...................]-( 160000) ...
+[.....................fdn.ic.....................................................]-( 120000) 0x004000e4.0 movz x13, #13, #0 [ 14]
+[............................................................r...................]-( 160000) ...
+[.....................fdn.ic.....................................................]-( 120000) 0x004000e8.0 movz x14, #14, #0 [ 15]
+[............................................................r...................]-( 160000) ...
+[.....................fdn.ic.....................................................]-( 120000) 0x004000ec.0 movz x15, #15, #0 [ 16]
+[............................................................r...................]-( 160000) ...
+[.....................fdn.ic.....................................................]-( 120000) 0x004000f0.0 movz x16, #16, #0 [ 17]
+[............................................................r...................]-( 160000) ...
+[.....................fdn.ic.....................................................]-( 120000) 0x004000f4.0 movz x17, #17, #0 [ 18]
+[.............................................................r..................]-( 160000) ...
+[.....................fdn.pic....................................................]-( 120000) 0x004000f8.0 movz x18, #18, #0 [ 19]
+[.............................................................r..................]-( 160000) ...
+[.....................fdn.pic....................................................]-( 120000) 0x004000fc.0 movz x19, #19, #0 [ 20]
+[.............................................................r..................]-( 160000) ...
+[............................................fdn.ic...........r..................]-( 160000) 0x00400100.0 movz x20, #20, #0 [ 21]
+[............................................fdn.ic...........r..................]-( 160000) 0x00400104.0 movz x21, #21, #0 [ 22]
+[............................................fdn.ic...........r..................]-( 160000) 0x00400108.0 movz x22, #22, #0 [ 23]
+[............................................fdn.ic...........r..................]-( 160000) 0x0040010c.0 movz x23, #23, #0 [ 24]
+[............................................fdn.ic...........r..................]-( 160000) 0x00400110.0 movz x24, #24, #0 [ 25]
+[............................................fdn.ic............r.................]-( 160000) 0x00400114.0 movz x25, #25, #0 [ 26]
+[............................................fdn.pic...........r.................]-( 160000) 0x00400118.0 movz x26, #26, #0 [ 27]
+[............................................fdn.pic...........r.................]-( 160000) 0x0040011c.0 movz x27, #27, #0 [ 28]
+[.............................................fdn.ic...........r.................]-( 160000) 0x00400120.0 movz x28, #28, #0 [ 29]
+[.............................................fdn.ic...........r.................]-( 160000) 0x00400124.0 movz x29, #29, #0 [ 30]
+[.............................................fdn.ic...........r.................]-( 160000) 0x00400128.0 movz x0, #0, #0 [ 31]
+[.............................................fdn.ic...........r.................]-( 160000) 0x0040012c.0 movz x1, #1, #0 [ 32]
+[.............................................fdn.pic..........r.................]-( 160000) 0x00400130.0 movz x2, #2, #0 [ 33]
+[.............................................fdn.pic...........r................]-( 160000) 0x00400134.0 movz x3, #3, #0 [ 34]
+[.............................................fdn.pic...........r................]-( 160000) 0x00400138.0 movz x4, #4, #0 [ 35]
+[.............................................fdn.pic...........r................]-( 160000) 0x0040013c.0 movz x5, #5, #0 [ 36]
+
+
+
+
+
+
+
+
That program is setup such that the branch is not taken if an extra CLI argument is passed with --cli-args.
+
+
+
We purposefully set things up so that speculation will be running from the icache so we can see what is going on more clearly without ifetch stalls.
+
+
+
Without an extra CLI argument (the branch is taken):
+
+
+
+
// f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
+
+ timeline tick pc.upc disasm seq_num
+[.............................................................................fdn]-( 40000) 0x00400078.0 ldr x0, [sp] [ 1]
+[.ic.............................................................................]-( 80000) ...
+[................................r...............................................]-( 120000) ...
+[.............................................................................fdn]-( 40000) 0x0040007c.0 movz x1, #1, #0 [ 2]
+[.ic.............................................................................]-( 80000) ...
+[................................r...............................................]-( 120000) ...
+[....................fdn.ic......r...............................................]-( 120000) 0x00400080.0 movz x2, #2, #0 [ 3]
+[....................fdn.ic......r...............................................]-( 120000) 0x00400084.0 movz x3, #3, #0 [ 4]
+[....................fdn.ic......r...............................................]-( 120000) 0x00400088.0 movz x4, #4, #0 [ 5]
+[....................fdn.ic......r...............................................]-( 120000) 0x0040008c.0 movz x5, #5, #0 [ 6]
+[....................fdn.ic......r...............................................]-( 120000) 0x00400090.0 movz x6, #6, #0 [ 7]
+[....................fdn.p.....ic..r.............................................]-( 120000) 0x00400094.0 subs x0, #2 [ 8]
+[....................fdn.ic........r.............................................]-( 120000) 0x00400098.0 movz x0, #3, #0 [ 9]
+[....................fdn.p......ic.r.............................................]-( 120000) 0x0040009c.0 b.lt 0x400080 [ 10]
+[=====================fdn=ic=====================================================]-( 120000) 0x004000a0.0 -----movz x10, #10, #0 [ 11]
+[=====================fdn=ic=====================================================]-( 120000) 0x004000a4.0 -----movz x11, #11, #0 [ 12]
+[=====================fdn=ic=====================================================]-( 120000) 0x004000a8.0 -----movz x12, #12, #0 [ 13]
+[=====================fdn=ic=====================================================]-( 120000) 0x004000ac.0 -----movz x13, #13, #0 [ 14]
+[=====================fdn=ic=====================================================]-( 120000) 0x004000b0.0 -----movz x14, #14, #0 [ 15]
+[=====================fdn=ic=====================================================]-( 120000) 0x004000b4.0 -----movz x15, #15, #0 [ 16]
+[=====================fdn=pic====================================================]-( 120000) 0x004000b8.0 -----movz x16, #16, #0 [ 17]
+[=====================fdn=pic====================================================]-( 120000) 0x004000bc.0 -----movz x17, #17, #0 [ 18]
+[.....................................fdn.ic.r...................................]-( 120000) 0x00400080.0 movz x2, #2, #0 [ 19]
+[.....................................fdn.ic.r...................................]-( 120000) 0x00400084.0 movz x3, #3, #0 [ 20]
+[.....................................fdn.ic.r...................................]-( 120000) 0x00400088.0 movz x4, #4, #0 [ 21]
+[.....................................fdn.ic.r...................................]-( 120000) 0x0040008c.0 movz x5, #5, #0 [ 22]
+[.....................................fdn.ic.r...................................]-( 120000) 0x00400090.0 movz x6, #6, #0 [ 23]
+[.....................................fdn.pic.r..................................]-( 120000) 0x00400098.0 movz x0, #3, #0 [ 25]
+[.....................................fdn.pic.r..................................]-( 120000) 0x0040009c.0 b.lt 0x400080 [ 26]
+[......................................fdn.ic.r..................................]-( 120000) 0x004000a0.0 movz x10, #10, #0 [ 27]
+[......................................fdn.ic.r..................................]-( 120000) 0x004000a4.0 movz x11, #11, #0 [ 28]
+[......................................fdn.ic.r..................................]-( 120000) 0x004000a8.0 movz x12, #12, #0 [ 29]
+[......................................fdn.ic.r..................................]-( 120000) 0x004000ac.0 movz x13, #13, #0 [ 30]
+[......................................fdn.pic.r.................................]-( 120000) 0x004000b0.0 movz x14, #14, #0 [ 31]
+[......................................fdn.pic.r.................................]-( 120000) 0x004000b4.0 movz x15, #15, #0 [ 32]
+[......................................fdn.pic.r.................................]-( 120000) 0x004000b8.0 movz x16, #16, #0 [ 33]
+[......................................fdn.pic.r.................................]-( 120000) 0x004000bc.0 movz x17, #17, #0 [ 34]
+[.............................................fdn.ic.r...........................]-( 160000) 0x004000c0.0 movz x0, #0, #0 [ 35]
+[.............................................fdn.ic.r...........................]-( 160000) 0x004000c4.0 movz x8, #93, #0 [ 36]
+
+
+
+
So here we see that the CPU mispredicted! After the BLT instruction, the CPU continued to run movz x10, assuming that the branch would not be taken.
+
+
+
Then, at time 120000, the LDR data came back, after the wrong prediction had already been fully executed.
+
+
+
The CPU then noticed that it mispredicted, and so it started again from the correct branch target movz x2, and the instructions that were thrown away are marked as ===== in the timeline.
+
+
+
+
+
130000: Fetch: system.cpu.fetch: [tid:0] [sn:10] Branch at PC 0x40009c predicted to be not taken
+ 130000: Fetch: system.cpu.fetch: [tid:0] [sn:10] Branch at PC 0x40009c predicted to go to (0x4000a0=>0x4000a4).(0=>1)
+
+ 131500: Commit: system.cpu.commit: [tid:10] [sn:0] Inserting PC (0x40009c=>0x4000a0).(0=>1) into ROB.
+ 131500: ROB: system.cpu.rob: Adding inst PC (0x40009c=>0x4000a0).(0=>1) to the ROB.
+ 131500: ROB: system.cpu.rob: [tid:0] Now has 10 instructions.
+
+ 132000: IEW: system.cpu.iew: [tid:0] Issue: Adding PC (0x40009c=>0x4000a0).(0=>1) [sn:10] [tid:0] to IQ.
+ 132000: IQ: system.cpu.iq: Adding instruction [sn:10] PC (0x40009c=>0x4000a0).(0=>1) to the IQ.
+ 132000: IQ: system.cpu.iq: Instruction PC (0x40009c=>0x4000a0).(0=>1) has src reg 6 (CCRegClass) that is being added to the dependency chain.
+ 132000: IQ: system.cpu.iq: Instruction PC (0x40009c=>0x4000a0).(0=>1) has src reg 8 (CCRegClass) that is being added to the dependency chain.
+ 132000: IQ: system.cpu.iq: Instruction PC (0x40009c=>0x4000a0).(0=>1) has src reg 7 (CCRegClass) that is being added to the dependency chain.
+
+ 135500: IQ: system.cpu.iq: Waking up a dependent instruction, [sn:10] PC (0x40009c=>0x4000a0).(0=>1).
+ 135500: IQ: global: [sn:10] has 1 ready out of 3 sources. RTI 0)
+ 135500: IQ: system.cpu.iq: Waking any dependents on register 7 (CCRegClass).
+ 135500: IQ: system.cpu.iq: Waking up a dependent instruction, [sn:10] PC (0x40009c=>0x4000a0).(0=>1).
+ 135500: IQ: global: [sn:10] has 2 ready out of 3 sources. RTI 0)
+ 135500: IQ: system.cpu.iq: Waking any dependents on register 8 (CCRegClass).
+ 135500: IQ: system.cpu.iq: Waking up a dependent instruction, [sn:10] PC (0x40009c=>0x4000a0).(0=>1).
+ 135500: IQ: global: [sn:10] has 3 ready out of 3 sources. RTI 0)
+ 135500: IQ: system.cpu.iq: Instruction is ready to issue, putting it onto the ready list, PC (0x40009c=>0x4000a0).(0=>1) opclass:1 [sn:10].
+ 135500: IEW: system.cpu.iew: Setting Destination Register 6 (CCRegClass)
+ 135500: Scoreboard: system.cpu.scoreboard: Setting reg 6 (CCRegClass) as ready
+ 135500: IEW: system.cpu.iew: Setting Destination Register 7 (CCRegClass)
+ 135500: Scoreboard: system.cpu.scoreboard: Setting reg 7 (CCRegClass) as ready
+ 135500: IEW: system.cpu.iew: Setting Destination Register 8 (CCRegClass)
+ 135500: Scoreboard: system.cpu.scoreboard: Setting reg 8 (CCRegClass) as ready
+ 135500: IQ: system.cpu.iq: Attempting to schedule ready instructions from the IQ.
+ 135500: IQ: system.cpu.iq: Thread 0: Issuing instruction PC (0x40009c=>0x4000a0).(0=>1) [sn:10]
+
+ 136000: IEW: system.cpu.iew: Execute: Processing PC (0x40009c=>0x4000a0).(0=>1), [tid:0] [sn:10].
+ 136000: IEW: global: RegFile: Access to cc register 6, has data 0x2
+ 136000: IEW: global: RegFile: Access to cc register 8, has data 0
+ 136000: IEW: global: RegFile: Access to cc register 7, has data 0
+ 136000: IEW: system.cpu.iew: Current wb cycle: 0, width: 8, numInst: 0
+wbActual:0
+ 136000: IEW: system.cpu.iew: [tid:0] [sn:10] Execute: Branch mispredict detected.
+ 136000: IEW: system.cpu.iew: [tid:0] [sn:10] Predicted target was PC: (0x4000a0=>0x4000a4).(0=>1)
+ 136000: IEW: system.cpu.iew: [tid:0] [sn:10] Execute: Redirecting fetch to PC: (0x40009c=>0x400080).(0=>1)
+ 136000: IEW: system.cpu.iew: [tid:0] [sn:10] Squashing from a specific instruction, PC: (0x40009c=>0x400080).(0=>1)
+
+ 136500: Commit: system.cpu.commit: [tid:0] Squashing due to branch mispred PC:0x40009c [sn:10]
+ 136500: Commit: system.cpu.commit: [tid:0] Redirecting to PC 0x400084
+ 136500: ROB: system.cpu.rob: Starting to squash within the ROB.
+ 136500: ROB: system.cpu.rob: [tid:0] Squashing instructions until [sn:10].
+ 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000bc=>0x4000c0).(0=>1), seq num 18.
+ 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000b8=>0x4000bc).(0=>1), seq num 17.
+ 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000b4=>0x4000b8).(0=>1), seq num 16.
+ 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000b0=>0x4000b4).(0=>1), seq num 15.
+ 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000ac=>0x4000b0).(0=>1), seq num 14.
+ 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000a8=>0x4000ac).(0=>1), seq num 13.
+ 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000a4=>0x4000a8).(0=>1), seq num 12.
+ 136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000a0=>0x4000a4).(0=>1), seq num 11.
+ 136500: ROB: system.cpu.rob: [tid:0] Done squashing instructions.
+ 136500: Commit: system.cpu.commit: [tid:0] Marking PC (0x40009c=>0x400080).(0=>1), [sn:10] ready within ROB.
+
+ 137000: Commit: system.cpu.commit: [tid:0] [sn:10] Committing instruction with PC (0x40009c=>0x400080).(0=>1)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+36 : b.lt 0x400080 : IntAlu : FetchSeq=10 CPSeq=10 flags=(IsControl|IsDirectControl|IsCondControl)
+ 137000: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x40009c=>0x400080).(0=>1), [sn:10]
+ 137000: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x40009c=>0x400080).(0=>1) [sn:10]
+ 137000: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:11]
+ 137000: Commit: system.cpu.commit: Retiring squashed instruction from ROB.
+
+ 137000: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:10]
+ 137000: Commit: system.cpu.commit: [tid:0] [sn:10] Committing instruction with PC (0x40009c=>0x400080).(0=>1)
+ 130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+36 : b.lt 0x400080 : IntAlu : FetchSeq=10 CPSeq=10 flags=(IsControl|IsDirectControl|IsCondControl)
+
+ 138500: Fetch: system.cpu.fetch: [tid:0] [sn:26] Branch at PC 0x40009c predicted to be not taken
+ 138500: Fetch: system.cpu.fetch: [tid:0] [sn:26] Branch at PC 0x40009c predicted to go to (0x4000a0=>0x4000a4).(0=>1)
+
+ 142500: Commit: system.cpu.commit: [tid:0] [sn:26] Committing instruction with PC (0x40009c=>0x4000a0).(0=>1)
+ 138500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+36 : b.lt 0x400080 : IntAlu : FetchSeq=26 CPSeq=18 flags=(IsControl|IsDirectControl|IsCondControl)
+ 142500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x40009c=>0x4000a0).(0=>1), [sn:26]
+
+
+
+
With an extra CLI (the branch is not taken):
+
+
+
+
// f = fetch, d = decode, n = rename, p = dispatch, i = issue, c = complete, r = retire
+
+ timeline tick pc.upc disasm seq_num
+[.............................................................................fdn]-( 40000) 0x00400078.0 ldr x0, [sp] [ 1]
+[.ic.............................................................................]-( 80000) ...
+[................................r...............................................]-( 120000) ...
+[.............................................................................fdn]-( 40000) 0x0040007c.0 movz x1, #1, #0 [ 2]
+[.ic.............................................................................]-( 80000) ...
+[................................r...............................................]-( 120000) ...
+[....................fdn.ic......r...............................................]-( 120000) 0x00400080.0 movz x2, #2, #0 [ 3]
+[....................fdn.ic......r...............................................]-( 120000) 0x00400084.0 movz x3, #3, #0 [ 4]
+[....................fdn.ic......r...............................................]-( 120000) 0x00400088.0 movz x4, #4, #0 [ 5]
+[....................fdn.ic......r...............................................]-( 120000) 0x0040008c.0 movz x5, #5, #0 [ 6]
+[....................fdn.ic......r...............................................]-( 120000) 0x00400090.0 movz x6, #6, #0 [ 7]
+[....................fdn.ic.......r..............................................]-( 120000) 0x00400098.0 movz x0, #3, #0 [ 9]
+[....................fdn.p......ic.r.............................................]-( 120000) 0x0040009c.0 b.lt 0x400080 [ 10]
+[.....................fdn.ic.......r.............................................]-( 120000) 0x004000a0.0 movz x10, #10, #0 [ 11]
+[.....................fdn.ic.......r.............................................]-( 120000) 0x004000a4.0 movz x11, #11, #0 [ 12]
+[.....................fdn.ic.......r.............................................]-( 120000) 0x004000a8.0 movz x12, #12, #0 [ 13]
+[.....................fdn.ic.......r.............................................]-( 120000) 0x004000ac.0 movz x13, #13, #0 [ 14]
+[.....................fdn.ic.......r.............................................]-( 120000) 0x004000b0.0 movz x14, #14, #0 [ 15]
+[.....................fdn.ic.......r.............................................]-( 120000) 0x004000b4.0 movz x15, #15, #0 [ 16]
+[.....................fdn.pic......r.............................................]-( 120000) 0x004000b8.0 movz x16, #16, #0 [ 17]
+[.....................fdn.pic.......r............................................]-( 120000) 0x004000bc.0 movz x17, #17, #0 [ 18]
+[............................................fdn.ic.r............................]-( 160000) 0x004000c0.0 movz x0, #0, #0 [ 19]
+[............................................fdn.ic.r............................]-( 160000) 0x004000c4.0 movz x8, #93, #0 [ 20]
+
+
+
+
So this time the prediction was correct. Retire is delayed until the memory comes back, but we otherwise just kept running forward until hitting the next ifetch cache line.
+
@@ -25315,16 +26567,61 @@ namespace ArmISAInst {
-
Some gem5 instructions break down into multiple microops.
Microops are very similar to regular instructions, and show on the gem5 ExecAll trace format since that flag implies ExecMicro.
-
On aarch64 for example, one of the simplest microoped instructions is STP, which does the relatively complex operation of storing two values to memory at once, and is therefore a good candidate for being broken down into microops.
+
On aarch64 for example, one of the simplest microoped instructions is STP, which does the relatively complex operation of storing two values to memory at once, and is therefore a good candidate for being broken down into microops. We can observe it when executing:
+
+
+
+
./run \
+ --arch arch64 \
+ --emulator gem5 \
+ --trace-insts-stdout \
+ --userland userland/arch/aarch64/freestanding/linux/disassembly_test.S \
+;
+
+
+
+
which contains in gem5’s broken-ish disassembly that the input:
+
+
+
+
+
+
16500: system.cpu: A0 T0 : @_start+108 : stp
+ 16500: system.cpu: A0 T0 : @_start+108. 0 : addxi_uop ureg0, x0, #16 : IntAlu : D=0x0000000000420010 flags=(IsInteger|IsMicroop|IsDelayedCommit|IsFirstMicroop)
+ 17000: system.cpu: A0 T0 : @_start+108. 1 : strxi_uop w1, [ureg0] : MemWrite : D=0x000000009abcdef0 A=0x420010 flags=(IsInteger|IsMemRef|IsStore|IsMicroop|IsDelayedCommit)
+ 17500: system.cpu: A0 T0 : @_start+108. 2 : strxi_uop w2, [ureg0, #8] : MemWrite : D=0x0000000000000002 A=0x420018 flags=(IsInteger|IsMemRef|IsStore|IsMicroop|IsLastMicroop)
+
+
+
+
Where @_start+108. 0, @_start+108. 1 and @_start+108. 2 all happen at the same PC, and are therefore microops of STP.
+
+
+
+
+
From the gem5 source code, we see that STP is a class LdpStp : public PairMemOp, and then the constructor of PairMemOp sets up the microops depending on the exact type of LDP/STP:
@@ -25347,7 +26644,7 @@ namespace ArmISAInst {
functional: get the value magically, do not update caches, see also: gem5 functional requests
-atomic: get the value now without making a separate event, but do not update caches
+atomic: get the value now without making a separate event, but do not update caches. Cannot work in Ruby due to fundamental limitations, mentioned in passing at: https://gem5.atlassian.net/browse/GEM5-676
timing: get the value simulating delays and updating caches
@@ -25563,6 +26860,33 @@ TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
Therefore, here it makes sense for gem5 syscall implementation, which does not actually have a real kernel running, to just make a functional request and be done with it, since the impact of cache changes done by this read would be insignificant to the cost of an actual full context switch that would happen on a real syscall.
+
+
It is generally hard to implement functional requests for Ruby runs, because packets are flying through the memory system in a transient state, and there is no simple way of finding exactly which ones might have the latest version of the memory. See for example:
+
+
+
+
The typical error message in that case is:
+
+
+
+
fatal: Ruby functional read failed for address
+
+
@@ -26393,11 +27717,10 @@ readFunc(SyscallDesc *desc, ThreadContext *tc,
-
-
Each instruction is marked with a class, and each class can execute in a given functional unit.
+
Each instruction is marked with a class, and each class can execute in a given functional unit.
+
+
@@ -26411,7 +27734,6 @@ readFunc(SyscallDesc *desc, ThreadContext *tc,
--emulator gem5 \
--userland userland/arch/aarch64/freestanding/linux/hello.S \
--trace-insts-stdout \
- -N1 \
-- \
--cpu-type MinorCPU \
--caches
@@ -26424,7 +27746,233 @@ readFunc(SyscallDesc *desc, ThreadContext *tc,
[system.cpu]
type=MinorCPU
-children=branchPred dcache dtb executeFuncUnits icache interrupts isa itb power_state tracer workload
+children=branchPred dcache dtb executeFuncUnits icache interrupts isa itb power_state tracer workload
+executeInputWidth=2
+executeIssueLimit=2
+
+
+
+
+
The system.cpu points to:
+
+
+
+
[system.cpu.executeFuncUnits]
+type=MinorFUPool
+children=funcUnits0 funcUnits1 funcUnits2 funcUnits3 funcUnits4 funcUnits5 funcUnits6 funcUnits7
+
+
+
+
and the two first units are in full:
+
+
+
+
[system.cpu.executeFuncUnits.funcUnits0]
+type=MinorFU
+children=opClasses timings
+opClasses=system.cpu.executeFuncUnits.funcUnits0.opClasses
+opLat=3
+
+[system.cpu.executeFuncUnits.funcUnits0.opClasses]
+type=MinorOpClassSet
+children=opClasses
+
+[system.cpu.executeFuncUnits.funcUnits0.opClasses.opClasses]
+type=MinorOpClass
+opClass=IntAlu
+
+
+
+
+
+
[system.cpu.executeFuncUnits.funcUnits1]
+type=MinorFU
+children=opClasses timings
+opLat=3
+
+[system.cpu.executeFuncUnits.funcUnits1.opClasses]
+type=MinorOpClassSet
+children=opClasses
+opClasses=system.cpu.executeFuncUnits.funcUnits1.opClasses.opClasses
+
+[system.cpu.executeFuncUnits.funcUnits1.opClasses.opClasses]
+type=MinorOpClass
+opClass=IntAlu
+
+
+
+
So we understand that both:
+
+
+
+-
+
the first and second functional units are IntAlu, so doing integer arithmetic operations
+
+-
+
both have a latency of 3
+
+-
+
each functional unit can have a set of opClass with more than one type. Those first two units just happen to have a single type.
+
+
+
+
+
+
+-
+
0, 1: IntAlu, opLat=3
+
+-
+
2: IntMult, opLat=3
+
+-
+
3: IntDiv, opLat=9. So we see that a more complex operation such as division has higher latency.
+
+-
+
4: FloatAdd, FloatCmp, and a gazillion other floating point related things. opLat=6.
+
+-
+
5: SimdPredAlu: TODO SVE-related? opLat=3
+
+-
+
6: MemRead, MemWrite, FloatMemRead, FloatMemWrite. opLat=1
+
+-
+
7: IprAccess (TODO), InstPrefetch
+
+
+
+
+
These are of course all specified in from the Python at src/cpu/minor/MinorCPU.py:
+
+
+
+
class MinorDefaultFUPool(MinorFUPool):
+ funcUnits = [MinorDefaultIntFU(), MinorDefaultIntFU(),
+ MinorDefaultIntMulFU(), MinorDefaultIntDivFU(),
+ MinorDefaultFloatSimdFU(), MinorDefaultPredFU(),
+ MinorDefaultMemFU(), MinorDefaultMiscFU()]
+
+
+
+
We then expect that each instruction has a certain opClass that determines on which unit it can run.
+
+
+
For example: class AddImm, which is what we get on a simple add x1, x2, 0, sets itself as an IntAluOp on the constructor as expected:
+
+
+
+
AddImm::AddImm(ExtMachInst machInst,
+ IntRegIndex _dest,
+ IntRegIndex _op1,
+ uint32_t _imm,
+ bool _rotC)
+ : DataImmOp("add", machInst, IntAluOp,
+ _dest, _op1, _imm, _rotC)
+
+
+
+
+
+
+
On gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772, after running:
+
+
+
+
./run \
+ --arch aarch64 \
+ --emulator gem5 \
+ --userland userland/arch/aarch64/freestanding/linux/hello.S \
+ --trace-insts-stdout \
+ -- \
+ --cpu-type Derivo3CPU \
+ --caches
+
+
+
+
+
+
[system.cpu]
+type=DerivO3CPU
+children=branchPred dcache dtb fuPool icache interrupts isa itb power_state tracer workload
+
+
+
+
+
+
[system.cpu.fuPool]
+type=FUPool
+children=FUList0 FUList1 FUList2 FUList3 FUList4 FUList5 FUList6 FUList7 FUList8 FUList9
+
+
+
+
so for example FUList0 is:
+
+
+
+
[system.cpu.fuPool.FUList0]
+type=FUDesc
+children=opList
+count=6
+eventq_index=0
+opList=system.cpu.fuPool.FUList0.opList
+
+[system.cpu.fuPool.FUList0.opList]
+type=OpDesc
+eventq_index=0
+opClass=IntAlu
+opLat=1
+pipelined=true
+
+
+
+
+
+
[system.cpu.fuPool.FUList1.opList0]
+type=OpDesc
+eventq_index=0
+opClass=IntMult
+opLat=3
+pipelined=true
+
+[system.cpu.fuPool.FUList1.opList1]
+type=OpDesc
+eventq_index=0
+opClass=IntDiv
+opLat=20
+pipelined=false
+
+
+
+
So summarizing all units we have:
+
+
+
+-
+
0, 1: IntAlu with opLat=3
+
+-
+
2: IntMult with opLat=3 and IntDiv with opLat=20
+
+-
+
3: FloatAdd, FloatCmp, FloatCvt with opLat=2
+
+-
+
TODO lazy to finish the list :-)
+
+
@@ -26679,6 +28227,107 @@ build/ARM/config/the_isa.hh
+
+
+
+
+
+
+
+
+
TODO: advantages over QEMU. Like the name implies, they seem to have a nice ISA description language. From quick internals look, seems to generate LLVM intermediate language, which sound good.
+
+
+
Build on Ubuntu 20.04:
+
+
+
+
sudo apt install libantlr3c-dev
+cd submodule/gensim
+make
+
+
+
+
+
+
arm-none-eabi-gcc: error: unrecognized -march target: armv5
+
+
+
+
Let’s try just armv8, who cares about arvm5!!!
+
+
+
+
mkdir build
+cd build
+cmake -DTESTING_ENABLED=FALSE -DCMAKE_BUILD_TYPE=DEBUGOPT ..
+make -j`nproc` model-armv8
+
+
+
+
+
+
terminate called after throwing an instance of 'std::logic_error'
+ what(): Unrecognised intrinsic: __builtin_abs64
+Aborted (core dumped)
+
+
+
+
Get the failing command with:
+
+
+
+
make VERBOSE=1 model-armv8
+
+
+
+
and we see some code generation step:
+
+
+
+
cd /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/models/armv8 && \
+ /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/build/dist/bin/gensim \
+ -a /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/models/armv8/aarch64.ac \
+ -s module,arch,decode,disasm,ee_interp,ee_blockjit,jumpinfo,function,makefile \
+ -o decode.GenerateDotGraph=1,makefile.libtrace_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/support/libtrace/inc,makefile.archsim_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/archsim/inc,makefile.llvm_path=,makefile.Optimise=2,makefile.Debug=1 \
+ -t /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/build/models/armv8/output-aarch64/
+
+
+
+
We can see an inclusion path:
+
+
+
+
gensim/models/armv8/aarch64.ac
+ ac_isa("isa.ac");
+gensim/models/armv8/isa.ac
+ ac_execute("execute.simd");
+
+
+
+
and where gensim/models/armv8/isa.ac contains __builtin_abs64 usages.
+
+
+
GDB on gensim shows that the error comes from a call to gci.GenerateExecuteBodyFor(body_str, *action);, so it looks like there are some missing cases in EmitFixedCode.
+
+
+
This is completely broken academic code! They must be using an off-tree of part of the tool and forgot to commit.
+
+
@@ -27017,7 +28666,7 @@ make menuconfig
If none of those methods are flexible enough for you, you can just fork or hack up buildroot_packages/sample_package the sample package to do what you want.
@@ -27203,7 +28852,7 @@ git -C "$(./getvar qemu_source_dir)" checkout -
Then, you will also want to do a Bisection to pinpoint the exact commit to blame, and CC that developer.
For Buildroot problems, you should wither provide the config you have:
@@ -27872,17 +29521,11 @@ echo 1 > /proc/sys/vm/overcommit_memory
Demonstrates atomic_int and thrd_create.
-
Disassembly with GDB at LKMC 619fef4b04bddc4a5a38aec5e207dd4d5a25d206 + 1:
+
Disassembly with GDB at LKMC 619fef4b04bddc4a5a38aec5e207dd4d5a25d206 + 1:
-
./run-toolchain \
- --arch aarch64 gdb \
- -- \
- -batch \
- -ex 'disas/rs my_thread_main' $(./getvar \
- --arch aarch64 userland_build_dir)/c/atomic.out \
-;
+
./disas --arch aarch64 --userland userland/c/atomic.c my_thread_main
@@ -27982,6 +29625,101 @@ echo 1 > /proc/sys/vm/overcommit_memory
strace shows that OpenMP makes clone() syscalls in Linux. TODO: does it actually call pthread_ functions, or does it make syscalls directly? Or in other words, can it work on Freestanding programs? A quick grep shows many references to pthreads.
+
+
+
+
+
Host build on Ubuntu 20.04:
+
+
+
+
git submodule update --init submodules/omp-validation
+cd submodules/omp-validation
+PERL5LIB="${PERL5LIB}:." make -j `nproc` ctest
+
+
+
+
This both builds and runs, took about 5 minutes on P51, but had build failues for some reason:
+
+
+
+
Summary:
+S Number of tested Open MP constructs: 62
+S Number of used tests: 123
+S Number of failed tests: 4
+S Number of successful tests: 119
+S + from this were verified: 115
+
+Normal tests:
+N Number of failed tests: 2
+N + from this fail compilation: 0
+N + from this timed out 0
+N Number of successful tests: 60
+N + from this were verified: 58
+
+Orphaned tests:
+O Number of failed tests: 2
+O + from this fail compilation: 0
+O + from this timed out 0
+O Number of successful tests: 59
+O + from this were verified: 57
+
+
+
+
The tests and run results placed under bin/c/, e.g.:
+
+
+
+
test_omp_threadprivate
+test_omp_threadprivate.c
+test_omp_threadprivate.log
+test_omp_threadprivate.out
+test_omp_threadprivate_compile.log
+
+
+
+
C files are also present as some kind of code generaion is used.
+
+
+
Build only and run one of them manually:
+
+
+
+
make -j`nproc` omp_my_sleep omp_testsuite
+PERL5LIB="${PERL5LIB}:." ./runtest.pl --lang=c --norun testlist-c.txt
+./bin/c/test_omp_barrier
+
+
+
+
The bin/c directory is hardcoded in the executable, so to run it you must ensure that it exists relative to CWD, e.g.:
+
+
+
+
cd bin/c
+mkdir -p bin/c
+./test_omp_barrier
+
+
+
+
+
+
PERL5LIB="${PERL5LIB}:." ./runtest.pl --lang=c --makeopts 'CC=aarch64-linux-gnu-gcc CFLAGS_EXTRA=-static' --norun testlist-c.txt
+./../../run --arch aarch64 --emulator gem5 --userland submodules/omp-validation/bin/c/test_omp_parallel_reduction --cpus 8 --memory 8G
+
+
+
+
+
+
make bin/c/test_omp_sections_reduction
+
+
+
@@ -28248,7 +29986,7 @@ global 12676
The actual value is much smaller, because the threads have often overwritten one another with older values.
-
With --optimization-level 3, the result almost always equals that of a single thread, e.g.:
+
With --optimization-level 3, the result almost always equals that of a single thread, e.g.:
@@ -28414,7 +30152,6 @@ non-atomic 19
-
/run -aA -eg -u userland/c/atomic.c --cli-args '2 200' --cpus 3 --userland-build-id o3 -N1 --trace ExecAll — --caches --cpu-type TimingSimpleCPU
Note that that the system is very minimal, and doesn’t even have caches, so I’m curious as to how this can happen at all.
@@ -29219,6 +30959,36 @@ There are no non-locking atomic types or atomic primitives in POSIX:
+
+
+
@@ -29779,7 +31549,125 @@ cblas_dgemm( CblasColMajor, CblasNoTrans, CblasTrans,3,3,2 ,1, A,3, B,
-
+
+
+
+
+
Feels old, guessing not representative anymore like Dhrystone. But hey, history!
+
+
+
Ubuntu 20.04 AMD64 native build and run:
+
+
+
+
git submodule update --init submodules/lmbench
+cd submodules/lmbench
+cd src
+make results
+
+
+
+
TODO it hangs for a long time at:
+
+
+
+
Hang on, we are calculating your cache line size.
+
+
+
+
+
the If I kill it, configuration process continues:
+
+
+
+
Killed
+OK, it looks like your cache line is bytes.
+
+
+
+
and continues with a few more interactive questions until finally:
+
+
+
+
Confguration done, thanks.
+
+
+
+
where it again hangs for at least 2 hours, so I lost patience and killed it.
+
+
+
TODO: how to do a non-interactive config? After the above procedure, bin/x86_64-linux-gnu/CONFIG.ciro-p51 contains:
+
+
+
+
DISKS=""
+DISK_DESC=""
+OUTPUT=/dev/null
+ENOUGH=50000
+FASTMEM="NO"
+FILE=/var/tmp/XXX
+FSDIR=/var/tmp
+INFO=INFO.ciro-p51
+LINE_SIZE=
+LOOP_O=0.00000000
+MAIL=no
+TOTAL_MEM=31903
+MB=22332
+MHZ="-1 System too busy"
+MOTHERBOARD=""
+NETWORKS=""
+OS="x86_64-linux-gnu"
+PROCESSORS="8"
+REMOTE=""
+SLOWFS="NO"
+SYNC_MAX="1"
+LMBENCH_SCHED="DEFAULT"
+TIMING_O=0
+RSH=rsh
+RCP=rcp
+VERSION=lmbench-3alpha4
+BENCHMARK_HARDWARE=YES
+BENCHMARK_OS=YES
+BENCHMARK_SYSCALL=
+BENCHMARK_SELECT=
+BENCHMARK_PROC=
+BENCHMARK_CTX=
+BENCHMARK_PAGEFAULT=
+BENCHMARK_FILE=
+BENCHMARK_MMAP=
+BENCHMARK_PIPE=
+BENCHMARK_UNIX=
+BENCHMARK_UDP=
+BENCHMARK_TCP=
+BENCHMARK_CONNECT=
+BENCHMARK_RPC=
+BENCHMARK_HTTP=
+BENCHMARK_BCOPY=
+BENCHMARK_MEM=
+BENCHMARK_OPS=
+
+
+
+
Native build only without running tests:
+
+
+
+
+
+
@@ -29853,7 +31741,7 @@ times[3 * ntimes + k] = mysecond() - times[3 * ntimes + k];
-
+
@@ -29871,7 +31759,7 @@ times[3 * ntimes + k] = mysecond() - times[3 * ntimes + k];
-
+
./build --arch arm --download-dependencies gem5-buildroot parsec-benchmark
@@ -29905,7 +31793,7 @@ times[3 * ntimes + k] = mysecond() - times[3 * ntimes + k];
-
+
Running a benchmark of a size different than test, e.g. simsmall, requires a rebuild with:
@@ -29969,7 +31857,7 @@ times[3 * ntimes + k] = mysecond() - times[3 * ntimes + k];
-
+
Most users won’t want to use this method because:
@@ -30032,7 +31920,7 @@ parsecmgmt -a run -p splash2x.fmm -i test
-
+
If you want to remove PARSEC later, Buildroot doesn’t provide an automated package removal mechanism as mentioned at: Section 20.6, “Remove Buildroot packages”, but the following procedure should be satisfactory:
@@ -30050,7 +31938,7 @@ parsecmgmt -a run -p splash2x.fmm -i test
@@ -31266,7 +33154,7 @@ zmmintrin.h AVX512
userland/arch/arm/freestanding/linux/hello.S
-userland/arch/arm/inline_asm/freestanding/linux/hello.c
+userland/arch/arm/inline_asm/freestanding/linux/hello.c: there are no register constraints in ARM, so register variables are the most efficient way of storing variables in specific general purpose registers: https://stackoverflow.com/questions/3929442/how-to-specify-an-individual-register-as-constraint-in-arm-gcc-inline-assembly/54845046#54845046
@@ -34587,6 +36475,158 @@ ldmia sp!, reglist
Why GNU GAS 2.29 does not have a mnemonic for it in A64 because it is very recent: shows in ARMv8 architecture reference manual db but not ca.
+
+
+
+
Examples of using them can be found at: dump_regs
+
+
+
aarch64 only uses exactly 2 instructions:
+
+
+
+
aarch32 is a bit more messy due to older setups, we have both:
+
+
+
+
+
+
mrs r0, ctr /* aarch32 */
+mrc x0, ctr_el0 /* aarch64 */
+
+
+
+
Other functionality has moved away from coprocessors into actual instructions, e.g. cache invalidation:
+
+
+
+
/* aarch32: DCISW, Data Cache line Invalidate by Set/Way. */
+mcr p15, 0, r5, c7, c6, 2
+
+/* aarch64: moved to one of the DC instruction variants. */
+dc isw
+
+
+
+
+
+
+
The interface to the System registers was originally defined as part of a generic coprocessor interface, that gave access to 15 coprocessors, CP0 - CP15. Of these, CP8 - CP15 were reserved for use by Arm, while CP0 - CP7 were available for IMPLEMENTATION DEFINED coprocessors.
+
+
+
+
+
and the actual coprocessor registers are specified in Chapter G7 "AArch32 System Register Encoding" at:
+
+
+
+-
+
CP14: Table G7-1 "Mapping of (coproc ==0b1110) MCR, MRC, and MRRC instruction arguments to System registers"
+
+-
+
CP15: Table G7-3 "VMSAv8-32 (coproc==0b1111) register summary, in MCR/MRC parameter order."
+
+
+
+
+
The actual MRC assembly does not exactly match the order of that table, this is how you can decode it, sample MCR:
+
+
+
+
mcr p15, 0, r5, c7, c6, 2
+
+
+
+
+
+
mcr p<coproc>, <opc1>, <src-dest-reg>, <CRn>, <CRm>, <opc2>
+
+
+
+
+
+
+
+-
+
op0
+
+-
+
op1
+
+-
+
CRn
+
+-
+
CRm
+
+-
+
op2
+
+
+
+
+
+
+
+
uint32_t id_isar6_el1;
+__asm__ ("mrs %0, s3_0_c0_c2_7" : "=r" (id_isar6_el1) : :);
+LKMC_DUMP_SYSTEM_REGS_PRINTF("ID_ISAR6_EL1 0x%" PRIX32 "\n", id_isar6_el1);
+
+
+
+
This can be useful to refer to new system registers which your older version of GNU GAS version does not yet have a name for.
+
+
+
+
@@ -35326,7 +37366,13 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.
+
+
@@ -35341,7 +37387,7 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.
-
+
@@ -35350,15 +37396,31 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.
-
+
ARM also releases documentation specific to each given processor.
+
+
For every processor, there are basically two key documents:
+
+
-
+
@@ -35367,6 +37429,18 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.
+
+
@@ -35634,6 +37708,15 @@ cc
It is worth noting that e.g. ARM has a Semihosting mechanism for loading CLI arguments through SYS_GET_CMDLINE, but our mechanism works in principle for any ISA.
+
+
+
+
+
I think we have to keep the CLI args below 32 GiB, otherwise argc cannot be correctly setup. But currently the gem5 text segment is exactly at 32 GiB, and we always place the CLI args higher in the Baremetal linker script.
+
+
@@ -37650,7 +39742,7 @@ ISB
In baremetal, we detect if tests failed by parsing logs for the Magic failure string.
@@ -38195,6 +40287,14 @@ instructions 124346081
+
+
On Ubuntu 20.04 gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772 this took 22 minutes 53 seconds:
+
+
+
+
./run -aa -eg --cpus 2 --tmux --quit-after-boot -- --cpu-type DerivO3CPU --caches
+
+
-
./run-toolchain --arch aarch64 gdb -- -nh -batch -ex 'disas/rs busy_loop' "$(./getvar --arch aarch64 userland_build_dir)/gcc/busy_loop.out"
+
./disas --arch aarch64 --userland userland/gcc/busy_loop.out busy_loop
@@ -39126,10 +41226,35 @@ west build -b qemu_aarch64 samples/hello_world
http://www.lighterra.com/papers/modernmicroprocessors/ explains it well.
-
You basically decode
+
You basically decode multiple instructions in one go, and run them at the same time if they can go in separate functional units and have no conflicts. Genius!
+
+
+
Although this is a microarchitectural feature, it is so important that it is publicly documented. For example:
+
+
+
+
+
+
+
gem5 calls them "functional units".
+
+
+
+
Allows working around data dependencies: you can execute the second next instruction forward if the first next depends on the current one.
+
+
+
Likely used on basically all (?) 2020 non-power-constrained CPUs.
+
+
+
+
+
-
+
+
+
+
Basically means decoding and then potentially executing a bunch of instructions in one go.
+
+
+
+
+
+
Intel name: "Hyperthreading"
@@ -39192,7 +41388,7 @@ west build -b qemu_aarch64 samples/hello_world
-
+
@@ -39234,7 +41430,7 @@ west build -b qemu_aarch64 samples/hello_world
Even if caches are coherent, this is still not enough to avoid data race conditions, because this does not enforce atomicity of read modify write sequences. This is for example shown at: Detailed gem5 analysis of how data races happen.
-
+
@@ -39242,14 +41438,14 @@ west build -b qemu_aarch64 samples/hello_world
This is represented explicitly in C++ for example C++ std::memory_order.
-
+
Either they can snoop only control, or both control and data can be snooped.
@@ -39264,7 +41460,7 @@ west build -b qemu_aarch64 samples/hello_world
-
+
@@ -39511,7 +41707,7 @@ west build -b qemu_aarch64 samples/hello_world
-
+
@@ -39823,7 +42019,7 @@ CACHE2 S nyy
TODO gem5 concrete example.
-
+
TODO understand well why those are needed.
@@ -39843,7 +42039,7 @@ CACHE2 S nyy
-
+
@@ -39903,7 +42099,7 @@ CACHE2 S nyy
-
+
@@ -39963,7 +42159,7 @@ CACHE2 S nyy
-
+
@@ -40495,7 +42691,7 @@ export CCACHE_MAXSIZE="20G"
-
While you could just manually find/learn the path to toolchain tools, e.g. in LKMC b15a0e455d691afa49f3b813ad9b09394dfb02b7 they are
+
While you could just manually find/learn the path to toolchain tools, e.g. in LKMC b15a0e455d691afa49f3b813ad9b09394dfb02b7 they are:
@@ -40513,6 +42709,17 @@ export CCACHE_MAXSIZE="20G"
+
+
+
./run-toolchain --arch aarch64 objdump -- -D $(./getvar --arch aarch64 userland_build_dir)/c/hello.out
+
+
+
+
however disassembly is such a common use case that we have a shortcut for it: disas.
+
+
Alternatively, if you just need a variable to feed into your own Build system, you can also use getvar:
@@ -40528,6 +42735,36 @@ export CCACHE_MAXSIZE="20G"
/path/to/linux-kernel-module-cheat/out/buildroot/build/default/aarch64/host/usr/bin/aarch64-buildroot-linux-gnu
+
+
+
+
+
+
+
./disas --arch aarch64 --userland userland/c/hello.c main
+
+
+
+
+
+
./disas --arch aarch64 start_kernel
+
+
+
+
+
+
./disas --arch aarch64 --baremetal baremetal/arch/aarch64/no_bootloader/exit.S _start
+
+
+
@@ -40889,9 +43126,36 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
-
+
+
+
The --optimization-level option is available on all build scripts and sets the given GCC `-`O optimization level where it has been implemented for guest binaries.
+
+
+
The default optimization level is -O0 to improve guest visibility.
+
+
+
To keep things sane, you generally want to create a separate build variant for each optimization level, e.g. to create an -O3 build:
+
+
+
+
./build-userland --optimization-level 3 --userland-build-id o3
+./run --userland userland/c/hello.c --userland-build-id o3
+
+
+
+
+
Our emulators however are build with higher optimization levels by default otherwise running anything would be too unbearably slow.
+
+
+
Emulator builds are also controlled with other mechanisms instead of --optimization-level as explained at: Debug the emulator.
+
+
+
+
-
+
lkmc/ contains sources and headers that are shared across kernel modules, userland and baremetal examples.
@@ -40902,7 +43166,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
Another option would have been to name it as includes/lkmc, but that would make paths longer, and we might want to store source code in that directory as well in the future.
-
+
When factoring out functionality across userland examples, there are two main options:
@@ -40961,7 +43225,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
-
+
@@ -41010,7 +43274,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
A custom build script can give you more flexibility: e.g. the package can be made work with other root filesystems more easily, have better 9P support, and rebuild faster as it evades some Buildroot boilerplate.
-
+
@@ -41057,9 +43321,9 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
-
+
-
+
Has the following structure:
@@ -41076,7 +43340,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
-
+
Patches in this directory are never applied automatically: it is up to users to manually apply them before usage following the instructions in this documentation.
@@ -41086,7 +43350,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
-
+
@@ -41133,7 +43397,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
This way you can just hack away the scripts and try them out immediately without any further operations.
-
+
This path can be found with:
@@ -41167,7 +43431,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
-
+
@@ -41197,7 +43461,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
-
+
@@ -41230,7 +43494,7 @@ git -C "$(./getvar buildroot_source_dir)" checkout -
-
+
In order to build and run each userland and baremetal example properly, we need per-file metadata such as compiler flags and required number of cores.
@@ -41293,7 +43557,7 @@ baremetal=True
-
+
Print out several parameters that normally change randomly from boot to boot:
@@ -41321,9 +43585,9 @@ baremetal=True
-
+
-
+
@@ -41379,7 +43643,7 @@ echo $?
test does not all possible tests, because there are too many possible variations and that would take forever. The rationale is the same as for ./build all and is explained in ./build --help.
-
+
You can select multiple archs and emulators of interest, as for an other command, with:
@@ -41412,7 +43676,7 @@ echo $?
-
+
By default, continue running even after the first failure happens, and they show a summary at the end.
@@ -41426,7 +43690,7 @@ echo $?
-
+
TODO: we really need a mechanism to automatically generate the test list automatically e.g. based on path_properties.py, currently there are many tests missing, and we have to add everything manually which is very annoying.
@@ -41455,7 +43719,7 @@ echo $?
-
+
We have some pexpect automated tests for GDB for both userland and baremetal programs!
@@ -41528,7 +43792,7 @@ echo $?
-
+
We do not know of any way to set the emulator exit status in QEMU arm full system.
@@ -41631,9 +43895,9 @@ echo $?
-
+
-
+
For the Linux kernel, do the following manual tests for now.
@@ -41671,7 +43935,7 @@ echo $?
-
+
You should also test that the Internet works:
@@ -41682,7 +43946,7 @@ echo $?
-
+
build-userland and test-executables have a wide variety of target selection modes, and it was hard to keep them all working without some tests:
@@ -41700,7 +43964,7 @@ echo $?
-
+
When updating the Linux kernel, QEMU and gem5, things sometimes break.
@@ -41756,7 +44020,7 @@ git submodule update
-
+
This is a template update procedure for submodules for which we have some patches on on top of mainline.
@@ -41785,9 +44049,9 @@ git commit -m "linux: update to ${next_mainline_revision}"
-
+
-
+
@@ -41798,7 +44062,7 @@ git commit -m "linux: update to ${next_mainline_revision}"
-
The ./build-test command builds a superset of what will be downloaded which also tests other things we would like to be working on the release. For the minimal build to generate the files to be uploaded, see: Section 33.18.2, “release-zip”
+
The ./build-test command builds a superset of what will be downloaded which also tests other things we would like to be working on the release. For the minimal build to generate the files to be uploaded, see: Section 33.19.2, “release-zip”
-
+
@@ -41893,7 +44157,7 @@ git push --follow-tags
-
+
@@ -41941,9 +44205,9 @@ git push --follow-tags
-
+
-
+
This project was created to help me understand, modify and test low level system components by using system simulators.
@@ -42019,7 +44283,7 @@ git push --follow-tags
-
+
The trade-offs between the different setups are basically a balance between:
@@ -42044,13 +44308,13 @@ git push --follow-tags
compatibility: how likely is is that all the components will work well together: emulator, compiler, kernel, standard library, …
-guest software availability: how wide is your choice of easily installed guest software packages? See also: Section 33.19.4, “Linux distro choice”
+guest software availability: how wide is your choice of easily installed guest software packages? See also: Section 33.20.4, “Linux distro choice”
-
+
Choosing which features go into our default builds means making tradeoffs, here are our guidelines:
@@ -42095,7 +44359,7 @@ git push --follow-tags
-
+
@@ -42198,9 +44462,9 @@ git push --follow-tags
-
+
-
+
@@ -42237,7 +44501,7 @@ git push --follow-tags
-
+
Being the hardcore person who fully understands an important complex system such as a computer, it does have a nice ring to it doesn’t it?
@@ -42266,6 +44530,9 @@ git push --follow-tags
In that sense, therefore, the kernel is not as open as one might want to believe.
+
it is impossible to become rich with this knowledge.
@@ -42314,7 +44581,7 @@ git push --follow-tags
-
Are you fine with those points, and ready to continue wasting your life?
+
Are you fine with those points, and ready to continue wasting your life with this crap?
Good. In that case, read on, and let’s have some fun together ;-)
@@ -42322,7 +44589,7 @@ git push --follow-tags
-
+
@@ -42366,6 +44633,9 @@ git push --follow-tags