functional units stub

2026-01-23 02:05:57 +01:00 · 2020-06-10 02:00:01 +00:00
parent 6a5b9673c7
commit 0a3ce2f41f
3 changed files with 127 additions and 4 deletions
--- a/README.adoc
+++ b/README.adoc
@@ -12643,6 +12643,56 @@ xdg-open "$(./getvar --arch arm --emulator gem5 m5out_dir)/config.dot.svg"
 An example of such file can be seen at: <<config-dot-svg-timingsimplecpu>>.
 On Ubuntu 20.04, you can also see the dot file "directly" with xdot:
 ....
 xdot "$(./getvar --arch arm --emulator gem5 m5out_dir)/config.dot"
 ....
 which is kind of really cool because it allows you to graph arrows with clicks.
 It is worth noting that if you are running a bunch of short simulations, dot/SVG/PDF generation could have a significant impact in simulation startup time, so it is something to watch out for. As per https://gem5-review.googlesource.com/c/public/gem5/+/29232 it can be turned off with:
 ....
 gem5.opt --dot-config=''
 ....
 or in LKMC:
 ....
 ./run --gem5-exe-args='--dot-config= --json-config= --dump-config='
 ....
 The time difference can be readily observed on minimal examples by running gem5 with `time`.
 By looking into gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 `src/python/m5/util/dot_writer.py` are can try to remove the SVG/PDF conversion to see if those dominate the runtime:
 ....
 def do_dot(root, outdir, dotFilename):
    if not pydot:
        warn("No dot file generated. " +
             "Please install pydot to generate the dot file and pdf.")
        return
    # * use ranksep > 1.0 for for vertical separation between nodes
    # especially useful if you need to annotate edges using e.g. visio
    # which accepts svg format
    # * no need for hoizontal separation as nothing moves horizonally
    callgraph = pydot.Dot(graph_type='digraph', ranksep='1.3')
    dot_create_nodes(root, callgraph)
    dot_create_edges(root, callgraph)
    dot_filename = os.path.join(outdir, dotFilename)
    callgraph.write(dot_filename)
    try:
        # dot crashes if the figure is extremely wide.
        # So avoid terminating simulation unnecessarily
        callgraph.write_svg(dot_filename + ".svg")
        callgraph.write_pdf(dot_filename + ".pdf")
    except:
        warn("failed to generate dot output from %s", dot_filename)
 ....
 but nope, they don't, `dot_create_nodes` and `dot_create_edges` are the culprits, so the only way to gain speed is to remove `.dot` generation altogether. It is tempting to do this by default on LKMC and add an option to enable dot generation when desired so we can be a bit faster by default... but I'm lazy to document the option right now. When it annoys me further maybe :-)
 === m5term
 We use the `m5term` in-tree executable to connect to the terminal instead of a direct `telnet`.
@@ -13217,7 +13267,7 @@ Implementations:
 Useful to <<gem5-restore-checkpoint-with-a-different-cpu,boot Linux fast and then checkpoint and switch to a more detailed CPU>>.
-====== gem5 `TiminSimpleCPU`
+====== gem5 `TimingSimpleCPU`
 `TimingSimpleCPU`: memory accesses are realistic, but the CPU has no pipeline. The simulation is faster than detailed models, but slower than `AtomicSimpleCPU`.
@@ -13241,6 +13291,8 @@ The weird name "Minor" stands for "M (TODO what is M) IN ONder".
 Its 4 stage pipeline is described at the "MinorCPU" section of <<gem5-arm-rsk>>.
 A commented execution example can be seen at: <<gem5-event-queue-minorcpu-syscall-emulation-freestanding-example-analysis>>.
 There is also an in-tree doxygen at: https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/doc/inside-minor.doxygen[`src/doc/inside-minor.doxygen`] and rendered at: http://pages.cs.wisc.edu/~swilson/gem5-docs/minor.html
 As of 2019, in-order cores are mostly present in low power/cost contexts, for example little cores of https://en.wikipedia.org/wiki/ARM_big.LITTLE[ARM bigLITTLE].
@@ -13267,10 +13319,12 @@ Implemented by Pierre-Yves Péneau from LIRMM, which is a research lab in Montpe
 ===== gem5 DerivO3CPU
-Generic out-of-order core. "O3" Stands for "Out Of Order"!
+Generic <<out-of-order-execution,out-of-order core>>. "O3" Stands for "Out Of Order"!
 Analogous to <<gem5-minorcpu,MinorCPU>>, but modelling an out of order core instead of in order.
 A commented execution example can be seen at: <<gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis>>.
 Existing parametrizations:
 * `ex5_big`: big corresponding to `ex5_LITTLE`, by same author at same time. It description reads:
@@ -15789,6 +15843,16 @@ Fault STXRX64::completeAcc(PacketPtr pkt, ExecContext *xc,
 From GDB on <<timingsimplecpu-analysis-ldr-stall>> we see that `completeAcc` gets called from `TimingSimpleCPU::completeDataAccess`.
 ===== gem5 microops
 TODO
 Some gem5 instructions break down into multiple microops.
 Microops are very similar to regular instructions, and show on the <<gem5-execall-trace-format>> since that flag implies `ExecMicro`.
 On aarch64 for example, one of the simplest microoped instructions is <<armv8-aarch64-ldp-and-stp-instructions,STP>>, which does the relatively complex operation of storing two values to memory at once, and is therefore a good candidate for being broken down into microops.
 ==== gem5 port system
 The gem5 memory system is connected in a very flexible way through the port system.
@@ -16652,6 +16716,36 @@ BaseSimpleCPU::BaseSimpleCPU(BaseSimpleCPUParams *p)
    }
 ....
 ==== gem5 functional units
 TODO
 Each instruction is marked with a class, and each class can execute in a given functional unit.
 Which units are available is visible for example on the <<gem5-config-ini>> of a <<gem5-minorcpu>> run. Functional units are not present in simple CPUs like <<gem5-timingsimplecpu>>.
 For example, on gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1, the `config.ini` of a minor run:
 ....
 ./run   \
  --arch aarch64 \
  --emulator gem5 \
  --userland userland/arch/aarch64/freestanding/linux/hello.S \
  --trace-insts-stdout \
  -N1 \
  -- \
  --cpu-type MinorCPU \
  --caches
 ....
 contains:
 ....
 [system.cpu]
 type=MinorCPU
 children=branchPred dcache dtb executeFuncUnits icache interrupts isa itb power_state tracer workload
 ....
 ==== gem5 code generation
 gem5 uses a ton of code generation, which makes the project horrendous:
@@ -24159,6 +24253,35 @@ Oh my God, unoptimized code is so horrendously inefficient, even I can't stand a
 == Computer architecture
 === Instruction pipelining
 In gem5, can be seen on:
 * <<gem5-minorcpu>>
 * <<gem5-derivo3cpu>>
 ==== Classic RISC pipeline
 https://en.wikipedia.org/wiki/Classic_RISC_pipeline
 gem5's <<gem5-minorcpu>> implements a similar but 4 stage pipeline. TODO why didn't they go with the classic RISC pipeline instead?
 === Superscalar processor
 https://en.wikipedia.org/wiki/Superscalar_processor
 http://www.lighterra.com/papers/modernmicroprocessors/ explains it well.
 You basically decode
 TODO in gem5? gem5 definitely has functional units explicitly modelled: <<gem5-functional-units>>, so do <<gem5-minorcpu>> or <<gem5-derivo3cpu>> have it?
 === Out-of-order execution
 https://en.wikipedia.org/wiki/Out-of-order_execution
 gem5's model is <<gem5-derivo3cpu>>.
 === Hardware threads
 Intel name: "Hyperthreading"
--- a/userland/cpp/atomic/main.hpp
+++ b/userland/cpp/atomic/main.hpp
@@ -28,7 +28,7 @@ void threadMain(size_t niters) {
            "incq %0;"
            : "+g" (global),
              "+g" (i) // to prevent loop unrolling, and make results more comparable across methods,
-                       // see also: https://cirosantilli.com/linux-kernel-module-cheat#infinite-busy-loop
+                       // see also: https://cirosantilli.com/linux-kernel-module-cheat#c-busy-loop
            :
            :
        );
--- a/userland/gcc/busy_loop.c
+++ b/userland/gcc/busy_loop.c
@@ -1,5 +1,5 @@
 /* https://cirosantilli.com/linux-kernel-module-cheat#micro-benchmarks
- * https://cirosantilli.com/linux-kernel-module-cheat#infinite-busy-loop
+ * https://cirosantilli.com/linux-kernel-module-cheat#c-busy-loop
 * https://cirosantilli.com/linux-kernel-module-cheat#benchmark-emulators-on-userland-executables */
 #include <stdlib.h>