gensim start

This commit is contained in:
Ciro Santilli 六四事件 法轮功
2020-07-17 01:00:00 +00:00
parent 988359440b
commit fa5333e60f

View File

@@ -9481,7 +9481,7 @@ https://en.wikipedia.org/wiki/QEMU[QEMU] is a system simulator: it simulates a C
If you are familiar with https://en.wikipedia.org/wiki/VirtualBox[VirtualBox], then QEMU then basically does the same thing: it opens a "window" inside your desktop that can run an operating system inside your operating system.
Also both can use very similar techniques: either https://en.wikipedia.org/wiki/Binary_translation[binary translation] or <<KVM>>. VirtualBox' binary translator is / was based on QEMU's it seems: https://en.wikipedia.org/wiki/VirtualBox#Software-based_virtualization
Also both can use very similar techniques: either <<binary-translation>> or <<KVM>>. VirtualBox' binary translator is / was based on QEMU's it seems: https://en.wikipedia.org/wiki/VirtualBox#Software-based_virtualization
The huge advantage of QEMU over VirtualBox is that is supports cross arch simulation, e.g. simulate an ARM guest on an x86 host.
@@ -9495,6 +9495,12 @@ QEMU is also supported by Buildroot in-tree, see e.g.: https://github.com/buildr
All of this makes QEMU the natural choice of reference system simulator for this repo.
=== Binary translation
https://en.wikipedia.org/wiki/Binary_translation
Used by <<qemu>> and <<gensim>>.
=== Disk persistency
We disable disk persistency for both QEMU and gem5 by default, to prevent the emulator from putting the image in an unknown state.
@@ -13490,6 +13496,16 @@ Not sure why it has v7a in the name, since I believe the CPUs are just the micro
+
The CLI option is named slightly differently as: `--cpu-type O3_ARM_v7a_3`.
====== gem5 `DerivO3CPU` pipeline stages
* fetch: besides obviously fetching the instruction, this is also where branch prediction runs. Presumably because you need to branch predict before deciding what to fetch next.
* retire: the instruction is completely and totally done with.
+
Mispeculated instructions never reach this stage as can be seen at: <<gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-speculative>>.
+
The `ExecAll` happens at this time as well. And therefore `ExecAll` does not happen for mispeculated instructions.
====== gem5 util/o3-pipeview.py O3 pipeline viewer
[[gem5-util-o3-pipeview-py-o3-pipeline-viewer]]
@@ -16450,6 +16466,85 @@ Then, at time 120000, the LDR data came back, after the wrong prediction had alr
The CPU then noticed that it mispredicted, and so it started again from the correct branch target `movz x2`, and the instructions that were thrown away are marked as `=====` in the timeline.
We can also see some <<branch-predictor>> log lines in the `O3CPUAll` log:
....
130000: Fetch: system.cpu.fetch: [tid:0] [sn:10] Branch at PC 0x40009c predicted to be not taken
130000: Fetch: system.cpu.fetch: [tid:0] [sn:10] Branch at PC 0x40009c predicted to go to (0x4000a0=>0x4000a4).(0=>1)
131500: Commit: system.cpu.commit: [tid:10] [sn:0] Inserting PC (0x40009c=>0x4000a0).(0=>1) into ROB.
131500: ROB: system.cpu.rob: Adding inst PC (0x40009c=>0x4000a0).(0=>1) to the ROB.
131500: ROB: system.cpu.rob: [tid:0] Now has 10 instructions.
132000: IEW: system.cpu.iew: [tid:0] Issue: Adding PC (0x40009c=>0x4000a0).(0=>1) [sn:10] [tid:0] to IQ.
132000: IQ: system.cpu.iq: Adding instruction [sn:10] PC (0x40009c=>0x4000a0).(0=>1) to the IQ.
132000: IQ: system.cpu.iq: Instruction PC (0x40009c=>0x4000a0).(0=>1) has src reg 6 (CCRegClass) that is being added to the dependency chain.
132000: IQ: system.cpu.iq: Instruction PC (0x40009c=>0x4000a0).(0=>1) has src reg 8 (CCRegClass) that is being added to the dependency chain.
132000: IQ: system.cpu.iq: Instruction PC (0x40009c=>0x4000a0).(0=>1) has src reg 7 (CCRegClass) that is being added to the dependency chain.
135500: IQ: system.cpu.iq: Waking up a dependent instruction, [sn:10] PC (0x40009c=>0x4000a0).(0=>1).
135500: IQ: global: [sn:10] has 1 ready out of 3 sources. RTI 0)
135500: IQ: system.cpu.iq: Waking any dependents on register 7 (CCRegClass).
135500: IQ: system.cpu.iq: Waking up a dependent instruction, [sn:10] PC (0x40009c=>0x4000a0).(0=>1).
135500: IQ: global: [sn:10] has 2 ready out of 3 sources. RTI 0)
135500: IQ: system.cpu.iq: Waking any dependents on register 8 (CCRegClass).
135500: IQ: system.cpu.iq: Waking up a dependent instruction, [sn:10] PC (0x40009c=>0x4000a0).(0=>1).
135500: IQ: global: [sn:10] has 3 ready out of 3 sources. RTI 0)
135500: IQ: system.cpu.iq: Instruction is ready to issue, putting it onto the ready list, PC (0x40009c=>0x4000a0).(0=>1) opclass:1 [sn:10].
135500: IEW: system.cpu.iew: Setting Destination Register 6 (CCRegClass)
135500: Scoreboard: system.cpu.scoreboard: Setting reg 6 (CCRegClass) as ready
135500: IEW: system.cpu.iew: Setting Destination Register 7 (CCRegClass)
135500: Scoreboard: system.cpu.scoreboard: Setting reg 7 (CCRegClass) as ready
135500: IEW: system.cpu.iew: Setting Destination Register 8 (CCRegClass)
135500: Scoreboard: system.cpu.scoreboard: Setting reg 8 (CCRegClass) as ready
135500: IQ: system.cpu.iq: Attempting to schedule ready instructions from the IQ.
135500: IQ: system.cpu.iq: Thread 0: Issuing instruction PC (0x40009c=>0x4000a0).(0=>1) [sn:10]
136000: IEW: system.cpu.iew: Execute: Processing PC (0x40009c=>0x4000a0).(0=>1), [tid:0] [sn:10].
136000: IEW: global: RegFile: Access to cc register 6, has data 0x2
136000: IEW: global: RegFile: Access to cc register 8, has data 0
136000: IEW: global: RegFile: Access to cc register 7, has data 0
136000: IEW: system.cpu.iew: Current wb cycle: 0, width: 8, numInst: 0
wbActual:0
136000: IEW: system.cpu.iew: [tid:0] [sn:10] Execute: Branch mispredict detected.
136000: IEW: system.cpu.iew: [tid:0] [sn:10] Predicted target was PC: (0x4000a0=>0x4000a4).(0=>1)
136000: IEW: system.cpu.iew: [tid:0] [sn:10] Execute: Redirecting fetch to PC: (0x40009c=>0x400080).(0=>1)
136000: IEW: system.cpu.iew: [tid:0] [sn:10] Squashing from a specific instruction, PC: (0x40009c=>0x400080).(0=>1)
136500: Commit: system.cpu.commit: [tid:0] Squashing due to branch mispred PC:0x40009c [sn:10]
136500: Commit: system.cpu.commit: [tid:0] Redirecting to PC 0x400084
136500: ROB: system.cpu.rob: Starting to squash within the ROB.
136500: ROB: system.cpu.rob: [tid:0] Squashing instructions until [sn:10].
136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000bc=>0x4000c0).(0=>1), seq num 18.
136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000b8=>0x4000bc).(0=>1), seq num 17.
136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000b4=>0x4000b8).(0=>1), seq num 16.
136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000b0=>0x4000b4).(0=>1), seq num 15.
136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000ac=>0x4000b0).(0=>1), seq num 14.
136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000a8=>0x4000ac).(0=>1), seq num 13.
136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000a4=>0x4000a8).(0=>1), seq num 12.
136500: ROB: system.cpu.rob: [tid:0] Squashing instruction PC (0x4000a0=>0x4000a4).(0=>1), seq num 11.
136500: ROB: system.cpu.rob: [tid:0] Done squashing instructions.
136500: Commit: system.cpu.commit: [tid:0] Marking PC (0x40009c=>0x400080).(0=>1), [sn:10] ready within ROB.
137000: Commit: system.cpu.commit: [tid:0] [sn:10] Committing instruction with PC (0x40009c=>0x400080).(0=>1)
130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+36 : b.lt 0x400080 : IntAlu : FetchSeq=10 CPSeq=10 flags=(IsControl|IsDirectControl|IsCondControl)
137000: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x40009c=>0x400080).(0=>1), [sn:10]
137000: O3CPU: system.cpu: Removing committed instruction [tid:0] PC (0x40009c=>0x400080).(0=>1) [sn:10]
137000: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:11]
137000: Commit: system.cpu.commit: Retiring squashed instruction from ROB.
137000: Commit: system.cpu.commit: Trying to commit head instruction, [tid:0] [sn:10]
137000: Commit: system.cpu.commit: [tid:0] [sn:10] Committing instruction with PC (0x40009c=>0x400080).(0=>1)
130000: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+36 : b.lt 0x400080 : IntAlu : FetchSeq=10 CPSeq=10 flags=(IsControl|IsDirectControl|IsCondControl)
138500: Fetch: system.cpu.fetch: [tid:0] [sn:26] Branch at PC 0x40009c predicted to be not taken
138500: Fetch: system.cpu.fetch: [tid:0] [sn:26] Branch at PC 0x40009c predicted to go to (0x4000a0=>0x4000a4).(0=>1)
142500: Commit: system.cpu.commit: [tid:0] [sn:26] Committing instruction with PC (0x40009c=>0x4000a0).(0=>1)
138500: ExecEnable: system.cpu: A0 T0 : @asm_main_after_prologue+36 : b.lt 0x400080 : IntAlu : FetchSeq=26 CPSeq=18 flags=(IsControl|IsDirectControl|IsCondControl)
142500: ROB: system.cpu.rob: [tid:0] Retiring head instruction, instruction PC (0x40009c=>0x4000a0).(0=>1), [sn:26]
....
With an extra CLI (the branch is not taken):
....
@@ -18005,6 +18100,83 @@ The horrendous downsides of this are:
* when <<debug-the-emulator,debugging the emulator>>, it shows you directories inside the build directory rather than in the source tree
* it is harder to separate which files are <<gem5-code-generation,generated>> and which are in-tree when grepping for code generated definitions
=== Gensim
https://gensim.org
https://bitbucket.org/gensim/gensim
MIT licensed <<binary-translation>> simulator, so a bit like an MIT <<qemu>>.
Video showing it boot Linux fast: https://www.youtube.com/watch?v=aZXx17oYumc
Its name is unfortunately completely and totally overshadowed by an unrelated software with the sane name: https://radimrehurek.com/gensim/
TODO: advantages over QEMU. Like the name implies, they seem to have a nice ISA description language. From quick internals look, seems to generate LLVM intermediate language, which sound good.
Build on Ubuntu 20.04:
....
sudo apt install libantlr3c-dev
cd submodule/gensim
make
....
First fails with:
....
arm-none-eabi-gcc: error: unrecognized -march target: armv5
....
Let's try just armv8, who cares about arvm5!!!
....
mkdir build
cd build
cmake -DTESTING_ENABLED=FALSE -DCMAKE_BUILD_TYPE=DEBUGOPT ..
make -j`nproc` model-armv8
....
Now fails as mentioned at https://bitbucket.org/gensim/gensim/issues/34/build-fails-with-unrecognised-intrinsic[]:
....
terminate called after throwing an instance of 'std::logic_error'
what(): Unrecognised intrinsic: __builtin_abs64
Aborted (core dumped)
....
Get the failing command with:
,,..
make VERBOSE=1 model-armv8
....
and we see some code generation step:
....
cd /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/models/armv8 && \
/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/build/dist/bin/gensim \
-a /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/models/armv8/aarch64.ac \
-s module,arch,decode,disasm,ee_interp,ee_blockjit,jumpinfo,function,makefile \
-o decode.GenerateDotGraph=1,makefile.libtrace_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/support/libtrace/inc,makefile.archsim_path=/home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/archsim/inc,makefile.llvm_path=,makefile.Optimise=2,makefile.Debug=1 \
-t /home/ciro/bak/git/linux-kernel-module-cheat/submodules/gensim/build/models/armv8/output-aarch64/
....
We can see an inclusion path:
....
gensim/models/armv8/aarch64.ac
ac_isa("isa.ac");
gensim/models/armv8/isa.ac
ac_execute("execute.simd");
....
and where `gensim/models/armv8/isa.ac` contains `__builtin_abs64` usages.
GDB on `gensim` shows that the error comes from a call to `gci.GenerateExecuteBodyFor(body_str, *action);`, so it looks like there are some missing cases in `EmitFixedCode`.
This is completely broken academic code! They must be using an off-tree of part of the tool and forgot to commit.
== Buildroot
=== Introduction to Buildroot
@@ -19908,6 +20080,97 @@ Canonical source at https://sourceforge.net/projects/lmbench/ but Intel has a fo
Feels old, guessing not representative anymore like <<dhrystone>>. But hey, history!
Ubuntu 20.04 AMD64 native build and run:
....
git submodule update --init submodules/lmbench
cd submodules/lmbench
cd src
make results
....
TODO it hangs for a long time at:
....
Hang on, we are calculating your cache line size.
....
Bug report: https://github.com/intel/lmbench/issues/15
the If I kill it, configuration process continues:
....
Killed
OK, it looks like your cache line is bytes.
....
and continues with a few more interactive questions until finally:
....
Confguration done, thanks.
....
where it again hangs for at least 2 hours, so I lost patience and killed it.
TODO: how to do a non-interactive config? After the above procedure, `bin/x86_64-linux-gnu/CONFIG.ciro-p51` contains:
....
DISKS=""
DISK_DESC=""
OUTPUT=/dev/null
ENOUGH=50000
FASTMEM="NO"
FILE=/var/tmp/XXX
FSDIR=/var/tmp
INFO=INFO.ciro-p51
LINE_SIZE=
LOOP_O=0.00000000
MAIL=no
TOTAL_MEM=31903
MB=22332
MHZ="-1 System too busy"
MOTHERBOARD=""
NETWORKS=""
OS="x86_64-linux-gnu"
PROCESSORS="8"
REMOTE=""
SLOWFS="NO"
SYNC_MAX="1"
LMBENCH_SCHED="DEFAULT"
TIMING_O=0
RSH=rsh
RCP=rcp
VERSION=lmbench-3alpha4
BENCHMARK_HARDWARE=YES
BENCHMARK_OS=YES
BENCHMARK_SYSCALL=
BENCHMARK_SELECT=
BENCHMARK_PROC=
BENCHMARK_CTX=
BENCHMARK_PAGEFAULT=
BENCHMARK_FILE=
BENCHMARK_MMAP=
BENCHMARK_PIPE=
BENCHMARK_UNIX=
BENCHMARK_UDP=
BENCHMARK_TCP=
BENCHMARK_CONNECT=
BENCHMARK_RPC=
BENCHMARK_HTTP=
BENCHMARK_BCOPY=
BENCHMARK_MEM=
BENCHMARK_OPS=
....
Native build only without running tests:
....
cd src
make
....
Interestingly, one of the creators of LMbench, Larry Mcvoy (https://www.linkedin.com/in/larrymcvoy/[], https://en.wikipedia.org/wiki/Larry_McVoy[]), is also a co-founder of https://en.wikipedia.org/wiki/BitKeeper[BitKeeper]. Their SMC must be blazingly fast!!! Also his LinkedIn says Intel uses it. But they will forever be remembered as "the closed source Git precursor that died N years ago", RIP.
==== STREAM benchmark
http://www.cs.virginia.edu/stream/ref.html
@@ -22422,20 +22685,49 @@ aarch32 is a bit more messy due to older setups, we have both:
* coprocessor accesses:
** MRC: reads a system register, C means coprocessor, which is how system registers were previously known as
** MCR: write to the system register
** MRRC: like MRC, but used for the system registers that are marked as 64-bit, and reads to two general purpose regis
** MRRC: like MRC, but used for the system registers that are marked as 64-bit, and reads to two general purpose register
** MCRR: write version of MCRR
TODO why both? For example, as mentioned at https://stackoverflow.com/questions/62920281/cross-compilng-c-program-for-armv8-a-in-linux-x86-64-system/62922677#62922677 a register that was accessed with MRC in armv7 can move to MRS in aarch64, as is the case for:
....
mrs r0, ctr /* aarch32 */
mrc x0, ctr_el0 /* aarch64 */
....
Other functionality has moved away from coprocessors into actual instructions, e.g. cache invalidation:
....
/* aarch32: DCISW, Data Cache line Invalidate by Set/Way. */
mcr p15, 0, r5, c7, c6, 2
/* aarch64: moved to one of the DC instruction variants. */
dc isw
....
<<armarm8-fa>> G1.19.4 "Background to the System register interface" says that only CP14 and CP15 are specified by the ISA:
____
The interface to the System registers was originally defined as part of a generic coprocessor interface, that gave access to 15 coprocessors, CP0 - CP15. Of these, CP8 - CP15 were reserved for use by Arm, while CP0 - CP7 were available for IMPLEMENTATION DEFINED coprocessors.
____
and the actual coprocessor registers are specified at:
and the actual coprocessor registers are specified in Chapter G7 "AArch32 System Register Encoding" at:
* CP14: Table G7-1 "Mapping of (coproc ==0b1110) MCR, MRC, and MRRC instruction arguments to System registers"
* CP15: Table G7-3 "VMSAv8-32 (coproc==0b1111) register summary, in MCR/MRC parameter order."
The actual MRC assembly does not exactly match the order of that table, this is how you can decode it, sample MCR:
....
mcr p15, 0, r5, c7, c6, 2
....
what each part means:
....
mcr p<coproc>, <opc1>, <src-dest-reg>, <CRn>, <CRm>, <opc2>
....
===== ARM system register encodings
Each aarch64 system register is specified in the encoding of <<arm-system-register-instructions>> by 5 integer numbers:
@@ -25570,6 +25862,14 @@ Bibliography:
* https://stackoverflow.com/questions/49601910/out-of-order-execution-vs-speculative-execution
===== Branch predictor
https://en.wikipedia.org/wiki/Branch_predictor
Comes in for <<superscalar-processor,superscalar processors>>.
A gem5 example can be seen at: <<gem5-event-queue-derivo3cpu-syscall-emulation-freestanding-example-analysis-speculative>>.
==== Re-order buffer
https://en.wikipedia.org/wiki/Re-order_buffer
@@ -25588,14 +25888,6 @@ Important examples:
* <<superscalar-processor>>
=== Branch predictor
https://en.wikipedia.org/wiki/Branch_predictor
Comes in for <<superscalar-processor,superscalar processors>>.
TODO analysis in gem5.
=== Hardware threads
Intel name: "Hyperthreading"