./run --arch aarch64 --emulator gem5 --userland userland/posix/getpid.c --cpus 2+
diff --git a/index.html b/index.html index 8972440..e2c8db1 100644 --- a/index.html +++ b/index.html @@ -692,6 +692,7 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
ThreadContext vs ThreadState vs ExecContext vs Process
+getcpu system call and the sched_getaffinity glibc wrapperyou won’t get the latest version of this repository. Our Travis attempt to automate builds failed, and storing a release for every commit would likely make GitHub mad at us anyways.
+you won’t get the latest version of this repository. Our Travis attempt to automate builds failed, and storing a release for every commit would likely make GitHub mad at us anyway.
gem5 is not currently supported. The major blocking point is how to avoid distributing the kernel images twice: once for gem5 which uses vmlinux, and once for QEMU which uses arch/* images, see also:
so we see that two syscall lines were added for each syscall, showing the syscall inputs and exit status, just like a mini strace!
This is not currently nicely exposed in LKMC, but gem5 syscall emulation does allow you to run multiple executables "at once".
+--cmd takes a semicolon separated list, so we could do:
./run --arch aarch64 --emulator gem5 --userland userland/posix/getpid.c --cpus 2+
and then hack the produced command by replacing:
+--cmd /home/ciro/bak/git/linux-kernel-module-cheat/out/userland/default/aarch64/posix/getpid.out \ + --param 'system.cpu[0].workload[:].release = "5.4.3"' \+
with:
+--cmd '/home/ciro/bak/git/linux-kernel-module-cheat/out/userland/default/aarch64/posix/getpid.out;/home/ciro/bak/git/linux-kernel-module-cheat/out/userland/default/aarch64/posix/getpid.out' \ + --param 'system.cpu[:].workload[:].release = "5.4.3"' \+
The outcome of this is that we see two different pid messages printed to stdout:
pid=101 +pid=100+
since from gem5 Process we can see that se.py sets up one different PID per executable starting at `100:
workloads = options.cmd.split(';')
+ idx = 0
+ for wrkld in workloads:
+ process = Process(pid = 100 + idx)
+This is basically starts running one process per CPU much like if it had been forked.
+We can also see that these processes are running concurrently with gem5 tracing by hacking:
+--debug-flags ExecAll \ + --debug-file cout \+
which starts with:
+0: system.cpu1: A0 T0 : @__end__+274873647040 : add x0, sp, #0 : IntAlu : D=0x0000007ffffefde0 flags=(IsInteger) + 0: system.cpu0: A0 T0 : @__end__+274873647040 : add x0, sp, #0 : IntAlu : D=0x0000007ffffefde0 flags=(IsInteger) + 500: system.cpu0: A0 T0 : @__end__+274873647044 : bl <__end__+274873649648> : IntAlu : D=0x0000004000001008 flags=(IsInteger|IsControl|IsDirectControl|IsUncondControl|IsCall) + 500: system.cpu1: A0 T0 : @__end__+274873647044 : bl <__end__+274873649648> : IntAlu : D=0x0000004000001008 flags=(IsInteger|IsControl|IsDirectControl|IsUncondControl|IsCall)+
and therefore shows one instruction running on each CPU for each process at the same time.
+Well, not that AOSP is that much better anyways.
+Well, not that AOSP is that much better anyway.
gem5 user mode multithreading has been particularly flaky compared to QEMU’s, but work is being put into improving it.
In gem5 syscall simulation, the fork syscall checks if there is a free CPU, and if there is a free one, the new threads runs on that CPU. Otherwise, the fork call, and therefore higher level interfaces to fork such as pthread_create also fail and return a failure return status in the guest.
In gem5 syscall simulation, the fork syscall checks if there is a free CPU, and if there is a free one, the new threads runs on that CPU.
Otherwise, the fork call, and therefore higher level interfaces to fork such as pthread_create also fail and return a failure return status in the guest.
For example, if we use just one CPU for userland/posix/pthread_self.c which spawns one thread besides main:
outputs 1 and 2 respectively.
This can also be clearly by running sched_getcpu:
./run \ + --arch aarch64 \ + --cli-args 4 \ + --cpus 8 \ + --emulator gem5 \ + --userland userland/linux/sched_getcpu.c \ +;+
which necessarily produces an output containing the CPU numbers from 1 to 4 and no higher:
+1 +3 +4 +2+
TODO why does the 2 come at the end here? Would be good to do a detailed assembly run analysis.
But let’s give it a try anyways with userland/freestanding/gem5_checkpoint.S which was mentioned at gem5 checkpoint userland minimal example
+But let’s give it a try anyway with userland/freestanding/gem5_checkpoint.S which was mentioned at gem5 checkpoint userland minimal example
when you want to call the instructions from inside interest points of your benchmark. Otherwise you add the syscall overhead to the benchmark, which is more intrusive and might affect results.
Why not just hardcode some m5ops instructions as in our example instead, since you are going to modify the source of the benchmark anyways?
+Why not just hardcode some m5ops instructions as in our example instead, since you are going to modify the source of the benchmark anyway?
Crossbar or XBar in the code, is the default CPU interconnect that gets used by fs.py if --ruby is not given.
One simple example of its operation can be seen at: Section 19.19.4.2, “gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis”.
+One simple example of its operation can be seen at: Section 19.20.4.2, “gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis”.
See also: https://en.wikipedia.org/wiki/Crossbar_switch
@@ -21174,10 +21300,53 @@ class SystemXBar(CoherentXBar):Both of those can be checked with git log and git blame.
All CPU types inherit from the BaseCPU class, and looking at the class hierarchy in Eclipse gives a good overview of what we have:
BaseCPU
BaseKvmCPU
BaseSimpleCPU
AtomicSimpleCPU
TimingSimpleCPU
MinorO3CPU
BaseO3CPU
FullO3CPU
BaseSimpleCPUSimple abstract CPU without a pipeline.
Certain ISAs like ARM have bootloaders that are automatically run before the main image to setup basic system state.
+We cross compile those bootloaders from source automatically during ./build-gem5.
As of gem5 bcf041f257623e5c9e77d35b7531bae59edc0423, the source code of the bootloaderes can be found under:
+system/arm/+
and their selection can be seen under: src/dev/arm/RealView.py, e.g.:
def setupBootLoader(self, cur_sys, loc):
+ if not cur_sys.boot_loader:
+ cur_sys.boot_loader = [ loc('boot_emm.arm64'), loc('boot_emm.arm') ]
+Internals under other sections:
In order to develop complex C++ software such as gem5, a good IDE setup is fundamental.
The best setup I’ve reached is with Eclipse. It is not perfect, and there is a learning curve, but is worth it.
Notably, it is very hard to get perfect due to: Why are all C++ symlinked into the gem5 build dir?.
+I recommend the following settings, tested in Eclipse 2019.09, Ubuntu 18.04:
use spaces instead of tabs: Window, Preferences, Code Style, C/C++, Formatter, New, Edit, Tab Policy, Spaces Only
either
+create the project in the gem5 build directory! Files are moved around there and symlinked, and this gives the best chances of success
+add to the include search path:
To run and GDB step debug the executable, just copy the full command line from the output ./run, and configure it into Eclipse.
The interaction uses the Python C extension interface https://docs.python.org/2/extending/extending.html interface through the pybind11 helper library: https://github.com/pybind/pybind11
The main is at: src/sim/main.cc. It calls:
gem5 is an event based simulator, and as such the event queue is of of the crucial elements in the system.
Then, once we had that, the most perfect thing ever would be to make the full event graph containing which events schedule which events!
Let’s now analyze every single event on a minimal gem5 syscall emulation mode in the simplest CPU that we have:
Tested in gem5 12c917de54145d2d50260035ba7fa614e25317a3.
Let’s have a closer look at the initial magically scheduled events of the simulation.
Inside AtomicSimpleCPU::tick() we saw previously that the reschedule happens at:
It will be interesting to see how AtomicSimpleCPU makes memory access on GDB and to compare that with TimingSimpleCPU.
Happens on EmulationPageTable, and seems to happen atomically without making any extra memory requests.
Now, let’s move on to TimingSimpleCPU, which is just like AtomicSimpleCPU internally, but now the memory requests don’t actually finish immediately: gem5 CPU types!
Schedules TimingSimpleCPU::fetch through:
Backtrace:
This is just the startup of the second rank, see: TimingSimpleCPU analysis #1.
From the timing we know what that one is: the end of time exit event, like for AtomicSimpleCPU.
Executes TimingSimpleCPU::fetch().
Schedules DRAMCtrl::processNextReqEvent through:
Schedules BaseXBar::Layer::releaseLayer through:
Executes DRAMCtrl::processNextReqEvent.
Schedules DRAMCtrl::Rank::processActivateEvent through:
Schedules DRAMCtrl::processRespondEvent through:
Schedules DRAMCtrl::processNextReqEvent through:
Executes DRAMCtrl::Rank::processActivateEvent.
Schedules DRAMCtrl::Rank::processPowerEvent through:
Executes DRAMCtrl::Rank::processPowerEvent.
Executes BaseXBar::Layer<SrcType, DstType>::releaseLayer.
Executes DRAMCtrl::processNextReqEvent().
Executes DRAMCtrl::processRespondEvent().
Schedules PacketQueue::processSendEvent() through:
Executes PacketQueue::processSendEvent().
Schedules PacketQueue::processSendEvent through:
Schedules BaseXBar::Layer<SrcType, DstType>::releaseLayer through:
Executes BaseXBar::Layer<SrcType, DstType>::releaseLayer.
Executes PacketQueue::processSendEvent.
Schedules TimingSimpleCPU::IcachePort::ITickEvent::process() through:
Executes TimingSimpleCPU::IcachePort::ITickEvent::process().
Schedules DRAMCtrl::processNextReqEvent through:
Schedules BaseXBar::Layer<SrcType, DstType>::releaseLayer through:
Execute DRAMCtrl::processNextReqEvent.
Schedule DRAMCtrl::processRespondEvent().
One important thing we want to check now, is how the memory reads are going to make the processor stall in the middle of an instruction.
Let’s just add --caches to see if things go any faster:
TODO is this the minimal setup that allows us to see the gem5 crossbar interconnect? Can we see anything in AtomicSimpleCPU?
The events for the Atomic CPU were pretty simple: basically just ticks.
TODO: like gem5 event queue MinorCPU syscall emulation freestanding example analysis but even more complex!
ThreadContext vs ThreadState vs ExecContext vs ProcessThese classes get used everywhere, and they have a somewhat convoluted relation with one another, so let’s figure it out this mess.
+None of those objects are SimObjects, so they must all belong to some higher SimObject.
+This section and all children tested at gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.
+ThreadContextAs we delve into more details below, we will reach the following conclusion: a ThreadContext represents on thread of a CPU with multiple Hardware threads.
We therefore we can have multiple ThreadContext for each BaseCPU.
ThreadContext is what gets passed in syscalls, e.g.:
src/sim/syscall_emul.hh
+template <class OS> +SyscallReturn +readFunc(SyscallDesc *desc, ThreadContext *tc, + int tgt_fd, Addr buf_ptr, int nbytes)+
The class hierarchy for ThreadContext looks like:
ThreadContext + O3ThreadContext + SimpleThread+
where the gem5 MinorCPU also uses SimpleThread:
/** Minor will use the SimpleThread state for now */ +typedef SimpleThread MinorThread;+
It is a bit confusing, things would be much clearer if SimpleThread was called instead SimpleThreadContext!
readIntReg and other register access methods are some notable methods implemented in descendants, e.g. SimpleThread::readIntReg.
Essentially all methods of the base ThreadContext are pure virtual.
SimpleThreadSimpleThread storage defined on BaseSimpleCPU for simple CPUs like AtomicSimpleCPU:
for (unsigned i = 0; i < numThreads; i++) {
+ if (FullSystem) {
+ thread = new SimpleThread(this, i, p->system,
+ p->itb, p->dtb, p->isa[i]);
+ } else {
+ thread = new SimpleThread(this, i, p->system, p->workload[i],
+ p->itb, p->dtb, p->isa[i]);
+ }
+ threadInfo.push_back(new SimpleExecContext(this, thread));
+ ThreadContext *tc = thread->getTC();
+ threadContexts.push_back(tc);
+ }
+and on MinorCPU for Minor:
MinorCPU::MinorCPU(MinorCPUParams *params) :
+ BaseCPU(params),
+ threadPolicy(params->threadPolicy)
+{
+ /* This is only written for one thread at the moment */
+ Minor::MinorThread *thread;
+
+ for (ThreadID i = 0; i < numThreads; i++) {
+ if (FullSystem) {
+ thread = new Minor::MinorThread(this, i, params->system,
+ params->itb, params->dtb, params->isa[i]);
+ thread->setStatus(ThreadContext::Halted);
+ } else {
+ thread = new Minor::MinorThread(this, i, params->system,
+ params->workload[i], params->itb, params->dtb,
+ params->isa[i]);
+ }
+
+ threads.push_back(thread);
+ ThreadContext *tc = thread->getTC();
+ threadContexts.push_back(tc);
+ }
+Those are used from gem5 ExecContext.
From this we see that one CPU can have multiple threads, and that this is controlled from the Python:
+BaseCPU::BaseCPU(Params *p, bool is_checker) + : numThreads(p->numThreads)+
and since SimpleThread contains its registers, this must represent Hardware threads.
If we analyse SimpleThread::readIntReg, we see that the actual register data is contained inside ThreadContext descendants, e.g. in SimpleThread:
RegVal
+ readIntReg(RegIndex reg_idx) const override
+ {
+ int flatIndex = isa->flattenIntIndex(reg_idx);
+ assert(flatIndex < TheISA::NumIntRegs);
+ uint64_t regVal(readIntRegFlat(flatIndex));
+ DPRINTF(IntRegs, "Reading int reg %d (%d) as %#x.\n",
+ reg_idx, flatIndex, regVal);
+ return regVal;
+ }
+
+ RegVal readIntRegFlat(RegIndex idx) const override { return intRegs[idx]; }
+ void
+ setIntRegFlat(RegIndex idx, RegVal val) override
+ {
+ intRegs[idx] = val;
+ }
+
+ std::array<RegVal, TheISA::NumIntRegs> intRegs;
+Another notable type of method contained in Thread context are methods that forward to gem5 ThreadState.
O3ThreadContextInstantiation happens in the FullO3CPU constructor:
FullO3CPU<Impl>::FullO3CPU(DerivO3CPUParams *params)
+
+ for (ThreadID tid = 0; tid < this->numThreads; ++tid) {
+ if (FullSystem) {
+ // SMT is not supported in FS mode yet.
+ assert(this->numThreads == 1);
+ this->thread[tid] = new Thread(this, 0, NULL);
+
+ // Setup the TC that will serve as the interface to the threads/CPU.
+ O3ThreadContext<Impl> *o3_tc = new O3ThreadContext<Impl>;
+and the SimObject DerivO3CPU is just a FullO3CPU instantiation:
class DerivO3CPU : public FullO3CPU<O3CPUImpl>+
O3ThreadContext is a template class:
template <class Impl> +class O3ThreadContext : public ThreadContext+
The only Impl used appears to be O3CPUImpl? This is explicitly instantiated in the source:
template class O3ThreadContext<O3CPUImpl>;+
Unlike in SimpleThread however, O3ThreadContext does not contain the register data itself, e.g. O3ThreadContext::readIntRegFlat instead forwards to cpu:
template <class Impl>
+RegVal
+O3ThreadContext<Impl>::readIntRegFlat(RegIndex reg_idx) const
+{
+ return cpu->readArchIntReg(reg_idx, thread->threadId());
+}
+where:
+typedef typename Impl::O3CPU O3CPU; + + /** Pointer to the CPU. */ + O3CPU *cpu;+
and:
+struct O3CPUImpl
+{
+ /** The O3CPU type to be used. */
+ typedef FullO3CPU<O3CPUImpl> O3CPU;
+and at long last FullO3CPU contains the register values:
template <class Impl>
+RegVal
+FullO3CPU<Impl>::readArchIntReg(int reg_idx, ThreadID tid)
+{
+ intRegfileReads++;
+ PhysRegIdPtr phys_reg = commitRenameMap[tid].lookup(
+ RegId(IntRegClass, reg_idx));
+
+ return regFile.readIntReg(phys_reg);
+}
+So we guess that this difference from SimpleThread is due to register renaming of the out of order implementation.
ThreadStateOwned one per ThreadContext.
Many ThreadContext methods simply forward to ThreadState implementations.
SimpleThread inherits from ThreadState, and forwards to it on several methods e.g.:
int cpuId() const override { return ThreadState::cpuId(); }
+ uint32_t socketId() const override { return ThreadState::socketId(); }
+ int threadId() const override { return ThreadState::threadId(); }
+ void setThreadId(int id) override { ThreadState::setThreadId(id); }
+ ContextID contextId() const override { return ThreadState::contextId(); }
+ void setContextId(ContextID id) override { ThreadState::setContextId(id); }
+O3ThreadContext on the other hand contains an O3ThreadState:
template <class Impl> +struct O3ThreadState : public ThreadState+
at:
+template <class Impl>
+class O3ThreadContext : public ThreadContext
+{
+ O3ThreadState<Impl> *thread
+
+ ContextID contextId() const override { return thread->contextId(); }
+
+ void setContextId(ContextID id) override { thread->setContextId(id); }
+ExecContextExecContext gets used in instruction definitions, e.g.:
build/ARM/arch/arm/generated/exec-ns.cc.inc
+Fault Mul::execute( + ExecContext *xc, Trace::InstRecord *traceData) const+
It contains methods to allow interacting with CPU state from inside instruction execution, notably reading and writing from/to registers.
+For example, the ARM mul instruction uses ExecContext to read the input operands, multiply them, and write to the output:
Fault Mul::execute(
+ ExecContext *xc, Trace::InstRecord *traceData) const
+ {
+ Fault fault = NoFault;
+ uint64_t resTemp = 0;
+ resTemp = resTemp;
+ uint32_t OptCondCodesNZ = 0;
+ uint32_t OptCondCodesC = 0;
+ uint32_t OptCondCodesV = 0;
+ uint32_t Reg0 = 0;
+ uint32_t Reg1 = 0;
+ uint32_t Reg2 = 0;
+
+ OptCondCodesNZ = xc->readCCRegOperand(this, 0);
+ OptCondCodesC = xc->readCCRegOperand(this, 1);
+ OptCondCodesV = xc->readCCRegOperand(this, 2);
+ Reg1 =
+ ((reg1 == PCReg) ? readPC(xc) : xc->readIntRegOperand(this, 3));
+ Reg2 =
+ ((reg2 == PCReg) ? readPC(xc) : xc->readIntRegOperand(this, 4));
+
+ if (testPredicate(OptCondCodesNZ, OptCondCodesC, OptCondCodesV, condCode)/*auto*/)
+ {
+ Reg0 = resTemp = Reg1 * Reg2;;
+ if (fault == NoFault) {
+ {
+ uint32_t final_val = Reg0;
+ ((reg0 == PCReg) ? setNextPC(xc, Reg0) : xc->setIntRegOperand(this, 0, Reg0));
+ if (traceData) { traceData->setData(final_val); }
+ };
+ }
+ } else {
+ xc->setPredicate(false);
+ }
+
+ return fault;
+ }
+ExecContext is however basically just a wrapper that forwards to other classes that actually contain the data in a microarchitectural neutral manner. For example, in SimpleExecContext:
/** Reads an integer register. */
+ RegVal
+ readIntRegOperand(const StaticInst *si, int idx) override
+ {
+ numIntRegReads++;
+ const RegId& reg = si->srcRegIdx(idx);
+ assert(reg.isIntReg());
+ return thread->readIntReg(reg.index());
+ }
+So we see that this just does some register position bookkeeping needed for instruction execution, but the actual data comes from SimpleThread::readIntReg, which is a specialization of gem5 ThreadContext.
ExecContext is a fully virtual class. The hierarchy is:
ExecContext
SimpleExecContext
Minor::MinorExecContext
BaseDynInst
BaseO3DynInst
If we follow SimpleExecContext creation for example, we see:
class BaseSimpleCPU : public BaseCPU
+{
+ std::vector<SimpleExecContext*> threadInfo;
+and:
+BaseSimpleCPU::BaseSimpleCPU(BaseSimpleCPUParams *p)
+ : BaseCPU(p),
+ curThread(0),
+ branchPred(p->branchPred),
+ traceData(NULL),
+ inst(),
+ _status(Idle)
+{
+ SimpleThread *thread;
+
+ for (unsigned i = 0; i < numThreads; i++) {
+ if (FullSystem) {
+ thread = new SimpleThread(this, i, p->system,
+ p->itb, p->dtb, p->isa[i]);
+ } else {
+ thread = new SimpleThread(this, i, p->system, p->workload[i],
+ p->itb, p->dtb, p->isa[i]);
+ }
+ threadInfo.push_back(new SimpleExecContext(this, thread));
+ ThreadContext *tc = thread->getTC();
+ threadContexts.push_back(tc);
+ }
+therefore there is one ExecContext for each ThreadContext, and each ExecContext knows about its own ThreadContext.
This makes sense, since each ThreadContext represents one CPU register set, and therefore needs a separate ExecContext which allows instruction implementations to access those registers.
ProcessThe Process class is used only for gem5 syscall emulation mode, and it represents a process like a Linux userland process, in addition to any further gem5 specific data needed to represent the process.
The first thing most syscall implementations do is to actually pull Process out of gem5 ThreadContext, e.g.:
template <class OS>
+SyscallReturn
+readFunc(SyscallDesc *desc, ThreadContext *tc,
+ int tgt_fd, Addr buf_ptr, int nbytes)
+{
+ auto p = tc->getProcessPtr();
+For example, we can readily see from its interface that it contains several accessors for common process fields:
+ inline uint64_t uid() { return _uid; }
+ inline uint64_t euid() { return _euid; }
+ inline uint64_t gid() { return _gid; }
+ inline uint64_t egid() { return _egid; }
+Process is a SimObject, and therefore produced directly in e.g. se.py.
se.py produces one process per-executable given:
+ workloads = options.cmd.split(';')
+ idx = 0
+ for wrkld in workloads:
+ process = Process(pid = 100 + idx)
+and those are placed in the workload property:
for i in range(np): + if options.smt: + system.cpu[i].workload = multiprocesses + elif len(multiprocesses) == 1: + system.cpu[i].workload = multiprocesses[0] + else: + system.cpu[i].workload = multiprocesses[i]+
and finally each thread of a CPU gets assigned to a different such workload:
+BaseSimpleCPU::BaseSimpleCPU(BaseSimpleCPUParams *p)
+ : BaseCPU(p),
+ curThread(0),
+ branchPred(p->branchPred),
+ traceData(NULL),
+ inst(),
+ _status(Idle)
+{
+ SimpleThread *thread;
+
+ for (unsigned i = 0; i < numThreads; i++) {
+ if (FullSystem) {
+ thread = new SimpleThread(this, i, p->system,
+ p->itb, p->dtb, p->isa[i]);
+ } else {
+ thread = new SimpleThread(this, i, p->system, p->workload[i],
+ p->itb, p->dtb, p->isa[i]);
+ }
+ threadInfo.push_back(new SimpleExecContext(this, thread));
+ ThreadContext *tc = thread->getTC();
+ threadContexts.push_back(tc);
+ }
+gem5 uses a ton of code generation, which makes the project horrendous:
But it has been widely overused to insanity. It likely also exists partly because when the project started in 2003 C++ compilers weren’t that good, so you couldn’t rely on features like templates that much.
Generated code at: build/<ISA>/config/the_isa.hh which contains amongst other lines:
Generated code at: build/<ISA>/config/the_isa.hh which e.g. for ARM contains:
#define X86_ISA 8
+#ifndef __CONFIG_THE_ISA_HH__
+#define __CONFIG_THE_ISA_HH__
+
+#define ARM_ISA 1
+#define MIPS_ISA 2
+#define NULL_ISA 3
+#define POWER_ISA 4
+#define RISCV_ISA 5
+#define SPARC_ISA 6
+#define X86_ISA 7
enum class Arch {
+ ArmISA = ARM_ISA,
+ MipsISA = MIPS_ISA,
+ NullISA = NULL_ISA,
+ PowerISA = POWER_ISA,
+ RiscvISA = RISCV_ISA,
+ SparcISA = SPARC_ISA,
X86ISA = X86_ISA
};
-#define THE_ISA X86_ISA
+#define THE_ISA ARM_ISA
+#define TheISA ArmISA
+#define THE_ISA_STR "arm"
+
+#endif // __CONFIG_THE_ISA_HH__
Generation code: src/SConscript at def makeTheISA.
Tested on gem5 211869ea950f3cc3116655f06b1d46d3fa39fb3a.
+Tested on gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.
Bibliography: https://www.mail-archive.com/gem5-users@gem5.org/msg16989.html
@@ -23651,9 +24426,9 @@ enum class Arch {gem5 moves a bit slowly, and if your host compiler is very new, the gem5 build might be broken for it, e.g. this was the case for Ubuntu 19.10 with GCC 9 and gem5 62d75e7105fe172eb906d4f80f360ff8591d4178 from Dec 2019.
E.g. src/cpu/decode_cache.hh includes:
Upstream request: https://gem5.atlassian.net/browse/GEM5-469
+Some scons madness.
it is basically impossible to setup an IDE properly with gem5: gem5 Eclipse configuration
+It is likely preventing ccache hits when building to different output paths, because it makes the -I includes point to different paths. This is especially important for gem5 Ruby build, which could have the exact same source files as the non-Ruby builds: https://stackoverflow.com/questions/60340271/can-ccache-handle-symlinks-to-the-same-input-source-file-as-hits
Certain ISAs like ARM have bootloaders that are automatically run before the main image to setup basic system state.
-We cross compile those bootloaders from source automatically during ./build-gem5.
As of gem5 bcf041f257623e5c9e77d35b7531bae59edc0423, the source code of the bootloaderes can be found under:
-system/arm/-
and their selection can be seen under: src/dev/arm/RealView.py, e.g.:
def setupBootLoader(self, cur_sys, loc):
- if not cur_sys.boot_loader:
- cur_sys.boot_loader = [ loc('boot_emm.arm64'), loc('boot_emm.arm') ]
-Read the source comments and understand everything that is going on!
The minimal interesting example is to use fork and observe different PIDs.
+A more minimal test-like example without forking can be seen at: userland/posix/getpid.c.
+This example can for example be used used to play with: gem5 syscall emulation multiple executables.
+getcpu system call and the sched_getaffinity glibc wrapperExample: userland/linux/sched_getcpu.c
+Returns the CPU that the process/thread is currently running on:
+So when running a multicore program, we may see that each thread can be running on a different core.
+The cores in which the process runs can be fixed with sched_setaffinity as shown at: userland/linux/sched_getaffinity.c.
So when I run it with main thread + 4 threads on a 4 core CPUs:
./userland/linux/sched_getcpu.out 4+
I see random outputs like:
+7 +2 +1 +5+
and:
+5 +0 +2 +1+
Due to the way that gem5 syscall emulation multithreading however, the output is more deterministic in that case, see that section for further details.
+GCC time increased 1.5x by our bloat, but its percentage of the total was greatly reduced, due to new packages being introduced.
make graph-depends shows that most new dependencies come from QEMU and GDB, which we can’t get rid of anyways.
make graph-depends shows that most new dependencies come from QEMU and GDB, which we can’t get rid of anyway.
Intel name: "Hyperthreading"
+gem5 appears to possibly have attempted to implement hardware threads in gem5 syscall emulation mode: https://github.com/cirosantilli/linux-kernel-module-cheat/issues/104 when using gem5 syscall emulation multiple executables.
+On fs.py it is not exposed in any in-tree config however, and as pointed by the above issue O3 FS has an assert that prevents it in src/cpu/o3/cpu.cc:
+// SMT is not supported in FS mode yet. + assert(this->numThreads == 1);+
TODO why only in fs.py? Is there much difference between fs and se from a hyperthreading point of view? Maybe the message is there because as concluded in gem5 O3ThreadContext, registeres for DerivO3CPU are stored in DerivO3CPU itself (FullO3CPU), and therefore there is no way to to currently represent multiple register sets per CPU.
Other CPUs just appear to fail non-gracefully, e.g.:
+./run --arch aarch64 --emulator gem5 -- --param 'system.cpu[0].numThreads = 2'+
fails with:
+fatal: fatal condition interrupts.size() != numThreads occurred: CPU system.cpu has 1 interrupt controllers, but is expecting one per thread (2)+
Note that cache coherency only applies to memory read/write instructions that explicitly make coherency requirements.
In most ISAs, this tends to be the minority of instructions, and is only used when something is going to modify memory that is known to be shared across threads. For example, the a x86 LOCK would be used to increment atomic counters that get incremented across several threads. Outside of those cases, cache coherency is not garanteed, and behaviour is undefined.
+In most ISAs, this tends to be the minority of instructions, and is only used when something is going to modify memory that is known to be shared across threads. For example, the a x86 LOCK would be used to increment atomic counters that get incremented across several threads. Outside of those cases, cache coherency is not guaranteed, and behaviour is undefined.
Either they can snoop only control, or both control and data can be snooped.
Mentioned at:
TODO gem5 concrete example.
TODO underestand well why those are needed.
https://en.wikipedia.org/wiki/MOSI_protocol The critical MSI vs MOSI section was a bit bogus though: https://en.wikipedia.org/w/index.php?title=MOSI_protocol&oldid=895443023 we have to edit it.