patch -d "$(./getvar gem5_source_dir)" -p 1 < patches/manual/gem5-commmonitor-se.patch-
diff --git a/index.html b/index.html index 096eac8..bf68eb3 100644 --- a/index.html +++ b/index.html @@ -1134,9 +1134,8 @@ pre{ white-space:pre }
CommMonitorPacket vs Request
PacketRequest
-Request in AtomicSimpleCPURequest in TimingSimpleCPUMSHRCommMonitorThreadContext vs ThreadState vs ExecContext vs Process
-ThreadContext
+ThreadStateExecContext
+ProcessMinorCPU default functional unitsexecute vs initiateAcc vs completeAcc
+
ThreadContext vs ThreadState vs ExecContext vs Process
ThreadContext
+
+ThreadStateExecContext
+
+Processstart_kernel is a good definition of it: https://stackoverflow.com/questions/18266063/does-kernel-have-main-function/33422401#33422401
In gem5 aarc64 Linux v4.18, experimentally the entry point of secondary CPUs seems to be secondary_holding_pen as shown at https://gist.github.com/cirosantilli2/34a7bc450fcb6c1c1a910369be1fdd90
CommMonitorYou can place this SimObject in between two ports to get extra statistics about the packets that are going through.
-It only works on timing requests, and does not seem to dump any memory values, only add extra statistics.
-For example, the patch patches/manual/gem5-commmonitor-se.patch hack a CommMonitor between the CPU and the L1 cache on top of gem5 1c3662c9557c85f0d25490dc4fbde3f8ab0cb350:
patch -d "$(./getvar gem5_source_dir)" -p 1 < patches/manual/gem5-commmonitor-se.patch-
That patch was done largely by copying what fs.py --memcheck does with a MemChecker object.
You can then run with:
-./run \ - --arch aarch64 \ - --emulator gem5 \ - --userland userland/arch/aarch64/freestanding/linux/hello.S \ - -- \ - --caches \ - --cpu-type TimingSimpleCPU \ -;-
and now we have some new extra histogram statistics such as:
-system.cpu.dcache_mon.readBurstLengthHist::samples 1-
One neat thing about this is that it is agnostic to the memory object type, so you don’t have to recode those statistics for every new type of object that operates on memory packets.
-TODO These look promising:
As of gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 defaults to 2GHz for fs.py:
TODO
-To prevent the stats file from becoming humongous.
Well, run minimal examples, and reverse engineer them up!
+We can start with userland/arch/x86_64/freestanding/linux/hello.S on atomic with gem5 ExecAll trace format.
+./run \ + --arch aarch64 \ + --emulator gem5 \ + --userland userland/arch/aarch64/freestanding/linux/hello.S \ + --trace ExecAll \ + --trace-stdout \ +;+
which gives:
+0: system.cpu: A0 T0 : @_start : movz x0, #1, #0 : IntAlu : D=0x0000000000000001 flags=(IsInteger) + 500: system.cpu: A0 T0 : @_start+4 : adr x1, #28 : IntAlu : D=0x0000000000400098 flags=(IsInteger) + 1000: system.cpu: A0 T0 : @_start+8 : ldr w2, #4194464 : MemRead : D=0x0000000000000006 A=0x4000a0 flags=(IsInteger|IsMemRef|IsLoad) + 1500: system.cpu: A0 T0 : @_start+12 : movz x8, #64, #0 : IntAlu : D=0x0000000000000040 flags=(IsInteger) + 2000: system.cpu: A0 T0 : @_start+16 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall) + 2500: system.cpu: A0 T0 : @_start+20 : movz x0, #0, #0 : IntAlu : D=0x0000000000000000 flags=(IsInteger) + 3000: system.cpu: A0 T0 : @_start+24 : movz x8, #93, #0 : IntAlu : D=0x000000000000005d flags=(IsInteger) + 3500: system.cpu: A0 T0 : @_start+28 : svc #0x0 : IntAlu : flags=(IsSerializeAfter|IsNonSpeculative|IsSyscall)+
The most important stat of all is usually the cycle count, which is a direct measure of performance if you modelled you system well:
+sim_ticks 3500 # Number of ticks simulated+
Next, sim_insts and sim_ops are often critical:
sim_insts 6 # Number of instructions simulated +sim_ops 6 # Number of ops (including micro ops) simulated+
sim_ops is like sim_insts but it also includes gem5 microops.
In gem5 syscall emulation mode, syscall instructions are magic, and therefore appear to not be counted, that is why we get 6 instructions instead of 8.
+This describes the internals of the gem5 m5out/stats.txt file.
which is kind of really cool because it allows you to graph arrows with clicks.
+which is kind of really cool because it allows you to view graph arrows on hover. This can be very useful because the PDF and SVG often overlap so many arrows together that you just can’t know which one is coming from/going to where.
It is worth noting that if you are running a bunch of short simulations, dot/SVG/PDF generation could have a significant impact in simulation startup time, so it is something to watch out for. As per https://gem5-review.googlesource.com/c/public/gem5/+/29232 it can be turned off with:
@@ -22052,7 +22089,7 @@ system.cpu[0].max_insts_any_thread./run \ --emulator gem5 \ - --userland \userland/arch/x86_64/freestanding/linux/hello.S \ + --userland userland/arch/x86_64/freestanding/linux/hello.S \ --trace-insts-stdout \ -- \ --param 'system.cpu[0].max_insts_all_threads = 3' \ @@ -22498,10 +22535,13 @@ cat "$(./getvar --arch aarch64 --emulator gem5 trace_txt_file)"
It presumably implements a crossbar switch along the lines of: https://en.wikipedia.org/wiki/Crossbar_switch
One simple example of its operation can be seen at: Section 19.21.4.2, “gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis”
+This is the best introductory example analysis we have so far: gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs. It contains more or less the most minimal example in which something interesting can be observed: multiple cores fighting over a single data memory variable.
But arguably interesting effects can only be observed when we have more than 1 CPUs as in gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs.
+Long story short: the interconnect contains the snoop mechanism, and it forwards packets coming form caches of a CPU to the caches of other CPUs in which the block is present.
+It is therefore the heart of the Cache coherence mechanism, as it informs other caches of bus transactions they need to know about.
TODO: describe it in more detail. It appears to be a very simple mechanism.
@@ -23003,15 +23043,678 @@ cd .. cur_sys.boot_loader = [ loc('boot_emm.arm64'), loc('boot_emm.arm') ]The bootloader basically just sets up a bit of CPU state and jumps to the kernel entry point.
+In aarch64 at least, CPUs other than CPU0 are also started up briefly, run some initialization, and are made wait on a WFE. This can be seen easily by booting a multicore Linux kernel run with gem5 ExecAll trace format.
+Parent section: gem5 internals.
+The gem5 memory system is connected in a very flexible way through the port system.
+This system exists to allow seamlessly connecting any combination of CPU, caches, interconnects, DRAM and peripherals.
+A Packet is the basic information unit that gets sent across ports.
gem5 memory requests can be classified in the following broad categories:
+functional: get the value magically, do not update caches, see also: gem5 functional requests
+atomic: get the value now without making a separate event, but do not update caches. Cannot work in Ruby due to fundamental limitations, mentioned in passing at: https://gem5.atlassian.net/browse/GEM5-676
+timing: get the value simulating delays and updating caches
+This trichotomy can be notably seen in the definition of the MasterPort class:
+class MasterPort : public Port, public AtomicRequestProtocol, + public TimingRequestProtocol, public FunctionalRequestProtocol+
and the base classes are defined under src/mem/protocol/.
Then, by reading the rest of the class, we see that the send methods are all boring, and just forward to some polymorphic receiver that does the actual interesting activity:
+ Tick
+ sendAtomicSnoop(PacketPtr pkt)
+ {
+ return AtomicResponseProtocol::sendSnoop(_masterPort, pkt);
+ }
+
+ Tick
+ AtomicResponseProtocol::sendSnoop(AtomicRequestProtocol *peer, PacketPtr pkt)
+ {
+ assert(pkt->isRequest());
+ return peer->recvAtomicSnoop(pkt);
+ }
+The receive methods are therefore the interesting ones, and must be overridden on derived classes if they ever expect to receive such requests:
+ Tick
+ recvAtomicSnoop(PacketPtr pkt) override
+ {
+ panic("%s was not expecting an atomic snoop request\n", name());
+ return 0;
+ }
+
+ void
+ recvFunctionalSnoop(PacketPtr pkt) override
+ {
+ panic("%s was not expecting a functional snoop request\n", name());
+ }
+
+ void
+ recvTimingSnoopReq(PacketPtr pkt) override
+ {
+ panic("%s was not expecting a timing snoop request.\n", name());
+ }
+One question that comes up now is: but why do CPUs need to care about snoop requests?
+And one big answer is: to be able to implement LLSC atomicity as mentioned at: ARM LDXR and STXR instructions, since when other cores update memory, they could invalidate the lock of the current core.
+Then, as you might expect, we can see that for example AtomicSimpleCPU does not override recvTimingSnoopReq.
Now let see which requests are generated by ordinary ARM LDR instruction. We run:
+./run \ + --arch aarch64 \ + --debug-vm \ + --emulator gem5 \ + --gem5-build-type debug \ + --useland userland/arch/aarch64/freestanding/linux/hello.S \+
and then break at the methods of the LDR class LDRXL64_LIT: gem5 execute vs initiateAcc vs completeAcc.
Before starting, we of course guess that:
+AtomicSimpleCPU will be making atomic accesses from execute
TimingSimpleCPU will be making timing accesses from initiateAcc, which must generate the event which leads to completeAcc
so let’s confirm it.
+We break on ArmISAInst::LDRXL64_LIT::execute which is what AtomicSimpleCPU uses, and that leads as expected to:
MasterPort::sendAtomic +AtomicSimpleCPU::sendPacket +AtomicSimpleCPU::readMem +SimpleExecContext::readMem +readMemAtomic<(ByteOrder)1, ExecContext, unsigned long> +readMemAtomicLE<ExecContext, unsigned long> +ArmISAInst::LDRXL64_LIT::execute +AtomicSimpleCPU::tick+
Notably, AtomicSimpleCPU::readMem immediately translates the address, creates a packet, sends the atomic request, and gets the response back without any events.
And now if we do the same with --cpu-type TimingSimpleCPU and break at ArmISAInst::LDRXL64_LIT::initiateAcc, and then add another break for the next event schedule b EventManager::schedule (which we imagine is the memory read) we reach:
EventManager::schedule +DRAMCtrl::addToReadQueue +DRAMCtrl::recvTimingReq +DRAMCtrl::MemoryPort::recvTimingReq +TimingRequestProtocol::sendReq +MasterPort::sendTimingReq +CoherentXBar::recvTimingReq +CoherentXBar::CoherentXBarSlavePort::recvTimingReq +TimingRequestProtocol::sendReq +MasterPort::sendTimingReq +TimingSimpleCPU::handleReadPacket +TimingSimpleCPU::sendData +TimingSimpleCPU::finishTranslation +DataTranslation<TimingSimpleCPU*>::finish +ArmISA::TLB::translateComplete +ArmISA::TLB::translateTiming +ArmISA::TLB::translateTiming +TimingSimpleCPU::initiateMemRead +SimpleExecContext::initiateMemRead +initiateMemRead<ExecContext, unsigned long> +ArmISAInst::LDRXL64_LIT::initiateAcc +TimingSimpleCPU::completeIfetch +TimingSimpleCPU::IcachePort::ITickEvent::process +EventQueue::serviceOne+
so as expected we have TimingRequestProtocol::sendReq.
Remember however that timing requests are a bit more complicated due to paging, since the page table walk can itself lead to further memory requests.
+In this particular instance, the address being read with ldr x2, =len ARM LDR pseudo-instruction is likely placed just after the text section, and therefore the pagewalk is already in the TLB due to previous instruction fetches, and this is because the translation just finished immediately going through TimingSimpleCPU::finishTranslation, some key snippets are:
TLB::translateComplete(const RequestPtr &req, ThreadContext *tc,
+ Translation *translation, Mode mode, TLB::ArmTranslationType tranType,
+ bool callFromS2)
+{
+ bool delay = false;
+ Fault fault;
+ if (FullSystem)
+ fault = translateFs(req, tc, mode, translation, delay, true, tranType);
+ else
+ fault = translateSe(req, tc, mode, translation, delay, true);
+ if (!delay)
+ translation->finish(fault, req, tc, mode);
+ else
+ translation->markDelayed();
+and then translateSe does not use delay at all, so we learn that in syscall emulation, delay is always false and things progress immediately there. And then further down TimingSimpleCPU::finishTranslation does some more fault checking:
void
+TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
+{
+ if (state->getFault() != NoFault) {
+ translationFault(state->getFault());
+ } else {
+ if (!state->isSplit) {
+ sendData(state->mainReq, state->data, state->res,
+ state->mode == BaseTLB::Read);
+Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.
+As seen at gem5 functional vs atomic vs timing memory requests, functional requests are not used in common simulation, since the core must always go through caches.
+Functional access are therefore only used for more magic simulation functionalities.
+One such functionality, is the gem5 syscall emulation mode implementation of the futex system call which is done at futexFunc in src/sim/sycall_emul.hh.
As seen from man futex, the Linux kernel reads the value from an address that is given as the first argument of the call.
Therefore, here it makes sense for gem5 syscall implementation, which does not actually have a real kernel running, to just make a functional request and be done with it, since the impact of cache changes done by this read would be insignificant to the cost of an actual full context switch that would happen on a real syscall.
+It is generally hard to implement functional requests for Ruby runs, because packets are flying through the memory system in a transient state, and there is no simple way of finding exactly which ones might have the latest version of the memory. See for example:
+The typical error message in that case is:
+fatal: Ruby functional read failed for address+
Packet vs RequestPacketPacket is what goes through ports: a single packet is sent out to the memory system, gets modified when it hits valid data, and then returns with the reply.
Packet is what CPUs create and send to get memory values. E.g. on gem5 AtomicSimpleCPU:
void
+AtomicSimpleCPU::tick()
+{
+ ...
+ Packet ifetch_pkt = Packet(ifetch_req, MemCmd::ReadReq);
+ ifetch_pkt.dataStatic(&inst);
+
+ icache_latency = sendPacket(icachePort, &ifetch_pkt);
+
+Tick
+AtomicSimpleCPU::sendPacket(MasterPort &port, const PacketPtr &pkt)
+{
+ return port.sendAtomic(pkt);
+}
+On TimingSimpleCPU, we note that the packet is dynamically created unlike for the AtomicSimpleCPU, since it must exist across multiple events which happen on separate function calls, unlike atomic memory which is done immediately in a single call:
+void
+TimingSimpleCPU::sendFetch(const Fault &fault, const RequestPtr &req,
+ ThreadContext *tc)
+{
+ if (fault == NoFault) {
+ DPRINTF(SimpleCPU, "Sending fetch for addr %#x(pa: %#x)\n",
+ req->getVaddr(), req->getPaddr());
+ ifetch_pkt = new Packet(req, MemCmd::ReadReq);
+ ifetch_pkt->dataStatic(&inst);
+ DPRINTF(SimpleCPU, " -- pkt addr: %#x\n", ifetch_pkt->getAddr());
+
+ if (!icachePort.sendTimingReq(ifetch_pkt)) {
+It must later delete the return packet that it gets later on, e.g. for the ifetch:
+TimingSimpleCPU::completeIfetch(PacketPtr pkt)
+{
+ if (pkt) {
+ delete pkt;
+ }
+The most important properties of a Packet are:
+PacketDataPtr data;: the data coming back from a reply packet or being sent via it
Addr addr;: the physical address of the data. TODO comment says could be virtual too, when?
/// The address of the request. This address could be virtual or +/// physical, depending on the system configuration. +Addr addr;+
Flags flags;: flags describing properties of the Packet
MemCmd cmd;: this is basically an enumeration of possible commands, stuff like:
enum Command
+{
+ InvalidCmd,
+ ReadReq,
+ ReadResp,
+Each command has a fixed number of attributes defined in the static array:
+static const CommandInfo commandInfo[];+
which gets initialized in the .cc file in the same order as the Command enum.
+const MemCmd::CommandInfo
+MemCmd::commandInfo[] =
+{
+ /* InvalidCmd */
+ { 0, InvalidCmd, "InvalidCmd" },
+ /* ReadReq - Read issued by a non-caching agent such as a CPU or
+ * device, with no restrictions on alignment. */
+ { SET3(IsRead, IsRequest, NeedsResponse), ReadResp, "ReadReq" },
+ /* ReadResp */
+ { SET3(IsRead, IsResponse, HasData), InvalidCmd, "ReadResp" },
+From this we see for example that both ReadReq and ReadResp are marked with the IsRead attribute.
The second field of this array also specifies the corresponding reply of each request. E.g. the reply of a ReadReq is a ReadResp. InvalidCmd is just a placeholders for requests that are already replies.
struct CommandInfo
+{
+ /// Set of attribute flags.
+ const std::bitset<NUM_COMMAND_ATTRIBUTES> attributes;
+ /// Corresponding response for requests; InvalidCmd if no
+ /// response is applicable.
+ const Command response;
+ /// String representation (for printing)
+ const std::string str;
+};
+RequestOne good way to think about Request vs Packet could be "it is what the instruction definitions see", a bit like ExecContext vs ThreadContext.
Request is passed to the constructor of Packet, and Packet keeps a reference to it:
Packet(const RequestPtr &_req, MemCmd _cmd)
+ : cmd(_cmd), id((PacketId)_req.get()), req(_req),
+ data(nullptr), addr(0), _isSecure(false), size(0),
+ _qosValue(0), headerDelay(0), snoopDelay(0),
+ payloadDelay(0), senderState(NULL)
+ {
+ if (req->hasPaddr()) {
+ addr = req->getPaddr();
+ flags.set(VALID_ADDR);
+ _isSecure = req->isSecure();
+ }
+ if (req->hasSize()) {
+ size = req->getSize();
+ flags.set(VALID_SIZE);
+ }
+ }
+where RequestPtr is defined as:
typedef std::shared_ptr<Request> RequestPtr;+
so we see that shared pointers to requests are basically passed around.
+Some key fields include:
+_paddr:
/** + * The physical address of the request. Valid only if validPaddr + * is set. + */ +Addr _paddr = 0;+
_vaddr:
/** The virtual address of the request. */ +Addr _vaddr = MaxAddr;+
Request in AtomicSimpleCPUIn AtomicSimpleCPU, a single packet of each type is kept for the entire CPU, e.g.:
RequestPtr ifetch_req;+
and it gets created at construction time:
+AtomicSimpleCPU::AtomicSimpleCPU(AtomicSimpleCPUParams *p)
+{
+ ifetch_req = std::make_shared<Request>();
+and then it gets modified for each request:
+setupFetchRequest(ifetch_req);+
which does:
+req->setVirt(fetchPC, sizeof(MachInst), Request::INST_FETCH, + instMasterId(), instAddr);+
Virtual to physical address translation done by the CPU stores the physical address:
+fault = thread->dtb->translateAtomic(req, thread->getTC(), + BaseTLB::Read);+
which eventually calls e.g. on fs with MMU enabled:
+Fault
+TLB::translateMmuOn(ThreadContext* tc, const RequestPtr &req, Mode mode,
+ Translation *translation, bool &delay, bool timing,
+ bool functional, Addr vaddr,
+ ArmFault::TranMethod tranMethod)
+{
+ req->setPaddr(pa);
+Request in TimingSimpleCPUIn TimingSimpleCPU, the request gets created per memory read:
+Fault
+TimingSimpleCPU::initiateMemRead(Addr addr, unsigned size,
+ Request::Flags flags,
+ const std::vector<bool>& byte_enable)
+{
+ ...
+ RequestPtr req = std::make_shared<Request>(
+ addr, size, flags, dataMasterId(), pc, thread->contextId());
+and from gem5 functional vs atomic vs timing memory requests and gem5 functional vs atomic vs timing memory requests we remember that initiateMemRead is actually started from the initiateAcc instruction definitions for timing:
Fault LDRWL64_LIT::initiateAcc(ExecContext *xc,
+ Trace::InstRecord *traceData) const
+{
+ ...
+ fault = initiateMemRead(xc, traceData, EA, Mem, memAccessFlags);
+From this we see that initiateAcc memory instructions are basically extracting the required information for the request, notably the address EA and flags.
MSHRMentioned at: http://pages.cs.wisc.edu/~swilson/gem5-docs/gem5MemorySystem.html
+Each cache object owns a MSHRQueue:
class BaseCache : public ClockedObject
+{
+ /** Miss status registers */
+ MSHRQueue mshrQueue;
+BaseCache is the base class of Cache and NoncoherentCache.
MSHRQueue is a Queue of MSHR:
class MSHRQueue : public Queue<MSHR>+
and Queue is also a gem5 class under src/mem/cache/queue.hh.
CommMonitorYou can place this SimObject in between two ports to get extra statistics about the packets that are going through.
+It only works on timing requests, and does not seem to dump any memory values, only add extra statistics.
+For example, the patch patches/manual/gem5-commmonitor-se.patch hack a CommMonitor between the CPU and the L1 cache on top of gem5 1c3662c9557c85f0d25490dc4fbde3f8ab0cb350:
patch -d "$(./getvar gem5_source_dir)" -p 1 < patches/manual/gem5-commmonitor-se.patch+
That patch was done largely by copying what fs.py --memcheck does with a MemChecker object.
You can then run with:
+./run \ + --arch aarch64 \ + --emulator gem5 \ + --userland userland/arch/aarch64/freestanding/linux/hello.S \ + -- \ + --caches \ + --cpu-type TimingSimpleCPU \ +;+
and now we have some new extra histogram statistics such as:
+system.cpu.dcache_mon.readBurstLengthHist::samples 1+
One neat thing about this is that it is agnostic to the memory object type, so you don’t have to recode those statistics for every new type of object that operates on memory packets.
+The interaction uses the Python C extension interface https://docs.python.org/2/extending/extending.html interface through the pybind11 helper library: https://github.com/pybind/pybind11
The main is at: src/sim/main.cc. It calls:
Tested at gem5 b4879ae5b0b6644e6836b0881e4da05c64a6550d.
m5.objects modulem5.objects moduleAll SimObjects seem to be automatically added to the m5.objects namespace, and this is done in a very convoluted way, let’s try to understand a bit:
gem5 is an event based simulator, and as such the event queue is of of the crucial elements in the system.
Then, once we had that, the most perfect thing ever would be to make the full event graph containing which events schedule which events!
Let’s now analyze every single event on a minimal gem5 syscall emulation mode in the simplest CPU that we have:
Tested in gem5 12c917de54145d2d50260035ba7fa614e25317a3.
Let’s have a closer look at the initial magically scheduled events of the simulation.
Inside AtomicSimpleCPU::tick() we saw previously that the reschedule happens at:
It will be interesting to see how AtomicSimpleCPU makes memory access on GDB and to compare that with TimingSimpleCPU.
Happens on EmulationPageTable, and seems to happen atomically without making any extra memory requests.
Now, let’s move on to TimingSimpleCPU, which is just like AtomicSimpleCPU internally, but now the memory requests don’t actually finish immediately: gem5 CPU types!
Schedules TimingSimpleCPU::fetch through:
Backtrace:
This is just the startup of the second rank, see: TimingSimpleCPU analysis #1.
From the timing we know what that one is: the end of time exit event, like for AtomicSimpleCPU.
Executes TimingSimpleCPU::fetch().
Schedules DRAMCtrl::processNextReqEvent through:
Schedules BaseXBar::Layer::releaseLayer through:
Executes DRAMCtrl::processNextReqEvent.
Schedules DRAMCtrl::Rank::processActivateEvent through:
Schedules DRAMCtrl::processRespondEvent through:
Schedules DRAMCtrl::processNextReqEvent through:
Executes DRAMCtrl::Rank::processActivateEvent.
Schedules DRAMCtrl::Rank::processPowerEvent through:
Executes DRAMCtrl::Rank::processPowerEvent.
Executes BaseXBar::Layer<SrcType, DstType>::releaseLayer.
Executes DRAMCtrl::processNextReqEvent().
Executes DRAMCtrl::processRespondEvent().
Schedules PacketQueue::processSendEvent() through:
Executes PacketQueue::processSendEvent().
Schedules PacketQueue::processSendEvent through:
Schedules BaseXBar::Layer<SrcType, DstType>::releaseLayer through:
Executes BaseXBar::Layer<SrcType, DstType>::releaseLayer.
Executes PacketQueue::processSendEvent.
Schedules TimingSimpleCPU::IcachePort::ITickEvent::process() through:
Executes TimingSimpleCPU::IcachePort::ITickEvent::process().
Schedules DRAMCtrl::processNextReqEvent through:
Schedules BaseXBar::Layer<SrcType, DstType>::releaseLayer through:
Execute DRAMCtrl::processNextReqEvent.
Schedule DRAMCtrl::processRespondEvent().
One important thing we want to check now, is how the memory reads are going to make the processor stall in the middle of an instruction.
Let’s just add --caches to gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis to see if things go any faster, and add Cache to --trace as in:
Contrast this with the non --cache version seen at TimingSimpleCPU analysis #5 in which DRAM only actually reads the 4 required bytes.
The only cryptic thing about the messages is the IF flag, but good computer architects would have guessed it correctly, and src/mem/packet.cc confirms:
The only cryptic thing about the messages is the IF flag, but good computer architects would have guessed it correctly that it is "instruction fetch" and src/mem/packet.cc confirms:
At:
+#7 0: Cache: system.cpu.icache: access for ReadReq [78:7b] IF miss +#8 0: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 scheduled @ 1000 +#9 1000: Event: system.cpu.icache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 59 executed @ 1000 +#10 1000: Cache: system.cpu.icache: sendMSHRQueuePacket: MSHR ReadReq [78:7b] IF +#12 1000: Cache: system.cpu.icache: createMissPacket: created ReadCleanReq [40:7f] IF from ReadReq [78:7b] IF+
we can briefly see the gem5 MSHR doing its thing.
At time 0, the CPU icache wants to read, so it creates a packet that reads 4 bytes only ([78:7b]) for the instruction, and that goes into the MSHR, to be treated in a future event.
At 1000, the future event is executed, and so it reads the original packet from the MSHR, and uses that to create a new request [40:7f] which gets forwarded.
MOESI cache coherence protocol: https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L352
The actual representation is done via separate state bits: https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L66 and MOESI appears explicitly only on the pretty printing.
This pretty printing appears for example in the --trace Cache lines as shown at gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and with a few more transitions visible at Section 19.21.4.4, “gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs”.
This pretty printing appears for example in the --trace Cache lines as shown at gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and with a few more transitions visible at Section 19.22.4.4, “gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs”.
It would be amazing to analyze a simple example with interconnect packets possibly invalidating caches of other CPUs.
Like gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs but with gem5 TimingSimpleCPU and userland/c/atomic/aarch64_add.c:
./build-userland --arch aarch64 --optimization-level 3 --userland-build-id o3 +./run \ + --arch aarch64 \ + --cli-args '2 1000' \ + --cpus 3 \ + --emulator gem5 \ + --trace FmtFlag,CacheAll,DRAM,Event,ExecAll,SimpleCPU,XBar \ + --userland userland/c/atomic/aarch64_add.c \ + --userland-build-id o3 \ + -- \ + --caches \ + --cpu-type TimingSimpleCPU \ +;+
This is arguably the best experiment to study the gem5 crossbar interconnect.
+We increase the loop count to 100 loops because 100 did not show memory conflicts. The output is:
+expect 200 +global 147+
Let’s double check what it compiles to with disas:
+./disas --arch aarch64 --userland userland/c/atomic/aarch64_add.c --userland-build-id o3 my_thread_main+
which contains:
+0x0000000000400a70 <+0>: 03 00 40 f9 ldr x3, [x0] + 0x0000000000400a74 <+4>: 63 01 00 b4 cbz x3, 0x400aa0 <my_thread_main+48> + 0x0000000000400a78 <+8>: 82 00 00 d0 adrp x2, 0x412000 <malloc@got.plt> + 0x0000000000400a7c <+12>: 42 a0 01 91 add x2, x2, #0x68 + 0x0000000000400a80 <+16>: 00 00 80 d2 mov x0, #0x0 // #0 + 0x0000000000400a84 <+20>: 1f 20 03 d5 nop + 0x0000000000400a88 <+24>: 41 00 40 f9 ldr x1, [x2] + 0x0000000000400a8c <+28>: 21 04 00 91 add x1, x1, #0x1 + 0x0000000000400a90 <+32>: 41 00 00 f9 str x1, [x2] + 0x0000000000400a94 <+36>: 00 04 00 91 add x0, x0, #0x1 + 0x0000000000400a98 <+40>: 7f 00 00 eb cmp x3, x0 + 0x0000000000400a9c <+44>: 68 ff ff 54 b.hi 0x400a88 <my_thread_main+24> // b.pmore + 0x0000000000400aa0 <+48>: 00 00 80 52 mov w0, #0x0 // #0 + 0x0000000000400aa4 <+52>: c0 03 5f d6 ret+
Grepping the logs for my_thread_main+24 shows where the first non-atomic interleaves happen at:
+471039000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+24 : ldr x1, [x2] : MemRead : D=0x000000000000002f A=0x412068 flags=(IsInteger|IsMemRef|IsLoad) +471034000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+24 : ldr x1, [x2] : MemRead : D=0x000000000000002f A=0x412068 flags=(IsInteger|IsMemRef|IsLoad) +471059000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+44 : b.hi <my_thread_main+24> : IntAlu : flags=(IsControl|IsDirectControl|IsCondControl) +471070000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+44 : b.hi <my_thread_main+24> : IntAlu : flags=(IsControl|IsDirectControl|IsCondControl) +471071000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+24 : ldr x1, [x2] : MemRead : D=0x0000000000000030 A=0x412068 flags=(IsInteger|IsMemRef|IsLoad)+
after a long string of cpu1 hits, since CPU1 was forked first and therefore had more time to run that operation.
+From those and logs around we deduce that:
+the shared address of interest is 0x412068
+the physical address is 2068
+it fits into the cache line for 2040:207f
+With that guide, we look at the fuller logs around that region of interest. With start at the first ifetch that CPU2 does for our LDR of interest at 0x400a88:
+471033000: SimpleCPU: system.cpu2: Fetch +471033000: SimpleCPU: system.cpu2: Translating address 0x400a88+
Things get a bit interleaved with CPU1, but soon afterwards we see the miss forwarding via gem5 MSHR as in gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches:
471034000: Cache: system.cpu2.dcache: access for ReadReq [2068:206f] D=b0d989c328560000 ptr=0x5628c3d26f00 miss +471034000: CachePort: system.cpu2.dcache.mem_side: Scheduling send event at 471035000 +471034000: Event: system.cpu2.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 140 scheduled @ 471035000+
Before the request moves on, some CPU1 action happens: a CPU1 STR finished! It hit the cache, and now we know the cache state: M:
+471034000: Cache: system.cpu1.dcache: access for WriteReq [2068:206f] D=2f00000000000000 ptr=0x5628c3d26c80 hit state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0 +471034000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+32 : str x1, [x2] : MemWrite : D=0x000000000000002f A=0x412068 flags=(IsInteger|IsMemRef|IsStore)+
After this is done, CPU2 dcache finally decides that it is time to forward its request, and now we see the crux of this experiment happen.
+First createMissPacket creates a new packet for the cache request, and then it sends that packet into CoherentXBar.
471035000: Event: system.cpu2.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 140 executed @ 471035000 +471035000: Cache: system.cpu2.dcache: sendMSHRQueuePacket: MSHR ReadReq [2068:206f] D=b0d989c328560000 ptr=0x5628c3d26f00 +471035000: Cache: system.cpu2.dcache: createMissPacket: created ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80 from ReadReq [2068:206f] D=b0d989c328560000 ptr=0x5628c3d26f00 +471035000: CoherentXBar: system.membus: recvTimingReq: src system.membus.slave[10] packet ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80+
Now, the SnoopFilte which lives inside the crossbar decides if any other CPUs care aout hat address:
471035000: SnoopFilter: system.membus.snoop_filter: lookupRequest: src system.membus.slave[10] packet ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80 +471035000: SnoopFilter: system.membus.snoop_filter: lookupRequest: SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000 +471035000: SnoopFilter: system.membus.snoop_filter: lookupRequest: new SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000 +471035000: CoherentXBar: system.membus: recvTimingReq: src system.membus.slave[10] packet ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80 SF size: 1 lat: 1+
and the answer is yes: CPU1 does care about that address obviously! So the packet is forwarded as is to CPU1:
+471035000: CoherentXBar: system.membus: forwardTiming for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80 +471035000: CacheVerbose: system.cpu1.dcache: recvTimingSnoopReq: for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80 +471035000: CacheVerbose: system.cpu1.dcache: handleSnoop: for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80 +471035000: Cache: system.cpu1.dcache: handleSnoop: snoop hit for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80, old state is state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0 +471035000: Cache: system.cpu1.dcache: new state is state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0 +471035000: Cache: system.cpu1.dcache: doTimingSupplyResponse: for ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80 +471035000: CacheVerbose: system.cpu1.dcache: doTimingSupplyResponse: created response: ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000 tick: 471044000 +471035000: Event: system.cpu1.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 94 scheduled @ 471044000 +471035000: CoherentXBar: system.membus: recvTimingReq: Not forwarding ReadSharedReq [2040:207f] D=c0ae37c4285600005b323036383a323036665d20443d62306439383963333238353630303030207074723d307835363238633364323666303000000000000000 ptr=0x5628c3d26e80 +471035000: Event: system.membus.reqLayer0.wrapped_function_event: EventFunctionWrapped 164 scheduled @ 471036000 +471035000: BaseXBar: system.membus.reqLayer0: The crossbar layer is now busy from tick 471035000 to 471036000+
and from this we see that this read request from CPU2 made a cache from CPU1 go from M to O!
+Then, the CPU1 dcache actually goes ahead, and creates a response or CPU2, since it has the data. This response is sent back to the crossbar which will forward it back to CPU1.
+This also makes the crossbar not forward the original request to DRAM as mentioned at Not forwarding.
This reply from CPU1 reaches the crossbar at:
+471044000: Event: system.cpu1.dcache.mem_side-MemSidePort.wrapped_function_event: EventFunctionWrapped 94 executed @ 471044000 +471044000: CoherentXBar: system.membus: recvTimingSnoopResp: src system.membus.slave[6] packet ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000 +471044000: SnoopFilter: system.membus.snoop_filter: updateSnoopResponse: rsp system.membus.slave[6] req system.membus.slave[10] packet ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000 +471044000: SnoopFilter: system.membus.snoop_filter: updateSnoopResponse: old SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000 +471044000: SnoopFilter: system.membus.snoop_filter: updateSnoopResponse: new SF value 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101000 +471044000: CoherentXBar: system.membus: recvTimingSnoopResp: src system.membus.slave[6] packet ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000 FWD RESP +471044000: Event: system.membus.slave[10]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 186 scheduled @ 471046000 +471044000: Event: system.membus.respLayer10.wrapped_function_event: EventFunctionWrapped 187 scheduled @ 471049000 +471044000: BaseXBar: system.membus.respLayer10: The crossbar layer is now busy from tick 471044000 to 471049000+
and finally, at long last, CPU2 receives the snoop reply that was created in CPU1 and sent back through the crossbar, and the LDR completes:
+471046000: Event: system.membus.slave[10]-RespPacketQueue.wrapped_function_event: EventFunctionWrapped 186 executed @ 471046000 +471046000: Cache: system.cpu2.dcache: recvTimingResp: Handling response ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000 +471046000: Cache: system.cpu2.dcache: Block for addr 0x2040 being updated in Cache +471046000: CacheRepl: system.cpu2.dcache: Replacement victim: state: 0 (I) valid: 0 writable: 0 readable: 0 dirty: 0 | tag: 0xffffffffffffffff set: 0x81 way: 0 +471046000: Cache: system.cpu2.dcache: Block addr 0x2040 (ns) moving from state 0 (I) to state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0 +471046000: Cache: system.cpu2.dcache: serviceMSHRTargets: updated cmd to ReadRespWithInvalidate [2068:206f] D=2f00000000000000 ptr=0x5628c3d26f00 +471046000: Event: system.cpu2.dcache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 138 scheduled @ 471047000 +471046000: Cache: system.cpu2.dcache: processing deferred snoop... +471046000: CacheVerbose: system.cpu2.dcache: handleSnoop: for UpgradeReq [2040:207f] D= ptr=0x5628c2d37b80 +471046000: Cache: system.cpu2.dcache: handleSnoop: snoop hit for UpgradeReq [2040:207f] D= ptr=0x5628c2d37b80, old state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0 +471046000: Cache: system.cpu2.dcache: new state is state: 0 (I) valid: 0 writable: 0 readable: 0 dirty: 0 | tag: 0xffffffffffffffff set: 0x81 way: 0 +471046000: CacheVerbose: system.cpu2.dcache: recvTimingResp: Leaving with ReadResp [2040:207f] D=700640000000000070064000000000000000000000000000000000000000000000000000000000002f0000000000000000000000000000000000000000000000 ptr=0x5628c3d27000 +471047000: Event: system.cpu2.dcache.cpu_side-CpuSidePort.wrapped_function_event: EventFunctionWrapped 138 executed @ 471047000 +471047000: SimpleCPU: system.cpu2.dcache_port: Received load/store response 0x2068 +471047000: Event: Event_136: Timing CPU dcache tick 136 scheduled @ 471047000 +471047000: Event: Event_136: Timing CPU dcache tick 136 executed @ 471047000 +471034000: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+24 : ldr x1, [x2] : MemRead : D=0x000000000000002f A=0x412068 flags=(IsInteger|IsMemRef|IsLoad)+
We note therefore that no DRAM access was involved, one cache services the other directly!
+Tested on LKMC d429552cdeb0fc0a08cff8e627bf501eaffb068f + 1, gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772.
+Now let’s do the exact same we did for gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs, but with Ruby rather than the classic system and TimingSimpleCPU (atomic does not work with Ruby)
The events for the Atomic CPU were pretty simple: basically just ticks.
TODO like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard but with the hazard.
Like gem5 event queue MinorCPU syscall emulation freestanding example analysis but even more complex since for the gem5 DerivO3CPU!
This section and children are tested at LKMC 144a552cf926ea630ef9eadbb22b79fe2468c456.
Let’s have a look at the arguably simplest example userland/arch/aarch64/freestanding/linux/hazardless.S.
so we observe that the first two instructions arrived, and the CPU noticed that 0x400080 hasn’t been fetched yet.
Then for several cycles that follow, the fetch stage just says that it is blocked on data returning, e.g. the
+Then for several cycles that follow, the fetch stage just says that it is blocked on data returning:
Now let’s do the same as in gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazardless but with a hazard: userland/arch/aarch64/freestanding/linux/hazard.S.
Like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard but now with an LDR stall: userland/arch/aarch64/freestanding/linux/stall.S.
Like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall but now with an LDR stall: userland/arch/aarch64/freestanding/linux/stall-gain.S.
Like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-gain but now with some dependencies after the LDR: userland/arch/aarch64/freestanding/linux/stall-hazard4.S.
Now let’s try to see some Speculative execution in action with userland/arch/aarch64/freestanding/linux/speculative.S.
This is one of the parts of gem5 that rely on semi-useless code generation inside the .isa sublanguage.
We also notice that the key argument passed to those instructions is of type ExecContext, which is discussed further at: Section 19.21.7.3, “gem5 ExecContext”.
We also notice that the key argument passed to those instructions is of type ExecContext, which is discussed further at: Section 19.22.6.3, “gem5 ExecContext”.
The file is an include so that compilation can be split up into chunks by the autogenerated includers
@@ -27040,7 +27985,7 @@ namespace ArmISAInst {Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.
execute vs initiateAcc vs completeAccexecute vs initiateAcc vs completeAccThese are the key methods defined in instruction definitions, so lets see when each one gets called and what they do more or less.
This can be seen concretely in GDB from the analysis done at: TimingSimpleCPU analysis: LDR stall and for more memory details see gem5 functional vs atomic vs timing memory requests.
completeAcccompleteAcccompleteAcc is boring on most simple store memory instructions, e.g. a simple STR:
Some gem5 instructions break down into multiple microops.
The gem5 memory system is connected in a very flexible way through the port system.
-This system exists to allow seamlessly connecting any combination of CPU, caches, interconnects, DRAM and peripherals.
-gem5 memory requests can be classified in the following broad categories:
-functional: get the value magically, do not update caches, see also: gem5 functional requests
-atomic: get the value now without making a separate event, but do not update caches. Cannot work in Ruby due to fundamental limitations, mentioned in passing at: https://gem5.atlassian.net/browse/GEM5-676
-timing: get the value simulating delays and updating caches
-This trichotomy can be notably seen in the definition of the MasterPort class:
-class MasterPort : public Port, public AtomicRequestProtocol, - public TimingRequestProtocol, public FunctionalRequestProtocol-
and the base classes are defined under src/mem/protocol/.
Then, by reading the rest of the class, we see that the send methods are all boring, and just forward to some polymorphic receiver that does the actual interesting activity:
- Tick
- sendAtomicSnoop(PacketPtr pkt)
- {
- return AtomicResponseProtocol::sendSnoop(_masterPort, pkt);
- }
-
- Tick
- AtomicResponseProtocol::sendSnoop(AtomicRequestProtocol *peer, PacketPtr pkt)
- {
- assert(pkt->isRequest());
- return peer->recvAtomicSnoop(pkt);
- }
-The receive methods are therefore the interesting ones, and must be overridden on derived classes if they ever expect to receive such requests:
- Tick
- recvAtomicSnoop(PacketPtr pkt) override
- {
- panic("%s was not expecting an atomic snoop request\n", name());
- return 0;
- }
-
- void
- recvFunctionalSnoop(PacketPtr pkt) override
- {
- panic("%s was not expecting a functional snoop request\n", name());
- }
-
- void
- recvTimingSnoopReq(PacketPtr pkt) override
- {
- panic("%s was not expecting a timing snoop request.\n", name());
- }
-One question that comes up now is: but why do CPUs need to care about snoop requests?
-And one big answer is: to be able to implement LLSC atomicity as mentioned at: ARM LDXR and STXR instructions, since when other cores update memory, they could invalidate the lock of the current core.
-Then, as you might expect, we can see that for example AtomicSimpleCPU does not override recvTimingSnoopReq.
Now let see which requests are generated by ordinary ARM LDR instruction. We run:
-./run \ - --arch aarch64 \ - --debug-vm \ - --emulator gem5 \ - --gem5-build-type debug \ - --useland userland/arch/aarch64/freestanding/linux/hello.S \-
and then break at the methods of the LDR class LDRXL64_LIT: gem5 execute vs initiateAcc vs completeAcc.
Before starting, we of course guess that:
-AtomicSimpleCPU will be making atomic accesses from execute
TimingSimpleCPU will be making timing accesses from initiateAcc, which must generate the event which leads to completeAcc
so let’s confirm it.
-We break on ArmISAInst::LDRXL64_LIT::execute which is what AtomicSimpleCPU uses, and that leads as expected to:
MasterPort::sendAtomic -AtomicSimpleCPU::sendPacket -AtomicSimpleCPU::readMem -SimpleExecContext::readMem -readMemAtomic<(ByteOrder)1, ExecContext, unsigned long> -readMemAtomicLE<ExecContext, unsigned long> -ArmISAInst::LDRXL64_LIT::execute -AtomicSimpleCPU::tick-
Notably, AtomicSimpleCPU::readMem immediately translates the address, creates a packet, sends the atomic request, and gets the response back without any events.
And now if we do the same with --cpu-type TimingSimpleCPU and break at ArmISAInst::LDRXL64_LIT::initiateAcc, and then add another break for the next event schedule b EventManager::schedule (which we imagine is the memory read) we reach:
EventManager::schedule -DRAMCtrl::addToReadQueue -DRAMCtrl::recvTimingReq -DRAMCtrl::MemoryPort::recvTimingReq -TimingRequestProtocol::sendReq -MasterPort::sendTimingReq -CoherentXBar::recvTimingReq -CoherentXBar::CoherentXBarSlavePort::recvTimingReq -TimingRequestProtocol::sendReq -MasterPort::sendTimingReq -TimingSimpleCPU::handleReadPacket -TimingSimpleCPU::sendData -TimingSimpleCPU::finishTranslation -DataTranslation<TimingSimpleCPU*>::finish -ArmISA::TLB::translateComplete -ArmISA::TLB::translateTiming -ArmISA::TLB::translateTiming -TimingSimpleCPU::initiateMemRead -SimpleExecContext::initiateMemRead -initiateMemRead<ExecContext, unsigned long> -ArmISAInst::LDRXL64_LIT::initiateAcc -TimingSimpleCPU::completeIfetch -TimingSimpleCPU::IcachePort::ITickEvent::process -EventQueue::serviceOne-
so as expected we have TimingRequestProtocol::sendReq.
Remember however that timing requests are a bit more complicated due to paging, since the page table walk can itself lead to further memory requests.
-In this particular instance, the address being read with ldr x2, =len ARM LDR pseudo-instruction is likely placed just after the text section, and therefore the pagewalk is already in the TLB due to previous instruction fetches, and this is because the translation just finished immediately going through TimingSimpleCPU::finishTranslation, some key snippets are:
TLB::translateComplete(const RequestPtr &req, ThreadContext *tc,
- Translation *translation, Mode mode, TLB::ArmTranslationType tranType,
- bool callFromS2)
-{
- bool delay = false;
- Fault fault;
- if (FullSystem)
- fault = translateFs(req, tc, mode, translation, delay, true, tranType);
- else
- fault = translateSe(req, tc, mode, translation, delay, true);
- if (!delay)
- translation->finish(fault, req, tc, mode);
- else
- translation->markDelayed();
-and then translateSe does not use delay at all, so we learn that in syscall emulation, delay is always false and things progress immediately there. And then further down TimingSimpleCPU::finishTranslation does some more fault checking:
void
-TimingSimpleCPU::finishTranslation(WholeTranslationState *state)
-{
- if (state->getFault() != NoFault) {
- translationFault(state->getFault());
- } else {
- if (!state->isSplit) {
- sendData(state->mainReq, state->data, state->res,
- state->mode == BaseTLB::Read);
-Tested in gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.
-As seen at gem5 functional vs atomic vs timing memory requests, functional requests are not used in common simulation, since the core must always go through caches.
-Functional access are therefore only used for more magic simulation functionalities.
-One such functionality, is the gem5 syscall emulation mode implementation of the futex system call which is done at futexFunc in src/sim/sycall_emul.hh.
As seen from man futex, the Linux kernel reads the value from an address that is given as the first argument of the call.
Therefore, here it makes sense for gem5 syscall implementation, which does not actually have a real kernel running, to just make a functional request and be done with it, since the impact of cache changes done by this read would be insignificant to the cost of an actual full context switch that would happen on a real syscall.
-It is generally hard to implement functional requests for Ruby runs, because packets are flying through the memory system in a transient state, and there is no simple way of finding exactly which ones might have the latest version of the memory. See for example:
-The typical error message in that case is:
-fatal: Ruby functional read failed for address-
ThreadContext vs ThreadState vs ExecContext vs ProcessThreadContext vs ThreadState vs ExecContext vs ProcessThese classes get used everywhere, and they have a somewhat convoluted relation with one another, so let’s figure it out this mess.
This section and all children tested at gem5 b1623cb2087873f64197e503ab8894b5e4d4c7b4.
ThreadContextThreadContextAs we delve into more details below, we will reach the following conclusion: a ThreadContext represents on thread of a CPU with multiple Hardware threads.
Essentially all methods of the base ThreadContext are pure virtual.
SimpleThreadSimpleThreadSimpleThread storage defined on BaseSimpleCPU for simple CPUs like AtomicSimpleCPU:
O3ThreadContextO3ThreadContextInstantiation happens in the FullO3CPU constructor:
ThreadStateThreadStateOwned one per ThreadContext.
ExecContextExecContextExecContext gets used in gem5 instruction definitions, e.g.:
This makes sense, since each ThreadContext represents one CPU register set, and therefore needs a separate ExecContext which allows instruction implementations to access those registers.
ExecContext::readIntRegOperand register resolutionExecContext::readIntRegOperand register resolutionLet’s have a look at how ExecContext::readIntRegOperand actually matches registers to decoded registers IDs, since it is not obvious.
ProcessProcessThe Process class is used only for gem5 syscall emulation mode, and it represents a process like a Linux userland process, in addition to any further gem5 specific data needed to represent the process.
Each instruction is marked with a class, and each class can execute in a given functional unit.
MinorCPU default functional unitsMinorCPU default functional unitsWhich units are available is visible for example on the gem5 config.ini of a gem5 MinorCPU run. Functional units are not present in simple CPUs like gem5 TimingSimpleCPU.
On gem5 3ca404da175a66e0b958165ad75eb5f54cb5e772, after running:
gem5 uses a ton of code generation, which makes the project horrendous:
But it has been widely overused to insanity. It likely also exists partly because when the project started in 2003 C++ compilers weren’t that good, so you couldn’t rely on features like templates that much.
Generated code at: build/<ISA>/config/the_isa.hh which e.g. for ARM contains:
gem5 moves a bit slowly, and if your host compiler is very new, the gem5 build might be broken for it, e.g. this was the case for Ubuntu 19.10 with GCC 9 and gem5 62d75e7105fe172eb906d4f80f360ff8591d4178 from Dec 2019.
E.g. src/cpu/decode_cache.hh includes:
Upstream request: https://gem5.atlassian.net/browse/GEM5-469
userland/c/atomic/: files in this directory use the same technique as atomic.cpp, i.e. with one special case per file.
+Maybe userland/c/atomic.c should be deprecated in favor of those more minimal ones.
+This was added because C++-pre main is too bloated, especially when we turn one a gazillion gem5 logs, it makes me want to cry.
+And we want a single operation per test rather than to as in atomic.c because when using gem5 we want absolute control over the microbenchmark.
Demonstrates atomic_int and thrd_create.
C version at: atomic.c.
+In this set of examples, we exemplify various synchronization mechanisms, including assembly specific ones, by using the convenience of C++ multithreading:
rootfs_overlay/lkmc/nodejs/http.js: http module to create a simple HTTP server: https://nodejs.org/api/http.html
https://courses.cs.washington.edu/courses/cse378/09wi/lectures/lec15.pdf contains some of the first pictures you should see.
+In a direct-mapped cache architecture (every address has a single possible block), a memory address can be broken up into:
++-----+-------+--------------+ +| | | | full address ++-----+-------+--------------+ +| | | | +| tag | index | block offset |+
where:
+index: determines in which block the address will go. This is the "index/ID of the block" it will go into!
+tag: allows us to differentiate between multiple addresses that have the same index
+We really want tag to be the higher bits, so that consecutive blocks may be placed in the cache at once.
+block offset: address withing the cache. Not used to find caches at all! Only used to find the data within the cache line
+If the cache is set associative, we just simply make the index smaller and add a bits to the tag.
+For example, for a 2-way associative cache, we remove on bit from the index, and add it to the tag.
+Even if caches are coherent, this is still not enough to avoid data race conditions, because this does not enforce atomicity of read modify write sequences. This is for example shown at: Detailed gem5 analysis of how data races happen.
According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf "memory consistency" is about ordering requirements of different memory addresses.
This is represented explicitly in C++ for example C++ std::memory_order.
According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf, the strongest possible consistency, everything nicely ordered as you’d expect.
Either they can snoop only control, or both control and data can be snooped.
If data snoops are not possible, then data must always to to DRAM first.
Mentioned at:
why PrWr stays in invalid? Why do writes always go to memory? Why not wait until eviction?
TODO gem5 concrete example.
TODO understand well why those are needed.
gem5 12c917de54145d2d50260035ba7fa614e25317a3 has two Ruby MESI models implemented: MESI_Two_Level and MESI_Three_Level.
https://en.wikipedia.org/wiki/MOSI_protocol The critical MSI vs MOSI section was a bit bogus though: https://en.wikipedia.org/w/index.php?title=MOSI_protocol&oldid=895443023 but I edited it :-)
This therefore saves one memory write through and its bus traffic.
MESI cache coherence protocol + MOSI cache coherence protocol, not much else to it!
In gem5 9fc9c67b4242c03f165951775be5cd0812f2a705, MOESI is the default cache coherency protocol of the classic memory system as shown at Section 19.21.4.3.1, “What is the coherency protocol implemented by the classic cache system in gem5?”.
+In gem5 9fc9c67b4242c03f165951775be5cd0812f2a705, MOESI is the default cache coherency protocol of the classic memory system as shown at Section 19.22.4.3.1, “What is the coherency protocol implemented by the classic cache system in gem5?”.
A good an simple example showing several MOESI transitions in the classic memory model can be seen at: Section 19.21.4.4, “gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs”.
+A good an simple example showing several MOESI transitions in the classic memory model can be seen at: Section 19.22.4.4, “gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs”.
gem5 12c917de54145d2d50260035ba7fa614e25317a3 has several Ruby MOESI models implemented: MOESI_AMD_Base, MOESI_CMP_directory, MOESI_CMP_token and MOESI_hammer.