From 0f1178c3aedf8239948724ad396b828ac63a5ef2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ciro=20Santilli=20=E5=85=AD=E5=9B=9B=E4=BA=8B=E4=BB=B6=20?= =?UTF-8?q?=E6=B3=95=E8=BD=AE=E5=8A=9F?= Date: Wed, 18 Dec 2019 00:00:00 +0000 Subject: [PATCH] 5e39c1b911aed6b0ab201763246fa8a204e5e7ae --- index.html | 452 +++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 425 insertions(+), 27 deletions(-) diff --git a/index.html b/index.html index a180df3..b332b94 100644 --- a/index.html +++ b/index.html @@ -1006,6 +1006,11 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b
  • 15.21.1. vmlinux vs bzImage vs zImage vs Image
  • +
  • 15.22. Kernel modules + +
  • 16. Xen
  • @@ -1385,7 +1390,15 @@ body.book #toc,body.book #preamble,body.book h1.sect0,body.book .sect1>h2{page-b -
  • 22.7. Linux system calls
  • +
  • 22.7. Linux system calls + +
  • 22.8. Linux calling conventions
  • -
  • 27.8.3. ARM multicore +
  • 27.8.3. ARM baremetal multicore
    -

    To debug these instructoins, you can see the register values in GDB with:

    +

    To debug these instructions, you can see the register values in GDB with:

    @@ -24940,13 +25070,75 @@ When instructions do not interpret this operand encoding as the zero register, u

    I’m not sure how much stdlib functionality is supposed to work without the pre-main stuff, but I guess we’ll just have to find out!

    +
    +

    Was going to ask the following markdown question, but I noticed half way that:

    +
    +
    +
      +
    • +

      without -static, I see a bunch of dynamic loader instructions, so not much is gained

      +
    • +
    • +

      with -static, the program segfaults, including on the host with stack:

      +
      +
      +
      #0  0x0000000000429625 in _IO_cleanup ()
      +#1  0x0000000000400c72 in __run_exit_handlers ()
      +#2  0x0000000000400caa in exit ()
      +#3  0x0000000000400a01 in _start () at exit.S:4
      +
      +
      +
    • +
    +
    +
    +

    so I didn’t really have a good question.

    +
    +
    +

    The Markdown question that was almost asked:

    +
    +
    +
    +
    When working in emulators, I often want to keep my workloads as small as possible to more easily study instruction traces and reproduce bugs.
    +
    +One of the ways I often want to do that, especially when doing [user mode simulations](https://wiki.debian.org/QemuUserEmulation), is by not running [the code that normally runs before main](https://stackoverflow.com/questions/53570678/what-happens-before-main-in-c) so that I can start directly in the instructions of interest that I control myself, which can be achieved with the `gcc -nostartfiles` option and by starting the program directly at `_start`.
    +
    +Here is a tiny example that calls just `exit` from the C standard library:
    +
    +main.S
    +
    +```
    +.global _start
    +_start:
    +    mov $0, %rdi
    +    call exit
    +
    +```
    +
    +Compile and run with:
    +
    +```
    +gcc -ggdb3 -nostartfiles -static -o exit.out exit.S
    +qemu-x86_64 -d in_asm exit.out
    +```
    +
    +However, for programming convenience, and to potentially keep my examples more OS portable, I would like to avoid making raw system calls, which would of course work, by using C standard library functions instead.
    +
    +But I'm afraid that some of those C standard library functions will fail in subtle ways because I have skipped required initialization steps that would normally happen before `main`.
    +
    +Is it any easy to determine which functions I can use or not, in case there are any that I can't use?
    +
    +
    @@ -25344,6 +25536,81 @@ zmmintrin.h AVX512
  • +
    +

    22.7.1. futex system call

    +
    +

    This is how threads either:

    +
    +
    +
      +
    • +

      request the kernel to sleep until they are woken up by other threads

      +
    • +
    • +

      request the kernel to wake up other threads that are waiting on a given futex

      +
    • +
    +
    +
    +

    This syscall is rarely used on its own, and there isn’t even a glibc wrapper for it: you almost always just want to use the pthreads or C++ multithreading wrappers which use it for you to implement higher level constructs like mutexes.

    +
    +
    +

    Futexes are bit complicated, because in order to achieve their efficiency, basically nothing is guaranteed: the wait might not wait, and the wakes might not wake. So you are just basically forced to use atomic operations on the futex memory address in order to be sure of anything.

    +
    +
    +

    Minimal examples:

    +
    +
    +
      +
    • +

      lkmc/futex.h: our futex wrapper

      +
    • +
    • +

      userland/linux/futex.c: minimal example, the main thread:

      +
      +
        +
      • +

        spawns a child

        +
      • +
      • +

        the child waits on a futex

        +
      • +
      • +

        the main thread sleeps for one second

        +
      • +
      • +

        the main thread wakes up the child

        +
      • +
      • +

        the child returns

        +
        +

        So what you see is:

        +
        +
        +
        +
        main start
        +child start
        +[wait 1s]
        +main after sleep
        +child end
        +
        +
        +
      • +
      +
      +
    • +
    +
    +
    +
    22.7.1.1. Userland mutex implementation
    +
    +

    The best article to understand spinlocks is: https://eli.thegreenplace.net/2018/basics-of-futexes/

    +
    +
    +

    The example in man futex is also a must.

    +
    +
    +

    22.8. Linux calling conventions

    @@ -28992,7 +29259,7 @@ AArch64, see Procedure Call Standard for the ARM 64-bit Architecture.

    24.7.1.1. ARM Large System Extensions (LSE)
    -

    Parent section: ARM multicore.

    +

    Parent section: ARM baremetal multicore.

    ARMv8 architecture reference manual db "ARMv8.1-LSE, ARMv8.1 Large System Extensions"

    @@ -30381,7 +30648,7 @@ IN: main
    -

    27.8.3. ARM multicore

    +

    27.8.3. ARM baremetal multicore

    Examples:

    @@ -30465,6 +30732,44 @@ IN: main

    The WFE and SEV instructions are just hints: a compliant implementation can treat them as NOPs.

    +

    Concrete examples of the instruction can be seen at:

    +
    +
    + +
    +

    However, likely no implementation likely does (TODO confirm), since:

    @@ -30484,7 +30789,34 @@ IN: main

    SEV is not the only thing that can wake up a WFE, it is only an explicit software way to do it. Notably, global monitor operations on memory accesses of regions marked by LDAXR and STLXR instructions can also wake up a WFE sleeping core. This is done to allow spinlocks opens to automatically wake up WFE sleeping cores at free time without the need for a explicit SEV.

    -

    WFE and SEV are usable from userland, and are part of a efficient spinlock implementation.

    +

    WFE and SEV are usable from userland, and are part of an efficient spinlock implementation, which maybe is not something that userland should ever tho and just stick to mutexes?

    +
    +
    +

    There is a control bit SCTLR_EL1.nTWE that determines if WFE is trapped or not, i.e.: is that bit is set, then it is trapped and EL0 execution raises an exception in EL1. Linux v5.2.1 does not seem to trap however, tested with --trace ExecAll and the dump_regs kernel module in a full system simulation. But then, how does the kernel prevent CPUs from going to sleep randomly and instead reschedules other tasks? Does the kernel check if CPUs are in WFE when it wakes up on the timer, and only then reschedules? This would allow for userland to implement fast spinlocks if the spinlock returns faster than the timer. The kernel seems to setup NTWE at:

    +
    +
    +

    include/asm/sysreg.h

    +
    +
    +
    +
    #define SCTLR_EL1_SET	(SCTLR_ELx_M    | SCTLR_ELx_C    | SCTLR_ELx_SA   |\
    +             ...
    +			 SCTLR_EL1_NTWE | SCTLR_ELx_IESB | SCTLR_EL1_SPAN |\
    +
    +
    +
    +

    and:

    +
    +
    +

    mm/proc.S

    +
    +
    +
    +
    	/*
    +	 * Prepare SCTLR
    +	 */
    +	mov_q	x0, SCTLR_EL1_SET
    +

    Quotes for the above ARMv8 architecture reference manual db G1.18.1 "Wait For Event and Send Event":

    @@ -30575,7 +30907,55 @@ IN: main
    -

    The best article to understand spinlocks is: https://eli.thegreenplace.net/2018/basics-of-futexes/

    +

    For how userland spinlocks and mutexes are implemented see Userland mutex implementation.

    +
    +
    +
    27.8.3.1.1. gem5 ARM WFE
    +
    +

    gem5 390a74f59934b85d91489f8a563450d8321b602d does not sleep on the first WFE on either syscall emulation or full system, because the code does:

    +
    +
    +
    +
    Fault WfeInst::execute(
    +    ExecContext *xc, Trace::InstRecord *traceData) const
    +{
    +[...]
    +    if (SevMailbox == 1) {
    +        SevMailbox = 0;
    +        PseudoInst::quiesceSkip(tc);
    +    } else if (tc->getCpuPtr()->getInterruptController(
    +                tc->threadId())->checkInterrupts(tc)) {
    +        PseudoInst::quiesceSkip(tc);
    +    } else {
    +        fault = trapWFx(tc, cpsr, scr, true);
    +        if (fault == NoFault) {
    +            PseudoInst::quiesce(tc);
    +        } else {
    +            PseudoInst::quiesceSkip(tc);
    +        }
    +    }
    +
    +
    +
    +

    where "quiesce" means "sleep" for laymen like Ciro, and quiesceSkip means don’t sleep.

    +
    +
    +

    SevMailbox is read from MISCREG_SEV_MAILBOX which is initialized to 1 at:

    +
    +
    +
    +
    ISA::clear()
    +{
    +[...]
    +    miscRegs[MISCREG_SEV_MAILBOX] = 1;
    +
    +
    +
    +
    @@ -31791,8 +32171,8 @@ instructions 124346081 LKMC -Benchmark -Emulator +Benchmark build +Emulator command Loops Time (s) Instruction count @@ -31802,7 +32182,7 @@ instructions 124346081

    a18f28e263c91362519ef550150b5c9d75fa3679 + 1

    -

    userland/gcc/busy_loop.c -O0

    +

    userland/gcc/busy_loop.c -O0

    qemu --arch aarch64

    10^10

    68

    @@ -31811,7 +32191,7 @@ instructions 124346081

    a18f28e263c91362519ef550150b5c9d75fa3679 + 1

    -

    userland/gcc/busy_loop.c -O0

    +

    userland/gcc/busy_loop.c -O0

    gem5 --arch aarch64

    10^7

    100

    @@ -31820,7 +32200,7 @@ instructions 124346081

    a18f28e263c91362519ef550150b5c9d75fa3679 + 1

    -

    userland/gcc/busy_loop.c -O0

    +

    userland/gcc/busy_loop.c -O0

    gem5 --arch aarch64 -- --cpu-type MinorCPU --caches

    10^6

    31

    @@ -31829,7 +32209,7 @@ instructions 124346081

    a18f28e263c91362519ef550150b5c9d75fa3679 + 1

    -

    userland/gcc/busy_loop.c -O0

    +

    userland/gcc/busy_loop.c -O0

    gem5 --arch aarch64 -- --cpu-type DerivO3CPU --caches

    10^6

    52

    @@ -31838,7 +32218,7 @@ instructions 124346081

    a18f28e263c91362519ef550150b5c9d75fa3679 + 1

    -

    userland/gcc/busy_loop.c -O0

    +

    userland/gcc/busy_loop.c -O0

    gem5 --arch aarch64 --gem5-build-id MOESI_CMP_directory -- --cpu-type DerivO3CPU --caches --ruby

    1 * 1000000 = 10^6

    63

    @@ -31847,7 +32227,7 @@ instructions 124346081

    605448f07e6380634b1aa7e9732d111759f69fd

    -

    Dhrystone -O3

    +

    Dhrystone -O3

    gem5 --arch aarch64

    4 * 10^5

    68

    @@ -31856,13 +32236,31 @@ instructions 124346081

    5d233f2664a78789f9907d27e2a40e86cefad595

    -

    STREAM benchmark -O3

    +

    STREAM benchmark -O3

    gem5 --arch aarch64 --userland-args 300000 2

    3 * 10^5 * 2

    64

    9.9674773 * 10^7

    1.6

    + +

    ab6f7331406b22f8ab6e2df5f8b8e464fb35b611

    +

    userland/c/m5ops.c -O0

    +

    gem5 --arch aarch64 --userland-args e

    +

    1

    +

    2

    +

    1.26479 * 10^5

    +

    0.05

    + + +

    ab6f7331406b22f8ab6e2df5f8b8e464fb35b611

    +

    userland/cpp/m5ops.cpp -O0

    +

    gem5 --arch aarch64 --userland-args e

    +

    1

    +

    2

    +

    2.385012 * 10^6

    +

    1

    +