From ef3b976932e7de200fd80bb76d58f779fbf506fe Mon Sep 17 00:00:00 2001 From: Bob Mottram Date: Fri, 15 Dec 2017 14:33:47 +0000 Subject: [PATCH] Some additional formats --- 4.12.12/LKMPG-4.12.12.md | 5656 +++++++++++++++++++++++++++++ 4.12.12/LKMPG-4.12.12.rst | 7263 +++++++++++++++++++++++++++++++++++++ 4.12.12/examples/Makefile | 2 +- 3 files changed, 12920 insertions(+), 1 deletion(-) create mode 100644 4.12.12/LKMPG-4.12.12.md create mode 100644 4.12.12/LKMPG-4.12.12.rst diff --git a/4.12.12/LKMPG-4.12.12.md b/4.12.12/LKMPG-4.12.12.md new file mode 100644 index 0000000..15c311b --- /dev/null +++ b/4.12.12/LKMPG-4.12.12.md @@ -0,0 +1,5656 @@ +::: {#preamble .status} +[]{#top} +::: + +::: {#content} +::: {#table-of-contents} +Table of Contents +----------------- + +::: {#text-table-of-contents} +- [Introduction](#org98c97cb) + - [Authorship](#org2782b14) + - [Versioning and Notes](#org0b6d633) + - [Acknowledgements](#orge57cf6b) + - [What Is A Kernel Module?](#org37341bc) + - [Kernel module package](#orge9612fa) + - [What Modules are in my Kernel?](#orgb6ce832) + - [Do I need to download and compile the kernel?](#orge1ec8b5) + - [Before We Begin](#org87661f2) +- [Headers](#org52fbd37) +- [Examples](#org628945f) +- [Hello World](#org0d455c0) + - [The Simplest Module](#orgba22fe1) + - [Hello and Goodbye](#org56fc79a) + - [The \_\_init and \_\_exit Macros](#org86bfdb6) + - [Licensing and Module Documentation](#org11aaf91) + - [Passing Command Line Arguments to a Module](#org9e1dd8d) + - [Modules Spanning Multiple Files](#orgcd10981) + - [Building modules for a precompiled kernel](#orga65faca) +- [Preliminaries](#orgdeef601) + - [How modules begin and end](#orgc8eceb0) + - [Functions available to modules](#org290f3df) + - [User Space vs Kernel Space](#orga7850df) + - [Name Space](#org4b4877b) + - [Code space](#org7e3a491) + - [Device Drivers](#org6c0b122) +- [Character Device drivers](#org016c39a) + - [The file\_operations Structure](#org31d952e) + - [The file structure](#org607b208) + - [Registering A Device](#orgf96ab85) + - [Unregistering A Device](#org452ea75) + - [chardev.c](#orgdd49880) + - [Writing Modules for Multiple Kernel Versions](#org903f5d5) +- [The /proc File System](#org6400501) + - [Read and Write a /proc File](#orga906618) + - [Manage /proc file with standard filesystem](#org561d817) + - [Manage /proc file with seq\_file](#org38ea52f) +- [sysfs: Interacting with your module](#org954957f) +- [Talking To Device Files](#org438f37b) +- [System Calls](#org8de5924) +- [Blocking Processes and threads](#org13e2c0e) + - [Sleep](#org9cbc7d3) + - [Completions](#org89cb410) +- [Avoiding Collisions and Deadlocks](#org949949f) + - [Mutex](#org10f05c2) + - [Spinlocks](#org5d633fc) + - [Read and write locks](#orgaa517c3) + - [Atomic operations](#orgadbf448) +- [Replacing Printks](#org7974c60) + - [Replacing printk](#org1c8b17b) + - [Flashing keyboard LEDs](#org418d823) +- [Scheduling Tasks](#orgf37d73f) + - [Tasklets](#org32525a8) + - [Work queues](#orge8a2d87) +- [Interrupt Handlers](#orgbc0cdf8) + - [Interrupt Handlers](#org93511bb) + - [Detecting button presses](#org77533ca) + - [Bottom Half](#orgdb452ba) +- [Crypto](#org627e987) + - [Hash functions](#org0d560c3) + - [Symmetric key encryption](#org4e331ef) +- [Standardising the interfaces: The Device Model](#org01d6493) +- [Optimisations](#org87293ce) + - [Likely and Unlikely conditions](#org87e8223) +- [Common Pitfalls](#org79dea20) + - [Using standard libraries](#org86275d7) + - [Disabling interrupts](#org8646229) + - [Sticking your head inside a large carnivore](#org58c8bc4) +- [Where To Go From Here?](#org2307e11) +::: +::: + +::: {#outline-container-org98c97cb .outline-2} +Introduction {#org98c97cb} +------------ + +::: {#text-org98c97cb .outline-text-2} +The Linux Kernel Module Programming Guide is a free book; you may +reproduce and/or modify it under the terms of the Open Software License, +version 3.0. + +This book is distributed in the hope it will be useful, but without any +warranty, without even the implied warranty of merchantability or +fitness for a particular purpose. + +The author encourages wide distribution of this book for personal or +commercial use, provided the above copyright notice remains intact and +the method adheres to the provisions of the Open Software License. In +summary, you may copy and distribute this book free of charge or for a +profit. No explicit permission is required from the author for +reproduction of this book in any medium, physical or electronic. + +Derivative works and translations of this document must be placed under +the Open Software License, and the original copyright notice must remain +intact. If you have contributed new material to this book, you must make +the material and source code available for your revisions. Please make +revisions and updates available directly to the document maintainer, +Peter Jay Salzman \. This will allow for the merging of +updates and provide consistent revisions to the Linux community. + +If you publish or distribute this book commercially, donations, +royalties, and/or printed copies are greatly appreciated by the author +and the [Linux Documentation Project](http://www.tldp.org) (LDP). +Contributing in this way shows your support for free software and the +LDP. If you have questions or comments, please contact the address +above. +::: + +::: {#outline-container-org2782b14 .outline-3} +### Authorship {#org2782b14} + +::: {#text-org2782b14 .outline-text-3} +The Linux Kernel Module Programming Guide was originally written for the +2.2 kernels by Ori Pomerantz. Eventually, Ori no longer had time to +maintain the document. After all, the Linux kernel is a fast moving +target. Peter Jay Salzman took over maintenance and updated it for the +2.4 kernels. Eventually, Peter no longer had time to follow developments +with the 2.6 kernel, so Michael Burian became a co-maintainer to update +the document for the 2.6 kernels. Bob Mottram updated the examples for +3.8 and later kernels, added the sysfs chapter and modified or updated +other chapters. +::: +::: + +::: {#outline-container-org0b6d633 .outline-3} +### Versioning and Notes {#org0b6d633} + +::: {#text-org0b6d633 .outline-text-3} +The Linux kernel is a moving target. There has always been a question +whether the LKMPG should remove deprecated information or keep it around +for historical sake. Michael Burian and I decided to create a new branch +of the LKMPG for each new stable kernel version. So version LKMPG 4.12.x +will address Linux kernel 4.12.x and LKMPG 2.6.x will address Linux +kernel 2.6. No attempt will be made to archive historical information; a +person wishing this information should read the appropriately versioned +LKMPG. + +The source code and discussions should apply to most architectures, but +I can\'t promise anything. +::: +::: + +::: {#outline-container-orge57cf6b .outline-3} +### Acknowledgements {#orge57cf6b} + +::: {#text-orge57cf6b .outline-text-3} +The following people have contributed corrections or good suggestions: +Ignacio Martin, David Porter, Daniele Paolo Scarpazza, Dimo Velev, +Francois Audeon, Horst Schirmeier, Bob Mottram and Roman Lakeev. +::: +::: + +::: {#outline-container-org37341bc .outline-3} +### What Is A Kernel Module? {#org37341bc} + +::: {#text-org37341bc .outline-text-3} +So, you want to write a kernel module. You know C, you\'ve written a few +normal programs to run as processes, and now you want to get to where +the real action is, to where a single wild pointer can wipe out your +file system and a core dump means a reboot. + +What exactly is a kernel module? Modules are pieces of code that can be +loaded and unloaded into the kernel upon demand. They extend the +functionality of the kernel without the need to reboot the system. For +example, one type of module is the device driver, which allows the +kernel to access hardware connected to the system. Without modules, we +would have to build monolithic kernels and add new functionality +directly into the kernel image. Besides having larger kernels, this has +the disadvantage of requiring us to rebuild and reboot the kernel every +time we want new functionality. +::: +::: + +::: {#outline-container-orge9612fa .outline-3} +### Kernel module package {#orge9612fa} + +::: {#text-orge9612fa .outline-text-3} +Linux distros provide the commands *modprobe*, *insmod* and *depmod* +within a package. + +On Debian: + +::: {.org-src-container} + sudo apt-get install build-essential kmod +::: + +On Parabola: + +::: {.org-src-container} + sudo pacman -S gcc kmod +::: +::: +::: + +::: {#outline-container-orgb6ce832 .outline-3} +### What Modules are in my Kernel? {#orgb6ce832} + +::: {#text-orgb6ce832 .outline-text-3} +To discover what modules are already loaded within your current kernel +use the command **lsmod**. + +::: {.org-src-container} + sudo lsmod +::: + +Modules are stored within the file /proc/modules, so you can also see +them with: + +::: {.org-src-container} + sudo cat /proc/modules +::: + +This can be a long list, and you might prefer to search for something +particular. To search for the *fat* module: + +::: {.org-src-container} + sudo lsmod | grep fat +::: +::: +::: + +::: {#outline-container-orge1ec8b5 .outline-3} +### Do I need to download and compile the kernel? {#orge1ec8b5} + +::: {#text-orge1ec8b5 .outline-text-3} +For the purposes of following this guide you don\'t necessarily need to +do that. However, it would be wise to run the examples within a test +distro running on a virtual machine in order to avoid any possibility of +messing up your system. +::: +::: + +::: {#outline-container-org87661f2 .outline-3} +### Before We Begin {#org87661f2} + +::: {#text-org87661f2 .outline-text-3} +Before we delve into code, there are a few issues we need to cover. +Everyone\'s system is different and everyone has their own groove. +Getting your first \"hello world\" program to compile and load correctly +can sometimes be a trick. Rest assured, after you get over the initial +hurdle of doing it for the first time, it will be smooth sailing +thereafter. +::: + +- []{#org551d822}Modversioning\ + ::: {#text-org551d822 .outline-text-5} + A module compiled for one kernel won\'t load if you boot a different + kernel unless you enable CONFIG\_MODVERSIONS in the kernel. We + won\'t go into module versioning until later in this guide. Until we + cover modversions, the examples in the guide may not work if you\'re + running a kernel with modversioning turned on. However, most stock + Linux distro kernels come with it turned on. If you\'re having + trouble loading the modules because of versioning errors, compile a + kernel with modversioning turned off. + ::: + +- []{#orgaf2a17b}Using X\ + ::: {#text-orgaf2a17b .outline-text-5} + It is highly recommended that you type in, compile and load all the + examples this guide discusses. It\'s also highly recommended you do + this from a console. You should not be working on this stuff in X. + + Modules can\'t print to the screen like printf() can, but they can + log information and warnings, which ends up being printed on your + screen, but only on a console. If you insmod a module from an xterm, + the information and warnings will be logged, but only to your + systemd journal. You won\'t see it unless you look through your + journalctl. To have immediate access to this information, do all + your work from the console. + ::: +::: +::: + +::: {#outline-container-org52fbd37 .outline-2} +Headers {#org52fbd37} +------- + +::: {#text-org52fbd37 .outline-text-2} +Before you can build anything you\'ll need to install the header files +for your kernel. On Parabola GNU/Linux: + +::: {.org-src-container} + sudo pacman -S linux-libre-headers +::: + +On Debian: + +::: {.org-src-container} + sudo apt-get update + apt-cache search linux-headers-$(uname -r) +::: + +This will tell you what kernel header files are available. Then for +example: + +::: {.org-src-container} + sudo apt-get install kmod linux-headers-4.12.12-1-amd64 +::: +::: +::: + +::: {#outline-container-org628945f .outline-2} +Examples {#org628945f} +-------- + +::: {#text-org628945f .outline-text-2} +All the examples from this document are available within the *examples* +subdirectory. To test that they compile: + +::: {.org-src-container} + cd examples + make +::: + +If there are any compile errors then you might have a more recent kernel +version or need to install the corresponding kernel header files. +::: +::: + +::: {#outline-container-org0d455c0 .outline-2} +Hello World {#org0d455c0} +----------- + +::: {#text-org0d455c0 .outline-text-2} +::: + +::: {#outline-container-orgba22fe1 .outline-3} +### The Simplest Module {#orgba22fe1} + +::: {#text-orgba22fe1 .outline-text-3} +Most people learning programming start out with some sort of \"*hello +world*\" example. I don\'t know what happens to people who break with +this tradition, but I think it\'s safer not to find out. We\'ll start +with a series of hello world programs that demonstrate the different +aspects of the basics of writing a kernel module. + +Here\'s the simplest module possible. + +Make a test directory: + +::: {.org-src-container} + mkdir -p ~/develop/kernel/hello-1 + cd ~/develop/kernel/hello-1 +::: + +Paste this into you favourite editor and save it as **hello-1.c**: + +::: {.org-src-container} + /* + * hello-1.c - The simplest kernel module. + */ + #include /* Needed by all modules */ + #include /* Needed for KERN_INFO */ + + int init_module(void) + { + printk(KERN_INFO "Hello world 1.\n"); + + /* + * A non 0 return means init_module failed; module can't be loaded. + */ + return 0; + } + + void cleanup_module(void) + { + printk(KERN_INFO "Goodbye world 1.\n"); + } +::: + +Now you\'ll need a Makefile. If you copy and paste this change the +indentation to use tabs, not spaces. + +::: {.org-src-container} + obj-m += hello-1.o + + all: + make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules + + clean: + make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean +::: + +And finally just: + +::: {.org-src-container} + make +::: + +If all goes smoothly you should then find that you have a compiled +**hello-1.ko** module. You can find info on it with the command: + +::: {.org-src-container} + sudo modinfo hello-1.ko +::: + +At this point the command: + +::: {.org-src-container} + sudo lsmod | grep hello +::: + +should return nothing. You can try loading your shiny new module with: + +::: {.org-src-container} + sudo insmod hello-1.ko +::: + +The dash character will get converted to an underscore, so when you +again try: + +::: {.org-src-container} + sudo lsmod | grep hello +::: + +you should now see your loaded module. It can be removed again with: + +::: {.org-src-container} + sudo rmmod hello_1 +::: + +Notice that the dash was replaced by an underscore. To see what just +happened in the logs: + +::: {.org-src-container} + journalctl --since "1 hour ago" | grep kernel +::: + +You now know the basics of creating, compiling, installing and removing +modules. Now for more of a description of how this module works. + +Kernel modules must have at least two functions: a \"start\" +(initialization) function called **init\_module()** which is called when +the module is insmoded into the kernel, and an \"end\" (cleanup) +function called **cleanup\_module()** which is called just before it is +rmmoded. Actually, things have changed starting with kernel 2.3.13. You +can now use whatever name you like for the start and end functions of a +module, and you\'ll learn how to do this in Section 2.3. In fact, the +new method is the preferred method. However, many people still use +init\_module() and cleanup\_module() for their start and end functions. + +Typically, init\_module() either registers a handler for something with +the kernel, or it replaces one of the kernel functions with its own code +(usually code to do something and then call the original function). The +cleanup\_module() function is supposed to undo whatever init\_module() +did, so the module can be unloaded safely. + +Lastly, every kernel module needs to include linux/module.h. We needed +to include **linux/kernel.h** only for the macro expansion for the +printk() log level, KERN\_ALERT, which you\'ll learn about in Section +2.1.1. +::: + +- []{#orgab018f5}A point about coding style\ + ::: {#text-orgab018f5 .outline-text-5} + Another thing which may not be immediately obvious to anyone getting + started with kernel programming is that indentation within your code + should be using **tabs** and **not spaces**. It\'s one of the coding + conventions of the kernel. You may not like it, but you\'ll need to + get used to it if you ever submit a patch upstream. + ::: + +- []{#org176ca3e}Introducing printk()\ + ::: {#text-org176ca3e .outline-text-5} + Despite what you might think, **printk()** was not meant to + communicate information to the user, even though we used it for + exactly this purpose in hello-1! It happens to be a logging + mechanism for the kernel, and is used to log information or give + warnings. Therefore, each printk() statement comes with a priority, + which is the \<1\> and KERN\_ALERT you see. There are 8 priorities + and the kernel has macros for them, so you don\'t have to use + cryptic numbers, and you can view them (and their meanings) in + **linux/kernel.h**. If you don\'t specify a priority level, the + default priority, DEFAULT\_MESSAGE\_LOGLEVEL, will be used. + + Take time to read through the priority macros. The header file also + describes what each priority means. In practise, don\'t use number, + like \<4\>. Always use the macro, like KERN\_WARNING. + + If the priority is less than int console\_loglevel, the message is + printed on your current terminal. If both syslogd and klogd are + running, then the message will also get appended to the systemd + journal, whether it got printed to the console or not. We use a high + priority, like KERN\_ALERT, to make sure the printk() messages get + printed to your console rather than just logged to the journal. When + you write real modules, you\'ll want to use priorities that are + meaningful for the situation at hand. + ::: + +- []{#orgc8049ab}About Compiling\ + ::: {#text-orgc8049ab .outline-text-5} + Kernel modules need to be compiled a bit differently from regular + userspace apps. Former kernel versions required us to care much + about these settings, which are usually stored in Makefiles. + Although hierarchically organized, many redundant settings + accumulated in sublevel Makefiles and made them large and rather + difficult to maintain. Fortunately, there is a new way of doing + these things, called kbuild, and the build process for external + loadable modules is now fully integrated into the standard kernel + build mechanism. To learn more on how to compile modules which are + not part of the official kernel (such as all the examples you\'ll + find in this guide), see file + **linux/Documentation/kbuild/modules.txt**. + + Additional details about Makefiles for kernel modules are available + in **linux/Documentation/kbuild/makefiles.txt**. Be sure to read + this and the related files before starting to hack Makefiles. It\'ll + probably save you lots of work. + + > Here\'s another exercise for the reader. See that comment above + > the return statement in init\_module()? Change the return value to + > something negative, recompile and load the module again. What + > happens? + ::: +::: + +::: {#outline-container-org56fc79a .outline-3} +### Hello and Goodbye {#org56fc79a} + +::: {#text-org56fc79a .outline-text-3} +In early kernel versions you had to use the **init\_module** and +**cleanup\_module** functions, as in the first hello world example, but +these days you can name those anything you want by using the +**module\_init** and **module\_exit** macros. These macros are defined +in **linux/init.h**. The only requirement is that your init and cleanup +functions must be defined before calling the those macros, otherwise +you\'ll get compilation errors. Here\'s an example of this technique: + +::: {.org-src-container} + /* + * hello-2.c - Demonstrating the module_init() and module_exit() macros. + * This is preferred over using init_module() and cleanup_module(). + */ + #include /* Needed by all modules */ + #include /* Needed for KERN_INFO */ + #include /* Needed for the macros */ + + static int __init hello_2_init(void) + { + printk(KERN_INFO "Hello, world 2\n"); + return 0; + } + + static void __exit hello_2_exit(void) + { + printk(KERN_INFO "Goodbye, world 2\n"); + } + + module_init(hello_2_init); + module_exit(hello_2_exit); +::: + +So now we have two real kernel modules under our belt. Adding another +module is as simple as this: + +::: {.org-src-container} + obj-m += hello-1.o + obj-m += hello-2.o + + all: + make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules + + clean: + make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean +::: + +Now have a look at linux/drivers/char/Makefile for a real world example. +As you can see, some things get hardwired into the kernel (obj-y) but +where are all those obj-m gone? Those familiar with shell scripts will +easily be able to spot them. For those not, the obj-\$(CONFIG\_FOO) +entries you see everywhere expand into obj-y or obj-m, depending on +whether the CONFIG\_FOO variable has been set to y or m. While we are at +it, those were exactly the kind of variables that you have set in the +linux/.config file, the last time when you said make menuconfig or +something like that. +::: +::: + +::: {#outline-container-org86bfdb6 .outline-3} +### The \_\_init and \_\_exit Macros {#org86bfdb6} + +::: {#text-org86bfdb6 .outline-text-3} +This demonstrates a feature of kernel 2.2 and later. Notice the change +in the definitions of the init and cleanup functions. The **\_\_init** +macro causes the init function to be discarded and its memory freed once +the init function finishes for built-in drivers, but not loadable +modules. If you think about when the init function is invoked, this +makes perfect sense. + +There is also an **\_\_initdata** which works similarly to **\_\_init** +but for init variables rather than functions. + +The **\_\_exit** macro causes the omission of the function when the +module is built into the kernel, and like \_\_init, has no effect for +loadable modules. Again, if you consider when the cleanup function runs, +this makes complete sense; built-in drivers don\'t need a cleanup +function, while loadable modules do. + +These macros are defined in **linux/init.h** and serve to free up kernel +memory. When you boot your kernel and see something like Freeing unused +kernel memory: 236k freed, this is precisely what the kernel is freeing. + +::: {.org-src-container} + /* + * hello-3.c - Illustrating the __init, __initdata and __exit macros. + */ + #include /* Needed by all modules */ + #include /* Needed for KERN_INFO */ + #include /* Needed for the macros */ + + static int hello3_data __initdata = 3; + + static int __init hello_3_init(void) + { + printk(KERN_INFO "Hello, world %d\n", hello3_data); + return 0; + } + + static void __exit hello_3_exit(void) + { + printk(KERN_INFO "Goodbye, world 3\n"); + } + + module_init(hello_3_init); + module_exit(hello_3_exit); +::: +::: +::: + +::: {#outline-container-org11aaf91 .outline-3} +### Licensing and Module Documentation {#org11aaf91} + +::: {#text-org11aaf91 .outline-text-3} +Honestly, who loads or even cares about proprietary modules? If you do +then you might have seen something like this: + +::: {.org-src-container} + # insmod xxxxxx.o + Warning: loading xxxxxx.ko will taint the kernel: no license + See http://www.tux.org/lkml/#export-tainted for information about tainted modules + Module xxxxxx loaded, with warnings +::: + +You can use a few macros to indicate the license for your module. Some +examples are \"GPL\", \"GPL v2\", \"GPL and additional rights\", \"Dual +BSD/GPL\", \"Dual MIT/GPL\", \"Dual MPL/GPL\" and \"Proprietary\". +They\'re defined within **linux/module.h**. + +To reference what license you\'re using a macro is available called +**MODULE\_LICENSE**. This and a few other macros describing the module +are illustrated in the below example. + +::: {.org-src-container} + /* + * hello-4.c - Demonstrates module documentation. + */ + #include /* Needed by all modules */ + #include /* Needed for KERN_INFO */ + #include /* Needed for the macros */ + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("A sample driver"); + MODULE_SUPPORTED_DEVICE("testdevice"); + + static int __init init_hello_4(void) + { + printk(KERN_INFO "Hello, world 4\n"); + return 0; + } + + static void __exit cleanup_hello_4(void) + { + printk(KERN_INFO "Goodbye, world 4\n"); + } + + module_init(init_hello_4); + module_exit(cleanup_hello_4); +::: +::: +::: + +::: {#outline-container-org9e1dd8d .outline-3} +### Passing Command Line Arguments to a Module {#org9e1dd8d} + +::: {#text-org9e1dd8d .outline-text-3} +Modules can take command line arguments, but not with the argc/argv you +might be used to. + +To allow arguments to be passed to your module, declare the variables +that will take the values of the command line arguments as global and +then use the module\_param() macro, (defined in linux/moduleparam.h) to +set the mechanism up. At runtime, insmod will fill the variables with +any command line arguments that are given, like ./insmod mymodule.ko +myvariable=5. The variable declarations and macros should be placed at +the beginning of the module for clarity. The example code should clear +up my admittedly lousy explanation. + +The module\_param() macro takes 3 arguments: the name of the variable, +its type and permissions for the corresponding file in sysfs. Integer +types can be signed as usual or unsigned. If you\'d like to use arrays +of integers or strings see module\_param\_array() and +module\_param\_string(). + +::: {.org-src-container} + int myint = 3; + module_param(myint, int, 0); +::: + +Arrays are supported too, but things are a bit different now than they +were in the olden days. To keep track of the number of parameters you +need to pass a pointer to a count variable as third parameter. At your +option, you could also ignore the count and pass NULL instead. We show +both possibilities here: + +::: {.org-src-container} + int myintarray[2]; + module_param_array(myintarray, int, NULL, 0); /* not interested in count */ + + short myshortarray[4]; + int count; + module_parm_array(myshortarray, short, &count, 0); /* put count into "count" variable */ +::: + +A good use for this is to have the module variable\'s default values +set, like an port or IO address. If the variables contain the default +values, then perform autodetection (explained elsewhere). Otherwise, +keep the current value. This will be made clear later on. + +Lastly, there\'s a macro function, **MODULE\_PARM\_DESC()**, that is +used to document arguments that the module can take. It takes two +parameters: a variable name and a free form string describing that +variable. + +::: {.org-src-container} + /* + * hello-5.c - Demonstrates command line argument passing to a module. + */ + #include + #include + #include + #include + #include + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Peter Jay Salzman"); + + static short int myshort = 1; + static int myint = 420; + static long int mylong = 9999; + static char *mystring = "blah"; + static int myintArray[2] = { -1, -1 }; + static int arr_argc = 0; + + /* + * module_param(foo, int, 0000) + * The first param is the parameters name + * The second param is it's data type + * The final argument is the permissions bits, + * for exposing parameters in sysfs (if non-zero) at a later stage. + */ + + module_param(myshort, short, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP); + MODULE_PARM_DESC(myshort, "A short integer"); + module_param(myint, int, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH); + MODULE_PARM_DESC(myint, "An integer"); + module_param(mylong, long, S_IRUSR); + MODULE_PARM_DESC(mylong, "A long integer"); + module_param(mystring, charp, 0000); + MODULE_PARM_DESC(mystring, "A character string"); + + /* + * module_param_array(name, type, num, perm); + * The first param is the parameter's (in this case the array's) name + * The second param is the data type of the elements of the array + * The third argument is a pointer to the variable that will store the number + * of elements of the array initialized by the user at module loading time + * The fourth argument is the permission bits + */ + module_param_array(myintArray, int, &arr_argc, 0000); + MODULE_PARM_DESC(myintArray, "An array of integers"); + + static int __init hello_5_init(void) + { + int i; + printk(KERN_INFO "Hello, world 5\n=============\n"); + printk(KERN_INFO "myshort is a short integer: %hd\n", myshort); + printk(KERN_INFO "myint is an integer: %d\n", myint); + printk(KERN_INFO "mylong is a long integer: %ld\n", mylong); + printk(KERN_INFO "mystring is a string: %s\n", mystring); + for (i = 0; i < (sizeof myintArray / sizeof (int)); i++) + { + printk(KERN_INFO "myintArray[%d] = %d\n", i, myintArray[i]); + } + printk(KERN_INFO "got %d arguments for myintArray.\n", arr_argc); + return 0; + } + + static void __exit hello_5_exit(void) + { + printk(KERN_INFO "Goodbye, world 5\n"); + } + + module_init(hello_5_init); + module_exit(hello_5_exit); +::: + +I would recommend playing around with this code: + +::: {.org-src-container} + # sudo insmod hello-5.ko mystring="bebop" mybyte=255 myintArray=-1 + mybyte is an 8 bit integer: 255 + myshort is a short integer: 1 + myint is an integer: 20 + mylong is a long integer: 9999 + mystring is a string: bebop + myintArray is -1 and 420 + + # rmmod hello-5 + Goodbye, world 5 + + # sudo insmod hello-5.ko mystring="supercalifragilisticexpialidocious" \ + > mybyte=256 myintArray=-1,-1 + mybyte is an 8 bit integer: 0 + myshort is a short integer: 1 + myint is an integer: 20 + mylong is a long integer: 9999 + mystring is a string: supercalifragilisticexpialidocious + myintArray is -1 and -1 + + # rmmod hello-5 + Goodbye, world 5 + + # sudo insmod hello-5.ko mylong=hello + hello-5.o: invalid argument syntax for mylong: 'h' +::: +::: +::: + +::: {#outline-container-orgcd10981 .outline-3} +### Modules Spanning Multiple Files {#orgcd10981} + +::: {#text-orgcd10981 .outline-text-3} +Sometimes it makes sense to divide a kernel module between several +source files. + +Here\'s an example of such a kernel module. + +::: {.org-src-container} + /* + * start.c - Illustration of multi filed modules + */ + + #include /* We're doing kernel work */ + #include /* Specifically, a module */ + + int init_module(void) + { + printk(KERN_INFO "Hello, world - this is the kernel speaking\n"); + return 0; + } +::: + +The next file: + +::: {.org-src-container} + /* + * stop.c - Illustration of multi filed modules + */ + + #include /* We're doing kernel work */ + #include /* Specifically, a module */ + + void cleanup_module() + { + printk(KERN_INFO "Short is the life of a kernel module\n"); + } +::: + +And finally, the makefile: + +::: {.org-src-container} + obj-m += hello-1.o + obj-m += hello-2.o + obj-m += hello-3.o + obj-m += hello-4.o + obj-m += hello-5.o + obj-m += startstop.o + startstop-objs := start.o stop.o + + all: + make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules + + clean: + make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean +::: + +This is the complete makefile for all the examples we\'ve seen so far. +The first five lines are nothing special, but for the last example +we\'ll need two lines. First we invent an object name for our combined +module, second we tell make what object files are part of that module. +::: +::: + +::: {#outline-container-orga65faca .outline-3} +### Building modules for a precompiled kernel {#orga65faca} + +::: {#text-orga65faca .outline-text-3} +Obviously, we strongly suggest you to recompile your kernel, so that you +can enable a number of useful debugging features, such as forced module +unloading (**MODULE\_FORCE\_UNLOAD**): when this option is enabled, you +can force the kernel to unload a module even when it believes it is +unsafe, via a **sudo rmmod -f module** command. This option can save you +a lot of time and a number of reboots during the development of a +module. If you don\'t want to recompile your kernel then you should +consider running the examples within a test distro on a virtual machine. +If you mess anything up then you can easily reboot or restore the VM. + +There are a number of cases in which you may want to load your module +into a precompiled running kernel, such as the ones shipped with common +Linux distributions, or a kernel you have compiled in the past. In +certain circumstances you could require to compile and insert a module +into a running kernel which you are not allowed to recompile, or on a +machine that you prefer not to reboot. If you can\'t think of a case +that will force you to use modules for a precompiled kernel you might +want to skip this and treat the rest of this chapter as a big footnote. + +Now, if you just install a kernel source tree, use it to compile your +kernel module and you try to insert your module into the kernel, in most +cases you would obtain an error as follows: + +::: {.org-src-container} + insmod: error inserting 'poet_atkm.ko': -1 Invalid module format +::: + +Less cryptical information are logged to the systemd journal: + +::: {.org-src-container} + Jun 4 22:07:54 localhost kernel: poet_atkm: version magic '2.6.5-1.358custom 686 + REGPARM 4KSTACKS gcc-3.3' should be '2.6.5-1.358 686 REGPARM 4KSTACKS gcc-3.3' +::: + +In other words, your kernel refuses to accept your module because +version strings (more precisely, version magics) do not match. +Incidentally, version magics are stored in the module object in the form +of a static string, starting with vermagic:. Version data are inserted +in your module when it is linked against the **init/vermagic.o** file. +To inspect version magics and other strings stored in a given module, +issue the modinfo module.ko command: + +::: {.org-src-container} + # sudo modinfo hello-4.ko + license: GPL + author: Bob Mottram + description: A sample driver + vermagic: 4.12.12-1.358 amd64 REGPARM 4KSTACKS gcc-4.9.2 + depends: +::: + +To overcome this problem we could resort to the **--force-vermagic** +option, but this solution is potentially unsafe, and unquestionably +inacceptable in production modules. Consequently, we want to compile our +module in an environment which was identical to the one in which our +precompiled kernel was built. How to do this, is the subject of the +remainder of this chapter. + +First of all, make sure that a kernel source tree is available, having +exactly the same version as your current kernel. Then, find the +configuration file which was used to compile your precompiled kernel. +Usually, this is available in your current *boot directory, under a name +like config-2.6.x. You may just want to copy it to your kernel source +tree: \*cp /boot/config-\`uname -r\` /usr/src/linux-\`uname +-r\`*.config\*. + +Let\'s focus again on the previous error message: a closer look at the +version magic strings suggests that, even with two configuration files +which are exactly the same, a slight difference in the version magic +could be possible, and it is sufficient to prevent insertion of the +module into the kernel. That slight difference, namely the custom string +which appears in the module\'s version magic and not in the kernel\'s +one, is due to a modification with respect to the original, in the +makefile that some distribution include. Then, examine your +**/usr/src/linux/Makefile**, and make sure that the specified version +information matches exactly the one used for your current kernel. For +example, you makefile could start as follows: + +::: {.org-src-container} + VERSION = 4 + PATCHLEVEL = 7 + SUBLEVEL = 4 + EXTRAVERSION = -1.358custom +::: + +In this case, you need to restore the value of symbol **EXTRAVERSION** +to -1.358. We suggest to keep a backup copy of the makefile used to +compile your kernel available in **/lib/modules/4.12.12-1.358/build**. A +simple **cp /lib/modules/\`uname-r\`/build/Makefile +/usr/src/linux-\`uname -r\`** should suffice. Additionally, if you +already started a kernel build with the previous (wrong) Makefile, you +should also rerun make, or directly modify symbol UTS\_RELEASE in file +**/usr/src/linux-4.12.12/include/linux/version.h** according to contents +of file **/lib/modules/4.12.12/build/include/linux/version.h**, or +overwrite the latter with the first. + +Now, please run make to update configuration and version headers and +objects: + +::: {.org-src-container} + # make + CHK include/linux/version.h + UPD include/linux/version.h + SYMLINK include/asm -> include/asm-i386 + SPLIT include/linux/autoconf.h -> include/config/* + HOSTCC scripts/basic/fixdep + HOSTCC scripts/basic/split-include + HOSTCC scripts/basic/docproc + HOSTCC scripts/conmakehash + HOSTCC scripts/kallsyms + CC scripts/empty.o +::: + +If you do not desire to actually compile the kernel, you can interrupt +the build process (CTRL-C) just after the SPLIT line, because at that +time, the files you need will be are ready. Now you can turn back to the +directory of your module and compile it: It will be built exactly +according your current kernel settings, and it will load into it without +any errors. +::: +::: +::: + +::: {#outline-container-orgdeef601 .outline-2} +Preliminaries {#orgdeef601} +------------- + +::: {#text-orgdeef601 .outline-text-2} +::: + +::: {#outline-container-orgc8eceb0 .outline-3} +### How modules begin and end {#orgc8eceb0} + +::: {#text-orgc8eceb0 .outline-text-3} +A program usually begins with a **main()** function, executes a bunch of +instructions and terminates upon completion of those instructions. +Kernel modules work a bit differently. A module always begin with either +the init\_module or the function you specify with module\_init call. +This is the entry function for modules; it tells the kernel what +functionality the module provides and sets up the kernel to run the +module\'s functions when they\'re needed. Once it does this, entry +function returns and the module does nothing until the kernel wants to +do something with the code that the module provides. + +All modules end by calling either **cleanup\_module** or the function +you specify with the **module\_exit** call. This is the exit function +for modules; it undoes whatever entry function did. It unregisters the +functionality that the entry function registered. + +Every module must have an entry function and an exit function. Since +there\'s more than one way to specify entry and exit functions, I\'ll +try my best to use the terms \`entry function\' and \`exit function\', +but if I slip and simply refer to them as init\_module and +cleanup\_module, I think you\'ll know what I mean. +::: +::: + +::: {#outline-container-org290f3df .outline-3} +### Functions available to modules {#org290f3df} + +::: {#text-org290f3df .outline-text-3} +Programmers use functions they don\'t define all the time. A prime +example of this is **printf()**. You use these library functions which +are provided by the standard C library, libc. The definitions for these +functions don\'t actually enter your program until the linking stage, +which insures that the code (for printf() for example) is available, and +fixes the call instruction to point to that code. + +Kernel modules are different here, too. In the hello world example, you +might have noticed that we used a function, **printk()** but didn\'t +include a standard I/O library. That\'s because modules are object files +whose symbols get resolved upon insmod\'ing. The definition for the +symbols comes from the kernel itself; the only external functions you +can use are the ones provided by the kernel. If you\'re curious about +what symbols have been exported by your kernel, take a look at +**/proc/kallsyms**. + +One point to keep in mind is the difference between library functions +and system calls. Library functions are higher level, run completely in +user space and provide a more convenient interface for the programmer to +the functions that do the real work --- system calls. System calls run +in kernel mode on the user\'s behalf and are provided by the kernel +itself. The library function printf() may look like a very general +printing function, but all it really does is format the data into +strings and write the string data using the low-level system call +write(), which then sends the data to standard output. + +Would you like to see what system calls are made by printf()? It\'s +easy! Compile the following program: + +::: {.org-src-container} + #include + + int main(void) + { + printf("hello"); + return 0; + } +::: + +with **gcc -Wall -o hello hello.c**. Run the exectable with **strace +./hello**. Are you impressed? Every line you see corresponds to a system +call. [strace](https://strace.io/) is a handy program that gives you +details about what system calls a program is making, including which +call is made, what its arguments are and what it returns. It\'s an +invaluable tool for figuring out things like what files a program is +trying to access. Towards the end, you\'ll see a line which looks like +write (1, \"hello\", 5hello). There it is. The face behind the printf() +mask. You may not be familiar with write, since most people use library +functions for file I/O (like fopen, fputs, fclose). If that\'s the case, +try looking at man 2 write. The 2nd man section is devoted to system +calls (like kill() and read()). The 3rd man section is devoted to +library calls, which you would probably be more familiar with (like +cosh() and random()). + +You can even write modules to replace the kernel\'s system calls, which +we\'ll do shortly. Crackers often make use of this sort of thing for +backdoors or trojans, but you can write your own modules to do more +benign things, like have the kernel write Tee hee, that tickles! +everytime someone tries to delete a file on your system. +::: +::: + +::: {#outline-container-orga7850df .outline-3} +### User Space vs Kernel Space {#orga7850df} + +::: {#text-orga7850df .outline-text-3} +A kernel is all about access to resources, whether the resource in +question happens to be a video card, a hard drive or even memory. +Programs often compete for the same resource. As I just saved this +document, updatedb started updating the locate database. My vim session +and updatedb are both using the hard drive concurrently. The kernel +needs to keep things orderly, and not give users access to resources +whenever they feel like it. To this end, a CPU can run in different +modes. Each mode gives a different level of freedom to do what you want +on the system. The Intel 80386 architecture had 4 of these modes, which +were called rings. Unix uses only two rings; the highest ring (ring 0, +also known as \`supervisor mode\' where everything is allowed to happen) +and the lowest ring, which is called \`user mode\'. + +Recall the discussion about library functions vs system calls. +Typically, you use a library function in user mode. The library function +calls one or more system calls, and these system calls execute on the +library function\'s behalf, but do so in supervisor mode since they are +part of the kernel itself. Once the system call completes its task, it +returns and execution gets transfered back to user mode. +::: +::: + +::: {#outline-container-org4b4877b .outline-3} +### Name Space {#org4b4877b} + +::: {#text-org4b4877b .outline-text-3} +When you write a small C program, you use variables which are convenient +and make sense to the reader. If, on the other hand, you\'re writing +routines which will be part of a bigger problem, any global variables +you have are part of a community of other peoples\' global variables; +some of the variable names can clash. When a program has lots of global +variables which aren\'t meaningful enough to be distinguished, you get +namespace pollution. In large projects, effort must be made to remember +reserved names, and to find ways to develop a scheme for naming unique +variable names and symbols. + +When writing kernel code, even the smallest module will be linked +against the entire kernel, so this is definitely an issue. The best way +to deal with this is to declare all your variables as static and to use +a well-defined prefix for your symbols. By convention, all kernel +prefixes are lowercase. If you don\'t want to declare everything as +static, another option is to declare a symbol table and register it with +a kernel. We\'ll get to this later. + +The file **/proc/kallsyms** holds all the symbols that the kernel knows +about and which are therefore accessible to your modules since they +share the kernel\'s codespace. +::: +::: + +::: {#outline-container-org7e3a491 .outline-3} +### Code space {#org7e3a491} + +::: {#text-org7e3a491 .outline-text-3} +Memory management is a very complicated subject and the majority of +O\'Reilly\'s \"*Understanding The Linux Kernel*\" exclusively covers +memory management! We\'re not setting out to be experts on memory +managements, but we do need to know a couple of facts to even begin +worrying about writing real modules. + +If you haven\'t thought about what a segfault really means, you may be +surprised to hear that pointers don\'t actually point to memory +locations. Not real ones, anyway. When a process is created, the kernel +sets aside a portion of real physical memory and hands it to the process +to use for its executing code, variables, stack, heap and other things +which a computer scientist would know about. This memory begins with +0x00000000 and extends up to whatever it needs to be. Since the memory +space for any two processes don\'t overlap, every process that can +access a memory address, say 0xbffff978, would be accessing a different +location in real physical memory! The processes would be accessing an +index named 0xbffff978 which points to some kind of offset into the +region of memory set aside for that particular process. For the most +part, a process like our Hello, World program can\'t access the space of +another process, although there are ways which we\'ll talk about later. + +The kernel has its own space of memory as well. Since a module is code +which can be dynamically inserted and removed in the kernel (as opposed +to a semi-autonomous object), it shares the kernel\'s codespace rather +than having its own. Therefore, if your module segfaults, the kernel +segfaults. And if you start writing over data because of an off-by-one +error, then you\'re trampling on kernel data (or code). This is even +worse than it sounds, so try your best to be careful. + +By the way, I would like to point out that the above discussion is true +for any operating system which uses a monolithic kernel. This isn\'t +quite the same thing as *\"building all your modules into the kernel\"*, +although the idea is the same. There are things called microkernels +which have modules which get their own codespace. The GNU Hurd and the +Magenta kernel of Google Fuchsia are two examples of a microkernel. +::: +::: + +::: {#outline-container-org6c0b122 .outline-3} +### Device Drivers {#org6c0b122} + +::: {#text-org6c0b122 .outline-text-3} +One class of module is the device driver, which provides functionality +for hardware like a serial port. On unix, each piece of hardware is +represented by a file located in /dev named a device file which provides +the means to communicate with the hardware. The device driver provides +the communication on behalf of a user program. So the es1370.o sound +card device driver might connect the /dev/sound device file to the +Ensoniq IS1370 sound card. A userspace program like mp3blaster can use +/dev/sound without ever knowing what kind of sound card is installed. +::: + +- []{#orga17bef9}Major and Minor Numbers\ + ::: {#text-orga17bef9 .outline-text-5} + Let\'s look at some device files. Here are device files which + represent the first three partitions on the primary master IDE hard + drive: + + ::: {.org-src-container} + # ls -l /dev/hda[1-3] + brw-rw---- 1 root disk 3, 1 Jul 5 2000 /dev/hda1 + brw-rw---- 1 root disk 3, 2 Jul 5 2000 /dev/hda2 + brw-rw---- 1 root disk 3, 3 Jul 5 2000 /dev/hda3 + ::: + + Notice the column of numbers separated by a comma? The first number + is called the device\'s major number. The second number is the minor + number. The major number tells you which driver is used to access + the hardware. Each driver is assigned a unique major number; all + device files with the same major number are controlled by the same + driver. All the above major numbers are 3, because they\'re all + controlled by the same driver. + + The minor number is used by the driver to distinguish between the + various hardware it controls. Returning to the example above, + although all three devices are handled by the same driver they have + unique minor numbers because the driver sees them as being different + pieces of hardware. + + Devices are divided into two types: character devices and block + devices. The difference is that block devices have a buffer for + requests, so they can choose the best order in which to respond to + the requests. This is important in the case of storage devices, + where it\'s faster to read or write sectors which are close to each + other, rather than those which are further apart. Another difference + is that block devices can only accept input and return output in + blocks (whose size can vary according to the device), whereas + character devices are allowed to use as many or as few bytes as they + like. Most devices in the world are character, because they don\'t + need this type of buffering, and they don\'t operate with a fixed + block size. You can tell whether a device file is for a block device + or a character device by looking at the first character in the + output of ls -l. If it\'s \`b\' then it\'s a block device, and if + it\'s \`c\' then it\'s a character device. The devices you see above + are block devices. Here are some character devices (the serial + ports): + + ::: {.org-src-container} + crw-rw---- 1 root dial 4, 64 Feb 18 23:34 /dev/ttyS0 + crw-r----- 1 root dial 4, 65 Nov 17 10:26 /dev/ttyS1 + crw-rw---- 1 root dial 4, 66 Jul 5 2000 /dev/ttyS2 + crw-rw---- 1 root dial 4, 67 Jul 5 2000 /dev/ttyS3 + ::: + + If you want to see which major numbers have been assigned, you can + look at /usr/src/linux/Documentation/devices.txt. + + When the system was installed, all of those device files were + created by the mknod command. To create a new char device named + \`coffee\' with major/minor number 12 and 2, simply do mknod + /dev/coffee c 12 2. You don\'t have to put your device files into + /dev, but it\'s done by convention. Linus put his device files in + /dev, and so should you. However, when creating a device file for + testing purposes, it\'s probably OK to place it in your working + directory where you compile the kernel module. Just be sure to put + it in the right place when you\'re done writing the device driver. + + I would like to make a few last points which are implicit from the + above discussion, but I\'d like to make them explicit just in case. + When a device file is accessed, the kernel uses the major number of + the file to determine which driver should be used to handle the + access. This means that the kernel doesn\'t really need to use or + even know about the minor number. The driver itself is the only + thing that cares about the minor number. It uses the minor number to + distinguish between different pieces of hardware. + + By the way, when I say \`hardware\', I mean something a bit more + abstract than a PCI card that you can hold in your hand. Look at + these two device files: + + ::: {.org-src-container} + % ls -l /dev/fd0 /dev/fd0u1680 + brwxrwxrwx 1 root floppy 2, 0 Jul 5 2000 /dev/fd0 + brw-rw---- 1 root floppy 2, 44 Jul 5 2000 /dev/fd0u1680 + ::: + + By now you can look at these two device files and know instantly + that they are block devices and are handled by same driver (block + major 2). You might even be aware that these both represent your + floppy drive, even if you only have one floppy drive. Why two files? + One represents the floppy drive with 1.44 MB of storage. The other + is the same floppy drive with 1.68 MB of storage, and corresponds to + what some people call a \`superformatted\' disk. One that holds more + data than a standard formatted floppy. So here\'s a case where two + device files with different minor number actually represent the same + piece of physical hardware. So just be aware that the word + \`hardware\' in our discussion can mean something very abstract. + ::: +::: +::: + +::: {#outline-container-org016c39a .outline-2} +Character Device drivers {#org016c39a} +------------------------ + +::: {#text-org016c39a .outline-text-2} +::: + +::: {#outline-container-org31d952e .outline-3} +### The file\_operations Structure {#org31d952e} + +::: {#text-org31d952e .outline-text-3} +The file\_operations structure is defined in +**/usr/include/linux/fs.h**, and holds pointers to functions defined by +the driver that perform various operations on the device. Each field of +the structure corresponds to the address of some function defined by the +driver to handle a requested operation. + +For example, every character driver needs to define a function that +reads from the device. The file\_operations structure holds the address +of the module\'s function that performs that operation. Here is what the +definition looks like for kernel 3.0: + +::: {.org-src-container} + struct file_operations { + struct module *owner; + loff_t (*llseek) (struct file *, loff_t, int); + ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); + ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + int (*iterate) (struct file *, struct dir_context *); + unsigned int (*poll) (struct file *, struct poll_table_struct *); + long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); + long (*compat_ioctl) (struct file *, unsigned int, unsigned long); + int (*mmap) (struct file *, struct vm_area_struct *); + int (*open) (struct inode *, struct file *); + int (*flush) (struct file *, fl_owner_t id); + int (*release) (struct inode *, struct file *); + int (*fsync) (struct file *, loff_t, loff_t, int datasync); + int (*aio_fsync) (struct kiocb *, int datasync); + int (*fasync) (int, struct file *, int); + int (*lock) (struct file *, int, struct file_lock *); + ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); + unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); + int (*check_flags)(int); + int (*flock) (struct file *, int, struct file_lock *); + ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); + ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); + int (*setlease)(struct file *, long, struct file_lock **); + long (*fallocate)(struct file *file, int mode, loff_t offset, + loff_t len); + int (*show_fdinfo)(struct seq_file *m, struct file *f); + }; +::: + +Some operations are not implemented by a driver. For example, a driver +that handles a video card won\'t need to read from a directory +structure. The corresponding entries in the file\_operations structure +should be set to NULL. + +There is a gcc extension that makes assigning to this structure more +convenient. You\'ll see it in modern drivers, and may catch you by +surprise. This is what the new way of assigning to the structure looks +like: + +::: {.org-src-container} + struct file_operations fops = { + read: device_read, + write: device_write, + open: device_open, + release: device_release + }; +::: + +However, there\'s also a C99 way of assigning to elements of a +structure, and this is definitely preferred over using the GNU +extension. The version of gcc the author used when writing this, 2.95, +supports the new C99 syntax. You should use this syntax in case someone +wants to port your driver. It will help with compatibility: + +::: {.org-src-container} + struct file_operations fops = { + .read = device_read, + .write = device_write, + .open = device_open, + .release = device_release + }; +::: + +The meaning is clear, and you should be aware that any member of the +structure which you don\'t explicitly assign will be initialized to NULL +by gcc. + +An instance of struct file\_operations containing pointers to functions +that are used to implement read, write, open, ... syscalls is commonly +named fops. +::: +::: + +::: {#outline-container-org607b208 .outline-3} +### The file structure {#org607b208} + +::: {#text-org607b208 .outline-text-3} +Each device is represented in the kernel by a file structure, which is +defined in **linux/fs.h**. Be aware that a file is a kernel level +structure and never appears in a user space program. It\'s not the same +thing as a **FILE**, which is defined by glibc and would never appear in +a kernel space function. Also, its name is a bit misleading; it +represents an abstract open \`file\', not a file on a disk, which is +represented by a structure named inode. + +An instance of struct file is commonly named filp. You\'ll also see it +refered to as struct file file. Resist the temptation. + +Go ahead and look at the definition of file. Most of the entries you +see, like struct dentry aren\'t used by device drivers, and you can +ignore them. This is because drivers don\'t fill file directly; they +only use structures contained in file which are created elsewhere. +::: +::: + +::: {#outline-container-orgf96ab85 .outline-3} +### Registering A Device {#orgf96ab85} + +::: {#text-orgf96ab85 .outline-text-3} +As discussed earlier, char devices are accessed through device files, +usually located in /dev. This is by convention. When writing a driver, +it\'s OK to put the device file in your current directory. Just make +sure you place it in /dev for a production driver. The major number +tells you which driver handles which device file. The minor number is +used only by the driver itself to differentiate which device it\'s +operating on, just in case the driver handles more than one device. + +Adding a driver to your system means registering it with the kernel. +This is synonymous with assigning it a major number during the module\'s +initialization. You do this by using the register\_chrdev function, +defined by linux/fs.h. + +::: {.org-src-container} + int register_chrdev(unsigned int major, const char *name, struct file_operations *fops); +::: + +where unsigned int major is the major number you want to request, *const +char \*name* is the name of the device as it\'ll appear in +**/proc/devices** and *struct file\_operations \*fops* is a pointer to +the file\_operations table for your driver. A negative return value +means the registration failed. Note that we didn\'t pass the minor +number to register\_chrdev. That\'s because the kernel doesn\'t care +about the minor number; only our driver uses it. + +Now the question is, how do you get a major number without hijacking one +that\'s already in use? The easiest way would be to look through +Documentation /devices.txt and pick an unused one. That\'s a bad way of +doing things because you\'ll never be sure if the number you picked will +be assigned later. The answer is that you can ask the kernel to assign +you a dynamic major number. + +If you pass a major number of 0 to register\_chrdev, the return value +will be the dynamically allocated major number. The downside is that you +can\'t make a device file in advance, since you don\'t know what the +major number will be. There are a couple of ways to do this. First, the +driver itself can print the newly assigned number and we can make the +device file by hand. Second, the newly registered device will have an +entry in **/proc/devices**, and we can either make the device file by +hand or write a shell script to read the file in and make the device +file. The third method is we can have our driver make the the device +file using the mknod system call after a successful registration and rm +during the call to cleanup\_module. +::: +::: + +::: {#outline-container-org452ea75 .outline-3} +### Unregistering A Device {#org452ea75} + +::: {#text-org452ea75 .outline-text-3} +We can\'t allow the kernel module to be rmmod\'ed whenever root feels +like it. If the device file is opened by a process and then we remove +the kernel module, using the file would cause a call to the memory +location where the appropriate function (read/write) used to be. If +we\'re lucky, no other code was loaded there, and we\'ll get an ugly +error message. If we\'re unlucky, another kernel module was loaded into +the same location, which means a jump into the middle of another +function within the kernel. The results of this would be impossible to +predict, but they can\'t be very positive. + +Normally, when you don\'t want to allow something, you return an error +code (a negative number) from the function which is supposed to do it. +With cleanup\_module that\'s impossible because it\'s a void function. +However, there\'s a counter which keeps track of how many processes are +using your module. You can see what it\'s value is by looking at the 3rd +field of **/proc/modules**. If this number isn\'t zero, rmmod will fail. +Note that you don\'t have to check the counter from within +cleanup\_module because the check will be performed for you by the +system call sys\_delete\_module, defined in **linux/module.c**. You +shouldn\'t use this counter directly, but there are functions defined in +**linux/module.h** which let you increase, decrease and display this +counter: + +- try\_module\_get(THIS\_MODULE): Increment the use count. +- module\_put(THIS\_MODULE): Decrement the use count. + +It\'s important to keep the counter accurate; if you ever do lose track +of the correct usage count, you\'ll never be able to unload the module; +it\'s now reboot time, boys and girls. This is bound to happen to you +sooner or later during a module\'s development. +::: +::: + +::: {#outline-container-orgdd49880 .outline-3} +### chardev.c {#orgdd49880} + +::: {#text-orgdd49880 .outline-text-3} +The next code sample creates a char driver named chardev. You can cat +its device file. + +::: {.org-src-container} + cat /proc/devices +::: + +(or open the file with a program) and the driver will put the number of +times the device file has been read from into the file. We don\'t +support writing to the file (like **echo \"hi\" \> /dev/hello**), but +catch these attempts and tell the user that the operation isn\'t +supported. Don\'t worry if you don\'t see what we do with the data we +read into the buffer; we don\'t do much with it. We simply read in the +data and print a message acknowledging that we received it. + +::: {.org-src-container} + /* + * chardev.c: Creates a read-only char device that says how many times + * you've read from the dev file + */ + + #include + #include + #include + #include /* for put_user */ + + /* + * Prototypes - this would normally go in a .h file + */ + int init_module(void); + void cleanup_module(void); + static int device_open(struct inode *, struct file *); + static int device_release(struct inode *, struct file *); + static ssize_t device_read(struct file *, char *, size_t, loff_t *); + static ssize_t device_write(struct file *, const char *, size_t, loff_t *); + + #define SUCCESS 0 + #define DEVICE_NAME "chardev" /* Dev name as it appears in /proc/devices */ + #define BUF_LEN 80 /* Max length of the message from the device */ + + /* + * Global variables are declared as static, so are global within the file. + */ + + static int Major; /* Major number assigned to our device driver */ + static int Device_Open = 0; /* Is device open? + * Used to prevent multiple access to device */ + static char msg[BUF_LEN]; /* The msg the device will give when asked */ + static char *msg_Ptr; + + static struct file_operations fops = { + .read = device_read, + .write = device_write, + .open = device_open, + .release = device_release + }; + + /* + * This function is called when the module is loaded + */ + int init_module(void) + { + Major = register_chrdev(0, DEVICE_NAME, &fops); + + if (Major < 0) { + printk(KERN_ALERT "Registering char device failed with %d\n", Major); + return Major; + } + + printk(KERN_INFO "I was assigned major number %d. To talk to\n", Major); + printk(KERN_INFO "the driver, create a dev file with\n"); + printk(KERN_INFO "'mknod /dev/%s c %d 0'.\n", DEVICE_NAME, Major); + printk(KERN_INFO "Try various minor numbers. Try to cat and echo to\n"); + printk(KERN_INFO "the device file.\n"); + printk(KERN_INFO "Remove the device file and module when done.\n"); + + return SUCCESS; + } + + /* + * This function is called when the module is unloaded + */ + void cleanup_module(void) + { + /* + * Unregister the device + */ + unregister_chrdev(Major, DEVICE_NAME); + } + + /* + * Methods + */ + + /* + * Called when a process tries to open the device file, like + * "cat /dev/mycharfile" + */ + static int device_open(struct inode *inode, struct file *file) + { + static int counter = 0; + + if (Device_Open) + return -EBUSY; + + Device_Open++; + sprintf(msg, "I already told you %d times Hello world!\n", counter++); + msg_Ptr = msg; + try_module_get(THIS_MODULE); + + return SUCCESS; + } + + /* + * Called when a process closes the device file. + */ + static int device_release(struct inode *inode, struct file *file) + { + Device_Open--; /* We're now ready for our next caller */ + + /* + * Decrement the usage count, or else once you opened the file, you'll + * never get get rid of the module. + */ + module_put(THIS_MODULE); + + return SUCCESS; + } + + /* + * Called when a process, which already opened the dev file, attempts to + * read from it. + */ + static ssize_t device_read(struct file *filp, /* see include/linux/fs.h */ + char *buffer, /* buffer to fill with data */ + size_t length, /* length of the buffer */ + loff_t * offset) + { + /* + * Number of bytes actually written to the buffer + */ + int bytes_read = 0; + + /* + * If we're at the end of the message, + * return 0 signifying end of file + */ + if (*msg_Ptr == 0) + return 0; + + /* + * Actually put the data into the buffer + */ + while (length && *msg_Ptr) { + + /* + * The buffer is in the user data segment, not the kernel + * segment so "*" assignment won't work. We have to use + * put_user which copies data from the kernel data segment to + * the user data segment. + */ + put_user(*(msg_Ptr++), buffer++); + + length--; + bytes_read++; + } + + /* + * Most read functions return the number of bytes put into the buffer + */ + return bytes_read; + } + + /* + * Called when a process writes to dev file: echo "hi" > /dev/hello + */ + static ssize_t device_write(struct file *filp, + const char *buff, + size_t len, + loff_t * off) + { + printk(KERN_ALERT "Sorry, this operation isn't supported.\n"); + return -EINVAL; + } +::: +::: +::: + +::: {#outline-container-org903f5d5 .outline-3} +### Writing Modules for Multiple Kernel Versions {#org903f5d5} + +::: {#text-org903f5d5 .outline-text-3} +The system calls, which are the major interface the kernel shows to the +processes, generally stay the same across versions. A new system call +may be added, but usually the old ones will behave exactly like they +used to. This is necessary for backward compatibility -- a new kernel +version is not supposed to break regular processes. In most cases, the +device files will also remain the same. On the other hand, the internal +interfaces within the kernel can and do change between versions. + +The Linux kernel versions are divided between the stable versions +(n.\$\<\$even number\\(\>\\).m) and the development versions +(n.\$\<\$odd number\\(\>\\).m). The development versions include all the +cool new ideas, including those which will be considered a mistake, or +reimplemented, in the next version. As a result, you can\'t trust the +interface to remain the same in those versions (which is why I don\'t +bother to support them in this book, it\'s too much work and it would +become dated too quickly). In the stable versions, on the other hand, we +can expect the interface to remain the same regardless of the bug fix +version (the m number). + +There are differences between different kernel versions, and if you want +to support multiple kernel versions, you\'ll find yourself having to +code conditional compilation directives. The way to do this to compare +the macro LINUX\_VERSION\_CODE to the macro KERNEL\_VERSION. In version +a.b.c of the kernel, the value of this macro would be +\\(2\^{16}a+2\^{8}b+c\\). + +While previous versions of this guide showed how you can write backward +compatible code with such constructs in great detail, we decided to +break with this tradition for the better. People interested in doing +such might now use a LKMPG with a version matching to their kernel. We +decided to version the LKMPG like the kernel, at least as far as major +and minor number are concerned. We use the patchlevel for our own +versioning so use LKMPG version 2.4.x for kernels 2.4.x, use LKMPG +version 2.6.x for kernels 2.6.x and so on. Also make sure that you +always use current, up to date versions of both, kernel and guide. + +You might already have noticed that recent kernels look different. In +case you haven\'t they look like 2.6.x.y now. The meaning of the first +three items basically stays the same, but a subpatchlevel has been added +and will indicate security fixes till the next stable patchlevel is out. +So people can choose between a stable tree with security updates and use +the latest kernel as developer tree. Search the kernel mailing list +archives if you\'re interested in the full story. +::: +::: +::: + +::: {#outline-container-org6400501 .outline-2} +The /proc File System {#org6400501} +--------------------- + +::: {#text-org6400501 .outline-text-2} +In Linux, there is an additional mechanism for the kernel and kernel +modules to send information to processes --- the **/proc** file system. +Originally designed to allow easy access to information about processes +(hence the name), it is now used by every bit of the kernel which has +something interesting to report, such as **/proc/modules** which +provides the list of modules and **/proc/meminfo** which stats memory +usage statistics. + +The method to use the proc file system is very similar to the one used +with device drivers --- a structure is created with all the information +needed for the **/proc** file, including pointers to any handler +functions (in our case there is only one, the one called when somebody +attempts to read from the **/proc** file). Then, init\_module registers +the structure with the kernel and cleanup\_module unregisters it. + +Normal file systems are located on a disk, rather than just in memory +(which is where **/proc** is), and in that case the inode number is a +pointer to a disk location where the file\'s index-node (inode for +short) is located. The inode contains information about the file, for +example the file\'s permissions, together with a pointer to the disk +location or locations where the file\'s data can be found. + +Because we don\'t get called when the file is opened or closed, there\'s +nowhere for us to put try\_module\_get and try\_module\_put in this +module, and if the file is opened and then the module is removed, +there\'s no way to avoid the consequences. + +Here a simple example showing how to use a **/proc** file. This is the +HelloWorld for the **/proc** filesystem. There are three parts: create +the file ***proc* helloworld** in the function init\_module, return a +value (and a buffer) when the file **/proc/helloworld** is read in the +callback function **procfile\_read**, and delete the file +**/proc/helloworld** in the function cleanup\_module. + +The **/proc/helloworld** is created when the module is loaded with the +function **proc\_create**. The return value is a **struct +proc\_dir\_entry** , and it will be used to configure the file +**/proc/helloworld** (for example, the owner of this file). A null +return value means that the creation has failed. + +Each time, everytime the file **/proc/helloworld** is read, the function +**procfile\_read** is called. Two parameters of this function are very +important: the buffer (the first parameter) and the offset (the third +one). The content of the buffer will be returned to the application +which read it (for example the cat command). The offset is the current +position in the file. If the return value of the function isn\'t null, +then this function is called again. So be careful with this function, if +it never returns zero, the read function is called endlessly. + +::: {.org-src-container} + # cat /proc/helloworld + HelloWorld! +::: + +::: {.org-src-container} + /* + procfs1.c + */ + + #include + #include + #include + #include + + #define procfs_name "helloworld" + + struct proc_dir_entry *Our_Proc_File; + + + ssize_t procfile_read(struct file *filePointer,char *buffer, + size_t buffer_length, loff_t * offset) + { + int ret=0; + if(strlen(buffer) ==0) { + printk(KERN_INFO "procfile read %s\n",filePointer->f_path.dentry->d_name.name); + ret=copy_to_user(buffer,"HelloWorld!\n",sizeof("HelloWorld!\n")); + ret=sizeof("HelloWorld!\n"); + } + return ret; + + } + + static const struct file_operations proc_file_fops = { + .owner = THIS_MODULE, + .read = procfile_read, + }; + + int init_module() + { + Our_Proc_File = proc_create(procfs_name,0644,NULL,&proc_file_fops); + if(NULL==Our_Proc_File) { + proc_remove(Our_Proc_File); + printk(KERN_ALERT "Error:Could not initialize /proc/%s\n",procfs_name); + return -ENOMEM; + } + + printk(KERN_INFO "/proc/%s created\n", procfs_name); + return 0; + } + + void cleanup_module() + { + proc_remove(Our_Proc_File); + printk(KERN_INFO "/proc/%s removed\n", procfs_name); + } +::: +::: + +::: {#outline-container-orga906618 .outline-3} +### Read and Write a /proc File {#orga906618} + +::: {#text-orga906618 .outline-text-3} +We have seen a very simple example for a /proc file where we only read +the file /proc/helloworld. It\'s also possible to write in a /proc file. +It works the same way as read, a function is called when the /proc file +is written. But there is a little difference with read, data comes from +user, so you have to import data from user space to kernel space (with +copy\_from\_user or get\_user) + +The reason for copy\_from\_user or get\_user is that Linux memory (on +Intel architecture, it may be different under some other processors) is +segmented. This means that a pointer, by itself, does not reference a +unique location in memory, only a location in a memory segment, and you +need to know which memory segment it is to be able to use it. There is +one memory segment for the kernel, and one for each of the processes. + +The only memory segment accessible to a process is its own, so when +writing regular programs to run as processes, there\'s no need to worry +about segments. When you write a kernel module, normally you want to +access the kernel memory segment, which is handled automatically by the +system. However, when the content of a memory buffer needs to be passed +between the currently running process and the kernel, the kernel +function receives a pointer to the memory buffer which is in the process +segment. The put\_user and get\_user macros allow you to access that +memory. These functions handle only one caracter, you can handle several +caracters with copy\_to\_user and copy\_from\_user. As the buffer (in +read or write function) is in kernel space, for write function you need +to import data because it comes from user space, but not for the read +function because data is already in kernel space. + +::: {.org-src-container} + /** + * procfs2.c - create a "file" in /proc + * + */ + + #include /* Specifically, a module */ + #include /* We're doing kernel work */ + #include /* Necessary because we use the proc fs */ + #include /* for copy_from_user */ + + #define PROCFS_MAX_SIZE 1024 + #define PROCFS_NAME "buffer1k" + + /** + * This structure hold information about the /proc file + * + */ + static struct proc_dir_entry *Our_Proc_File; + + /** + * The buffer used to store character for this module + * + */ + static char procfs_buffer[PROCFS_MAX_SIZE]; + + /** + * The size of the buffer + * + */ + static unsigned long procfs_buffer_size = 0; + + /** + * This function is called then the /proc file is read + * + */ + ssize_t procfile_read(struct file *filePointer,char *buffer, + size_t buffer_length, loff_t * offset) + { + int ret=0; + if(strlen(buffer) ==0) { + printk(KERN_INFO "procfile read %s\n",filePointer->f_path.dentry->d_name.name); + ret=copy_to_user(buffer,"HelloWorld!\n",sizeof("HelloWorld!\n")); + ret=sizeof("HelloWorld!\n"); + } + return ret; + } + + + /** + * This function is called with the /proc file is written + * + */ + static ssize_t procfile_write(struct file *file, const char *buff, + size_t len, loff_t *off) + { + procfs_buffer_size = len; + if (procfs_buffer_size > PROCFS_MAX_SIZE) + procfs_buffer_size = PROCFS_MAX_SIZE; + + if (copy_from_user(procfs_buffer, buff, procfs_buffer_size)) + return -EFAULT; + + procfs_buffer[procfs_buffer_size] = '\0'; + return procfs_buffer_size; + } + + static const struct file_operations proc_file_fops = { + .owner = THIS_MODULE, + .read = procfile_read, + .write = procfile_write, + }; + + /** + *This function is called when the module is loaded + * + */ + int init_module() + { + Our_Proc_File = proc_create(PROCFS_NAME,0644,NULL,&proc_file_fops); + if(NULL==Our_Proc_File) { + proc_remove(Our_Proc_File); + printk(KERN_ALERT "Error:Could not initialize /proc/%s\n",PROCFS_NAME); + return -ENOMEM; + } + + printk(KERN_INFO "/proc/%s created\n", PROCFS_NAME); + return 0; + } + + /** + *This function is called when the module is unloaded + * + */ + void cleanup_module() + { + proc_remove(Our_Proc_File); + printk(KERN_INFO "/proc/%s removed\n", PROCFS_NAME); + } +::: +::: +::: + +::: {#outline-container-org561d817 .outline-3} +### Manage /proc file with standard filesystem {#org561d817} + +::: {#text-org561d817 .outline-text-3} +We have seen how to read and write a /proc file with the /proc +interface. But it\'s also possible to manage /proc file with inodes. The +main concern is to use advanced functions, like permissions. + +In Linux, there is a standard mechanism for file system registration. +Since every file system has to have its own functions to handle inode +and file operations, there is a special structure to hold pointers to +all those functions, struct **inode\_operations**, which includes a +pointer to struct file\_operations. + +The difference between file and inode operations is that file operations +deal with the file itself whereas inode operations deal with ways of +referencing the file, such as creating links to it. + +In /proc, whenever we register a new file, we\'re allowed to specify +which struct inode\_operations will be used to access to it. This is the +mechanism we use, a struct inode\_operations which includes a pointer to +a struct file\_operations which includes pointers to our procfs\_read +and procfs\_write functions. + +Another interesting point here is the module\_permission function. This +function is called whenever a process tries to do something with the +/proc file, and it can decide whether to allow access or not. Right now +it is only based on the operation and the uid of the current user (as +available in current, a pointer to a structure which includes +information on the currently running process), but it could be based on +anything we like, such as what other processes are doing with the same +file, the time of day, or the last input we received. + +It\'s important to note that the standard roles of read and write are +reversed in the kernel. Read functions are used for output, whereas +write functions are used for input. The reason for that is that read and +write refer to the user\'s point of view --- if a process reads +something from the kernel, then the kernel needs to output it, and if a +process writes something to the kernel, then the kernel receives it as +input. + +::: {.org-src-container} + /* + procfs3.c + */ + + #include + #include + #include + #include + #include + + #define PROCFS_MAX_SIZE 2048 + #define PROCFS_ENTRY_FILENAME "buffer2k" + + struct proc_dir_entry *Our_Proc_File; + static char procfs_buffer[PROCFS_MAX_SIZE]; + static unsigned long procfs_buffer_size = 0; + + static ssize_t procfs_read(struct file *filp, char *buffer, + size_t length, loff_t *offset) + { + static int finished = 0; + if(finished) + { + printk(KERN_DEBUG "procfs_read: END\n"); + finished = 0; + return 0; + } + finished = 1; + if(copy_to_user(buffer, procfs_buffer, procfs_buffer_size)) + return -EFAULT; + printk(KERN_DEBUG "procfs_read: read %lu bytes\n", procfs_buffer_size); + return procfs_buffer_size; + } + static ssize_t procfs_write(struct file *file, const char *buffer, + size_t len, loff_t *off) + { + if(len>PROCFS_MAX_SIZE) + procfs_buffer_size = PROCFS_MAX_SIZE; + else + procfs_buffer_size = len; + if(copy_from_user(procfs_buffer, buffer, procfs_buffer_size)) + return -EFAULT; + printk(KERN_DEBUG "procfs_write: write %lu bytes\n", procfs_buffer_size); + return procfs_buffer_size; + } + int procfs_open(struct inode *inode, struct file *file) + { + try_module_get(THIS_MODULE); + return 0; + } + int procfs_close(struct inode *inode, struct file *file) + { + module_put(THIS_MODULE); + return 0; + } + + static struct file_operations File_Ops_4_Our_Proc_File = { + .read = procfs_read, + .write = procfs_write, + .open = procfs_open, + .release = procfs_close, + }; + + int init_module() + { + Our_Proc_File = proc_create(PROCFS_ENTRY_FILENAME, 0644, NULL,&File_Ops_4_Our_Proc_File); + if(Our_Proc_File == NULL) + { + remove_proc_entry(PROCFS_ENTRY_FILENAME, NULL); + printk(KERN_DEBUG "Error: Could not initialize /proc/%s\n", PROCFS_ENTRY_FILENAME); + return -ENOMEM; + } + proc_set_size(Our_Proc_File, 80); + proc_set_user(Our_Proc_File, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID); + + printk(KERN_DEBUG "/proc/%s created\n", PROCFS_ENTRY_FILENAME); + return 0; + } + void cleanup_module() + { + remove_proc_entry(PROCFS_ENTRY_FILENAME, NULL); + printk(KERN_DEBUG "/proc/%s removed\n", PROCFS_ENTRY_FILENAME); + } +::: + +Still hungry for procfs examples? Well, first of all keep in mind, there +are rumors around, claiming that procfs is on it\'s way out, consider +using sysfs instead. Second, if you really can\'t get enough, there\'s a +highly recommendable bonus level for procfs below +linux/Documentation/DocBook/ . Use make help in your toplevel kernel +directory for instructions about how to convert it into your favourite +format. Example: make htmldocs . Consider using this mechanism, in case +you want to document something kernel related yourself. +::: +::: + +::: {#outline-container-org38ea52f .outline-3} +### Manage /proc file with seq\_file {#org38ea52f} + +::: {#text-org38ea52f .outline-text-3} +As we have seen, writing a /proc file may be quite \"complex\". So to +help people writting /proc file, there is an API named seq\_file that +helps formating a /proc file for output. It\'s based on sequence, which +is composed of 3 functions: start(), next(), and stop(). The seq\_file +API starts a sequence when a user read the /proc file. + +A sequence begins with the call of the function start(). If the return +is a non NULL value, the function next() is called. This function is an +iterator, the goal is to go thought all the data. Each time next() is +called, the function show() is also called. It writes data values in the +buffer read by the user. The function next() is called until it returns +NULL. The sequence ends when next() returns NULL, then the function +stop() is called. + +BE CARREFUL: when a sequence is finished, another one starts. That means +that at the end of function stop(), the function start() is called +again. This loop finishes when the function start() returns NULL. You +can see a scheme of this in the figure \"How seq\_file works\". + +::: {.figure} +![seq\_file.png](img/seq_file.png) +::: + +Seq\_file provides basic functions for file\_operations, as seq\_read, +seq\_lseek, and some others. But nothing to write in the /proc file. Of +course, you can still use the same way as in the previous example. + +::: {.org-src-container} + /** + * procfs4.c - create a "file" in /proc + * This program uses the seq_file library to manage the /proc file. + * + */ + + #include /* We're doing kernel work */ + #include /* Specifically, a module */ + #include /* Necessary because we use proc fs */ + #include /* for seq_file */ + + #define PROC_NAME "iter" + + MODULE_AUTHOR("Philippe Reynes"); + MODULE_LICENSE("GPL"); + + /** + * This function is called at the beginning of a sequence. + * ie, when: + * - the /proc file is read (first time) + * - after the function stop (end of sequence) + * + */ + static void *my_seq_start(struct seq_file *s, loff_t *pos) + { + static unsigned long counter = 0; + + /* beginning a new sequence ? */ + if ( *pos == 0 ) { + /* yes => return a non null value to begin the sequence */ + return &counter; + } + else { + /* no => it's the end of the sequence, return end to stop reading */ + *pos = 0; + return NULL; + } + } + + /** + * This function is called after the beginning of a sequence. + * It's called untill the return is NULL (this ends the sequence). + * + */ + static void *my_seq_next(struct seq_file *s, void *v, loff_t *pos) + { + unsigned long *tmp_v = (unsigned long *)v; + (*tmp_v)++; + (*pos)++; + return NULL; + } + + /** + * This function is called at the end of a sequence + * + */ + static void my_seq_stop(struct seq_file *s, void *v) + { + /* nothing to do, we use a static value in start() */ + } + + /** + * This function is called for each "step" of a sequence + * + */ + static int my_seq_show(struct seq_file *s, void *v) + { + loff_t *spos = (loff_t *) v; + + seq_printf(s, "%Ld\n", *spos); + return 0; + } + + /** + * This structure gather "function" to manage the sequence + * + */ + static struct seq_operations my_seq_ops = { + .start = my_seq_start, + .next = my_seq_next, + .stop = my_seq_stop, + .show = my_seq_show + }; + + /** + * This function is called when the /proc file is open. + * + */ + static int my_open(struct inode *inode, struct file *file) + { + return seq_open(file, &my_seq_ops); + }; + + /** + * This structure gather "function" that manage the /proc file + * + */ + static struct file_operations my_file_ops = { + .owner = THIS_MODULE, + .open = my_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release + }; + + + /** + * This function is called when the module is loaded + * + */ + int init_module(void) + { + struct proc_dir_entry *entry; + + entry = proc_create(PROC_NAME, 0, NULL, &my_file_ops); + if(entry == NULL) + { + remove_proc_entry(PROC_NAME, NULL); + printk(KERN_DEBUG "Error: Could not initialize /proc/%s\n", PROC_NAME); + return -ENOMEM; + } + + return 0; + } + + /** + * This function is called when the module is unloaded. + * + */ + void cleanup_module(void) + { + remove_proc_entry(PROC_NAME, NULL); + printk(KERN_DEBUG "/proc/%s removed\n", PROC_NAME); + } +::: + +If you want more information, you can read this web page: + +- +- + +You can also read the code of fs/seq\_file.c in the linux kernel. +::: +::: +::: + +::: {#outline-container-org954957f .outline-2} +sysfs: Interacting with your module {#org954957f} +----------------------------------- + +::: {#text-org954957f .outline-text-2} +*sysfs* allows you to interact with the running kernel from userspace by +reading or setting variables inside of modules. This can be useful for +debugging purposes, or just as an interface for applications or scripts. +You can find sysfs directories and files under the *sys* directory on +your system. + +::: {.org-src-container} + ls -l /sys +::: + +An example of a hello world module which includes the creation of a +variable accessible via sysfs is given below. + +::: {.org-src-container} + /* + * hello-sysfs.c sysfs example + */ + + #include + #include + #include + #include + #include + #include + #include + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Bob Mottram"); + + static struct kobject *mymodule; + + /* the variable you want to be able to change */ + static int myvariable = 0; + + static ssize_t myvariable_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) + { + return sprintf(buf, "%d\n", myvariable); + } + + static ssize_t myvariable_store(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf, size_t count) + { + sscanf(buf, "%du", &myvariable); + return count; + } + + + static struct kobj_attribute myvariable_attribute = + __ATTR(myvariable, 0660, myvariable_show, + (void*)myvariable_store); + + static int __init mymodule_init (void) + { + int error = 0; + + printk(KERN_INFO "mymodule: initialised\n"); + + mymodule = + kobject_create_and_add("mymodule", kernel_kobj); + if (!mymodule) + return -ENOMEM; + + error = sysfs_create_file(mymodule, &myvariable_attribute.attr); + if (error) { + printk(KERN_INFO "failed to create the myvariable file " \ + "in /sys/kernel/mymodule\n"); + } + + return error; + } + + static void __exit mymodule_exit (void) + { + printk(KERN_INFO "mymodule: Exit success\n"); + kobject_put(mymodule); + } + + module_init(mymodule_init); + module_exit(mymodule_exit); +::: + +Make and install the module: + +::: {.org-src-container} + make + sudo insmod hello-sysfs.ko +::: + +Check that it exists: + +::: {.org-src-container} + sudo lsmod | grep hello_sysfs +::: + +What is the current value of *myvariable* ? + +::: {.org-src-container} + cat /sys/kernel/mymodule/myvariable +::: + +Set the value of *myvariable* and check that it changed. + +::: {.org-src-container} + echo "32" > /sys/kernel/mymodule/myvariable + cat /sys/kernel/mymodule/myvariable +::: + +Finally, remove the test module: + +::: {.org-src-container} + sudo rmmod hello_sysfs +::: +::: +::: + +::: {#outline-container-org438f37b .outline-2} +Talking To Device Files {#org438f37b} +----------------------- + +::: {#text-org438f37b .outline-text-2} +Device files are supposed to represent physical devices. Most physical +devices are used for output as well as input, so there has to be some +mechanism for device drivers in the kernel to get the output to send to +the device from processes. This is done by opening the device file for +output and writing to it, just like writing to a file. In the following +example, this is implemented by device\_write. + +This is not always enough. Imagine you had a serial port connected to a +modem (even if you have an internal modem, it is still implemented from +the CPU\'s perspective as a serial port connected to a modem, so you +don\'t have to tax your imagination too hard). The natural thing to do +would be to use the device file to write things to the modem (either +modem commands or data to be sent through the phone line) and read +things from the modem (either responses for commands or the data +received through the phone line). However, this leaves open the question +of what to do when you need to talk to the serial port itself, for +example to send the rate at which data is sent and received. + +The answer in Unix is to use a special function called **ioctl** (short +for Input Output ConTroL). Every device can have its own ioctl commands, +which can be read ioctl\'s (to send information from a process to the +kernel), write ioctl\'s (to return information to a process), both or +neither. Notice here the roles of read and write are reversed again, so +in ioctl\'s read is to send information to the kernel and write is to +receive information from the kernel. + +The ioctl function is called with three parameters: the file descriptor +of the appropriate device file, the ioctl number, and a parameter, which +is of type long so you can use a cast to use it to pass anything. You +won\'t be able to pass a structure this way, but you will be able to +pass a pointer to the structure. + +The ioctl number encodes the major device number, the type of the ioctl, +the command, and the type of the parameter. This ioctl number is usually +created by a macro call (\_IO, \_IOR, \_IOW or \_IOWR --- depending on +the type) in a header file. This header file should then be included +both by the programs which will use ioctl (so they can generate the +appropriate ioctl\'s) and by the kernel module (so it can understand +it). In the example below, the header file is chardev.h and the program +which uses it is ioctl.c. + +If you want to use ioctls in your own kernel modules, it is best to +receive an official ioctl assignment, so if you accidentally get +somebody else\'s ioctls, or if they get yours, you\'ll know something is +wrong. For more information, consult the kernel source tree at +Documentation/ioctl-number.txt. + +::: {.org-src-container} + /* + * chardev2.c - Create an input/output character device + */ + + #include /* We're doing kernel work */ + #include /* Specifically, a module */ + #include + #include /* for get_user and put_user */ + + #include "chardev.h" + #define SUCCESS 0 + #define DEVICE_NAME "char_dev" + #define BUF_LEN 80 + + /* + * Is the device open right now? Used to prevent + * concurent access into the same device + */ + static int Device_Open = 0; + + /* + * The message the device will give when asked + */ + static char Message[BUF_LEN]; + + /* + * How far did the process reading the message get? + * Useful if the message is larger than the size of the + * buffer we get to fill in device_read. + */ + static char *Message_Ptr; + + /* + * This is called whenever a process attempts to open the device file + */ + static int device_open(struct inode *inode, struct file *file) + { + #ifdef DEBUG + printk(KERN_INFO "device_open(%p)\n", file); + #endif + + /* + * We don't want to talk to two processes at the same time + */ + if (Device_Open) + return -EBUSY; + + Device_Open++; + /* + * Initialize the message + */ + Message_Ptr = Message; + try_module_get(THIS_MODULE); + return SUCCESS; + } + + static int device_release(struct inode *inode, struct file *file) + { + #ifdef DEBUG + printk(KERN_INFO "device_release(%p,%p)\n", inode, file); + #endif + + /* + * We're now ready for our next caller + */ + Device_Open--; + + module_put(THIS_MODULE); + return SUCCESS; + } + + /* + * This function is called whenever a process which has already opened the + * device file attempts to read from it. + */ + static ssize_t device_read(struct file *file, /* see include/linux/fs.h */ + char __user * buffer, /* buffer to be + * filled with data */ + size_t length, /* length of the buffer */ + loff_t * offset) + { + /* + * Number of bytes actually written to the buffer + */ + int bytes_read = 0; + + #ifdef DEBUG + printk(KERN_INFO "device_read(%p,%p,%d)\n", file, buffer, length); + #endif + + /* + * If we're at the end of the message, return 0 + * (which signifies end of file) + */ + if (*Message_Ptr == 0) + return 0; + + /* + * Actually put the data into the buffer + */ + while (length && *Message_Ptr) { + + /* + * Because the buffer is in the user data segment, + * not the kernel data segment, assignment wouldn't + * work. Instead, we have to use put_user which + * copies data from the kernel data segment to the + * user data segment. + */ + put_user(*(Message_Ptr++), buffer++); + length--; + bytes_read++; + } + + #ifdef DEBUG + printk(KERN_INFO "Read %d bytes, %d left\n", bytes_read, length); + #endif + + /* + * Read functions are supposed to return the number + * of bytes actually inserted into the buffer + */ + return bytes_read; + } + + /* + * This function is called when somebody tries to + * write into our device file. + */ + static ssize_t + device_write(struct file *file, + const char __user * buffer, size_t length, loff_t * offset) + { + int i; + + #ifdef DEBUG + printk(KERN_INFO "device_write(%p,%s,%d)", file, buffer, length); + #endif + + for (i = 0; i < length && i < BUF_LEN; i++) + get_user(Message[i], buffer + i); + + Message_Ptr = Message; + + /* + * Again, return the number of input characters used + */ + return i; + } + + /* + * This function is called whenever a process tries to do an ioctl on our + * device file. We get two extra parameters (additional to the inode and file + * structures, which all device functions get): the number of the ioctl called + * and the parameter given to the ioctl function. + * + * If the ioctl is write or read/write (meaning output is returned to the + * calling process), the ioctl call returns the output of this function. + * + */ + long device_ioctl(struct file *file, /* ditto */ + unsigned int ioctl_num, /* number and param for ioctl */ + unsigned long ioctl_param) + { + int i; + char *temp; + char ch; + + /* + * Switch according to the ioctl called + */ + switch (ioctl_num) { + case IOCTL_SET_MSG: + /* + * Receive a pointer to a message (in user space) and set that + * to be the device's message. Get the parameter given to + * ioctl by the process. + */ + temp = (char *)ioctl_param; + + /* + * Find the length of the message + */ + get_user(ch, temp); + for (i = 0; ch && i < BUF_LEN; i++, temp++) + get_user(ch, temp); + + device_write(file, (char *)ioctl_param, i, 0); + break; + + case IOCTL_GET_MSG: + /* + * Give the current message to the calling process - + * the parameter we got is a pointer, fill it. + */ + i = device_read(file, (char *)ioctl_param, 99, 0); + + /* + * Put a zero at the end of the buffer, so it will be + * properly terminated + */ + put_user('\0', (char *)ioctl_param + i); + break; + + case IOCTL_GET_NTH_BYTE: + /* + * This ioctl is both input (ioctl_param) and + * output (the return value of this function) + */ + return Message[ioctl_param]; + break; + } + + return SUCCESS; + } + + /* Module Declarations */ + + /* + * This structure will hold the functions to be called + * when a process does something to the device we + * created. Since a pointer to this structure is kept in + * the devices table, it can't be local to + * init_module. NULL is for unimplemented functions. + */ + struct file_operations Fops = { + .read = device_read, + .write = device_write, + .unlocked_ioctl = device_ioctl, + .open = device_open, + .release = device_release, /* a.k.a. close */ + }; + + /* + * Initialize the module - Register the character device + */ + int init_module() + { + int ret_val; + /* + * Register the character device (atleast try) + */ + ret_val = register_chrdev(MAJOR_NUM, DEVICE_NAME, &Fops); + + /* + * Negative values signify an error + */ + if (ret_val < 0) { + printk(KERN_ALERT "%s failed with %d\n", + "Sorry, registering the character device ", ret_val); + return ret_val; + } + + printk(KERN_INFO "%s The major device number is %d.\n", + "Registeration is a success", MAJOR_NUM); + printk(KERN_INFO "If you want to talk to the device driver,\n"); + printk(KERN_INFO "you'll have to create a device file. \n"); + printk(KERN_INFO "We suggest you use:\n"); + printk(KERN_INFO "mknod %s c %d 0\n", DEVICE_FILE_NAME, MAJOR_NUM); + printk(KERN_INFO "The device file name is important, because\n"); + printk(KERN_INFO "the ioctl program assumes that's the\n"); + printk(KERN_INFO "file you'll use.\n"); + + return 0; + } + + /* + * Cleanup - unregister the appropriate file from /proc + */ + void cleanup_module() + { + /* + * Unregister the device + */ + unregister_chrdev(MAJOR_NUM, DEVICE_NAME); + } +::: + +::: {.org-src-container} + /* + * chardev.h - the header file with the ioctl definitions. + * + * The declarations here have to be in a header file, because + * they need to be known both to the kernel module + * (in chardev.c) and the process calling ioctl (ioctl.c) + */ + + #ifndef CHARDEV_H + #define CHARDEV_H + + #include + + /* + * The major device number. We can't rely on dynamic + * registration any more, because ioctls need to know + * it. + */ + #define MAJOR_NUM 100 + + /* + * Set the message of the device driver + */ + #define IOCTL_SET_MSG _IOW(MAJOR_NUM, 0, char *) + /* + * _IOW means that we're creating an ioctl command + * number for passing information from a user process + * to the kernel module. + * + * The first arguments, MAJOR_NUM, is the major device + * number we're using. + * + * The second argument is the number of the command + * (there could be several with different meanings). + * + * The third argument is the type we want to get from + * the process to the kernel. + */ + + /* + * Get the message of the device driver + */ + #define IOCTL_GET_MSG _IOR(MAJOR_NUM, 1, char *) + /* + * This IOCTL is used for output, to get the message + * of the device driver. However, we still need the + * buffer to place the message in to be input, + * as it is allocated by the process. + */ + + /* + * Get the n'th byte of the message + */ + #define IOCTL_GET_NTH_BYTE _IOWR(MAJOR_NUM, 2, int) + /* + * The IOCTL is used for both input and output. It + * receives from the user a number, n, and returns + * Message[n]. + */ + + /* + * The name of the device file + */ + #define DEVICE_FILE_NAME "char_dev" + + #endif +::: + +::: {.org-src-container} + /* + * ioctl.c - the process to use ioctl's to control the kernel module + * + * Until now we could have used cat for input and output. But now + * we need to do ioctl's, which require writing our own process. + */ + + /* + * device specifics, such as ioctl numbers and the + * major device file. + */ + #include "../chardev.h" + + #include + #include + #include /* open */ + #include /* exit */ + #include /* ioctl */ + + /* + * Functions for the ioctl calls + */ + + int ioctl_set_msg(int file_desc, char *message) + { + int ret_val; + + ret_val = ioctl(file_desc, IOCTL_SET_MSG, message); + + if (ret_val < 0) { + printf("ioctl_set_msg failed:%d\n", ret_val); + exit(-1); + } + return 0; + } + + int ioctl_get_msg(int file_desc) + { + int ret_val; + char message[100]; + + /* + * Warning - this is dangerous because we don't tell + * the kernel how far it's allowed to write, so it + * might overflow the buffer. In a real production + * program, we would have used two ioctls - one to tell + * the kernel the buffer length and another to give + * it the buffer to fill + */ + ret_val = ioctl(file_desc, IOCTL_GET_MSG, message); + + if (ret_val < 0) { + printf("ioctl_get_msg failed:%d\n", ret_val); + exit(-1); + } + + printf("get_msg message:%s\n", message); + return 0; + } + + int ioctl_get_nth_byte(int file_desc) + { + int i; + char c; + + printf("get_nth_byte message:"); + + i = 0; + do { + c = ioctl(file_desc, IOCTL_GET_NTH_BYTE, i++); + + if (c < 0) { + printf("ioctl_get_nth_byte failed at the %d'th byte:\n", + i); + exit(-1); + } + + putchar(c); + } while (c != 0); + putchar('\n'); + return 0; + } + + /* + * Main - Call the ioctl functions + */ + int main() + { + int file_desc, ret_val; + char *msg = "Message passed by ioctl\n"; + + file_desc = open(DEVICE_FILE_NAME, 0); + if (file_desc < 0) { + printf("Can't open device file: %s\n", DEVICE_FILE_NAME); + exit(-1); + } + + ioctl_get_nth_byte(file_desc); + ioctl_get_msg(file_desc); + ioctl_set_msg(file_desc, msg); + + close(file_desc); + return 0; + } +::: +::: +::: + +::: {#outline-container-org8de5924 .outline-2} +System Calls {#org8de5924} +------------ + +::: {#text-org8de5924 .outline-text-2} +So far, the only thing we\'ve done was to use well defined kernel +mechanisms to register **/proc** files and device handlers. This is fine +if you want to do something the kernel programmers thought you\'d want, +such as write a device driver. But what if you want to do something +unusual, to change the behavior of the system in some way? Then, you\'re +mostly on your own. + +If you\'re not being sensible and using a virtual machine then this is +where kernel programming can become hazardous. While writing the example +below, I killed the **open()** system call. This meant I couldn\'t open +any files, I couldn\'t run any programs, and I couldn\'t shutdown the +system. I had to restart the virtual machine. No important files got +anihilated, but if I was doing this on some live mission critical system +then that could have been a possible outcome. To ensure you don\'t lose +any files, even within a test environment, please run **sync** right +before you do the **insmod** and the **rmmod**. + +Forget about **/proc** files, forget about device files. They\'re just +minor details. Minutiae in the vast expanse of the universe. The real +process to kernel communication mechanism, the one used by all +processes, is *system calls*. When a process requests a service from the +kernel (such as opening a file, forking to a new process, or requesting +more memory), this is the mechanism used. If you want to change the +behaviour of the kernel in interesting ways, this is the place to do it. +By the way, if you want to see which system calls a program uses, run +**strace \**. + +In general, a process is not supposed to be able to access the kernel. +It can\'t access kernel memory and it can\'t call kernel functions. The +hardware of the CPU enforces this (that\'s the reason why it\'s called +\`protected mode\' or \'page protection\'). + +System calls are an exception to this general rule. What happens is that +the process fills the registers with the appropriate values and then +calls a special instruction which jumps to a previously defined location +in the kernel (of course, that location is readable by user processes, +it is not writable by them). Under Intel CPUs, this is done by means of +interrupt 0x80. The hardware knows that once you jump to this location, +you are no longer running in restricted user mode, but as the operating +system kernel --- and therefore you\'re allowed to do whatever you want. + +The location in the kernel a process can jump to is called system\_call. +The procedure at that location checks the system call number, which +tells the kernel what service the process requested. Then, it looks at +the table of system calls (sys\_call\_table) to see the address of the +kernel function to call. Then it calls the function, and after it +returns, does a few system checks and then return back to the process +(or to a different process, if the process time ran out). If you want to +read this code, it\'s at the source file +arch/\$\<\$architecture\$\>\$/kernel/entry.S, after the line +ENTRY(system\_call). + +So, if we want to change the way a certain system call works, what we +need to do is to write our own function to implement it (usually by +adding a bit of our own code, and then calling the original function) +and then change the pointer at sys\_call\_table to point to our +function. Because we might be removed later and we don\'t want to leave +the system in an unstable state, it\'s important for cleanup\_module to +restore the table to its original state. + +The source code here is an example of such a kernel module. We want to +\"spy\" on a certain user, and to printk() a message whenever that user +opens a file. Towards this end, we replace the system call to open a +file with our own function, called **our\_sys\_open**. This function +checks the uid (user\'s id) of the current process, and if it\'s equal +to the uid we spy on, it calls printk() to display the name of the file +to be opened. Then, either way, it calls the original open() function +with the same parameters, to actually open the file. + +The **init\_module** function replaces the appropriate location in +**sys\_call\_table** and keeps the original pointer in a variable. The +cleanup\_module function uses that variable to restore everything back +to normal. This approach is dangerous, because of the possibility of two +kernel modules changing the same system call. Imagine we have two kernel +modules, A and B. A\'s open system call will be A\_open and B\'s will be +B\_open. Now, when A is inserted into the kernel, the system call is +replaced with A\_open, which will call the original sys\_open when it\'s +done. Next, B is inserted into the kernel, which replaces the system +call with B\_open, which will call what it thinks is the original system +call, A\_open, when it\'s done. + +Now, if B is removed first, everything will be well --- it will simply +restore the system call to A\_open, which calls the original. However, +if A is removed and then B is removed, the system will crash. A\'s +removal will restore the system call to the original, sys\_open, cutting +B out of the loop. Then, when B is removed, it will restore the system +call to what it thinks is the original, **A\_open**, which is no longer +in memory. At first glance, it appears we could solve this particular +problem by checking if the system call is equal to our open function and +if so not changing it at all (so that B won\'t change the system call +when it\'s removed), but that will cause an even worse problem. When A +is removed, it sees that the system call was changed to **B\_open** so +that it is no longer pointing to **A\_open**, so it won\'t restore it to +**sys\_open** before it is removed from memory. Unfortunately, +**B\_open** will still try to call **A\_open** which is no longer there, +so that even without removing B the system would crash. + +Note that all the related problems make syscall stealing unfeasiable for +production use. In order to keep people from doing potential harmful +things **sys\_call\_table** is no longer exported. This means, if you +want to do something more than a mere dry run of this example, you will +have to patch your current kernel in order to have sys\_call\_table +exported. In the example directory you will find a README and the patch. +As you can imagine, such modifications are not to be taken lightly. Do +not try this on valueable systems (ie systems that you do not own - or +cannot restore easily). You\'ll need to get the complete sourcecode of +this guide as a tarball in order to get the patch and the README. +Depending on your kernel version, you might even need to hand apply the +patch. Still here? Well, so is this chapter. If Wyle E. Coyote was a +kernel hacker, this would be the first thing he\'d try. ;) + +::: {.org-src-container} + /* + * syscall.c + * + * System call "stealing" sample. + * + * Disables page protection at a processor level by + * changing the 16th bit in the cr0 register (could be Intel specific) + * + * Based on example by Peter Jay Salzman and + * https://bbs.archlinux.org/viewtopic.php?id=139406 + */ + + #include + #include + #include + #include + #include + #include /* which will have params */ + #include /* The list of system calls */ + + /* + * For the current (process) structure, we need + * this to know who the current user is. + */ + #include + #include + + unsigned long **sys_call_table; + unsigned long original_cr0; + + /* + * UID we want to spy on - will be filled from the + * command line + */ + static int uid; + module_param(uid, int, 0644); + + /* + * A pointer to the original system call. The reason + * we keep this, rather than call the original function + * (sys_open), is because somebody else might have + * replaced the system call before us. Note that this + * is not 100% safe, because if another module + * replaced sys_open before us, then when we're inserted + * we'll call the function in that module - and it + * might be removed before we are. + * + * Another reason for this is that we can't get sys_open. + * It's a static variable, so it is not exported. + */ + asmlinkage int (*original_call) (const char *, int, int); + + /* + * The function we'll replace sys_open (the function + * called when you call the open system call) with. To + * find the exact prototype, with the number and type + * of arguments, we find the original function first + * (it's at fs/open.c). + * + * In theory, this means that we're tied to the + * current version of the kernel. In practice, the + * system calls almost never change (it would wreck havoc + * and require programs to be recompiled, since the system + * calls are the interface between the kernel and the + * processes). + */ + asmlinkage int our_sys_open(const char *filename, int flags, int mode) + { + int i = 0; + char ch; + + /* + * Report the file, if relevant + */ + printk("Opened file by %d: ", uid); + do { + get_user(ch, filename + i); + i++; + printk("%c", ch); + } while (ch != 0); + printk("\n"); + + /* + * Call the original sys_open - otherwise, we lose + * the ability to open files + */ + return original_call(filename, flags, mode); + } + + static unsigned long **aquire_sys_call_table(void) + { + unsigned long int offset = PAGE_OFFSET; + unsigned long **sct; + + while (offset < ULLONG_MAX) { + sct = (unsigned long **)offset; + + if (sct[__NR_close] == (unsigned long *) sys_close) + return sct; + + offset += sizeof(void *); + } + + return NULL; + } + + static int __init syscall_start(void) + { + if(!(sys_call_table = aquire_sys_call_table())) + return -1; + + original_cr0 = read_cr0(); + + write_cr0(original_cr0 & ~0x00010000); + + /* keep track of the original open function */ + original_call = (void*)sys_call_table[__NR_open]; + + /* use our open function instead */ + sys_call_table[__NR_open] = (unsigned long *)our_sys_open; + + write_cr0(original_cr0); + + printk(KERN_INFO "Spying on UID:%d\n", uid); + + return 0; + } + + static void __exit syscall_end(void) + { + if(!sys_call_table) { + return; + } + + /* + * Return the system call back to normal + */ + if (sys_call_table[__NR_open] != (unsigned long *)our_sys_open) { + printk(KERN_ALERT "Somebody else also played with the "); + printk(KERN_ALERT "open system call\n"); + printk(KERN_ALERT "The system may be left in "); + printk(KERN_ALERT "an unstable state.\n"); + } + + write_cr0(original_cr0 & ~0x00010000); + sys_call_table[__NR_open] = (unsigned long *)original_call; + write_cr0(original_cr0); + + msleep(2000); + } + + module_init(syscall_start); + module_exit(syscall_end); + + MODULE_LICENSE("GPL"); +::: +::: +::: + +::: {#outline-container-org13e2c0e .outline-2} +Blocking Processes and threads {#org13e2c0e} +------------------------------ + +::: {#text-org13e2c0e .outline-text-2} +::: + +::: {#outline-container-org9cbc7d3 .outline-3} +### Sleep {#org9cbc7d3} + +::: {#text-org9cbc7d3 .outline-text-3} +What do you do when somebody asks you for something you can\'t do right +away? If you\'re a human being and you\'re bothered by a human being, +the only thing you can say is: \"*Not right now, I\'m busy. Go away!*\". +But if you\'re a kernel module and you\'re bothered by a process, you +have another possibility. You can put the process to sleep until you can +service it. After all, processes are being put to sleep by the kernel +and woken up all the time (that\'s the way multiple processes appear to +run on the same time on a single CPU). + +This kernel module is an example of this. The file (called +**/proc/sleep**) can only be opened by a single process at a time. If +the file is already open, the kernel module calls +wait\_event\_interruptible. The easiest way to keep a file open is to +open it with: + +::: {.org-src-container} + tail -f +::: + +This function changes the status of the task (a task is the kernel data +structure which holds information about a process and the system call +it\'s in, if any) to **TASK\_INTERRUPTIBLE**, which means that the task +will not run until it is woken up somehow, and adds it to WaitQ, the +queue of tasks waiting to access the file. Then, the function calls the +scheduler to context switch to a different process, one which has some +use for the CPU. + +When a process is done with the file, it closes it, and module\_close is +called. That function wakes up all the processes in the queue (there\'s +no mechanism to only wake up one of them). It then returns and the +process which just closed the file can continue to run. In time, the +scheduler decides that that process has had enough and gives control of +the CPU to another process. Eventually, one of the processes which was +in the queue will be given control of the CPU by the scheduler. It +starts at the point right after the call to +**module\_interruptible\_sleep\_on**. + +This means that the process is still in kernel mode - as far as the +process is concerned, it issued the open system call and the system call +hasn\'t returned yet. The process doesn\'t know somebody else used the +CPU for most of the time between the moment it issued the call and the +moment it returned. + +It can then proceed to set a global variable to tell all the other +processes that the file is still open and go on with its life. When the +other processes get a piece of the CPU, they\'ll see that global +variable and go back to sleep. + +So we\'ll use tail -f to keep the file open in the background, while +trying to access it with another process (again in the background, so +that we need not switch to a different vt). As soon as the first +background process is killed with kill %1 , the second is woken up, is +able to access the file and finally terminates. + +To make our life more interesting, **module\_close** doesn\'t have a +monopoly on waking up the processes which wait to access the file. A +signal, such as *Ctrl +c* (**SIGINT**) can also wake up a process. This +is because we used **module\_interruptible\_sleep\_on**. We could have +used **module\_sleep\_on** instead, but that would have resulted in +extremely angry users whose *Ctrl+c*\'s are ignored. + +In that case, we want to return with **-EINTR** immediately. This is +important so users can, for example, kill the process before it receives +the file. + +There is one more point to remember. Some times processes don\'t want to +sleep, they want either to get what they want immediately, or to be told +it cannot be done. Such processes use the **O\_NONBLOCK** flag when +opening the file. The kernel is supposed to respond by returning with +the error code **-EAGAIN** from operations which would otherwise block, +such as opening the file in this example. The program cat\_noblock, +available in the source directory for this chapter, can be used to open +a file with **O\_NONBLOCK**. + +::: {.org-src-container} + hostname:~/lkmpg-examples/09-BlockingProcesses# insmod sleep.ko + hostname:~/lkmpg-examples/09-BlockingProcesses# cat_noblock /proc/sleep + Last input: + hostname:~/lkmpg-examples/09-BlockingProcesses# tail -f /proc/sleep & + Last input: + Last input: + Last input: + Last input: + Last input: + Last input: + Last input: + tail: /proc/sleep: file truncated + [1] 6540 + hostname:~/lkmpg-examples/09-BlockingProcesses# cat_noblock /proc/sleep + Open would block + hostname:~/lkmpg-examples/09-BlockingProcesses# kill %1 + [1]+ Terminated tail -f /proc/sleep + hostname:~/lkmpg-examples/09-BlockingProcesses# cat_noblock /proc/sleep + Last input: + hostname:~/lkmpg-examples/09-BlockingProcesses# +::: + +::: {.org-src-container} + /* + * sleep.c - create a /proc file, and if several processes try to open it at + * the same time, put all but one to sleep + */ + + #include /* We're doing kernel work */ + #include /* Specifically, a module */ + #include /* Necessary because we use proc fs */ + #include /* For putting processes to sleep and + waking them up */ + #include /* for get_user and put_user */ + + /* + * The module's file functions + */ + + /* + * Here we keep the last message received, to prove that we can process our + * input + */ + #define MESSAGE_LENGTH 80 + static char Message[MESSAGE_LENGTH]; + + static struct proc_dir_entry *Our_Proc_File; + #define PROC_ENTRY_FILENAME "sleep" + + /* + * Since we use the file operations struct, we can't use the special proc + * output provisions - we have to use a standard read function, which is this + * function + */ + static ssize_t module_output(struct file *file, /* see include/linux/fs.h */ + char *buf, /* The buffer to put data to + (in the user segment) */ + size_t len, /* The length of the buffer */ + loff_t * offset) + { + static int finished = 0; + int i; + char message[MESSAGE_LENGTH + 30]; + + /* + * Return 0 to signify end of file - that we have nothing + * more to say at this point. + */ + if (finished) { + finished = 0; + return 0; + } + + /* + * If you don't understand this by now, you're hopeless as a kernel + * programmer. + */ + sprintf(message, "Last input:%s\n", Message); + for (i = 0; i < len && message[i]; i++) + put_user(message[i], buf + i); + + finished = 1; + return i; /* Return the number of bytes "read" */ + } + + /* + * This function receives input from the user when the user writes to the /proc + * file. + */ + static ssize_t module_input(struct file *file, /* The file itself */ + const char *buf, /* The buffer with input */ + size_t length, /* The buffer's length */ + loff_t * offset) /* offset to file - ignore */ + { + int i; + + /* + * Put the input into Message, where module_output will later be + * able to use it + */ + for (i = 0; i < MESSAGE_LENGTH - 1 && i < length; i++) + get_user(Message[i], buf + i); + /* + * we want a standard, zero terminated string + */ + Message[i] = '\0'; + + /* + * We need to return the number of input characters used + */ + return i; + } + + /* + * 1 if the file is currently open by somebody + */ + int Already_Open = 0; + + /* + * Queue of processes who want our file + */ + DECLARE_WAIT_QUEUE_HEAD(WaitQ); + /* + * Called when the /proc file is opened + */ + static int module_open(struct inode *inode, struct file *file) + { + /* + * If the file's flags include O_NONBLOCK, it means the process doesn't + * want to wait for the file. In this case, if the file is already + * open, we should fail with -EAGAIN, meaning "you'll have to try + * again", instead of blocking a process which would rather stay awake. + */ + if ((file->f_flags & O_NONBLOCK) && Already_Open) + return -EAGAIN; + + /* + * This is the correct place for try_module_get(THIS_MODULE) because + * if a process is in the loop, which is within the kernel module, + * the kernel module must not be removed. + */ + try_module_get(THIS_MODULE); + + /* + * If the file is already open, wait until it isn't + */ + + while (Already_Open) { + int i, is_sig = 0; + + /* + * This function puts the current process, including any system + * calls, such as us, to sleep. Execution will be resumed right + * after the function call, either because somebody called + * wake_up(&WaitQ) (only module_close does that, when the file + * is closed) or when a signal, such as Ctrl-C, is sent + * to the process + */ + wait_event_interruptible(WaitQ, !Already_Open); + + /* + * If we woke up because we got a signal we're not blocking, + * return -EINTR (fail the system call). This allows processes + * to be killed or stopped. + */ + + /* + * Emmanuel Papirakis: + * + * This is a little update to work with 2.2.*. Signals now are contained in + * two words (64 bits) and are stored in a structure that contains an array of + * two unsigned longs. We now have to make 2 checks in our if. + * + * Ori Pomerantz: + * + * Nobody promised me they'll never use more than 64 bits, or that this book + * won't be used for a version of Linux with a word size of 16 bits. This code + * would work in any case. + */ + for (i = 0; i < _NSIG_WORDS && !is_sig; i++) + is_sig = + current->pending.signal.sig[i] & ~current-> + blocked.sig[i]; + + if (is_sig) { + /* + * It's important to put module_put(THIS_MODULE) here, + * because for processes where the open is interrupted + * there will never be a corresponding close. If we + * don't decrement the usage count here, we will be + * left with a positive usage count which we'll have no + * way to bring down to zero, giving us an immortal + * module, which can only be killed by rebooting + * the machine. + */ + module_put(THIS_MODULE); + return -EINTR; + } + } + + /* + * If we got here, Already_Open must be zero + */ + + /* + * Open the file + */ + Already_Open = 1; + return 0; /* Allow the access */ + } + + /* + * Called when the /proc file is closed + */ + int module_close(struct inode *inode, struct file *file) + { + /* + * Set Already_Open to zero, so one of the processes in the WaitQ will + * be able to set Already_Open back to one and to open the file. All + * the other processes will be called when Already_Open is back to one, + * so they'll go back to sleep. + */ + Already_Open = 0; + + /* + * Wake up all the processes in WaitQ, so if anybody is waiting for the + * file, they can have it. + */ + wake_up(&WaitQ); + + module_put(THIS_MODULE); + + return 0; /* success */ + } + + /* + * Structures to register as the /proc file, with pointers to all the relevant + * functions. + */ + + /* + * File operations for our proc file. This is where we place pointers to all + * the functions called when somebody tries to do something to our file. NULL + * means we don't want to deal with something. + */ + static struct file_operations File_Ops_4_Our_Proc_File = { + .read = module_output, /* "read" from the file */ + .write = module_input, /* "write" to the file */ + .open = module_open, /* called when the /proc file is opened */ + .release = module_close, /* called when it's closed */ + }; + + /* + * Module initialization and cleanup + */ + + /* + * Initialize the module - register the proc file + */ + + int init_module() + { + Our_Proc_File = proc_create(PROC_ENTRY_FILENAME, 0644, NULL, &File_Ops_4_Our_Proc_File); + if(Our_Proc_File == NULL) + { + remove_proc_entry(PROC_ENTRY_FILENAME, NULL); + printk(KERN_DEBUG "Error: Could not initialize /proc/%s\n", PROC_ENTRY_FILENAME); + return -ENOMEM; + } + proc_set_size(Our_Proc_File, 80); + proc_set_user(Our_Proc_File, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID); + + printk(KERN_INFO "/proc/test created\n"); + + return 0; + } + + /* + * Cleanup - unregister our file from /proc. This could get dangerous if + * there are still processes waiting in WaitQ, because they are inside our + * open function, which will get unloaded. I'll explain how to avoid removal + * of a kernel module in such a case in chapter 10. + */ + void cleanup_module() + { + remove_proc_entry(PROC_ENTRY_FILENAME, NULL); + printk(KERN_DEBUG "/proc/%s removed\n", PROC_ENTRY_FILENAME); + } +::: + +::: {.org-src-container} + /* cat_noblock.c - open a file and display its contents, but exit rather than + * wait for input */ + /* Copyright (C) 1998 by Ori Pomerantz */ + + #include /* standard I/O */ + #include /* for open */ + #include /* for read */ + #include /* for exit */ + #include /* for errno */ + + #define MAX_BYTES 1024*4 + + + int main(int argc, char *argv[]) + { + int fd; /* The file descriptor for the file to read */ + size_t bytes; /* The number of bytes read */ + char buffer[MAX_BYTES]; /* The buffer for the bytes */ + + + /* Usage */ + if (argc != 2) { + printf("Usage: %s \n", argv[0]); + puts("Reads the content of a file, but doesn't wait for input"); + exit(-1); + } + + /* Open the file for reading in non blocking mode */ + fd = open(argv[1], O_RDONLY | O_NONBLOCK); + + /* If open failed */ + if (fd == -1) { + if (errno = EAGAIN) + puts("Open would block"); + else + puts("Open failed"); + exit(-1); + } + + /* Read the file and output its contents */ + do { + int i; + + /* Read characters from the file */ + bytes = read(fd, buffer, MAX_BYTES); + + /* If there's an error, report it and die */ + if (bytes == -1) { + if (errno = EAGAIN) + puts("Normally I'd block, but you told me not to"); + else + puts("Another read error"); + exit(-1); + } + + /* Print the characters */ + if (bytes > 0) { + for(i=0; i 0); + return 0; + } +::: +::: +::: + +::: {#outline-container-org89cb410 .outline-3} +### Completions {#org89cb410} + +::: {#text-org89cb410 .outline-text-3} +Sometimes one thing should happen before another within a module having +multiple threads. Rather than using **/proc/sleep** commands the kernel +has another way to do this which allows timeouts or interrupts to also +happen. + +In the following example two threads are started, but one needs to start +before another. + +::: {.org-src-container} + #include + #include + #include + #include + #include + + static struct { + struct completion crank_comp; + struct completion flywheel_comp; + } machine; + + static int machine_crank_thread(void* arg) + { + printk("Turn the crank\n"); + + complete_all(&machine.crank_comp); + complete_and_exit(&machine.crank_comp, 0); + } + + static int machine_flywheel_spinup_thread(void* arg) + { + wait_for_completion(&machine.crank_comp); + + printk("Flywheel spins up\n"); + + complete_all(&machine.flywheel_comp); + complete_and_exit(&machine.flywheel_comp, 0); + } + + static int completions_init(void) + { + struct task_struct* crank_thread; + struct task_struct* flywheel_thread; + + printk("completions example\n"); + + init_completion(&machine.crank_comp); + init_completion(&machine.flywheel_comp); + + crank_thread = + kthread_create(machine_crank_thread, + NULL, "KThread Crank"); + if (IS_ERR(crank_thread)) + goto ERROR_THREAD_1; + + flywheel_thread = + kthread_create(machine_flywheel_spinup_thread, + NULL, "KThread Flywheel"); + if (IS_ERR(flywheel_thread)) + goto ERROR_THREAD_2; + + wake_up_process(flywheel_thread); + wake_up_process(crank_thread); + + return 0; + + ERROR_THREAD_2: + kthread_stop(crank_thread); + ERROR_THREAD_1: + + return -1; + } + + void completions_exit(void) + { + wait_for_completion(&machine.crank_comp); + wait_for_completion(&machine.flywheel_comp); + + printk("completions exit\n"); + } + + module_init(completions_init); + module_exit(completions_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Completions example"); + MODULE_LICENSE("GPL"); +::: + +The *machine* structure stores the completion states for the two +threads. At the exit point of each thread the respective completion +state is updated, and *wait\_for\_completion* is used by the flywheel +thread to ensure that it doesn\'t begin prematurely. + +So even though *flywheel\_thread* is started first you should notice if +you load this module and run *dmesg* that turning the crank always +happens first because the flywheel thread waits for it to complete. + +There are other variations upon the *wait\_for\_completion* function, +which include timeouts or being interrupted, but this basic mechanism is +enough for many common situations without adding a lot of complexity. +::: +::: +::: + +::: {#outline-container-org949949f .outline-2} +Avoiding Collisions and Deadlocks {#org949949f} +--------------------------------- + +::: {#text-org949949f .outline-text-2} +If processes running on different CPUs or in different threads try to +access the same memory then it\'s possible that strange things can +happen or your system can lock up. To avoid this various types of mutual +exclusion kernel functions are available. These indicate if a section of +code is \"locked\" or \"unlocked\" so that simultaneous attempts to run +it can\'t happen. +::: + +::: {#outline-container-org10f05c2 .outline-3} +### Mutex {#org10f05c2} + +::: {#text-org10f05c2 .outline-text-3} +You can use kernel mutexes (mutual exclusions) in much the same manner +that you might deploy them in userland. This may be all that\'s needed +to avoid collisions in most cases. + +::: {.org-src-container} + #include + #include + #include + #include + + DEFINE_MUTEX(mymutex); + + static int example_mutex_init(void) + { + int ret; + + printk("example_mutex init\n"); + + ret = mutex_trylock(&mymutex); + if (ret != 0) { + printk("mutex is locked\n"); + + if (mutex_is_locked(&mymutex) == 0) + printk("The mutex failed to lock!\n"); + + mutex_unlock(&mymutex); + printk("mutex is unlocked\n"); + } + else + printk("Failed to lock\n"); + + return 0; + } + + static void example_mutex_exit(void) + { + printk("example_mutex exit\n"); + } + + module_init(example_mutex_init); + module_exit(example_mutex_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Mutex example"); + MODULE_LICENSE("GPL"); +::: +::: +::: + +::: {#outline-container-org5d633fc .outline-3} +### Spinlocks {#org5d633fc} + +::: {#text-org5d633fc .outline-text-3} +As the name suggests, spinlocks lock up the CPU that the code is running +on, taking 100% of its resources. Because of this you should only use +the spinlock mechanism around code which is likely to take no more than +a few milliseconds to run and so won\'t noticably slow anything down +from the user\'s point of view. + +The example here is *\"irq safe\"* in that if interrupts happen during +the lock then they won\'t be forgotten and will activate when the unlock +happens, using the *flags* variable to retain their state. + +::: {.org-src-container} + #include + #include + #include + #include + #include + + DEFINE_SPINLOCK(sl_static); + spinlock_t sl_dynamic; + + static void example_spinlock_static(void) + { + unsigned long flags; + + spin_lock_irqsave(&sl_static, flags); + printk("Locked static spinlock\n"); + + /* Do something or other safely. + Because this uses 100% CPU time this + code should take no more than a few + milliseconds to run */ + + spin_unlock_irqrestore(&sl_static, flags); + printk("Unlocked static spinlock\n"); + } + + static void example_spinlock_dynamic(void) + { + unsigned long flags; + + spin_lock_init(&sl_dynamic); + spin_lock_irqsave(&sl_dynamic, flags); + printk("Locked dynamic spinlock\n"); + + /* Do something or other safely. + Because this uses 100% CPU time this + code should take no more than a few + milliseconds to run */ + + spin_unlock_irqrestore(&sl_dynamic, flags); + printk("Unlocked dynamic spinlock\n"); + } + + static int example_spinlock_init(void) + { + printk("example spinlock started\n"); + + example_spinlock_static(); + example_spinlock_dynamic(); + + return 0; + } + + static void example_spinlock_exit(void) + { + printk("example spinlock exit\n"); + } + + module_init(example_spinlock_init); + module_exit(example_spinlock_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Spinlock example"); + MODULE_LICENSE("GPL"); +::: +::: +::: + +::: {#outline-container-orgaa517c3 .outline-3} +### Read and write locks {#orgaa517c3} + +::: {#text-orgaa517c3 .outline-text-3} +Read and write locks are specialised kinds of spinlocks so that you can +exclusively read from something or write to something. Like the earlier +spinlocks example the one below shows an \"irq safe\" situation in which +if other functions were triggered from irqs which might also read and +write to whatever you are concerned with then they wouldn\'t disrupt the +logic. As before it\'s a good idea to keep anything done within the lock +as short as possible so that it doesn\'t hang up the system and cause +users to start revolting against the tyranny of your module. + +::: {.org-src-container} + #include + #include + #include + + DEFINE_RWLOCK(myrwlock); + + static void example_read_lock(void) + { + unsigned long flags; + + read_lock_irqsave(&myrwlock, flags); + printk("Read Locked\n"); + + /* Read from something */ + + read_unlock_irqrestore(&myrwlock, flags); + printk("Read Unlocked\n"); + } + + static void example_write_lock(void) + { + unsigned long flags; + + write_lock_irqsave(&myrwlock, flags); + printk("Write Locked\n"); + + /* Write to something */ + + write_unlock_irqrestore(&myrwlock, flags); + printk("Write Unlocked\n"); + } + + static int example_rwlock_init(void) + { + printk("example_rwlock started\n"); + + example_read_lock(); + example_write_lock(); + + return 0; + } + + static void example_rwlock_exit(void) + { + printk("example_rwlock exit\n"); + } + + module_init(example_rwlock_init); + module_exit(example_rwlock_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Read/Write locks example"); + MODULE_LICENSE("GPL"); +::: + +Of course if you know for sure that there are no functions triggered by +irqs which could possibly interfere with your logic then you can use the +simpler *read\_lock(&myrwlock)* and *read\_unlock(&myrwlock)* or the +corresponding write functions. +::: +::: + +::: {#outline-container-orgadbf448 .outline-3} +### Atomic operations {#orgadbf448} + +::: {#text-orgadbf448 .outline-text-3} +If you\'re doing simple arithmetic: adding, subtracting or bitwise +operations then there\'s another way in the multi-CPU and +multi-hyperthreaded world to stop other parts of the system from messing +with your mojo. By using atomic operations you can be confident that +your addition, subtraction or bit flip did actually happen and wasn\'t +overwritten by some other shenanigans. An example is shown below. + +::: {.org-src-container} + #include + #include + #include + + #define BYTE_TO_BINARY_PATTERN "%c%c%c%c%c%c%c%c" + #define BYTE_TO_BINARY(byte) \ + (byte & 0x80 ? '1' : '0'), \ + (byte & 0x40 ? '1' : '0'), \ + (byte & 0x20 ? '1' : '0'), \ + (byte & 0x10 ? '1' : '0'), \ + (byte & 0x08 ? '1' : '0'), \ + (byte & 0x04 ? '1' : '0'), \ + (byte & 0x02 ? '1' : '0'), \ + (byte & 0x01 ? '1' : '0') + + static void atomic_add_subtract(void) + { + atomic_t debbie; + atomic_t chris = ATOMIC_INIT(50); + + atomic_set(&debbie, 45); + + /* subtract one */ + atomic_dec(&debbie); + + atomic_add(7, &debbie); + + /* add one */ + atomic_inc(&debbie); + + printk("chris: %d, debbie: %d\n", + atomic_read(&chris), atomic_read(&debbie)); + } + + static void atomic_bitwise(void) + { + unsigned long word = 0; + + printk("Bits 0: "BYTE_TO_BINARY_PATTERN, BYTE_TO_BINARY(word)); + set_bit(3, &word); + set_bit(5, &word); + printk("Bits 1: "BYTE_TO_BINARY_PATTERN, BYTE_TO_BINARY(word)); + clear_bit(5, &word); + printk("Bits 2: "BYTE_TO_BINARY_PATTERN, BYTE_TO_BINARY(word)); + change_bit(3, &word); + + printk("Bits 3: "BYTE_TO_BINARY_PATTERN, BYTE_TO_BINARY(word)); + if (test_and_set_bit(3, &word)) + printk("wrong\n"); + printk("Bits 4: "BYTE_TO_BINARY_PATTERN, BYTE_TO_BINARY(word)); + + word = 255; + printk("Bits 5: "BYTE_TO_BINARY_PATTERN, BYTE_TO_BINARY(word)); + } + + static int example_atomic_init(void) + { + printk("example_atomic started\n"); + + atomic_add_subtract(); + atomic_bitwise(); + + return 0; + } + + static void example_atomic_exit(void) + { + printk("example_atomic exit\n"); + } + + module_init(example_atomic_init); + module_exit(example_atomic_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Atomic operations example"); + MODULE_LICENSE("GPL"); +::: +::: +::: +::: + +::: {#outline-container-org7974c60 .outline-2} +Replacing Printks {#org7974c60} +----------------- + +::: {#text-org7974c60 .outline-text-2} +::: + +::: {#outline-container-org1c8b17b .outline-3} +### Replacing printk {#org1c8b17b} + +::: {#text-org1c8b17b .outline-text-3} +In Section 1.2.1.2, I said that X and kernel module programming don\'t +mix. That\'s true for developing kernel modules, but in actual use, you +want to be able to send messages to whichever tty the command to load +the module came from. + +\"tty\" is an abbreviation of *teletype*: originally a combination +keyboard-printer used to communicate with a Unix system, and today an +abstraction for the text stream used for a Unix program, whether it\'s a +physical terminal, an xterm on an X display, a network connection used +with ssh, etc. + +The way this is done is by using current, a pointer to the currently +running task, to get the current task\'s tty structure. Then, we look +inside that tty structure to find a pointer to a string write function, +which we use to write a string to the tty. + +::: {.org-src-container} + /* + * print_string.c - Send output to the tty we're running on, regardless if it's + * through X11, telnet, etc. We do this by printing the string to the tty + * associated with the current task. + */ + #include + #include + #include + #include /* For current */ + #include /* For the tty declarations */ + #include /* For LINUX_VERSION_CODE */ + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Peter Jay Salzman"); + + static void print_string(char *str) + { + struct tty_struct *my_tty; + const struct tty_operations *ttyops; + + /* + * tty struct went into signal struct in 2.6.6 + */ + #if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,5) ) + /* + * The tty for the current task + */ + my_tty = current->tty; + #else + /* + * The tty for the current task, for 2.6.6+ kernels + */ + my_tty = get_current_tty(); + #endif + ttyops = my_tty->driver->ops; + + /* + * If my_tty is NULL, the current task has no tty you can print to + * (ie, if it's a daemon). If so, there's nothing we can do. + */ + if (my_tty != NULL) { + + /* + * my_tty->driver is a struct which holds the tty's functions, + * one of which (write) is used to write strings to the tty. + * It can be used to take a string either from the user's or + * kernel's memory segment. + * + * The function's 1st parameter is the tty to write to, + * because the same function would normally be used for all + * tty's of a certain type. The 2nd parameter controls + * whether the function receives a string from kernel + * memory (false, 0) or from user memory (true, non zero). + * BTW: this param has been removed in Kernels > 2.6.9 + * The (2nd) 3rd parameter is a pointer to a string. + * The (3rd) 4th parameter is the length of the string. + * + * As you will see below, sometimes it's necessary to use + * preprocessor stuff to create code that works for different + * kernel versions. The (naive) approach we've taken here + * does not scale well. The right way to deal with this + * is described in section 2 of + * linux/Documentation/SubmittingPatches + */ + (ttyops->write) (my_tty, /* The tty itself */ + #if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,9) ) + 0, /* Don't take the string + from user space */ + #endif + str, /* String */ + strlen(str)); /* Length */ + + /* + * ttys were originally hardware devices, which (usually) + * strictly followed the ASCII standard. In ASCII, to move to + * a new line you need two characters, a carriage return and a + * line feed. On Unix, the ASCII line feed is used for both + * purposes - so we can't just use \n, because it wouldn't have + * a carriage return and the next line will start at the + * column right after the line feed. + * + * This is why text files are different between Unix and + * MS Windows. In CP/M and derivatives, like MS-DOS and + * MS Windows, the ASCII standard was strictly adhered to, + * and therefore a newline requirs both a LF and a CR. + */ + + #if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,9) ) + (ttyops->write) (my_tty, 0, "\015\012", 2); + #else + (ttyops->write) (my_tty, "\015\012", 2); + #endif + } + } + + static int __init print_string_init(void) + { + print_string("The module has been inserted. Hello world!"); + return 0; + } + + static void __exit print_string_exit(void) + { + print_string("The module has been removed. Farewell world!"); + } + + module_init(print_string_init); + module_exit(print_string_exit); +::: +::: +::: + +::: {#outline-container-org418d823 .outline-3} +### Flashing keyboard LEDs {#org418d823} + +::: {#text-org418d823 .outline-text-3} +In certain conditions, you may desire a simpler and more direct way to +communicate to the external world. Flashing keyboard LEDs can be such a +solution: It is an immediate way to attract attention or to display a +status condition. Keyboard LEDs are present on every hardware, they are +always visible, they do not need any setup, and their use is rather +simple and non-intrusive, compared to writing to a tty or a file. + +The following source code illustrates a minimal kernel module which, +when loaded, starts blinking the keyboard LEDs until it is unloaded. + +::: {.org-src-container} + /* + * kbleds.c - Blink keyboard leds until the module is unloaded. + */ + + #include + #include + #include /* for fg_console */ + #include /* For fg_console, MAX_NR_CONSOLES */ + #include /* For KDSETLED */ + #include + #include /* For vc_cons */ + + MODULE_DESCRIPTION("Example module illustrating the use of Keyboard LEDs."); + MODULE_AUTHOR("Daniele Paolo Scarpazza"); + MODULE_LICENSE("GPL"); + + struct timer_list my_timer; + struct tty_driver *my_driver; + char kbledstatus = 0; + + #define BLINK_DELAY HZ/5 + #define ALL_LEDS_ON 0x07 + #define RESTORE_LEDS 0xFF + + /* + * Function my_timer_func blinks the keyboard LEDs periodically by invoking + * command KDSETLED of ioctl() on the keyboard driver. To learn more on virtual + * terminal ioctl operations, please see file: + * /usr/src/linux/drivers/char/vt_ioctl.c, function vt_ioctl(). + * + * The argument to KDSETLED is alternatively set to 7 (thus causing the led + * mode to be set to LED_SHOW_IOCTL, and all the leds are lit) and to 0xFF + * (any value above 7 switches back the led mode to LED_SHOW_FLAGS, thus + * the LEDs reflect the actual keyboard status). To learn more on this, + * please see file: + * /usr/src/linux/drivers/char/keyboard.c, function setledstate(). + * + */ + + static void my_timer_func(unsigned long ptr) + { + unsigned long *pstatus = (unsigned long *)ptr; + struct tty_struct* t = vc_cons[fg_console].d->port.tty; + + if (*pstatus == ALL_LEDS_ON) + *pstatus = RESTORE_LEDS; + else + *pstatus = ALL_LEDS_ON; + + (my_driver->ops->ioctl) (t, KDSETLED, *pstatus); + + my_timer.expires = jiffies + BLINK_DELAY; + add_timer(&my_timer); + } + + static int __init kbleds_init(void) + { + int i; + + printk(KERN_INFO "kbleds: loading\n"); + printk(KERN_INFO "kbleds: fgconsole is %x\n", fg_console); + for (i = 0; i < MAX_NR_CONSOLES; i++) { + if (!vc_cons[i].d) + break; + printk(KERN_INFO "poet_atkm: console[%i/%i] #%i, tty %lx\n", i, + MAX_NR_CONSOLES, vc_cons[i].d->vc_num, + (unsigned long)vc_cons[i].d->port.tty); + } + printk(KERN_INFO "kbleds: finished scanning consoles\n"); + + my_driver = vc_cons[fg_console].d->port.tty->driver; + printk(KERN_INFO "kbleds: tty driver magic %x\n", my_driver->magic); + + /* + * Set up the LED blink timer the first time + */ + init_timer(&my_timer); + my_timer.function = my_timer_func; + my_timer.data = (unsigned long)&kbledstatus; + my_timer.expires = jiffies + BLINK_DELAY; + add_timer(&my_timer); + + return 0; + } + + static void __exit kbleds_cleanup(void) + { + printk(KERN_INFO "kbleds: unloading...\n"); + del_timer(&my_timer); + (my_driver->ops->ioctl) (vc_cons[fg_console].d->port.tty, + KDSETLED, RESTORE_LEDS); + } + + module_init(kbleds_init); + module_exit(kbleds_cleanup); +::: + +If none of the examples in this chapter fit your debugging needs there +might yet be some other tricks to try. Ever wondered what +CONFIG\_LL\_DEBUG in make menuconfig is good for? If you activate that +you get low level access to the serial port. While this might not sound +very powerful by itself, you can patch kernel/printk.c or any other +essential syscall to use printascii, thus makeing it possible to trace +virtually everything what your code does over a serial line. If you find +yourself porting the kernel to some new and former unsupported +architecture this is usually amongst the first things that should be +implemented. Logging over a netconsole might also be worth a try. + +While you have seen lots of stuff that can be used to aid debugging +here, there are some things to be aware of. Debugging is almost always +intrusive. Adding debug code can change the situation enough to make the +bug seem to dissappear. Thus you should try to keep debug code to a +minimum and make sure it does not show up in production code. +::: +::: +::: + +::: {#outline-container-orgf37d73f .outline-2} +Scheduling Tasks {#orgf37d73f} +---------------- + +::: {#text-orgf37d73f .outline-text-2} +There are two main ways of running tasks: tasklets and work queues. +Tasklets are a quick and easy way of scheduling a single function to be +run, for example when triggered from an interrupt, whereas work queues +are more complicated but also better suited to running multiple things +in a sequence. +::: + +::: {#outline-container-org32525a8 .outline-3} +### Tasklets {#org32525a8} + +::: {#text-org32525a8 .outline-text-3} +Here\'s an example tasklet module. The *tasklet\_fn* function runs for a +few seconds and in the mean time execution of the +*example\_tasklet\_init* function continues to the exit point. + +::: {.org-src-container} + #include + #include + #include + #include + + static void tasklet_fn(unsigned long data) + { + printk("Example tasklet starts\n"); + mdelay(5000); + printk("Example tasklet ends\n"); + } + + DECLARE_TASKLET(mytask, tasklet_fn, 0L); + + static int example_tasklet_init(void) + { + printk("tasklet example init\n"); + tasklet_schedule(&mytask); + mdelay(200); + printk("Example tasklet init continues...\n"); + return 0; + } + + static void example_tasklet_exit(void) + { + printk("tasklet example exit\n"); + tasklet_kill(&mytask); + } + + module_init(example_tasklet_init); + module_exit(example_tasklet_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Tasklet example"); + MODULE_LICENSE("GPL"); +::: + +So with this example loaded *dmesg* should show: + +::: {.org-src-container} + tasklet example init + Example tasklet starts + Example tasklet init continues... + Example tasklet ends +::: +::: +::: + +::: {#outline-container-orge8a2d87 .outline-3} +### Work queues {#orge8a2d87} + +::: {#text-orge8a2d87 .outline-text-3} +Very often, we have \"housekeeping\" tasks which have to be done at a +certain time, or every so often. If the task is to be done by a process, +we do it by putting it in the crontab file. If the task is to be done by +a kernel module, we have two possibilities. The first is to put a +process in the crontab file which will wake up the module by a system +call when necessary, for example by opening a file. This is terribly +inefficient, however -- we run a new process off of crontab, read a new +executable to memory, and all this just to wake up a kernel module which +is in memory anyway. + +Instead of doing that, we can create a function that will be called once +for every timer interrupt. The way we do this is we create a task, held +in a workqueue\_struct structure, which will hold a pointer to the +function. Then, we use queue\_delayed\_work to put that task on a task +list called my\_workqueue , which is the list of tasks to be executed on +the next timer interrupt. Because we want the function to keep on being +executed, we need to put it back on my\_workqueue whenever it is called, +for the next timer interrupt. + +There\'s one more point we need to remember here. When a module is +removed by rmmod, first its reference count is checked. If it is zero, +module\_cleanup is called. Then, the module is removed from memory with +all its functions. Things need to be shut down properly, or bad things +will happen. See the code below how this can be done in a safe way. + +::: {.org-src-container} + /* + * sched.c - schedule a function to be called on every timer interrupt. + * + * Copyright (C) 2001 by Peter Jay Salzman + */ + + /* + * The necessary header files + */ + + /* + * Standard in kernel modules + */ + #include /* We're doing kernel work */ + #include /* Specifically, a module */ + #include /* Necessary because we use the proc fs */ + #include /* We schedule tasks here */ + #include /* We need to put ourselves to sleep + and wake up later */ + #include /* For __init and __exit */ + #include /* For irqreturn_t */ + + struct proc_dir_entry *Our_Proc_File; + #define PROC_ENTRY_FILENAME "sched" + #define MY_WORK_QUEUE_NAME "WQsched.c" + + /* + * some work_queue related functions + * are just available to GPL licensed Modules + */ + MODULE_LICENSE("GPL"); + + /* + * The number of times the timer interrupt has been called so far + */ + static int TimerIntrpt = 0; + + static void intrpt_routine(struct work_struct *work); + + static int die = 0; /* set this to 1 for shutdown */ + + /* + * The work queue structure for this task, from workqueue.h + */ + static struct workqueue_struct *my_workqueue; + + static struct delayed_work Task; + static DECLARE_DELAYED_WORK(Task, intrpt_routine); + + /* + * This function will be called on every timer interrupt. Notice the void* + * pointer - task functions can be used for more than one purpose, each time + * getting a different parameter. + */ + static void intrpt_routine(struct work_struct *work) + { + /* + * Increment the counter + */ + TimerIntrpt++; + + /* + * If cleanup wants us to die + */ + if (die == 0) + queue_delayed_work(my_workqueue, &Task, 100); + } + + /* + * Put data into the proc fs file. + */ + int + procfile_read(char *buffer, + char **buffer_location, + off_t offset, int buffer_length, int *eof, void *data) + { + int len; /* The number of bytes actually used */ + + /* + * It's static so it will still be in memory + * when we leave this function + */ + static char my_buffer[80]; + + /* + * We give all of our information in one go, so if anybody asks us + * if we have more information the answer should always be no. + */ + if (offset > 0) + return 0; + + /* + * Fill the buffer and get its length + */ + len = sprintf(my_buffer, "Timer called %d times so far\n", TimerIntrpt); + + /* + * Tell the function which called us where the buffer is + */ + *buffer_location = my_buffer; + + /* + * Return the length + */ + return len; + } + + /* + * Initialize the module - register the proc file + */ + int __init init_module() + { + /* + * Create our /proc file + */ + Our_Proc_File = proc_create(PROC_ENTRY_FILENAME, 0644, NULL, NULL); + + if (Our_Proc_File == NULL) { + remove_proc_entry(PROC_ENTRY_FILENAME, NULL); + printk(KERN_ALERT "Error: Could not initialize /proc/%s\n", + PROC_ENTRY_FILENAME); + return -ENOMEM; + } + proc_set_size(Our_Proc_File, 80); + proc_set_user(Our_Proc_File, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID); + + /* + * Put the task in the work_timer task queue, so it will be executed at + * next timer interrupt + */ + my_workqueue = create_workqueue(MY_WORK_QUEUE_NAME); + queue_delayed_work(my_workqueue, &Task, 100); + + printk(KERN_INFO "/proc/%s created\n", PROC_ENTRY_FILENAME); + + return 0; + } + + /* + * Cleanup + */ + void __exit cleanup_module() + { + /* + * Unregister our /proc file + */ + remove_proc_entry(PROC_ENTRY_FILENAME, NULL); + printk(KERN_INFO "/proc/%s removed\n", PROC_ENTRY_FILENAME); + + die = 1; /* keep intrp_routine from queueing itself */ + cancel_delayed_work(&Task); /* no "new ones" */ + flush_workqueue(my_workqueue); /* wait till all "old ones" finished */ + destroy_workqueue(my_workqueue); + + /* + * Sleep until intrpt_routine is called one last time. This is + * necessary, because otherwise we'll deallocate the memory holding + * intrpt_routine and Task while work_timer still references them. + * Notice that here we don't allow signals to interrupt us. + * + * Since WaitQ is now not NULL, this automatically tells the interrupt + * routine it's time to die. + */ + } +::: +::: +::: +::: + +::: {#outline-container-orgbc0cdf8 .outline-2} +Interrupt Handlers {#orgbc0cdf8} +------------------ + +::: {#text-orgbc0cdf8 .outline-text-2} +::: + +::: {#outline-container-org93511bb .outline-3} +### Interrupt Handlers {#org93511bb} + +::: {#text-org93511bb .outline-text-3} +Except for the last chapter, everything we did in the kernel so far +we\'ve done as a response to a process asking for it, either by dealing +with a special file, sending an ioctl(), or issuing a system call. But +the job of the kernel isn\'t just to respond to process requests. +Another job, which is every bit as important, is to speak to the +hardware connected to the machine. + +There are two types of interaction between the CPU and the rest of the +computer\'s hardware. The first type is when the CPU gives orders to the +hardware, the other is when the hardware needs to tell the CPU +something. The second, called interrupts, is much harder to implement +because it has to be dealt with when convenient for the hardware, not +the CPU. Hardware devices typically have a very small amount of RAM, and +if you don\'t read their information when available, it is lost. + +Under Linux, hardware interrupts are called IRQ\'s (Interrupt ReQuests). +There are two types of IRQ\'s, short and long. A short IRQ is one which +is expected to take a very short period of time, during which the rest +of the machine will be blocked and no other interrupts will be handled. +A long IRQ is one which can take longer, and during which other +interrupts may occur (but not interrupts from the same device). If at +all possible, it\'s better to declare an interrupt handler to be long. + +When the CPU receives an interrupt, it stops whatever it\'s doing +(unless it\'s processing a more important interrupt, in which case it +will deal with this one only when the more important one is done), saves +certain parameters on the stack and calls the interrupt handler. This +means that certain things are not allowed in the interrupt handler +itself, because the system is in an unknown state. The solution to this +problem is for the interrupt handler to do what needs to be done +immediately, usually read something from the hardware or send something +to the hardware, and then schedule the handling of the new information +at a later time (this is called the \"bottom half\") and return. The +kernel is then guaranteed to call the bottom half as soon as possible -- +and when it does, everything allowed in kernel modules will be allowed. + +The way to implement this is to call **request\_irq()** to get your +interrupt handler called when the relevant IRQ is received. + +In practice IRQ handling can be a bit more complex. Hardware is often +designed in a way that chains two interrupt controllers, so that all the +IRQs from interrupt controller B are cascaded to a certain IRQ from +interrupt controller A. Of course that requires that the kernel finds +out which IRQ it really was afterwards and that adds overhead. Other +architectures offer some special, very low overhead, so called \"fast +IRQ\" or FIQs. To take advantage of them requires handlers to be written +in assembler, so they do not really fit into the kernel. They can be +made to work similar to the others, but after that procedure, they\'re +no longer any faster than \"common\" IRQs. SMP enabled kernels running +on systems with more than one processor need to solve another truckload +of problems. It\'s not enough to know if a certain IRQs has happend, +it\'s also important for what CPU(s) it was for. People still interested +in more details, might want to do a web search for \"APIC\" now ;) + +This function receives the IRQ number, the name of the function, flags, +a name for /proc/interrupts and a parameter to pass to the interrupt +handler. Usually there is a certain number of IRQs available. How many +IRQs there are is hardware-dependent. The flags can include SA\_SHIRQ to +indicate you\'re willing to share the IRQ with other interrupt handlers +(usually because a number of hardware devices sit on the same IRQ) and +SA\_INTERRUPT to indicate this is a fast interrupt. This function will +only succeed if there isn\'t already a handler on this IRQ, or if +you\'re both willing to share. +::: +::: + +::: {#outline-container-org77533ca .outline-3} +### Detecting button presses {#org77533ca} + +::: {#text-org77533ca .outline-text-3} +Many popular single board computers, such as Raspberry Pis or +Beagleboards, have a bunch of GPIO pins. Attaching buttons to those and +then having a button press do something is a classic case in which you +might need to use interrupts so that instead of having the CPU waste +time and battery power polling for a change in input state it\'s better +for the input to trigger the CPU to then run a particular handling +function. + +Here\'s an example where buttons are connected to GPIO numbers 17 and 18 +and an LED is connected to GPIO 4. You can change those numbers to +whatever is appropriate for your board. + +::: {.org-src-container} + /* + * intrpt.c - Handling GPIO with interrupts + * + * Copyright (C) 2017 by Bob Mottram + * Based upon the Rpi example by Stefan Wendler (devnull@kaltpost.de) + * from: + * https://github.com/wendlers/rpi-kmod-samples + * + * Press one button to turn on a LED and another to turn it off + */ + + #include + #include + #include + #include + + static int button_irqs[] = { -1, -1 }; + + /* Define GPIOs for LEDs. + Change the numbers for the GPIO on your board. */ + static struct gpio leds[] = { + { 4, GPIOF_OUT_INIT_LOW, "LED 1" } + }; + + /* Define GPIOs for BUTTONS + Change the numbers for the GPIO on your board. */ + static struct gpio buttons[] = { + { 17, GPIOF_IN, "LED 1 ON BUTTON" }, + { 18, GPIOF_IN, "LED 1 OFF BUTTON" } + }; + + /* + * interrupt function triggered when a button is pressed + */ + static irqreturn_t button_isr(int irq, void *data) + { + /* first button */ + if (irq == button_irqs[0] && !gpio_get_value(leds[0].gpio)) + gpio_set_value(leds[0].gpio, 1); + /* second button */ + else if(irq == button_irqs[1] && gpio_get_value(leds[0].gpio)) + gpio_set_value(leds[0].gpio, 0); + + return IRQ_HANDLED; + } + + int init_module() + { + int ret = 0; + + printk(KERN_INFO "%s\n", __func__); + + /* register LED gpios */ + ret = gpio_request_array(leds, ARRAY_SIZE(leds)); + + if (ret) { + printk(KERN_ERR "Unable to request GPIOs for LEDs: %d\n", ret); + return ret; + } + + /* register BUTTON gpios */ + ret = gpio_request_array(buttons, ARRAY_SIZE(buttons)); + + if (ret) { + printk(KERN_ERR "Unable to request GPIOs for BUTTONs: %d\n", ret); + goto fail1; + } + + printk(KERN_INFO "Current button1 value: %d\n", + gpio_get_value(buttons[0].gpio)); + + ret = gpio_to_irq(buttons[0].gpio); + + if (ret < 0) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail2; + } + + button_irqs[0] = ret; + + printk(KERN_INFO "Successfully requested BUTTON1 IRQ # %d\n", + button_irqs[0]); + + ret = request_irq(button_irqs[0], button_isr, + IRQF_TRIGGER_RISING | IRQF_TRIGGER_FALLING, + "gpiomod#button1", NULL); + + if (ret) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail2; + } + + + ret = gpio_to_irq(buttons[1].gpio); + + if (ret < 0) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail2; + } + + button_irqs[1] = ret; + + printk(KERN_INFO "Successfully requested BUTTON2 IRQ # %d\n", + button_irqs[1]); + + ret = request_irq(button_irqs[1], button_isr, + IRQF_TRIGGER_RISING | IRQF_TRIGGER_FALLING, + "gpiomod#button2", NULL); + + if (ret) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail3; + } + + return 0; + + /* cleanup what has been setup so far */ + fail3: + free_irq(button_irqs[0], NULL); + + fail2: + gpio_free_array(buttons, ARRAY_SIZE(leds)); + + fail1: + gpio_free_array(leds, ARRAY_SIZE(leds)); + + return ret; + } + + void cleanup_module() + { + int i; + + printk(KERN_INFO "%s\n", __func__); + + /* free irqs */ + free_irq(button_irqs[0], NULL); + free_irq(button_irqs[1], NULL); + + /* turn all LEDs off */ + for (i = 0; i < ARRAY_SIZE(leds); i++) + gpio_set_value(leds[i].gpio, 0); + + /* unregister */ + gpio_free_array(leds, ARRAY_SIZE(leds)); + gpio_free_array(buttons, ARRAY_SIZE(buttons)); + } + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Handle some GPIO interrupts"); +::: +::: +::: + +::: {#outline-container-orgdb452ba .outline-3} +### Bottom Half {#orgdb452ba} + +::: {#text-orgdb452ba .outline-text-3} +Suppose you want to do a bunch of stuff inside of an interrupt routine. +A common way to do that without rendering the interrupt unavailable for +a significant duration is to combine it with a tasklet. This pushes the +bulk of the work off into the scheduler. + +The example below modifies the previous example to also run an +additional task when an interrupt is triggered. + +::: {.org-src-container} + /* + * bottomhalf.c - Top and bottom half interrupt handling + * + * Copyright (C) 2017 by Bob Mottram + * Based upon the Rpi example by Stefan Wendler (devnull@kaltpost.de) + * from: + * https://github.com/wendlers/rpi-kmod-samples + * + * Press one button to turn on a LED and another to turn it off + */ + + #include + #include + #include + #include + #include + + static int button_irqs[] = { -1, -1 }; + + /* Define GPIOs for LEDs. + Change the numbers for the GPIO on your board. */ + static struct gpio leds[] = { + { 4, GPIOF_OUT_INIT_LOW, "LED 1" } + }; + + /* Define GPIOs for BUTTONS + Change the numbers for the GPIO on your board. */ + static struct gpio buttons[] = { + { 17, GPIOF_IN, "LED 1 ON BUTTON" }, + { 18, GPIOF_IN, "LED 1 OFF BUTTON" } + }; + + /* Tasklet containing some non-trivial amount of processing */ + static void bottomhalf_tasklet_fn(unsigned long data) + { + printk("Bottom half tasklet starts\n"); + /* do something which takes a while */ + mdelay(500); + printk("Bottom half tasklet ends\n"); + } + + DECLARE_TASKLET(buttontask, bottomhalf_tasklet_fn, 0L); + + /* + * interrupt function triggered when a button is pressed + */ + static irqreturn_t button_isr(int irq, void *data) + { + /* Do something quickly right now */ + if (irq == button_irqs[0] && !gpio_get_value(leds[0].gpio)) + gpio_set_value(leds[0].gpio, 1); + else if(irq == button_irqs[1] && gpio_get_value(leds[0].gpio)) + gpio_set_value(leds[0].gpio, 0); + + /* Do the rest at leisure via the scheduler */ + tasklet_schedule(&buttontask); + + return IRQ_HANDLED; + } + + int init_module() + { + int ret = 0; + + printk(KERN_INFO "%s\n", __func__); + + /* register LED gpios */ + ret = gpio_request_array(leds, ARRAY_SIZE(leds)); + + if (ret) { + printk(KERN_ERR "Unable to request GPIOs for LEDs: %d\n", ret); + return ret; + } + + /* register BUTTON gpios */ + ret = gpio_request_array(buttons, ARRAY_SIZE(buttons)); + + if (ret) { + printk(KERN_ERR "Unable to request GPIOs for BUTTONs: %d\n", ret); + goto fail1; + } + + printk(KERN_INFO "Current button1 value: %d\n", + gpio_get_value(buttons[0].gpio)); + + ret = gpio_to_irq(buttons[0].gpio); + + if (ret < 0) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail2; + } + + button_irqs[0] = ret; + + printk(KERN_INFO "Successfully requested BUTTON1 IRQ # %d\n", + button_irqs[0]); + + ret = request_irq(button_irqs[0], button_isr, + IRQF_TRIGGER_RISING | IRQF_TRIGGER_FALLING, + "gpiomod#button1", NULL); + + if (ret) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail2; + } + + + ret = gpio_to_irq(buttons[1].gpio); + + if (ret < 0) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail2; + } + + button_irqs[1] = ret; + + printk(KERN_INFO "Successfully requested BUTTON2 IRQ # %d\n", + button_irqs[1]); + + ret = request_irq(button_irqs[1], button_isr, + IRQF_TRIGGER_RISING | IRQF_TRIGGER_FALLING, + "gpiomod#button2", NULL); + + if (ret) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail3; + } + + return 0; + + /* cleanup what has been setup so far */ + fail3: + free_irq(button_irqs[0], NULL); + + fail2: + gpio_free_array(buttons, ARRAY_SIZE(leds)); + + fail1: + gpio_free_array(leds, ARRAY_SIZE(leds)); + + return ret; + } + + void cleanup_module() + { + int i; + + printk(KERN_INFO "%s\n", __func__); + + /* free irqs */ + free_irq(button_irqs[0], NULL); + free_irq(button_irqs[1], NULL); + + /* turn all LEDs off */ + for (i = 0; i < ARRAY_SIZE(leds); i++) + gpio_set_value(leds[i].gpio, 0); + + /* unregister */ + gpio_free_array(leds, ARRAY_SIZE(leds)); + gpio_free_array(buttons, ARRAY_SIZE(buttons)); + } + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Interrupt with top and bottom half"); +::: +::: +::: +::: + +::: {#outline-container-org627e987 .outline-2} +Crypto {#org627e987} +------ + +::: {#text-org627e987 .outline-text-2} +At the dawn of the internet everybody trusted everybody completely...but +that didn\'t work out so well. When this guide was originally written it +was a more innocent era in which almost nobody actually gave a damn +about crypto - least of all kernel developers. That\'s certainly no +longer the case now. To handle crypto stuff the kernel has its own API +enabling common methods of encryption, decryption and your favourite +hash functions. +::: + +::: {#outline-container-org0d560c3 .outline-3} +### Hash functions {#org0d560c3} + +::: {#text-org0d560c3 .outline-text-3} +Calculating and checking the hashes of things is a common operation. +Here is a demonstration of how to calculate a sha256 hash within a +kernel module. + +::: {.org-src-container} + #include + #include + + #define SHA256_LENGTH (256/8) + + static void show_hash_result(char * plaintext, char * hash_sha256) + { + int i; + char str[SHA256_LENGTH*2 + 1]; + + printk("sha256 test for string: \"%s\"\n", plaintext); + for (i = 0; i < SHA256_LENGTH ; i++) + sprintf(&str[i*2],"%02x", (unsigned char)hash_sha256[i]); + str[i*2] = 0; + printk("%s\n", str); + } + + int cryptosha256_init(void) + { + char * plaintext = "This is a test"; + char hash_sha256[SHA256_LENGTH]; + struct crypto_shash *sha256; + struct shash_desc *shash; + + sha256 = crypto_alloc_shash("sha256", 0, 0); + if (IS_ERR(sha256)) + return -1; + + shash = + kmalloc(sizeof(struct shash_desc) + crypto_shash_descsize(sha256), + GFP_KERNEL); + if (!shash) + return -ENOMEM; + + shash->tfm = sha256; + shash->flags = 0; + + if (crypto_shash_init(shash)) + return -1; + + if (crypto_shash_update(shash, plaintext, strlen(plaintext))) + return -1; + + if (crypto_shash_final(shash, hash_sha256)) + return -1; + + kfree(shash); + crypto_free_shash(sha256); + + show_hash_result(plaintext, hash_sha256); + + return 0; + } + + void cryptosha256_exit(void) + { + } + + module_init(cryptosha256_init); + module_exit(cryptosha256_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("sha256 hash test"); + MODULE_LICENSE("GPL"); +::: + +Make and install the module: + +::: {.org-src-container} + make + sudo insmod cryptosha256.ko + dmesg +::: + +And you should see that the hash was calculated for the test string. + +Finally, remove the test module: + +::: {.org-src-container} + sudo rmmod cryptosha256 +::: +::: +::: + +::: {#outline-container-org4e331ef .outline-3} +### Symmetric key encryption {#org4e331ef} + +::: {#text-org4e331ef .outline-text-3} +Here is an example of symmetrically encrypting a string using the AES +algorithm and a password. + +::: {.org-src-container} + #include + #include + #include + + #define SYMMETRIC_KEY_LENGTH 32 + #define CIPHER_BLOCK_SIZE 16 + + struct tcrypt_result { + struct completion completion; + int err; + }; + + struct skcipher_def { + struct scatterlist sg; + struct crypto_skcipher * tfm; + struct skcipher_request * req; + struct tcrypt_result result; + char * scratchpad; + char * ciphertext; + char * ivdata; + }; + + static struct skcipher_def sk; + + static void test_skcipher_finish(struct skcipher_def * sk) + { + if (sk->tfm) + crypto_free_skcipher(sk->tfm); + if (sk->req) + skcipher_request_free(sk->req); + if (sk->ivdata) + kfree(sk->ivdata); + if (sk->scratchpad) + kfree(sk->scratchpad); + if (sk->ciphertext) + kfree(sk->ciphertext); + } + + static int test_skcipher_result(struct skcipher_def * sk, int rc) + { + switch (rc) { + case 0: + break; + case -EINPROGRESS: + case -EBUSY: + rc = wait_for_completion_interruptible( + &sk->result.completion); + if (!rc && !sk->result.err) { + reinit_completion(&sk->result.completion); + break; + } + default: + printk("skcipher encrypt returned with %d result %d\n", + rc, sk->result.err); + break; + } + + init_completion(&sk->result.completion); + + return rc; + } + + static void test_skcipher_callback(struct crypto_async_request *req, int error) + { + struct tcrypt_result *result = req->data; + int ret; + + if (error == -EINPROGRESS) + return; + + result->err = error; + complete(&result->completion); + printk("Encryption finished successfully\n"); + } + + static int test_skcipher_encrypt(char * plaintext, char * password, + struct skcipher_def * sk) + { + int ret = -EFAULT; + unsigned char key[SYMMETRIC_KEY_LENGTH]; + + if (!sk->tfm) { + sk->tfm = crypto_alloc_skcipher("cbc-aes-aesni", 0, 0); + if (IS_ERR(sk->tfm)) { + printk("could not allocate skcipher handle\n"); + return PTR_ERR(sk->tfm); + } + } + + if (!sk->req) { + sk->req = skcipher_request_alloc(sk->tfm, GFP_KERNEL); + if (!sk->req) { + printk("could not allocate skcipher request\n"); + ret = -ENOMEM; + goto out; + } + } + + skcipher_request_set_callback(sk->req, CRYPTO_TFM_REQ_MAY_BACKLOG, + test_skcipher_callback, + &sk->result); + + /* clear the key */ + memset((void*)key,'\0',SYMMETRIC_KEY_LENGTH); + + /* Use the world's favourite password */ + sprintf((char*)key,"%s",password); + + /* AES 256 with given symmetric key */ + if (crypto_skcipher_setkey(sk->tfm, key, SYMMETRIC_KEY_LENGTH)) { + printk("key could not be set\n"); + ret = -EAGAIN; + goto out; + } + printk("Symmetric key: %s\n", key); + printk("Plaintext: %s\n", plaintext); + + if (!sk->ivdata) { + /* see https://en.wikipedia.org/wiki/Initialization_vector */ + sk->ivdata = kmalloc(CIPHER_BLOCK_SIZE, GFP_KERNEL); + if (!sk->ivdata) { + printk("could not allocate ivdata\n"); + goto out; + } + get_random_bytes(sk->ivdata, CIPHER_BLOCK_SIZE); + } + + if (!sk->scratchpad) { + /* The text to be encrypted */ + sk->scratchpad = kmalloc(CIPHER_BLOCK_SIZE, GFP_KERNEL); + if (!sk->scratchpad) { + printk("could not allocate scratchpad\n"); + goto out; + } + } + sprintf((char*)sk->scratchpad,"%s",plaintext); + + sg_init_one(&sk->sg, sk->scratchpad, CIPHER_BLOCK_SIZE); + skcipher_request_set_crypt(sk->req, &sk->sg, &sk->sg, + CIPHER_BLOCK_SIZE, sk->ivdata); + init_completion(&sk->result.completion); + + /* encrypt data */ + ret = crypto_skcipher_encrypt(sk->req); + ret = test_skcipher_result(sk, ret); + if (ret) + goto out; + + printk("Encryption request successful\n"); + + out: + return ret; + } + + int cryptoapi_init(void) + { + /* The world's favourite password */ + char * password = "password123"; + + sk.tfm = NULL; + sk.req = NULL; + sk.scratchpad = NULL; + sk.ciphertext = NULL; + sk.ivdata = NULL; + + test_skcipher_encrypt("Testing", password, &sk); + return 0; + } + + void cryptoapi_exit(void) + { + test_skcipher_finish(&sk); + } + + module_init(cryptoapi_init); + module_exit(cryptoapi_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Symmetric key encryption example"); + MODULE_LICENSE("GPL"); +::: +::: +::: +::: + +::: {#outline-container-org01d6493 .outline-2} +Standardising the interfaces: The Device Model {#org01d6493} +---------------------------------------------- + +::: {#text-org01d6493 .outline-text-2} +Up to this point we\'ve seen all kinds of modules doing all kinds of +things, but there was no consistency in their interfaces with the rest +of the kernel. To impose some consistency such that there is at minimum +a standardised way to start, suspend and resume a device a device model +was added. An example is show below, and you can use this as a template +to add your own suspend, resume or other interface functions. + +::: {.org-src-container} + #include + #include + #include + + struct devicemodel_data { + char *greeting; + int number; + }; + + static int devicemodel_probe(struct platform_device *dev) + { + struct devicemodel_data *pd = (struct devicemodel_data *)(dev->dev.platform_data); + + printk("devicemodel probe\n"); + printk("devicemodel greeting: %s; %d\n", pd->greeting, pd->number); + + /* Your device initialisation code */ + + return 0; + } + + static int devicemodel_remove(struct platform_device *dev) + { + printk("devicemodel example removed\n"); + + /* Your device removal code */ + + return 0; + } + + static int devicemodel_suspend(struct device *dev) + { + printk("devicemodel example suspend\n"); + + /* Your device suspend code */ + + return 0; + } + + static int devicemodel_resume(struct device *dev) + { + printk("devicemodel example resume\n"); + + /* Your device resume code */ + + return 0; + } + + static const struct dev_pm_ops devicemodel_pm_ops = + { + .suspend = devicemodel_suspend, + .resume = devicemodel_resume, + .poweroff = devicemodel_suspend, + .freeze = devicemodel_suspend, + .thaw = devicemodel_resume, + .restore = devicemodel_resume + }; + + static struct platform_driver devicemodel_driver = { + .driver = { + .name = "devicemodel_example", + .owner = THIS_MODULE, + .pm = &devicemodel_pm_ops, + }, + .probe = devicemodel_probe, + .remove = devicemodel_remove, + }; + + static int devicemodel_init(void) + { + int ret; + + printk("devicemodel init\n"); + + ret = platform_driver_register(&devicemodel_driver); + + if (ret) { + printk(KERN_ERR "Unable to register driver\n"); + return ret; + } + + return 0; + } + + static void devicemodel_exit(void) + { + printk("devicemodel exit\n"); + platform_driver_unregister(&devicemodel_driver); + } + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Linux Device Model example"); + + module_init(devicemodel_init); + module_exit(devicemodel_exit); +::: +::: +::: + +::: {#outline-container-org87293ce .outline-2} +Optimisations {#org87293ce} +------------- + +::: {#text-org87293ce .outline-text-2} +::: + +::: {#outline-container-org87e8223 .outline-3} +### Likely and Unlikely conditions {#org87e8223} + +::: {#text-org87e8223 .outline-text-3} +Sometimes you might want your code to run as quickly as possible, +especially if it\'s handling an interrupt or doing something which might +cause noticible latency. If your code contains boolean conditions and if +you know that the conditions are almost always likely to evaluate as +either *true* or *false*, then you can allow the compiler to optimise +for this using the *likely* and *unlikely* macros. + +For example, when allocating memory you\'re almost always expecting this +to succeed. + +::: {.org-src-container} + bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx); + if (unlikely(!bvl)) { + mempool_free(bio, bio_pool); + bio = NULL; + goto out; + } +::: + +When the *unlikely* macro is used the compiler alters its machine +instruction output so that it continues along the false branch and only +jumps if the condition is true. That avoids flushing the processor +pipeline. The opposite happens if you use the *likely* macro. +::: +::: +::: + +::: {#outline-container-org79dea20 .outline-2} +Common Pitfalls {#org79dea20} +--------------- + +::: {#text-org79dea20 .outline-text-2} +Before I send you on your way to go out into the world and write kernel +modules, there are a few things I need to warn you about. If I fail to +warn you and something bad happens, please report the problem to me for +a full refund of the amount I was paid for your copy of the book. +::: + +::: {#outline-container-org86275d7 .outline-3} +### Using standard libraries {#org86275d7} + +::: {#text-org86275d7 .outline-text-3} +You can\'t do that. In a kernel module you can only use kernel +functions, which are the functions you can see in /proc/kallsyms. +::: +::: + +::: {#outline-container-org8646229 .outline-3} +### Disabling interrupts {#org8646229} + +::: {#text-org8646229 .outline-text-3} +You might need to do this for a short time and that is OK, but if you +don\'t enable them afterwards, your system will be stuck and you\'ll +have to power it off. +::: +::: + +::: {#outline-container-org58c8bc4 .outline-3} +### Sticking your head inside a large carnivore {#org58c8bc4} + +::: {#text-org58c8bc4 .outline-text-3} +I probably don\'t have to warn you about this, but I figured I will +anyway, just in case. +::: +::: +::: + +::: {#outline-container-org2307e11 .outline-2} +Where To Go From Here? {#org2307e11} +---------------------- + +::: {#text-org2307e11 .outline-text-2} +I could easily have squeezed a few more chapters into this book. I could +have added a chapter about creating new file systems, or about adding +new protocol stacks (as if there\'s a need for that -- you\'d have to +dig underground to find a protocol stack not supported by Linux). I +could have added explanations of the kernel mechanisms we haven\'t +touched upon, such as bootstrapping or the disk interface. + +However, I chose not to. My purpose in writing this book was to provide +initiation into the mysteries of kernel module programming and to teach +the common techniques for that purpose. For people seriously interested +in kernel programming, I recommend +[kernelnewbies.org](https://kernelnewbies.org) and the *Documentation* +subdirectory within the kernel source code which isn\'t always easy to +understand but can be a starting point for further investigation. Also, +as Linus said, the best way to learn the kernel is to read the source +code yourself. + +If you\'re interested in more examples of short kernel modules then +searching on sites such as Github and Gitlab is a good way to start, +although there is a lot of duplication of older LKMPG examples which may +not compile with newer kernel versions. You will also be able to find +examples of the use of kernel modules to attack or compromise systems or +exfiltrate data and those can be useful for thinking about how to defend +systems and learning about existing security mechanisms within the +kernel. + +I hope I have helped you in your quest to become a better programmer, or +at least to have fun through technology. And, if you do write useful +kernel modules, I hope you publish them under the GPL, so I can use them +too. + +If you\'d like to contribute to this guide, notice anything glaringly +wrong, or just want to add extra sarcastic remarks perhaps involving +monkeys or some other kind of animal then please file an issue or even +better submit a pull request at . +::: +::: +::: + +::: {#postamble .status} +::: {.back-to-top} +[Back to top](#top) \| [E-mail me](mailto:bob@freedombone.net) +::: +::: diff --git a/4.12.12/LKMPG-4.12.12.rst b/4.12.12/LKMPG-4.12.12.rst new file mode 100644 index 0000000..6f3cd1b --- /dev/null +++ b/4.12.12/LKMPG-4.12.12.rst @@ -0,0 +1,7263 @@ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Table of Contents + :name: table-of-contents + +.. raw:: html + +
+ +- `Introduction <#org98c97cb>`__ + + - `Authorship <#org2782b14>`__ + - `Versioning and Notes <#org0b6d633>`__ + - `Acknowledgements <#orge57cf6b>`__ + - `What Is A Kernel Module? <#org37341bc>`__ + - `Kernel module package <#orge9612fa>`__ + - `What Modules are in my Kernel? <#orgb6ce832>`__ + - `Do I need to download and compile the kernel? <#orge1ec8b5>`__ + - `Before We Begin <#org87661f2>`__ + +- `Headers <#org52fbd37>`__ +- `Examples <#org628945f>`__ +- `Hello World <#org0d455c0>`__ + + - `The Simplest Module <#orgba22fe1>`__ + - `Hello and Goodbye <#org56fc79a>`__ + - `The \__init and \__exit Macros <#org86bfdb6>`__ + - `Licensing and Module Documentation <#org11aaf91>`__ + - `Passing Command Line Arguments to a Module <#org9e1dd8d>`__ + - `Modules Spanning Multiple Files <#orgcd10981>`__ + - `Building modules for a precompiled kernel <#orga65faca>`__ + +- `Preliminaries <#orgdeef601>`__ + + - `How modules begin and end <#orgc8eceb0>`__ + - `Functions available to modules <#org290f3df>`__ + - `User Space vs Kernel Space <#orga7850df>`__ + - `Name Space <#org4b4877b>`__ + - `Code space <#org7e3a491>`__ + - `Device Drivers <#org6c0b122>`__ + +- `Character Device drivers <#org016c39a>`__ + + - `The file_operations Structure <#org31d952e>`__ + - `The file structure <#org607b208>`__ + - `Registering A Device <#orgf96ab85>`__ + - `Unregistering A Device <#org452ea75>`__ + - `chardev.c <#orgdd49880>`__ + - `Writing Modules for Multiple Kernel Versions <#org903f5d5>`__ + +- `The /proc File System <#org6400501>`__ + + - `Read and Write a /proc File <#orga906618>`__ + - `Manage /proc file with standard filesystem <#org561d817>`__ + - `Manage /proc file with seq_file <#org38ea52f>`__ + +- `sysfs: Interacting with your module <#org954957f>`__ +- `Talking To Device Files <#org438f37b>`__ +- `System Calls <#org8de5924>`__ +- `Blocking Processes and threads <#org13e2c0e>`__ + + - `Sleep <#org9cbc7d3>`__ + - `Completions <#org89cb410>`__ + +- `Avoiding Collisions and Deadlocks <#org949949f>`__ + + - `Mutex <#org10f05c2>`__ + - `Spinlocks <#org5d633fc>`__ + - `Read and write locks <#orgaa517c3>`__ + - `Atomic operations <#orgadbf448>`__ + +- `Replacing Printks <#org7974c60>`__ + + - `Replacing printk <#org1c8b17b>`__ + - `Flashing keyboard LEDs <#org418d823>`__ + +- `Scheduling Tasks <#orgf37d73f>`__ + + - `Tasklets <#org32525a8>`__ + - `Work queues <#orge8a2d87>`__ + +- `Interrupt Handlers <#orgbc0cdf8>`__ + + - `Interrupt Handlers <#org93511bb>`__ + - `Detecting button presses <#org77533ca>`__ + - `Bottom Half <#orgdb452ba>`__ + +- `Crypto <#org627e987>`__ + + - `Hash functions <#org0d560c3>`__ + - `Symmetric key encryption <#org4e331ef>`__ + +- `Standardising the interfaces: The Device Model <#org01d6493>`__ +- `Optimisations <#org87293ce>`__ + + - `Likely and Unlikely conditions <#org87e8223>`__ + +- `Common Pitfalls <#org79dea20>`__ + + - `Using standard libraries <#org86275d7>`__ + - `Disabling interrupts <#org8646229>`__ + - `Sticking your head inside a large carnivore <#org58c8bc4>`__ + +- `Where To Go From Here? <#org2307e11>`__ + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Introduction + :name: org98c97cb + +.. raw:: html + +
+ +The Linux Kernel Module Programming Guide is a free book; you may +reproduce and/or modify it under the terms of the Open Software License, +version 3.0. + +This book is distributed in the hope it will be useful, but without any +warranty, without even the implied warranty of merchantability or +fitness for a particular purpose. + +The author encourages wide distribution of this book for personal or +commercial use, provided the above copyright notice remains intact and +the method adheres to the provisions of the Open Software License. In +summary, you may copy and distribute this book free of charge or for a +profit. No explicit permission is required from the author for +reproduction of this book in any medium, physical or electronic. + +Derivative works and translations of this document must be placed under +the Open Software License, and the original copyright notice must remain +intact. If you have contributed new material to this book, you must make +the material and source code available for your revisions. Please make +revisions and updates available directly to the document maintainer, +Peter Jay Salzman . This will allow for the merging of +updates and provide consistent revisions to the Linux community. + +If you publish or distribute this book commercially, donations, +royalties, and/or printed copies are greatly appreciated by the author +and the `Linux Documentation Project `__ (LDP). +Contributing in this way shows your support for free software and the +LDP. If you have questions or comments, please contact the address +above. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Authorship + :name: org2782b14 + +.. raw:: html + +
+ +The Linux Kernel Module Programming Guide was originally written for the +2.2 kernels by Ori Pomerantz. Eventually, Ori no longer had time to +maintain the document. After all, the Linux kernel is a fast moving +target. Peter Jay Salzman took over maintenance and updated it for the +2.4 kernels. Eventually, Peter no longer had time to follow developments +with the 2.6 kernel, so Michael Burian became a co-maintainer to update +the document for the 2.6 kernels. Bob Mottram updated the examples for +3.8 and later kernels, added the sysfs chapter and modified or updated +other chapters. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Versioning and Notes + :name: org0b6d633 + +.. raw:: html + +
+ +The Linux kernel is a moving target. There has always been a question +whether the LKMPG should remove deprecated information or keep it around +for historical sake. Michael Burian and I decided to create a new branch +of the LKMPG for each new stable kernel version. So version LKMPG 4.12.x +will address Linux kernel 4.12.x and LKMPG 2.6.x will address Linux +kernel 2.6. No attempt will be made to archive historical information; a +person wishing this information should read the appropriately versioned +LKMPG. + +The source code and discussions should apply to most architectures, but +I can't promise anything. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Acknowledgements + :name: orge57cf6b + +.. raw:: html + +
+ +The following people have contributed corrections or good suggestions: +Ignacio Martin, David Porter, Daniele Paolo Scarpazza, Dimo Velev, +Francois Audeon, Horst Schirmeier, Bob Mottram and Roman Lakeev. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: What Is A Kernel Module? + :name: org37341bc + +.. raw:: html + +
+ +So, you want to write a kernel module. You know C, you've written a few +normal programs to run as processes, and now you want to get to where +the real action is, to where a single wild pointer can wipe out your +file system and a core dump means a reboot. + +What exactly is a kernel module? Modules are pieces of code that can be +loaded and unloaded into the kernel upon demand. They extend the +functionality of the kernel without the need to reboot the system. For +example, one type of module is the device driver, which allows the +kernel to access hardware connected to the system. Without modules, we +would have to build monolithic kernels and add new functionality +directly into the kernel image. Besides having larger kernels, this has +the disadvantage of requiring us to rebuild and reboot the kernel every +time we want new functionality. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Kernel module package + :name: orge9612fa + +.. raw:: html + +
+ +Linux distros provide the commands *modprobe*, *insmod* and *depmod* +within a package. + +On Debian: + +.. raw:: html + +
+ +:: + + sudo apt-get install build-essential kmod + +.. raw:: html + +
+ +On Parabola: + +.. raw:: html + +
+ +:: + + sudo pacman -S gcc kmod + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: What Modules are in my Kernel? + :name: orgb6ce832 + +.. raw:: html + +
+ +To discover what modules are already loaded within your current kernel +use the command **lsmod**. + +.. raw:: html + +
+ +:: + + sudo lsmod + +.. raw:: html + +
+ +Modules are stored within the file /proc/modules, so you can also see +them with: + +.. raw:: html + +
+ +:: + + sudo cat /proc/modules + +.. raw:: html + +
+ +This can be a long list, and you might prefer to search for something +particular. To search for the *fat* module: + +.. raw:: html + +
+ +:: + + sudo lsmod | grep fat + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Do I need to download and compile the kernel? + :name: orge1ec8b5 + +.. raw:: html + +
+ +For the purposes of following this guide you don't necessarily need to +do that. However, it would be wise to run the examples within a test +distro running on a virtual machine in order to avoid any possibility of +messing up your system. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Before We Begin + :name: org87661f2 + +.. raw:: html + +
+ +Before we delve into code, there are a few issues we need to cover. +Everyone's system is different and everyone has their own groove. +Getting your first "hello world" program to compile and load correctly +can sometimes be a trick. Rest assured, after you get over the initial +hurdle of doing it for the first time, it will be smooth sailing +thereafter. + +.. raw:: html + +
+ +- | Modversioning + | ::: {#text-org551d822 .outline-text-5} A module compiled for one + kernel won't load if you boot a different kernel unless you enable + CONFIG_MODVERSIONS in the kernel. We won't go into module + versioning until later in this guide. Until we cover modversions, + the examples in the guide may not work if you're running a kernel + with modversioning turned on. However, most stock Linux distro + kernels come with it turned on. If you're having trouble loading + the modules because of versioning errors, compile a kernel with + modversioning turned off. + + ::: + +- | Using X + | ::: {#text-orgaf2a17b .outline-text-5} It is highly recommended + that you type in, compile and load all the examples this guide + discusses. It's also highly recommended you do this from a console. + You should not be working on this stuff in X. + + Modules can't print to the screen like printf() can, but they can log + information and warnings, which ends up being printed on your screen, + but only on a console. If you insmod a module from an xterm, the + information and warnings will be logged, but only to your systemd + journal. You won't see it unless you look through your journalctl. To + have immediate access to this information, do all your work from the + console. + + ::: + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Headers + :name: org52fbd37 + +.. raw:: html + +
+ +Before you can build anything you'll need to install the header files +for your kernel. On Parabola GNU/Linux: + +.. raw:: html + +
+ +:: + + sudo pacman -S linux-libre-headers + +.. raw:: html + +
+ +On Debian: + +.. raw:: html + +
+ +:: + + sudo apt-get update + apt-cache search linux-headers-$(uname -r) + +.. raw:: html + +
+ +This will tell you what kernel header files are available. Then for +example: + +.. raw:: html + +
+ +:: + + sudo apt-get install kmod linux-headers-4.12.12-1-amd64 + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Examples + :name: org628945f + +.. raw:: html + +
+ +All the examples from this document are available within the *examples* +subdirectory. To test that they compile: + +.. raw:: html + +
+ +:: + + cd examples + make + +.. raw:: html + +
+ +If there are any compile errors then you might have a more recent kernel +version or need to install the corresponding kernel header files. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Hello World + :name: org0d455c0 + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: The Simplest Module + :name: orgba22fe1 + +.. raw:: html + +
+ +Most people learning programming start out with some sort of "*hello +world*" example. I don't know what happens to people who break with this +tradition, but I think it's safer not to find out. We'll start with a +series of hello world programs that demonstrate the different aspects of +the basics of writing a kernel module. + +Here's the simplest module possible. + +Make a test directory: + +.. raw:: html + +
+ +:: + + mkdir -p ~/develop/kernel/hello-1 + cd ~/develop/kernel/hello-1 + +.. raw:: html + +
+ +Paste this into you favourite editor and save it as **hello-1.c**: + +.. raw:: html + +
+ +:: + + /* + * hello-1.c - The simplest kernel module. + */ + #include /* Needed by all modules */ + #include /* Needed for KERN_INFO */ + + int init_module(void) + { + printk(KERN_INFO "Hello world 1.\n"); + + /* + * A non 0 return means init_module failed; module can't be loaded. + */ + return 0; + } + + void cleanup_module(void) + { + printk(KERN_INFO "Goodbye world 1.\n"); + } + +.. raw:: html + +
+ +Now you'll need a Makefile. If you copy and paste this change the +indentation to use tabs, not spaces. + +.. raw:: html + +
+ +:: + + obj-m += hello-1.o + + all: + make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules + + clean: + make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean + +.. raw:: html + +
+ +And finally just: + +.. raw:: html + +
+ +:: + + make + +.. raw:: html + +
+ +If all goes smoothly you should then find that you have a compiled +**hello-1.ko** module. You can find info on it with the command: + +.. raw:: html + +
+ +:: + + sudo modinfo hello-1.ko + +.. raw:: html + +
+ +At this point the command: + +.. raw:: html + +
+ +:: + + sudo lsmod | grep hello + +.. raw:: html + +
+ +should return nothing. You can try loading your shiny new module with: + +.. raw:: html + +
+ +:: + + sudo insmod hello-1.ko + +.. raw:: html + +
+ +The dash character will get converted to an underscore, so when you +again try: + +.. raw:: html + +
+ +:: + + sudo lsmod | grep hello + +.. raw:: html + +
+ +you should now see your loaded module. It can be removed again with: + +.. raw:: html + +
+ +:: + + sudo rmmod hello_1 + +.. raw:: html + +
+ +Notice that the dash was replaced by an underscore. To see what just +happened in the logs: + +.. raw:: html + +
+ +:: + + journalctl --since "1 hour ago" | grep kernel + +.. raw:: html + +
+ +You now know the basics of creating, compiling, installing and removing +modules. Now for more of a description of how this module works. + +Kernel modules must have at least two functions: a "start" +(initialization) function called **init_module()** which is called when +the module is insmoded into the kernel, and an "end" (cleanup) function +called **cleanup_module()** which is called just before it is rmmoded. +Actually, things have changed starting with kernel 2.3.13. You can now +use whatever name you like for the start and end functions of a module, +and you'll learn how to do this in Section 2.3. In fact, the new method +is the preferred method. However, many people still use init_module() +and cleanup_module() for their start and end functions. + +Typically, init_module() either registers a handler for something with +the kernel, or it replaces one of the kernel functions with its own code +(usually code to do something and then call the original function). The +cleanup_module() function is supposed to undo whatever init_module() +did, so the module can be unloaded safely. + +Lastly, every kernel module needs to include linux/module.h. We needed +to include **linux/kernel.h** only for the macro expansion for the +printk() log level, KERN_ALERT, which you'll learn about in Section +2.1.1. + +.. raw:: html + +
+ +- | A point about coding style + | ::: {#text-orgab018f5 .outline-text-5} Another thing which may not + be immediately obvious to anyone getting started with kernel + programming is that indentation within your code should be using + **tabs** and **not spaces**. It's one of the coding conventions of + the kernel. You may not like it, but you'll need to get used to it + if you ever submit a patch upstream. + + ::: + +- | Introducing printk() + | ::: {#text-org176ca3e .outline-text-5} Despite what you might + think, **printk()** was not meant to communicate information to the + user, even though we used it for exactly this purpose in hello-1! + It happens to be a logging mechanism for the kernel, and is used to + log information or give warnings. Therefore, each printk() + statement comes with a priority, which is the <1> and KERN_ALERT + you see. There are 8 priorities and the kernel has macros for them, + so you don't have to use cryptic numbers, and you can view them + (and their meanings) in **linux/kernel.h**. If you don't specify a + priority level, the default priority, DEFAULT_MESSAGE_LOGLEVEL, + will be used. + + Take time to read through the priority macros. The header file also + describes what each priority means. In practise, don't use number, + like <4>. Always use the macro, like KERN_WARNING. + + If the priority is less than int console_loglevel, the message is + printed on your current terminal. If both syslogd and klogd are + running, then the message will also get appended to the systemd + journal, whether it got printed to the console or not. We use a high + priority, like KERN_ALERT, to make sure the printk() messages get + printed to your console rather than just logged to the journal. When + you write real modules, you'll want to use priorities that are + meaningful for the situation at hand. + + ::: + +- | About Compiling + | ::: {#text-orgc8049ab .outline-text-5} Kernel modules need to be + compiled a bit differently from regular userspace apps. Former + kernel versions required us to care much about these settings, + which are usually stored in Makefiles. Although hierarchically + organized, many redundant settings accumulated in sublevel + Makefiles and made them large and rather difficult to maintain. + Fortunately, there is a new way of doing these things, called + kbuild, and the build process for external loadable modules is now + fully integrated into the standard kernel build mechanism. To learn + more on how to compile modules which are not part of the official + kernel (such as all the examples you'll find in this guide), see + file **linux/Documentation/kbuild/modules.txt**. + + Additional details about Makefiles for kernel modules are available + in **linux/Documentation/kbuild/makefiles.txt**. Be sure to read this + and the related files before starting to hack Makefiles. It'll + probably save you lots of work. + + Here's another exercise for the reader. See that comment above + the return statement in init_module()? Change the return value to + something negative, recompile and load the module again. What + happens? + + ::: + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Hello and Goodbye + :name: org56fc79a + +.. raw:: html + +
+ +In early kernel versions you had to use the **init_module** and +**cleanup_module** functions, as in the first hello world example, but +these days you can name those anything you want by using the +**module_init** and **module_exit** macros. These macros are defined in +**linux/init.h**. The only requirement is that your init and cleanup +functions must be defined before calling the those macros, otherwise +you'll get compilation errors. Here's an example of this technique: + +.. raw:: html + +
+ +:: + + /* + * hello-2.c - Demonstrating the module_init() and module_exit() macros. + * This is preferred over using init_module() and cleanup_module(). + */ + #include /* Needed by all modules */ + #include /* Needed for KERN_INFO */ + #include /* Needed for the macros */ + + static int __init hello_2_init(void) + { + printk(KERN_INFO "Hello, world 2\n"); + return 0; + } + + static void __exit hello_2_exit(void) + { + printk(KERN_INFO "Goodbye, world 2\n"); + } + + module_init(hello_2_init); + module_exit(hello_2_exit); + +.. raw:: html + +
+ +So now we have two real kernel modules under our belt. Adding another +module is as simple as this: + +.. raw:: html + +
+ +:: + + obj-m += hello-1.o + obj-m += hello-2.o + + all: + make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules + + clean: + make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean + +.. raw:: html + +
+ +Now have a look at linux/drivers/char/Makefile for a real world example. +As you can see, some things get hardwired into the kernel (obj-y) but +where are all those obj-m gone? Those familiar with shell scripts will +easily be able to spot them. For those not, the obj-$(CONFIG_FOO) +entries you see everywhere expand into obj-y or obj-m, depending on +whether the CONFIG_FOO variable has been set to y or m. While we are at +it, those were exactly the kind of variables that you have set in the +linux/.config file, the last time when you said make menuconfig or +something like that. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: The \__init and \__exit Macros + :name: org86bfdb6 + +.. raw:: html + +
+ +This demonstrates a feature of kernel 2.2 and later. Notice the change +in the definitions of the init and cleanup functions. The **\__init** +macro causes the init function to be discarded and its memory freed once +the init function finishes for built-in drivers, but not loadable +modules. If you think about when the init function is invoked, this +makes perfect sense. + +There is also an **\__initdata** which works similarly to **\__init** +but for init variables rather than functions. + +The **\__exit** macro causes the omission of the function when the +module is built into the kernel, and like \__init, has no effect for +loadable modules. Again, if you consider when the cleanup function runs, +this makes complete sense; built-in drivers don't need a cleanup +function, while loadable modules do. + +These macros are defined in **linux/init.h** and serve to free up kernel +memory. When you boot your kernel and see something like Freeing unused +kernel memory: 236k freed, this is precisely what the kernel is freeing. + +.. raw:: html + +
+ +:: + + /* + * hello-3.c - Illustrating the __init, __initdata and __exit macros. + */ + #include /* Needed by all modules */ + #include /* Needed for KERN_INFO */ + #include /* Needed for the macros */ + + static int hello3_data __initdata = 3; + + static int __init hello_3_init(void) + { + printk(KERN_INFO "Hello, world %d\n", hello3_data); + return 0; + } + + static void __exit hello_3_exit(void) + { + printk(KERN_INFO "Goodbye, world 3\n"); + } + + module_init(hello_3_init); + module_exit(hello_3_exit); + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Licensing and Module Documentation + :name: org11aaf91 + +.. raw:: html + +
+ +Honestly, who loads or even cares about proprietary modules? If you do +then you might have seen something like this: + +.. raw:: html + +
+ +:: + + # insmod xxxxxx.o + Warning: loading xxxxxx.ko will taint the kernel: no license + See http://www.tux.org/lkml/#export-tainted for information about tainted modules + Module xxxxxx loaded, with warnings + +.. raw:: html + +
+ +You can use a few macros to indicate the license for your module. Some +examples are "GPL", "GPL v2", "GPL and additional rights", "Dual +BSD/GPL", "Dual MIT/GPL", "Dual MPL/GPL" and "Proprietary". They're +defined within **linux/module.h**. + +To reference what license you're using a macro is available called +**MODULE_LICENSE**. This and a few other macros describing the module +are illustrated in the below example. + +.. raw:: html + +
+ +:: + + /* + * hello-4.c - Demonstrates module documentation. + */ + #include /* Needed by all modules */ + #include /* Needed for KERN_INFO */ + #include /* Needed for the macros */ + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("A sample driver"); + MODULE_SUPPORTED_DEVICE("testdevice"); + + static int __init init_hello_4(void) + { + printk(KERN_INFO "Hello, world 4\n"); + return 0; + } + + static void __exit cleanup_hello_4(void) + { + printk(KERN_INFO "Goodbye, world 4\n"); + } + + module_init(init_hello_4); + module_exit(cleanup_hello_4); + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Passing Command Line Arguments to a Module + :name: org9e1dd8d + +.. raw:: html + +
+ +Modules can take command line arguments, but not with the argc/argv you +might be used to. + +To allow arguments to be passed to your module, declare the variables +that will take the values of the command line arguments as global and +then use the module_param() macro, (defined in linux/moduleparam.h) to +set the mechanism up. At runtime, insmod will fill the variables with +any command line arguments that are given, like ./insmod mymodule.ko +myvariable=5. The variable declarations and macros should be placed at +the beginning of the module for clarity. The example code should clear +up my admittedly lousy explanation. + +The module_param() macro takes 3 arguments: the name of the variable, +its type and permissions for the corresponding file in sysfs. Integer +types can be signed as usual or unsigned. If you'd like to use arrays of +integers or strings see module_param_array() and module_param_string(). + +.. raw:: html + +
+ +:: + + int myint = 3; + module_param(myint, int, 0); + +.. raw:: html + +
+ +Arrays are supported too, but things are a bit different now than they +were in the olden days. To keep track of the number of parameters you +need to pass a pointer to a count variable as third parameter. At your +option, you could also ignore the count and pass NULL instead. We show +both possibilities here: + +.. raw:: html + +
+ +:: + + int myintarray[2]; + module_param_array(myintarray, int, NULL, 0); /* not interested in count */ + + short myshortarray[4]; + int count; + module_parm_array(myshortarray, short, &count, 0); /* put count into "count" variable */ + +.. raw:: html + +
+ +A good use for this is to have the module variable's default values set, +like an port or IO address. If the variables contain the default values, +then perform autodetection (explained elsewhere). Otherwise, keep the +current value. This will be made clear later on. + +Lastly, there's a macro function, **MODULE_PARM_DESC()**, that is used +to document arguments that the module can take. It takes two parameters: +a variable name and a free form string describing that variable. + +.. raw:: html + +
+ +:: + + /* + * hello-5.c - Demonstrates command line argument passing to a module. + */ + #include + #include + #include + #include + #include + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Peter Jay Salzman"); + + static short int myshort = 1; + static int myint = 420; + static long int mylong = 9999; + static char *mystring = "blah"; + static int myintArray[2] = { -1, -1 }; + static int arr_argc = 0; + + /* + * module_param(foo, int, 0000) + * The first param is the parameters name + * The second param is it's data type + * The final argument is the permissions bits, + * for exposing parameters in sysfs (if non-zero) at a later stage. + */ + + module_param(myshort, short, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP); + MODULE_PARM_DESC(myshort, "A short integer"); + module_param(myint, int, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH); + MODULE_PARM_DESC(myint, "An integer"); + module_param(mylong, long, S_IRUSR); + MODULE_PARM_DESC(mylong, "A long integer"); + module_param(mystring, charp, 0000); + MODULE_PARM_DESC(mystring, "A character string"); + + /* + * module_param_array(name, type, num, perm); + * The first param is the parameter's (in this case the array's) name + * The second param is the data type of the elements of the array + * The third argument is a pointer to the variable that will store the number + * of elements of the array initialized by the user at module loading time + * The fourth argument is the permission bits + */ + module_param_array(myintArray, int, &arr_argc, 0000); + MODULE_PARM_DESC(myintArray, "An array of integers"); + + static int __init hello_5_init(void) + { + int i; + printk(KERN_INFO "Hello, world 5\n=============\n"); + printk(KERN_INFO "myshort is a short integer: %hd\n", myshort); + printk(KERN_INFO "myint is an integer: %d\n", myint); + printk(KERN_INFO "mylong is a long integer: %ld\n", mylong); + printk(KERN_INFO "mystring is a string: %s\n", mystring); + for (i = 0; i < (sizeof myintArray / sizeof (int)); i++) + { + printk(KERN_INFO "myintArray[%d] = %d\n", i, myintArray[i]); + } + printk(KERN_INFO "got %d arguments for myintArray.\n", arr_argc); + return 0; + } + + static void __exit hello_5_exit(void) + { + printk(KERN_INFO "Goodbye, world 5\n"); + } + + module_init(hello_5_init); + module_exit(hello_5_exit); + +.. raw:: html + +
+ +I would recommend playing around with this code: + +.. raw:: html + +
+ +:: + + # sudo insmod hello-5.ko mystring="bebop" mybyte=255 myintArray=-1 + mybyte is an 8 bit integer: 255 + myshort is a short integer: 1 + myint is an integer: 20 + mylong is a long integer: 9999 + mystring is a string: bebop + myintArray is -1 and 420 + + # rmmod hello-5 + Goodbye, world 5 + + # sudo insmod hello-5.ko mystring="supercalifragilisticexpialidocious" \ + > mybyte=256 myintArray=-1,-1 + mybyte is an 8 bit integer: 0 + myshort is a short integer: 1 + myint is an integer: 20 + mylong is a long integer: 9999 + mystring is a string: supercalifragilisticexpialidocious + myintArray is -1 and -1 + + # rmmod hello-5 + Goodbye, world 5 + + # sudo insmod hello-5.ko mylong=hello + hello-5.o: invalid argument syntax for mylong: 'h' + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Modules Spanning Multiple Files + :name: orgcd10981 + +.. raw:: html + +
+ +Sometimes it makes sense to divide a kernel module between several +source files. + +Here's an example of such a kernel module. + +.. raw:: html + +
+ +:: + + /* + * start.c - Illustration of multi filed modules + */ + + #include /* We're doing kernel work */ + #include /* Specifically, a module */ + + int init_module(void) + { + printk(KERN_INFO "Hello, world - this is the kernel speaking\n"); + return 0; + } + +.. raw:: html + +
+ +The next file: + +.. raw:: html + +
+ +:: + + /* + * stop.c - Illustration of multi filed modules + */ + + #include /* We're doing kernel work */ + #include /* Specifically, a module */ + + void cleanup_module() + { + printk(KERN_INFO "Short is the life of a kernel module\n"); + } + +.. raw:: html + +
+ +And finally, the makefile: + +.. raw:: html + +
+ +:: + + obj-m += hello-1.o + obj-m += hello-2.o + obj-m += hello-3.o + obj-m += hello-4.o + obj-m += hello-5.o + obj-m += startstop.o + startstop-objs := start.o stop.o + + all: + make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules + + clean: + make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean + +.. raw:: html + +
+ +This is the complete makefile for all the examples we've seen so far. +The first five lines are nothing special, but for the last example we'll +need two lines. First we invent an object name for our combined module, +second we tell make what object files are part of that module. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Building modules for a precompiled kernel + :name: orga65faca + +.. raw:: html + +
+ +Obviously, we strongly suggest you to recompile your kernel, so that you +can enable a number of useful debugging features, such as forced module +unloading (**MODULE_FORCE_UNLOAD**): when this option is enabled, you +can force the kernel to unload a module even when it believes it is +unsafe, via a **sudo rmmod -f module** command. This option can save you +a lot of time and a number of reboots during the development of a +module. If you don't want to recompile your kernel then you should +consider running the examples within a test distro on a virtual machine. +If you mess anything up then you can easily reboot or restore the VM. + +There are a number of cases in which you may want to load your module +into a precompiled running kernel, such as the ones shipped with common +Linux distributions, or a kernel you have compiled in the past. In +certain circumstances you could require to compile and insert a module +into a running kernel which you are not allowed to recompile, or on a +machine that you prefer not to reboot. If you can't think of a case that +will force you to use modules for a precompiled kernel you might want to +skip this and treat the rest of this chapter as a big footnote. + +Now, if you just install a kernel source tree, use it to compile your +kernel module and you try to insert your module into the kernel, in most +cases you would obtain an error as follows: + +.. raw:: html + +
+ +:: + + insmod: error inserting 'poet_atkm.ko': -1 Invalid module format + +.. raw:: html + +
+ +Less cryptical information are logged to the systemd journal: + +.. raw:: html + +
+ +:: + + Jun 4 22:07:54 localhost kernel: poet_atkm: version magic '2.6.5-1.358custom 686 + REGPARM 4KSTACKS gcc-3.3' should be '2.6.5-1.358 686 REGPARM 4KSTACKS gcc-3.3' + +.. raw:: html + +
+ +In other words, your kernel refuses to accept your module because +version strings (more precisely, version magics) do not match. +Incidentally, version magics are stored in the module object in the form +of a static string, starting with vermagic:. Version data are inserted +in your module when it is linked against the **init/vermagic.o** file. +To inspect version magics and other strings stored in a given module, +issue the modinfo module.ko command: + +.. raw:: html + +
+ +:: + + # sudo modinfo hello-4.ko + license: GPL + author: Bob Mottram + description: A sample driver + vermagic: 4.12.12-1.358 amd64 REGPARM 4KSTACKS gcc-4.9.2 + depends: + +.. raw:: html + +
+ +To overcome this problem we could resort to the **–force-vermagic** +option, but this solution is potentially unsafe, and unquestionably +inacceptable in production modules. Consequently, we want to compile our +module in an environment which was identical to the one in which our +precompiled kernel was built. How to do this, is the subject of the +remainder of this chapter. + +First of all, make sure that a kernel source tree is available, having +exactly the same version as your current kernel. Then, find the +configuration file which was used to compile your precompiled kernel. +Usually, this is available in your current *boot directory, under a name +like config-2.6.x. You may just want to copy it to your kernel source +tree: \*cp /boot/config-`uname -r\` /usr/src/linux-`uname -r\`*.config*. + +Let's focus again on the previous error message: a closer look at the +version magic strings suggests that, even with two configuration files +which are exactly the same, a slight difference in the version magic +could be possible, and it is sufficient to prevent insertion of the +module into the kernel. That slight difference, namely the custom string +which appears in the module's version magic and not in the kernel's one, +is due to a modification with respect to the original, in the makefile +that some distribution include. Then, examine your +**/usr/src/linux/Makefile**, and make sure that the specified version +information matches exactly the one used for your current kernel. For +example, you makefile could start as follows: + +.. raw:: html + +
+ +:: + + VERSION = 4 + PATCHLEVEL = 7 + SUBLEVEL = 4 + EXTRAVERSION = -1.358custom + +.. raw:: html + +
+ +In this case, you need to restore the value of symbol **EXTRAVERSION** +to -1.358. We suggest to keep a backup copy of the makefile used to +compile your kernel available in **/lib/modules/4.12.12-1.358/build**. A +simple **cp /lib/modules/`uname-r`/build/Makefile /usr/src/linux-`uname +-r\`** should suffice. Additionally, if you already started a kernel +build with the previous (wrong) Makefile, you should also rerun make, or +directly modify symbol UTS_RELEASE in file +**/usr/src/linux-4.12.12/include/linux/version.h** according to contents +of file **/lib/modules/4.12.12/build/include/linux/version.h**, or +overwrite the latter with the first. + +Now, please run make to update configuration and version headers and +objects: + +.. raw:: html + +
+ +:: + + # make + CHK include/linux/version.h + UPD include/linux/version.h + SYMLINK include/asm -> include/asm-i386 + SPLIT include/linux/autoconf.h -> include/config/* + HOSTCC scripts/basic/fixdep + HOSTCC scripts/basic/split-include + HOSTCC scripts/basic/docproc + HOSTCC scripts/conmakehash + HOSTCC scripts/kallsyms + CC scripts/empty.o + +.. raw:: html + +
+ +If you do not desire to actually compile the kernel, you can interrupt +the build process (CTRL-C) just after the SPLIT line, because at that +time, the files you need will be are ready. Now you can turn back to the +directory of your module and compile it: It will be built exactly +according your current kernel settings, and it will load into it without +any errors. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Preliminaries + :name: orgdeef601 + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: How modules begin and end + :name: orgc8eceb0 + +.. raw:: html + +
+ +A program usually begins with a **main()** function, executes a bunch of +instructions and terminates upon completion of those instructions. +Kernel modules work a bit differently. A module always begin with either +the init_module or the function you specify with module_init call. This +is the entry function for modules; it tells the kernel what +functionality the module provides and sets up the kernel to run the +module's functions when they're needed. Once it does this, entry +function returns and the module does nothing until the kernel wants to +do something with the code that the module provides. + +All modules end by calling either **cleanup_module** or the function you +specify with the **module_exit** call. This is the exit function for +modules; it undoes whatever entry function did. It unregisters the +functionality that the entry function registered. + +Every module must have an entry function and an exit function. Since +there's more than one way to specify entry and exit functions, I'll try +my best to use the terms \`entry function' and \`exit function', but if +I slip and simply refer to them as init_module and cleanup_module, I +think you'll know what I mean. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Functions available to modules + :name: org290f3df + +.. raw:: html + +
+ +Programmers use functions they don't define all the time. A prime +example of this is **printf()**. You use these library functions which +are provided by the standard C library, libc. The definitions for these +functions don't actually enter your program until the linking stage, +which insures that the code (for printf() for example) is available, and +fixes the call instruction to point to that code. + +Kernel modules are different here, too. In the hello world example, you +might have noticed that we used a function, **printk()** but didn't +include a standard I/O library. That's because modules are object files +whose symbols get resolved upon insmod'ing. The definition for the +symbols comes from the kernel itself; the only external functions you +can use are the ones provided by the kernel. If you're curious about +what symbols have been exported by your kernel, take a look at +**/proc/kallsyms**. + +One point to keep in mind is the difference between library functions +and system calls. Library functions are higher level, run completely in +user space and provide a more convenient interface for the programmer to +the functions that do the real work — system calls. System calls run in +kernel mode on the user's behalf and are provided by the kernel itself. +The library function printf() may look like a very general printing +function, but all it really does is format the data into strings and +write the string data using the low-level system call write(), which +then sends the data to standard output. + +Would you like to see what system calls are made by printf()? It's easy! +Compile the following program: + +.. raw:: html + +
+ +:: + + #include + + int main(void) + { + printf("hello"); + return 0; + } + +.. raw:: html + +
+ +with **gcc -Wall -o hello hello.c**. Run the exectable with **strace +./hello**. Are you impressed? Every line you see corresponds to a system +call. `strace `__ is a handy program that gives you +details about what system calls a program is making, including which +call is made, what its arguments are and what it returns. It's an +invaluable tool for figuring out things like what files a program is +trying to access. Towards the end, you'll see a line which looks like +write (1, "hello", 5hello). There it is. The face behind the printf() +mask. You may not be familiar with write, since most people use library +functions for file I/O (like fopen, fputs, fclose). If that's the case, +try looking at man 2 write. The 2nd man section is devoted to system +calls (like kill() and read()). The 3rd man section is devoted to +library calls, which you would probably be more familiar with (like +cosh() and random()). + +You can even write modules to replace the kernel's system calls, which +we'll do shortly. Crackers often make use of this sort of thing for +backdoors or trojans, but you can write your own modules to do more +benign things, like have the kernel write Tee hee, that tickles! +everytime someone tries to delete a file on your system. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: User Space vs Kernel Space + :name: orga7850df + +.. raw:: html + +
+ +A kernel is all about access to resources, whether the resource in +question happens to be a video card, a hard drive or even memory. +Programs often compete for the same resource. As I just saved this +document, updatedb started updating the locate database. My vim session +and updatedb are both using the hard drive concurrently. The kernel +needs to keep things orderly, and not give users access to resources +whenever they feel like it. To this end, a CPU can run in different +modes. Each mode gives a different level of freedom to do what you want +on the system. The Intel 80386 architecture had 4 of these modes, which +were called rings. Unix uses only two rings; the highest ring (ring 0, +also known as \`supervisor mode' where everything is allowed to happen) +and the lowest ring, which is called \`user mode'. + +Recall the discussion about library functions vs system calls. +Typically, you use a library function in user mode. The library function +calls one or more system calls, and these system calls execute on the +library function's behalf, but do so in supervisor mode since they are +part of the kernel itself. Once the system call completes its task, it +returns and execution gets transfered back to user mode. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Name Space + :name: org4b4877b + +.. raw:: html + +
+ +When you write a small C program, you use variables which are convenient +and make sense to the reader. If, on the other hand, you're writing +routines which will be part of a bigger problem, any global variables +you have are part of a community of other peoples' global variables; +some of the variable names can clash. When a program has lots of global +variables which aren't meaningful enough to be distinguished, you get +namespace pollution. In large projects, effort must be made to remember +reserved names, and to find ways to develop a scheme for naming unique +variable names and symbols. + +When writing kernel code, even the smallest module will be linked +against the entire kernel, so this is definitely an issue. The best way +to deal with this is to declare all your variables as static and to use +a well-defined prefix for your symbols. By convention, all kernel +prefixes are lowercase. If you don't want to declare everything as +static, another option is to declare a symbol table and register it with +a kernel. We'll get to this later. + +The file **/proc/kallsyms** holds all the symbols that the kernel knows +about and which are therefore accessible to your modules since they +share the kernel's codespace. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Code space + :name: org7e3a491 + +.. raw:: html + +
+ +Memory management is a very complicated subject and the majority of +O'Reilly's "*Understanding The Linux Kernel*" exclusively covers memory +management! We're not setting out to be experts on memory managements, +but we do need to know a couple of facts to even begin worrying about +writing real modules. + +If you haven't thought about what a segfault really means, you may be +surprised to hear that pointers don't actually point to memory +locations. Not real ones, anyway. When a process is created, the kernel +sets aside a portion of real physical memory and hands it to the process +to use for its executing code, variables, stack, heap and other things +which a computer scientist would know about. This memory begins with +0x00000000 and extends up to whatever it needs to be. Since the memory +space for any two processes don't overlap, every process that can access +a memory address, say 0xbffff978, would be accessing a different +location in real physical memory! The processes would be accessing an +index named 0xbffff978 which points to some kind of offset into the +region of memory set aside for that particular process. For the most +part, a process like our Hello, World program can't access the space of +another process, although there are ways which we'll talk about later. + +The kernel has its own space of memory as well. Since a module is code +which can be dynamically inserted and removed in the kernel (as opposed +to a semi-autonomous object), it shares the kernel's codespace rather +than having its own. Therefore, if your module segfaults, the kernel +segfaults. And if you start writing over data because of an off-by-one +error, then you're trampling on kernel data (or code). This is even +worse than it sounds, so try your best to be careful. + +By the way, I would like to point out that the above discussion is true +for any operating system which uses a monolithic kernel. This isn't +quite the same thing as *"building all your modules into the kernel"*, +although the idea is the same. There are things called microkernels +which have modules which get their own codespace. The GNU Hurd and the +Magenta kernel of Google Fuchsia are two examples of a microkernel. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Device Drivers + :name: org6c0b122 + +.. raw:: html + +
+ +One class of module is the device driver, which provides functionality +for hardware like a serial port. On unix, each piece of hardware is +represented by a file located in /dev named a device file which provides +the means to communicate with the hardware. The device driver provides +the communication on behalf of a user program. So the es1370.o sound +card device driver might connect the /dev/sound device file to the +Ensoniq IS1370 sound card. A userspace program like mp3blaster can use +/dev/sound without ever knowing what kind of sound card is installed. + +.. raw:: html + +
+ +- | Major and Minor Numbers + | ::: {#text-orga17bef9 .outline-text-5} Let's look at some device + files. Here are device files which represent the first three + partitions on the primary master IDE hard drive: + + .. raw:: html + +
+ + :: + + # ls -l /dev/hda[1-3] + brw-rw---- 1 root disk 3, 1 Jul 5 2000 /dev/hda1 + brw-rw---- 1 root disk 3, 2 Jul 5 2000 /dev/hda2 + brw-rw---- 1 root disk 3, 3 Jul 5 2000 /dev/hda3 + + .. raw:: html + +
+ + Notice the column of numbers separated by a comma? The first number + is called the device's major number. The second number is the minor + number. The major number tells you which driver is used to access the + hardware. Each driver is assigned a unique major number; all device + files with the same major number are controlled by the same driver. + All the above major numbers are 3, because they're all controlled by + the same driver. + + The minor number is used by the driver to distinguish between the + various hardware it controls. Returning to the example above, + although all three devices are handled by the same driver they have + unique minor numbers because the driver sees them as being different + pieces of hardware. + + Devices are divided into two types: character devices and block + devices. The difference is that block devices have a buffer for + requests, so they can choose the best order in which to respond to + the requests. This is important in the case of storage devices, where + it's faster to read or write sectors which are close to each other, + rather than those which are further apart. Another difference is that + block devices can only accept input and return output in blocks + (whose size can vary according to the device), whereas character + devices are allowed to use as many or as few bytes as they like. Most + devices in the world are character, because they don't need this type + of buffering, and they don't operate with a fixed block size. You can + tell whether a device file is for a block device or a character + device by looking at the first character in the output of ls -l. If + it's \`b' then it's a block device, and if it's \`c' then it's a + character device. The devices you see above are block devices. Here + are some character devices (the serial ports): + + .. raw:: html + +
+ + :: + + crw-rw---- 1 root dial 4, 64 Feb 18 23:34 /dev/ttyS0 + crw-r----- 1 root dial 4, 65 Nov 17 10:26 /dev/ttyS1 + crw-rw---- 1 root dial 4, 66 Jul 5 2000 /dev/ttyS2 + crw-rw---- 1 root dial 4, 67 Jul 5 2000 /dev/ttyS3 + + .. raw:: html + +
+ + If you want to see which major numbers have been assigned, you can + look at /usr/src/linux/Documentation/devices.txt. + + When the system was installed, all of those device files were created + by the mknod command. To create a new char device named \`coffee' + with major/minor number 12 and 2, simply do mknod /dev/coffee c 12 2. + You don't have to put your device files into /dev, but it's done by + convention. Linus put his device files in /dev, and so should you. + However, when creating a device file for testing purposes, it's + probably OK to place it in your working directory where you compile + the kernel module. Just be sure to put it in the right place when + you're done writing the device driver. + + I would like to make a few last points which are implicit from the + above discussion, but I'd like to make them explicit just in case. + When a device file is accessed, the kernel uses the major number of + the file to determine which driver should be used to handle the + access. This means that the kernel doesn't really need to use or even + know about the minor number. The driver itself is the only thing that + cares about the minor number. It uses the minor number to distinguish + between different pieces of hardware. + + By the way, when I say \`hardware', I mean something a bit more + abstract than a PCI card that you can hold in your hand. Look at + these two device files: + + .. raw:: html + +
+ + :: + + % ls -l /dev/fd0 /dev/fd0u1680 + brwxrwxrwx 1 root floppy 2, 0 Jul 5 2000 /dev/fd0 + brw-rw---- 1 root floppy 2, 44 Jul 5 2000 /dev/fd0u1680 + + .. raw:: html + +
+ + By now you can look at these two device files and know instantly that + they are block devices and are handled by same driver (block major + 2). You might even be aware that these both represent your floppy + drive, even if you only have one floppy drive. Why two files? One + represents the floppy drive with 1.44 MB of storage. The other is the + same floppy drive with 1.68 MB of storage, and corresponds to what + some people call a \`superformatted' disk. One that holds more data + than a standard formatted floppy. So here's a case where two device + files with different minor number actually represent the same piece + of physical hardware. So just be aware that the word \`hardware' in + our discussion can mean something very abstract. + + ::: + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Character Device drivers + :name: org016c39a + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: The file_operations Structure + :name: org31d952e + +.. raw:: html + +
+ +The file_operations structure is defined in **/usr/include/linux/fs.h**, +and holds pointers to functions defined by the driver that perform +various operations on the device. Each field of the structure +corresponds to the address of some function defined by the driver to +handle a requested operation. + +For example, every character driver needs to define a function that +reads from the device. The file_operations structure holds the address +of the module's function that performs that operation. Here is what the +definition looks like for kernel 3.0: + +.. raw:: html + +
+ +:: + + struct file_operations { + struct module *owner; + loff_t (*llseek) (struct file *, loff_t, int); + ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); + ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + int (*iterate) (struct file *, struct dir_context *); + unsigned int (*poll) (struct file *, struct poll_table_struct *); + long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); + long (*compat_ioctl) (struct file *, unsigned int, unsigned long); + int (*mmap) (struct file *, struct vm_area_struct *); + int (*open) (struct inode *, struct file *); + int (*flush) (struct file *, fl_owner_t id); + int (*release) (struct inode *, struct file *); + int (*fsync) (struct file *, loff_t, loff_t, int datasync); + int (*aio_fsync) (struct kiocb *, int datasync); + int (*fasync) (int, struct file *, int); + int (*lock) (struct file *, int, struct file_lock *); + ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); + unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); + int (*check_flags)(int); + int (*flock) (struct file *, int, struct file_lock *); + ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); + ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); + int (*setlease)(struct file *, long, struct file_lock **); + long (*fallocate)(struct file *file, int mode, loff_t offset, + loff_t len); + int (*show_fdinfo)(struct seq_file *m, struct file *f); + }; + +.. raw:: html + +
+ +Some operations are not implemented by a driver. For example, a driver +that handles a video card won't need to read from a directory structure. +The corresponding entries in the file_operations structure should be set +to NULL. + +There is a gcc extension that makes assigning to this structure more +convenient. You'll see it in modern drivers, and may catch you by +surprise. This is what the new way of assigning to the structure looks +like: + +.. raw:: html + +
+ +:: + + struct file_operations fops = { + read: device_read, + write: device_write, + open: device_open, + release: device_release + }; + +.. raw:: html + +
+ +However, there's also a C99 way of assigning to elements of a structure, +and this is definitely preferred over using the GNU extension. The +version of gcc the author used when writing this, 2.95, supports the new +C99 syntax. You should use this syntax in case someone wants to port +your driver. It will help with compatibility: + +.. raw:: html + +
+ +:: + + struct file_operations fops = { + .read = device_read, + .write = device_write, + .open = device_open, + .release = device_release + }; + +.. raw:: html + +
+ +The meaning is clear, and you should be aware that any member of the +structure which you don't explicitly assign will be initialized to NULL +by gcc. + +An instance of struct file_operations containing pointers to functions +that are used to implement read, write, open, … syscalls is commonly +named fops. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: The file structure + :name: org607b208 + +.. raw:: html + +
+ +Each device is represented in the kernel by a file structure, which is +defined in **linux/fs.h**. Be aware that a file is a kernel level +structure and never appears in a user space program. It's not the same +thing as a **FILE**, which is defined by glibc and would never appear in +a kernel space function. Also, its name is a bit misleading; it +represents an abstract open \`file', not a file on a disk, which is +represented by a structure named inode. + +An instance of struct file is commonly named filp. You'll also see it +refered to as struct file file. Resist the temptation. + +Go ahead and look at the definition of file. Most of the entries you +see, like struct dentry aren't used by device drivers, and you can +ignore them. This is because drivers don't fill file directly; they only +use structures contained in file which are created elsewhere. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Registering A Device + :name: orgf96ab85 + +.. raw:: html + +
+ +As discussed earlier, char devices are accessed through device files, +usually located in /dev. This is by convention. When writing a driver, +it's OK to put the device file in your current directory. Just make sure +you place it in /dev for a production driver. The major number tells you +which driver handles which device file. The minor number is used only by +the driver itself to differentiate which device it's operating on, just +in case the driver handles more than one device. + +Adding a driver to your system means registering it with the kernel. +This is synonymous with assigning it a major number during the module's +initialization. You do this by using the register_chrdev function, +defined by linux/fs.h. + +.. raw:: html + +
+ +:: + + int register_chrdev(unsigned int major, const char *name, struct file_operations *fops); + +.. raw:: html + +
+ +where unsigned int major is the major number you want to request, *const +char \*name* is the name of the device as it'll appear in +**/proc/devices** and *struct file_operations \*fops* is a pointer to +the file_operations table for your driver. A negative return value means +the registration failed. Note that we didn't pass the minor number to +register_chrdev. That's because the kernel doesn't care about the minor +number; only our driver uses it. + +Now the question is, how do you get a major number without hijacking one +that's already in use? The easiest way would be to look through +Documentation /devices.txt and pick an unused one. That's a bad way of +doing things because you'll never be sure if the number you picked will +be assigned later. The answer is that you can ask the kernel to assign +you a dynamic major number. + +If you pass a major number of 0 to register_chrdev, the return value +will be the dynamically allocated major number. The downside is that you +can't make a device file in advance, since you don't know what the major +number will be. There are a couple of ways to do this. First, the driver +itself can print the newly assigned number and we can make the device +file by hand. Second, the newly registered device will have an entry in +**/proc/devices**, and we can either make the device file by hand or +write a shell script to read the file in and make the device file. The +third method is we can have our driver make the the device file using +the mknod system call after a successful registration and rm during the +call to cleanup_module. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Unregistering A Device + :name: org452ea75 + +.. raw:: html + +
+ +We can't allow the kernel module to be rmmod'ed whenever root feels like +it. If the device file is opened by a process and then we remove the +kernel module, using the file would cause a call to the memory location +where the appropriate function (read/write) used to be. If we're lucky, +no other code was loaded there, and we'll get an ugly error message. If +we're unlucky, another kernel module was loaded into the same location, +which means a jump into the middle of another function within the +kernel. The results of this would be impossible to predict, but they +can't be very positive. + +Normally, when you don't want to allow something, you return an error +code (a negative number) from the function which is supposed to do it. +With cleanup_module that's impossible because it's a void function. +However, there's a counter which keeps track of how many processes are +using your module. You can see what it's value is by looking at the 3rd +field of **/proc/modules**. If this number isn't zero, rmmod will fail. +Note that you don't have to check the counter from within cleanup_module +because the check will be performed for you by the system call +sys_delete_module, defined in **linux/module.c**. You shouldn't use this +counter directly, but there are functions defined in **linux/module.h** +which let you increase, decrease and display this counter: + +- try_module_get(THIS_MODULE): Increment the use count. +- module_put(THIS_MODULE): Decrement the use count. + +It's important to keep the counter accurate; if you ever do lose track +of the correct usage count, you'll never be able to unload the module; +it's now reboot time, boys and girls. This is bound to happen to you +sooner or later during a module's development. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: chardev.c + :name: orgdd49880 + +.. raw:: html + +
+ +The next code sample creates a char driver named chardev. You can cat +its device file. + +.. raw:: html + +
+ +:: + + cat /proc/devices + +.. raw:: html + +
+ +(or open the file with a program) and the driver will put the number of +times the device file has been read from into the file. We don't support +writing to the file (like **echo "hi" > /dev/hello**), but catch these +attempts and tell the user that the operation isn't supported. Don't +worry if you don't see what we do with the data we read into the buffer; +we don't do much with it. We simply read in the data and print a message +acknowledging that we received it. + +.. raw:: html + +
+ +:: + + /* + * chardev.c: Creates a read-only char device that says how many times + * you've read from the dev file + */ + + #include + #include + #include + #include /* for put_user */ + + /* + * Prototypes - this would normally go in a .h file + */ + int init_module(void); + void cleanup_module(void); + static int device_open(struct inode *, struct file *); + static int device_release(struct inode *, struct file *); + static ssize_t device_read(struct file *, char *, size_t, loff_t *); + static ssize_t device_write(struct file *, const char *, size_t, loff_t *); + + #define SUCCESS 0 + #define DEVICE_NAME "chardev" /* Dev name as it appears in /proc/devices */ + #define BUF_LEN 80 /* Max length of the message from the device */ + + /* + * Global variables are declared as static, so are global within the file. + */ + + static int Major; /* Major number assigned to our device driver */ + static int Device_Open = 0; /* Is device open? + * Used to prevent multiple access to device */ + static char msg[BUF_LEN]; /* The msg the device will give when asked */ + static char *msg_Ptr; + + static struct file_operations fops = { + .read = device_read, + .write = device_write, + .open = device_open, + .release = device_release + }; + + /* + * This function is called when the module is loaded + */ + int init_module(void) + { + Major = register_chrdev(0, DEVICE_NAME, &fops); + + if (Major < 0) { + printk(KERN_ALERT "Registering char device failed with %d\n", Major); + return Major; + } + + printk(KERN_INFO "I was assigned major number %d. To talk to\n", Major); + printk(KERN_INFO "the driver, create a dev file with\n"); + printk(KERN_INFO "'mknod /dev/%s c %d 0'.\n", DEVICE_NAME, Major); + printk(KERN_INFO "Try various minor numbers. Try to cat and echo to\n"); + printk(KERN_INFO "the device file.\n"); + printk(KERN_INFO "Remove the device file and module when done.\n"); + + return SUCCESS; + } + + /* + * This function is called when the module is unloaded + */ + void cleanup_module(void) + { + /* + * Unregister the device + */ + unregister_chrdev(Major, DEVICE_NAME); + } + + /* + * Methods + */ + + /* + * Called when a process tries to open the device file, like + * "cat /dev/mycharfile" + */ + static int device_open(struct inode *inode, struct file *file) + { + static int counter = 0; + + if (Device_Open) + return -EBUSY; + + Device_Open++; + sprintf(msg, "I already told you %d times Hello world!\n", counter++); + msg_Ptr = msg; + try_module_get(THIS_MODULE); + + return SUCCESS; + } + + /* + * Called when a process closes the device file. + */ + static int device_release(struct inode *inode, struct file *file) + { + Device_Open--; /* We're now ready for our next caller */ + + /* + * Decrement the usage count, or else once you opened the file, you'll + * never get get rid of the module. + */ + module_put(THIS_MODULE); + + return SUCCESS; + } + + /* + * Called when a process, which already opened the dev file, attempts to + * read from it. + */ + static ssize_t device_read(struct file *filp, /* see include/linux/fs.h */ + char *buffer, /* buffer to fill with data */ + size_t length, /* length of the buffer */ + loff_t * offset) + { + /* + * Number of bytes actually written to the buffer + */ + int bytes_read = 0; + + /* + * If we're at the end of the message, + * return 0 signifying end of file + */ + if (*msg_Ptr == 0) + return 0; + + /* + * Actually put the data into the buffer + */ + while (length && *msg_Ptr) { + + /* + * The buffer is in the user data segment, not the kernel + * segment so "*" assignment won't work. We have to use + * put_user which copies data from the kernel data segment to + * the user data segment. + */ + put_user(*(msg_Ptr++), buffer++); + + length--; + bytes_read++; + } + + /* + * Most read functions return the number of bytes put into the buffer + */ + return bytes_read; + } + + /* + * Called when a process writes to dev file: echo "hi" > /dev/hello + */ + static ssize_t device_write(struct file *filp, + const char *buff, + size_t len, + loff_t * off) + { + printk(KERN_ALERT "Sorry, this operation isn't supported.\n"); + return -EINVAL; + } + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Writing Modules for Multiple Kernel Versions + :name: org903f5d5 + +.. raw:: html + +
+ +The system calls, which are the major interface the kernel shows to the +processes, generally stay the same across versions. A new system call +may be added, but usually the old ones will behave exactly like they +used to. This is necessary for backward compatibility – a new kernel +version is not supposed to break regular processes. In most cases, the +device files will also remain the same. On the other hand, the internal +interfaces within the kernel can and do change between versions. + +The Linux kernel versions are divided between the stable versions +(n.$<$even number\(>\).m) and the development versions (n.$<$odd +number\(>\).m). The development versions include all the cool new ideas, +including those which will be considered a mistake, or reimplemented, in +the next version. As a result, you can't trust the interface to remain +the same in those versions (which is why I don't bother to support them +in this book, it's too much work and it would become dated too quickly). +In the stable versions, on the other hand, we can expect the interface +to remain the same regardless of the bug fix version (the m number). + +There are differences between different kernel versions, and if you want +to support multiple kernel versions, you'll find yourself having to code +conditional compilation directives. The way to do this to compare the +macro LINUX_VERSION_CODE to the macro KERNEL_VERSION. In version a.b.c +of the kernel, the value of this macro would be \\(2^{16}a+2^{8}b+c\). + +While previous versions of this guide showed how you can write backward +compatible code with such constructs in great detail, we decided to +break with this tradition for the better. People interested in doing +such might now use a LKMPG with a version matching to their kernel. We +decided to version the LKMPG like the kernel, at least as far as major +and minor number are concerned. We use the patchlevel for our own +versioning so use LKMPG version 2.4.x for kernels 2.4.x, use LKMPG +version 2.6.x for kernels 2.6.x and so on. Also make sure that you +always use current, up to date versions of both, kernel and guide. + +You might already have noticed that recent kernels look different. In +case you haven't they look like 2.6.x.y now. The meaning of the first +three items basically stays the same, but a subpatchlevel has been added +and will indicate security fixes till the next stable patchlevel is out. +So people can choose between a stable tree with security updates and use +the latest kernel as developer tree. Search the kernel mailing list +archives if you're interested in the full story. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: The /proc File System + :name: org6400501 + +.. raw:: html + +
+ +In Linux, there is an additional mechanism for the kernel and kernel +modules to send information to processes — the **/proc** file system. +Originally designed to allow easy access to information about processes +(hence the name), it is now used by every bit of the kernel which has +something interesting to report, such as **/proc/modules** which +provides the list of modules and **/proc/meminfo** which stats memory +usage statistics. + +The method to use the proc file system is very similar to the one used +with device drivers — a structure is created with all the information +needed for the **/proc** file, including pointers to any handler +functions (in our case there is only one, the one called when somebody +attempts to read from the **/proc** file). Then, init_module registers +the structure with the kernel and cleanup_module unregisters it. + +Normal file systems are located on a disk, rather than just in memory +(which is where **/proc** is), and in that case the inode number is a +pointer to a disk location where the file's index-node (inode for short) +is located. The inode contains information about the file, for example +the file's permissions, together with a pointer to the disk location or +locations where the file's data can be found. + +Because we don't get called when the file is opened or closed, there's +nowhere for us to put try_module_get and try_module_put in this module, +and if the file is opened and then the module is removed, there's no way +to avoid the consequences. + +Here a simple example showing how to use a **/proc** file. This is the +HelloWorld for the **/proc** filesystem. There are three parts: create +the file ***proc* helloworld** in the function init_module, return a +value (and a buffer) when the file **/proc/helloworld** is read in the +callback function **procfile_read**, and delete the file +**/proc/helloworld** in the function cleanup_module. + +The **/proc/helloworld** is created when the module is loaded with the +function **proc_create**. The return value is a **struct +proc_dir_entry** , and it will be used to configure the file +**/proc/helloworld** (for example, the owner of this file). A null +return value means that the creation has failed. + +Each time, everytime the file **/proc/helloworld** is read, the function +**procfile_read** is called. Two parameters of this function are very +important: the buffer (the first parameter) and the offset (the third +one). The content of the buffer will be returned to the application +which read it (for example the cat command). The offset is the current +position in the file. If the return value of the function isn't null, +then this function is called again. So be careful with this function, if +it never returns zero, the read function is called endlessly. + +.. raw:: html + +
+ +:: + + # cat /proc/helloworld + HelloWorld! + +.. raw:: html + +
+ +.. raw:: html + +
+ +:: + + /* + procfs1.c + */ + + #include + #include + #include + #include + + #define procfs_name "helloworld" + + struct proc_dir_entry *Our_Proc_File; + + + ssize_t procfile_read(struct file *filePointer,char *buffer, + size_t buffer_length, loff_t * offset) + { + int ret=0; + if(strlen(buffer) ==0) { + printk(KERN_INFO "procfile read %s\n",filePointer->f_path.dentry->d_name.name); + ret=copy_to_user(buffer,"HelloWorld!\n",sizeof("HelloWorld!\n")); + ret=sizeof("HelloWorld!\n"); + } + return ret; + + } + + static const struct file_operations proc_file_fops = { + .owner = THIS_MODULE, + .read = procfile_read, + }; + + int init_module() + { + Our_Proc_File = proc_create(procfs_name,0644,NULL,&proc_file_fops); + if(NULL==Our_Proc_File) { + proc_remove(Our_Proc_File); + printk(KERN_ALERT "Error:Could not initialize /proc/%s\n",procfs_name); + return -ENOMEM; + } + + printk(KERN_INFO "/proc/%s created\n", procfs_name); + return 0; + } + + void cleanup_module() + { + proc_remove(Our_Proc_File); + printk(KERN_INFO "/proc/%s removed\n", procfs_name); + } + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Read and Write a /proc File + :name: orga906618 + +.. raw:: html + +
+ +We have seen a very simple example for a /proc file where we only read +the file /proc/helloworld. It's also possible to write in a /proc file. +It works the same way as read, a function is called when the /proc file +is written. But there is a little difference with read, data comes from +user, so you have to import data from user space to kernel space (with +copy_from_user or get_user) + +The reason for copy_from_user or get_user is that Linux memory (on Intel +architecture, it may be different under some other processors) is +segmented. This means that a pointer, by itself, does not reference a +unique location in memory, only a location in a memory segment, and you +need to know which memory segment it is to be able to use it. There is +one memory segment for the kernel, and one for each of the processes. + +The only memory segment accessible to a process is its own, so when +writing regular programs to run as processes, there's no need to worry +about segments. When you write a kernel module, normally you want to +access the kernel memory segment, which is handled automatically by the +system. However, when the content of a memory buffer needs to be passed +between the currently running process and the kernel, the kernel +function receives a pointer to the memory buffer which is in the process +segment. The put_user and get_user macros allow you to access that +memory. These functions handle only one caracter, you can handle several +caracters with copy_to_user and copy_from_user. As the buffer (in read +or write function) is in kernel space, for write function you need to +import data because it comes from user space, but not for the read +function because data is already in kernel space. + +.. raw:: html + +
+ +:: + + /** + * procfs2.c - create a "file" in /proc + * + */ + + #include /* Specifically, a module */ + #include /* We're doing kernel work */ + #include /* Necessary because we use the proc fs */ + #include /* for copy_from_user */ + + #define PROCFS_MAX_SIZE 1024 + #define PROCFS_NAME "buffer1k" + + /** + * This structure hold information about the /proc file + * + */ + static struct proc_dir_entry *Our_Proc_File; + + /** + * The buffer used to store character for this module + * + */ + static char procfs_buffer[PROCFS_MAX_SIZE]; + + /** + * The size of the buffer + * + */ + static unsigned long procfs_buffer_size = 0; + + /** + * This function is called then the /proc file is read + * + */ + ssize_t procfile_read(struct file *filePointer,char *buffer, + size_t buffer_length, loff_t * offset) + { + int ret=0; + if(strlen(buffer) ==0) { + printk(KERN_INFO "procfile read %s\n",filePointer->f_path.dentry->d_name.name); + ret=copy_to_user(buffer,"HelloWorld!\n",sizeof("HelloWorld!\n")); + ret=sizeof("HelloWorld!\n"); + } + return ret; + } + + + /** + * This function is called with the /proc file is written + * + */ + static ssize_t procfile_write(struct file *file, const char *buff, + size_t len, loff_t *off) + { + procfs_buffer_size = len; + if (procfs_buffer_size > PROCFS_MAX_SIZE) + procfs_buffer_size = PROCFS_MAX_SIZE; + + if (copy_from_user(procfs_buffer, buff, procfs_buffer_size)) + return -EFAULT; + + procfs_buffer[procfs_buffer_size] = '\0'; + return procfs_buffer_size; + } + + static const struct file_operations proc_file_fops = { + .owner = THIS_MODULE, + .read = procfile_read, + .write = procfile_write, + }; + + /** + *This function is called when the module is loaded + * + */ + int init_module() + { + Our_Proc_File = proc_create(PROCFS_NAME,0644,NULL,&proc_file_fops); + if(NULL==Our_Proc_File) { + proc_remove(Our_Proc_File); + printk(KERN_ALERT "Error:Could not initialize /proc/%s\n",PROCFS_NAME); + return -ENOMEM; + } + + printk(KERN_INFO "/proc/%s created\n", PROCFS_NAME); + return 0; + } + + /** + *This function is called when the module is unloaded + * + */ + void cleanup_module() + { + proc_remove(Our_Proc_File); + printk(KERN_INFO "/proc/%s removed\n", PROCFS_NAME); + } + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Manage /proc file with standard filesystem + :name: org561d817 + +.. raw:: html + +
+ +We have seen how to read and write a /proc file with the /proc +interface. But it's also possible to manage /proc file with inodes. The +main concern is to use advanced functions, like permissions. + +In Linux, there is a standard mechanism for file system registration. +Since every file system has to have its own functions to handle inode +and file operations, there is a special structure to hold pointers to +all those functions, struct **inode_operations**, which includes a +pointer to struct file_operations. + +The difference between file and inode operations is that file operations +deal with the file itself whereas inode operations deal with ways of +referencing the file, such as creating links to it. + +In /proc, whenever we register a new file, we're allowed to specify +which struct inode_operations will be used to access to it. This is the +mechanism we use, a struct inode_operations which includes a pointer to +a struct file_operations which includes pointers to our procfs_read and +procfs_write functions. + +Another interesting point here is the module_permission function. This +function is called whenever a process tries to do something with the +/proc file, and it can decide whether to allow access or not. Right now +it is only based on the operation and the uid of the current user (as +available in current, a pointer to a structure which includes +information on the currently running process), but it could be based on +anything we like, such as what other processes are doing with the same +file, the time of day, or the last input we received. + +It's important to note that the standard roles of read and write are +reversed in the kernel. Read functions are used for output, whereas +write functions are used for input. The reason for that is that read and +write refer to the user's point of view — if a process reads something +from the kernel, then the kernel needs to output it, and if a process +writes something to the kernel, then the kernel receives it as input. + +.. raw:: html + +
+ +:: + + /* + procfs3.c + */ + + #include + #include + #include + #include + #include + + #define PROCFS_MAX_SIZE 2048 + #define PROCFS_ENTRY_FILENAME "buffer2k" + + struct proc_dir_entry *Our_Proc_File; + static char procfs_buffer[PROCFS_MAX_SIZE]; + static unsigned long procfs_buffer_size = 0; + + static ssize_t procfs_read(struct file *filp, char *buffer, + size_t length, loff_t *offset) + { + static int finished = 0; + if(finished) + { + printk(KERN_DEBUG "procfs_read: END\n"); + finished = 0; + return 0; + } + finished = 1; + if(copy_to_user(buffer, procfs_buffer, procfs_buffer_size)) + return -EFAULT; + printk(KERN_DEBUG "procfs_read: read %lu bytes\n", procfs_buffer_size); + return procfs_buffer_size; + } + static ssize_t procfs_write(struct file *file, const char *buffer, + size_t len, loff_t *off) + { + if(len>PROCFS_MAX_SIZE) + procfs_buffer_size = PROCFS_MAX_SIZE; + else + procfs_buffer_size = len; + if(copy_from_user(procfs_buffer, buffer, procfs_buffer_size)) + return -EFAULT; + printk(KERN_DEBUG "procfs_write: write %lu bytes\n", procfs_buffer_size); + return procfs_buffer_size; + } + int procfs_open(struct inode *inode, struct file *file) + { + try_module_get(THIS_MODULE); + return 0; + } + int procfs_close(struct inode *inode, struct file *file) + { + module_put(THIS_MODULE); + return 0; + } + + static struct file_operations File_Ops_4_Our_Proc_File = { + .read = procfs_read, + .write = procfs_write, + .open = procfs_open, + .release = procfs_close, + }; + + int init_module() + { + Our_Proc_File = proc_create(PROCFS_ENTRY_FILENAME, 0644, NULL,&File_Ops_4_Our_Proc_File); + if(Our_Proc_File == NULL) + { + remove_proc_entry(PROCFS_ENTRY_FILENAME, NULL); + printk(KERN_DEBUG "Error: Could not initialize /proc/%s\n", PROCFS_ENTRY_FILENAME); + return -ENOMEM; + } + proc_set_size(Our_Proc_File, 80); + proc_set_user(Our_Proc_File, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID); + + printk(KERN_DEBUG "/proc/%s created\n", PROCFS_ENTRY_FILENAME); + return 0; + } + void cleanup_module() + { + remove_proc_entry(PROCFS_ENTRY_FILENAME, NULL); + printk(KERN_DEBUG "/proc/%s removed\n", PROCFS_ENTRY_FILENAME); + } + +.. raw:: html + +
+ +Still hungry for procfs examples? Well, first of all keep in mind, there +are rumors around, claiming that procfs is on it's way out, consider +using sysfs instead. Second, if you really can't get enough, there's a +highly recommendable bonus level for procfs below +linux/Documentation/DocBook/ . Use make help in your toplevel kernel +directory for instructions about how to convert it into your favourite +format. Example: make htmldocs . Consider using this mechanism, in case +you want to document something kernel related yourself. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Manage /proc file with seq_file + :name: org38ea52f + +.. raw:: html + +
+ +As we have seen, writing a /proc file may be quite "complex". So to help +people writting /proc file, there is an API named seq_file that helps +formating a /proc file for output. It's based on sequence, which is +composed of 3 functions: start(), next(), and stop(). The seq_file API +starts a sequence when a user read the /proc file. + +A sequence begins with the call of the function start(). If the return +is a non NULL value, the function next() is called. This function is an +iterator, the goal is to go thought all the data. Each time next() is +called, the function show() is also called. It writes data values in the +buffer read by the user. The function next() is called until it returns +NULL. The sequence ends when next() returns NULL, then the function +stop() is called. + +BE CARREFUL: when a sequence is finished, another one starts. That means +that at the end of function stop(), the function start() is called +again. This loop finishes when the function start() returns NULL. You +can see a scheme of this in the figure "How seq_file works". + +.. raw:: html + +
+ +.. figure:: img/seq_file.png + :alt: seq_file.png + + seq_file.png + +.. raw:: html + +
+ +Seq_file provides basic functions for file_operations, as seq_read, +seq_lseek, and some others. But nothing to write in the /proc file. Of +course, you can still use the same way as in the previous example. + +.. raw:: html + +
+ +:: + + /** + * procfs4.c - create a "file" in /proc + * This program uses the seq_file library to manage the /proc file. + * + */ + + #include /* We're doing kernel work */ + #include /* Specifically, a module */ + #include /* Necessary because we use proc fs */ + #include /* for seq_file */ + + #define PROC_NAME "iter" + + MODULE_AUTHOR("Philippe Reynes"); + MODULE_LICENSE("GPL"); + + /** + * This function is called at the beginning of a sequence. + * ie, when: + * - the /proc file is read (first time) + * - after the function stop (end of sequence) + * + */ + static void *my_seq_start(struct seq_file *s, loff_t *pos) + { + static unsigned long counter = 0; + + /* beginning a new sequence ? */ + if ( *pos == 0 ) { + /* yes => return a non null value to begin the sequence */ + return &counter; + } + else { + /* no => it's the end of the sequence, return end to stop reading */ + *pos = 0; + return NULL; + } + } + + /** + * This function is called after the beginning of a sequence. + * It's called untill the return is NULL (this ends the sequence). + * + */ + static void *my_seq_next(struct seq_file *s, void *v, loff_t *pos) + { + unsigned long *tmp_v = (unsigned long *)v; + (*tmp_v)++; + (*pos)++; + return NULL; + } + + /** + * This function is called at the end of a sequence + * + */ + static void my_seq_stop(struct seq_file *s, void *v) + { + /* nothing to do, we use a static value in start() */ + } + + /** + * This function is called for each "step" of a sequence + * + */ + static int my_seq_show(struct seq_file *s, void *v) + { + loff_t *spos = (loff_t *) v; + + seq_printf(s, "%Ld\n", *spos); + return 0; + } + + /** + * This structure gather "function" to manage the sequence + * + */ + static struct seq_operations my_seq_ops = { + .start = my_seq_start, + .next = my_seq_next, + .stop = my_seq_stop, + .show = my_seq_show + }; + + /** + * This function is called when the /proc file is open. + * + */ + static int my_open(struct inode *inode, struct file *file) + { + return seq_open(file, &my_seq_ops); + }; + + /** + * This structure gather "function" that manage the /proc file + * + */ + static struct file_operations my_file_ops = { + .owner = THIS_MODULE, + .open = my_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release + }; + + + /** + * This function is called when the module is loaded + * + */ + int init_module(void) + { + struct proc_dir_entry *entry; + + entry = proc_create(PROC_NAME, 0, NULL, &my_file_ops); + if(entry == NULL) + { + remove_proc_entry(PROC_NAME, NULL); + printk(KERN_DEBUG "Error: Could not initialize /proc/%s\n", PROC_NAME); + return -ENOMEM; + } + + return 0; + } + + /** + * This function is called when the module is unloaded. + * + */ + void cleanup_module(void) + { + remove_proc_entry(PROC_NAME, NULL); + printk(KERN_DEBUG "/proc/%s removed\n", PROC_NAME); + } + +.. raw:: html + +
+ +If you want more information, you can read this web page: + +- http://lwn.net/Articles/22355/ +- http://www.kernelnewbies.org/documents/seq_file_howto.txt + +You can also read the code of fs/seq_file.c in the linux kernel. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: sysfs: Interacting with your module + :name: org954957f + +.. raw:: html + +
+ +*sysfs* allows you to interact with the running kernel from userspace by +reading or setting variables inside of modules. This can be useful for +debugging purposes, or just as an interface for applications or scripts. +You can find sysfs directories and files under the *sys* directory on +your system. + +.. raw:: html + +
+ +:: + + ls -l /sys + +.. raw:: html + +
+ +An example of a hello world module which includes the creation of a +variable accessible via sysfs is given below. + +.. raw:: html + +
+ +:: + + /* + * hello-sysfs.c sysfs example + */ + + #include + #include + #include + #include + #include + #include + #include + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Bob Mottram"); + + static struct kobject *mymodule; + + /* the variable you want to be able to change */ + static int myvariable = 0; + + static ssize_t myvariable_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) + { + return sprintf(buf, "%d\n", myvariable); + } + + static ssize_t myvariable_store(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf, size_t count) + { + sscanf(buf, "%du", &myvariable); + return count; + } + + + static struct kobj_attribute myvariable_attribute = + __ATTR(myvariable, 0660, myvariable_show, + (void*)myvariable_store); + + static int __init mymodule_init (void) + { + int error = 0; + + printk(KERN_INFO "mymodule: initialised\n"); + + mymodule = + kobject_create_and_add("mymodule", kernel_kobj); + if (!mymodule) + return -ENOMEM; + + error = sysfs_create_file(mymodule, &myvariable_attribute.attr); + if (error) { + printk(KERN_INFO "failed to create the myvariable file " \ + "in /sys/kernel/mymodule\n"); + } + + return error; + } + + static void __exit mymodule_exit (void) + { + printk(KERN_INFO "mymodule: Exit success\n"); + kobject_put(mymodule); + } + + module_init(mymodule_init); + module_exit(mymodule_exit); + +.. raw:: html + +
+ +Make and install the module: + +.. raw:: html + +
+ +:: + + make + sudo insmod hello-sysfs.ko + +.. raw:: html + +
+ +Check that it exists: + +.. raw:: html + +
+ +:: + + sudo lsmod | grep hello_sysfs + +.. raw:: html + +
+ +What is the current value of *myvariable* ? + +.. raw:: html + +
+ +:: + + cat /sys/kernel/mymodule/myvariable + +.. raw:: html + +
+ +Set the value of *myvariable* and check that it changed. + +.. raw:: html + +
+ +:: + + echo "32" > /sys/kernel/mymodule/myvariable + cat /sys/kernel/mymodule/myvariable + +.. raw:: html + +
+ +Finally, remove the test module: + +.. raw:: html + +
+ +:: + + sudo rmmod hello_sysfs + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Talking To Device Files + :name: org438f37b + +.. raw:: html + +
+ +Device files are supposed to represent physical devices. Most physical +devices are used for output as well as input, so there has to be some +mechanism for device drivers in the kernel to get the output to send to +the device from processes. This is done by opening the device file for +output and writing to it, just like writing to a file. In the following +example, this is implemented by device_write. + +This is not always enough. Imagine you had a serial port connected to a +modem (even if you have an internal modem, it is still implemented from +the CPU's perspective as a serial port connected to a modem, so you +don't have to tax your imagination too hard). The natural thing to do +would be to use the device file to write things to the modem (either +modem commands or data to be sent through the phone line) and read +things from the modem (either responses for commands or the data +received through the phone line). However, this leaves open the question +of what to do when you need to talk to the serial port itself, for +example to send the rate at which data is sent and received. + +The answer in Unix is to use a special function called **ioctl** (short +for Input Output ConTroL). Every device can have its own ioctl commands, +which can be read ioctl's (to send information from a process to the +kernel), write ioctl's (to return information to a process), both or +neither. Notice here the roles of read and write are reversed again, so +in ioctl's read is to send information to the kernel and write is to +receive information from the kernel. + +The ioctl function is called with three parameters: the file descriptor +of the appropriate device file, the ioctl number, and a parameter, which +is of type long so you can use a cast to use it to pass anything. You +won't be able to pass a structure this way, but you will be able to pass +a pointer to the structure. + +The ioctl number encodes the major device number, the type of the ioctl, +the command, and the type of the parameter. This ioctl number is usually +created by a macro call (_IO, \_IOR, \_IOW or \_IOWR — depending on the +type) in a header file. This header file should then be included both by +the programs which will use ioctl (so they can generate the appropriate +ioctl's) and by the kernel module (so it can understand it). In the +example below, the header file is chardev.h and the program which uses +it is ioctl.c. + +If you want to use ioctls in your own kernel modules, it is best to +receive an official ioctl assignment, so if you accidentally get +somebody else's ioctls, or if they get yours, you'll know something is +wrong. For more information, consult the kernel source tree at +Documentation/ioctl-number.txt. + +.. raw:: html + +
+ +:: + + /* + * chardev2.c - Create an input/output character device + */ + + #include /* We're doing kernel work */ + #include /* Specifically, a module */ + #include + #include /* for get_user and put_user */ + + #include "chardev.h" + #define SUCCESS 0 + #define DEVICE_NAME "char_dev" + #define BUF_LEN 80 + + /* + * Is the device open right now? Used to prevent + * concurent access into the same device + */ + static int Device_Open = 0; + + /* + * The message the device will give when asked + */ + static char Message[BUF_LEN]; + + /* + * How far did the process reading the message get? + * Useful if the message is larger than the size of the + * buffer we get to fill in device_read. + */ + static char *Message_Ptr; + + /* + * This is called whenever a process attempts to open the device file + */ + static int device_open(struct inode *inode, struct file *file) + { + #ifdef DEBUG + printk(KERN_INFO "device_open(%p)\n", file); + #endif + + /* + * We don't want to talk to two processes at the same time + */ + if (Device_Open) + return -EBUSY; + + Device_Open++; + /* + * Initialize the message + */ + Message_Ptr = Message; + try_module_get(THIS_MODULE); + return SUCCESS; + } + + static int device_release(struct inode *inode, struct file *file) + { + #ifdef DEBUG + printk(KERN_INFO "device_release(%p,%p)\n", inode, file); + #endif + + /* + * We're now ready for our next caller + */ + Device_Open--; + + module_put(THIS_MODULE); + return SUCCESS; + } + + /* + * This function is called whenever a process which has already opened the + * device file attempts to read from it. + */ + static ssize_t device_read(struct file *file, /* see include/linux/fs.h */ + char __user * buffer, /* buffer to be + * filled with data */ + size_t length, /* length of the buffer */ + loff_t * offset) + { + /* + * Number of bytes actually written to the buffer + */ + int bytes_read = 0; + + #ifdef DEBUG + printk(KERN_INFO "device_read(%p,%p,%d)\n", file, buffer, length); + #endif + + /* + * If we're at the end of the message, return 0 + * (which signifies end of file) + */ + if (*Message_Ptr == 0) + return 0; + + /* + * Actually put the data into the buffer + */ + while (length && *Message_Ptr) { + + /* + * Because the buffer is in the user data segment, + * not the kernel data segment, assignment wouldn't + * work. Instead, we have to use put_user which + * copies data from the kernel data segment to the + * user data segment. + */ + put_user(*(Message_Ptr++), buffer++); + length--; + bytes_read++; + } + + #ifdef DEBUG + printk(KERN_INFO "Read %d bytes, %d left\n", bytes_read, length); + #endif + + /* + * Read functions are supposed to return the number + * of bytes actually inserted into the buffer + */ + return bytes_read; + } + + /* + * This function is called when somebody tries to + * write into our device file. + */ + static ssize_t + device_write(struct file *file, + const char __user * buffer, size_t length, loff_t * offset) + { + int i; + + #ifdef DEBUG + printk(KERN_INFO "device_write(%p,%s,%d)", file, buffer, length); + #endif + + for (i = 0; i < length && i < BUF_LEN; i++) + get_user(Message[i], buffer + i); + + Message_Ptr = Message; + + /* + * Again, return the number of input characters used + */ + return i; + } + + /* + * This function is called whenever a process tries to do an ioctl on our + * device file. We get two extra parameters (additional to the inode and file + * structures, which all device functions get): the number of the ioctl called + * and the parameter given to the ioctl function. + * + * If the ioctl is write or read/write (meaning output is returned to the + * calling process), the ioctl call returns the output of this function. + * + */ + long device_ioctl(struct file *file, /* ditto */ + unsigned int ioctl_num, /* number and param for ioctl */ + unsigned long ioctl_param) + { + int i; + char *temp; + char ch; + + /* + * Switch according to the ioctl called + */ + switch (ioctl_num) { + case IOCTL_SET_MSG: + /* + * Receive a pointer to a message (in user space) and set that + * to be the device's message. Get the parameter given to + * ioctl by the process. + */ + temp = (char *)ioctl_param; + + /* + * Find the length of the message + */ + get_user(ch, temp); + for (i = 0; ch && i < BUF_LEN; i++, temp++) + get_user(ch, temp); + + device_write(file, (char *)ioctl_param, i, 0); + break; + + case IOCTL_GET_MSG: + /* + * Give the current message to the calling process - + * the parameter we got is a pointer, fill it. + */ + i = device_read(file, (char *)ioctl_param, 99, 0); + + /* + * Put a zero at the end of the buffer, so it will be + * properly terminated + */ + put_user('\0', (char *)ioctl_param + i); + break; + + case IOCTL_GET_NTH_BYTE: + /* + * This ioctl is both input (ioctl_param) and + * output (the return value of this function) + */ + return Message[ioctl_param]; + break; + } + + return SUCCESS; + } + + /* Module Declarations */ + + /* + * This structure will hold the functions to be called + * when a process does something to the device we + * created. Since a pointer to this structure is kept in + * the devices table, it can't be local to + * init_module. NULL is for unimplemented functions. + */ + struct file_operations Fops = { + .read = device_read, + .write = device_write, + .unlocked_ioctl = device_ioctl, + .open = device_open, + .release = device_release, /* a.k.a. close */ + }; + + /* + * Initialize the module - Register the character device + */ + int init_module() + { + int ret_val; + /* + * Register the character device (atleast try) + */ + ret_val = register_chrdev(MAJOR_NUM, DEVICE_NAME, &Fops); + + /* + * Negative values signify an error + */ + if (ret_val < 0) { + printk(KERN_ALERT "%s failed with %d\n", + "Sorry, registering the character device ", ret_val); + return ret_val; + } + + printk(KERN_INFO "%s The major device number is %d.\n", + "Registeration is a success", MAJOR_NUM); + printk(KERN_INFO "If you want to talk to the device driver,\n"); + printk(KERN_INFO "you'll have to create a device file. \n"); + printk(KERN_INFO "We suggest you use:\n"); + printk(KERN_INFO "mknod %s c %d 0\n", DEVICE_FILE_NAME, MAJOR_NUM); + printk(KERN_INFO "The device file name is important, because\n"); + printk(KERN_INFO "the ioctl program assumes that's the\n"); + printk(KERN_INFO "file you'll use.\n"); + + return 0; + } + + /* + * Cleanup - unregister the appropriate file from /proc + */ + void cleanup_module() + { + /* + * Unregister the device + */ + unregister_chrdev(MAJOR_NUM, DEVICE_NAME); + } + +.. raw:: html + +
+ +.. raw:: html + +
+ +:: + + /* + * chardev.h - the header file with the ioctl definitions. + * + * The declarations here have to be in a header file, because + * they need to be known both to the kernel module + * (in chardev.c) and the process calling ioctl (ioctl.c) + */ + + #ifndef CHARDEV_H + #define CHARDEV_H + + #include + + /* + * The major device number. We can't rely on dynamic + * registration any more, because ioctls need to know + * it. + */ + #define MAJOR_NUM 100 + + /* + * Set the message of the device driver + */ + #define IOCTL_SET_MSG _IOW(MAJOR_NUM, 0, char *) + /* + * _IOW means that we're creating an ioctl command + * number for passing information from a user process + * to the kernel module. + * + * The first arguments, MAJOR_NUM, is the major device + * number we're using. + * + * The second argument is the number of the command + * (there could be several with different meanings). + * + * The third argument is the type we want to get from + * the process to the kernel. + */ + + /* + * Get the message of the device driver + */ + #define IOCTL_GET_MSG _IOR(MAJOR_NUM, 1, char *) + /* + * This IOCTL is used for output, to get the message + * of the device driver. However, we still need the + * buffer to place the message in to be input, + * as it is allocated by the process. + */ + + /* + * Get the n'th byte of the message + */ + #define IOCTL_GET_NTH_BYTE _IOWR(MAJOR_NUM, 2, int) + /* + * The IOCTL is used for both input and output. It + * receives from the user a number, n, and returns + * Message[n]. + */ + + /* + * The name of the device file + */ + #define DEVICE_FILE_NAME "char_dev" + + #endif + +.. raw:: html + +
+ +.. raw:: html + +
+ +:: + + /* + * ioctl.c - the process to use ioctl's to control the kernel module + * + * Until now we could have used cat for input and output. But now + * we need to do ioctl's, which require writing our own process. + */ + + /* + * device specifics, such as ioctl numbers and the + * major device file. + */ + #include "../chardev.h" + + #include + #include + #include /* open */ + #include /* exit */ + #include /* ioctl */ + + /* + * Functions for the ioctl calls + */ + + int ioctl_set_msg(int file_desc, char *message) + { + int ret_val; + + ret_val = ioctl(file_desc, IOCTL_SET_MSG, message); + + if (ret_val < 0) { + printf("ioctl_set_msg failed:%d\n", ret_val); + exit(-1); + } + return 0; + } + + int ioctl_get_msg(int file_desc) + { + int ret_val; + char message[100]; + + /* + * Warning - this is dangerous because we don't tell + * the kernel how far it's allowed to write, so it + * might overflow the buffer. In a real production + * program, we would have used two ioctls - one to tell + * the kernel the buffer length and another to give + * it the buffer to fill + */ + ret_val = ioctl(file_desc, IOCTL_GET_MSG, message); + + if (ret_val < 0) { + printf("ioctl_get_msg failed:%d\n", ret_val); + exit(-1); + } + + printf("get_msg message:%s\n", message); + return 0; + } + + int ioctl_get_nth_byte(int file_desc) + { + int i; + char c; + + printf("get_nth_byte message:"); + + i = 0; + do { + c = ioctl(file_desc, IOCTL_GET_NTH_BYTE, i++); + + if (c < 0) { + printf("ioctl_get_nth_byte failed at the %d'th byte:\n", + i); + exit(-1); + } + + putchar(c); + } while (c != 0); + putchar('\n'); + return 0; + } + + /* + * Main - Call the ioctl functions + */ + int main() + { + int file_desc, ret_val; + char *msg = "Message passed by ioctl\n"; + + file_desc = open(DEVICE_FILE_NAME, 0); + if (file_desc < 0) { + printf("Can't open device file: %s\n", DEVICE_FILE_NAME); + exit(-1); + } + + ioctl_get_nth_byte(file_desc); + ioctl_get_msg(file_desc); + ioctl_set_msg(file_desc, msg); + + close(file_desc); + return 0; + } + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: System Calls + :name: org8de5924 + +.. raw:: html + +
+ +So far, the only thing we've done was to use well defined kernel +mechanisms to register **/proc** files and device handlers. This is fine +if you want to do something the kernel programmers thought you'd want, +such as write a device driver. But what if you want to do something +unusual, to change the behavior of the system in some way? Then, you're +mostly on your own. + +If you're not being sensible and using a virtual machine then this is +where kernel programming can become hazardous. While writing the example +below, I killed the **open()** system call. This meant I couldn't open +any files, I couldn't run any programs, and I couldn't shutdown the +system. I had to restart the virtual machine. No important files got +anihilated, but if I was doing this on some live mission critical system +then that could have been a possible outcome. To ensure you don't lose +any files, even within a test environment, please run **sync** right +before you do the **insmod** and the **rmmod**. + +Forget about **/proc** files, forget about device files. They're just +minor details. Minutiae in the vast expanse of the universe. The real +process to kernel communication mechanism, the one used by all +processes, is *system calls*. When a process requests a service from the +kernel (such as opening a file, forking to a new process, or requesting +more memory), this is the mechanism used. If you want to change the +behaviour of the kernel in interesting ways, this is the place to do it. +By the way, if you want to see which system calls a program uses, run +**strace **. + +In general, a process is not supposed to be able to access the kernel. +It can't access kernel memory and it can't call kernel functions. The +hardware of the CPU enforces this (that's the reason why it's called +\`protected mode' or 'page protection'). + +System calls are an exception to this general rule. What happens is that +the process fills the registers with the appropriate values and then +calls a special instruction which jumps to a previously defined location +in the kernel (of course, that location is readable by user processes, +it is not writable by them). Under Intel CPUs, this is done by means of +interrupt 0x80. The hardware knows that once you jump to this location, +you are no longer running in restricted user mode, but as the operating +system kernel — and therefore you're allowed to do whatever you want. + +The location in the kernel a process can jump to is called system_call. +The procedure at that location checks the system call number, which +tells the kernel what service the process requested. Then, it looks at +the table of system calls (sys_call_table) to see the address of the +kernel function to call. Then it calls the function, and after it +returns, does a few system checks and then return back to the process +(or to a different process, if the process time ran out). If you want to +read this code, it's at the source file +arch/$<$architecture$>$/kernel/entry.S, after the line +ENTRY(system_call). + +So, if we want to change the way a certain system call works, what we +need to do is to write our own function to implement it (usually by +adding a bit of our own code, and then calling the original function) +and then change the pointer at sys_call_table to point to our function. +Because we might be removed later and we don't want to leave the system +in an unstable state, it's important for cleanup_module to restore the +table to its original state. + +The source code here is an example of such a kernel module. We want to +"spy" on a certain user, and to printk() a message whenever that user +opens a file. Towards this end, we replace the system call to open a +file with our own function, called **our_sys_open**. This function +checks the uid (user's id) of the current process, and if it's equal to +the uid we spy on, it calls printk() to display the name of the file to +be opened. Then, either way, it calls the original open() function with +the same parameters, to actually open the file. + +The **init_module** function replaces the appropriate location in +**sys_call_table** and keeps the original pointer in a variable. The +cleanup_module function uses that variable to restore everything back to +normal. This approach is dangerous, because of the possibility of two +kernel modules changing the same system call. Imagine we have two kernel +modules, A and B. A's open system call will be A_open and B's will be +B_open. Now, when A is inserted into the kernel, the system call is +replaced with A_open, which will call the original sys_open when it's +done. Next, B is inserted into the kernel, which replaces the system +call with B_open, which will call what it thinks is the original system +call, A_open, when it's done. + +Now, if B is removed first, everything will be well — it will simply +restore the system call to A_open, which calls the original. However, if +A is removed and then B is removed, the system will crash. A's removal +will restore the system call to the original, sys_open, cutting B out of +the loop. Then, when B is removed, it will restore the system call to +what it thinks is the original, **A_open**, which is no longer in +memory. At first glance, it appears we could solve this particular +problem by checking if the system call is equal to our open function and +if so not changing it at all (so that B won't change the system call +when it's removed), but that will cause an even worse problem. When A is +removed, it sees that the system call was changed to **B_open** so that +it is no longer pointing to **A_open**, so it won't restore it to +**sys_open** before it is removed from memory. Unfortunately, **B_open** +will still try to call **A_open** which is no longer there, so that even +without removing B the system would crash. + +Note that all the related problems make syscall stealing unfeasiable for +production use. In order to keep people from doing potential harmful +things **sys_call_table** is no longer exported. This means, if you want +to do something more than a mere dry run of this example, you will have +to patch your current kernel in order to have sys_call_table exported. +In the example directory you will find a README and the patch. As you +can imagine, such modifications are not to be taken lightly. Do not try +this on valueable systems (ie systems that you do not own - or cannot +restore easily). You'll need to get the complete sourcecode of this +guide as a tarball in order to get the patch and the README. Depending +on your kernel version, you might even need to hand apply the patch. +Still here? Well, so is this chapter. If Wyle E. Coyote was a kernel +hacker, this would be the first thing he'd try. ;) + +.. raw:: html + +
+ +:: + + /* + * syscall.c + * + * System call "stealing" sample. + * + * Disables page protection at a processor level by + * changing the 16th bit in the cr0 register (could be Intel specific) + * + * Based on example by Peter Jay Salzman and + * https://bbs.archlinux.org/viewtopic.php?id=139406 + */ + + #include + #include + #include + #include + #include + #include /* which will have params */ + #include /* The list of system calls */ + + /* + * For the current (process) structure, we need + * this to know who the current user is. + */ + #include + #include + + unsigned long **sys_call_table; + unsigned long original_cr0; + + /* + * UID we want to spy on - will be filled from the + * command line + */ + static int uid; + module_param(uid, int, 0644); + + /* + * A pointer to the original system call. The reason + * we keep this, rather than call the original function + * (sys_open), is because somebody else might have + * replaced the system call before us. Note that this + * is not 100% safe, because if another module + * replaced sys_open before us, then when we're inserted + * we'll call the function in that module - and it + * might be removed before we are. + * + * Another reason for this is that we can't get sys_open. + * It's a static variable, so it is not exported. + */ + asmlinkage int (*original_call) (const char *, int, int); + + /* + * The function we'll replace sys_open (the function + * called when you call the open system call) with. To + * find the exact prototype, with the number and type + * of arguments, we find the original function first + * (it's at fs/open.c). + * + * In theory, this means that we're tied to the + * current version of the kernel. In practice, the + * system calls almost never change (it would wreck havoc + * and require programs to be recompiled, since the system + * calls are the interface between the kernel and the + * processes). + */ + asmlinkage int our_sys_open(const char *filename, int flags, int mode) + { + int i = 0; + char ch; + + /* + * Report the file, if relevant + */ + printk("Opened file by %d: ", uid); + do { + get_user(ch, filename + i); + i++; + printk("%c", ch); + } while (ch != 0); + printk("\n"); + + /* + * Call the original sys_open - otherwise, we lose + * the ability to open files + */ + return original_call(filename, flags, mode); + } + + static unsigned long **aquire_sys_call_table(void) + { + unsigned long int offset = PAGE_OFFSET; + unsigned long **sct; + + while (offset < ULLONG_MAX) { + sct = (unsigned long **)offset; + + if (sct[__NR_close] == (unsigned long *) sys_close) + return sct; + + offset += sizeof(void *); + } + + return NULL; + } + + static int __init syscall_start(void) + { + if(!(sys_call_table = aquire_sys_call_table())) + return -1; + + original_cr0 = read_cr0(); + + write_cr0(original_cr0 & ~0x00010000); + + /* keep track of the original open function */ + original_call = (void*)sys_call_table[__NR_open]; + + /* use our open function instead */ + sys_call_table[__NR_open] = (unsigned long *)our_sys_open; + + write_cr0(original_cr0); + + printk(KERN_INFO "Spying on UID:%d\n", uid); + + return 0; + } + + static void __exit syscall_end(void) + { + if(!sys_call_table) { + return; + } + + /* + * Return the system call back to normal + */ + if (sys_call_table[__NR_open] != (unsigned long *)our_sys_open) { + printk(KERN_ALERT "Somebody else also played with the "); + printk(KERN_ALERT "open system call\n"); + printk(KERN_ALERT "The system may be left in "); + printk(KERN_ALERT "an unstable state.\n"); + } + + write_cr0(original_cr0 & ~0x00010000); + sys_call_table[__NR_open] = (unsigned long *)original_call; + write_cr0(original_cr0); + + msleep(2000); + } + + module_init(syscall_start); + module_exit(syscall_end); + + MODULE_LICENSE("GPL"); + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Blocking Processes and threads + :name: org13e2c0e + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Sleep + :name: org9cbc7d3 + +.. raw:: html + +
+ +What do you do when somebody asks you for something you can't do right +away? If you're a human being and you're bothered by a human being, the +only thing you can say is: "*Not right now, I'm busy. Go away!*". But if +you're a kernel module and you're bothered by a process, you have +another possibility. You can put the process to sleep until you can +service it. After all, processes are being put to sleep by the kernel +and woken up all the time (that's the way multiple processes appear to +run on the same time on a single CPU). + +This kernel module is an example of this. The file (called +**/proc/sleep**) can only be opened by a single process at a time. If +the file is already open, the kernel module calls +wait_event_interruptible. The easiest way to keep a file open is to open +it with: + +.. raw:: html + +
+ +:: + + tail -f + +.. raw:: html + +
+ +This function changes the status of the task (a task is the kernel data +structure which holds information about a process and the system call +it's in, if any) to **TASK_INTERRUPTIBLE**, which means that the task +will not run until it is woken up somehow, and adds it to WaitQ, the +queue of tasks waiting to access the file. Then, the function calls the +scheduler to context switch to a different process, one which has some +use for the CPU. + +When a process is done with the file, it closes it, and module_close is +called. That function wakes up all the processes in the queue (there's +no mechanism to only wake up one of them). It then returns and the +process which just closed the file can continue to run. In time, the +scheduler decides that that process has had enough and gives control of +the CPU to another process. Eventually, one of the processes which was +in the queue will be given control of the CPU by the scheduler. It +starts at the point right after the call to +**module_interruptible_sleep_on**. + +This means that the process is still in kernel mode - as far as the +process is concerned, it issued the open system call and the system call +hasn't returned yet. The process doesn't know somebody else used the CPU +for most of the time between the moment it issued the call and the +moment it returned. + +It can then proceed to set a global variable to tell all the other +processes that the file is still open and go on with its life. When the +other processes get a piece of the CPU, they'll see that global variable +and go back to sleep. + +So we'll use tail -f to keep the file open in the background, while +trying to access it with another process (again in the background, so +that we need not switch to a different vt). As soon as the first +background process is killed with kill %1 , the second is woken up, is +able to access the file and finally terminates. + +To make our life more interesting, **module_close** doesn't have a +monopoly on waking up the processes which wait to access the file. A +signal, such as *Ctrl +c* (**SIGINT**) can also wake up a process. This +is because we used **module_interruptible_sleep_on**. We could have used +**module_sleep_on** instead, but that would have resulted in extremely +angry users whose *Ctrl+c*'s are ignored. + +In that case, we want to return with **-EINTR** immediately. This is +important so users can, for example, kill the process before it receives +the file. + +There is one more point to remember. Some times processes don't want to +sleep, they want either to get what they want immediately, or to be told +it cannot be done. Such processes use the **O_NONBLOCK** flag when +opening the file. The kernel is supposed to respond by returning with +the error code **-EAGAIN** from operations which would otherwise block, +such as opening the file in this example. The program cat_noblock, +available in the source directory for this chapter, can be used to open +a file with **O_NONBLOCK**. + +.. raw:: html + +
+ +:: + + hostname:~/lkmpg-examples/09-BlockingProcesses# insmod sleep.ko + hostname:~/lkmpg-examples/09-BlockingProcesses# cat_noblock /proc/sleep + Last input: + hostname:~/lkmpg-examples/09-BlockingProcesses# tail -f /proc/sleep & + Last input: + Last input: + Last input: + Last input: + Last input: + Last input: + Last input: + tail: /proc/sleep: file truncated + [1] 6540 + hostname:~/lkmpg-examples/09-BlockingProcesses# cat_noblock /proc/sleep + Open would block + hostname:~/lkmpg-examples/09-BlockingProcesses# kill %1 + [1]+ Terminated tail -f /proc/sleep + hostname:~/lkmpg-examples/09-BlockingProcesses# cat_noblock /proc/sleep + Last input: + hostname:~/lkmpg-examples/09-BlockingProcesses# + +.. raw:: html + +
+ +.. raw:: html + +
+ +:: + + /* + * sleep.c - create a /proc file, and if several processes try to open it at + * the same time, put all but one to sleep + */ + + #include /* We're doing kernel work */ + #include /* Specifically, a module */ + #include /* Necessary because we use proc fs */ + #include /* For putting processes to sleep and + waking them up */ + #include /* for get_user and put_user */ + + /* + * The module's file functions + */ + + /* + * Here we keep the last message received, to prove that we can process our + * input + */ + #define MESSAGE_LENGTH 80 + static char Message[MESSAGE_LENGTH]; + + static struct proc_dir_entry *Our_Proc_File; + #define PROC_ENTRY_FILENAME "sleep" + + /* + * Since we use the file operations struct, we can't use the special proc + * output provisions - we have to use a standard read function, which is this + * function + */ + static ssize_t module_output(struct file *file, /* see include/linux/fs.h */ + char *buf, /* The buffer to put data to + (in the user segment) */ + size_t len, /* The length of the buffer */ + loff_t * offset) + { + static int finished = 0; + int i; + char message[MESSAGE_LENGTH + 30]; + + /* + * Return 0 to signify end of file - that we have nothing + * more to say at this point. + */ + if (finished) { + finished = 0; + return 0; + } + + /* + * If you don't understand this by now, you're hopeless as a kernel + * programmer. + */ + sprintf(message, "Last input:%s\n", Message); + for (i = 0; i < len && message[i]; i++) + put_user(message[i], buf + i); + + finished = 1; + return i; /* Return the number of bytes "read" */ + } + + /* + * This function receives input from the user when the user writes to the /proc + * file. + */ + static ssize_t module_input(struct file *file, /* The file itself */ + const char *buf, /* The buffer with input */ + size_t length, /* The buffer's length */ + loff_t * offset) /* offset to file - ignore */ + { + int i; + + /* + * Put the input into Message, where module_output will later be + * able to use it + */ + for (i = 0; i < MESSAGE_LENGTH - 1 && i < length; i++) + get_user(Message[i], buf + i); + /* + * we want a standard, zero terminated string + */ + Message[i] = '\0'; + + /* + * We need to return the number of input characters used + */ + return i; + } + + /* + * 1 if the file is currently open by somebody + */ + int Already_Open = 0; + + /* + * Queue of processes who want our file + */ + DECLARE_WAIT_QUEUE_HEAD(WaitQ); + /* + * Called when the /proc file is opened + */ + static int module_open(struct inode *inode, struct file *file) + { + /* + * If the file's flags include O_NONBLOCK, it means the process doesn't + * want to wait for the file. In this case, if the file is already + * open, we should fail with -EAGAIN, meaning "you'll have to try + * again", instead of blocking a process which would rather stay awake. + */ + if ((file->f_flags & O_NONBLOCK) && Already_Open) + return -EAGAIN; + + /* + * This is the correct place for try_module_get(THIS_MODULE) because + * if a process is in the loop, which is within the kernel module, + * the kernel module must not be removed. + */ + try_module_get(THIS_MODULE); + + /* + * If the file is already open, wait until it isn't + */ + + while (Already_Open) { + int i, is_sig = 0; + + /* + * This function puts the current process, including any system + * calls, such as us, to sleep. Execution will be resumed right + * after the function call, either because somebody called + * wake_up(&WaitQ) (only module_close does that, when the file + * is closed) or when a signal, such as Ctrl-C, is sent + * to the process + */ + wait_event_interruptible(WaitQ, !Already_Open); + + /* + * If we woke up because we got a signal we're not blocking, + * return -EINTR (fail the system call). This allows processes + * to be killed or stopped. + */ + + /* + * Emmanuel Papirakis: + * + * This is a little update to work with 2.2.*. Signals now are contained in + * two words (64 bits) and are stored in a structure that contains an array of + * two unsigned longs. We now have to make 2 checks in our if. + * + * Ori Pomerantz: + * + * Nobody promised me they'll never use more than 64 bits, or that this book + * won't be used for a version of Linux with a word size of 16 bits. This code + * would work in any case. + */ + for (i = 0; i < _NSIG_WORDS && !is_sig; i++) + is_sig = + current->pending.signal.sig[i] & ~current-> + blocked.sig[i]; + + if (is_sig) { + /* + * It's important to put module_put(THIS_MODULE) here, + * because for processes where the open is interrupted + * there will never be a corresponding close. If we + * don't decrement the usage count here, we will be + * left with a positive usage count which we'll have no + * way to bring down to zero, giving us an immortal + * module, which can only be killed by rebooting + * the machine. + */ + module_put(THIS_MODULE); + return -EINTR; + } + } + + /* + * If we got here, Already_Open must be zero + */ + + /* + * Open the file + */ + Already_Open = 1; + return 0; /* Allow the access */ + } + + /* + * Called when the /proc file is closed + */ + int module_close(struct inode *inode, struct file *file) + { + /* + * Set Already_Open to zero, so one of the processes in the WaitQ will + * be able to set Already_Open back to one and to open the file. All + * the other processes will be called when Already_Open is back to one, + * so they'll go back to sleep. + */ + Already_Open = 0; + + /* + * Wake up all the processes in WaitQ, so if anybody is waiting for the + * file, they can have it. + */ + wake_up(&WaitQ); + + module_put(THIS_MODULE); + + return 0; /* success */ + } + + /* + * Structures to register as the /proc file, with pointers to all the relevant + * functions. + */ + + /* + * File operations for our proc file. This is where we place pointers to all + * the functions called when somebody tries to do something to our file. NULL + * means we don't want to deal with something. + */ + static struct file_operations File_Ops_4_Our_Proc_File = { + .read = module_output, /* "read" from the file */ + .write = module_input, /* "write" to the file */ + .open = module_open, /* called when the /proc file is opened */ + .release = module_close, /* called when it's closed */ + }; + + /* + * Module initialization and cleanup + */ + + /* + * Initialize the module - register the proc file + */ + + int init_module() + { + Our_Proc_File = proc_create(PROC_ENTRY_FILENAME, 0644, NULL, &File_Ops_4_Our_Proc_File); + if(Our_Proc_File == NULL) + { + remove_proc_entry(PROC_ENTRY_FILENAME, NULL); + printk(KERN_DEBUG "Error: Could not initialize /proc/%s\n", PROC_ENTRY_FILENAME); + return -ENOMEM; + } + proc_set_size(Our_Proc_File, 80); + proc_set_user(Our_Proc_File, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID); + + printk(KERN_INFO "/proc/test created\n"); + + return 0; + } + + /* + * Cleanup - unregister our file from /proc. This could get dangerous if + * there are still processes waiting in WaitQ, because they are inside our + * open function, which will get unloaded. I'll explain how to avoid removal + * of a kernel module in such a case in chapter 10. + */ + void cleanup_module() + { + remove_proc_entry(PROC_ENTRY_FILENAME, NULL); + printk(KERN_DEBUG "/proc/%s removed\n", PROC_ENTRY_FILENAME); + } + +.. raw:: html + +
+ +.. raw:: html + +
+ +:: + + /* cat_noblock.c - open a file and display its contents, but exit rather than + * wait for input */ + /* Copyright (C) 1998 by Ori Pomerantz */ + + #include /* standard I/O */ + #include /* for open */ + #include /* for read */ + #include /* for exit */ + #include /* for errno */ + + #define MAX_BYTES 1024*4 + + + int main(int argc, char *argv[]) + { + int fd; /* The file descriptor for the file to read */ + size_t bytes; /* The number of bytes read */ + char buffer[MAX_BYTES]; /* The buffer for the bytes */ + + + /* Usage */ + if (argc != 2) { + printf("Usage: %s \n", argv[0]); + puts("Reads the content of a file, but doesn't wait for input"); + exit(-1); + } + + /* Open the file for reading in non blocking mode */ + fd = open(argv[1], O_RDONLY | O_NONBLOCK); + + /* If open failed */ + if (fd == -1) { + if (errno = EAGAIN) + puts("Open would block"); + else + puts("Open failed"); + exit(-1); + } + + /* Read the file and output its contents */ + do { + int i; + + /* Read characters from the file */ + bytes = read(fd, buffer, MAX_BYTES); + + /* If there's an error, report it and die */ + if (bytes == -1) { + if (errno = EAGAIN) + puts("Normally I'd block, but you told me not to"); + else + puts("Another read error"); + exit(-1); + } + + /* Print the characters */ + if (bytes > 0) { + for(i=0; i 0); + return 0; + } + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Completions + :name: org89cb410 + +.. raw:: html + +
+ +Sometimes one thing should happen before another within a module having +multiple threads. Rather than using **/proc/sleep** commands the kernel +has another way to do this which allows timeouts or interrupts to also +happen. + +In the following example two threads are started, but one needs to start +before another. + +.. raw:: html + +
+ +:: + + #include + #include + #include + #include + #include + + static struct { + struct completion crank_comp; + struct completion flywheel_comp; + } machine; + + static int machine_crank_thread(void* arg) + { + printk("Turn the crank\n"); + + complete_all(&machine.crank_comp); + complete_and_exit(&machine.crank_comp, 0); + } + + static int machine_flywheel_spinup_thread(void* arg) + { + wait_for_completion(&machine.crank_comp); + + printk("Flywheel spins up\n"); + + complete_all(&machine.flywheel_comp); + complete_and_exit(&machine.flywheel_comp, 0); + } + + static int completions_init(void) + { + struct task_struct* crank_thread; + struct task_struct* flywheel_thread; + + printk("completions example\n"); + + init_completion(&machine.crank_comp); + init_completion(&machine.flywheel_comp); + + crank_thread = + kthread_create(machine_crank_thread, + NULL, "KThread Crank"); + if (IS_ERR(crank_thread)) + goto ERROR_THREAD_1; + + flywheel_thread = + kthread_create(machine_flywheel_spinup_thread, + NULL, "KThread Flywheel"); + if (IS_ERR(flywheel_thread)) + goto ERROR_THREAD_2; + + wake_up_process(flywheel_thread); + wake_up_process(crank_thread); + + return 0; + + ERROR_THREAD_2: + kthread_stop(crank_thread); + ERROR_THREAD_1: + + return -1; + } + + void completions_exit(void) + { + wait_for_completion(&machine.crank_comp); + wait_for_completion(&machine.flywheel_comp); + + printk("completions exit\n"); + } + + module_init(completions_init); + module_exit(completions_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Completions example"); + MODULE_LICENSE("GPL"); + +.. raw:: html + +
+ +The *machine* structure stores the completion states for the two +threads. At the exit point of each thread the respective completion +state is updated, and *wait_for_completion* is used by the flywheel +thread to ensure that it doesn't begin prematurely. + +So even though *flywheel_thread* is started first you should notice if +you load this module and run *dmesg* that turning the crank always +happens first because the flywheel thread waits for it to complete. + +There are other variations upon the *wait_for_completion* function, +which include timeouts or being interrupted, but this basic mechanism is +enough for many common situations without adding a lot of complexity. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Avoiding Collisions and Deadlocks + :name: org949949f + +.. raw:: html + +
+ +If processes running on different CPUs or in different threads try to +access the same memory then it's possible that strange things can happen +or your system can lock up. To avoid this various types of mutual +exclusion kernel functions are available. These indicate if a section of +code is "locked" or "unlocked" so that simultaneous attempts to run it +can't happen. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Mutex + :name: org10f05c2 + +.. raw:: html + +
+ +You can use kernel mutexes (mutual exclusions) in much the same manner +that you might deploy them in userland. This may be all that's needed to +avoid collisions in most cases. + +.. raw:: html + +
+ +:: + + #include + #include + #include + #include + + DEFINE_MUTEX(mymutex); + + static int example_mutex_init(void) + { + int ret; + + printk("example_mutex init\n"); + + ret = mutex_trylock(&mymutex); + if (ret != 0) { + printk("mutex is locked\n"); + + if (mutex_is_locked(&mymutex) == 0) + printk("The mutex failed to lock!\n"); + + mutex_unlock(&mymutex); + printk("mutex is unlocked\n"); + } + else + printk("Failed to lock\n"); + + return 0; + } + + static void example_mutex_exit(void) + { + printk("example_mutex exit\n"); + } + + module_init(example_mutex_init); + module_exit(example_mutex_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Mutex example"); + MODULE_LICENSE("GPL"); + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Spinlocks + :name: org5d633fc + +.. raw:: html + +
+ +As the name suggests, spinlocks lock up the CPU that the code is running +on, taking 100% of its resources. Because of this you should only use +the spinlock mechanism around code which is likely to take no more than +a few milliseconds to run and so won't noticably slow anything down from +the user's point of view. + +The example here is *"irq safe"* in that if interrupts happen during the +lock then they won't be forgotten and will activate when the unlock +happens, using the *flags* variable to retain their state. + +.. raw:: html + +
+ +:: + + #include + #include + #include + #include + #include + + DEFINE_SPINLOCK(sl_static); + spinlock_t sl_dynamic; + + static void example_spinlock_static(void) + { + unsigned long flags; + + spin_lock_irqsave(&sl_static, flags); + printk("Locked static spinlock\n"); + + /* Do something or other safely. + Because this uses 100% CPU time this + code should take no more than a few + milliseconds to run */ + + spin_unlock_irqrestore(&sl_static, flags); + printk("Unlocked static spinlock\n"); + } + + static void example_spinlock_dynamic(void) + { + unsigned long flags; + + spin_lock_init(&sl_dynamic); + spin_lock_irqsave(&sl_dynamic, flags); + printk("Locked dynamic spinlock\n"); + + /* Do something or other safely. + Because this uses 100% CPU time this + code should take no more than a few + milliseconds to run */ + + spin_unlock_irqrestore(&sl_dynamic, flags); + printk("Unlocked dynamic spinlock\n"); + } + + static int example_spinlock_init(void) + { + printk("example spinlock started\n"); + + example_spinlock_static(); + example_spinlock_dynamic(); + + return 0; + } + + static void example_spinlock_exit(void) + { + printk("example spinlock exit\n"); + } + + module_init(example_spinlock_init); + module_exit(example_spinlock_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Spinlock example"); + MODULE_LICENSE("GPL"); + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Read and write locks + :name: orgaa517c3 + +.. raw:: html + +
+ +Read and write locks are specialised kinds of spinlocks so that you can +exclusively read from something or write to something. Like the earlier +spinlocks example the one below shows an "irq safe" situation in which +if other functions were triggered from irqs which might also read and +write to whatever you are concerned with then they wouldn't disrupt the +logic. As before it's a good idea to keep anything done within the lock +as short as possible so that it doesn't hang up the system and cause +users to start revolting against the tyranny of your module. + +.. raw:: html + +
+ +:: + + #include + #include + #include + + DEFINE_RWLOCK(myrwlock); + + static void example_read_lock(void) + { + unsigned long flags; + + read_lock_irqsave(&myrwlock, flags); + printk("Read Locked\n"); + + /* Read from something */ + + read_unlock_irqrestore(&myrwlock, flags); + printk("Read Unlocked\n"); + } + + static void example_write_lock(void) + { + unsigned long flags; + + write_lock_irqsave(&myrwlock, flags); + printk("Write Locked\n"); + + /* Write to something */ + + write_unlock_irqrestore(&myrwlock, flags); + printk("Write Unlocked\n"); + } + + static int example_rwlock_init(void) + { + printk("example_rwlock started\n"); + + example_read_lock(); + example_write_lock(); + + return 0; + } + + static void example_rwlock_exit(void) + { + printk("example_rwlock exit\n"); + } + + module_init(example_rwlock_init); + module_exit(example_rwlock_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Read/Write locks example"); + MODULE_LICENSE("GPL"); + +.. raw:: html + +
+ +Of course if you know for sure that there are no functions triggered by +irqs which could possibly interfere with your logic then you can use the +simpler *read_lock(&myrwlock)* and *read_unlock(&myrwlock)* or the +corresponding write functions. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Atomic operations + :name: orgadbf448 + +.. raw:: html + +
+ +If you're doing simple arithmetic: adding, subtracting or bitwise +operations then there's another way in the multi-CPU and +multi-hyperthreaded world to stop other parts of the system from messing +with your mojo. By using atomic operations you can be confident that +your addition, subtraction or bit flip did actually happen and wasn't +overwritten by some other shenanigans. An example is shown below. + +.. raw:: html + +
+ +:: + + #include + #include + #include + + #define BYTE_TO_BINARY_PATTERN "%c%c%c%c%c%c%c%c" + #define BYTE_TO_BINARY(byte) \ + (byte & 0x80 ? '1' : '0'), \ + (byte & 0x40 ? '1' : '0'), \ + (byte & 0x20 ? '1' : '0'), \ + (byte & 0x10 ? '1' : '0'), \ + (byte & 0x08 ? '1' : '0'), \ + (byte & 0x04 ? '1' : '0'), \ + (byte & 0x02 ? '1' : '0'), \ + (byte & 0x01 ? '1' : '0') + + static void atomic_add_subtract(void) + { + atomic_t debbie; + atomic_t chris = ATOMIC_INIT(50); + + atomic_set(&debbie, 45); + + /* subtract one */ + atomic_dec(&debbie); + + atomic_add(7, &debbie); + + /* add one */ + atomic_inc(&debbie); + + printk("chris: %d, debbie: %d\n", + atomic_read(&chris), atomic_read(&debbie)); + } + + static void atomic_bitwise(void) + { + unsigned long word = 0; + + printk("Bits 0: "BYTE_TO_BINARY_PATTERN, BYTE_TO_BINARY(word)); + set_bit(3, &word); + set_bit(5, &word); + printk("Bits 1: "BYTE_TO_BINARY_PATTERN, BYTE_TO_BINARY(word)); + clear_bit(5, &word); + printk("Bits 2: "BYTE_TO_BINARY_PATTERN, BYTE_TO_BINARY(word)); + change_bit(3, &word); + + printk("Bits 3: "BYTE_TO_BINARY_PATTERN, BYTE_TO_BINARY(word)); + if (test_and_set_bit(3, &word)) + printk("wrong\n"); + printk("Bits 4: "BYTE_TO_BINARY_PATTERN, BYTE_TO_BINARY(word)); + + word = 255; + printk("Bits 5: "BYTE_TO_BINARY_PATTERN, BYTE_TO_BINARY(word)); + } + + static int example_atomic_init(void) + { + printk("example_atomic started\n"); + + atomic_add_subtract(); + atomic_bitwise(); + + return 0; + } + + static void example_atomic_exit(void) + { + printk("example_atomic exit\n"); + } + + module_init(example_atomic_init); + module_exit(example_atomic_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Atomic operations example"); + MODULE_LICENSE("GPL"); + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Replacing Printks + :name: org7974c60 + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Replacing printk + :name: org1c8b17b + +.. raw:: html + +
+ +In Section 1.2.1.2, I said that X and kernel module programming don't +mix. That's true for developing kernel modules, but in actual use, you +want to be able to send messages to whichever tty the command to load +the module came from. + +"tty" is an abbreviation of *teletype*: originally a combination +keyboard-printer used to communicate with a Unix system, and today an +abstraction for the text stream used for a Unix program, whether it's a +physical terminal, an xterm on an X display, a network connection used +with ssh, etc. + +The way this is done is by using current, a pointer to the currently +running task, to get the current task's tty structure. Then, we look +inside that tty structure to find a pointer to a string write function, +which we use to write a string to the tty. + +.. raw:: html + +
+ +:: + + /* + * print_string.c - Send output to the tty we're running on, regardless if it's + * through X11, telnet, etc. We do this by printing the string to the tty + * associated with the current task. + */ + #include + #include + #include + #include /* For current */ + #include /* For the tty declarations */ + #include /* For LINUX_VERSION_CODE */ + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Peter Jay Salzman"); + + static void print_string(char *str) + { + struct tty_struct *my_tty; + const struct tty_operations *ttyops; + + /* + * tty struct went into signal struct in 2.6.6 + */ + #if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,5) ) + /* + * The tty for the current task + */ + my_tty = current->tty; + #else + /* + * The tty for the current task, for 2.6.6+ kernels + */ + my_tty = get_current_tty(); + #endif + ttyops = my_tty->driver->ops; + + /* + * If my_tty is NULL, the current task has no tty you can print to + * (ie, if it's a daemon). If so, there's nothing we can do. + */ + if (my_tty != NULL) { + + /* + * my_tty->driver is a struct which holds the tty's functions, + * one of which (write) is used to write strings to the tty. + * It can be used to take a string either from the user's or + * kernel's memory segment. + * + * The function's 1st parameter is the tty to write to, + * because the same function would normally be used for all + * tty's of a certain type. The 2nd parameter controls + * whether the function receives a string from kernel + * memory (false, 0) or from user memory (true, non zero). + * BTW: this param has been removed in Kernels > 2.6.9 + * The (2nd) 3rd parameter is a pointer to a string. + * The (3rd) 4th parameter is the length of the string. + * + * As you will see below, sometimes it's necessary to use + * preprocessor stuff to create code that works for different + * kernel versions. The (naive) approach we've taken here + * does not scale well. The right way to deal with this + * is described in section 2 of + * linux/Documentation/SubmittingPatches + */ + (ttyops->write) (my_tty, /* The tty itself */ + #if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,9) ) + 0, /* Don't take the string + from user space */ + #endif + str, /* String */ + strlen(str)); /* Length */ + + /* + * ttys were originally hardware devices, which (usually) + * strictly followed the ASCII standard. In ASCII, to move to + * a new line you need two characters, a carriage return and a + * line feed. On Unix, the ASCII line feed is used for both + * purposes - so we can't just use \n, because it wouldn't have + * a carriage return and the next line will start at the + * column right after the line feed. + * + * This is why text files are different between Unix and + * MS Windows. In CP/M and derivatives, like MS-DOS and + * MS Windows, the ASCII standard was strictly adhered to, + * and therefore a newline requirs both a LF and a CR. + */ + + #if ( LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,9) ) + (ttyops->write) (my_tty, 0, "\015\012", 2); + #else + (ttyops->write) (my_tty, "\015\012", 2); + #endif + } + } + + static int __init print_string_init(void) + { + print_string("The module has been inserted. Hello world!"); + return 0; + } + + static void __exit print_string_exit(void) + { + print_string("The module has been removed. Farewell world!"); + } + + module_init(print_string_init); + module_exit(print_string_exit); + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Flashing keyboard LEDs + :name: org418d823 + +.. raw:: html + +
+ +In certain conditions, you may desire a simpler and more direct way to +communicate to the external world. Flashing keyboard LEDs can be such a +solution: It is an immediate way to attract attention or to display a +status condition. Keyboard LEDs are present on every hardware, they are +always visible, they do not need any setup, and their use is rather +simple and non-intrusive, compared to writing to a tty or a file. + +The following source code illustrates a minimal kernel module which, +when loaded, starts blinking the keyboard LEDs until it is unloaded. + +.. raw:: html + +
+ +:: + + /* + * kbleds.c - Blink keyboard leds until the module is unloaded. + */ + + #include + #include + #include /* for fg_console */ + #include /* For fg_console, MAX_NR_CONSOLES */ + #include /* For KDSETLED */ + #include + #include /* For vc_cons */ + + MODULE_DESCRIPTION("Example module illustrating the use of Keyboard LEDs."); + MODULE_AUTHOR("Daniele Paolo Scarpazza"); + MODULE_LICENSE("GPL"); + + struct timer_list my_timer; + struct tty_driver *my_driver; + char kbledstatus = 0; + + #define BLINK_DELAY HZ/5 + #define ALL_LEDS_ON 0x07 + #define RESTORE_LEDS 0xFF + + /* + * Function my_timer_func blinks the keyboard LEDs periodically by invoking + * command KDSETLED of ioctl() on the keyboard driver. To learn more on virtual + * terminal ioctl operations, please see file: + * /usr/src/linux/drivers/char/vt_ioctl.c, function vt_ioctl(). + * + * The argument to KDSETLED is alternatively set to 7 (thus causing the led + * mode to be set to LED_SHOW_IOCTL, and all the leds are lit) and to 0xFF + * (any value above 7 switches back the led mode to LED_SHOW_FLAGS, thus + * the LEDs reflect the actual keyboard status). To learn more on this, + * please see file: + * /usr/src/linux/drivers/char/keyboard.c, function setledstate(). + * + */ + + static void my_timer_func(unsigned long ptr) + { + unsigned long *pstatus = (unsigned long *)ptr; + struct tty_struct* t = vc_cons[fg_console].d->port.tty; + + if (*pstatus == ALL_LEDS_ON) + *pstatus = RESTORE_LEDS; + else + *pstatus = ALL_LEDS_ON; + + (my_driver->ops->ioctl) (t, KDSETLED, *pstatus); + + my_timer.expires = jiffies + BLINK_DELAY; + add_timer(&my_timer); + } + + static int __init kbleds_init(void) + { + int i; + + printk(KERN_INFO "kbleds: loading\n"); + printk(KERN_INFO "kbleds: fgconsole is %x\n", fg_console); + for (i = 0; i < MAX_NR_CONSOLES; i++) { + if (!vc_cons[i].d) + break; + printk(KERN_INFO "poet_atkm: console[%i/%i] #%i, tty %lx\n", i, + MAX_NR_CONSOLES, vc_cons[i].d->vc_num, + (unsigned long)vc_cons[i].d->port.tty); + } + printk(KERN_INFO "kbleds: finished scanning consoles\n"); + + my_driver = vc_cons[fg_console].d->port.tty->driver; + printk(KERN_INFO "kbleds: tty driver magic %x\n", my_driver->magic); + + /* + * Set up the LED blink timer the first time + */ + init_timer(&my_timer); + my_timer.function = my_timer_func; + my_timer.data = (unsigned long)&kbledstatus; + my_timer.expires = jiffies + BLINK_DELAY; + add_timer(&my_timer); + + return 0; + } + + static void __exit kbleds_cleanup(void) + { + printk(KERN_INFO "kbleds: unloading...\n"); + del_timer(&my_timer); + (my_driver->ops->ioctl) (vc_cons[fg_console].d->port.tty, + KDSETLED, RESTORE_LEDS); + } + + module_init(kbleds_init); + module_exit(kbleds_cleanup); + +.. raw:: html + +
+ +If none of the examples in this chapter fit your debugging needs there +might yet be some other tricks to try. Ever wondered what +CONFIG_LL_DEBUG in make menuconfig is good for? If you activate that you +get low level access to the serial port. While this might not sound very +powerful by itself, you can patch kernel/printk.c or any other essential +syscall to use printascii, thus makeing it possible to trace virtually +everything what your code does over a serial line. If you find yourself +porting the kernel to some new and former unsupported architecture this +is usually amongst the first things that should be implemented. Logging +over a netconsole might also be worth a try. + +While you have seen lots of stuff that can be used to aid debugging +here, there are some things to be aware of. Debugging is almost always +intrusive. Adding debug code can change the situation enough to make the +bug seem to dissappear. Thus you should try to keep debug code to a +minimum and make sure it does not show up in production code. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Scheduling Tasks + :name: orgf37d73f + +.. raw:: html + +
+ +There are two main ways of running tasks: tasklets and work queues. +Tasklets are a quick and easy way of scheduling a single function to be +run, for example when triggered from an interrupt, whereas work queues +are more complicated but also better suited to running multiple things +in a sequence. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Tasklets + :name: org32525a8 + +.. raw:: html + +
+ +Here's an example tasklet module. The *tasklet_fn* function runs for a +few seconds and in the mean time execution of the *example_tasklet_init* +function continues to the exit point. + +.. raw:: html + +
+ +:: + + #include + #include + #include + #include + + static void tasklet_fn(unsigned long data) + { + printk("Example tasklet starts\n"); + mdelay(5000); + printk("Example tasklet ends\n"); + } + + DECLARE_TASKLET(mytask, tasklet_fn, 0L); + + static int example_tasklet_init(void) + { + printk("tasklet example init\n"); + tasklet_schedule(&mytask); + mdelay(200); + printk("Example tasklet init continues...\n"); + return 0; + } + + static void example_tasklet_exit(void) + { + printk("tasklet example exit\n"); + tasklet_kill(&mytask); + } + + module_init(example_tasklet_init); + module_exit(example_tasklet_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Tasklet example"); + MODULE_LICENSE("GPL"); + +.. raw:: html + +
+ +So with this example loaded *dmesg* should show: + +.. raw:: html + +
+ +:: + + tasklet example init + Example tasklet starts + Example tasklet init continues... + Example tasklet ends + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Work queues + :name: orge8a2d87 + +.. raw:: html + +
+ +Very often, we have "housekeeping" tasks which have to be done at a +certain time, or every so often. If the task is to be done by a process, +we do it by putting it in the crontab file. If the task is to be done by +a kernel module, we have two possibilities. The first is to put a +process in the crontab file which will wake up the module by a system +call when necessary, for example by opening a file. This is terribly +inefficient, however – we run a new process off of crontab, read a new +executable to memory, and all this just to wake up a kernel module which +is in memory anyway. + +Instead of doing that, we can create a function that will be called once +for every timer interrupt. The way we do this is we create a task, held +in a workqueue_struct structure, which will hold a pointer to the +function. Then, we use queue_delayed_work to put that task on a task +list called my_workqueue , which is the list of tasks to be executed on +the next timer interrupt. Because we want the function to keep on being +executed, we need to put it back on my_workqueue whenever it is called, +for the next timer interrupt. + +There's one more point we need to remember here. When a module is +removed by rmmod, first its reference count is checked. If it is zero, +module_cleanup is called. Then, the module is removed from memory with +all its functions. Things need to be shut down properly, or bad things +will happen. See the code below how this can be done in a safe way. + +.. raw:: html + +
+ +:: + + /* + * sched.c - schedule a function to be called on every timer interrupt. + * + * Copyright (C) 2001 by Peter Jay Salzman + */ + + /* + * The necessary header files + */ + + /* + * Standard in kernel modules + */ + #include /* We're doing kernel work */ + #include /* Specifically, a module */ + #include /* Necessary because we use the proc fs */ + #include /* We schedule tasks here */ + #include /* We need to put ourselves to sleep + and wake up later */ + #include /* For __init and __exit */ + #include /* For irqreturn_t */ + + struct proc_dir_entry *Our_Proc_File; + #define PROC_ENTRY_FILENAME "sched" + #define MY_WORK_QUEUE_NAME "WQsched.c" + + /* + * some work_queue related functions + * are just available to GPL licensed Modules + */ + MODULE_LICENSE("GPL"); + + /* + * The number of times the timer interrupt has been called so far + */ + static int TimerIntrpt = 0; + + static void intrpt_routine(struct work_struct *work); + + static int die = 0; /* set this to 1 for shutdown */ + + /* + * The work queue structure for this task, from workqueue.h + */ + static struct workqueue_struct *my_workqueue; + + static struct delayed_work Task; + static DECLARE_DELAYED_WORK(Task, intrpt_routine); + + /* + * This function will be called on every timer interrupt. Notice the void* + * pointer - task functions can be used for more than one purpose, each time + * getting a different parameter. + */ + static void intrpt_routine(struct work_struct *work) + { + /* + * Increment the counter + */ + TimerIntrpt++; + + /* + * If cleanup wants us to die + */ + if (die == 0) + queue_delayed_work(my_workqueue, &Task, 100); + } + + /* + * Put data into the proc fs file. + */ + int + procfile_read(char *buffer, + char **buffer_location, + off_t offset, int buffer_length, int *eof, void *data) + { + int len; /* The number of bytes actually used */ + + /* + * It's static so it will still be in memory + * when we leave this function + */ + static char my_buffer[80]; + + /* + * We give all of our information in one go, so if anybody asks us + * if we have more information the answer should always be no. + */ + if (offset > 0) + return 0; + + /* + * Fill the buffer and get its length + */ + len = sprintf(my_buffer, "Timer called %d times so far\n", TimerIntrpt); + + /* + * Tell the function which called us where the buffer is + */ + *buffer_location = my_buffer; + + /* + * Return the length + */ + return len; + } + + /* + * Initialize the module - register the proc file + */ + int __init init_module() + { + /* + * Create our /proc file + */ + Our_Proc_File = proc_create(PROC_ENTRY_FILENAME, 0644, NULL, NULL); + + if (Our_Proc_File == NULL) { + remove_proc_entry(PROC_ENTRY_FILENAME, NULL); + printk(KERN_ALERT "Error: Could not initialize /proc/%s\n", + PROC_ENTRY_FILENAME); + return -ENOMEM; + } + proc_set_size(Our_Proc_File, 80); + proc_set_user(Our_Proc_File, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID); + + /* + * Put the task in the work_timer task queue, so it will be executed at + * next timer interrupt + */ + my_workqueue = create_workqueue(MY_WORK_QUEUE_NAME); + queue_delayed_work(my_workqueue, &Task, 100); + + printk(KERN_INFO "/proc/%s created\n", PROC_ENTRY_FILENAME); + + return 0; + } + + /* + * Cleanup + */ + void __exit cleanup_module() + { + /* + * Unregister our /proc file + */ + remove_proc_entry(PROC_ENTRY_FILENAME, NULL); + printk(KERN_INFO "/proc/%s removed\n", PROC_ENTRY_FILENAME); + + die = 1; /* keep intrp_routine from queueing itself */ + cancel_delayed_work(&Task); /* no "new ones" */ + flush_workqueue(my_workqueue); /* wait till all "old ones" finished */ + destroy_workqueue(my_workqueue); + + /* + * Sleep until intrpt_routine is called one last time. This is + * necessary, because otherwise we'll deallocate the memory holding + * intrpt_routine and Task while work_timer still references them. + * Notice that here we don't allow signals to interrupt us. + * + * Since WaitQ is now not NULL, this automatically tells the interrupt + * routine it's time to die. + */ + } + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Interrupt Handlers + :name: orgbc0cdf8 + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Interrupt Handlers + :name: org93511bb + +.. raw:: html + +
+ +Except for the last chapter, everything we did in the kernel so far +we've done as a response to a process asking for it, either by dealing +with a special file, sending an ioctl(), or issuing a system call. But +the job of the kernel isn't just to respond to process requests. Another +job, which is every bit as important, is to speak to the hardware +connected to the machine. + +There are two types of interaction between the CPU and the rest of the +computer's hardware. The first type is when the CPU gives orders to the +hardware, the other is when the hardware needs to tell the CPU +something. The second, called interrupts, is much harder to implement +because it has to be dealt with when convenient for the hardware, not +the CPU. Hardware devices typically have a very small amount of RAM, and +if you don't read their information when available, it is lost. + +Under Linux, hardware interrupts are called IRQ's (Interrupt ReQuests). +There are two types of IRQ's, short and long. A short IRQ is one which +is expected to take a very short period of time, during which the rest +of the machine will be blocked and no other interrupts will be handled. +A long IRQ is one which can take longer, and during which other +interrupts may occur (but not interrupts from the same device). If at +all possible, it's better to declare an interrupt handler to be long. + +When the CPU receives an interrupt, it stops whatever it's doing (unless +it's processing a more important interrupt, in which case it will deal +with this one only when the more important one is done), saves certain +parameters on the stack and calls the interrupt handler. This means that +certain things are not allowed in the interrupt handler itself, because +the system is in an unknown state. The solution to this problem is for +the interrupt handler to do what needs to be done immediately, usually +read something from the hardware or send something to the hardware, and +then schedule the handling of the new information at a later time (this +is called the "bottom half") and return. The kernel is then guaranteed +to call the bottom half as soon as possible – and when it does, +everything allowed in kernel modules will be allowed. + +The way to implement this is to call **request_irq()** to get your +interrupt handler called when the relevant IRQ is received. + +In practice IRQ handling can be a bit more complex. Hardware is often +designed in a way that chains two interrupt controllers, so that all the +IRQs from interrupt controller B are cascaded to a certain IRQ from +interrupt controller A. Of course that requires that the kernel finds +out which IRQ it really was afterwards and that adds overhead. Other +architectures offer some special, very low overhead, so called "fast +IRQ" or FIQs. To take advantage of them requires handlers to be written +in assembler, so they do not really fit into the kernel. They can be +made to work similar to the others, but after that procedure, they're no +longer any faster than "common" IRQs. SMP enabled kernels running on +systems with more than one processor need to solve another truckload of +problems. It's not enough to know if a certain IRQs has happend, it's +also important for what CPU(s) it was for. People still interested in +more details, might want to do a web search for "APIC" now ;) + +This function receives the IRQ number, the name of the function, flags, +a name for /proc/interrupts and a parameter to pass to the interrupt +handler. Usually there is a certain number of IRQs available. How many +IRQs there are is hardware-dependent. The flags can include SA_SHIRQ to +indicate you're willing to share the IRQ with other interrupt handlers +(usually because a number of hardware devices sit on the same IRQ) and +SA_INTERRUPT to indicate this is a fast interrupt. This function will +only succeed if there isn't already a handler on this IRQ, or if you're +both willing to share. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Detecting button presses + :name: org77533ca + +.. raw:: html + +
+ +Many popular single board computers, such as Raspberry Pis or +Beagleboards, have a bunch of GPIO pins. Attaching buttons to those and +then having a button press do something is a classic case in which you +might need to use interrupts so that instead of having the CPU waste +time and battery power polling for a change in input state it's better +for the input to trigger the CPU to then run a particular handling +function. + +Here's an example where buttons are connected to GPIO numbers 17 and 18 +and an LED is connected to GPIO 4. You can change those numbers to +whatever is appropriate for your board. + +.. raw:: html + +
+ +:: + + /* + * intrpt.c - Handling GPIO with interrupts + * + * Copyright (C) 2017 by Bob Mottram + * Based upon the Rpi example by Stefan Wendler (devnull@kaltpost.de) + * from: + * https://github.com/wendlers/rpi-kmod-samples + * + * Press one button to turn on a LED and another to turn it off + */ + + #include + #include + #include + #include + + static int button_irqs[] = { -1, -1 }; + + /* Define GPIOs for LEDs. + Change the numbers for the GPIO on your board. */ + static struct gpio leds[] = { + { 4, GPIOF_OUT_INIT_LOW, "LED 1" } + }; + + /* Define GPIOs for BUTTONS + Change the numbers for the GPIO on your board. */ + static struct gpio buttons[] = { + { 17, GPIOF_IN, "LED 1 ON BUTTON" }, + { 18, GPIOF_IN, "LED 1 OFF BUTTON" } + }; + + /* + * interrupt function triggered when a button is pressed + */ + static irqreturn_t button_isr(int irq, void *data) + { + /* first button */ + if (irq == button_irqs[0] && !gpio_get_value(leds[0].gpio)) + gpio_set_value(leds[0].gpio, 1); + /* second button */ + else if(irq == button_irqs[1] && gpio_get_value(leds[0].gpio)) + gpio_set_value(leds[0].gpio, 0); + + return IRQ_HANDLED; + } + + int init_module() + { + int ret = 0; + + printk(KERN_INFO "%s\n", __func__); + + /* register LED gpios */ + ret = gpio_request_array(leds, ARRAY_SIZE(leds)); + + if (ret) { + printk(KERN_ERR "Unable to request GPIOs for LEDs: %d\n", ret); + return ret; + } + + /* register BUTTON gpios */ + ret = gpio_request_array(buttons, ARRAY_SIZE(buttons)); + + if (ret) { + printk(KERN_ERR "Unable to request GPIOs for BUTTONs: %d\n", ret); + goto fail1; + } + + printk(KERN_INFO "Current button1 value: %d\n", + gpio_get_value(buttons[0].gpio)); + + ret = gpio_to_irq(buttons[0].gpio); + + if (ret < 0) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail2; + } + + button_irqs[0] = ret; + + printk(KERN_INFO "Successfully requested BUTTON1 IRQ # %d\n", + button_irqs[0]); + + ret = request_irq(button_irqs[0], button_isr, + IRQF_TRIGGER_RISING | IRQF_TRIGGER_FALLING, + "gpiomod#button1", NULL); + + if (ret) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail2; + } + + + ret = gpio_to_irq(buttons[1].gpio); + + if (ret < 0) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail2; + } + + button_irqs[1] = ret; + + printk(KERN_INFO "Successfully requested BUTTON2 IRQ # %d\n", + button_irqs[1]); + + ret = request_irq(button_irqs[1], button_isr, + IRQF_TRIGGER_RISING | IRQF_TRIGGER_FALLING, + "gpiomod#button2", NULL); + + if (ret) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail3; + } + + return 0; + + /* cleanup what has been setup so far */ + fail3: + free_irq(button_irqs[0], NULL); + + fail2: + gpio_free_array(buttons, ARRAY_SIZE(leds)); + + fail1: + gpio_free_array(leds, ARRAY_SIZE(leds)); + + return ret; + } + + void cleanup_module() + { + int i; + + printk(KERN_INFO "%s\n", __func__); + + /* free irqs */ + free_irq(button_irqs[0], NULL); + free_irq(button_irqs[1], NULL); + + /* turn all LEDs off */ + for (i = 0; i < ARRAY_SIZE(leds); i++) + gpio_set_value(leds[i].gpio, 0); + + /* unregister */ + gpio_free_array(leds, ARRAY_SIZE(leds)); + gpio_free_array(buttons, ARRAY_SIZE(buttons)); + } + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Handle some GPIO interrupts"); + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Bottom Half + :name: orgdb452ba + +.. raw:: html + +
+ +Suppose you want to do a bunch of stuff inside of an interrupt routine. +A common way to do that without rendering the interrupt unavailable for +a significant duration is to combine it with a tasklet. This pushes the +bulk of the work off into the scheduler. + +The example below modifies the previous example to also run an +additional task when an interrupt is triggered. + +.. raw:: html + +
+ +:: + + /* + * bottomhalf.c - Top and bottom half interrupt handling + * + * Copyright (C) 2017 by Bob Mottram + * Based upon the Rpi example by Stefan Wendler (devnull@kaltpost.de) + * from: + * https://github.com/wendlers/rpi-kmod-samples + * + * Press one button to turn on a LED and another to turn it off + */ + + #include + #include + #include + #include + #include + + static int button_irqs[] = { -1, -1 }; + + /* Define GPIOs for LEDs. + Change the numbers for the GPIO on your board. */ + static struct gpio leds[] = { + { 4, GPIOF_OUT_INIT_LOW, "LED 1" } + }; + + /* Define GPIOs for BUTTONS + Change the numbers for the GPIO on your board. */ + static struct gpio buttons[] = { + { 17, GPIOF_IN, "LED 1 ON BUTTON" }, + { 18, GPIOF_IN, "LED 1 OFF BUTTON" } + }; + + /* Tasklet containing some non-trivial amount of processing */ + static void bottomhalf_tasklet_fn(unsigned long data) + { + printk("Bottom half tasklet starts\n"); + /* do something which takes a while */ + mdelay(500); + printk("Bottom half tasklet ends\n"); + } + + DECLARE_TASKLET(buttontask, bottomhalf_tasklet_fn, 0L); + + /* + * interrupt function triggered when a button is pressed + */ + static irqreturn_t button_isr(int irq, void *data) + { + /* Do something quickly right now */ + if (irq == button_irqs[0] && !gpio_get_value(leds[0].gpio)) + gpio_set_value(leds[0].gpio, 1); + else if(irq == button_irqs[1] && gpio_get_value(leds[0].gpio)) + gpio_set_value(leds[0].gpio, 0); + + /* Do the rest at leisure via the scheduler */ + tasklet_schedule(&buttontask); + + return IRQ_HANDLED; + } + + int init_module() + { + int ret = 0; + + printk(KERN_INFO "%s\n", __func__); + + /* register LED gpios */ + ret = gpio_request_array(leds, ARRAY_SIZE(leds)); + + if (ret) { + printk(KERN_ERR "Unable to request GPIOs for LEDs: %d\n", ret); + return ret; + } + + /* register BUTTON gpios */ + ret = gpio_request_array(buttons, ARRAY_SIZE(buttons)); + + if (ret) { + printk(KERN_ERR "Unable to request GPIOs for BUTTONs: %d\n", ret); + goto fail1; + } + + printk(KERN_INFO "Current button1 value: %d\n", + gpio_get_value(buttons[0].gpio)); + + ret = gpio_to_irq(buttons[0].gpio); + + if (ret < 0) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail2; + } + + button_irqs[0] = ret; + + printk(KERN_INFO "Successfully requested BUTTON1 IRQ # %d\n", + button_irqs[0]); + + ret = request_irq(button_irqs[0], button_isr, + IRQF_TRIGGER_RISING | IRQF_TRIGGER_FALLING, + "gpiomod#button1", NULL); + + if (ret) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail2; + } + + + ret = gpio_to_irq(buttons[1].gpio); + + if (ret < 0) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail2; + } + + button_irqs[1] = ret; + + printk(KERN_INFO "Successfully requested BUTTON2 IRQ # %d\n", + button_irqs[1]); + + ret = request_irq(button_irqs[1], button_isr, + IRQF_TRIGGER_RISING | IRQF_TRIGGER_FALLING, + "gpiomod#button2", NULL); + + if (ret) { + printk(KERN_ERR "Unable to request IRQ: %d\n", ret); + goto fail3; + } + + return 0; + + /* cleanup what has been setup so far */ + fail3: + free_irq(button_irqs[0], NULL); + + fail2: + gpio_free_array(buttons, ARRAY_SIZE(leds)); + + fail1: + gpio_free_array(leds, ARRAY_SIZE(leds)); + + return ret; + } + + void cleanup_module() + { + int i; + + printk(KERN_INFO "%s\n", __func__); + + /* free irqs */ + free_irq(button_irqs[0], NULL); + free_irq(button_irqs[1], NULL); + + /* turn all LEDs off */ + for (i = 0; i < ARRAY_SIZE(leds); i++) + gpio_set_value(leds[i].gpio, 0); + + /* unregister */ + gpio_free_array(leds, ARRAY_SIZE(leds)); + gpio_free_array(buttons, ARRAY_SIZE(buttons)); + } + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Interrupt with top and bottom half"); + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Crypto + :name: org627e987 + +.. raw:: html + +
+ +At the dawn of the internet everybody trusted everybody completely…but +that didn't work out so well. When this guide was originally written it +was a more innocent era in which almost nobody actually gave a damn +about crypto - least of all kernel developers. That's certainly no +longer the case now. To handle crypto stuff the kernel has its own API +enabling common methods of encryption, decryption and your favourite +hash functions. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Hash functions + :name: org0d560c3 + +.. raw:: html + +
+ +Calculating and checking the hashes of things is a common operation. +Here is a demonstration of how to calculate a sha256 hash within a +kernel module. + +.. raw:: html + +
+ +:: + + #include + #include + + #define SHA256_LENGTH (256/8) + + static void show_hash_result(char * plaintext, char * hash_sha256) + { + int i; + char str[SHA256_LENGTH*2 + 1]; + + printk("sha256 test for string: \"%s\"\n", plaintext); + for (i = 0; i < SHA256_LENGTH ; i++) + sprintf(&str[i*2],"%02x", (unsigned char)hash_sha256[i]); + str[i*2] = 0; + printk("%s\n", str); + } + + int cryptosha256_init(void) + { + char * plaintext = "This is a test"; + char hash_sha256[SHA256_LENGTH]; + struct crypto_shash *sha256; + struct shash_desc *shash; + + sha256 = crypto_alloc_shash("sha256", 0, 0); + if (IS_ERR(sha256)) + return -1; + + shash = + kmalloc(sizeof(struct shash_desc) + crypto_shash_descsize(sha256), + GFP_KERNEL); + if (!shash) + return -ENOMEM; + + shash->tfm = sha256; + shash->flags = 0; + + if (crypto_shash_init(shash)) + return -1; + + if (crypto_shash_update(shash, plaintext, strlen(plaintext))) + return -1; + + if (crypto_shash_final(shash, hash_sha256)) + return -1; + + kfree(shash); + crypto_free_shash(sha256); + + show_hash_result(plaintext, hash_sha256); + + return 0; + } + + void cryptosha256_exit(void) + { + } + + module_init(cryptosha256_init); + module_exit(cryptosha256_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("sha256 hash test"); + MODULE_LICENSE("GPL"); + +.. raw:: html + +
+ +Make and install the module: + +.. raw:: html + +
+ +:: + + make + sudo insmod cryptosha256.ko + dmesg + +.. raw:: html + +
+ +And you should see that the hash was calculated for the test string. + +Finally, remove the test module: + +.. raw:: html + +
+ +:: + + sudo rmmod cryptosha256 + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Symmetric key encryption + :name: org4e331ef + +.. raw:: html + +
+ +Here is an example of symmetrically encrypting a string using the AES +algorithm and a password. + +.. raw:: html + +
+ +:: + + #include + #include + #include + + #define SYMMETRIC_KEY_LENGTH 32 + #define CIPHER_BLOCK_SIZE 16 + + struct tcrypt_result { + struct completion completion; + int err; + }; + + struct skcipher_def { + struct scatterlist sg; + struct crypto_skcipher * tfm; + struct skcipher_request * req; + struct tcrypt_result result; + char * scratchpad; + char * ciphertext; + char * ivdata; + }; + + static struct skcipher_def sk; + + static void test_skcipher_finish(struct skcipher_def * sk) + { + if (sk->tfm) + crypto_free_skcipher(sk->tfm); + if (sk->req) + skcipher_request_free(sk->req); + if (sk->ivdata) + kfree(sk->ivdata); + if (sk->scratchpad) + kfree(sk->scratchpad); + if (sk->ciphertext) + kfree(sk->ciphertext); + } + + static int test_skcipher_result(struct skcipher_def * sk, int rc) + { + switch (rc) { + case 0: + break; + case -EINPROGRESS: + case -EBUSY: + rc = wait_for_completion_interruptible( + &sk->result.completion); + if (!rc && !sk->result.err) { + reinit_completion(&sk->result.completion); + break; + } + default: + printk("skcipher encrypt returned with %d result %d\n", + rc, sk->result.err); + break; + } + + init_completion(&sk->result.completion); + + return rc; + } + + static void test_skcipher_callback(struct crypto_async_request *req, int error) + { + struct tcrypt_result *result = req->data; + int ret; + + if (error == -EINPROGRESS) + return; + + result->err = error; + complete(&result->completion); + printk("Encryption finished successfully\n"); + } + + static int test_skcipher_encrypt(char * plaintext, char * password, + struct skcipher_def * sk) + { + int ret = -EFAULT; + unsigned char key[SYMMETRIC_KEY_LENGTH]; + + if (!sk->tfm) { + sk->tfm = crypto_alloc_skcipher("cbc-aes-aesni", 0, 0); + if (IS_ERR(sk->tfm)) { + printk("could not allocate skcipher handle\n"); + return PTR_ERR(sk->tfm); + } + } + + if (!sk->req) { + sk->req = skcipher_request_alloc(sk->tfm, GFP_KERNEL); + if (!sk->req) { + printk("could not allocate skcipher request\n"); + ret = -ENOMEM; + goto out; + } + } + + skcipher_request_set_callback(sk->req, CRYPTO_TFM_REQ_MAY_BACKLOG, + test_skcipher_callback, + &sk->result); + + /* clear the key */ + memset((void*)key,'\0',SYMMETRIC_KEY_LENGTH); + + /* Use the world's favourite password */ + sprintf((char*)key,"%s",password); + + /* AES 256 with given symmetric key */ + if (crypto_skcipher_setkey(sk->tfm, key, SYMMETRIC_KEY_LENGTH)) { + printk("key could not be set\n"); + ret = -EAGAIN; + goto out; + } + printk("Symmetric key: %s\n", key); + printk("Plaintext: %s\n", plaintext); + + if (!sk->ivdata) { + /* see https://en.wikipedia.org/wiki/Initialization_vector */ + sk->ivdata = kmalloc(CIPHER_BLOCK_SIZE, GFP_KERNEL); + if (!sk->ivdata) { + printk("could not allocate ivdata\n"); + goto out; + } + get_random_bytes(sk->ivdata, CIPHER_BLOCK_SIZE); + } + + if (!sk->scratchpad) { + /* The text to be encrypted */ + sk->scratchpad = kmalloc(CIPHER_BLOCK_SIZE, GFP_KERNEL); + if (!sk->scratchpad) { + printk("could not allocate scratchpad\n"); + goto out; + } + } + sprintf((char*)sk->scratchpad,"%s",plaintext); + + sg_init_one(&sk->sg, sk->scratchpad, CIPHER_BLOCK_SIZE); + skcipher_request_set_crypt(sk->req, &sk->sg, &sk->sg, + CIPHER_BLOCK_SIZE, sk->ivdata); + init_completion(&sk->result.completion); + + /* encrypt data */ + ret = crypto_skcipher_encrypt(sk->req); + ret = test_skcipher_result(sk, ret); + if (ret) + goto out; + + printk("Encryption request successful\n"); + + out: + return ret; + } + + int cryptoapi_init(void) + { + /* The world's favourite password */ + char * password = "password123"; + + sk.tfm = NULL; + sk.req = NULL; + sk.scratchpad = NULL; + sk.ciphertext = NULL; + sk.ivdata = NULL; + + test_skcipher_encrypt("Testing", password, &sk); + return 0; + } + + void cryptoapi_exit(void) + { + test_skcipher_finish(&sk); + } + + module_init(cryptoapi_init); + module_exit(cryptoapi_exit); + + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Symmetric key encryption example"); + MODULE_LICENSE("GPL"); + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Standardising the interfaces: The Device Model + :name: org01d6493 + +.. raw:: html + +
+ +Up to this point we've seen all kinds of modules doing all kinds of +things, but there was no consistency in their interfaces with the rest +of the kernel. To impose some consistency such that there is at minimum +a standardised way to start, suspend and resume a device a device model +was added. An example is show below, and you can use this as a template +to add your own suspend, resume or other interface functions. + +.. raw:: html + +
+ +:: + + #include + #include + #include + + struct devicemodel_data { + char *greeting; + int number; + }; + + static int devicemodel_probe(struct platform_device *dev) + { + struct devicemodel_data *pd = (struct devicemodel_data *)(dev->dev.platform_data); + + printk("devicemodel probe\n"); + printk("devicemodel greeting: %s; %d\n", pd->greeting, pd->number); + + /* Your device initialisation code */ + + return 0; + } + + static int devicemodel_remove(struct platform_device *dev) + { + printk("devicemodel example removed\n"); + + /* Your device removal code */ + + return 0; + } + + static int devicemodel_suspend(struct device *dev) + { + printk("devicemodel example suspend\n"); + + /* Your device suspend code */ + + return 0; + } + + static int devicemodel_resume(struct device *dev) + { + printk("devicemodel example resume\n"); + + /* Your device resume code */ + + return 0; + } + + static const struct dev_pm_ops devicemodel_pm_ops = + { + .suspend = devicemodel_suspend, + .resume = devicemodel_resume, + .poweroff = devicemodel_suspend, + .freeze = devicemodel_suspend, + .thaw = devicemodel_resume, + .restore = devicemodel_resume + }; + + static struct platform_driver devicemodel_driver = { + .driver = { + .name = "devicemodel_example", + .owner = THIS_MODULE, + .pm = &devicemodel_pm_ops, + }, + .probe = devicemodel_probe, + .remove = devicemodel_remove, + }; + + static int devicemodel_init(void) + { + int ret; + + printk("devicemodel init\n"); + + ret = platform_driver_register(&devicemodel_driver); + + if (ret) { + printk(KERN_ERR "Unable to register driver\n"); + return ret; + } + + return 0; + } + + static void devicemodel_exit(void) + { + printk("devicemodel exit\n"); + platform_driver_unregister(&devicemodel_driver); + } + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Bob Mottram"); + MODULE_DESCRIPTION("Linux Device Model example"); + + module_init(devicemodel_init); + module_exit(devicemodel_exit); + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Optimisations + :name: org87293ce + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Likely and Unlikely conditions + :name: org87e8223 + +.. raw:: html + +
+ +Sometimes you might want your code to run as quickly as possible, +especially if it's handling an interrupt or doing something which might +cause noticible latency. If your code contains boolean conditions and if +you know that the conditions are almost always likely to evaluate as +either *true* or *false*, then you can allow the compiler to optimise +for this using the *likely* and *unlikely* macros. + +For example, when allocating memory you're almost always expecting this +to succeed. + +.. raw:: html + +
+ +:: + + bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx); + if (unlikely(!bvl)) { + mempool_free(bio, bio_pool); + bio = NULL; + goto out; + } + +.. raw:: html + +
+ +When the *unlikely* macro is used the compiler alters its machine +instruction output so that it continues along the false branch and only +jumps if the condition is true. That avoids flushing the processor +pipeline. The opposite happens if you use the *likely* macro. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Common Pitfalls + :name: org79dea20 + +.. raw:: html + +
+ +Before I send you on your way to go out into the world and write kernel +modules, there are a few things I need to warn you about. If I fail to +warn you and something bad happens, please report the problem to me for +a full refund of the amount I was paid for your copy of the book. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Using standard libraries + :name: org86275d7 + +.. raw:: html + +
+ +You can't do that. In a kernel module you can only use kernel functions, +which are the functions you can see in /proc/kallsyms. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Disabling interrupts + :name: org8646229 + +.. raw:: html + +
+ +You might need to do this for a short time and that is OK, but if you +don't enable them afterwards, your system will be stuck and you'll have +to power it off. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Sticking your head inside a large carnivore + :name: org58c8bc4 + +.. raw:: html + +
+ +I probably don't have to warn you about this, but I figured I will +anyway, just in case. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. rubric:: Where To Go From Here? + :name: org2307e11 + +.. raw:: html + +
+ +I could easily have squeezed a few more chapters into this book. I could +have added a chapter about creating new file systems, or about adding +new protocol stacks (as if there's a need for that – you'd have to dig +underground to find a protocol stack not supported by Linux). I could +have added explanations of the kernel mechanisms we haven't touched +upon, such as bootstrapping or the disk interface. + +However, I chose not to. My purpose in writing this book was to provide +initiation into the mysteries of kernel module programming and to teach +the common techniques for that purpose. For people seriously interested +in kernel programming, I recommend +`kernelnewbies.org `__ and the +*Documentation* subdirectory within the kernel source code which isn't +always easy to understand but can be a starting point for further +investigation. Also, as Linus said, the best way to learn the kernel is +to read the source code yourself. + +If you're interested in more examples of short kernel modules then +searching on sites such as Github and Gitlab is a good way to start, +although there is a lot of duplication of older LKMPG examples which may +not compile with newer kernel versions. You will also be able to find +examples of the use of kernel modules to attack or compromise systems or +exfiltrate data and those can be useful for thinking about how to defend +systems and learning about existing security mechanisms within the +kernel. + +I hope I have helped you in your quest to become a better programmer, or +at least to have fun through technology. And, if you do write useful +kernel modules, I hope you publish them under the GPL, so I can use them +too. + +If you'd like to contribute to this guide, notice anything glaringly +wrong, or just want to add extra sarcastic remarks perhaps involving +monkeys or some other kind of animal then please file an issue or even +better submit a pull request at https://github.com/bashrc/LKMPG. + +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +.. raw:: html + +
+ +`Back to top <#top>`__ \| `E-mail me `__ + +.. raw:: html + +
+ +.. raw:: html + +
diff --git a/4.12.12/examples/Makefile b/4.12.12/examples/Makefile index 770a22c..fff6966 100644 --- a/4.12.12/examples/Makefile +++ b/4.12.12/examples/Makefile @@ -36,4 +36,4 @@ all: clean: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean - rm other/ioctl other/cat_noblock *.plist + rm -f other/ioctl other/cat_noblock *.plist