Communicating with kernel

According to Wikipedia, an operating system usually segregates virtual memory into kernel space and user space. Kernel space is strictly reserved for running the kernel, device drivers and kernel extensions. In most operating systems, kernel memory is never swapped out to disk. User space is the memory area where all user mode applications work and this memory can be swapped out when necessary.

A user application cannot access kernel space directly and similarly kernel code cannot access the user space without checking whether the page is present in memory or swapped out. Even though these can not access each other directly, user and kernel space can communicate with each other using variety of ways. This is an effort to summarize those used under Linux:

  • System calls
  • As explained in manual of syscall(2), The system call is the fundamental interface between an application and the kernel.

    A system call is a request by a running task to the kernel to provide some sort of service on its behalf. In general, the kernel services invoked by system calls comprise an abstraction layer between hardware and user-space programs.

    The way system calls are handled is up to the processor. Usually, a call to the kernel is due to an interrupt or exception; in the call, there is a request to execute something special.

    Actual code for the system_call entry point can be found in /usr/src/linux/kernel/sys_call.S and the code for many of the system calls can be found in /usr/src/linux/kernel/sys.c. Code for the rest is distributed throughout the source files. Some system calls, like fork, have their own source file (e.g., kernel/fork.c).

    Pros: Well documented, easy to use interface.

  • ioctl
  • Even though ioctl is one of the system calls, its usefulness tempts me to talk about it separately. Sometimes referred to as Swiss army knife, ioctl stands for input/output control and is used to manipulate a character device via a file descriptor. ioctl is a catch-all function that takes a device, a request and a variable number of other parameters. The devices, requests and other parameters are then defined in header files.

    Pros: Provides hardware level access to programmers.
    Cons: device and requests are poorly documented and platform specific.

  • procfs
  • The /proc file system (procfs) is a special file system in the Linux kernel. It is a virtual file system, i.e. not associated with a block device but exists only in memory.

    The files in the procfs are there to allow userland programs access to certain information from the kernel (like process information in /proc/[0-9]+/), but also for debug purposes (like /proc/ksyms).

    Pros: Uniform interface to retrieve process information from command line.
    Cons: Difficult to figure out required information as too much miscellaneous information exposed through single file system.

  • Sysfs
  • Sysfs is a virtual file system provided by the 2.6 Linux kernel. Sysfs exports information about devices and drivers from the kernel device model to userspace, and is also used for configuration.

    Sysfs is designed to export the information present in the device tree which would then no longer clutter up procfs.

    For each object added in the driver model tree (drivers, devices including class devices) a directory in sysfs is created.

    The parent/child relationship is reflected with subdirectories under /sys/devices/ (reflecting the physical layout). The subdirectory /sys/bus/ is populated with symbolic links, reflecting how the devices belong to different busses. /sys/class/ shows devices grouped according to classes, like network, while /sys/block/ contains the block devices.

    Pros: Uniform interface to retrieve/update device specific information

  • Netlink sockets
  • Netlink socket is a special IPC used for transferring information between kernel and user-space processes. It provides a full-duplex communication link between the two by way of standard socket APIs for user-space processes and a special kernel API for kernel modules.

    Netlink socket uses the address family AF_NETLINK.

    The standard socket APIs-socket(), sendmsg(), recvmsg() and close()-can be used by user-space applications to access netlink socket.

    Pros: It is a nontrivial task to add system calls, ioctls or proc files for new features; with the risk of polluting the kernel and damaging the stability of the system. Netlink socket is simple, though: only a constant, the protocol type, needs to be added to netlink.h.

  • relay
  • Relayfs is, yet another virtual filesystem implemented by the kernel; it must be explicitly mounted by user space to be available. Kernel code can then create a relay with relay_open(); it will show up as a file under relayfs.

    User space can then open the relay and employ all of the usual file operations - including mmap() and poll() - to exchange data with the kernel.

    To an application, a relayfs file descriptor looks much like a Unix-domain socket, except that the other end is a piece of kernel code rather than another process.

    The interface on the kernel side is a bit more complex. The expected relay_read() and relay_write() functions exist and can be used to move data to and from user space.

    But relayfs also exposes much of the internal structure to kernel code that needs to know about it. So special-purpose code can obtain a pointer into the relayfs buffer and copy data there directly.

    Pros: Easier to pass large amount of information to and from kernel space.

  • debugfs
  • debugfs is a in-kernel filesystem just for putting debugging stuff some place other than proc and sysfs, and which is easier than both of them to use.

    debugfs is meant for putting stuff that kernel developers need to see exported to userspace, yet don't always want hanging around.

    To create a file using debugfs the call is just:
    struct dentry *debugfs_create_file(const char *name, mode_t mode, struct dentry *parent, void *data, struct file_operations *fops);

    To export a single value to userspace:
    struct dentry *debugfs_create_u8(const char *name, mode_t mode, struct dentry *parent, u8 *value);

    That's it, one line of code and a variable can be read and written to from userspace.

    Pros: Shortens development time with debugging information easily available.

  • Firmware loading
  • While most computer peripherals work right "out of the box," some will not function properly until the host system has downloaded a blob of binary firmware.

    The end result is that the recommended way of dealing with devices needing firmware downloads is to have a user-space process handle it.

    In the new scheme, a device driver needing firmware for a particular device makes a call to:
    int request_firmware(struct firmware **fw, const char *name, struct device *device);

    Here, name is the name of the relevant device, and device is its device model entry. This call will create a directory with the given name under /sys/class/firmware and populate it with two files called loading and data. A hotplug event is then generated which, presumably, will inspire user space to find some firmware to feed the device.

    The resulting user-space process starts by setting the loading sysfs attribute to a value of one. The actual firmware can then be written to the data file; when the process is complete, the loading file should be set back to zero. At that point, request_firmware() will return to the driver with fw pointing to the actual firmware data. The user-space process can chose to abort the firmware load by writing -1 to the loading attribute.

    When the driver has loaded the firmware into its device, it should free up the associated memory with:
    void release_firmware(struct firmware *fw);

    Though, this is specific to firmware loading, but is sometimes used for passing large amount of data to the kernel space.

    Kernel mode HTML parser

    System calls are means through which user level processes can communicate with kernel. Though Linux kernel allows kernel code to invoke system calls.

    This is not generally considered a good idea in terms of debugging, maintaining and porting the code. But if performance or size are absolutely necessary porting applications on kernel seems to have huge benefits.

    The gain of performance comes for costly user/kernel space transition and associated data passing.

    In order to measure timing benefits I implemented a rudimentary HTML parser in kernel space and a similar parser in userland.

    Code snippets from kernel module for reading a html file and removing the html tags is as follows(Complete source available here):

    	best = ~0;
    	measure_time(0);
    	tsc = best;
    	printk(KERN_INFO "Time taken for no code: %ld\n", tsc);
    
    	/*Measure time of reading a file*/
    	/*Prepare to invoke system call*/
    	fs = get_fs();	/*Save previous value*/
         	set_fs(get_ds());	/*use kernel limit*/
    	/*Call system call*/
    	fd = filp_open(FILE_NAME, O_RDONLY, 0600);
    
    	if(fd->f_op && fd->f_op->read){
    	    best = ~0;
    	    measure_time(fd->f_op->read(fd, html, 1000, &fd->f_pos));
    	    printk(KERN_INFO "Time taken by read: %ld\n", best-tsc);
    	    parse_html(html, text); /*Parse html to text*/
    	    printk(KERN_INFO "Parsed text: %s", text);
    	}
    

    This code parses the HTML by calling an ugly parser parse_html(Defined in common.h available here) which strips out the html tags.

    While part of similar userland code is as follows(Complete source available here):

    	/*time rdsc, i.e. no code*/
    	best =~ 0;
    	measure_time(0);
    	tsc = best;
    	printf("Time taken for no code: %ld\n", tsc);
    	
    	/*Measure time for reading a file*/
    	fd = open(FILE_NAME, O_RDONLY, 0600);
    	if(!fd){
    	    printf("Error opening file\n");
    	    exit(1);
    	}
    	best = ~0;
    	measure_time(read(fd, html, 1000));
    	printf("Time taken by read: %li\n", best - tsc);
    	parse_html(html, text);
    	printf("Parsed text: %s\n", text);
    

    I collected following read time for first 5 runs:

    Clock ticks/run 1st Run 2nd Run 3rd Run 4th Run 5th Run
    Kernel HTML Parser 246 366 245 351 246
    Userland HTML Parser 675 683 561 683 675
    Thus, file read time in kernel outperforms userland code by around 3 times.

    There are couple of interesting possibilities on porting application requiring high performance to kernel space. There already exists few including a Kernel mode web server. Ofcourse, the crash for a not properly tested module could cost more than their userland counterparts.

    ReiserFS: To be or not to be

    There seem to be quite alot of debate going on LKML over whether to include ReiserFS in kernel or not.
    ReiserFS has been into problems over couple of things. Firstly, the way it was pushed by Hans Reiser was not liked by many.

    Then there were talks over reliability of file system. As someone pointed out:
    "The fact that reiserfs uses a single B-tree to store all of its data means that very entertaining things can happen if you lose a sector containing a high-level node in the tree.It's even more entertaining if you have image files (like initrd files) in reiserfs format stored in reiserfs, and you run the recovery program on the filesystem."

    Another problem with ReiserFS is it's quest to integrate everything within filesystem. As an example it has plugins that can alter the symantics of files, like making files into directories inside which you could see meta-files like file/uid and file/size which contained meta-data and such accessible as normal files to all the unix tools. You could get things like chmod by just doing
    'echo root >file/owner'.

    Whether this is a good idea is quite debatable, as it is being long believed in Unix world that do one thing well and keep it simple. Next step in this direction could to parse the zip archives in kernel space for doing a 'cd linux-2.6.17.tar.bz2'(or is it already implemented) which does not sound like a good idea.
    Moreover, this may require couple of changes in VFS.

    I recently noticed missing readv system call in ReiserFS while calling it from kernel space.

    Someone wrote an article on why ReiserFS is not included in kernel:
    http://wiki.kernelnewbies.org/WhyReiser4IsNotIn

    Although there seem to have been quite alot of development going on ReiserFS and it is installed as part of SUSE distributions. It seems some of the ideas implemented by this file system are unique and may be useful to other filesystems implemented in the future.