Archive for November, 2013

sysfs, procfs, sysctl, debugfs and other similar kernel interfaces

Posted on November 20, 2013. Filed under: Kernel | Tags: |

2. Procfs, Sysfs, and Similar Kernel Interfaces

2.1 Introduction

These file systems are optional to the Linux kernel, and may not be enabled on your system. The file “/lib/modules/`uname -r`/build/.config” will tell you how your kernel is configured.

In order to exchange data between user space and kernel space the Linux kernel provides a couple of RAM based file systems. These interfaces are, themselves, based on files. Usually a file represents a single value, but it may also represent a set of values. The user space can access these values by means of the standard read(2) and write(2) functions. For most file systems the read and write function results in a callback function in the Linux kernel which has access to the corresponding value.

Despite offering similar functionality, the different RAM based file systems are all designed for separate purposes. However it is easy to use these file systems for other purposes as well. Questions such as “Which file system should be used?” or “Why is there a need for the different file systems?” often arise on the Linux kernel mailing list. The arguments are controversial and each developer seems to have a unique view.

The benefit of using the read and write function in comparison to, for example, socket based approaches, is that the user space has a lot of tools available to send data to the kernel space (e.g. cat(1), echo (1)). These programs are well known to users and they can be used in scripts.

2.2 Procfs


The procfs, located in /proc, is the best known interface of this class. It was originally designed to export all kind of process information such as the current status of the process, or all open file descriptors to the user space. Despite its initial purposes, the procfs has been used for a lot of other purposes:

  • provide information about the running system such as cpu information, information about interrupts, about the available memory or the version of the kernel.
  • information about “ide devices”, “scsi devices” and “tty’s”.
  • networking information such as the arp table, network statistics or lists of used sockets

There is a special subdirectory: /proc/sys.
It allows to configure a lot of parameters of the running system. Usually each file consists of a single value. This value may correspond to:

  • a limit (e.g. maximum buffer size)
  • turn on or off a given functionality (for example routing)
  • or represent some other kernel variable

All directories and files below /proc/sys/ are not implemented with the procfs interface. Instead they use a mechanism called sysctl. See section sysctl for further details about sysctl.

Note, despite the wide use of the procfs, it is deprecated and should only be used to export information related to a process itself.


In order to use the procfs it needs to be compiled with the Linux kernel source code. This is done by setting the parameter CONFIG_PROC_FS=y. In most standard configurations this is enabled by default

Procfs supports two different APIs for kernel modules:
The legacy procfs API: It is easy to use as long as the amount of data to be handled is small. In this context small means smaller than one page size (PAGE_SIZE), which is in i386 systems 4096 bytes.
The seq_file API: Seq_file was designed to facilitate the handling of read requests. It supports read requests for more than PAGE_SIZE bytes and it provides mechanism to traverse a list, collect the elements of the list, and send all elements to user space.

1. Legacy procfs API

procfs.c legacy procfs API

The legacy procfs API allows for the creation of files and directories. For each file you have to specify two callback functions: One which is executed when a user reads the file and the other when a user writes to the file. The use of this API is well described in the “Linux Kernel Procfs Guide” distributed with the Linux kernel source code. Therefore we give here only a very basic example: A module which creates a directory as well as a file. If your file provides more than PAGE_SIZE bytes of data it is easy to get things wrong. This is due to the API of the read function:
read(char *page, char **start, off_t off, int count, int *eof, void *data)
The first parameter of this function is a buffer with the size corresponding to one page. Hence, if there is more data, the read has to be split in multiple pieces.

2. Seq_file API

The seq_file API is concerned with read requests solely – no writes. It hides the PAGE_SIZE boundary from the developer and it provides an API to step through a series of objects, collect the data from each of them and put all those data in the file. An example module can be found at

Further Reading and Resources

  • lwn article “Driver porting: The seq_file interface”.
  • Example module that uses seq_file in relation with the lwn article.
  • Linux magazine article “Manipulating “Seq” Files” covers legacy API as well as seq_file.
  • Documents/SeqFileHowTo seqfile how-to on kernelnewbies
  • Linux Kernel Procfs Guide available in the Linux kernel source code Documentation/DocBook/procfs-guide.tmpl. Description of the legacy procfs API
  • T H E /proc F I L E S Y S T E M Description of entries in /proc, available in the Linux kernel source code Documentation/filesystems/proc.txt

2.3 Sysfs


Sysfs was designed to represent the whole device model as seen from the Linux kernel. It contains information about devices, drivers and buses and their interconnections. In order to represent the hierarchy and the interconnections sysfs is heavily structured and contains a lot of links between the individual directories. As for kernel 2.6.23 it contains the following 9 top-level directories:

  • sys/block/ all known block devices such as hda/ ram/ sda/
  • sys/bus/ all registered buses. Each directory below bus/ holds by default two subdirectories:
    • device/ for all devices attached to that bus
    • driver/ for all drivers assigned with that bus.
  • sys/class/ for each device type there is a subdirectory: for example /printer or /sound
  • sys/device/ all devices known by the kernel, organised by the bus they are connected to
  • sys/firmware/ files in this directory handle the firmware of some hardware devices
  • sys/fs/ files to control a file system, currently used by FUSE, a user space file system implementation
  • sys/kernel/ holds directories (mount points) for other filesystems such as debugfs, securityfs.
  • sys/module/ each kernel module loaded is represented with a directory.
  • sys/power/ files to handle the power state of some hardware


In order to use sysfs it needs to be compiled with the Linux kernel source code. This is done by setting the parameter CONFIG_SYSFS=y.

The philosophy behind sysfs is to represent each value with a dedicated file. In addition each file has a maximum size of PAGE_SIZE bytes.

For a kernel module there are three possibilities to use a file below /sys:

  1. module parameter
  2. register new subsystem
  3. debugfs: debugfs, mounted in /sys/kernel/debug. More information about debugfs.

Module Parameter API

Similar to command line arguments for applications, Linux kernel modules may allow a set of parameters. These parameters can not only be specified upon module insertion but also during module run time. A module parameter can be defined with the following macro:
module_param_named(name, value, type, perm)
This macro creates a parameter called "name" which corresponds to the variable with name "value" of type "type". There are many predefined types such as byte (for a single character), int (for an integer) or charp (for a string). It is also possible to add new types. The file include/linux/stat.h provides all predefined types as well as an introduction how to define new types.

The module_param macro creates a file called /sys/modules/module_name/name with the access rights specified by perm. Depending on the specified access rights, the file – and thereby the parameter value – can be read or written. If perm is set to 0 the file is not created, and therefore the parameter cannot be accessed during run time.

The module does not receive a notification when a user reads or writes a given parameter, but the value is silently changed. Therefore it is not possible to do some additional stuff when a parameter changes its value. This may be acceptable in some circumstances as for changing a debug level, but in most circumstances the module wants to do some additional stuff such as sanity checks or manipulating a data structure.

Standard Sysfs API

The standard sysfs API uses a dedicated terminology: A file is called an attribute, the function executed upon reading an attribute is called show and the one for writing an attribute store.

Before starting with the implementation of a module which uses sysfs you have to figure out which subdirectory it belongs to. If you deal with a bus, it belongs to bus/, with a file system it belongs to fs/ or with a block device it belongs to block/. The API to use depends on the given subdirectory. We first show an example which uses the low level sysfs functions to add a new directory to fs/, and in a second example we show how to add a new entry to the bus/ directory.

1. new fs/ entry

sysfs_ex.c creates the directory /sys/fs/myfs/ along with two files first and second. Both containing one single integer value.

The first step is to declare our subsystem. This can be done with the use of the decl_subsys macro (on top of the file). This macro creates a struct kset with the name myfs_subsys.

The module_init() function performs the proper registration of our subsystem: The macro kobj_set_kset_s initializes myfs_subsys so that it will be part of the fs_subsys. The field myfs_subsys.kobj.ktype points to a structure which holds all the attributes as well as the functions to read and write the attributes. And finally a call to register_subsystem() registers our subsystem.

Files are generally represented by a struct attribute. This struct holds the name as well as the access permission for the corresponding file, but no data. Therefore you have to create your own attribute type which consists of at least the struct attribute and the value corresponding to that file.

By design all attributes share the same show and store functions. Each time one of these two functions is invoked it gets the corresponding struct attribute as an argument. Therefore in the show and store functions you can obtain the value corresponding to the file being read/written and you can manipulate it accordingly. For this purpose you need the macro container_of(ptr, type, member)ptr is a pointer to the member of the struct.type is the type of the struct this member is emeded in and member is the name of the member within the struct.

2. new bus/ entry

sysfs_ex2.c use of sysfs in combination with a bus. It provides the possibility to read and write one value with the help of “my_pseude_bus”.

First of all we define our bus my_pseudo_bus. Then we create our attribute with the help of the BUS_ATTR macro. In the init function we register our pseudo bus and we create a file (attribute). If we would like more than one attribute we would have to use BUS_ATTR several times and provide for each attribute its own store and show function. This example is similar to the debugging facility of the scsi bus, which is implemented indrivers/scsi/scsi_debug.c

Resources and Further Reading

2.4 Configfs


The configfs is somewhat the counterpart of the sysfs. It can be seen as a filesystem based manager of kernel objects. An important difference between configfs and sysfs is that in configfs all objects are created from user space with a call to mkdir(2). The kernel responds with creating the attributes (files) and then they can be read and written by the user. If the user no longer needs the files, he calls rmdir(2) and everything gets deleted. Therefore the life cycle of a configfs object is fully controlled by user space.

Each time mkdir is invoked a new “config_item” is created by the kernel implementation. This config_item represents the files (attributes), the show and store callback functions as well as the associated value. Therefore each mkdir creates a new directory along with new files which represent new values.

Configfs has the same limitations than sysfs: each file should represent only one value and it should be smaller than PAGE_SIZE bytes.


In order to use configfs it needs to be compiled with the Linux kernel source code. This is done by setting the parameter CONFIG_CONFIGFS_FS=y.

In order to access configfs it has to be mounted with the following command:
mount -t configfs none /config

The Linux kernel documentation provides a good manual for configfs along with an example module. Therefore we do not describe the configfs implementation aspects.

Resources and Further Reading

  • Linux kernel source code: Documentation/filesystems/configfs

2.5 Debugfs


Debugfs is a simple to use RAM based file system especially designed for debugging purposes. Developers are encouraged to use debugfs instead of procfs in order to obtain some debugging information from their kernel code. Debugfs is quite flexible: it provides the possibility to set or get a single value with the help of just one line of code but the developer is also allowed to write its own read/write functions, and he can use the seq_file interface described in the procfs section.


In order to use debugfs it needs to be compiled with the Linux kernel source code. This is done by setting the parameter CONFIG_DEBUG_FS=y.

Before having access to the debugfs it has to be mounted with the following command.
mount -t debugfs none /sys/kernel/debug

debugfs.c kernel module that implements the “one line” API for a variable of type u8 as well as the API with which you can specify your own read and write functions.

All the “one line” APIs start with debugfs_create_ and are listed in include/linux/debugfs.h

The API with which you can provide your own read and write functions is similar to the one of procfs. In contrast to sysfs, you may create directories and files without having to care about a given hierarchy.

Resources and Further Reading

2.6 Sysctl


The sysctl infrastructure is designed to configure kernel parameters at run time. The sysctl interface is heavily used by the Linux networking subsystem. It can be used to configure some core kernel parameters; represented as files in /proc/sys/*. The values can be accessed by using cat(1)echo(1) or the sysctl(8) commands. If a value is set by the echo command it only persists as long as the kernel is running, but gets lost as soon as the machine is rebooted. In order to change the values permanently they have to be written to the file /etc/sysctl.conf. Upon restarting the machine all values specified in this file are written to the corresponding files in/proc/sys/.


sysctl.c sysctl example module: write an integer to /proc/sys/net/test/value1 and value2 respectively

Each entry in the /proc/sys directory is represented by an entry in a table maintained by the Linux kernel, arranged in a hierarchy. A directory is represented by an entry pointing to a subtable. A file is represented by an entry of type struct ctl_table. This entry consists of the data represented by this file along with some access rules.

New files and directories can be added by expanding one of the subtables. In this example we add a new directory called test below the /proc/sys/net/ directory. Our directory has got two files: value1 and value2. Each of these files hold an integer variable which can have a value between 10 and 20. The user root is allowed to change the entries whereas normal user are allowed to read the entries.

Each file is represented with an entry in the test_table[] array:

static ctl_table test_table[] = {
        .ctl_name       = CTL_UNNUMBERED,
        .procname       = "value1",
        .data           = &value1,
        .maxlen         = sizeof(int),
        .mode           = 0644,
        .proc_handler   = &proc_dointvec_minmax,
        .strategy       = &sysctl_intvec,
        .extra1         = &min,
        .extra2         = &max

The struct ctl_table entries are:

  • .ctl_name: For new entries this has to be CTL_UNNUMBERED (according to Documentation/sysctl/ctl_unnumbered.txt).
  • .procname: The name of the file.
  • .data: A reference to the data we want to be shown in the file.
  • maxlen: The size of the data.
  • mode: Access permissions (read, write, execute for user, group, others)
  • proc_handler: The routine which handles read and write requests. There is a set of default routines declared near the end of include/linux/sysctl.h
  • strategy: Some routine that enforces additional access control. In this example it checks that the value to be written is between min and max.
static ctl_table test_net_table[] = {
                .ctl_name       = CTL_UNNUMBERED,
                .procname       = "test",
                .mode           = 0555,
                .child          = test_table
        { .ctl_name = 0 }

This table represents our test directory. The entry .child says that the elements below this directory are represented by the table test_table, discussed above.

static ctl_table test_root_table[] = {
                .ctl_name       = CTL_UNNUMBERED,
                .procname       = "net",
                .mode           = 0555,
                .child          = test_net_table
        { .ctl_name = 0 }

This table represents the directory to which we want to attach our new directory. In this example, this is the net directory. In the module_init() function we have to register this root table with a call to

Resources and Further Reading

  • Linux kernel source code: net/core/sysctl_net_core.c
  • Linux kernel Documentation: Documentation/sysctl/ctl_unnumbered.txt

2.7 Character Devices


As the name suggests, this interface was designed for character device drivers, and is commonly used for communication between uer and kernel space. (For example, users with sufficient privileges my write directly to the virtual terminal 1 with echo "hi there" > /dev/tty1).

Each module can register itself as a character device and provide some read and write functions which handle the data. Files representing character devices are located within the /dev directory (where you will also find block devices, but we will not be describing them further). Usually these files correspond to a hardware device.


cdev.c kernel module that prints its majorNumber to the system log. The minorNumber can be chosen to be 0.

As with all file system based approaches the module has to specify a read and a write callback function. Therefore, we have to register ourself with the function register_chrdev(unsigned int major, const char *name, struct file_operations *ops);major is the major number of this device. We can set it to 0 to let the kernel choose an appropriate number. name is the name of this character device, as it will be shown below the/dev directory. ops is a pointer to the read and write functions.

In contrast to most file system based approaches seen so far, the user has to create the device file explicitly with a call to:
mknod /dev/arbitrary_name c majorNumber minorNumber

Resources and Further Reading



Read Full Post | Make a Comment ( None so far )

Useful kernel and driver performance tweaks for your Linux server

Posted on November 20, 2013. Filed under: Performance | Tags: , , , , , |

This article is going to address some kernel and driver tweaks that are interesting and useful. We use several of these in production with excellent performance, but you should proceed with caution and do researchprior to trying anything listed below.

Tickless System

The tickless kernel feature allows for on-demand timer interrupts. This means that during idle periods, fewer timer interrupts will fire, which should lead to power savings, cooler running systems, and fewer useless context switches.

Kernel option: CONFIG_NO_HZ=y

Timer Frequency

You can select the rate at which timer interrupts in the kernel will fire. When a timer interrupt fires on a CPU, the process running on that CPU is interrupted while the timer interrupt is handled. Reducing the rate at which the timer fires allows for fewer interruptions of your running processes. This option is particularly useful for servers with multiple CPUs where processes are not running interactively.

Kernel options: CONFIG_HZ_100=y and CONFIG_HZ=100


The connector module is a kernel module which reports process events such as forkexec, and exit to userland. This is extremely useful for process monitoring. You can build a simple system (or use an existing one like god) to watch mission-critical processes. If the processes die due to a signal (like SIGSEGV, orSIGBUS) or exit unexpectedly you’ll get an asynchronous notification from the kernel. The processes can then be restarted by your monitor keeping downtime to a minimum when unexpected events occur.


TCP segmentation offload (TSO)

A popular feature among newer NICs is TCP segmentation offload (TSO). This feature allows the kernel to offload the work of dividing large packets into smaller packets to the NIC. This frees up the CPU to do more useful work and reduces the amount of overhead that the CPU passes along the bus. If your NIC supports this feature, you can enable it with ethtool:

[joe@timetobleed]% sudo ethtool -K eth1 tso on

Let’s quickly verify that this worked:

[joe@timetobleed]% sudo ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: on
large receive offload: off

[joe@timetobleed]% dmesg | tail -1
[892528.450378] 0000:04:00.1: eth1: TSO is Enabled

Intel I/OAT DMA Engine

This kernel option enables the Intel I/OAT DMA engine that is present in recent Xeon CPUs. This option increases network throughput as the DMA engine allows the kernel to offload network data copying from the CPU to the DMA engine. This frees up the CPU to do more useful work.

Check to see if it’s enabled:

[joe@timetobleed]% dmesg | grep ioat
ioatdma 0000:00:08.0: setting latency timer to 64
ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, device version 0x12, driver version 3.64
ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X

There’s also a sysfs interface where you can get some statistics about the DMA engine. Check the directories under /sys/class/dma/.


Direct Cache Access (DCA)

Intel’s I/OAT also includes a feature called Direct Cache Access (DCA). DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most popular (to my knowledge) is the Intel 10GbE driver (ixgbe). Refer to your NIC driver documentation to see if your NIC supports DCA. To enable DCA, a switch in the BIOS must be flipped. Some vendors supply machines that support DCA, but don’t expose a switch for DCA. If that is the case, see my last blog post for how to enable DCA manually.

You can check if DCA is enabled:

[joe@timetobleed]% dmesg | grep dca
dca service started, version 1.8

If DCA is possible on your system but disabled you’ll see:

ioatdma 0000:00:08.0: DCA is disabled in BIOS

Which means you’ll need to enable it in the BIOS or manually.

Kernel option: CONFIG_DCA=y


The “New API” (NAPI) is a rework of the packet processing code in the kernel to improve performance for high speed networking. NAPI provides two major features1:

Interrupt mitigation: High-speed networking can create thousands of interrupts per second, all of which tell the system something it already knew: it has lots of packets to process. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load.

Packet throttling: When the system is overwhelmed and must drop packets, it’s better if those packets are disposed of before much effort goes into processing them. NAPI-compliant drivers can often cause packets to be dropped in the network adaptor itself, before the kernel sees them at all.

Many recent NIC drivers automatically support NAPI, so you don’t need to do anything. Some drivers need you to explicitly specify NAPI in the kernel config or on the command line when compiling the driver. If you are unsure, check your driver documentation. A good place to look for docs is in your kernel source under Documentation, available on the web here: be sure to select the correct kernel version, first!

Older e1000 drivers (newer drivers, do nothing)make CFLAGS_EXTRA=-DE1000_NAPI install

Throttle NIC Interrupts

Some drivers allow the user to specify the rate at which the NIC will generate interrupts. The e1000e driver allows you to pass a command line option InterruptThrottleRate

when loading the module with insmod. For the e1000e there are two dynamic interrupt throttle mechanisms, specified on the command line as 1 (dynamic) and 3 (dynamic conservative). The adaptive algorithm traffic into different classes and adjusts the interrupt rate appropriately. The difference between dynamic and dynamic conservative is the the rate for the “Lowest Latency” traffic class, dynamic (1) has a much more aggressive interrupt rate for this traffic class.

As always, check your driver documentation for more information.

With modprobe: insmod e1000e.o InterruptThrottleRate=1

Process and IRQ affinity

Linux allows the user to specify which CPUs processes and interrupt handlers are bound.

  • Processes You can use taskset to specify which CPUs a process can run on
  • Interrupt Handlers The interrupt map can be found in /proc/interrupts, and the affinity for each interrupt can be set in the file smp_affinity in the directory for each interrupt under /proc/irq/

This is useful because you can pin the interrupt handlers for your NICs to specific CPUs so that when a shared resource is touched (a lock in the network stack) and loaded to a CPU cache, the next time the handler runs, it will be put on the same CPU avoiding costly cache invalidations that can occur if the handler is put on a different CPU.

However, reports2 of up to a 24% improvement can be had if processes and the IRQs for the NICs the processes get data from are pinned to the same CPUs. Doing this ensures that the data loaded into the CPU cache by the interrupt handler can be used (without invalidation) by the process; extremely high cache locality is achieved.


oprofile is a system wide profiler that can profile both kernel and application level code. There is a kernel driver for oprofile which generates collects data in the x86′s Model Specific Registers (MSRs) to give very detailed information about the performance of running code. oprofile can also annotate source code with performance information to make fixing bottlenecks easy. See oprofile’s homepage for more information.



epoll(7) is useful for applications which must watch for events on large numbers of file descriptors. Theepoll interface is designed to easily scale to large numbers of file descriptors. epoll is already enabled in most recent kernels, but some strange distributions (which will remain nameless) have this feature disabled.

Kernel option: CONFIG_EPOLL=y


  • There are a lot of useful levers that can be pulled when trying to squeeze every last bit of performance out of your system
  • It is extremely important to read and understand your hardware documentation if you hope to achieve the maximum throughput your system can achieve
  • You can find documentation for your kernel online at the Linux LXRMake sure to select the correct kernel version because docs change as the source changes!

Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.


  3.  (Source of this file)
Read Full Post | Make a Comment ( None so far )

Liked it here?
Why not try sites on the blogroll...