Performance

What every programmer should know about memory

Posted on January 3, 2014. Filed under: C/C++, Performance | Tags: develop, memory, OS, program |

This is a big document about how to improve program based on memory hardware implementation and software design/compile. (Arthur is Ulrich Drepper who was from Redhat)

The link on lwn.net is http://lwn.net/Articles/250967/

The outline is:

Part 1 (Introduction)

Part 2 (CPU caches)
Part 3 (Virtual memory)
Part 4 (NUMA systems)
Part 5 (What programmers can do – cache optimization)
Part 6 (What programmers can do – multi-threaded optimizations)
Part 7 (Memory performance tools)
Part 8 (Future technologies)
Part 9 (Appendices and bibliography)

Read Full Post | Make a Comment ( None so far )

Useful kernel and driver performance tweaks for your Linux server

Posted on November 20, 2013. Filed under: Performance | Tags: driver, Linux, NAPI, network, performance, tune |

This article is going to address some kernel and driver tweaks that are interesting and useful. We use several of these in production with excellent performance, but you should proceed with caution and do researchprior to trying anything listed below.

Tickless System

The tickless kernel feature allows for on-demand timer interrupts. This means that during idle periods, fewer timer interrupts will fire, which should lead to power savings, cooler running systems, and fewer useless context switches.

Kernel option: CONFIG_NO_HZ=y

Timer Frequency

You can select the rate at which timer interrupts in the kernel will fire. When a timer interrupt fires on a CPU, the process running on that CPU is interrupted while the timer interrupt is handled. Reducing the rate at which the timer fires allows for fewer interruptions of your running processes. This option is particularly useful for servers with multiple CPUs where processes are not running interactively.

Kernel options: CONFIG_HZ_100=y and CONFIG_HZ=100

Connector

The connector module is a kernel module which reports process events such as fork, exec, and exit to userland. This is extremely useful for process monitoring. You can build a simple system (or use an existing one like god) to watch mission-critical processes. If the processes die due to a signal (like SIGSEGV, orSIGBUS) or exit unexpectedly you’ll get an asynchronous notification from the kernel. The processes can then be restarted by your monitor keeping downtime to a minimum when unexpected events occur.

Kernel options: CONFIG_CONNECTOR=y and CONFIG_PROC_EVENTS=y

TCP segmentation offload (TSO)

A popular feature among newer NICs is TCP segmentation offload (TSO). This feature allows the kernel to offload the work of dividing large packets into smaller packets to the NIC. This frees up the CPU to do more useful work and reduces the amount of overhead that the CPU passes along the bus. If your NIC supports this feature, you can enable it with ethtool:

[joe@timetobleed]% sudo ethtool -K eth1 tso on

Let’s quickly verify that this worked:

[joe@timetobleed]% sudo ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: on
large receive offload: off

[joe@timetobleed]% dmesg | tail -1
[892528.450378] 0000:04:00.1: eth1: TSO is Enabled

Intel I/OAT DMA Engine

This kernel option enables the Intel I/OAT DMA engine that is present in recent Xeon CPUs. This option increases network throughput as the DMA engine allows the kernel to offload network data copying from the CPU to the DMA engine. This frees up the CPU to do more useful work.

Check to see if it’s enabled:

[joe@timetobleed]% dmesg | grep ioat
ioatdma 0000:00:08.0: setting latency timer to 64
ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, device version 0x12, driver version 3.64
ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X

There’s also a sysfs interface where you can get some statistics about the DMA engine. Check the directories under /sys/class/dma/.

Kernel options: CONFIG_DMADEVICES=y and CONFIG_INTEL_IOATDMA=y and CONFIG_DMA_ENGINE=y and CONFIG_NET_DMA=y and CONFIG_ASYNC_TX_DMA=y

Direct Cache Access (DCA)

Intel’s I/OAT also includes a feature called Direct Cache Access (DCA). DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most popular (to my knowledge) is the Intel 10GbE driver (ixgbe). Refer to your NIC driver documentation to see if your NIC supports DCA. To enable DCA, a switch in the BIOS must be flipped. Some vendors supply machines that support DCA, but don’t expose a switch for DCA. If that is the case, see my last blog post for how to enable DCA manually.

You can check if DCA is enabled:

[joe@timetobleed]% dmesg | grep dca
dca service started, version 1.8

If DCA is possible on your system but disabled you’ll see:

ioatdma 0000:00:08.0: DCA is disabled in BIOS

Which means you’ll need to enable it in the BIOS or manually.

Kernel option: CONFIG_DCA=y

NAPI

The “New API” (NAPI) is a rework of the packet processing code in the kernel to improve performance for high speed networking. NAPI provides two major features¹:

Interrupt mitigation: High-speed networking can create thousands of interrupts per second, all of which tell the system something it already knew: it has lots of packets to process. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load.

Packet throttling: When the system is overwhelmed and must drop packets, it’s better if those packets are disposed of before much effort goes into processing them. NAPI-compliant drivers can often cause packets to be dropped in the network adaptor itself, before the kernel sees them at all.

Many recent NIC drivers automatically support NAPI, so you don’t need to do anything. Some drivers need you to explicitly specify NAPI in the kernel config or on the command line when compiling the driver. If you are unsure, check your driver documentation. A good place to look for docs is in your kernel source under Documentation, available on the web here: http://lxr.linux.no/linux+v2.6.30/Documentation/networking/but be sure to select the correct kernel version, first!

Older e1000 drivers (newer drivers, do nothing): make CFLAGS_EXTRA=-DE1000_NAPI install

Throttle NIC Interrupts

Some drivers allow the user to specify the rate at which the NIC will generate interrupts. The e1000e driver allows you to pass a command line option InterruptThrottleRate

when loading the module with insmod. For the e1000e there are two dynamic interrupt throttle mechanisms, specified on the command line as 1 (dynamic) and 3 (dynamic conservative). The adaptive algorithm traffic into different classes and adjusts the interrupt rate appropriately. The difference between dynamic and dynamic conservative is the the rate for the “Lowest Latency” traffic class, dynamic (1) has a much more aggressive interrupt rate for this traffic class.

As always, check your driver documentation for more information.

With modprobe: insmod e1000e.o InterruptThrottleRate=1

Process and IRQ affinity

Linux allows the user to specify which CPUs processes and interrupt handlers are bound.

Processes You can use taskset to specify which CPUs a process can run on
Interrupt Handlers The interrupt map can be found in /proc/interrupts, and the affinity for each interrupt can be set in the file smp_affinity in the directory for each interrupt under /proc/irq/

This is useful because you can pin the interrupt handlers for your NICs to specific CPUs so that when a shared resource is touched (a lock in the network stack) and loaded to a CPU cache, the next time the handler runs, it will be put on the same CPU avoiding costly cache invalidations that can occur if the handler is put on a different CPU.

However, reports² of up to a 24% improvement can be had if processes and the IRQs for the NICs the processes get data from are pinned to the same CPUs. Doing this ensures that the data loaded into the CPU cache by the interrupt handler can be used (without invalidation) by the process; extremely high cache locality is achieved.

oprofile

oprofile is a system wide profiler that can profile both kernel and application level code. There is a kernel driver for oprofile which generates collects data in the x86′s Model Specific Registers (MSRs) to give very detailed information about the performance of running code. oprofile can also annotate source code with performance information to make fixing bottlenecks easy. See oprofile’s homepage for more information.

Kernel options: CONFIG_OPROFILE=y and CONFIG_HAVE_OPROFILE=y

`epoll`

epoll(7) is useful for applications which must watch for events on large numbers of file descriptors. Theepoll interface is designed to easily scale to large numbers of file descriptors. epoll is already enabled in most recent kernels, but some strange distributions (which will remain nameless) have this feature disabled.

Kernel option: CONFIG_EPOLL=y

Conclusion

There are a lot of useful levers that can be pulled when trying to squeeze every last bit of performance out of your system
It is extremely important to read and understand your hardware documentation if you hope to achieve the maximum throughput your system can achieve
You can find documentation for your kernel online at the Linux LXR. Make sure to select the correct kernel version because docs change as the source changes!

Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.