Linux

What every programmer should know about memory

Posted on January 3, 2014. Filed under: C/C++, Performance | Tags: , , , |

This is a big document about how to improve program based on memory hardware implementation and software design/compile. (Arthur is Ulrich Drepper who was from Redhat)

The link on lwn.net is http://lwn.net/Articles/250967/

The outline is:

  • Part 2 (CPU caches)
  • Part 3 (Virtual memory)
  • Part 4 (NUMA systems)
  • Part 5 (What programmers can do – cache optimization)
  • Part 6 (What programmers can do – multi-threaded optimizations)
  • Part 7 (Memory performance tools)
  • Part 8 (Future technologies)
  • Part 9 (Appendices and bibliography)
Read Full Post | Make a Comment ( None so far )

sysfs, procfs, sysctl, debugfs and other similar kernel interfaces

Posted on November 20, 2013. Filed under: Kernel | Tags: |

2. Procfs, Sysfs, and Similar Kernel Interfaces

2.1 Introduction

These file systems are optional to the Linux kernel, and may not be enabled on your system. The file “/lib/modules/`uname -r`/build/.config” will tell you how your kernel is configured.

In order to exchange data between user space and kernel space the Linux kernel provides a couple of RAM based file systems. These interfaces are, themselves, based on files. Usually a file represents a single value, but it may also represent a set of values. The user space can access these values by means of the standard read(2) and write(2) functions. For most file systems the read and write function results in a callback function in the Linux kernel which has access to the corresponding value.

Despite offering similar functionality, the different RAM based file systems are all designed for separate purposes. However it is easy to use these file systems for other purposes as well. Questions such as “Which file system should be used?” or “Why is there a need for the different file systems?” often arise on the Linux kernel mailing list. The arguments are controversial and each developer seems to have a unique view.

The benefit of using the read and write function in comparison to, for example, socket based approaches, is that the user space has a lot of tools available to send data to the kernel space (e.g. cat(1), echo (1)). These programs are well known to users and they can be used in scripts.

2.2 Procfs

Description

The procfs, located in /proc, is the best known interface of this class. It was originally designed to export all kind of process information such as the current status of the process, or all open file descriptors to the user space. Despite its initial purposes, the procfs has been used for a lot of other purposes:

  • provide information about the running system such as cpu information, information about interrupts, about the available memory or the version of the kernel.
  • information about “ide devices”, “scsi devices” and “tty’s”.
  • networking information such as the arp table, network statistics or lists of used sockets

There is a special subdirectory: /proc/sys.
It allows to configure a lot of parameters of the running system. Usually each file consists of a single value. This value may correspond to:

  • a limit (e.g. maximum buffer size)
  • turn on or off a given functionality (for example routing)
  • or represent some other kernel variable

All directories and files below /proc/sys/ are not implemented with the procfs interface. Instead they use a mechanism called sysctl. See section sysctl for further details about sysctl.

Note, despite the wide use of the procfs, it is deprecated and should only be used to export information related to a process itself.

Implementation

In order to use the procfs it needs to be compiled with the Linux kernel source code. This is done by setting the parameter CONFIG_PROC_FS=y. In most standard configurations this is enabled by default

Procfs supports two different APIs for kernel modules:
The legacy procfs API: It is easy to use as long as the amount of data to be handled is small. In this context small means smaller than one page size (PAGE_SIZE), which is in i386 systems 4096 bytes.
The seq_file API: Seq_file was designed to facilitate the handling of read requests. It supports read requests for more than PAGE_SIZE bytes and it provides mechanism to traverse a list, collect the elements of the list, and send all elements to user space.

1. Legacy procfs API

procfs.c legacy procfs API

The legacy procfs API allows for the creation of files and directories. For each file you have to specify two callback functions: One which is executed when a user reads the file and the other when a user writes to the file. The use of this API is well described in the “Linux Kernel Procfs Guide” distributed with the Linux kernel source code. Therefore we give here only a very basic example: A module which creates a directory as well as a file. If your file provides more than PAGE_SIZE bytes of data it is easy to get things wrong. This is due to the API of the read function:
read(char *page, char **start, off_t off, int count, int *eof, void *data)
The first parameter of this function is a buffer with the size corresponding to one page. Hence, if there is more data, the read has to be split in multiple pieces.

2. Seq_file API

The seq_file API is concerned with read requests solely – no writes. It hides the PAGE_SIZE boundary from the developer and it provides an API to step through a series of objects, collect the data from each of them and put all those data in the file. An example module can be found at http://lwn.net/Articles/22359/.

Further Reading and Resources

  • http://lwn.net/Articles/22355/ lwn article “Driver porting: The seq_file interface”.
  • http://lwn.net/Articles/22359/ Example module that uses seq_file in relation with the lwn article.
  • http://www.linux-mag.com/id/2739?r=s Linux magazine article “Manipulating “Seq” Files” covers legacy API as well as seq_file.
  • http://kernelnewbies.org/Documents/SeqFileHowTo Documents/SeqFileHowTo seqfile how-to on kernelnewbies
  • Linux Kernel Procfs Guide available in the Linux kernel source code Documentation/DocBook/procfs-guide.tmpl. Description of the legacy procfs API
  • T H E /proc F I L E S Y S T E M Description of entries in /proc, available in the Linux kernel source code Documentation/filesystems/proc.txt

2.3 Sysfs

Description

Sysfs was designed to represent the whole device model as seen from the Linux kernel. It contains information about devices, drivers and buses and their interconnections. In order to represent the hierarchy and the interconnections sysfs is heavily structured and contains a lot of links between the individual directories. As for kernel 2.6.23 it contains the following 9 top-level directories:

  • sys/block/ all known block devices such as hda/ ram/ sda/
  • sys/bus/ all registered buses. Each directory below bus/ holds by default two subdirectories:
    • device/ for all devices attached to that bus
    • driver/ for all drivers assigned with that bus.
  • sys/class/ for each device type there is a subdirectory: for example /printer or /sound
  • sys/device/ all devices known by the kernel, organised by the bus they are connected to
  • sys/firmware/ files in this directory handle the firmware of some hardware devices
  • sys/fs/ files to control a file system, currently used by FUSE, a user space file system implementation
  • sys/kernel/ holds directories (mount points) for other filesystems such as debugfs, securityfs.
  • sys/module/ each kernel module loaded is represented with a directory.
  • sys/power/ files to handle the power state of some hardware

Implementation

In order to use sysfs it needs to be compiled with the Linux kernel source code. This is done by setting the parameter CONFIG_SYSFS=y.

The philosophy behind sysfs is to represent each value with a dedicated file. In addition each file has a maximum size of PAGE_SIZE bytes.

For a kernel module there are three possibilities to use a file below /sys:

  1. module parameter
  2. register new subsystem
  3. debugfs: debugfs, mounted in /sys/kernel/debug. More information about debugfs.

Module Parameter API

Similar to command line arguments for applications, Linux kernel modules may allow a set of parameters. These parameters can not only be specified upon module insertion but also during module run time. A module parameter can be defined with the following macro:
module_param_named(name, value, type, perm)
This macro creates a parameter called "name" which corresponds to the variable with name "value" of type "type". There are many predefined types such as byte (for a single character), int (for an integer) or charp (for a string). It is also possible to add new types. The file include/linux/stat.h provides all predefined types as well as an introduction how to define new types.

The module_param macro creates a file called /sys/modules/module_name/name with the access rights specified by perm. Depending on the specified access rights, the file – and thereby the parameter value – can be read or written. If perm is set to 0 the file is not created, and therefore the parameter cannot be accessed during run time.

The module does not receive a notification when a user reads or writes a given parameter, but the value is silently changed. Therefore it is not possible to do some additional stuff when a parameter changes its value. This may be acceptable in some circumstances as for changing a debug level, but in most circumstances the module wants to do some additional stuff such as sanity checks or manipulating a data structure.

Standard Sysfs API

The standard sysfs API uses a dedicated terminology: A file is called an attribute, the function executed upon reading an attribute is called show and the one for writing an attribute store.

Before starting with the implementation of a module which uses sysfs you have to figure out which subdirectory it belongs to. If you deal with a bus, it belongs to bus/, with a file system it belongs to fs/ or with a block device it belongs to block/. The API to use depends on the given subdirectory. We first show an example which uses the low level sysfs functions to add a new directory to fs/, and in a second example we show how to add a new entry to the bus/ directory.

1. new fs/ entry

sysfs_ex.c creates the directory /sys/fs/myfs/ along with two files first and second. Both containing one single integer value.

The first step is to declare our subsystem. This can be done with the use of the decl_subsys macro (on top of the file). This macro creates a struct kset with the name myfs_subsys.

The module_init() function performs the proper registration of our subsystem: The macro kobj_set_kset_s initializes myfs_subsys so that it will be part of the fs_subsys. The field myfs_subsys.kobj.ktype points to a structure which holds all the attributes as well as the functions to read and write the attributes. And finally a call to register_subsystem() registers our subsystem.

Files are generally represented by a struct attribute. This struct holds the name as well as the access permission for the corresponding file, but no data. Therefore you have to create your own attribute type which consists of at least the struct attribute and the value corresponding to that file.

By design all attributes share the same show and store functions. Each time one of these two functions is invoked it gets the corresponding struct attribute as an argument. Therefore in the show and store functions you can obtain the value corresponding to the file being read/written and you can manipulate it accordingly. For this purpose you need the macro container_of(ptr, type, member)ptr is a pointer to the member of the struct.type is the type of the struct this member is emeded in and member is the name of the member within the struct.

2. new bus/ entry

sysfs_ex2.c use of sysfs in combination with a bus. It provides the possibility to read and write one value with the help of “my_pseude_bus”.

First of all we define our bus my_pseudo_bus. Then we create our attribute with the help of the BUS_ATTR macro. In the init function we register our pseudo bus and we create a file (attribute). If we would like more than one attribute we would have to use BUS_ATTR several times and provide for each attribute its own store and show function. This example is similar to the debugging facility of the scsi bus, which is implemented indrivers/scsi/scsi_debug.c

Resources and Further Reading

2.4 Configfs

Description

The configfs is somewhat the counterpart of the sysfs. It can be seen as a filesystem based manager of kernel objects. An important difference between configfs and sysfs is that in configfs all objects are created from user space with a call to mkdir(2). The kernel responds with creating the attributes (files) and then they can be read and written by the user. If the user no longer needs the files, he calls rmdir(2) and everything gets deleted. Therefore the life cycle of a configfs object is fully controlled by user space.

Each time mkdir is invoked a new “config_item” is created by the kernel implementation. This config_item represents the files (attributes), the show and store callback functions as well as the associated value. Therefore each mkdir creates a new directory along with new files which represent new values.

Configfs has the same limitations than sysfs: each file should represent only one value and it should be smaller than PAGE_SIZE bytes.

Implementation

In order to use configfs it needs to be compiled with the Linux kernel source code. This is done by setting the parameter CONFIG_CONFIGFS_FS=y.

In order to access configfs it has to be mounted with the following command:
mount -t configfs none /config

The Linux kernel documentation provides a good manual for configfs along with an example module. Therefore we do not describe the configfs implementation aspects.

Resources and Further Reading

  • Linux kernel source code: Documentation/filesystems/configfs

2.5 Debugfs

Description

Debugfs is a simple to use RAM based file system especially designed for debugging purposes. Developers are encouraged to use debugfs instead of procfs in order to obtain some debugging information from their kernel code. Debugfs is quite flexible: it provides the possibility to set or get a single value with the help of just one line of code but the developer is also allowed to write its own read/write functions, and he can use the seq_file interface described in the procfs section.

Implementation

In order to use debugfs it needs to be compiled with the Linux kernel source code. This is done by setting the parameter CONFIG_DEBUG_FS=y.

Before having access to the debugfs it has to be mounted with the following command.
mount -t debugfs none /sys/kernel/debug

debugfs.c kernel module that implements the “one line” API for a variable of type u8 as well as the API with which you can specify your own read and write functions.

All the “one line” APIs start with debugfs_create_ and are listed in include/linux/debugfs.h

The API with which you can provide your own read and write functions is similar to the one of procfs. In contrast to sysfs, you may create directories and files without having to care about a given hierarchy.

Resources and Further Reading

2.6 Sysctl

Description

The sysctl infrastructure is designed to configure kernel parameters at run time. The sysctl interface is heavily used by the Linux networking subsystem. It can be used to configure some core kernel parameters; represented as files in /proc/sys/*. The values can be accessed by using cat(1)echo(1) or the sysctl(8) commands. If a value is set by the echo command it only persists as long as the kernel is running, but gets lost as soon as the machine is rebooted. In order to change the values permanently they have to be written to the file /etc/sysctl.conf. Upon restarting the machine all values specified in this file are written to the corresponding files in/proc/sys/.

Implementation

sysctl.c sysctl example module: write an integer to /proc/sys/net/test/value1 and value2 respectively

Each entry in the /proc/sys directory is represented by an entry in a table maintained by the Linux kernel, arranged in a hierarchy. A directory is represented by an entry pointing to a subtable. A file is represented by an entry of type struct ctl_table. This entry consists of the data represented by this file along with some access rules.

New files and directories can be added by expanding one of the subtables. In this example we add a new directory called test below the /proc/sys/net/ directory. Our directory has got two files: value1 and value2. Each of these files hold an integer variable which can have a value between 10 and 20. The user root is allowed to change the entries whereas normal user are allowed to read the entries.

Each file is represented with an entry in the test_table[] array:

static ctl_table test_table[] = {
        {
        .ctl_name       = CTL_UNNUMBERED,
        .procname       = "value1",
        .data           = &value1,
        .maxlen         = sizeof(int),
        .mode           = 0644,
        .proc_handler   = &proc_dointvec_minmax,
        .strategy       = &sysctl_intvec,
        .extra1         = &min,
        .extra2         = &max
        },
        ...
}

The struct ctl_table entries are:

  • .ctl_name: For new entries this has to be CTL_UNNUMBERED (according to Documentation/sysctl/ctl_unnumbered.txt).
  • .procname: The name of the file.
  • .data: A reference to the data we want to be shown in the file.
  • maxlen: The size of the data.
  • mode: Access permissions (read, write, execute for user, group, others)
  • proc_handler: The routine which handles read and write requests. There is a set of default routines declared near the end of include/linux/sysctl.h
  • strategy: Some routine that enforces additional access control. In this example it checks that the value to be written is between min and max.
static ctl_table test_net_table[] = {
        {
                .ctl_name       = CTL_UNNUMBERED,
                .procname       = "test",
                .mode           = 0555,
                .child          = test_table
        },
        { .ctl_name = 0 }
};

This table represents our test directory. The entry .child says that the elements below this directory are represented by the table test_table, discussed above.

static ctl_table test_root_table[] = {
        {
                .ctl_name       = CTL_UNNUMBERED,
                .procname       = "net",
                .mode           = 0555,
                .child          = test_net_table
        },
        { .ctl_name = 0 }
};

This table represents the directory to which we want to attach our new directory. In this example, this is the net directory. In the module_init() function we have to register this root table with a call to
register_sysctl_table(test_root_table);

Resources and Further Reading

  • Linux kernel source code: net/core/sysctl_net_core.c
  • Linux kernel Documentation: Documentation/sysctl/ctl_unnumbered.txt

2.7 Character Devices

Description

As the name suggests, this interface was designed for character device drivers, and is commonly used for communication between uer and kernel space. (For example, users with sufficient privileges my write directly to the virtual terminal 1 with echo "hi there" > /dev/tty1).

Each module can register itself as a character device and provide some read and write functions which handle the data. Files representing character devices are located within the /dev directory (where you will also find block devices, but we will not be describing them further). Usually these files correspond to a hardware device.

Implementation

cdev.c kernel module that prints its majorNumber to the system log. The minorNumber can be chosen to be 0.

As with all file system based approaches the module has to specify a read and a write callback function. Therefore, we have to register ourself with the function register_chrdev(unsigned int major, const char *name, struct file_operations *ops);major is the major number of this device. We can set it to 0 to let the kernel choose an appropriate number. name is the name of this character device, as it will be shown below the/dev directory. ops is a pointer to the read and write functions.

In contrast to most file system based approaches seen so far, the user has to create the device file explicitly with a call to:
mknod /dev/arbitrary_name c majorNumber minorNumber

Resources and Further Reading

Links:

 

Read Full Post | Make a Comment ( 1 so far )

Useful kernel and driver performance tweaks for your Linux server

Posted on November 20, 2013. Filed under: Performance | Tags: , , , , , |

This article is going to address some kernel and driver tweaks that are interesting and useful. We use several of these in production with excellent performance, but you should proceed with caution and do researchprior to trying anything listed below.

Tickless System

The tickless kernel feature allows for on-demand timer interrupts. This means that during idle periods, fewer timer interrupts will fire, which should lead to power savings, cooler running systems, and fewer useless context switches.

Kernel option: CONFIG_NO_HZ=y

Timer Frequency

You can select the rate at which timer interrupts in the kernel will fire. When a timer interrupt fires on a CPU, the process running on that CPU is interrupted while the timer interrupt is handled. Reducing the rate at which the timer fires allows for fewer interruptions of your running processes. This option is particularly useful for servers with multiple CPUs where processes are not running interactively.

Kernel options: CONFIG_HZ_100=y and CONFIG_HZ=100

Connector

The connector module is a kernel module which reports process events such as forkexec, and exit to userland. This is extremely useful for process monitoring. You can build a simple system (or use an existing one like god) to watch mission-critical processes. If the processes die due to a signal (like SIGSEGV, orSIGBUS) or exit unexpectedly you’ll get an asynchronous notification from the kernel. The processes can then be restarted by your monitor keeping downtime to a minimum when unexpected events occur.

Kernel options: CONFIG_CONNECTOR=y and CONFIG_PROC_EVENTS=y

TCP segmentation offload (TSO)

A popular feature among newer NICs is TCP segmentation offload (TSO). This feature allows the kernel to offload the work of dividing large packets into smaller packets to the NIC. This frees up the CPU to do more useful work and reduces the amount of overhead that the CPU passes along the bus. If your NIC supports this feature, you can enable it with ethtool:

[joe@timetobleed]% sudo ethtool -K eth1 tso on

Let’s quickly verify that this worked:

[joe@timetobleed]% sudo ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: on
large receive offload: off

[joe@timetobleed]% dmesg | tail -1
[892528.450378] 0000:04:00.1: eth1: TSO is Enabled

Intel I/OAT DMA Engine

This kernel option enables the Intel I/OAT DMA engine that is present in recent Xeon CPUs. This option increases network throughput as the DMA engine allows the kernel to offload network data copying from the CPU to the DMA engine. This frees up the CPU to do more useful work.

Check to see if it’s enabled:

[joe@timetobleed]% dmesg | grep ioat
ioatdma 0000:00:08.0: setting latency timer to 64
ioatdma 0000:00:08.0: Intel(R) I/OAT DMA Engine found, 4 channels, device version 0x12, driver version 3.64
ioatdma 0000:00:08.0: irq 56 for MSI/MSI-X

There’s also a sysfs interface where you can get some statistics about the DMA engine. Check the directories under /sys/class/dma/.

Kernel options: CONFIG_DMADEVICES=y and CONFIG_INTEL_IOATDMA=y and CONFIG_DMA_ENGINE=y and CONFIG_NET_DMA=y and CONFIG_ASYNC_TX_DMA=y

Direct Cache Access (DCA)

Intel’s I/OAT also includes a feature called Direct Cache Access (DCA). DCA allows a driver to warm a CPU cache. A few NICs support DCA, the most popular (to my knowledge) is the Intel 10GbE driver (ixgbe). Refer to your NIC driver documentation to see if your NIC supports DCA. To enable DCA, a switch in the BIOS must be flipped. Some vendors supply machines that support DCA, but don’t expose a switch for DCA. If that is the case, see my last blog post for how to enable DCA manually.

You can check if DCA is enabled:

[joe@timetobleed]% dmesg | grep dca
dca service started, version 1.8

If DCA is possible on your system but disabled you’ll see:

ioatdma 0000:00:08.0: DCA is disabled in BIOS

Which means you’ll need to enable it in the BIOS or manually.

Kernel option: CONFIG_DCA=y

NAPI

The “New API” (NAPI) is a rework of the packet processing code in the kernel to improve performance for high speed networking. NAPI provides two major features1:

Interrupt mitigation: High-speed networking can create thousands of interrupts per second, all of which tell the system something it already knew: it has lots of packets to process. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load.

Packet throttling: When the system is overwhelmed and must drop packets, it’s better if those packets are disposed of before much effort goes into processing them. NAPI-compliant drivers can often cause packets to be dropped in the network adaptor itself, before the kernel sees them at all.

Many recent NIC drivers automatically support NAPI, so you don’t need to do anything. Some drivers need you to explicitly specify NAPI in the kernel config or on the command line when compiling the driver. If you are unsure, check your driver documentation. A good place to look for docs is in your kernel source under Documentation, available on the web here: http://lxr.linux.no/linux+v2.6.30/Documentation/networking/but be sure to select the correct kernel version, first!

Older e1000 drivers (newer drivers, do nothing)make CFLAGS_EXTRA=-DE1000_NAPI install

Throttle NIC Interrupts

Some drivers allow the user to specify the rate at which the NIC will generate interrupts. The e1000e driver allows you to pass a command line option InterruptThrottleRate

when loading the module with insmod. For the e1000e there are two dynamic interrupt throttle mechanisms, specified on the command line as 1 (dynamic) and 3 (dynamic conservative). The adaptive algorithm traffic into different classes and adjusts the interrupt rate appropriately. The difference between dynamic and dynamic conservative is the the rate for the “Lowest Latency” traffic class, dynamic (1) has a much more aggressive interrupt rate for this traffic class.

As always, check your driver documentation for more information.

With modprobe: insmod e1000e.o InterruptThrottleRate=1

Process and IRQ affinity

Linux allows the user to specify which CPUs processes and interrupt handlers are bound.

  • Processes You can use taskset to specify which CPUs a process can run on
  • Interrupt Handlers The interrupt map can be found in /proc/interrupts, and the affinity for each interrupt can be set in the file smp_affinity in the directory for each interrupt under /proc/irq/

This is useful because you can pin the interrupt handlers for your NICs to specific CPUs so that when a shared resource is touched (a lock in the network stack) and loaded to a CPU cache, the next time the handler runs, it will be put on the same CPU avoiding costly cache invalidations that can occur if the handler is put on a different CPU.

However, reports2 of up to a 24% improvement can be had if processes and the IRQs for the NICs the processes get data from are pinned to the same CPUs. Doing this ensures that the data loaded into the CPU cache by the interrupt handler can be used (without invalidation) by the process; extremely high cache locality is achieved.

oprofile

oprofile is a system wide profiler that can profile both kernel and application level code. There is a kernel driver for oprofile which generates collects data in the x86′s Model Specific Registers (MSRs) to give very detailed information about the performance of running code. oprofile can also annotate source code with performance information to make fixing bottlenecks easy. See oprofile’s homepage for more information.

Kernel options: CONFIG_OPROFILE=y and CONFIG_HAVE_OPROFILE=y

epoll

epoll(7) is useful for applications which must watch for events on large numbers of file descriptors. Theepoll interface is designed to easily scale to large numbers of file descriptors. epoll is already enabled in most recent kernels, but some strange distributions (which will remain nameless) have this feature disabled.

Kernel option: CONFIG_EPOLL=y

Conclusion

  • There are a lot of useful levers that can be pulled when trying to squeeze every last bit of performance out of your system
  • It is extremely important to read and understand your hardware documentation if you hope to achieve the maximum throughput your system can achieve
  • You can find documentation for your kernel online at the Linux LXRMake sure to select the correct kernel version because docs change as the source changes!

Thanks for reading and don’t forget to subscribe (via RSS or e-mail) and follow me on twitter.

References

  1. http://www.linuxfoundation.org/en/Net:NAPI
  2. http://software.intel.com/en-us/articles/improved-linux-smp-scaling-user-directed-processor-affinity/
  3. http://timetobleed.com/useful-kernel-and-driver-performance-tweaks-for-your-linux-server/  (Source of this file)
Read Full Post | Make a Comment ( None so far )

Make ipmitool working for ESXi 5.1

Posted on September 17, 2013. Filed under: Linux, Programming | Tags: , , |

ESXi 5.1 has no ipmitool command built-in, and has limited IPMI capability, we can build a static ipmitool binary for ESXi 5.1.

1. Login to ESXi 5.1, check it’s libc.so.6 version by:
~# /lib/libc.so.6

Compiled by GNU CC version 4.1.2 20070626 (Red Hat 4.1.2-14).
Compiled on a Linux 2.6.9 system on 2012-03-21.

This means the libc.so.6 is built on linux with kernel 2.6.9 and gcc 4.1.2, so we’d better setup a linux environment with the same linux kernel and gcc version, such as CentOS 4.8.

2. login to the linux environment, build static linked ipmitool binary
2.1 download the ipmitool source code from sourceforge
2.2 Compile. And adding “-static” linker option. The resulting binary I got back was not statically compiled, but worked. Omitting the -static flag caused ipmitool to throw compile errors, so I left it.
~/src/ipmitool-1.8.11$ ./configure CFLAGS=-m32 LDFLAGS=-static

~/src/ipmitool-1.8.11/src/$ ldd ipmitool
linux-gate.so.1 => (0x009f0000)
libm.so.6 => /lib/libm.so.6 (0x00c80000)
libc.so.6 => /lib/libc.so.6 (0x00ac4000)
/lib/ld-linux.so.2 (0x00a9e000)

2.3 copy the built binary to ESXi 5.1 server, it works fine

References:
http://ipmitool.sourceforge.net/
http://blog.rchapman.org/post/17480234232/configuring-bmc-drac-from-esxi-or-linux

Read Full Post | Make a Comment ( 6 so far )

Linux Page Faults

Posted on September 16, 2013. Filed under: Linux, Programming | Tags: , , , |

Page Fault

4 Linux Commands to View Page Faults Stats: ps, top, sar, time

  1. major fault occurs when disk access required. For example, start an app called Firefox. The Linux kernel will search in the physical memory and CPU cache. If data do not exist, the Linux issues a major page fault.
  2. A minor fault occurs due to page allocation.

 

Read Full Post | Make a Comment ( None so far )

Cross compiling environment setup for ARM Architecture pidora OS

Posted on September 2, 2013. Filed under: C/C++, Linux | Tags: , , , , , , , |

In this article, I’m using raspberry pi hardware, and using pedora 32-bit target OS, Scientific Linux 6.1 64-bit as host OS, crosstool-ng as cross compiling tool. I will give a brief introduction steps about how to build binaries and shared/static libraries, and in the process, some building depends on third party libraries

Question: Why we need setup the cross compiling environment?
The ARM architecture is not powerful enough to build the binary as quick as we want, so we need setup a toolchain build environment on a powerful host OS which is using Intel or AMD cpu to build binaries for target OS running on ARM cpu.

    1. After install fedora in raspberry pi, need following information from it:

kernel version by uname -a

gcc version by gcc --version

glibc version by run /lib/libc.so.6 directly

    1. download and install crosstool-ng to host OS

login as root to the host OS, and then:
wget http://crosstool-ng.org/download/crosstool-ng/crosstool-ng-1.18.0.tar.bz2
tar xjvf crosstool-ng-1.18.0.tar.bz2
yum install gperf.x86_64 texinfo.x86_64 gmp-devel.x86_64

./configure –prefix=/usr/local/
make
make install

unset LD_LIBRARY_PATH (or “export -n LD_LIBRARY_PATH”, this step is because of crosstool-ng may use this environment, so if we leave the host one, it may screw up the target compiling)
and add the destination address to PATH:
export PATH=/usr/local/x-tools/crosstool-ng/bin:$PATH (then you can use “ct-ng”)
mkdir ~/pi-staging
cd ~/pi-staging
ct-ng menuconfig

In the configuration, select related configuration of Host and Target OS, such as:
In target option, select target architecture as “ARM”, and set “Endianness set” as “Little endian” and “Bitness” as “32-bit”

      • enable “EXPERIMENTAL” features in “Paths and misc”, leave your prefix directory as “${HOME}/x-tools/${CT_TARGET}”
      • in “Operating System”, set “Target OS” as “linux”, and “Linux Kernel” to that ARM pedora’s kernel version
      • Set “binutils” version to “2.21.1a”
      • Set “C Compiler” version to what ARM pedora is using, and enable “C++” if needed
      • Set “glibc” or “eglibc” or “ulibc” version to what ARM pedora is using (get this information from /lib/libc.so.6)

Then:
ct-ng build

this will build the basic environment of target os, and it may take 1 to 2 hours, after this building, you can find the binaries under directory “~/x-tools/arm-unknown-linux-gnueabi/bin/”, and we need add this directory into PATH, and following binary provided:
arm-unknown-linux-gnueabi-gcc
arm-unknown-linux-gnueabi-arm

    1. After build crosstool-ng, there is one “sysroot” directory under “x-tools/arm-unknown-linux-gnueabi/arm-unknown-linux-gnueabi/sysroot”, this information can be found in cross tool-ng’s document “crosstool-ng/docs/”When we build a big project which has multiple libraries (static or dynamic) and binaries, the project may depend on more third party libraries, so we need make the build environment supports third party libraries deployment capability, the straightforward way is include all source code of third party libraries into the project, but that costs a lot of debugging and waste building time, the simple way is to setup a sysroot (like chroot) to “install”/”unpack” the third party libraries into the sysroot directory
        1. we’re not going to use built-in sysroot directory, we will setup a new sysroot directory that can benefit multiple projects use their own sysroot and screw one won’t affect other projects
        2. run following commands for a new sysroot

      cp -r ~/x-tools/arm-unknown-linux-gnueabi/arm-unknown-linux-gnueabi/sysroot ~/my-sysroot

        1. once you need a third party library to build a project, you can download the rpm from http://downloads.raspberrypi.org/pidora/releases/18/packages/armv6hl/os/Packages

      such as I need libcurl-devel for my project, I can download rpm and unpack the package to my-sysroot:
      cd ~/my-sysroot/
      rpm2cpio libcurl-devel-7.27.0-6.fc18.armv6hl.rpm | cpio -idmv

    2. start cross compiling your project

Lots of our projects are using autoconf or automake, so we may use “./configure” and “make” during the project compilation
Set CC to the cross-compiler for the target you configured the library for; it is important to use this same CC value when running configure, like this: ‘CC=target-gcc configure target’. Set BUILD_CC to the compiler to use for programs run on the build system as part of compiling the library. You may need to set AR to cross-compiling versions of ar if the native tools are not configured to work with object files for the target you configured for. such as:

CC=arm-unknown-linux-gnueabi-gcc AR=arm-unknown-linux-gnueabi-ar ./configure

4.1 once the Makefile is generated, go check the Makefile is using the right gcc and ar version as you want
and if there is architecture parameter in the Makefile, please check its using “armv6” instead of “i386” or “x86_64”

4.2 If there is header files needed, please check the CFLAGS that it’s pointing to the header file in your ~/my-sysroot directory, not host system header file

4.3 Some binary or library needs third party library to link, so add lib directory of your sysroot to CFLAGS and LDFLAGS as well, like:

CFLAGS+=-B/root/my-sysroot/usr/lib

Sometimes, the make process can not find some library even you give the path it exist, then you may need following adding to Makefile
LDFLAGS+=-Wl,-rpath,/root/my-sysroot/lib

  1. Troubleshootings
    5.1 if you want to see the process of buildadding “-v” to “CFLAGS“5.2 cannot find crti.o: No such file or directory

    Adding sysroot lib directory “-B/root/my-sysroot/usr/lib” to “CFLAGS

    5.3 arm-unknown-linux-gnueabi/bin/ld: cannot find /lib/libpthread.so.0
    arm-unknown-linux-gnueabi/bin/ld: skipping incompatible /usr/lib/libpthread_nonshared.a when searching for /usr/lib/libpthread_nonshared.a
    arm-unknown-linux-gnueabi/bin/ld: cannot find /usr/lib/libpthread_nonshared.a

    This need edit file in my-sysroot to delete absolute path:
    5.3.1 cd ~/my-sysroot/usr/lib/
    vi libpthread.so

    change “GROUP ( /lib/libpthread.so.0 /usr/lib/libpthread_nonshared.a )
    to:
    GROUP ( libpthread.so.0 libpthread_nonshared.a )

    5.3.2 this solution apply to libc.so file for cross-compiling as well

    5.4 arm-unknown-linux-gnueabi/bin/ld: cannot find -lboost_regex
    download and install boost-static from pedora to sysroot which includes static boost libraries
    And adding the sysroot lib directory to LDFLAGS such as:
    LDFLAGS+=-Wl,-rpath,/root/my-sysroot/lib

    This method applies to libpthread.so.0, librt.so.1, libresolv.so.2 not found issues

    5.5 arm-unknown-linux-gnueabi/bin/ld: warning: ld-linux.so.3, needed by my-sysroot//usr/lib/libz.so, not found (try using -rpath or -rpath-link)

    This is because of the ld-linux.so.3 in your sysroot points to the Host OS one, we need relink it to target sysroot lib file:
    make sure you unpack the pedora glibc library into your sysroot (rpm2cpio glibc-2.16-28.fc18.a6.armv6hl.rpm | cpio -idmv),
    then:
    cd ~/my-sysroot/lib/
    ls -l ld-linux.so.3
    rm ld-linux.so.3
    ln -s ld-linux-armhf.so.3 ld-linux.so.3

    5.6 wired link problems
    5.6.1 sometimes if you provide all static or dynamic libraries directories from your sysroot for the binary or library to link, it still fail, try to switch the third party libraries sequence in LDFLAGS, such as change “-lz -lcrypto” to “-lcrypto -lz
    5.6.2 In function `boost::iostreams::detail::bzip2_base::compress(int)': undefined reference to `BZ2_bzCompress'
    In this case, your Makefile may want to link static libraries into your binary, in this case you can try to link dynamic linked library instead of static linked library.

  • References:

http://crosstool-ng.org/
http://jecxjo.motd.org/code/blosxom.cgi/devel/cross_compiler_environment.html
http://fedoraproject.org/wiki/Raspberry_Pi
http://downloads.raspberrypi.org/pidora/releases/18/packages/armv6hl/os/Packages
http://linux.die.net/man/1/yumdownloader
http://smdaudhilbe.wordpress.com/2013/04/26/steps-to-create-cross-compiling-toolchain-using-crosstool-ng/
http://briolidz.wordpress.com/2012/02/07/building-embedded-arm-systems-with-crosstool-ng/
http://www.kitware.com/blog/home/post/426
http://stackoverflow.com/questions/6329887/compiling-problems-cannot-find-crt1-o

http://www.raspberrypi.org/phpBB3/viewtopic.php?f=33&t=37658

Read Full Post | Make a Comment ( None so far )

Linux performance related urls

Posted on August 28, 2013. Filed under: Linux | Tags: , |

1. red hat rhel6 Performance Tuning Guide

2. Xen Network Throughput and Performance Guide

Read Full Post | Make a Comment ( None so far )

How to create Patch on Liunx

Posted on June 7, 2012. Filed under: Linux, Programming |

In this post I will explain how to create a patch file and how to apply patch.
I only know the basic stuff. Therefore, if you want to know more then please use command man diff at command prompt to explore other possible options.

How to Create a Patch File?
Patch file is a readable text file which is created using diff command.
To create a patch of entire source folder including its sub directories, do as follows:

$ diff -rauB  old_dir   new_dir   >  patchfile_name

options -rauB mean:
r  -> recursive (multiple levels dir)
a ->  treat all files as text
u -> Output NUM (default 3) lines of unified context
B -> B is to ignore Blank Lines

Blank lines are usually useless in the patch file.  Therefore I used B option, however you are free to tune your patch file according to your choice.
old_dir is the original source code directory,  new_dir is the new version of source code and patchfile_name is the name of the patch file.  You can give any name to your patch file.

How to Apply Patch to your Source Code?
Suppose you have got a patch file and you want to apply that patch to the version of your source code.
It is not necessary but I recommend to do a dry-run test first.

To do a dry run:
$ patch –dry-run -p1 -i patchfile_name

i : Read the patch from patchfile.
pnum : Strip  the  smallest prefix containing num leading slashes from each
file name found in the patch file.  A sequence of one or more  adjacent
slashes  is counted as a single slash.  This controls how file
names found in the patch file are treated, in  case  you  keep  your
files  in  a  different  directory  than the person who sent out the
patch.

If there is no failure then you can apply patch to your source code:
$ patch -p1 -i patchfile_name

I hope this post will help you to generate patch from your source code and apply patch to your source code.

 

Reference:

http://kailaspatil.blogspot.com/2010/08/how-to-create-patch-on-liunx.html

 

Read Full Post | Make a Comment ( None so far )

Useful Linux/Redhat Admin commands:chkconfig, rpm, firstboot, killproc

Posted on January 26, 2011. Filed under: Linux, Services |

1. chkconfig

chkconfig provides a  simple  command-line  tool  for  maintaining  the
       /etc/rc[0-6].d  directory  hierarchy by relieving system administrators
       of the task of directly manipulating the  numerous  symbolic  links  in
       those directories.

examples:

chkconfig –add servicename

chkconfig –del servicename

chkconfig –list

Normally, we are using chkconfig to manage the initscripts of package or applications.

2. rpm

RPM Package Manager is a package management system. The name RPM refers to two things: software packaged in the .rpm file format, and the package manager itself. RPM was intended primarily for GNU/Linux distributions; the file format is the baseline package format of the Linux Standard Base.
Originally developed by Red Hat for Red Hat Linux, RPM is now used by many GNU/Linux distributions. It has also been ported to some other operating systems, such as Novell NetWare (as of version 6.5 SP3) and IBM’s AIX as of version 4.
Originally standing for “Red Hat Package Manager”, RPM now stands for “RPM Package Manager”, a recursive acronym.

–  How to View Installation / Uninstallation Script Inside The RPM File ?

To list the package specific scriptlet(s) that are used as part of the installation and uninstallation processes pass –scripts option to rpm command. You also need to specify the following options:

-q option : Query option
-p option : Query an (uninstalled) package PACKAGE_FILE. The PACKAGE_FILE may be specified as an ftp or http style URL, in which case the package header will be downloaded and querie).

General syntax for uninstalled foo.rpm file

$ rpm -qp --scripts foo.rpm
Find out Installation / Uninstallation scripts inside the rpm file called monit-4.9-2.el5.rf.x86_64.rpm:
$ rpm -qp --scripts monit-4.9-2.el5.rf.x86_64.rpm

General syntax for installed httpd package

$ rpm -q --scripts httpd

3. firstboot

firstboot is the program that runs on the first boot of a Fedora or Red Hat Enterprise Linux system that allows you to configure more things than the installer allows.

4. rsyslog

Rsyslog has become the de-facto standard on modern Linux operating systems. It’s high-performance log processing, database integration, modularity and support for multiple logging protocols make it the sysadmin’s logging daemon of choice. The project was started in 2004 and has since then evolved rapidly.

The SIGHUP signal is used to reload the rsyslogd, not to kill it.

 5. killproc

killproc sends signals to all processes that use the specified executable.  If no signal name is  specified,  the
      signal  SIGTERM  is  sent. If this program is not called with the name killproc then SIGHUP is used. Note that if
      SIGTERM is used and does not terminate a process the signal SIGKILL is send after a few  seconds  (default  is  5
      seconds,  see  option -t).  If a program has been terminated successfully and a verified pid file was found, this
      pid file will be removed if the terminated process didn't already do so.

References:

http://man-wiki.net/index.php/8:killproc

http://en.wikipedia.org/wiki/RPM_Package_Manager

http://blog.gerhards.net/2008/10/new-rsyslog-hup-processing.html

http://fedoraproject.org/wiki/FirstBoot

http://www.cyberciti.biz/faq/rhel-list-package-specific-scriptlets/

http://linuxcommand.org/man_pages/chkconfig8.html

http://www.linuxjournal.com/article/4445

http://www.techrepublic.com/article/using-chkconfig-to-control-initscripts/5033660

http://krnjevic.com/wp/?p=76

Read Full Post | Make a Comment ( None so far )

up and running with Cassandra

Posted on March 21, 2010. Filed under: Java, Linux, MySQL, PHP, Services | Tags: , , |

Cassandra is a hybrid non-relational database in the same class as Google’s BigTable. It is more featureful than a key/value store like Dynomite, but supports fewer query types than a document store like MongoDB.

Cassandra was started by Facebook and later transferred to the open-source community. It is an ideal runtime database for web-scale domains like social networks.

This post is both a tutorial and a “getting started” overview. You will learn about Cassandra’s features, data model, API, and operational requirements—everything you need to know to deploy a Cassandra-backed service.

Jan 8, 2010: post updated for Cassandra gem 0.7 and Cassandra version 0.5.

features

There are a number of reasons to choose Cassandra for your website. Compared to other databases, three big features stand out:

  • Flexible schema: with Cassandra, like a document store, you don’t have to decide what fields you need in your records ahead of time. You can add and remove arbitrary fields on the fly. This is an incredible productivity boost, especially in large deployments.
  • True scalability: Cassandra scales horizontally in the purest sense. To add more capacity to a cluster, turn on another machine. You don’t have restart any processes, change your application queries, or manually relocate any data.
  • Multi-datacenter awareness: you can adjust your node layout to ensure that if one datacenter burns in a fire, an alternative datacenter will have at least one full copy of every record.

Some other features that help put Cassandra above the competition :

  • Range queries: unlike most key/value stores, you can query for ordered ranges of keys.
  • List datastructures: super columns add a 5th dimension to the hybrid model, turning columns into lists. This is very handy for things like per-user indexes.
  • Distributed writes: you can read and write any data to anywhere in the cluster at any time. There is never any single point of failure.

installation

You need a Unix system. If you are using Mac OS 10.5, all you need is Git. Otherwise, you need to install Java 1.6, Git 1.6, Ruby, and Rubygems in some reasonable way.

Start a terminal and run:

sudo gem install cassandra

If you are using Mac OS, you need to export the following environment variables:

export JAVA_HOME="/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home"
export PATH="/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin:$PATH"

Now you can build and start a test server with cassandra_helper:

cassandra_helper cassandra

It runs!

live demo

The above script boots the server with a schema that we can interact with. Open another terminal window and start irb, the Ruby shell:

irb

In the irb prompt, require the library:

require 'rubygems'
require 'cassandra'
include Cassandra::Constants

Now instantiate a client object:

twitter = Cassandra.new('Twitter')

Let’s insert a few things:

user = {'screen_name' => 'buttonscat'}
twitter.insert(:Users, '5', user)  

tweet1 = {'text' => 'Nom nom nom nom nom.', 'user_id' => '5'}
twitter.insert(:Statuses, '1', tweet1)

tweet2 = {'text' => '@evan Zzzz....', 'user_id' => '5', 'reply_to_id' => '8'}
twitter.insert(:Statuses, '2', tweet2)

Notice that the two status records do not have all the same columns. Let’s go ahead and connect them to our user record:

twitter.insert(:UserRelationships, '5', {'user_timeline' => {UUID.new => '1'}})
twitter.insert(:UserRelationships, '5', {'user_timeline' => {UUID.new => '2'}})

The UUID.new call creates a collation key based on the current time; our tweet ids are stored in the values.

Now we can query our user’s tweets:

timeline = twitter.get(:UserRelationships, '5', 'user_timeline', :reversed => true)
timeline.map { |time, id| twitter.get(:Statuses, id, 'text') }
# => ["@evan Zzzz....", "Nom nom nom nom nom."]

Two tweet bodies, returned in recency order—not bad at all. In a similar fashion, each time a user tweets, we could loop through their followers and insert the status key into their follower’s home_timeline relationship, for handling general status delivery.

the data model

Cassandra is best thought of as a 4 or 5 dimensional hash. The usual way to refer to a piece of data is as follows: a keyspace, a column family, a key, an optional super column, and a column. At the end of that chain lies a single, lonely value.

Let’s break down what these layers mean.

  • Keyspace (also confusingly called “table”): the outer-most level of organization. This is usually the name of the application. For example, 'Twitter' and 'Wordpress' are both good keyspaces. Keyspaces must be defined at startup in the storage-conf.xml file.
  • Column family: a slice of data corresponding to a particular key. Each column family is stored in a separate file on disk, so it can be useful to put frequently accessed data in one column family, and rarely accessed data in another. Some good column family names might be :Posts, :Users and :UserAudits. Column families must be defined at startup.
  • Key: the permanent name of the record. You can query over ranges of keys in a column family, like :start => '10050', :finish => '10070'—this is the only index Cassandra provides for free. Keys are defined on the fly.

After the column family level, the organization can diverge—this is a feature unique to Cassandra. You can choose either:

  • A column: this is a tuple with a name and a value. Good columns might be 'screen_name' => 'lisa4718' or 'Google' => 'http://google.com'.It is common to not specify a particular column name when requesting a key; the response will then be an ordered hash of all columns. For example, querying for (:Users, '174927') might return:
    {'name' => 'Lisa Jones',
     'gender' => 'f',
     'screen_name' => 'lisa4718'}

    In this case, name, gender, and screen_name are all column names. Columns are defined on the fly, and different records can have different sets of column names, even in the same keyspace and column family. This lets you use the column name itself as either structure or data. Columns can be stored in recency order, or alphabetical by name, and all columns keep a timestamp.

  • A super column: this is a named list. It contains standard columns, stored in recency order.Say Lisa Jones has bookmarks in several categories. Querying (:UserBookmarks, '174927') might return:
    {'work' => {
        'Google' => 'http://google.com',
        'IBM' => 'http://ibm.com'},
     'todo': {...},
     'cooking': {...}}

    Here, work, todo, and cooking are all super column names. They are defined on the fly, and there can be any number of them per row. :UserBookmarks is the name of the super column family. Super columns are stored in alphabetical order, with their sub columns physically adjacent on the disk.

Super columns and standard columns cannot be mixed at the same (4th) level of dimensionality. You must define at startup which column families contain standard columns, and which contain super columns with standard columns inside them.

Super columns are a great way to store one-to-many indexes to other records: make the sub column names TimeUUIDs (or whatever you’d like to use to sort the index), and have the values be the foreign key. We saw an example of this strategy in the demo, above.

If this is confusing, don’t worry. We’ll now look at two example schemas in depth.

twitter schema

Here is the schema definition we used for the demo, above. It is based on Eric Florenzano’s Twissandra:

<Keyspace Name="Twitter">
  <ColumnFamily CompareWith="UTF8Type" Name="Statuses" />
  <ColumnFamily CompareWith="UTF8Type" Name="StatusAudits" />
  <ColumnFamily CompareWith="UTF8Type" Name="StatusRelationships"
    CompareSubcolumnsWith="TimeUUIDType" ColumnType="Super" />
  <ColumnFamily CompareWith="UTF8Type" Name="Users" />
  <ColumnFamily CompareWith="UTF8Type" Name="UserRelationships"
    CompareSubcolumnsWith="TimeUUIDType" ColumnType="Super" />
</Keyspace>

What could be in StatusRelationships? Maybe a list of users who favorited the tweet? Having a super column family for both record types lets us index each direction of whatever many-to-many relationships we come up with.

Here’s how the data is organized:

Click to enlarge

Cassandra lets you distribute the keys across the cluster either randomly, or in order, via the Partitioner option in the storage-conf.xml file.

For the Twitter application, if we were using the order-preserving partitioner, all recent statuses would be stored on the same node. This would cause hotspots. Instead, we should use the random partitioner.

Alternatively, we could preface the status keys with the user key, which has less temporal locality. If we used user_id:status_id as the status key, we could do range queries on the user fragment to get tweets-by-user, avoiding the need for a user_timeline super column.

multi-blog schema

Here’s a another schema, suggested to me by Jonathan Ellis, the primary Cassandra maintainer. It’s for a multi-tenancy blog platform:

<Keyspace Name="Multiblog">
  <ColumnFamily CompareWith="TimeUUIDType" Name="Blogs" />
  <ColumnFamily CompareWith="TimeUUIDType" Name="Comments"/>
</Keyspace>

Imagine we have a blog named ‘The Cutest Kittens’. We will insert a row when the first post is made as follows:

require 'rubygems'
require 'cassandra'
include Cassandra::Constants

multiblog = Cassandra.new('Multiblog')

multiblog.insert(:Blogs, 'The Cutest Kittens',
  { UUID.new =>
    '{"title":"Say Hello to Buttons Cat","body":"Buttons is a cute cat."}' })

UUID.new generates a unique, sortable column name, and the JSON hash contains the post details. Let’s insert another:

multiblog.insert(:Blogs, 'The Cutest Kittens',
  { UUID.new =>
    '{"title":"Introducing Commie Cat","body":"Commie is also a cute cat"}' })

Now we can find the latest post with the following query:

post = multiblog.get(:Blogs, 'The Cutest Kittens', :reversed => true).to_a.first

On our website, we can build links based on the readable representation of the UUID:

guid = post.first.to_guid
# => "b06e80b0-8c61-11de-8287-c1fa647fd821"

If the user clicks this string in a permalink, our app can find the post directly via:

multiblog.get(:Blogs, 'The Cutest Kittens', :start => UUID.new(guid), :count => 1)

For comments, we’ll use the post UUID as the outermost key:

multiblog.insert(:Comments, guid,
  {UUID.new => 'I like this cat. - Evan'})
multiblog.insert(:Comments, guid,
  {UUID.new => 'I am cuter. - Buttons'})

Now we can get all comments (oldest first) for a post by calling:

multiblog.get(:Comments, guid)

We could paginate them by passing :start with a UUID. See this presentation to learn more about token-based pagination.

We have sidestepped two problems with this data model: we don’t have to maintain separate indexes for any lookups, and the posts and comments are stored in separate files, where they don’t cause as much write contention. Note that we didn’t need to use any super columns, either.

storage layout and api comparison

The storage strategy for Cassandra’s standard model is the same as BigTable’s. Here’s a comparison chart:

multi-file per-file intra-file
Relational server database table* primary key column value
BigTable cluster table column family key column name column value
Cassandra, standard model cluster keyspace column family key column name column value
Cassandra, super column model cluster keyspace column family key super column name column name column value

* With fixed column names.

Column families are stored in column-major order, which is why people call BigTable a column-oriented database. This is not the same as a column-oriented OLAP database like Sybase IQ—it depends on what you use the column names for.

Click to enlarge

In row-orientation, the column names are the structure, and you think of the column families as containing keys. This is the convention in relational databases.

Click to enlarge

In column-orientation, the column names are the data, and the column families are the structure. You think of the key as containing the column family, which is the convention in BigTable. (In Cassandra, super columns are also stored in column-major order—all the sub columns are together.)

In Cassandra’s Ruby API, parameters are expressed in storage order, for clarity:

Relational SELECT `column` FROM `database`.`table` WHERE `id` = key;
BigTable table.get(key, "column_family:column")
Cassandra: standard model keyspace.get("column_family", key, "column")
Cassandra: super column model keyspace.get("column_family", key, "super_column", "column")

Note that Cassandra’s internal Thrift interface mimics BigTable in some ways, but this is being changed.

going to production

Cassandra is an alpha product and could, theoretically, lose your data. In particular, if you change the schema specified in the storage-conf.xml file, you must follow these instructions carefully, or corruption will occur (this is going to be fixed). Also, the on-disk storage format is expected to change in version 0.4.0. After that things will be a bit more stable.

The biggest deployment is at Facebook, where hundreds of terabytes of token indexes are kept in about a hundred Cassandra nodes. However, their use case allows the data to be rebuilt if something goes wrong. Currently there are no known deployments of non-transient data. Proceed carefully, keep a backup in an unrelated storage engine…and submit patches if things go wrong.

That aside, here is a guide for deploying a production cluster:

  • Hardware: get a handful of commodity Linux servers. 16GB memory is good; Cassandra likes a big filesystem buffer. You don’t need RAID. If you put the commitlog file and the data files on separate physical disks, things will go faster. Don’t use EC2 or friends except for testing; the virtualized I/O is too slow.
  • Configuration: in the storage-conf.xml schema file, set the replication factor to 3. List the IP address of one of the nodes as the seed. Set the listen address to the empty string, so the hosts will resolve their own IPs. Now, adjust the contents of cassandra.in.sh for your various paths and JVM options—for a 16GB node, set the JVM heap to 4GB.
  • Deployment: build a package of Cassandra itself and your configuration files, and deliver it to all your servers (I use Capistrano for this). Start the servers by setting CASSANDRA_INCLUDE in the environment to point to your cassandra.in.sh file, and run bin/cassandra. At this point, you should see join notices in the Cassandra logs:
    Cassandra starting up...
    Node 10.224.17.13:7001 has now joined.
    Node 10.224.17.14:7001 has now joined.

    Congratulations! You have a cluster. Don’t forget to turn off debug logging in the log4j.properties file.

  • Visibility: you can get a little more information about your cluster via the tool bin/nodeprobe, included:
    $ bin/nodeprobe --host 10.224.17.13 ring
    Token(124007023942663924846758258675932114665)  3 10.224.17.13  |<--|
    Token(106858063638814585506848525974047690568)  3 10.224.17.19  |   ^
    Token(141130545721235451315477340120224986045)  3 10.224.17.14  |-->|

    Cassandra also exposes various statistics over JMX.

Note that your client machines (not servers!) must have accurate clocks for Cassandra to resolve write conflicts properly. Use NTP.

conclusion

There is a misperception that if someone advocates a non-relational database, they either don’t understand SQL optimization, or they are generally a hater. This is not the case.

It is reasonable to seek a new tool for a new problem, and database problems have changed with the rise of web-scale distributed systems. This does not mean that SQL as a general-purpose runtime and reporting tool is going away. However, at web-scale, it is more flexible to separate the concerns. Runtime object lookups can be handled by a low-latency, strict, self-managed system like Cassandra. Asynchronous analytics and reporting can be handled by a high-latency, flexible, un-managed system like Hadoop. And in neither case does SQL lend itself to sharding.

I think that Cassandra is the most promising current implementation of a runtime distributed database, but much work remains to be done. We’re beginning to use Cassandra at Twitter, and here’s what I would like to happen real-soon-now:

  • Interface cleanup: the Thrift API for Cassandra is incomplete and inconsistent, which makes writing clients very irritating.
    Done!
  • Online migrations: restarting the cluster 3 times to add a column family is silly.
  • ActiveModel or DataMapper adapter: for interaction with business objects in Ruby.
    Done! Michael Koziarski on the Rails core team wrote an ActiveModel adapter.
  • Scala client: for interoperability with JVM middleware.

Go ahead and jump on any of those projects—it’s a chance to get in on the ground floor.

Cassandra has excellent performance. There some benchmark results for version 0.5 at the end of the Yahoo performance study.

further resources

Reference(cited):
Read Full Post | Make a Comment ( 1 so far )

NPAPI

Posted on February 25, 2010. Filed under: C/C++, Linux, Mac, Windows | Tags: , , |

Netscape Plugin Application Programming Interface (NPAPI) is a cross-platform plugin architecture used by many web browsers.

It was first developed for the Netscape family of browsers starting with Netscape Navigator 2.0 but has subsequently been implemented in other browsers including Mozilla Application Suite, Mozilla Firefox, Safari, Google Chrome, Opera, Konqueror, and some older versions of Microsoft Internet Explorer.

Its success can be partly attributed to its simplicity. A plugin declares that it handles certain content types (e.g. “audio/mp3”) through exposed file information. When the browser encounters such content type it loads the associated plugin, sets aside the space within the browser content for the plugin to render itself and then streams data to it. The plugin is then responsible for rendering the data as it sees fit, be it visual, audio or otherwise. So a plugin runs in-place within the page, as opposed to older browsers that had to launch an external application to handle unknown content types.

The API requires each plugin to implement and expose a comparatively small number of functions. There are approximately 15 functions in total for initializing, creating, destroying, and positioning plugins. The NPAPI also supports scripting, printing, full screen plugins, windowless plugins and content streaming.

Reference:

http://en.wikipedia.org/wiki/NPAPI

https://developer.mozilla.org/en/Plugins

http://colonelpanic.net/2009/03/building-a-firefox-plugin-part-one/

http://colonelpanic.net/2009/05/building-a-firefox-plugin-part-two/

http://colonelpanic.net/2009/08/building-a-firefox-plugin-%E2%80%93-part-three/

Read Full Post | Make a Comment ( 1 so far )

ubuntu livecd mount LVM2 fs

Posted on January 28, 2010. Filed under: Linux | Tags: , , , , , |

1. Startup ubuntu livecd

2. sudo apt-get install lvm2

do NOT reboot!!

3.if you mount as following:

sudo mount -t auto /dev/sda2 /mnt

you will get error: mount: unknown filesystem type ‘LVM2_member’

4. check the phisical disk

[ubuntu] $sudo pvs

PV          VG         Fmt  Attr PSize   PFree 

  /dev/sda1  VolGroup00 lvm2 a-   100M 40.00M

/dev/sda2 VolGroup01 lvm2 a- 200G 40.00M
5. check the volume

[ubuntu]$sudo vgs

 VG         #PV #LV #SN Attr   VSize   VFree 
  VolGroup01   1   4   0 wz--n- 200G 40.00M

6. check the logical volume

[ubuntu]lvdisplay

  --- Logical volume ---
  LV Name                /dev/VolGroup01/LogVol03
  VG Name                VolGroup01
  LV UUID                YhG8Fu-Zrek-zx8D-AzxC-Dz34-d32F-dfal23I
  LV Write Access        read/write
  LV Status              unenable
  # open                 1
  LV Size                200GB
  Current LE             781
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2

7. active the logic vol

[ubuntu]vgchange -ay /dev/VolGroup01

8. Check the status again

[ubuntu]lvdisplay

  --- Logical volume ---
  LV Name                /dev/VolGroup01/LogVol03
  VG Name                VolGroup01
  LV UUID                YhG8Fu-Zrek-zx8D-AzxC-Dz34-d32F-dfal23I
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                200GB
  Current LE             781
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2

9. mount it

mount /dev/VolGroup01/LogVol03 /mnt


Reference:

http://fedoraforum.org/forum/archive/index.php/t-64964.html

http://forums.opensuse.org/install-boot-login/396994-mount-lvm2-partition.html

http://www.howtoforge.com/linux_lvm

Read Full Post | Make a Comment ( 3 so far )

Major and Minor Numbers

Posted on January 18, 2010. Filed under: Linux, Services | Tags: , , , , |

One of the basic features of the Linux kernel is that it abstracts the handling of devices. All hardware devices look like regular files; they can be opened, closed, read and written using the same, standard, system calls that are used to manipulate files. Every device in the system is represented by a file. For block (disk) and character devices, these device files are created by the mknod command and they describe the device using major and minor device numbers. Network devices are also represented by device special files but they are created by Linux as it finds and initializes the network controllers in the system.

To UNIX, everything is a file. To write to the hard disk, you write to a file. To read from the keyboard is to read from a file. To store backups on a tape device is to write to a file. Even to read from memory is to read from a file. If the file from which you are trying to read or to which you are trying to write is a “normal” file, the process is fairly easy to understand: the file is opened and you read or write data. If, however, the device you want to access is a special device file (also referred to as a device node), a fair bit of work needs to be done before the read or write operation can begin.

One key aspect of understanding device files lies in the fact that different devices behave and react differently. There are no keys on a hard disk and no sectors on a keyboard, though you can read from both. The system, therefore, needs a mechanism whereby it can distinguish between the various types of devices and behave accordingly.

To access a device accordingly, the operating system must be told what to do. Obviously, the manner in which the kernel accesses a hard disk will be different from the way it accesses a terminal. Both can be read from and written to, but that’s about where the similarities end. To access each of these totally different kinds of devices, the kernel needs to know that they are, in fact, different.

Inside the kernel are functions for each of the devices the kernel is going to access. All the routines for a specific device are jointly referred to as the device driver. Each device on the system has its own device driver. Within each device driver are the functions that are used to access the device. For devices such as a hard disk or terminal, the system needs to be able to (among other things) open the device, write to the device, read from the device, and close the device. Therefore, the respective drivers will contain the routines needed to open, write to, read from, and close (among other things) those devices.

The kernel needs to be told how to access the device. Not only does the kernel need to be told what kind of device is being accessed but also any special information, such as the partition number if it’s a hard disk or density if it’s a floppy, for example. This is accomplished by the major number and minor number of that device.

All devices controlled by the same device driver have a common major device number. The minor device numbers are used to distinguish between different devices and their controllers. Linux maps the device special file passed in system calls (say to mount a file system on a block device) to the device’s device driver using the major device number and a number of system tables, for example the character device table, chrdevs. The major number is actually the offset into the kernel’s device driver table, which tells the kernel what kind of device it is (whether it is a hard disk or a serial terminal). The minor number tells the kernel special characteristics of the device to be accessed. For example, the second hard disk has a different minor number than the first. The COM1 port has a different minor number than the COM2 port, each partition on the primary IDE disk has a different minor device number, and so forth. So, for example, /dev/hda2, the second partition of the primary IDE disk has a major number of 3 and a minor number of 2.

It is through this table that the routines are accessed that, in turn, access the physical hardware. Once the kernel has determined what kind of device to which it is talking, it determines the specific device, the specific location, or other characteristics of the device by means of the minor number.

The major number for the hd (IDE) driver is hard-coded at 3. The minor numbers have the format

(<unit>*64)+<part>

where <unit> is the IDE drive number on the first controller, either 0 or 1, which is then multiplied by 64. That means that all hd devices on the first IDE drive have a minor number less than 64. <part> is the partition number, which can be anything from 1 to 20. Which minor numbers you will be able to access will depend on how many partitions you have and what kind they are (extended, logical, etc.). The minor number of the device node that represents the whole disk is 0. This has the node name hda, whereas the other device nodes have a name equal to their minor number (i.e., /dev/hda6 has a minor number 6).

If you were to have a second IDE on the first controller, the unit number would be 1. Therefore, all of the minor numbers would be 64 or greater. The minor number of the device node representing the whole disk is 1. This has the node name hdb, whereas the other device nodes have a name equal to their minor number plus 64 (i.e., /dev/hdb6 has a minor number 70).

If you have more than one IDE controller, the principle is the same. The only difference is that the major number is 22.

For SCSI devices, the scheme is a little different. When you have a SCSI host adapter, you can have up to seven hard disks. Therefore, we need a different way to refer to the partitions. In general, the format of the device nodes is

sd<drive><partition>

where sd refers to the SCSI disk driver, <drive> is a letter for the physical drive, and <partition> is the partition number. Like the hd devices, when a device refers to the entire disk, for example the device sda refers to the first disk.

The major number for all SCSI drives is 8. The minor number is based on the drive number, which is multiplied by 16 instead of 64, like the IDE drives. The partition number is then added to this number to give the minor.

The partition numbers are not as simple to figure out. Partition 0 is for the whole disk (i.e., sda). The four DOS primary partitions are numbered 14. Then the extended partitions are numbered 58. We then add 16 for each drive. For example:

brw-rw----   1 root     disk       8,  22 Sep 12  1994
/dev/sdb6

Because b is after the sd, we know that this is on the second drive. Subtracting 16 from the minor, we get 6, which matches the partition number. Because it is between 4 and 8, we know that this is on an extended partition. This is the second partition on the first extended partition.

The floppy devices have an even more peculiar way of assigning minor numbers. The major number is fairly easy its 2. Because the names are a little easier to figure out, lets start with them. As you might guess, the device names all begin with fd. The general format is

fd<drive><density><capacity>

where <drive> is the drive number (0 for A:, 1 for B:), <density> is the density (d-double, h-high), and <capacity> is the capacity of the drive (360Kb, 1440Kb). You can also tell the size of the drive by the density letter, lowercase letter indicates that it is a i5.25″ drive and an uppercase letter indicates that it is a 3.5″ driver. For example, a low-density 5.25″ drive with a capacity of 360Kb would look like

fd0d360

If your second drive was a high-density 3.5″ drive with a capacity of 1440Kb, the device would look like

fd1H1440

What the minor numbers represents is a fairly complicated process. In the fd(4) man-page there is an explanatory table, but it is not obvious from the table why specific minor numbers go with each device. The problem is that there is no logical progression as with the hard disks. For example, there was never a 3.5″ with a capacity of 1200Kb nor has there been a 5.25″ with a capacity of 1.44Mb. So you will never find a device with H1200 or h1440. So to figure out the device names, the best thing is to look at the man-page.

The terminal devices come in a few forms. The first is the system console, which are the devices tty0-tty?. You can have up to 64 virtual terminals on you system console, although most systems that I have seen are limited five or six. All console terminals have a major number of 4. As we discussed earlier, you can reach the low numbered ones with ALT-Fn, where n is the number of the function key. So ALT-F4 gets you to the fourth virtual console. Both the minor number and the tty number are based on the function key, so /dev/tty4 has a minor number 4 and you get to it with ALT-F4. (Check the console(4) man-page to see how to use and get to the other virtual terminals.)

Serial devices can also have terminals hooked up to them. These terminals normally use the devices /dev/ttySn, where n is the number of the serial port (0, 1, 2, etc.). These also have a minor number of 4, but the minor numbers all start at 64. (Thats why you can only have 63 virtual consoles.) The minor numbers are this base of 64, plus the serial port number (04). Therefore, the minor number of the third serial port would be 64+3=67.

Related to these devices are the modem control devices, which are used to access modems. These have the same minor numbers but have a major number of 5. The names are also based on the device number. Therefore, the modem device attached to the third serial port has a minor number of 64+3=67 and its name is cua3.

Another device with a major number of 5 is /dev/tty, which has a minor number of 0. This is a special device and is referred to as the “controlling terminal. ” This is the terminal device for the currently running process. Therefore, no matter where you are, no matter what you are running, sending something to /dev/tty will always appear on your screen.

The pseudo-terminals (those that you use with network connections or X) actually come in pairs. The “slave” is the device at which you type and has a name like ttyp?, where ? is the tty number. The device that the process sees is the “master” and has a name like ptyn, where n is the device number. These also have a major number 4. However, the master devices all have minor numbers based on 128. Therefore, pty0 has a minor number of 128 and pty9 has a minor number of 137 (128+9). The slave device has minor numbers based on 192, so the slave device ttyp0 has a minor number of 192. Note that the tty numbers do not increase numerically after 9 but use the letter af.

Other oddities with the device numbering and naming scheme are the memory devices. These have a major number of 1. For example, the device to access physical memory is /dev/mem with a minor number of 1. The kernel memory device is /dev/kmem, and it has a minor number of 2. The device used to access IO ports is /dev/port and it has a minor number of 4.

What about minor number 3? This is for device /dev/null, which is nothing. If you direct output to this device, it goes into nothing, or just disappears. Often the error output of a command is directed here, as the errors generated are uninteresting. If you redirect from /dev/null, you get nothing as well. Often I do something like this:

cat /dev/null > file_name

This device is also used a lot in shell scripts where you do not want to see any output or error messages. You can then redirect standard out or standard error to /dev/null and the messages disappear. For details on standard out and error, see the section on redirection.

If file_name doesn’t exist yet, it is created with a length of zero. If it does exist, the file is truncated to 0 bytes.

The device /dev/zero has a major number of 5 and its minor number is 5. This behaves similarly to /dev/null in that redirecting output to this device is the same as making it disappear. However, if you direct input from this device, you get an unending stream of zeroes. Not the number 0, which has an ASCII value of 48this is an ASCII 0.

Are those all the devices? Unfortunately not. However, I hope that this has given you a start on how device names and minor numbers are configured. The file <linux/major.h> contains a list of the currently used (at least well known) major numbers. Some nonstandard package might add a major number of its own. Up to this point, they have been fairly good about not stomping on the existing major numbers.

As far as the minor numbers go, check out the various man-pages. If there is a man-page for a specific device, the minor number will probably be under the name of the driver. This is in major.h or often the first letter of the device name. For example, the parallel (printer) devices are lp?, so check out man lp.

The best overview of all the major and minor numbers is in the /usr/src/linux/Documentation directory. The devices.txt is considered the “authoritative” source for this information.

References:

http://www.linux-tutorial.info/modules.php?name=MContent&pageid=94

http://www.linux-tutorial.info/modules.php?name=MContent&obj=glossary&term=man-page

http://www.kernel.org/pub/linux/docs/device-list/devices.txt

http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000ic/index.jsp?topic=/com.ibm.storage.ssic.help.doc/f2c_linuxdevnaming_2hsag8.html

http://docsrv.sco.com/HDK_concepts/ddT_majmin.html

http://burks.brighton.ac.uk/burks/linux/rute/node18.htm

http://www.makelinux.info/ldd3/chp-3-sect-2.shtml

http://www.thegeekstuff.com/2009/06/how-to-identify-major-and-minor-number-in-linux/

Read Full Post | Make a Comment ( None so far )

XenIntro

Posted on January 8, 2010. Filed under: C/C++, Linux, Services | Tags: , |

Xen Intro- version 1.0:

Contents

  1. Introduction
  2. Xen and IA32 Protection Modes
  3. The Xend daemon:
  4. The Xen Store:
  5. VT-x (virtual technology) processors – support in Xen
  6. Vmxloader
  7. VT-i (virtual technology) processors – support in Xen
  8. AMD SVM
  9. Xen On Solaris
  10. Step by step example of creating guest OS with Virtual Machine Manager in Fedora Core 6
  11. Physical Interrupts
  12. Backend Drivers:
  13. Migration and Live Migration:
  14. Creating of a domain – behind the scenes:
  15. HyperCalls Mapping to code Xen 3.0.2
  16. Virtualization and the Linux Kernel
  17. Pre-Virtualization
  18. Xen Storage
  19. kvm – Kernel-based Virtualization Driver
  20. Tip: How to build Xen with your own tar ball
  21. Xen in the Linux Kernel
  22. VMI : Virtual Machine Interface
  23. Links
  24. Adding new device and triggering the probe() functions
    1. deviceback.c
    2. xenbus.c
    3. common.h
    4. Makefile
  25. Adding a frontend device
    1. Makefile
    2. devicefront.c
  26. Discussion

Introduction

All of the following text refers to x86 platform of Xen-unstable, unless otherwise explicitly said. We will deal only with Xen on linux 2.6 ; We are not dealing at all with Xen on linux 2.4 (and as far as we know, in the future, domain 0 is intended to be based ONLY on 2.6). Moreover,currently the 2.4 linux tree is removed from Xen Tree (changeset 7263:f1abe953e401 from 8.10.05) but it can be that it will be back when some problems will be fixed.

This document deals only with Xen 3.0 version unless explictily said otherwise.

This is not intended to be a full and detailed documentation of the Xen project but we hope it will be a starting point to anyone who is interested in Xen and wants to learn more.

The Xen Project team is permitted to take part or all of this document and integrate it with the official Xen documentation or put it as a standalone document in the Xen Web Site if they wish, without any further notice.

This is not a complete detailed document nor a full walkthrough ,and many important issues are omitted. Any feedback is welcomed to : Rami Rosen , ramirose@gmail.com

Xen and IA32 Protection Modes

In the classical protection model of IA-32, there are 4 privilege levels; The highest ring is 0, where the kernel runs. (this level is also called SuperVisor Mode) The lowest is ring 3, where User applications run (this level is also called User Mode) Issuing some instructions , which are called “privileged instructions” , from ring which is NOT ring 0, will cause a General Protection Fault.

Ring 1,2 were not used through the years (except for in the case of OS/2). When running Xen, we run a Hypervisor in ring 0 and the guest OS in ring 1. The applications run unmodified at ring 3.

BTW, there are of course architectures which have a different privilege models; for example, in PPC both domain 0 and the Unprivileged domains run in supervisor mode. Diagram: Xen and IA32 Protection Modes

rings

The Xend daemon:

The Xend Daemon handles requests issued from Domain 0; requests can be, for example, creating a new domain (“xm create”) or listing the domains (“xm list”), shutting down a domain (“xm destroy”). Running “xm help” will show all possibilities.

You start the Xend daemon by running, after booting into domain0, “xend start”. “xend start” creates two daemons: xenstored and xenconsoled (see toos/misc/xend). It also creates an instance of a python SrvDaemon class and calls its start() method. (see tools/python/xen/xend/server/SrvDaemon.py).

The SrvDaemon start() method is in fact the xend main program.

In the past,the start() method of SrvDaemon eventually started an http socket (8000) on which it listened to http requests. Now it does not open an http socket on port 8000 anymore.

Note : There is an altenative to the management layer of Xen which is called libvirt; see http://libvir.org. This is a free API (LGPL)

The Xen Store:

The Xen Store Daemon provides a simple tree-like database to which we can read and write values. The Xen Store code is mainly under tools\xenstore.

It replaces the XCS, which was a daemon handling control messages.

The physical xenstore resides in one file: /var/lib/xenstored/tdb. (previously it was sacttered in some files; the change to using one file (named “tdb”) was probably to increase performance).

Both user space (“tools” in Xen terminology) and kernel code can write to the XenStore.The kernel code writes to the XenStore by using XenBus.

The python scripts (under tools/python) uses lowlevel/xs.c to read/write to the XenStore.

The Xen Store Daemon is started in xenstored_core.c. It creates a device file (“/dev/xen/evtchn”) in case such a device file does not exists and it opens it. (see : domain_init() ,file tools/xenstore/xenstored_domain.c).

It opens 2 TCP sockets (UNIX sockets). One of these sockets is a Read-Only socket, and it resides under /var/run/xenstored/socket_ro. The second is /var/run/xenstored/socket.

Connections on these sockets are represented by the connection struct.

A connection can be in one of three states:

        BLOCKED (blocked by a transaction)
        BUSY    (doing some action)
        OK      (completed it's transaction)

struct connection is declared in xenstore/xenstored_core.h; When a socket is ReadOnly,the “can_write” member of it is false.

Then we start an endless loop in which we can get input/output from three sources: the two sockets and the event channel, mentioned above.

Events, which are received in the event channel,are handled by handle_event() method (file xenstored_domain.c).

There are six executables under tools/xenstore, five of which are in fact made from the same module, which is xenstore_client.c, each time built with a different DEFINE passed. (See the Makefile). The sixth tool is built from xsls.c

These executables are : xenstore-exists, xenstore-list, xenstore-read, xenstore-rm, xenstore-write and xsls.

You can use these executable for accessing xenstore. For example: to view the list of fields of domain 0 which has a path “local/domain/0”, you run:

xenstore-list /local/domain/0

and a typical result can be the following list:

cpu
memory
name
console
vm
domid
backend

The xsls command is very useful and recursively shows the contents of a specified XenStore path. Essentially it does a xenstore-list and then a xenstore-read for each returned field, displaying the fields and their values and then repeating this recursively on each sub-path. For example: to view information about all VIFs backends hosted in domain 0 you may use the following command.

xsls /local/domain/0/backend/vif

and a typical result may be:

14 = ""
 0 = ""
  bridge = "xenbr0"
  domain = "vm1"
  handle = "0"
  script = "/etc/xen/scripts/vif-bridge"
  state = "4"
  frontend = "/local/domain/14/device/vif/0"
  mac = "aa:00:00:22:fe:9f"
  frontend-id = "14"
  hotplug-status = "connected"
15 = ""
 0 = ""
  mac = "aa:00:00:6e:d8:46"
  state = "4"
  handle = "0"
  script = "/etc/xen/scripts/vif-bridge"
  frontend-id = "15"
  domain = "vm2"
  frontend = "/local/domain/15/device/vif/0"
  hotplug-status = "connected"

(The xenstored must be running for these six executables to run; If xenstored is not running, then running theses executables will usually hang. The Xend daemon can be stopped).

An instance of struct node is the elementary unit of the XenStore. (struct node is defined in xenstored_core.h). The actual writing to the XenStore is done by write_node() method of xenstored_core.c.

xen_start_info structure has a member named :store_evtchn. (declared in public/xen.h as u16). This is the event channel for store communication.

VT-x (virtual technology) processors – support in Xen

Note: following text refers only to IA-32 unless explicitly said otherwise.

Intel had announced Pentium® 4 672 and 662 processors in November 2005 with virtualization support. (see, for example: http://www.physorg.com/news8160.html).

How does Xen support the Intel Virtualization Technology ?

The VT extensions support in Xen3 code is mostly in xen/arch/x86/hvm/vmx*.c.

  • and xen/include/asm-x86/vmx*.h and xen/arch/x86/x86_32/entry.S.

arch_vcpu structure (file xen/include/asm-x86/domain.h) contains a member which is called arch_vmx and is an instance of arch_vmx_struct. This member is also important to understand the VT-x mechanism.

But the most important structure for VT-x is the VMCS( vmcs_struct in the code) which represents the VMCS region.

The definition (file include/asm-x86/vmx_vmcs.h) is short:

struct vmcs_struct

  • { u32 vmcs_revision_id; unsigned char data [0]; /* vmcs size is read from MSR */ };

The VMCS region contains six logical regions; most relevant to our discussions are Guest-state area and Host-state area. We will also deal with the other four regions: VM-execution control fields,VM-exit control fields, VM-entry control fields and VM-exit information fields.

Intel added 10 new opcodes in VT-x to support Intel Virtualization Technology. They are detailed in the end of this section.

When using this technology, Xen runs in “VMX root operation mode” while the guest domains (which are unmodified OSs) run in “VMX non-root operation mode”. Since the guest domains run in “non-root operation” mode, it is more restricted,meaning that certain actions will cause “VM exit” to the VM.

Xen enters VMX operation in start_vmx() method. ( file xen/arch/x86/vmx.c)

This method is called from init_intel() method (file xen/arch/x86/cpu/intel.c.) (CONFIG_VMX should be defined).

First we check the X86_FEATURE_VMXE bit in ecx register to see if the cpuid shows that there is support for VMX in the processor. In IA-32 Intel added in the CR4 control register a bit specifying whether we want to enable VMX. So we must set this bit to enable VMX on the processor (by calling set_in_cr4(X86_CR4_VMXE)); This bit is bit 13 in CR4 (VMXE).

Then we call _vmxon to start VMX operation. If we will try to start VMX operation by _vmxon when the VMXE bit in CR4 is not set we will get exception (#UD , for undefined opcode)

In IA-64, things are a little different due to different architecture structure: Intel added a new bit in IA-64 in the Processor Status Register (PSR). This is bit 46 and it’s called VM. It should be set to 1 in guest OSs; and when it’s values is 1 , certain instructions cause virtualization fault.

VM exit:

Some instructions can cause unconditionally VM exit and some can cause VM exit under certain VM-execution control fields. (see the discussion about VMX-region above)

The following instructions will cause VM exit unconditionally: CPUID, INVD, MOV from CR3, RDMSR, WRMSR, and all the new VT-x instructions (which are listed below).

There are other instruction like HLT,INVPLG (Invalidate TLB Entry instruction) MWAIT and others which will cause VM exit if a corresponding VM-execution control was set.

Apart from VM-execution control fields, there are 2 bitmpas which are used for determining whether to perform VM exit: The first is the exception bitmap (see EXCEPTION_BITMAP in vmcs_field enum , file xen/include/asm-x86/vmx_vmcs.h). This bitmap is 32 bit field; when a bit is set in this bitmap, this causes a VM exit if a corresponding exception occurs; by default ,the entries which are set are EXCEPTION_BITMAP_PG (for page fault) and EXCEPTION_BITMAP_GP (for General Protection). see MONITOR_DEFAULT_EXCEPTION_BITMAP in vmx.h.

The second bitmap is the I/O bitmap (in fact, there are 2 I/O bitmaps,A and B, each is 4KB in size) which controls I/O instructions on ports. I/O bitmap A contains the ports in the range 0000-7FFF and I/O bitmap B contains the ports in the range 8000-FFFF. (one bit for each I/O port). see IO_BITMAP_A and IO_BITMAP_B in vmcs_field enum (VMCS Encordings).

When there is an “VM exit” we reach the vmx_vmexit_handler(struct cpu_user_regs regs) in vmx.c. We handle the VM exit according to the exit reason which we read from the VMCS region. We read the vmcs by calling vmread() ; The return value of vmread is 0 in case of success.

We sometimes also need to read some additional data (VM_EXIT_INTR_INFO) from the vmcs.

We get additional data by getting the “VM-exit interruption information” which is a 32 bit field and the “Exit qualification” (64 bit value).

For example, if the exception was NMI, we check if it is valid by checking bit 31 (valid bit) of the VM-exit interruption field. In case it is not valid we call _hvm_bug() to print some statistics and crash the domain.

Example of reading the “Exit qualification” field is in the case where the VMEXIT was caused by issuing INVPLG instruction.

When we work with vt-x, the guest OSs work in shadow mode, meaning they use shadow page tables; this is because the guest kernel in a VMX guest does not know that it’s being virtualized. There is no software visible bit which indicates that the processor is in VMX non-root operation. We set shadow mode by calling shadow_mode_enable() in vmx_final_setup_guest() method (file vmx.c).

There are 43 basic exit reasons – you can see part of them in vmx.h (fields starting with EXIT_REASON_ like EXIT_REASON_EXCEPTION_NMI, which is exit reason number 0, and so on).

In VT-x, Xen will probably use an emulated devices layer which will send virtual interrupts to the VMM. We can prevent the OS from receiving interrupts by setting the IF flag of EFLAGS.

The new ten opcodes which Intel added in Vt-x are detailed below:

1) VMCALL: (VMCALL_OPCODE in vmx.h)

  • This simply calls the VM monitor, causing vm exit.

2) VMCLEAR: (VMCLEAR_OPCODE in vmx.h)

  • copies VMCS data to memory in case it does not written there.
  • wrapper : _vmpclear (u64 addr) in vmx.h.

3) VMLAUNCH (VMLAUNCH_OPCODE in vmx.h)

  • launched a virtual machine; changes the launch state of the VMCS to
    • launched (if it is clear)

4) VMPTRLD (VMPTRLD_OPCODE file vmx.h)

  • loads a pointer to the VMCS.
    • wrapper : _vmptrld (u64 addr) (file vmx.h)

5) VMPTRST (VMPTRST_OPCODE in vmx.h)

  • stores a pointer to the VMCS.wrapper : _vmptrst (u64 addr) (file vmx.h.)

6) VMREAD (VMREAD_OPCODE in vmx.h)

  • read specified field from VMCS.
  • wrapper : _vmread(x, ptr) (file vmx.h)

7) VMRESUME (VMRESUME_OPCODE in vmx.h)

  • resumes a virtual machine ; in order it to resume the VM,
    • the launch state of the VMCS should be “clear.

8) VMWRITE (VMWRITE_OPCODE in vmx.h)

  • write specified field in VMCS. wrapper _vmwrite (field, value).

9) VMXOFF (VMXOFF_OPCODE in vmx.h)

  • terminates VMX operation.
    • wrapper : _vmxoff (void) (file vmx.h.)

10) VMXON (VMXON_OPCODE in vmx.h)

  • starts VMX operation.wrapper : _vmxon (u64 addr) (file vmx.h.)

QEMU and VT-D The io in Vt-x is performed by using QEMU. The QEMU code which Xen uses is under tools/ioemu. It is based on version 0.6.1 of QEMU. This version was patched accrording to Xen needs. Also AMD SVM uses QEUMU emulation.

The default network card which QEMU uses in Vt-x is AMD PCnet-PCI II Ethernet Controller. (file tools/ioemu/hw/pcnet.c). The reason to prefer this nic emulation to the other alternative, ne2000, is that pcnet uses DMA whereas ne2000 does not.

There is of course a performance cost for using QEMU, so there are chances that usage of QEMU will be replaced in the future with different soulutions which have lower performance costs.

Intel had annouced in March 2006 its VT-d Technology (Intel Virtualization Technology for Directred I/O). This technology enables to assign devices to virtual machines. It also enables DMA remapping, which can be configured for each device. There is a cache called IOTLB which improves performance.

Vmxloader

There are some restrictions on VMX operation. Guest OSes in VMX cannot operate in Real Mode. If bit PE (Protection Enabled) of CR0 is 0 or bit PG (“Enable Paging”) of CR0 is 0, then trying to start the VMX operation (VMXON instruction) fails.If after entering VMX operation you try to clear these bits, you get an exception (General Protection Exception). When using a linux loader, it starts in real mode. As a result, a vmxloader was written for vmx images. (file tools/firmware/vmxassist/vmxloader.c.)

(In order to build vmxloader you must have dev86 package installed; dev86 is a real mode 80×86 assembler and linker).

After installing Xen, vmxloader is under /usr/lib/xen/boot. In order to use it, you should specify kernel = “/usr/lib/xen/boot/vmxloader” in the config file (which is an input to your “xm create” command.)

The vmxloader loads ROMBIOS at 0xF0000, then VGABIOS at 0xC0000, and then VMXAssist at D000:0000.

What is VMXAssist? The VMXAssist is an emulator for real mode which uses the Virtual-8086 mode of IA32. After setting Virtual-8086 mode, it executes in a 16-bit environment.

There are certain instructions which are not recognized in virtual-8086 mode. For example, LIDT (Load Interrupt Register Table), or LGDT (Load Global DescriptorTable).

These instructions cause #GP(0) when trying to run them in protected mode.

So the VMXAssist assist checks the opcode of the instructions which are being executed, and handles them so that they will not cause General Protection Exception (as would have happened without its intervention).

VT-i (virtual technology) processors – support in Xen

Note : the files mentioned in this sections are from the unstable xen version).

In Vt-i extension for IA64 processors,intel added a bit to the PSR (process status register). This bit is bit 46 of the PSR and is called PSR.vm. When this bit is set, some instructions will cause a fault.

A new instruction called vmsw (Virtual Machine Switch) was added. This instruction sets the PSR.vm to 1 or 0. This instruction can be used to cause transition to or from a VM without causing an interruption.

Also a descriptor named VPD was added; this descriptor represents the resources of a virtual processor. It’s size is 64 K. (It must be 32 aligned).

A VPD stands for “Virtual Processor Descriptor”. A structure named vpd_t represents the VPD descriptor (file include/public/arch-ia64.h).

Two vectors were added to the ivt: One is the External Interrupt vector (0x3400) and the other is the Virtualization vector (0x6100).

The virtualization vector handler is called when an instruction which need virtualization was called. This handler cannot be raised by IA-32 instructions.

Also nine PAL services were added. PAL stands for Processor Abstraction Layer.

The services that were added are: PAL_VPS_RESUME_NORMAL,PAL_VPS_RESUME_HANDLER,PAL_VPS_SYNC_READ, PAL_VPS_SYNC_WRITE,PAL_VPS_SET_PENDING_INTERRUPT,PAL_VPS_THASH, PAL_VPS_TTAG, PAL_VPS_RESTORE and PAL_VPS_SAVE, (file include/asm-ia64/vmx_pal_vsa.h)

AMD SVM

AMD will hopefully release PACIFICA processors with virtualization support in Q2 2006. (probably on June 2006). The IOMMU virtualization support is to be out in 2007.

Them xen-unstable tree now includes both intel VT and SVM support, using a common API which is called HVM.

The inclusion of HVM in the unstable tree is since changeset 8708 from 31/1/06, which is a “Big merge the HVM full-virtualisation abstractions.”

You can download the code by: hg clone http://xenbits.xensource.com/xen-unstable.hg

The code for AMD SVM is mostly under xen/arch/x86/hvm/svm.

The code is developed by AMD team: Tom Woller, Mats Petersson, Travis Betak, Nagib Gulam, Leo Duran, Rosilmildo Dasilva and Wei Huang.

SVM stands for “Secure Virtual Machine”.

One major difference between Vt-x and AMD SVM is that the AMD SVM virtualization extensions include tagged TLB (whereas Intel virtualization extensions for IA-32 does not). The benefit of a tagged TLB is significantly reducing the number of TLB flushes ; this is achieved by using an ASID (Address Space Identifer) in the TLB. Using tagged TLB is common in RISC processors.

In AMD SVM, the most important struct (which is parallel to the VT-x vmcs_struct) is the vmcb_struct. (file xen/include/asm-x86/hvm/svm/vmcb.h). VMCB stands for Virtual Machine Control Block.

AMD added the following eight instructions to the SVM processor:

VMLOAD loads the processor state from the VMCB. VMMCALL enables the guest to communicate with the VMM. VMRUN starts the operation of a guest OS. VMSAVE store the processor state from the VMCB. CLGI clears the global interrupt flag (GIF) SLGI sets the global interrupt flag (GIF) INVPLGA invalidates the TLB mapping of a specified virtual page

  • and a specfied ASID.

SKINIT reinitilizes the CPU.

To issue these instructions SVM must be enabled. Enabling SVM is done by setting bit 12 of the EFER MSR register.

In VT-x, the vmx_vmexit_handler() method handles VM Entries. In AMD SVM, the svm_vmexit_handler() method is the one which handles VM exits. (file xen/arch/x86/hvm/svm/svm.c) When VM exit occurs, the processor saves the reason for this exit in the exit_code member of the VCMB. The svm_vmexit_handler() handles the VM EXIT according to the exit_reason of the VMCB.

Xen On Solaris

On 13 Feb 2006, Sun had released the Xen sources for Solaris x86. See : http://opensolaris.org/os/community/xen/opening-day.

This version currently supports 32 bit only ;it enables openSolaris to be a guest OS where dom0 is a modifed Linux kernel running Xen. Also this version is currently only for x86 (porting to SPARC processor is much more difficult). The members of the Solaris Xen project are Tim Marsland, John Levon, Mark Johnson, Stu Maybee, Joe Bonasera, Ryan Scott, Dave Edmondson and others. Todd Clayton is leading the 64-bit solaris Xen project. In order to boot the Solaris Xen guest many changes were done; can see more details in http://blogs.sun.com/roller/page/JoeBonasera.

You can download the Xen Solaris sources from : http://dlc.sun.com/osol/xen/downloads/osox-src-12-02-2006.tar.gz

Frontend net virtual device sources are in uts/common/io/xennet/xennetf.c. (xennet is the net front virtual driver.).

Frontend block virtual device sources are in uts/i86xen/io/xvbd (xvbd is the block front virtual driver.).

Currently the front block device does not work. There are many things which are similiar between Xen on Solaris and Xen on Linux.

In Xen Solaris Hypercall are also made by calling int 0x82 . (see #define TRAP_INSTR int $0x82 (file /uts/i86xen/ml/hypersubr.s)

Sun also released in february 2006 the specs for the T1 prcoessor, which supports virtualization: see : http://opensparc.sunsource.net/nonav/opensparct1.html

see: http://www.prnewswire.com/cgi-bin/stories.pl?ACCT=104&STORY=/www/story/02-14-2006/0004281587&EDATE=

Also the UltraSPARC T1 Hypervisor API Specification was released: http://opensparc.sunsource.net/specs/Hypervisor_api-26-v6.pdf

T1 virtualization:

The Hyperprivileged edition of the UltraSPARC Architecture 2005 Specification describes the Nonprivileged, Privileged, and Hyperprivileged (hypervisor/virtual machine firmware) spec.

The virtual processor on Sun supports three privilege modes:

2 bits determine the privilege mode of the processor: HPSTATE.hpriv and PSTATE.priv When both are 0 ,we are in nonprivileged mode When both are 1 ,we are in privileged mode When HPSTATE.hpriv is 1 , we are in Hyperprivileged mode (regardless of the value of PSTATE.priv). PSTATE is the Processor State register. HPSTATE is the Hyperprivileged State register HPSTATE.(64 bit). Each virtual processor has only one instance of the PSTATE and HPSTATE registers. The HPSTATE is one of the HPR state registers, and it is also called HPR 0. It can be read by the RDHPR instructions, and it can be written by the WRHPR instruction.

Step by step example of creating guest OS with Virtual Machine Manager in Fedora Core 6

This secrion describes a step by step example of creating guest OS based on FC6 i386 with Virtual Machine Manager in a Fedora Core 6 machine by installing from a WEB URL:

Go to : Application->System Tools->Virtual Machine Manager Choose : Local Xen Host Press New. Enter a name for the guest. You reach now the “Locating installation media” dialog. In “install media URL” you should enter a URL of Fedora Core 6 i386 download. For example, “http://download.fedora.redhat.com/pub/fedora/linux/core/6/i386/os/” then press forward. Choose simple file, and give a path to a non existing file in some existing folder. than: File size: choose 3.5 GB for example ; if you will assign less space, you will not be able to finish the installation, assuming it is a typical , non custom, installation.

Then press forward ; accept the defaults for memory/cpu and then press forward. Than press finish. That’s it! When insalling from web like this it can take 2-4 hours, depending on your bandwidth. You will get to the text mode installation of fedora core 6, and have to enter parameters for the installation.

After the installation is finished and you want to restart the guest OS , you do it by simply: “xm create /etc/xen/NameOfGuest“, where NameOfGuest is of course the name of guest you choose in the installation.

Physical Interrupts

In Xen, only the Hypervisor has an access to the hardware so that to achieve isolation (it is dangerous to share the hardware and let other domains access directly hardware devices simultaneously).

Let’s take a little walkthrough dealing with Xen interrupts:

Handling interrupts in Xen is done by using event channels. Each domain can hold up to 1024 events. An event channel can have 2 flags associated with it : pending and mask. The mask flag can be updated only by guests. The hypervisor cannot update it. These flags are not part of the event channel structure itself. (struct evtchn is defined in xen/include/xen/sched.h ). There are 2 arrays in struct shared_info which contains these flags: evtchn_pending[] and evtchn_mask[] ; each holds 32 elements. (file xen/include/public/xen.h)

(The shared_info is a member in domain struct; it is the domain shared data area).

TBD: add info about event selectors (evtchn_pending_sel in vcpu_info).

Registration (or binding) of irqs in guest domains:

The guest OS calls init_IRQ() when it boots (start_kernel() method calls init_IRQ() ; file init/main.c).

(init_IRQ() is in file sparse/arch/xen/kernel/evtchn.c)

There can be 256 physical irqs; so there is an array called irq_desc with 256 entries. (file sparse/include/linux/irq.h)

All elements in this array are initialized in init_IRQ() so that their status is disabled (IRQ_DISABLED).

Now, when a physical driver starts it usually calls request_irq().

This method eventually calls setup_irq() (both in sparse/kernel/irq/manage.c). which calls startup_pirq().

startup_pirq() send a hypercall to the hypervisor (HYPERVISOR_event_channel_op) in order to bind the physical irq (pirq) . The hypercall is of type EVTCHNOP_bind_pirq. See: startup_pirq() (file sparse/arch/xen/kernel/evtchn.c)

On the Hypervisor side, handling this hypervisor call is done in: evtchn_bind_pirq() method (file /common/event_channel.c) which calls pirq_guest_bind() (file arch/x86/irq.c). The pirq_guest_bind() changes the status of the corresponding irq_desc array element to be enabled (~IRQ_DISABLED). it also calls startup() method.

Now when an interrupts arrives from the controller (the APIC), we arrive at do_IRQ() method as is also in usual linux kernel (also in arch/x86/irq.c). The Hypervisor handles only timer and serial interrupts. Other interrupts are passed to the domains by calling _do_IRQ_guest() (In fact, the IRQ_GUEST flag is set for all interrupts except for timer and serial interrupts). _do_IRQ_guest() send the interrupt by calling send_guest_pirq() to all guests who are registered on this IRQ. The send_guest_pirq() creates an event channel (an instance of evtchn) and sets the pending flag of this event channel. (by calling evtchn_set_pending()) Then, asynchronously, Xen will notify this domain regarding this interrupt (unless it is masked).

TBD: shared interrupts; avoiding problems with shared interrupts when using PCI express.

Backend Drivers:

The Backend Drivers are started from domain 0. We will deal mainly with the network and block drivers. The network backend drivers reside under sparse/drives/xen/netback, and the block backend drivers reside under sparse/drives/xen/blkback.

There are many things in common between the netback and blkback device drivers. There are some differences, though. The blkback device drivers runs a kernel daemon thread (named :xenblkd) whereas the netback device driver does not run any kernel thread.

The netback and blkback register themselves with XenBus by calling xenbus_register_backend().

This method simply calls xenbus_register_driver_common(); both are in sparse/drivers/xen/xenbus/xenbus_probe.c.

(The xenbus_register_driver() method calls the generic kernel method for registering drivers, driver_register()).

Both netback (network backend driver) and blkback (block backend driver) has a module named xenbus.c. There are drivers which are not splitted to backend/frontend drivers;for example, the balloon driver.The balloon driver calls register_xenstore_notifier() in its initialization (balloon_init() method). The register_xenstore_notifier() uses a generic linux callback mechanism for passing status changes (notifier_block in include/linux/notifier.h).

The USB driver also has a backend and frontend drivers; currently it has no support to the xenbus/xenstore API so it does not have a module named xenbus.c but it will probably be adjusted in the future. As of writing of this document, the USB backend/frontend code was removed temporarily from the sparse tree.

Each of the backend drivers registers two watches: one for the backend and one for the frontend. The registration of the watches is done in the probe method:

* In netback it is in netback_probe() method (file netback/xenbus.c).

* In blkback it is in blkback_probe() method (file blkback/xenbus.c).

A registration of a watch is done by calling the xenbus_watch_path2() method. This method is implemented in sparse/drivers/xen/xenbus/xenbus_client.c. Evntually the watch registration is done by calling register_xenbus_watch(), which is implemented in sparse/drivers/xen/xenbus/xenbus_xs.c.

In both cases, netback and blkback, the callback for the backend watch is called backend_changed, and the callback for the forntend watch is called frontend_changed.

xenbus_watch is a simple struct consisting of 3 elements:

A reference to a list of watches (list_head)

A pointer to a node (char*)

A callback function pointer.

The xenbus.c in both netback and blkback defines a struct called backend_info; These structs have much in common: there are minor differences between them. One difference is that in the netback the communications channel is an instance of netif_t whereas in the blkback the communications channel is an instance of blkif_t; In the case of blkback, it includes also the major/minor numbers of the device and the mode (whereas these members don’t exist in the backend_info struct of the netback).

In the case of netback, there is also a XenbusState member. The state machine for XenBus includes seven states: Unknown, Initialising, InitWait (early initialisation was finished, and xenbus is waiting for information from the peer or hotplug scripts), Initialised (waiting for a connection from the peer), Connected, Closing (due to an error or an unplug event) and Closed.

One of the members of this struct (backend_info) is an instance of xenbus_device.(xenbus_device is declared in sparse/include/asm-xen/xenbus.h). The nodename looks like a directory path, for example, dev->nodename in the blkback case may look like:

backend/vbd/94e11324-7eb1-437f-86e6-3db0e145136e/771

and dev->nodename in the netback may look like:

backend/vif/94e11324-7eb1-437f-86e6-3db0e145136e/1

We create an event channel for communication between the two domains by calling a bind_interdomain Hypervisor call. (HYPERVISOR_event_channel_op).

For the networking,this is done in netif_map() in netback/interface.c. For the block device, this is done in blkif_map() in blkback/interface.c.

We use the grant tables to create shared memory between frontend and backend domain. In the case of network drivers,this is done by calling: gnttab_grant_foreign_transfer_ref(). (called in: network_alloc_rx_buffers(), file netfront.c)

gnttab_grant_foreign_transfer_ref() sets a bit named GTF_accept_transfer in the grant_entry.

In the case of block drivers,this is done by calling: gnttab_grant_foreign_access_ref() in blkif_queue_request() (file blkfront.c)

gnttab_grant_foreign_access_ref() sets a bit named GTF_permit_access in the grant entry. grant entry (grant_entry_t) represents a page frame which is shared between domains.

Diagram: Virtual Split Devices

virtualDevices

Migration and Live Migration:

Xend must be configured so that migration (which is also termed relocation) will be enabled. In /etc/xen/xend-config.sxp, there is the definition of the relocation-port, so the following line should be uncommented:

(xend-relocation-port 8002)

The line “(xend-address localhost)” prevents remote connections on the localhost,so this line must be commented.

Notice: if this line is commented in the side to which you want to migrate your domain, you will most likely get the following error after issuing the migrate command:

"Error: can't connect: Connection refused"

This error can be traced to domain_migrate() method in /tools/python/xen/xend/XendDomain.py which start a TCP connection on the relocation port (which is by default 8002)

...
...
def domain_migrate(self, domid, dst, live=False, resource=0):
        """Start domain migration."""
        dominfo = self.domain_lookup(domid)
        port = xroot.get_xend_relocation_port()
        try:
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            sock.connect((dst, port))
        except socket.error, err:
            raise XendError("can't connect: %s" % err[1])
...
...

See more details on the relocation protocol (implemented in relocate.py) below.

The line “(xend-relocation-server yes)” should be uncommented so the migration server will be running.

When we issue a migration command like “xm migrate #numOfDomain ipAddress” or , if we want to use live-migration , we add the –live flag thus: “xm migrate #numOfDomain –live ipAddress”, we call server.xend_domain_migrate() after validating that the arguments are valid. (file /tools/python/xen/xm/migrate.py)

We enter the domain_migrate() method of XendDomain, which first performs a domain lookup and then creates a TCP connection for the migration with the target machine; then it sends a “receive” packet (a packets which contains the string “receive”) on port 8002. The other sides gets this message in in the dataReceived() method (see of web/connection.py) and delegates it to dataReceived() of RelocationProtocol (file /tools/python/xen/xend/server/relocate.py). Eventually it calls the save() method of XendChekpoint to start the migration process.

We end up with a call to op_receive() which ultimately sends back a “ready receive” message (also on port 8002). See op_receive() method in relocation.py.

The save() method of XendChekpoint opens a thread which calls the xc_save executable (which is in /usr/lib/xen/bin/xc_save). (file /tools/python/xen/xend/XendCheckpoint.py).

The xc_save executable is build from tools/xcutils/xc_save.c.

The xc_save executable calls xc_linux_save() in tools/libxc/xc_linux_save, which in fact performs most of the migration process. (see xc_linux_save() in /tools/libxc/xc_linux_save.c)

The xc_linux_save() returns 0 on success, and 1 in case of failure.

Live migration is currently supported for 32-bit and 64-bit architectures. TBD: find out if there is support for live-migration for pae architectures. Jacob Gorm Hansen is doing an interesting work with migration in Xen; see http://www.diku.dk/~jacobg/self-migration.

He wrote code which performs Xen “self-migration” , which is a migration which is done without the hypervisor involvement. The migration is done by opening a User Space socket and reading from a special file (/dev/checkpoint).

His code works with linux-2.6 bases xen-unstable.

You can get it by: hg clone http://www.distlab.dk/hg/index.cgi/xen-gfx.hg

His work was inspired by his own ‘NomadBIOS’ for L4 microkernel, which also uses this approach of self-migration.

It might be that this new alternative to managed migration will be adopted in future versions of Xen (time will tell).

Migration of operating systems has much in common with migration of single processes. see Master’s thesis: “Nomadic Operating Systems” Jacob G. Hansen , Asger kahl Henriksen,2002 http://nomadbios.sunsite.dk/NomadicOS.ps

You can find there a discussion about migration of single processes in Mosix (Barak and La’adan).

Creating of a domain – behind the scenes:

We create a domain by:

xm create configFile -c

A simple config file may look like (using ttylinux,as in the user manual):

kernel = "/boot/vmlinuz-2.6.12-xenU"
memory = 64
name = "ttylinux"
nics = 1
ip = "10.0.0.3"
disk = ['file:/home/work/downloads/tmp/ttylinux-xen,hda3,w']
root = "/dev/hda3 ro"

The create() method of XendDomainInfo handles creation of domains. When a domain is created it is assigned a unique id (uuid ) which is cretaed using uuidgen command line utility of e2fsprogs.(file python/xen/xend/uuid.py).

If the memory paramter specified a too high memory which the hypervisor cannot allocate, we end up with the following message: “Error: Error creating domain: The privileged domain did not balloon!”

The devices of a domain are created using the createDevice() method which delegates the call to the createDevice() method of the Device Controller (see XendDomainInfo.py) The createDevice() in turn calls writeDetails() method (also in DevController). This writeDetails() method write the details in XenStore to trigger the creation of the device. The getDeviceDetails() is an abstract method which each subclass of DevController implements. Writing to the store is done by calling Write() method of xstransact. (file tools/pyhton/xen/xend/xenstore/xstransact.py) which returns the id of the newly created device.

By using transaction you can batch together some actions to perform against the xenstored (the common use is some read actions). You can create a domain also without Xend and without Python bindings; Jacob Gorm Hansen had demonstrated it in 2 little programs (businit.c and buscrate.c) (see http://lists.xensource.com/archives/html/xen-devel/2005-10/msg00432.html).

However, true to now these programs should be adjusted beacuse there were some API changes, especially that creation of interdomain event channel is done now with sending ioctl to event_channel (IOCTL_EVTCHN_BIND_INTERDOMAIN).

HyperCalls Mapping to code Xen 3.0.2

Mapping of HyperCalls to code :

Follwoing is the location of all hypercalls: The HyperCalls appear according to their order in xen.h.

The hypercall table itself is in xen/arch/x86/x86_32/entry.S (ENTRY(hypercall_table)).

HYPERVISOR_set_trap_table => do_set_trap_table() (file xen/arch/x86/traps.c)

HYPERVISOR_mmu_update => do_mmu_update() (file xen/arch/x86/mm.c)

HYPERVISOR_set_gdt => do_set_gdt() (file xen/arch/x86/mm.c)

HYPERVISOR_stack_switch => do_stack_switch() (file xen/arch/x86/x86_32/mm.c)

HYPERVISOR_set_callbacks => do_set_callbacks() (file xen/arch/x86/x86_32/traps.c)

HYPERVISOR_fpu_taskswitch => do_fpu_taskswitch(int set) (file xen/arch/x86/traps.c)

HYPERVISOR_sched_op_compat => do_sched_op_compat() (file xen/common/schedule.c)

HYPERVISOR_dom0_op => do_dom0_op() (file xen/common/dom0_ops.c)

HYPERVISOR_set_debugreg => do_set_debugreg() (file xen/arch/x86/traps.c)

HYPERVISOR_get_debugreg => do_get_debugreg() (file xen/arch/x86/traps.c)

HYPERVISOR_update_descriptor => do_update_descriptor() (file xen/arch/x86/mm.c)

HYPERVISOR_memory_op => do_memory_op() (file xen/common/memory.c)

HYPERVISOR_multicall => do_multicall() (file xen/common/multicall.c)

HYPERVISOR_update_va_mapping => do_update_va_mapping() (file /xen/arch/x86/mm.c)

HYPERVISOR_set_timer_op => do_set_timer_op() (file xen/common/schedule.c)

HYPERVISOR_event_channel_op => do_event_channel_op() (file xen/common/event_channel.c)

HYPERVISOR_xen_version => do_xen_version() (file xen/common/kernel.c)

HYPERVISOR_console_io => do_console_io() (file xen/drivers/char/console.c)

HYPERVISOR_physdev_op => do_physdev_op() (file xen/arch/x86/physdev.c)

HYPERVISOR_grant_table_op => do_grant_table_op() (file xen/common/grant_table.c)

HYPERVISOR_vm_assist => do_vm_assist() (file xen/common/kernel.c)

HYPERVISOR_update_va_mapping_otherdomain =>

  • do_update_va_mapping_otherdomain() (file xen/arch/x86/mm.c)

HYPERVISOR_iret => do_iret() (file xen/arch/x86/x86_32/traps.c) /* x86/32 only */

HYPERVISOR_vcpu_op => do_vcpu_op() (file xen/common/domain.c)

HYPERVISOR_set_segment_base => do_set_segment_base (file xen/arch/x86/x86_64/mm.c) /* x86/64 only */

HYPERVISOR_mmuext_op => do_mmuext_op() (file xen/arch/x86/mm.c)

HYPERVISOR_acm_op => do_acm_op() (file xen/common/acm_ops.c)

HYPERVISOR_nmi_op => do_nmi_op() (file xen/common/kernel.c)

HYPERVISOR_sched_op => do_sched_op() (file xen/common/schedule.c)

(Note: sometimes hypercalls are also called hcalls.)

Virtualization and the Linux Kernel

Virtualization in computer context can be thought of as extending the abilities of a computer beyond what a straight, non-virtual implelmentation allows.

In this category we can include also virtual memory, which allows a process to access 4GB virtual address space even though the physical RAM is usually much lower.

We can also think of the Linux IP Virtual Server (which is now a part of the linux kernel) as a kind of virtualization. By using the Linux IP Virtual Server you can configure a router to redirect service requests from a virtual server address to other machines (called real servers).

The IP Virtual Server is part of the kernel starting 2.6.10 (In the 2.4.* kernels it is also available as a patch; the code for 2.6.10 and above kernels is under net/ipv4/ipvs under the kernel tree ;there is still no implementation for ipv6).

The Linux Virtual Server (LVS) was started quite a time ago,in 1998; see http://www.linuxvirtualserver.org.

The idea of virtualization in the sense of enabling of running more than one operating system on a single platform is not new and was researched for many years. However, it seems that the Xen project is the first which produces performance benchmark metrics of such a feature which make this idea more practical and more attractive.

Origins of the Xen project: The Xen project is based on the Xenoservers project; It was originally built as part of the XenoServer project, see http://www.cl.cam.ac.uk/Research/SRG/netos/xeno.

Also the arsenic project has some ideas which were used in Xen. (see http://www.cl.cam.ac.uk/Research/SRG/netos/arsenic)

In the arsenic project, written by Ian Pratt and Keir Fraser, a big part of the Linux kernel TCP/IP stack was ported to user space. The arsenic project is based on Linux 2.3.29. After a short look at the Arsenic porject code you can find some data structures which can remind of parallel data structures in Xen, like the event rings. (for exmaple,the ring_control_block struct in arsenic-1.0/acenic-fw-12.3.10/nic/common/nic_api.h)

Meiosys is a French Company which was purchased by IBM. It deals with another different type of virtualization – Application Virtualization.

see http://www.virtual-strategy.com/article/articleview/680/1/2/ and http://www.infoworld.com/article/05/06/23/HNibmbuy_1.html

In context of the Meiosys project, it is worth to mention that a patch was sent recently to the Linux Kernel Mailing List from Serge E. Hallyn (IBM): see http://lwn.net/Articles/160015

This patch deals with process IDs. (the pid should stay the same after starting anew the application in Meiosys).

Another article on PID virtualization can be found in “PID virtualization: a wealth of choices” http://lwn.net/Articles/171017/?format=printable This article deals with PID virtualization in a context of a diffenet project (openVZ).

There is also the colinux open source project (see:http://colinux.sourceforge.net for more details) and the openvz project, which is based on Virtuozzo™. (Virtuozzo is a commercial solution).

The openvz offers server virtualization, linux-based solution: see http://openvz.org.

There are other projects which probably ispire virtualization; to name of few:

Denali Project uses (uses paravirtualization). http://denali.cs.washington.edu

A paper: Denali: Lightweight Virtual Machines for Distributed and Networked Applications By Andrew Whitaker et al. http://denali.cs.washington.edu/pubs/distpubs/papers/denali_usenix2002.pdf

Nemesis Operating System. http://www.cl.cam.ac.uk/Research/SRG/netos/old-projects/nemesis/index.html

Exokernel: see “Application Performance and Flexibility on Exokernel Systems” by M. Frans Kaashoek et al http://www.cl.cam.ac.uk/~smh22/docs/kaashoek97application.ps.gz

TBD: more details.

Pre-Virtualization

Another interesting virtulaization technique is Pre-Virtualization; in this method, we rewite sensitive instructions using the assembler files (whether generated by compiler, as is the usual case, or assembler files created manually). There is a problem in this method because there are instuctions which are sensitive only when they are performed in a certain context. A solution for this is to generate profiling data of a guest OS and then recompile the OS using the profiling data.

See:

http://l4ka.org/projects/virtualization/afterburn/

and an article: Pre-Virtualization: Slashing the Cost of Virtualization Joshua LeVasseur, Volkmar Uhlig, Matthew Chapman et al. http://l4ka.org/publications/2005/previrtualization-techreport.pdf

This technique is based on a paper by Hideki Eiraku and Yasushi Shinjo, “Running BSD Kernels as User Processes by Partial Emulation and Rewriting of Machine Instructions” http://www.usenix.org/publications/library/proceedings/bsdcon03/tech/eiraku/eiraku_html/index.html

Xen Storage

You can use iscsi for Xen Storage. The xen-tools package of OpenSuse has an example of using iscsi, called xmexample.iscsi. The disk entry for iscsi in the configuration file may look like: disk = [ ‘iscsi:iqn.2006-09.de.suse@0ac47ee2-216e-452a-a341-a12624cd0225,hda,w’ ]

TBD: more on iSCSI in Xen.

Solutions for using CoW in Xen: blktap (part of the xen project).

UnionFS: a stackable filesystem (used also in Knoppix Live-CD and other Live-CDs)

dm-userspace (A tool which uses device-mapper and a daemon called cowd; written by Dan Smith) You may download dm-userspace by:

To build as a module out-of-tree, copy dm-userspace.h to: /lib/modules/uname -r/build/include/linux and then run “make”.

Home of dm-userspace:

Copy-on-write NFS server: see http://www.russross.com/CoWNFS.html

kvm – Kernel-based Virtualization Driver

Kvm is as an open source virtualization project , written by Avi Kivity and Yaniv Kamay from qumranet. See : http://kvm.sourceforge.net.

It is included in the linux kerel tree since 2.6.20-rc1; see: http://lkml.org/lkml/2006/12/13/361 (“kvm driver for all those crazy virtualization people to play with”)

Currently it deals with Intel processors with the virtual extension (VT-X). and AMD SVM processors. You can know if your processor has these extensions by issuing from the command line: “egrep ‘^flags.*(vmx|svm)’ /proc/cpuinfo”

kvm.ko is a kernel module which handles userspace requests through ioctls. It works with a character device (/dev/kvm). The userspace part is built from patched quemu. One of KVM advantages is that it uses linux kernel mechanisms as they are without change (such as the linux scheduler). The Xen project, for example, made many changes to parts of the kernel to enable para-virtualization. Another advantage is the simplicty of the project: there is a kernel part and a userspace part. An advantage of KVM is that future versions of linux kernel will not entail changes in the kvm module code (and of course not in the user space part). The project currently support SMP hosts and will support SMP guests in the future.

Currently there is no support to live migration in KVM (but there is support for ordinary migration, when the migrated OS is stopped and than transfrerred to the target and than resumed).

In intel vt-x , VM-Exits are handled by the kvm module by kvm_handle_exit() method in kvm_main.c according to the reason which caused them (and which is specified and read from the VMCS). in AMD SVM , exit are handled by handle_exit() in svm.c.

There is an interesting usage of memory slots . There is already an rpm for openSUSE by Gerd Hoffman.

Tip: How to build Xen with your own tar ball

If you want to run “make world” without downloading the kernel (beacuse that you want to to your own tar ball which is a bit different from the original one because you made few changes inside the kernel), then do the following:

1) Let’s say that the kernel tar ball is named: my_linux-2.6.18.tar.bz2.

  • First, move my_linux-2.6.18.tar.bz2 to the folder from where you build Xen

2) Run from bash: XEN_LINUX_SOURCE=tarball make world

That’s it; it will use the my_linux-2.6.18.tar.bz2. tar ball that you copied to that folder.

Xen in the Linux Kernel

According to the following thread from xen-devel: http://lists.xensource.com/archives/html/xen-devel/2005-10/msg00436.html, there is a mercurial repository in which xen is a subarch of i386 and x86_64 of the linux kernel, and there is an intention to send releavant stuff to Andrew/Linus for the upcoming 2.6.15 kernel. In 22/3/2006 , a patchest of 35 parts was sent to the Linux Kernel Mailing List (lkml) for Xen i386 paravirtualization support in the linux kernel: see http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/2313.html

VMI : Virtual Machine Interface

On 13/3/06 , a patchset titled “VMI i386 Linux virtualization interface proposal” was sent to the LKML (Linux Kernel Mailing List) by Zachary Amsden and othes. (see http://lkml.org/lkml/2006/3/13/140) It suggests for a common interfcace which abstracts the specifics of each hypervisor and thus can be used by many hypervisors. According to the vmi_spec.txt of this patchset, when an OS is ported to a paravirtulizable x86 processor, it should access the hypervisor through the VMI layer.

The VMI layer interface:

The VMI is divided to the following 10 types of calls:

CORE INTERFACE CALLS (like VMI_Init)

PROCESSOR STATE CALLS (like VMI_DisableInterrupts, VMI_EnableInterrupts,VMI_GetInterruptMask)

DESCRIPTOR RELATED CALLS (like VMI_SetGDT,VMI_SetIDT)

CPU CONTROL CALLS (like VMI_WRMSR,VMI_RDMSR)

PROCESSOR INFORMATION CALLS (like VMI_CPUID)

STACK/PRIVILEGE TRANSITION CALLS (like VMI_UpdateKernelStack,VMI_IRET)

I/O CALLS (like VMI_INB,VMI_INW,VMI_INL)

APIC CALLS (like VMI_APICWrite,VMI_APICRead)

TIMER CALLS (VMI_GetWallclockTime)

MMU CALLS (like VMI_SetLinearMapping)

Links

1) Xen Project HomePage http://www.cl.cam.ac.uk/Research/SRG/netos/xen

2) Xen Mailing Lists Pge: http://lists.xensource.com

(don’t forget to read the XenUsersNetiquette before posting on the lists)

3) Atricle : Analysis of the Intel Pentium’s Ability to Support a Secure Virtual Machine Monitor http://www.usenix.org/publications/library/proceedings/sec2000/full_papers/robin/robin_html/index.html

4) Xen Summits: 2005: http://xen.xensource.com/xensummit/xensummit_2005.html 2006 fall: http://xen.xensource.com/xensummit/xensummit_fall_2006.html 2006 winter: http://xen.xensource.com/xensummit/xensummit_winter_2006.html 2007 spring: http://xen.xensource.com/xensummit/xensummit_spring_2007.html

5) Intel Virtualizatiuon technology: http://www.intel.com/technology/virtualization/index.htm

6) Article by Ryan Maueron Linux Journal in 2 parts:

6-1) Xen Virtualization and Linux Clustering, Part 1 http://www.linuxjournal.com/article/8812

6-2) Xen Virtualization and Linux Clustering, Part 2 http://www.linuxjournal.com/article/8816

Commercial Companies: 7)XenSource:

http://www.xensource.com/

8) Enomalism: http://www.enomalism.com/

9) Thoughtcrime is a brand new company specialising in opensource virtualisation solutions see http://debian.thoughtcrime.co.nz/ubuntu/README.txt

IA64:

10)IA64 Master Thesis HPC Virtualization with Xen on Itanium by Havard K. F. Bjerke http://openlab-mu-internal.web.cern.ch/openlab%2Dmu%2Dinternal/Documents/2_Technical_Documents/Master_thesis/Thesis_HarvardBjerke.pdf

11) vBlades: Optimized Paravirtualization for the Itanium Processor Family Daniel J. Magenheimer and Thomas W. Christian http://www.usenix.org/publications/library/proceedings/vm04/tech/full_papers/magenheimer/magenheimer_html/index.html

12) Now and Xen Feature Story Article Written by Andrew Warfield and Keir Fraser http://www.linux-mag.com/2004-10/xen_01.html

13) Self Migration: http://www.diku.dk/~jacobg/self-migration

14) online magazine: http://www.virtual-strategy.com

15) Denali Project http://denali.cs.washington.edu

general links about virtualization:

16) http://www.virtualization.info

17) http://www.kernelthread.com/publications/virtualization

18) A Survey on Virtualization Technologies Susanta Nanda Tzi-cker Chiueh http://www.ecsl.cs.sunysb.edu/tr/TR179.pdf

AMD:

19) AMD I/O virtualization technology (IOMMU) specification Rev 1.00 – February 03, 2006 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/34434.pdf

20) AMD64 Architecture Programmer’s Manual: Vol 2 System Programming : Revision 3.11 added chapter 15 on virtualization (“Secure Virtual Machine”).(december 2005) http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf

21) AMD64 Architecture Programmer’s Manual: Vol 3: General-Purpose and System Instructions Revision 3.11 added SVM instructions (december 2005) http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24594.pdf

22) AMD virtualization on the Xen Summit: http://www.xensource.com/files/xs0106_amd_virtualization.pdf

23) AMD Press Release: SUNNYVALE, CALIF. — May 23, 2006: availability of AMD processors with virtualization extensions: http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~108605,00.html

Open Solaris:

24) Open Solaris Xen Forum: http://www.opensolaris.org/jive/category.jspa?categoryID=32

25) Update to opensolaris Xen: adding OpenSolaris-based dom0 capabilities, as well as 32-bit and 64-bit MP guest. 14/07/2006 http://www.opensolaris.org/os/community/xen/announcements/?monthYear=July+2006

26) Open Sparc Hypervisor Spec: http://opensparc.sunsource.net/nonav/Hypervisor_api-26-v6.pdf

27) Open Sparc T1 page: http://opensparc.sunsource.net/nonav/opensparct1.html extension to the Solaris Zones: http://opensolaris.org/os/community/brandz

28) OSDL Development Wiki Homepage (Virtualization) http://www.osdl.org/cgi-bin/osdl_development_wiki.pl?Virtualization

29) fedora xen mailing list archive: https://www.redhat.com/archives/fedora-xen

30) Xen Quick Start for FC4 (Fedora Core 4). http://www.fedoraproject.org/wiki/FedoraXenQuickstart?highlight=%28xen%29

31) Xen Quick Start for FC5 (Fedora Core 5): http://www.fedoraproject.org/wiki/FedoraXenQuickstartFC5

32) Xen Quick Start for FC6 (Fedora Core 6): http://fedoraproject.org/wiki/FedoraXenQuickstartFC6

Fedora 7 quick start: http://fedoraproject.org/wiki/Docs/Fedora7VirtQuickStart Fedora 8 quick start: http://fedoraproject.org/wiki/Docs/Fedora8VirtQuickStart

33) The Xen repository is handled by the mercurial version system. mercurial download: http://www.selenic.com/mercurial/release

34) Measuring CPU Overhead for I/O Processing in the Xen Virtual Machine Monitor Cherkasova Ludmila and Gardner, Rob http://www.hpl.hp.com/techreports/2005/HPL-2005-62.html

35) XenMon: QoS Monitoring and Performance Profiling Tool Gupta Diwaker and Gardner Rob; Cherkasova, Ludmila http://www.hpl.hp.com/techreports/2005/HPL-2005-187.html

36) Potemkin VMM: A virtual machine based on xen-unstable ; used in a honeypot By Michael Vrable, Justin Ma, Jay Chen, David Moore, Erik Vandekieft, Alex C. Snoeren, Geoffrey M. Voelker, and Stefan Savage. http://www.cs.ucsd.edu/~savage/papers/Sosp05.pdf

37) Memory Resource Management in VMware ESX Server http://www.usenix.org/events/osdi02/tech/waldspurger.html

books:

38) Virtualization: From the Desktop to the Enterprise By Erick M. Halter , Chris Wolf Published: May 2005 http://www.apress.com/book/bookDisplay.html?bID=449

39) Virtualization with VMware ESX Server Publisher: Syngress; 2005 by Al Muller, Seburn Wilson, Don Happe, Gary J. Humphrey http://www.syngress.com/catalog/?pid=3315

40) VMware ESX Server: Advanced Technical Design Guide by Ron Oglesby, Scott Herold http://www.amazon.com/exec/obidos/ASIN/0971151067/virtualizatio-20/002-5634453-4543251?creative=327641&camp=14573&link_code=as1

41) PPC: Hollis Blanchard, IBM Linux Technology Center Jimi Xenidis, IBM Research http://wiki.xensource.com/xenwiki/Xen/PPC

42) http://about-virtualization.com/mambo

43) PID virtualization: a wealth of choices http://lwn.net/Articles/171017/?format=printable

44) The Xen Hypervisor and its IO Subsystem: http://www.mulix.org/lectures/xen-iommu/xen-io.pdf

45) G. J. Popek and R. P. Goldberg, Formal requirements for virtualizable third generation architectures, Commun. ACM, vol. 17, no. 7, pp. 412 421, 1974. http://www.cis.upenn.edu/~cis700-6/04f/papers/popek-goldberg-requirements.pdf

46) “Running multiple operating systems concurrently on an IA32 PC using virtualization techniques” by Kevin Lawton (1999). http://www.floobydust.com/virtualization/lawton_1999.txt

47) Automating Xen Virtual Machine Deployment (talks about integrating SystemImager with Xen and more) by Kris Buytaert http://howto.x-tend.be/AutomatingVirtualMachineDeployment/

48)Virtualizing servers with Xen Evaldo Gardenali VI International Conference of Unix at UNINET http://umeet.uninet.edu/umeet2005/talks/evaldoa/xen.pdf

49)Survey of System Virtualization Techniques Robert Rose March 8, 2004 http://www.robertwrose.com/vita/rose-virtualization.pdf

NetBSD

50)http://www.netbsd.org/Ports/xen/

51)Interview on Xen with NetBSD develope Manuel Bouyer http://ezine.daemonnews.org/200602/xen.html

52) netbsd xen mailing list: http://www.netbsd.org/MailingLists/#port-xen

53) NetBSD/xen Howto http://www.netbsd.org/Ports/xen/howto.html

54) “C” API for Xen (LGPL) By Daniel Veillard and others

http://libvir.org/

55) Fraser Campbell page: http://goxen.org/

56) Another page from Fraser Campbell : http://linuxvirtualization.com/

57) Virtualization blog

58)Hardware emulation with QEMU (article)

59) http://linuxemu.retrofaction.com

60) Linux Virtualization with Xen http://www.linuxdevcenter.com/pub/a/linux/2006/01/26/xen.html

61) The virtues of Xen by Alex Maier http://www.redhat.com/magazine/014dec05/features/xen/

62) Deploying VirtualMachines as Sandboxes for the Grid Sriya Santhanam, Pradheep Elango, Andrea Arpaci Dusseau, Miron Livny http://www.cs.wisc.edu/~pradheep/SandboxingWorlds05.pdf

63) article: Xen and the new processors:

Infiniband:

64) Infiniband (Smart IO) wiki page http://wiki.xensource.com/xenwiki/XenSmartIO hg repository: http://xenbits.xensource.com/ext/xen-smartio.hg

65) Novell Infiniband and virtualization, Patrick Mullaney , may 1, 2007: http://www.openfabrics.org/archives/spring2007sonoma/Monday%20April%2030/Novell%20xen-ib-presentation-sonoma.ppt

66) A Case for High Performance Computing with Virtual Machines Wei Huangy, Jiuxing Liuz, Bulent Abaliz et al. http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/huangwei-ics06.pdf

67) High Performance VMM-Bypass I/O in Virtual Machines Wei Huangy, Jiuxing Liuz, Bulent Abaliz et al. (usenix 06) http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/usenix06.pdf

68) User Mode Linux , a book By Jeff Dike. Bruce Perens’ Open Source Series. Published: Apr 12, 2006; http://www.phptr.com/title/0131865056

69) Xen 3.0.3 features,schedule : http://lists.xensource.com/archives/html/xen-devel/2006-06/msg00390.html

70) Practical Taint-Based Protection using Demand Emulation Alex Ho, Michael Fetterman, Christopher Clark et al. http://www.cs.kuleuven.ac.be/conference/EuroSys2006/papers/p29-ho.pdf

71) http://stateless.geek.nz/2006/09/11/current-virtualisation-hardware/ Current Virtualisation Hardware by Nicholas Lee

72) RAID: Installing Xen with RAID 1 on a Fedora Core 4 x86 64 SMP machine: http://www.freax.be/wiki/index.php/Installing_Xen_with_RAID_1_on_a_Fedora_Core_4_x86_64_SMP_machine

73) RAID 1 and Xen (dom0) : (On Debian) http://wiki.kartbuilding.net/index.php/RAID_1_and_Xen_(dom0)

74) OpenVZ Virtualization Software Available for Power Processors http://lwn.net/Articles/204275/

75) Kernel-based Virtual Machine patchset (Avi Kivity) adding /dev/kvm which exposes the virtualization capabilities to userspace. http://lwn.net/Articles/205580

76) Intel Technology Journal : Intel Virtulaization Technology: articles by Intel Staff (96 pages) http://download.intel.com/technology/itj/2006/v10i3/v10_iss03.pdf

77) kvm site: (Avi Kivity and others) Includes a howto and a white paper, download , faq sections. http://kvm.sourceforge.net/

78) kvm on debian: http://people.debian.org/~baruch/kvm http://baruch.ev-en.org/blog/Debian/kvm-in-debian

79) Linux Virtualization Wiki

80) ” New virtualisation system beats Xen to Linux kernel” (about kvm) http://www.techworld.com/opsys/news/index.cfm?newsID=7586&pagtype=all

81) article about kvm: http://linux.inet.hr/finally-user-friendly-virtualization-for-linux.html

82) Virtual Linux : An overview of virtualization methods, architectures, and implementations An article by M. Tim Jones (auhor of “GNU/Linux Application Programming”,”AI Application Programming”, and “BSD Sockets Programming from a Multilanguage Perspective”. http://www-128.ibm.com/developerworks/library/l-linuxvirt/index.html

83) Lguest: The Simple x86 Hypervisor by Rusty Russel (formerly lhype) http://lguest.ozlabs.org/

84) “Infrastructure virtualisation with Xen advisory” – a wiki atricle : using iscsi for Xen-clustering ; shared storage http://docs.solstice.nl/index.php/Infrastructure_virtualisation_with_Xen_advisory

85) Xen with DRBD, GNBD and OCFS2 HOWTO http://xenamo.sourceforge.net/

86) Virtualization with Xen(tm): Including Xenenterprise, Xenserver, and Xenexpress (Paperback) by David E Williams Syngress Publishing (April 1, 2007) # ISBN-10: 1597491675 # ISBN-13: 978-1597491679 (Author)http://www.amazon.com/Virtualization-Xen-Including-Xenenterprise-Xenexpress/dp/1597491675/ref=pd_bbs_sr_2/103-3574046-3264617?ie=UTF8&s=books&qid=1173899913&sr=8-2

Paperback: 512 pages

87) Professional XEN Virtualization (Paperback) by William von Hagen (Author)

# Paperback: 500 pages # Publisher: Wrox (August 27, 2007) # Language: English # ISBN-10: 0470138114 # ISBN-13: 978-0470138113

88) Xen and the Art of Consolidation Tom Eastep Linuxfest NW. April 29, 2007. http://www.shorewall.net/Linuxfest-2007.pdf

89) Optimizing Network Virtualization in Xen

By Willy Zwaenepoel, Alan L. Cox, Aravind Menon, usenix 2006 http://www.usenix.org/events/usenix06/tech/menon/menon_html/paper.html

90) Virtual Machine Checkpointing

Brendan Cully,University of British Columbia with Andrew Warfield, University of Cambridge

http://xcr.cenit.latech.edu/hapcw2006/program/papers/cr-xen-hapcw06-final.pdf

Adding new device and triggering the probe() functions

The following is a simple example which shows how to add a new device and trigger the probe() function of a backend driver using xenstore-write tool. This is relevant for Xen 3.1

Currently in Xen, triggering of the probe() method in a backend driver or a frontend driver is done by writing some values to the xenstore into directories where the xenbus poses watches. This writing to the xenstore is currently done in Xen from the python code, and it is wrapped deep inside the xend and/or xm commands. Eventually it is done in the writeDetails method of the DevController class. (And both blkif and netif use it).

For those who want who want to be able to trigger the probe() function without diving too deeply into the python code, this should suffice.

For the purposes of this little tutorial, let’s assume that you have built and installed Xen 3.1 from source and have used it to fire up a guest domain at least once. After you’ve done that, let’s say we want to add new device. We will add a device named “mydevice”. Let’s begin with the backend. For this purpose, we will add a directory named “deviceback” to linux-2.6-sparse/drivers/xen. This directory will store the backend portion of our driver.

First, create linux-2.6-sparse/drivers/xen/deviceback. Next, add the following three files to that directory: deviceback.c, xenbus.c, common.h and Makefile.

Here is a minimal skeleton implementation of these files:

deviceback.c

#include <linux/module.h>
#include "common.h"
static int __init deviceback_init(void)
{
        device_xenbus_init();
}
static void deviceback_cleanup()
{
}
module_init(deviceback_init);
module_exit(deviceback_cleanup);

xenbus.c

#include <xen/xenbus.h>
#include <linux/module.h>
#include <linux/slab.h>
struct backendinfo
{
        struct xenbus_device* dev;
        long int frontend_id;
        struct xenbus_watch backend_watch;
        struct xenbus_watch watch;
        char* frontpath;
};
//////////////////////////////////////////////////////////////////////////////
static int device_probe(struct xenbus_device* dev,
                        const struct xenbus_device_id* id)
{
        struct backendinfo* be;
        char* frontend;
        int err;
        be = kmalloc(sizeof(*be),GFP_KERNEL);
        memset(be,0,sizeof(*be));
        be->dev = dev;
        printk("Probe fired!\n");
        return 0;
}
//////////////////////////////////////////////////////////////////////////////
static int device_uevent(struct xenbus_device* xdev,
                          char** envp, int num_envp,
                          char* buffer, int buffer_size)
{
        return 0;
}
static int device_remove(struct xenbus_device* dev)
{
        return 0;
}
//////////////////////////////////////////////////////////////////////////////
static struct xenbus_device_id device_ids[] =
{
        { "mydevice" },
        { "" }
};
//////////////////////////////////////////////////////////////////////////////
static struct xenbus_driver deviceback =
{
        .name    = "mydevice",
        .owner   = THIS_MODULE,
        .ids     = device_ids,
        .probe   = device_probe,
        .remove  = device_remove,
        .uevent  = device_uevent,
};
//////////////////////////////////////////////////////////////////////////////
void device_xenbus_init()
{
        xenbus_register_backend(&deviceback);
}

common.h

#ifndef COMMON_H
#define COMMON_H
void device_xenbus_init(void);
#endif

Makefile

#Makefile
obj-y += xenbus.o deviceback.o

Next, we should add our new backend device to the Makefile in linux-2.6-sparse/drivers/xen/Makefile. Add the following line to the bottom of that file:

obj-y += deviceback/

This will make sure that it will be included in the build.

Next, we need to add symlinks from linux-2.6-sparse/drivers/xen/deviceback into linux-2.6.18-xen/drivers/xen/deviceback:

  1. Create linux-2.6.18-xen/drivers/xen/deviceback
  2. Change into that directory
  3. Add the symlinks: ‘ln -s ../../../../linux-2.6-xen-sparse/./drivers/xen/deviceback/./* .’

Now we should build the new drivers and reboot with the new Xen image. You can do this by going back to the root directory of the source tree (the place where you typed ‘make world’ when doing your normal build before) and do ‘make install-kernels’. (Note: This will overwrite the previous Xen kernel!) Finally, reboot.

After the machine boots back up, go ahead and start a guest domain and you should notice that device_probe() does not get executed. (Check /var/log/syslog on dom0 to look for the printk() to show up.)

How can we trigger the probe() function of our backend driver? We just need to write the correct key/value pairs into the xenstore.

The call to xenbus_register_backend() in xenbus.c causes xenbus to set a watch on local/domain/0/backend/mydevice in the xenstore. Specifically, anytime anything is written into that location in the store the watch fires and checks for a specific set of key/value pairs that indicate the probe should be fired.

So performing the following 4 calls using xenstore-write will trigger our probe() function. Change the X with the ID of a running guest domain. (Check ‘xm list’ for this. If you’ve only started one guest, this number is probably 1.)

xenstore-write /local/domain/X/device/mydevice/0/state 1
xenstore-write /local/domain/0/backend/mydevice/X/0/frontend-id X
xenstore-write /local/domain/0/backend/mydevice/X/0/frontend /local/domain/X/device/mydevice/0
xenstore-write /local/domain/0/backend/mydevice/X/0/state 1

You should see the probe message appear the Dom0’s /var/log/syslog. What happened here behind the scenes ,without going too deep, is that the xenbus_register_backend() put a watch on the xenback directory of /local/domain/0 in the xenstore. Once frontend, frontend-id, and state are all written to the watched location, the xenbus driver will gather all of that information, as well as the state of the frontend driver (written in that first line) and use it to setup the appropriate data structures. From there, the probe() function is finally fired.

Adding a frontend device

For this purpose,we will add a directory named “devicefront” to linux-2.6-sparse/drivers/xen.

We will create 2 files there: devicefront.c and Makefile.

We will also add directories and symlinks as we did in the deviceback case.

Makefile

obj-y := devicefront.o

devicefront.c

The devicefront.c will be (a minimalist implementation):

// devicefront.c
#include <xen/xenbus.h>
#include <linux/module.h>
#include <linux/list.h>
struct device_info
{
        struct list_head list;
        struct xenbus_device* xbdev;
};
//////////////////////////////////////////////////////////////////////////////
static int devicefront_probe(struct xenbus_device* dev,
                             const struct xenbus_device_id* id)
{
        printk("Frontend Probe Fired!\n");
        return 0;
}
//////////////////////////////////////////////////////////////////////////////
static struct xenbus_device_id devicefront_ids[] =
{
        {"mydevice"},
        {}
};
static struct xenbus_driver devicefront =
{
        .name  = "mydevice",
        .owner = THIS_MODULE,
        .ids   = devicefront_ids,
        .probe = devicefront_probe,
};
static int devicefront_init(void)
{
        printk("%s\n",__FUNCTION__);
        xenbus_register_frontend(&devicefront);
}
module_init(devicefront_init);

We should also remember to add the following to the Makefile under linux-2.6-sparse/drivers/xen:

obj-y   += devicefront/

Getting the frontend driver to fire is a bit more complicated, the following bash script should help you:

#!/bin/bash
if [ $# != 2 ]
then
        echo "Usage: $0 <device name> <frontend-id>"
else
        # Write backend information into the location the frontend will look
        # for it.
        xenstore-write /local/domain/${2}/device/${1}/0/backend-id 0
        xenstore-write /local/domain/${2}/device/${1}/0/backend \
                       /local/domain/0/backend/${1}/${2}/0
        # Write frontend information into the location the backend will look
        # for it.
        xenstore-write /local/domain/0/backend/${1}/${2}/0/frontend-id ${2}
        xenstore-write /local/domain/0/backend/${1}/${2}/0/frontend \
                       /local/domain/${2}/device/${1}/0
        # Set the permissions on the backend so that the frontend can
        # actually read it.
        xenstore-chmod /local/domain/0/backend/${1}/${2}/0 r
        # Write the states.  Note that the backend state must be written
        # last because it requires a valid frontend state to already be
        # written.
        xenstore-write /local/domain/${2}/device/${1}/0/state 1
        xenstore-write /local/domain/0/backend/${1}/${2}/0/state 1
fi

Here’s how to use it:

  • Startup a Xen guest that contains your frontend driver, and be sure dom0 contains the backend driver.
  • Figure out the frontend-id for the guest. This is the ID field when running xm list. Let’s say that number is 3.
  • Run the script as so: ./probetest.sh mydevice 3

That should fire both the frontend driver. (You’ll have to check /var/log/messages in the guest to verify that the probe was fired.)

Discussion

SimonKagstrom: Maybe it would be a good idea to split this document into several pages? It’s starting to be fairly long :) TimPost

TimPost: ACK, even if printed, this is hard to digest as an intro

Reference:

http://wiki.xensource.com/xenwiki/XenIntro

Read Full Post | Make a Comment ( 1 so far )

Strip attachments from an email message

Posted on December 7, 2009. Filed under: Linux, Python | Tags: , , , |

This recipe shows a simple approach to using the Python email package to strip out attachments and file types from an email message that might be considered dangerous. This is particularly relevant in Python 2.4, as the email Parser is now much more robust in handling mal-formed messages (which are typical for virus and worm emails)

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
ReplaceString = """

This message contained an attachment that was stripped out. 

The original type was: %(content_type)s
The filename was: %(filename)s, 
(and it had additional parameters of:
%(params)s)

"""

import re
BAD_CONTENT_RE = re.compile('application/(msword|msexcel)', re.I)
BAD_FILEEXT_RE = re.compile(r'(\.exe|\.zip|\.pif|\.scr|\.ps)$')

def sanitise(msg):
    # Strip out all payloads of a particular type
    ct = msg.get_content_type()
    # We also want to check for bad filename extensions
    fn = msg.get_filename()
    # get_filename() returns None if there's no filename
    if BAD_CONTENT_RE.search(ct) or (fn and BAD_FILEEXT_RE.search(fn)):
        # Ok. This part of the message is bad, and we're going to stomp
        # on it. First, though, we pull out the information we're about to
        # destroy so we can tell the user about it.

        # This returns the parameters to the content-type. The first entry
        # is the content-type itself, which we already have.
        params = msg.get_params()[1:] 
        # The parameters are a list of (key, value) pairs - join the
        # key-value with '=', and the parameter list with ', '
        params = ', '.join([ '='.join(p) for p in params ])
        # Format up the replacement text, telling the user we ate their
        # email attachment.
        replace = ReplaceString % dict(content_type=ct, 
                                       filename=fn, 
                                       params=params)
        # Install the text body as the new payload.
        msg.set_payload(replace)
        # Now we manually strip away any paramaters to the content-type 
        # header. Again, we skip the first parameter, as it's the 
        # content-type itself, and we'll stomp that next.
        for k, v in msg.get_params()[1:]:
            msg.del_param(k)
        # And set the content-type appropriately.
        msg.set_type('text/plain')
        # Since we've just stomped the content-type, we also kill these
        # headers - they make no sense otherwise.
        del msg['Content-Transfer-Encoding']
        del msg['Content-Disposition']
    else:
        # Now we check for any sub-parts to the message
        if msg.is_multipart():
            # Call the sanitise routine on any subparts
            payload = [ sanitise(x) for x in msg.get_payload() ]
            # We replace the payload with our list of sanitised parts
            msg.set_payload(payload)
    # Return the sanitised message
    return msg

# And a simple driver to show how to use this
import email, sys
m = email.message_from_file(open(sys.argv[1]))
print sanitise(m)

Discussion

I’ve seen this come up a few times on comp.lang.python, so here’s a cookbook entry for it. This recipe shows how to read in an email message, strip out any dangerous or suspicious attachments, and replace them with a harmless text message informing the user of this.

This is particularly important if the end-users are using something like Outlook, which is targetted by unpleasant virus and worm messages on a daily basis.

The email parser in Python 2.4 has been completely rewritten to be robust first, correct second – prior to this, the parser was written for correctness first. This was a problem, because many virus/worm messages would send email messages that were broken and non-conformant – this made the old email parser choke and die. The new parser is designed to never actually break when reading a message – instead it tries it’s best to fix up whatever it can in the message. (If you have a message that causes the parser to crash, please let us know – that’s a bug, and we’ll fix it).

The code itself is heavily commented, and should be easy enough to follow. A mail message consists of one or more parts – these can each contain nested parts. We call the ‘sanitise()’ function on the top level Message object, and it calls itself recursively on the sub-objects. The sanitise() function checks the Content-Type of the part, and if there’s a filename, also checks that, against a known-to-be-bad list.

If the message part is bad, we replace the message itself with a short text description describing the now-removed part, and clean out the headers that are relevant. We set this message part’s Content-Type to ‘text/plain’, and remove other headers that related to the now-removed message.

Finally, we check if the message is a multipart message. This means it has sub-parts, so we recursively call the sanitise function on each of those. We then replace the payload with our list of sanitised sub-parts.

Extensions, further work, etc:

Instead of destroying the attachment, it would be a small amount of work to instead store the attachment away in a directory, and supply the user with a link to the file.

You could add other filters into the sanitise() code – for instance, checking other headers for known signs of worm or virus messages. Or removing all large powerpoint files sent to you by your marketing department, if that’s what you want to do.

Reference:

http://code.activestate.com/recipes/302086/

Read Full Post | Make a Comment ( None so far )

Linux Network Setup/Monitor commands

Posted on November 30, 2009. Filed under: Linux | Tags: , , , , |

1. check devices status:

lspci  (dmidecode, lsusb, lshw , lspci)

reference: http://ubuntuforums.org/showthread.php?t=320995

2.  Check the network status:

mii-tool  (you should be root)

ip link

References:

http://www.linuxforums.org/forum/redhat-fedora-linux-help/129167-how-detect-hardware-add-new-one.html

3. NIC card change sequence:

nameif

(hwinfo)

/lib/udev/rename_netiface <old> <new>

References:

http://www.centos.org/modules/newbb/viewtopic.php?topic_id=10434

http://www.cyberciti.biz/faq/rhel-fedora-centos-howto-swap-ethernet-aliases/

http://www.linuxforums.org/forum/suse-linux-help/92280-how-change-ethx-ethy.html

4. iftop

a good console bandwidth visualization tool that shows you active
connections, where they are going to/from and how much of your precious bandwidth
they are using.

iftop is a command-line system monitor tool that produces a frequently-updated list of network connections. By default, the connections are ordered by bandwidth usage, with only the “top” bandwidth consumers shown.

Reference:

http://en.wikipedia.org/wiki/Iftop

5. iptraf

IPTraf is a console-based network statistics utility for Linux. It gathers a variety of figures such as TCP connection packet and byte counts, interface statistics and activity indicators, TCP/UDP traffic breakdowns, and LAN station packet and byte counts.

http://iptraf.seul.org/

Read Full Post | Make a Comment ( 1 so far )

« Previous Entries

Liked it here?
Why not try sites on the blogroll...