Posted on January 8, 2010. Filed under: C/C++, Linux, Services | Tags: , |

Xen Intro- version 1.0:


  1. Introduction
  2. Xen and IA32 Protection Modes
  3. The Xend daemon:
  4. The Xen Store:
  5. VT-x (virtual technology) processors – support in Xen
  6. Vmxloader
  7. VT-i (virtual technology) processors – support in Xen
  8. AMD SVM
  9. Xen On Solaris
  10. Step by step example of creating guest OS with Virtual Machine Manager in Fedora Core 6
  11. Physical Interrupts
  12. Backend Drivers:
  13. Migration and Live Migration:
  14. Creating of a domain – behind the scenes:
  15. HyperCalls Mapping to code Xen 3.0.2
  16. Virtualization and the Linux Kernel
  17. Pre-Virtualization
  18. Xen Storage
  19. kvm – Kernel-based Virtualization Driver
  20. Tip: How to build Xen with your own tar ball
  21. Xen in the Linux Kernel
  22. VMI : Virtual Machine Interface
  23. Links
  24. Adding new device and triggering the probe() functions
    1. deviceback.c
    2. xenbus.c
    3. common.h
    4. Makefile
  25. Adding a frontend device
    1. Makefile
    2. devicefront.c
  26. Discussion


All of the following text refers to x86 platform of Xen-unstable, unless otherwise explicitly said. We will deal only with Xen on linux 2.6 ; We are not dealing at all with Xen on linux 2.4 (and as far as we know, in the future, domain 0 is intended to be based ONLY on 2.6). Moreover,currently the 2.4 linux tree is removed from Xen Tree (changeset 7263:f1abe953e401 from 8.10.05) but it can be that it will be back when some problems will be fixed.

This document deals only with Xen 3.0 version unless explictily said otherwise.

This is not intended to be a full and detailed documentation of the Xen project but we hope it will be a starting point to anyone who is interested in Xen and wants to learn more.

The Xen Project team is permitted to take part or all of this document and integrate it with the official Xen documentation or put it as a standalone document in the Xen Web Site if they wish, without any further notice.

This is not a complete detailed document nor a full walkthrough ,and many important issues are omitted. Any feedback is welcomed to : Rami Rosen ,

Xen and IA32 Protection Modes

In the classical protection model of IA-32, there are 4 privilege levels; The highest ring is 0, where the kernel runs. (this level is also called SuperVisor Mode) The lowest is ring 3, where User applications run (this level is also called User Mode) Issuing some instructions , which are called “privileged instructions” , from ring which is NOT ring 0, will cause a General Protection Fault.

Ring 1,2 were not used through the years (except for in the case of OS/2). When running Xen, we run a Hypervisor in ring 0 and the guest OS in ring 1. The applications run unmodified at ring 3.

BTW, there are of course architectures which have a different privilege models; for example, in PPC both domain 0 and the Unprivileged domains run in supervisor mode. Diagram: Xen and IA32 Protection Modes


The Xend daemon:

The Xend Daemon handles requests issued from Domain 0; requests can be, for example, creating a new domain (“xm create”) or listing the domains (“xm list”), shutting down a domain (“xm destroy”). Running “xm help” will show all possibilities.

You start the Xend daemon by running, after booting into domain0, “xend start”. “xend start” creates two daemons: xenstored and xenconsoled (see toos/misc/xend). It also creates an instance of a python SrvDaemon class and calls its start() method. (see tools/python/xen/xend/server/

The SrvDaemon start() method is in fact the xend main program.

In the past,the start() method of SrvDaemon eventually started an http socket (8000) on which it listened to http requests. Now it does not open an http socket on port 8000 anymore.

Note : There is an altenative to the management layer of Xen which is called libvirt; see This is a free API (LGPL)

The Xen Store:

The Xen Store Daemon provides a simple tree-like database to which we can read and write values. The Xen Store code is mainly under tools\xenstore.

It replaces the XCS, which was a daemon handling control messages.

The physical xenstore resides in one file: /var/lib/xenstored/tdb. (previously it was sacttered in some files; the change to using one file (named “tdb”) was probably to increase performance).

Both user space (“tools” in Xen terminology) and kernel code can write to the XenStore.The kernel code writes to the XenStore by using XenBus.

The python scripts (under tools/python) uses lowlevel/xs.c to read/write to the XenStore.

The Xen Store Daemon is started in xenstored_core.c. It creates a device file (“/dev/xen/evtchn”) in case such a device file does not exists and it opens it. (see : domain_init() ,file tools/xenstore/xenstored_domain.c).

It opens 2 TCP sockets (UNIX sockets). One of these sockets is a Read-Only socket, and it resides under /var/run/xenstored/socket_ro. The second is /var/run/xenstored/socket.

Connections on these sockets are represented by the connection struct.

A connection can be in one of three states:

        BLOCKED (blocked by a transaction)
        BUSY    (doing some action)
        OK      (completed it's transaction)

struct connection is declared in xenstore/xenstored_core.h; When a socket is ReadOnly,the “can_write” member of it is false.

Then we start an endless loop in which we can get input/output from three sources: the two sockets and the event channel, mentioned above.

Events, which are received in the event channel,are handled by handle_event() method (file xenstored_domain.c).

There are six executables under tools/xenstore, five of which are in fact made from the same module, which is xenstore_client.c, each time built with a different DEFINE passed. (See the Makefile). The sixth tool is built from xsls.c

These executables are : xenstore-exists, xenstore-list, xenstore-read, xenstore-rm, xenstore-write and xsls.

You can use these executable for accessing xenstore. For example: to view the list of fields of domain 0 which has a path “local/domain/0”, you run:

xenstore-list /local/domain/0

and a typical result can be the following list:


The xsls command is very useful and recursively shows the contents of a specified XenStore path. Essentially it does a xenstore-list and then a xenstore-read for each returned field, displaying the fields and their values and then repeating this recursively on each sub-path. For example: to view information about all VIFs backends hosted in domain 0 you may use the following command.

xsls /local/domain/0/backend/vif

and a typical result may be:

14 = ""
 0 = ""
  bridge = "xenbr0"
  domain = "vm1"
  handle = "0"
  script = "/etc/xen/scripts/vif-bridge"
  state = "4"
  frontend = "/local/domain/14/device/vif/0"
  mac = "aa:00:00:22:fe:9f"
  frontend-id = "14"
  hotplug-status = "connected"
15 = ""
 0 = ""
  mac = "aa:00:00:6e:d8:46"
  state = "4"
  handle = "0"
  script = "/etc/xen/scripts/vif-bridge"
  frontend-id = "15"
  domain = "vm2"
  frontend = "/local/domain/15/device/vif/0"
  hotplug-status = "connected"

(The xenstored must be running for these six executables to run; If xenstored is not running, then running theses executables will usually hang. The Xend daemon can be stopped).

An instance of struct node is the elementary unit of the XenStore. (struct node is defined in xenstored_core.h). The actual writing to the XenStore is done by write_node() method of xenstored_core.c.

xen_start_info structure has a member named :store_evtchn. (declared in public/xen.h as u16). This is the event channel for store communication.

VT-x (virtual technology) processors – support in Xen

Note: following text refers only to IA-32 unless explicitly said otherwise.

Intel had announced Pentium® 4 672 and 662 processors in November 2005 with virtualization support. (see, for example:

How does Xen support the Intel Virtualization Technology ?

The VT extensions support in Xen3 code is mostly in xen/arch/x86/hvm/vmx*.c.

  • and xen/include/asm-x86/vmx*.h and xen/arch/x86/x86_32/entry.S.

arch_vcpu structure (file xen/include/asm-x86/domain.h) contains a member which is called arch_vmx and is an instance of arch_vmx_struct. This member is also important to understand the VT-x mechanism.

But the most important structure for VT-x is the VMCS( vmcs_struct in the code) which represents the VMCS region.

The definition (file include/asm-x86/vmx_vmcs.h) is short:

struct vmcs_struct

  • { u32 vmcs_revision_id; unsigned char data [0]; /* vmcs size is read from MSR */ };

The VMCS region contains six logical regions; most relevant to our discussions are Guest-state area and Host-state area. We will also deal with the other four regions: VM-execution control fields,VM-exit control fields, VM-entry control fields and VM-exit information fields.

Intel added 10 new opcodes in VT-x to support Intel Virtualization Technology. They are detailed in the end of this section.

When using this technology, Xen runs in “VMX root operation mode” while the guest domains (which are unmodified OSs) run in “VMX non-root operation mode”. Since the guest domains run in “non-root operation” mode, it is more restricted,meaning that certain actions will cause “VM exit” to the VM.

Xen enters VMX operation in start_vmx() method. ( file xen/arch/x86/vmx.c)

This method is called from init_intel() method (file xen/arch/x86/cpu/intel.c.) (CONFIG_VMX should be defined).

First we check the X86_FEATURE_VMXE bit in ecx register to see if the cpuid shows that there is support for VMX in the processor. In IA-32 Intel added in the CR4 control register a bit specifying whether we want to enable VMX. So we must set this bit to enable VMX on the processor (by calling set_in_cr4(X86_CR4_VMXE)); This bit is bit 13 in CR4 (VMXE).

Then we call _vmxon to start VMX operation. If we will try to start VMX operation by _vmxon when the VMXE bit in CR4 is not set we will get exception (#UD , for undefined opcode)

In IA-64, things are a little different due to different architecture structure: Intel added a new bit in IA-64 in the Processor Status Register (PSR). This is bit 46 and it’s called VM. It should be set to 1 in guest OSs; and when it’s values is 1 , certain instructions cause virtualization fault.

VM exit:

Some instructions can cause unconditionally VM exit and some can cause VM exit under certain VM-execution control fields. (see the discussion about VMX-region above)

The following instructions will cause VM exit unconditionally: CPUID, INVD, MOV from CR3, RDMSR, WRMSR, and all the new VT-x instructions (which are listed below).

There are other instruction like HLT,INVPLG (Invalidate TLB Entry instruction) MWAIT and others which will cause VM exit if a corresponding VM-execution control was set.

Apart from VM-execution control fields, there are 2 bitmpas which are used for determining whether to perform VM exit: The first is the exception bitmap (see EXCEPTION_BITMAP in vmcs_field enum , file xen/include/asm-x86/vmx_vmcs.h). This bitmap is 32 bit field; when a bit is set in this bitmap, this causes a VM exit if a corresponding exception occurs; by default ,the entries which are set are EXCEPTION_BITMAP_PG (for page fault) and EXCEPTION_BITMAP_GP (for General Protection). see MONITOR_DEFAULT_EXCEPTION_BITMAP in vmx.h.

The second bitmap is the I/O bitmap (in fact, there are 2 I/O bitmaps,A and B, each is 4KB in size) which controls I/O instructions on ports. I/O bitmap A contains the ports in the range 0000-7FFF and I/O bitmap B contains the ports in the range 8000-FFFF. (one bit for each I/O port). see IO_BITMAP_A and IO_BITMAP_B in vmcs_field enum (VMCS Encordings).

When there is an “VM exit” we reach the vmx_vmexit_handler(struct cpu_user_regs regs) in vmx.c. We handle the VM exit according to the exit reason which we read from the VMCS region. We read the vmcs by calling vmread() ; The return value of vmread is 0 in case of success.

We sometimes also need to read some additional data (VM_EXIT_INTR_INFO) from the vmcs.

We get additional data by getting the “VM-exit interruption information” which is a 32 bit field and the “Exit qualification” (64 bit value).

For example, if the exception was NMI, we check if it is valid by checking bit 31 (valid bit) of the VM-exit interruption field. In case it is not valid we call _hvm_bug() to print some statistics and crash the domain.

Example of reading the “Exit qualification” field is in the case where the VMEXIT was caused by issuing INVPLG instruction.

When we work with vt-x, the guest OSs work in shadow mode, meaning they use shadow page tables; this is because the guest kernel in a VMX guest does not know that it’s being virtualized. There is no software visible bit which indicates that the processor is in VMX non-root operation. We set shadow mode by calling shadow_mode_enable() in vmx_final_setup_guest() method (file vmx.c).

There are 43 basic exit reasons – you can see part of them in vmx.h (fields starting with EXIT_REASON_ like EXIT_REASON_EXCEPTION_NMI, which is exit reason number 0, and so on).

In VT-x, Xen will probably use an emulated devices layer which will send virtual interrupts to the VMM. We can prevent the OS from receiving interrupts by setting the IF flag of EFLAGS.

The new ten opcodes which Intel added in Vt-x are detailed below:


  • This simply calls the VM monitor, causing vm exit.


  • copies VMCS data to memory in case it does not written there.
  • wrapper : _vmpclear (u64 addr) in vmx.h.


  • launched a virtual machine; changes the launch state of the VMCS to
    • launched (if it is clear)


  • loads a pointer to the VMCS.
    • wrapper : _vmptrld (u64 addr) (file vmx.h)


  • stores a pointer to the VMCS.wrapper : _vmptrst (u64 addr) (file vmx.h.)


  • read specified field from VMCS.
  • wrapper : _vmread(x, ptr) (file vmx.h)


  • resumes a virtual machine ; in order it to resume the VM,
    • the launch state of the VMCS should be “clear.


  • write specified field in VMCS. wrapper _vmwrite (field, value).


  • terminates VMX operation.
    • wrapper : _vmxoff (void) (file vmx.h.)

10) VMXON (VMXON_OPCODE in vmx.h)

  • starts VMX operation.wrapper : _vmxon (u64 addr) (file vmx.h.)

QEMU and VT-D The io in Vt-x is performed by using QEMU. The QEMU code which Xen uses is under tools/ioemu. It is based on version 0.6.1 of QEMU. This version was patched accrording to Xen needs. Also AMD SVM uses QEUMU emulation.

The default network card which QEMU uses in Vt-x is AMD PCnet-PCI II Ethernet Controller. (file tools/ioemu/hw/pcnet.c). The reason to prefer this nic emulation to the other alternative, ne2000, is that pcnet uses DMA whereas ne2000 does not.

There is of course a performance cost for using QEMU, so there are chances that usage of QEMU will be replaced in the future with different soulutions which have lower performance costs.

Intel had annouced in March 2006 its VT-d Technology (Intel Virtualization Technology for Directred I/O). This technology enables to assign devices to virtual machines. It also enables DMA remapping, which can be configured for each device. There is a cache called IOTLB which improves performance.


There are some restrictions on VMX operation. Guest OSes in VMX cannot operate in Real Mode. If bit PE (Protection Enabled) of CR0 is 0 or bit PG (“Enable Paging”) of CR0 is 0, then trying to start the VMX operation (VMXON instruction) fails.If after entering VMX operation you try to clear these bits, you get an exception (General Protection Exception). When using a linux loader, it starts in real mode. As a result, a vmxloader was written for vmx images. (file tools/firmware/vmxassist/vmxloader.c.)

(In order to build vmxloader you must have dev86 package installed; dev86 is a real mode 80×86 assembler and linker).

After installing Xen, vmxloader is under /usr/lib/xen/boot. In order to use it, you should specify kernel = “/usr/lib/xen/boot/vmxloader” in the config file (which is an input to your “xm create” command.)

The vmxloader loads ROMBIOS at 0xF0000, then VGABIOS at 0xC0000, and then VMXAssist at D000:0000.

What is VMXAssist? The VMXAssist is an emulator for real mode which uses the Virtual-8086 mode of IA32. After setting Virtual-8086 mode, it executes in a 16-bit environment.

There are certain instructions which are not recognized in virtual-8086 mode. For example, LIDT (Load Interrupt Register Table), or LGDT (Load Global DescriptorTable).

These instructions cause #GP(0) when trying to run them in protected mode.

So the VMXAssist assist checks the opcode of the instructions which are being executed, and handles them so that they will not cause General Protection Exception (as would have happened without its intervention).

VT-i (virtual technology) processors – support in Xen

Note : the files mentioned in this sections are from the unstable xen version).

In Vt-i extension for IA64 processors,intel added a bit to the PSR (process status register). This bit is bit 46 of the PSR and is called PSR.vm. When this bit is set, some instructions will cause a fault.

A new instruction called vmsw (Virtual Machine Switch) was added. This instruction sets the PSR.vm to 1 or 0. This instruction can be used to cause transition to or from a VM without causing an interruption.

Also a descriptor named VPD was added; this descriptor represents the resources of a virtual processor. It’s size is 64 K. (It must be 32 aligned).

A VPD stands for “Virtual Processor Descriptor”. A structure named vpd_t represents the VPD descriptor (file include/public/arch-ia64.h).

Two vectors were added to the ivt: One is the External Interrupt vector (0x3400) and the other is the Virtualization vector (0x6100).

The virtualization vector handler is called when an instruction which need virtualization was called. This handler cannot be raised by IA-32 instructions.

Also nine PAL services were added. PAL stands for Processor Abstraction Layer.



AMD will hopefully release PACIFICA processors with virtualization support in Q2 2006. (probably on June 2006). The IOMMU virtualization support is to be out in 2007.

Them xen-unstable tree now includes both intel VT and SVM support, using a common API which is called HVM.

The inclusion of HVM in the unstable tree is since changeset 8708 from 31/1/06, which is a “Big merge the HVM full-virtualisation abstractions.”

You can download the code by: hg clone

The code for AMD SVM is mostly under xen/arch/x86/hvm/svm.

The code is developed by AMD team: Tom Woller, Mats Petersson, Travis Betak, Nagib Gulam, Leo Duran, Rosilmildo Dasilva and Wei Huang.

SVM stands for “Secure Virtual Machine”.

One major difference between Vt-x and AMD SVM is that the AMD SVM virtualization extensions include tagged TLB (whereas Intel virtualization extensions for IA-32 does not). The benefit of a tagged TLB is significantly reducing the number of TLB flushes ; this is achieved by using an ASID (Address Space Identifer) in the TLB. Using tagged TLB is common in RISC processors.

In AMD SVM, the most important struct (which is parallel to the VT-x vmcs_struct) is the vmcb_struct. (file xen/include/asm-x86/hvm/svm/vmcb.h). VMCB stands for Virtual Machine Control Block.

AMD added the following eight instructions to the SVM processor:

VMLOAD loads the processor state from the VMCB. VMMCALL enables the guest to communicate with the VMM. VMRUN starts the operation of a guest OS. VMSAVE store the processor state from the VMCB. CLGI clears the global interrupt flag (GIF) SLGI sets the global interrupt flag (GIF) INVPLGA invalidates the TLB mapping of a specified virtual page

  • and a specfied ASID.

SKINIT reinitilizes the CPU.

To issue these instructions SVM must be enabled. Enabling SVM is done by setting bit 12 of the EFER MSR register.

In VT-x, the vmx_vmexit_handler() method handles VM Entries. In AMD SVM, the svm_vmexit_handler() method is the one which handles VM exits. (file xen/arch/x86/hvm/svm/svm.c) When VM exit occurs, the processor saves the reason for this exit in the exit_code member of the VCMB. The svm_vmexit_handler() handles the VM EXIT according to the exit_reason of the VMCB.

Xen On Solaris

On 13 Feb 2006, Sun had released the Xen sources for Solaris x86. See :

This version currently supports 32 bit only ;it enables openSolaris to be a guest OS where dom0 is a modifed Linux kernel running Xen. Also this version is currently only for x86 (porting to SPARC processor is much more difficult). The members of the Solaris Xen project are Tim Marsland, John Levon, Mark Johnson, Stu Maybee, Joe Bonasera, Ryan Scott, Dave Edmondson and others. Todd Clayton is leading the 64-bit solaris Xen project. In order to boot the Solaris Xen guest many changes were done; can see more details in

You can download the Xen Solaris sources from :

Frontend net virtual device sources are in uts/common/io/xennet/xennetf.c. (xennet is the net front virtual driver.).

Frontend block virtual device sources are in uts/i86xen/io/xvbd (xvbd is the block front virtual driver.).

Currently the front block device does not work. There are many things which are similiar between Xen on Solaris and Xen on Linux.

In Xen Solaris Hypercall are also made by calling int 0x82 . (see #define TRAP_INSTR int $0x82 (file /uts/i86xen/ml/hypersubr.s)

Sun also released in february 2006 the specs for the T1 prcoessor, which supports virtualization: see :


Also the UltraSPARC T1 Hypervisor API Specification was released:

T1 virtualization:

The Hyperprivileged edition of the UltraSPARC Architecture 2005 Specification describes the Nonprivileged, Privileged, and Hyperprivileged (hypervisor/virtual machine firmware) spec.

The virtual processor on Sun supports three privilege modes:

2 bits determine the privilege mode of the processor: HPSTATE.hpriv and PSTATE.priv When both are 0 ,we are in nonprivileged mode When both are 1 ,we are in privileged mode When HPSTATE.hpriv is 1 , we are in Hyperprivileged mode (regardless of the value of PSTATE.priv). PSTATE is the Processor State register. HPSTATE is the Hyperprivileged State register HPSTATE.(64 bit). Each virtual processor has only one instance of the PSTATE and HPSTATE registers. The HPSTATE is one of the HPR state registers, and it is also called HPR 0. It can be read by the RDHPR instructions, and it can be written by the WRHPR instruction.

Step by step example of creating guest OS with Virtual Machine Manager in Fedora Core 6

This secrion describes a step by step example of creating guest OS based on FC6 i386 with Virtual Machine Manager in a Fedora Core 6 machine by installing from a WEB URL:

Go to : Application->System Tools->Virtual Machine Manager Choose : Local Xen Host Press New. Enter a name for the guest. You reach now the “Locating installation media” dialog. In “install media URL” you should enter a URL of Fedora Core 6 i386 download. For example, “” then press forward. Choose simple file, and give a path to a non existing file in some existing folder. than: File size: choose 3.5 GB for example ; if you will assign less space, you will not be able to finish the installation, assuming it is a typical , non custom, installation.

Then press forward ; accept the defaults for memory/cpu and then press forward. Than press finish. That’s it! When insalling from web like this it can take 2-4 hours, depending on your bandwidth. You will get to the text mode installation of fedora core 6, and have to enter parameters for the installation.

After the installation is finished and you want to restart the guest OS , you do it by simply: “xm create /etc/xen/NameOfGuest“, where NameOfGuest is of course the name of guest you choose in the installation.

Physical Interrupts

In Xen, only the Hypervisor has an access to the hardware so that to achieve isolation (it is dangerous to share the hardware and let other domains access directly hardware devices simultaneously).

Let’s take a little walkthrough dealing with Xen interrupts:

Handling interrupts in Xen is done by using event channels. Each domain can hold up to 1024 events. An event channel can have 2 flags associated with it : pending and mask. The mask flag can be updated only by guests. The hypervisor cannot update it. These flags are not part of the event channel structure itself. (struct evtchn is defined in xen/include/xen/sched.h ). There are 2 arrays in struct shared_info which contains these flags: evtchn_pending[] and evtchn_mask[] ; each holds 32 elements. (file xen/include/public/xen.h)

(The shared_info is a member in domain struct; it is the domain shared data area).

TBD: add info about event selectors (evtchn_pending_sel in vcpu_info).

Registration (or binding) of irqs in guest domains:

The guest OS calls init_IRQ() when it boots (start_kernel() method calls init_IRQ() ; file init/main.c).

(init_IRQ() is in file sparse/arch/xen/kernel/evtchn.c)

There can be 256 physical irqs; so there is an array called irq_desc with 256 entries. (file sparse/include/linux/irq.h)

All elements in this array are initialized in init_IRQ() so that their status is disabled (IRQ_DISABLED).

Now, when a physical driver starts it usually calls request_irq().

This method eventually calls setup_irq() (both in sparse/kernel/irq/manage.c). which calls startup_pirq().

startup_pirq() send a hypercall to the hypervisor (HYPERVISOR_event_channel_op) in order to bind the physical irq (pirq) . The hypercall is of type EVTCHNOP_bind_pirq. See: startup_pirq() (file sparse/arch/xen/kernel/evtchn.c)

On the Hypervisor side, handling this hypervisor call is done in: evtchn_bind_pirq() method (file /common/event_channel.c) which calls pirq_guest_bind() (file arch/x86/irq.c). The pirq_guest_bind() changes the status of the corresponding irq_desc array element to be enabled (~IRQ_DISABLED). it also calls startup() method.

Now when an interrupts arrives from the controller (the APIC), we arrive at do_IRQ() method as is also in usual linux kernel (also in arch/x86/irq.c). The Hypervisor handles only timer and serial interrupts. Other interrupts are passed to the domains by calling _do_IRQ_guest() (In fact, the IRQ_GUEST flag is set for all interrupts except for timer and serial interrupts). _do_IRQ_guest() send the interrupt by calling send_guest_pirq() to all guests who are registered on this IRQ. The send_guest_pirq() creates an event channel (an instance of evtchn) and sets the pending flag of this event channel. (by calling evtchn_set_pending()) Then, asynchronously, Xen will notify this domain regarding this interrupt (unless it is masked).

TBD: shared interrupts; avoiding problems with shared interrupts when using PCI express.

Backend Drivers:

The Backend Drivers are started from domain 0. We will deal mainly with the network and block drivers. The network backend drivers reside under sparse/drives/xen/netback, and the block backend drivers reside under sparse/drives/xen/blkback.

There are many things in common between the netback and blkback device drivers. There are some differences, though. The blkback device drivers runs a kernel daemon thread (named :xenblkd) whereas the netback device driver does not run any kernel thread.

The netback and blkback register themselves with XenBus by calling xenbus_register_backend().

This method simply calls xenbus_register_driver_common(); both are in sparse/drivers/xen/xenbus/xenbus_probe.c.

(The xenbus_register_driver() method calls the generic kernel method for registering drivers, driver_register()).

Both netback (network backend driver) and blkback (block backend driver) has a module named xenbus.c. There are drivers which are not splitted to backend/frontend drivers;for example, the balloon driver.The balloon driver calls register_xenstore_notifier() in its initialization (balloon_init() method). The register_xenstore_notifier() uses a generic linux callback mechanism for passing status changes (notifier_block in include/linux/notifier.h).

The USB driver also has a backend and frontend drivers; currently it has no support to the xenbus/xenstore API so it does not have a module named xenbus.c but it will probably be adjusted in the future. As of writing of this document, the USB backend/frontend code was removed temporarily from the sparse tree.

Each of the backend drivers registers two watches: one for the backend and one for the frontend. The registration of the watches is done in the probe method:

* In netback it is in netback_probe() method (file netback/xenbus.c).

* In blkback it is in blkback_probe() method (file blkback/xenbus.c).

A registration of a watch is done by calling the xenbus_watch_path2() method. This method is implemented in sparse/drivers/xen/xenbus/xenbus_client.c. Evntually the watch registration is done by calling register_xenbus_watch(), which is implemented in sparse/drivers/xen/xenbus/xenbus_xs.c.

In both cases, netback and blkback, the callback for the backend watch is called backend_changed, and the callback for the forntend watch is called frontend_changed.

xenbus_watch is a simple struct consisting of 3 elements:

A reference to a list of watches (list_head)

A pointer to a node (char*)

A callback function pointer.

The xenbus.c in both netback and blkback defines a struct called backend_info; These structs have much in common: there are minor differences between them. One difference is that in the netback the communications channel is an instance of netif_t whereas in the blkback the communications channel is an instance of blkif_t; In the case of blkback, it includes also the major/minor numbers of the device and the mode (whereas these members don’t exist in the backend_info struct of the netback).

In the case of netback, there is also a XenbusState member. The state machine for XenBus includes seven states: Unknown, Initialising, InitWait (early initialisation was finished, and xenbus is waiting for information from the peer or hotplug scripts), Initialised (waiting for a connection from the peer), Connected, Closing (due to an error or an unplug event) and Closed.

One of the members of this struct (backend_info) is an instance of xenbus_device.(xenbus_device is declared in sparse/include/asm-xen/xenbus.h). The nodename looks like a directory path, for example, dev->nodename in the blkback case may look like:


and dev->nodename in the netback may look like:


We create an event channel for communication between the two domains by calling a bind_interdomain Hypervisor call. (HYPERVISOR_event_channel_op).

For the networking,this is done in netif_map() in netback/interface.c. For the block device, this is done in blkif_map() in blkback/interface.c.

We use the grant tables to create shared memory between frontend and backend domain. In the case of network drivers,this is done by calling: gnttab_grant_foreign_transfer_ref(). (called in: network_alloc_rx_buffers(), file netfront.c)

gnttab_grant_foreign_transfer_ref() sets a bit named GTF_accept_transfer in the grant_entry.

In the case of block drivers,this is done by calling: gnttab_grant_foreign_access_ref() in blkif_queue_request() (file blkfront.c)

gnttab_grant_foreign_access_ref() sets a bit named GTF_permit_access in the grant entry. grant entry (grant_entry_t) represents a page frame which is shared between domains.

Diagram: Virtual Split Devices


Migration and Live Migration:

Xend must be configured so that migration (which is also termed relocation) will be enabled. In /etc/xen/xend-config.sxp, there is the definition of the relocation-port, so the following line should be uncommented:

(xend-relocation-port 8002)

The line “(xend-address localhost)” prevents remote connections on the localhost,so this line must be commented.

Notice: if this line is commented in the side to which you want to migrate your domain, you will most likely get the following error after issuing the migrate command:

"Error: can't connect: Connection refused"

This error can be traced to domain_migrate() method in /tools/python/xen/xend/ which start a TCP connection on the relocation port (which is by default 8002)

def domain_migrate(self, domid, dst, live=False, resource=0):
        """Start domain migration."""
        dominfo = self.domain_lookup(domid)
        port = xroot.get_xend_relocation_port()
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            sock.connect((dst, port))
        except socket.error, err:
            raise XendError("can't connect: %s" % err[1])

See more details on the relocation protocol (implemented in below.

The line “(xend-relocation-server yes)” should be uncommented so the migration server will be running.

When we issue a migration command like “xm migrate #numOfDomain ipAddress” or , if we want to use live-migration , we add the –live flag thus: “xm migrate #numOfDomain –live ipAddress”, we call server.xend_domain_migrate() after validating that the arguments are valid. (file /tools/python/xen/xm/

We enter the domain_migrate() method of XendDomain, which first performs a domain lookup and then creates a TCP connection for the migration with the target machine; then it sends a “receive” packet (a packets which contains the string “receive”) on port 8002. The other sides gets this message in in the dataReceived() method (see of web/ and delegates it to dataReceived() of RelocationProtocol (file /tools/python/xen/xend/server/ Eventually it calls the save() method of XendChekpoint to start the migration process.

We end up with a call to op_receive() which ultimately sends back a “ready receive” message (also on port 8002). See op_receive() method in

The save() method of XendChekpoint opens a thread which calls the xc_save executable (which is in /usr/lib/xen/bin/xc_save). (file /tools/python/xen/xend/

The xc_save executable is build from tools/xcutils/xc_save.c.

The xc_save executable calls xc_linux_save() in tools/libxc/xc_linux_save, which in fact performs most of the migration process. (see xc_linux_save() in /tools/libxc/xc_linux_save.c)

The xc_linux_save() returns 0 on success, and 1 in case of failure.

Live migration is currently supported for 32-bit and 64-bit architectures. TBD: find out if there is support for live-migration for pae architectures. Jacob Gorm Hansen is doing an interesting work with migration in Xen; see

He wrote code which performs Xen “self-migration” , which is a migration which is done without the hypervisor involvement. The migration is done by opening a User Space socket and reading from a special file (/dev/checkpoint).

His code works with linux-2.6 bases xen-unstable.

You can get it by: hg clone

His work was inspired by his own ‘NomadBIOS’ for L4 microkernel, which also uses this approach of self-migration.

It might be that this new alternative to managed migration will be adopted in future versions of Xen (time will tell).

Migration of operating systems has much in common with migration of single processes. see Master’s thesis: “Nomadic Operating Systems” Jacob G. Hansen , Asger kahl Henriksen,2002

You can find there a discussion about migration of single processes in Mosix (Barak and La’adan).

Creating of a domain – behind the scenes:

We create a domain by:

xm create configFile -c

A simple config file may look like (using ttylinux,as in the user manual):

kernel = "/boot/vmlinuz-2.6.12-xenU"
memory = 64
name = "ttylinux"
nics = 1
ip = ""
disk = ['file:/home/work/downloads/tmp/ttylinux-xen,hda3,w']
root = "/dev/hda3 ro"

The create() method of XendDomainInfo handles creation of domains. When a domain is created it is assigned a unique id (uuid ) which is cretaed using uuidgen command line utility of e2fsprogs.(file python/xen/xend/

If the memory paramter specified a too high memory which the hypervisor cannot allocate, we end up with the following message: “Error: Error creating domain: The privileged domain did not balloon!”

The devices of a domain are created using the createDevice() method which delegates the call to the createDevice() method of the Device Controller (see The createDevice() in turn calls writeDetails() method (also in DevController). This writeDetails() method write the details in XenStore to trigger the creation of the device. The getDeviceDetails() is an abstract method which each subclass of DevController implements. Writing to the store is done by calling Write() method of xstransact. (file tools/pyhton/xen/xend/xenstore/ which returns the id of the newly created device.

By using transaction you can batch together some actions to perform against the xenstored (the common use is some read actions). You can create a domain also without Xend and without Python bindings; Jacob Gorm Hansen had demonstrated it in 2 little programs (businit.c and buscrate.c) (see

However, true to now these programs should be adjusted beacuse there were some API changes, especially that creation of interdomain event channel is done now with sending ioctl to event_channel (IOCTL_EVTCHN_BIND_INTERDOMAIN).

HyperCalls Mapping to code Xen 3.0.2

Mapping of HyperCalls to code :

Follwoing is the location of all hypercalls: The HyperCalls appear according to their order in xen.h.

The hypercall table itself is in xen/arch/x86/x86_32/entry.S (ENTRY(hypercall_table)).

HYPERVISOR_set_trap_table => do_set_trap_table() (file xen/arch/x86/traps.c)

HYPERVISOR_mmu_update => do_mmu_update() (file xen/arch/x86/mm.c)

HYPERVISOR_set_gdt => do_set_gdt() (file xen/arch/x86/mm.c)

HYPERVISOR_stack_switch => do_stack_switch() (file xen/arch/x86/x86_32/mm.c)

HYPERVISOR_set_callbacks => do_set_callbacks() (file xen/arch/x86/x86_32/traps.c)

HYPERVISOR_fpu_taskswitch => do_fpu_taskswitch(int set) (file xen/arch/x86/traps.c)

HYPERVISOR_sched_op_compat => do_sched_op_compat() (file xen/common/schedule.c)

HYPERVISOR_dom0_op => do_dom0_op() (file xen/common/dom0_ops.c)

HYPERVISOR_set_debugreg => do_set_debugreg() (file xen/arch/x86/traps.c)

HYPERVISOR_get_debugreg => do_get_debugreg() (file xen/arch/x86/traps.c)

HYPERVISOR_update_descriptor => do_update_descriptor() (file xen/arch/x86/mm.c)

HYPERVISOR_memory_op => do_memory_op() (file xen/common/memory.c)

HYPERVISOR_multicall => do_multicall() (file xen/common/multicall.c)

HYPERVISOR_update_va_mapping => do_update_va_mapping() (file /xen/arch/x86/mm.c)

HYPERVISOR_set_timer_op => do_set_timer_op() (file xen/common/schedule.c)

HYPERVISOR_event_channel_op => do_event_channel_op() (file xen/common/event_channel.c)

HYPERVISOR_xen_version => do_xen_version() (file xen/common/kernel.c)

HYPERVISOR_console_io => do_console_io() (file xen/drivers/char/console.c)

HYPERVISOR_physdev_op => do_physdev_op() (file xen/arch/x86/physdev.c)

HYPERVISOR_grant_table_op => do_grant_table_op() (file xen/common/grant_table.c)

HYPERVISOR_vm_assist => do_vm_assist() (file xen/common/kernel.c)

HYPERVISOR_update_va_mapping_otherdomain =>

  • do_update_va_mapping_otherdomain() (file xen/arch/x86/mm.c)

HYPERVISOR_iret => do_iret() (file xen/arch/x86/x86_32/traps.c) /* x86/32 only */

HYPERVISOR_vcpu_op => do_vcpu_op() (file xen/common/domain.c)

HYPERVISOR_set_segment_base => do_set_segment_base (file xen/arch/x86/x86_64/mm.c) /* x86/64 only */

HYPERVISOR_mmuext_op => do_mmuext_op() (file xen/arch/x86/mm.c)

HYPERVISOR_acm_op => do_acm_op() (file xen/common/acm_ops.c)

HYPERVISOR_nmi_op => do_nmi_op() (file xen/common/kernel.c)

HYPERVISOR_sched_op => do_sched_op() (file xen/common/schedule.c)

(Note: sometimes hypercalls are also called hcalls.)

Virtualization and the Linux Kernel

Virtualization in computer context can be thought of as extending the abilities of a computer beyond what a straight, non-virtual implelmentation allows.

In this category we can include also virtual memory, which allows a process to access 4GB virtual address space even though the physical RAM is usually much lower.

We can also think of the Linux IP Virtual Server (which is now a part of the linux kernel) as a kind of virtualization. By using the Linux IP Virtual Server you can configure a router to redirect service requests from a virtual server address to other machines (called real servers).

The IP Virtual Server is part of the kernel starting 2.6.10 (In the 2.4.* kernels it is also available as a patch; the code for 2.6.10 and above kernels is under net/ipv4/ipvs under the kernel tree ;there is still no implementation for ipv6).

The Linux Virtual Server (LVS) was started quite a time ago,in 1998; see

The idea of virtualization in the sense of enabling of running more than one operating system on a single platform is not new and was researched for many years. However, it seems that the Xen project is the first which produces performance benchmark metrics of such a feature which make this idea more practical and more attractive.

Origins of the Xen project: The Xen project is based on the Xenoservers project; It was originally built as part of the XenoServer project, see

Also the arsenic project has some ideas which were used in Xen. (see

In the arsenic project, written by Ian Pratt and Keir Fraser, a big part of the Linux kernel TCP/IP stack was ported to user space. The arsenic project is based on Linux 2.3.29. After a short look at the Arsenic porject code you can find some data structures which can remind of parallel data structures in Xen, like the event rings. (for exmaple,the ring_control_block struct in arsenic-1.0/acenic-fw-12.3.10/nic/common/nic_api.h)

Meiosys is a French Company which was purchased by IBM. It deals with another different type of virtualization – Application Virtualization.

see and

In context of the Meiosys project, it is worth to mention that a patch was sent recently to the Linux Kernel Mailing List from Serge E. Hallyn (IBM): see

This patch deals with process IDs. (the pid should stay the same after starting anew the application in Meiosys).

Another article on PID virtualization can be found in “PID virtualization: a wealth of choices” This article deals with PID virtualization in a context of a diffenet project (openVZ).

There is also the colinux open source project (see: for more details) and the openvz project, which is based on Virtuozzo™. (Virtuozzo is a commercial solution).

The openvz offers server virtualization, linux-based solution: see

There are other projects which probably ispire virtualization; to name of few:

Denali Project uses (uses paravirtualization).

A paper: Denali: Lightweight Virtual Machines for Distributed and Networked Applications By Andrew Whitaker et al.

Nemesis Operating System.

Exokernel: see “Application Performance and Flexibility on Exokernel Systems” by M. Frans Kaashoek et al

TBD: more details.


Another interesting virtulaization technique is Pre-Virtualization; in this method, we rewite sensitive instructions using the assembler files (whether generated by compiler, as is the usual case, or assembler files created manually). There is a problem in this method because there are instuctions which are sensitive only when they are performed in a certain context. A solution for this is to generate profiling data of a guest OS and then recompile the OS using the profiling data.


and an article: Pre-Virtualization: Slashing the Cost of Virtualization Joshua LeVasseur, Volkmar Uhlig, Matthew Chapman et al.

This technique is based on a paper by Hideki Eiraku and Yasushi Shinjo, “Running BSD Kernels as User Processes by Partial Emulation and Rewriting of Machine Instructions”

Xen Storage

You can use iscsi for Xen Storage. The xen-tools package of OpenSuse has an example of using iscsi, called xmexample.iscsi. The disk entry for iscsi in the configuration file may look like: disk = [ ‘,hda,w’ ]

TBD: more on iSCSI in Xen.

Solutions for using CoW in Xen: blktap (part of the xen project).

UnionFS: a stackable filesystem (used also in Knoppix Live-CD and other Live-CDs)

dm-userspace (A tool which uses device-mapper and a daemon called cowd; written by Dan Smith) You may download dm-userspace by:

To build as a module out-of-tree, copy dm-userspace.h to: /lib/modules/uname -r/build/include/linux and then run “make”.

Home of dm-userspace:

Copy-on-write NFS server: see

kvm – Kernel-based Virtualization Driver

Kvm is as an open source virtualization project , written by Avi Kivity and Yaniv Kamay from qumranet. See :

It is included in the linux kerel tree since 2.6.20-rc1; see: (“kvm driver for all those crazy virtualization people to play with”)

Currently it deals with Intel processors with the virtual extension (VT-X). and AMD SVM processors. You can know if your processor has these extensions by issuing from the command line: “egrep ‘^flags.*(vmx|svm)’ /proc/cpuinfo”

kvm.ko is a kernel module which handles userspace requests through ioctls. It works with a character device (/dev/kvm). The userspace part is built from patched quemu. One of KVM advantages is that it uses linux kernel mechanisms as they are without change (such as the linux scheduler). The Xen project, for example, made many changes to parts of the kernel to enable para-virtualization. Another advantage is the simplicty of the project: there is a kernel part and a userspace part. An advantage of KVM is that future versions of linux kernel will not entail changes in the kvm module code (and of course not in the user space part). The project currently support SMP hosts and will support SMP guests in the future.

Currently there is no support to live migration in KVM (but there is support for ordinary migration, when the migrated OS is stopped and than transfrerred to the target and than resumed).

In intel vt-x , VM-Exits are handled by the kvm module by kvm_handle_exit() method in kvm_main.c according to the reason which caused them (and which is specified and read from the VMCS). in AMD SVM , exit are handled by handle_exit() in svm.c.

There is an interesting usage of memory slots . There is already an rpm for openSUSE by Gerd Hoffman.

Tip: How to build Xen with your own tar ball

If you want to run “make world” without downloading the kernel (beacuse that you want to to your own tar ball which is a bit different from the original one because you made few changes inside the kernel), then do the following:

1) Let’s say that the kernel tar ball is named: my_linux-2.6.18.tar.bz2.

  • First, move my_linux-2.6.18.tar.bz2 to the folder from where you build Xen

2) Run from bash: XEN_LINUX_SOURCE=tarball make world

That’s it; it will use the my_linux-2.6.18.tar.bz2. tar ball that you copied to that folder.

Xen in the Linux Kernel

According to the following thread from xen-devel:, there is a mercurial repository in which xen is a subarch of i386 and x86_64 of the linux kernel, and there is an intention to send releavant stuff to Andrew/Linus for the upcoming 2.6.15 kernel. In 22/3/2006 , a patchest of 35 parts was sent to the Linux Kernel Mailing List (lkml) for Xen i386 paravirtualization support in the linux kernel: see

VMI : Virtual Machine Interface

On 13/3/06 , a patchset titled “VMI i386 Linux virtualization interface proposal” was sent to the LKML (Linux Kernel Mailing List) by Zachary Amsden and othes. (see It suggests for a common interfcace which abstracts the specifics of each hypervisor and thus can be used by many hypervisors. According to the vmi_spec.txt of this patchset, when an OS is ported to a paravirtulizable x86 processor, it should access the hypervisor through the VMI layer.

The VMI layer interface:

The VMI is divided to the following 10 types of calls:


PROCESSOR STATE CALLS (like VMI_DisableInterrupts, VMI_EnableInterrupts,VMI_GetInterruptMask)







TIMER CALLS (VMI_GetWallclockTime)

MMU CALLS (like VMI_SetLinearMapping)


1) Xen Project HomePage

2) Xen Mailing Lists Pge:

(don’t forget to read the XenUsersNetiquette before posting on the lists)

3) Atricle : Analysis of the Intel Pentium’s Ability to Support a Secure Virtual Machine Monitor

4) Xen Summits: 2005: 2006 fall: 2006 winter: 2007 spring:

5) Intel Virtualizatiuon technology:

6) Article by Ryan Maueron Linux Journal in 2 parts:

6-1) Xen Virtualization and Linux Clustering, Part 1

6-2) Xen Virtualization and Linux Clustering, Part 2

Commercial Companies: 7)XenSource:

8) Enomalism:

9) Thoughtcrime is a brand new company specialising in opensource virtualisation solutions see


10)IA64 Master Thesis HPC Virtualization with Xen on Itanium by Havard K. F. Bjerke

11) vBlades: Optimized Paravirtualization for the Itanium Processor Family Daniel J. Magenheimer and Thomas W. Christian

12) Now and Xen Feature Story Article Written by Andrew Warfield and Keir Fraser

13) Self Migration:

14) online magazine:

15) Denali Project

general links about virtualization:



18) A Survey on Virtualization Technologies Susanta Nanda Tzi-cker Chiueh


19) AMD I/O virtualization technology (IOMMU) specification Rev 1.00 – February 03, 2006

20) AMD64 Architecture Programmer’s Manual: Vol 2 System Programming : Revision 3.11 added chapter 15 on virtualization (“Secure Virtual Machine”).(december 2005)

21) AMD64 Architecture Programmer’s Manual: Vol 3: General-Purpose and System Instructions Revision 3.11 added SVM instructions (december 2005)

22) AMD virtualization on the Xen Summit:

23) AMD Press Release: SUNNYVALE, CALIF. — May 23, 2006: availability of AMD processors with virtualization extensions:,,51_104_543~108605,00.html

Open Solaris:

24) Open Solaris Xen Forum:

25) Update to opensolaris Xen: adding OpenSolaris-based dom0 capabilities, as well as 32-bit and 64-bit MP guest. 14/07/2006

26) Open Sparc Hypervisor Spec:

27) Open Sparc T1 page: extension to the Solaris Zones:

28) OSDL Development Wiki Homepage (Virtualization)

29) fedora xen mailing list archive:

30) Xen Quick Start for FC4 (Fedora Core 4).

31) Xen Quick Start for FC5 (Fedora Core 5):

32) Xen Quick Start for FC6 (Fedora Core 6):

Fedora 7 quick start: Fedora 8 quick start:

33) The Xen repository is handled by the mercurial version system. mercurial download:

34) Measuring CPU Overhead for I/O Processing in the Xen Virtual Machine Monitor Cherkasova Ludmila and Gardner, Rob

35) XenMon: QoS Monitoring and Performance Profiling Tool Gupta Diwaker and Gardner Rob; Cherkasova, Ludmila

36) Potemkin VMM: A virtual machine based on xen-unstable ; used in a honeypot By Michael Vrable, Justin Ma, Jay Chen, David Moore, Erik Vandekieft, Alex C. Snoeren, Geoffrey M. Voelker, and Stefan Savage.

37) Memory Resource Management in VMware ESX Server


38) Virtualization: From the Desktop to the Enterprise By Erick M. Halter , Chris Wolf Published: May 2005

39) Virtualization with VMware ESX Server Publisher: Syngress; 2005 by Al Muller, Seburn Wilson, Don Happe, Gary J. Humphrey

40) VMware ESX Server: Advanced Technical Design Guide by Ron Oglesby, Scott Herold

41) PPC: Hollis Blanchard, IBM Linux Technology Center Jimi Xenidis, IBM Research


43) PID virtualization: a wealth of choices

44) The Xen Hypervisor and its IO Subsystem:

45) G. J. Popek and R. P. Goldberg, Formal requirements for virtualizable third generation architectures, Commun. ACM, vol. 17, no. 7, pp. 412 421, 1974.

46) “Running multiple operating systems concurrently on an IA32 PC using virtualization techniques” by Kevin Lawton (1999).

47) Automating Xen Virtual Machine Deployment (talks about integrating SystemImager with Xen and more) by Kris Buytaert

48)Virtualizing servers with Xen Evaldo Gardenali VI International Conference of Unix at UNINET

49)Survey of System Virtualization Techniques Robert Rose March 8, 2004



51)Interview on Xen with NetBSD develope Manuel Bouyer

52) netbsd xen mailing list:

53) NetBSD/xen Howto

54) “C” API for Xen (LGPL) By Daniel Veillard and others

55) Fraser Campbell page:

56) Another page from Fraser Campbell :

57) Virtualization blog

58)Hardware emulation with QEMU (article)


60) Linux Virtualization with Xen

61) The virtues of Xen by Alex Maier

62) Deploying VirtualMachines as Sandboxes for the Grid Sriya Santhanam, Pradheep Elango, Andrea Arpaci Dusseau, Miron Livny

63) article: Xen and the new processors:


64) Infiniband (Smart IO) wiki page hg repository:

65) Novell Infiniband and virtualization, Patrick Mullaney , may 1, 2007:

66) A Case for High Performance Computing with Virtual Machines Wei Huangy, Jiuxing Liuz, Bulent Abaliz et al.

67) High Performance VMM-Bypass I/O in Virtual Machines Wei Huangy, Jiuxing Liuz, Bulent Abaliz et al. (usenix 06)

68) User Mode Linux , a book By Jeff Dike. Bruce Perens’ Open Source Series. Published: Apr 12, 2006;

69) Xen 3.0.3 features,schedule :

70) Practical Taint-Based Protection using Demand Emulation Alex Ho, Michael Fetterman, Christopher Clark et al.

71) Current Virtualisation Hardware by Nicholas Lee

72) RAID: Installing Xen with RAID 1 on a Fedora Core 4 x86 64 SMP machine:

73) RAID 1 and Xen (dom0) : (On Debian)

74) OpenVZ Virtualization Software Available for Power Processors

75) Kernel-based Virtual Machine patchset (Avi Kivity) adding /dev/kvm which exposes the virtualization capabilities to userspace.

76) Intel Technology Journal : Intel Virtulaization Technology: articles by Intel Staff (96 pages)

77) kvm site: (Avi Kivity and others) Includes a howto and a white paper, download , faq sections.

78) kvm on debian:

79) Linux Virtualization Wiki

80) ” New virtualisation system beats Xen to Linux kernel” (about kvm)

81) article about kvm:

82) Virtual Linux : An overview of virtualization methods, architectures, and implementations An article by M. Tim Jones (auhor of “GNU/Linux Application Programming”,”AI Application Programming”, and “BSD Sockets Programming from a Multilanguage Perspective”.

83) Lguest: The Simple x86 Hypervisor by Rusty Russel (formerly lhype)

84) “Infrastructure virtualisation with Xen advisory” – a wiki atricle : using iscsi for Xen-clustering ; shared storage

85) Xen with DRBD, GNBD and OCFS2 HOWTO

86) Virtualization with Xen(tm): Including Xenenterprise, Xenserver, and Xenexpress (Paperback) by David E Williams Syngress Publishing (April 1, 2007) # ISBN-10: 1597491675 # ISBN-13: 978-1597491679 (Author)

Paperback: 512 pages

87) Professional XEN Virtualization (Paperback) by William von Hagen (Author)

# Paperback: 500 pages # Publisher: Wrox (August 27, 2007) # Language: English # ISBN-10: 0470138114 # ISBN-13: 978-0470138113

88) Xen and the Art of Consolidation Tom Eastep Linuxfest NW. April 29, 2007.

89) Optimizing Network Virtualization in Xen

By Willy Zwaenepoel, Alan L. Cox, Aravind Menon, usenix 2006

90) Virtual Machine Checkpointing

Brendan Cully,University of British Columbia with Andrew Warfield, University of Cambridge

Adding new device and triggering the probe() functions

The following is a simple example which shows how to add a new device and trigger the probe() function of a backend driver using xenstore-write tool. This is relevant for Xen 3.1

Currently in Xen, triggering of the probe() method in a backend driver or a frontend driver is done by writing some values to the xenstore into directories where the xenbus poses watches. This writing to the xenstore is currently done in Xen from the python code, and it is wrapped deep inside the xend and/or xm commands. Eventually it is done in the writeDetails method of the DevController class. (And both blkif and netif use it).

For those who want who want to be able to trigger the probe() function without diving too deeply into the python code, this should suffice.

For the purposes of this little tutorial, let’s assume that you have built and installed Xen 3.1 from source and have used it to fire up a guest domain at least once. After you’ve done that, let’s say we want to add new device. We will add a device named “mydevice”. Let’s begin with the backend. For this purpose, we will add a directory named “deviceback” to linux-2.6-sparse/drivers/xen. This directory will store the backend portion of our driver.

First, create linux-2.6-sparse/drivers/xen/deviceback. Next, add the following three files to that directory: deviceback.c, xenbus.c, common.h and Makefile.

Here is a minimal skeleton implementation of these files:


#include <linux/module.h>
#include "common.h"
static int __init deviceback_init(void)
static void deviceback_cleanup()


#include <xen/xenbus.h>
#include <linux/module.h>
#include <linux/slab.h>
struct backendinfo
        struct xenbus_device* dev;
        long int frontend_id;
        struct xenbus_watch backend_watch;
        struct xenbus_watch watch;
        char* frontpath;
static int device_probe(struct xenbus_device* dev,
                        const struct xenbus_device_id* id)
        struct backendinfo* be;
        char* frontend;
        int err;
        be = kmalloc(sizeof(*be),GFP_KERNEL);
        be->dev = dev;
        printk("Probe fired!\n");
        return 0;
static int device_uevent(struct xenbus_device* xdev,
                          char** envp, int num_envp,
                          char* buffer, int buffer_size)
        return 0;
static int device_remove(struct xenbus_device* dev)
        return 0;
static struct xenbus_device_id device_ids[] =
        { "mydevice" },
        { "" }
static struct xenbus_driver deviceback =
        .name    = "mydevice",
        .owner   = THIS_MODULE,
        .ids     = device_ids,
        .probe   = device_probe,
        .remove  = device_remove,
        .uevent  = device_uevent,
void device_xenbus_init()


#ifndef COMMON_H
#define COMMON_H
void device_xenbus_init(void);


obj-y += xenbus.o deviceback.o

Next, we should add our new backend device to the Makefile in linux-2.6-sparse/drivers/xen/Makefile. Add the following line to the bottom of that file:

obj-y += deviceback/

This will make sure that it will be included in the build.

Next, we need to add symlinks from linux-2.6-sparse/drivers/xen/deviceback into linux-2.6.18-xen/drivers/xen/deviceback:

  1. Create linux-2.6.18-xen/drivers/xen/deviceback
  2. Change into that directory
  3. Add the symlinks: ‘ln -s ../../../../linux-2.6-xen-sparse/./drivers/xen/deviceback/./* .’

Now we should build the new drivers and reboot with the new Xen image. You can do this by going back to the root directory of the source tree (the place where you typed ‘make world’ when doing your normal build before) and do ‘make install-kernels’. (Note: This will overwrite the previous Xen kernel!) Finally, reboot.

After the machine boots back up, go ahead and start a guest domain and you should notice that device_probe() does not get executed. (Check /var/log/syslog on dom0 to look for the printk() to show up.)

How can we trigger the probe() function of our backend driver? We just need to write the correct key/value pairs into the xenstore.

The call to xenbus_register_backend() in xenbus.c causes xenbus to set a watch on local/domain/0/backend/mydevice in the xenstore. Specifically, anytime anything is written into that location in the store the watch fires and checks for a specific set of key/value pairs that indicate the probe should be fired.

So performing the following 4 calls using xenstore-write will trigger our probe() function. Change the X with the ID of a running guest domain. (Check ‘xm list’ for this. If you’ve only started one guest, this number is probably 1.)

xenstore-write /local/domain/X/device/mydevice/0/state 1
xenstore-write /local/domain/0/backend/mydevice/X/0/frontend-id X
xenstore-write /local/domain/0/backend/mydevice/X/0/frontend /local/domain/X/device/mydevice/0
xenstore-write /local/domain/0/backend/mydevice/X/0/state 1

You should see the probe message appear the Dom0’s /var/log/syslog. What happened here behind the scenes ,without going too deep, is that the xenbus_register_backend() put a watch on the xenback directory of /local/domain/0 in the xenstore. Once frontend, frontend-id, and state are all written to the watched location, the xenbus driver will gather all of that information, as well as the state of the frontend driver (written in that first line) and use it to setup the appropriate data structures. From there, the probe() function is finally fired.

Adding a frontend device

For this purpose,we will add a directory named “devicefront” to linux-2.6-sparse/drivers/xen.

We will create 2 files there: devicefront.c and Makefile.

We will also add directories and symlinks as we did in the deviceback case.


obj-y := devicefront.o


The devicefront.c will be (a minimalist implementation):

// devicefront.c
#include <xen/xenbus.h>
#include <linux/module.h>
#include <linux/list.h>
struct device_info
        struct list_head list;
        struct xenbus_device* xbdev;
static int devicefront_probe(struct xenbus_device* dev,
                             const struct xenbus_device_id* id)
        printk("Frontend Probe Fired!\n");
        return 0;
static struct xenbus_device_id devicefront_ids[] =
static struct xenbus_driver devicefront =
        .name  = "mydevice",
        .owner = THIS_MODULE,
        .ids   = devicefront_ids,
        .probe = devicefront_probe,
static int devicefront_init(void)

We should also remember to add the following to the Makefile under linux-2.6-sparse/drivers/xen:

obj-y   += devicefront/

Getting the frontend driver to fire is a bit more complicated, the following bash script should help you:

if [ $# != 2 ]
        echo "Usage: $0 <device name> <frontend-id>"
        # Write backend information into the location the frontend will look
        # for it.
        xenstore-write /local/domain/${2}/device/${1}/0/backend-id 0
        xenstore-write /local/domain/${2}/device/${1}/0/backend \
        # Write frontend information into the location the backend will look
        # for it.
        xenstore-write /local/domain/0/backend/${1}/${2}/0/frontend-id ${2}
        xenstore-write /local/domain/0/backend/${1}/${2}/0/frontend \
        # Set the permissions on the backend so that the frontend can
        # actually read it.
        xenstore-chmod /local/domain/0/backend/${1}/${2}/0 r
        # Write the states.  Note that the backend state must be written
        # last because it requires a valid frontend state to already be
        # written.
        xenstore-write /local/domain/${2}/device/${1}/0/state 1
        xenstore-write /local/domain/0/backend/${1}/${2}/0/state 1

Here’s how to use it:

  • Startup a Xen guest that contains your frontend driver, and be sure dom0 contains the backend driver.
  • Figure out the frontend-id for the guest. This is the ID field when running xm list. Let’s say that number is 3.
  • Run the script as so: ./ mydevice 3

That should fire both the frontend driver. (You’ll have to check /var/log/messages in the guest to verify that the probe was fired.)


SimonKagstrom: Maybe it would be a good idea to split this document into several pages? It’s starting to be fairly long :) TimPost

TimPost: ACK, even if printed, this is hard to digest as an intro


Read Full Post | Make a Comment ( 1 so far )

Strip attachments from an email message

Posted on December 7, 2009. Filed under: Linux, Python | Tags: , , , |

This recipe shows a simple approach to using the Python email package to strip out attachments and file types from an email message that might be considered dangerous. This is particularly relevant in Python 2.4, as the email Parser is now much more robust in handling mal-formed messages (which are typical for virus and worm emails)

ReplaceString = """

This message contained an attachment that was stripped out. 

The original type was: %(content_type)s
The filename was: %(filename)s, 
(and it had additional parameters of:


import re
BAD_CONTENT_RE = re.compile('application/(msword|msexcel)', re.I)
BAD_FILEEXT_RE = re.compile(r'(\.exe|\.zip|\.pif|\.scr|\.ps)$')

def sanitise(msg):
    # Strip out all payloads of a particular type
    ct = msg.get_content_type()
    # We also want to check for bad filename extensions
    fn = msg.get_filename()
    # get_filename() returns None if there's no filename
    if or (fn and
        # Ok. This part of the message is bad, and we're going to stomp
        # on it. First, though, we pull out the information we're about to
        # destroy so we can tell the user about it.

        # This returns the parameters to the content-type. The first entry
        # is the content-type itself, which we already have.
        params = msg.get_params()[1:] 
        # The parameters are a list of (key, value) pairs - join the
        # key-value with '=', and the parameter list with ', '
        params = ', '.join([ '='.join(p) for p in params ])
        # Format up the replacement text, telling the user we ate their
        # email attachment.
        replace = ReplaceString % dict(content_type=ct, 
        # Install the text body as the new payload.
        # Now we manually strip away any paramaters to the content-type 
        # header. Again, we skip the first parameter, as it's the 
        # content-type itself, and we'll stomp that next.
        for k, v in msg.get_params()[1:]:
        # And set the content-type appropriately.
        # Since we've just stomped the content-type, we also kill these
        # headers - they make no sense otherwise.
        del msg['Content-Transfer-Encoding']
        del msg['Content-Disposition']
        # Now we check for any sub-parts to the message
        if msg.is_multipart():
            # Call the sanitise routine on any subparts
            payload = [ sanitise(x) for x in msg.get_payload() ]
            # We replace the payload with our list of sanitised parts
    # Return the sanitised message
    return msg

# And a simple driver to show how to use this
import email, sys
m = email.message_from_file(open(sys.argv[1]))
print sanitise(m)


I’ve seen this come up a few times on comp.lang.python, so here’s a cookbook entry for it. This recipe shows how to read in an email message, strip out any dangerous or suspicious attachments, and replace them with a harmless text message informing the user of this.

This is particularly important if the end-users are using something like Outlook, which is targetted by unpleasant virus and worm messages on a daily basis.

The email parser in Python 2.4 has been completely rewritten to be robust first, correct second – prior to this, the parser was written for correctness first. This was a problem, because many virus/worm messages would send email messages that were broken and non-conformant – this made the old email parser choke and die. The new parser is designed to never actually break when reading a message – instead it tries it’s best to fix up whatever it can in the message. (If you have a message that causes the parser to crash, please let us know – that’s a bug, and we’ll fix it).

The code itself is heavily commented, and should be easy enough to follow. A mail message consists of one or more parts – these can each contain nested parts. We call the ‘sanitise()’ function on the top level Message object, and it calls itself recursively on the sub-objects. The sanitise() function checks the Content-Type of the part, and if there’s a filename, also checks that, against a known-to-be-bad list.

If the message part is bad, we replace the message itself with a short text description describing the now-removed part, and clean out the headers that are relevant. We set this message part’s Content-Type to ‘text/plain’, and remove other headers that related to the now-removed message.

Finally, we check if the message is a multipart message. This means it has sub-parts, so we recursively call the sanitise function on each of those. We then replace the payload with our list of sanitised sub-parts.

Extensions, further work, etc:

Instead of destroying the attachment, it would be a small amount of work to instead store the attachment away in a directory, and supply the user with a link to the file.

You could add other filters into the sanitise() code – for instance, checking other headers for known signs of worm or virus messages. Or removing all large powerpoint files sent to you by your marketing department, if that’s what you want to do.


Read Full Post | Make a Comment ( None so far )

BASH Shell: For Loop File Names With Spaces

Posted on November 20, 2009. Filed under: Linux, Mac, Shell | Tags: , , , , , , |

BASH for loop works nicely under UNIX / Linux / Windows and OS X while working on set of files. However, if you try to process a for loop on file name with spaces in them you are going to have some problem. for loop uses $IFS variable to determine what the field separators are. By default $IFS is set to the space character. There are multiple solutions to this problem.

Set $IFS variable

Try it as follows:

IFS=$(echo -en "\n\b")
for f in *
  echo "$f"


IFS=$(echo -en "\n\b")
# set me
for f in $FILES
  echo "$f"
# restore $IFS

More examples using $IFS and while loop

Now you know that if the field delimiters are not whitespace, you can set IFS. For example, while loop can be used to get all fields from /etc/passwd file:

while IFS=: read userName passWord userID groupID geCos homeDir userShell
      echo "$userName -> $homeDir"
done < /etc/passwd

Using old good find command to process file names

To process the output of find with a command, try as follows:

find . -print0 | while read -d $'' file
  echo -v "$file"

Try to copy files to /tmp with spaces in a filename using find command and shell pipes:

find . -print0 | while read -d $'' file; do cp -v "$file" /tmp; done

Processing filenames using an array

Sometimes you need read a file into an array as one array element per line. Following script will read file names into an array and you can process each file using for loop. This is useful for complex tasks:


# failsafe - fall back to current directory
[ "$DIR" == "" ] && DIR="." 

# save and change IFS

# read all file name into an array
fileArray=($(find $DIR -type f))

# restore it

# get length of an array

# use for loop read all filenames
for (( i=0; i<${tLen}; i++ ));
  echo "${fileArray[$i]}"

Playing mp3s with spaces in file names

Place following code in your ~/.bashrc file:

	local o=$IFS
	IFS=$(echo -en "\n\b")
	/usr/bin/beep-media-player "$(cat  $@)" &

Keep list of all mp3s in a text file such as follows (~/eng.mp3.txt):

/nas/english/Adriano Celentano - Susanna.mp3
/nas/english/Nick Cave & Kylie Minogue - Where The Wild Roses Grow.mp3
/nas/english/Roberta Flack - Kiling Me Softly With This Song.mp3
/nas/english/The Beatles - Girl.mp3
/nas/english/John Lennon - Stand By Me.mp3
/nas/english/The Seatbelts, Cowboy Bebop - 01-Tank.mp3

To play just type:
$ mp3 eng.mp3.txt

Another example about IFS:


set $(cat my.file)

# Now the lines are stored in $1, $2, $3, …

echo $1
echo $2
echo $3
echo $4


Read Full Post | Make a Comment ( None so far )

Bash small tricks

Posted on November 2, 2009. Filed under: Linux, Shell |

1. How to get the output of the command? (Not the return/exit status of the command)

ret=$(echo 123)


ret=`echo 123`

Read Full Post | Make a Comment ( None so far )

Useful Linux commands: Screen, ttyload, mytop, mtop

Posted on October 23, 2009. Filed under: Linux, MySQL, Shell | Tags: , , , , , |

1. Screen

Screen is a full-screen window manager that multiplexes a physical terminal between several processes (typically interactive shells). The same way tabbed browsing revolutionized the web experience, GNU Screen can do the same for your experience in the command line. Instead of opening up several terminal instances on your desktop or using those ugly GNOME/KDE-based tabs, Screen can do it better and simpler. Not only that, with GNU Screen, you can share sessions with others and detach/attach terminal sessions. It is a great tool for people who have to share working environments between work and home


2. ttyload

ttyload is a little *NIX utility I wrote which is meant to give a color-coded graph of load averages over time.


3. mtop/mkill

mtop (MySQL top) monitors a MySQL server showing the queries which are taking the most amount of time to complete. Features include ‘zooming’ in on a process to show the complete query, ‘explaining’ the query optimizer information for a query and ‘killing’ queries. In addition, server performance statistics, configuration information, and tuning tips are provided.

mkill (MySQL kill) monitors a MySQL server for long running queries and kills them after a specified time interval. Queries can be selected based on regexes on the user, host, command, database, state and query.


4. mytop

mytop is a console-based (non-gui) tool for monitoring the threads and overall performance of a MySQL 3.22.x, 3.23.x, and 4.x server. It runs on most Unix systems (including Mac OS X) which have Perl, DBI, and Term::ReadKey installed. And with Term::ANSIColor installed you even get color.


Read Full Post | Make a Comment ( None so far )

Useful Linux tools: sipsak,Balance,eAccelerator,ploticus

Posted on October 20, 2009. Filed under: Linux, Shell | Tags: , , , , , |

1. sipsak

sipsak is a small command line tool for developers and administrators of Session Initiation Protocol (SIP) applications. It can be used for some simple tests on SIP applications and devices



Balance is our surprisingly successful load balancing solution being a simple but powerful generic tcp proxy with round robin load balancing and failover mechanisms. Its behaviour can be controlled at runtime using a simple command line syntax.



eAccelerator is a free open-source PHP accelerator & optimizer. It increases the performance of PHP scripts by caching them in their compiled state, so that the overhead of compiling is almost completely eliminated. It also optimizes scripts to speed up their execution. eAccelerator typically reduces server load and increases the speed of your PHP code by 1-10 times.



A free, GPL, non-interactive software package for producing plots, charts, and graphics from data. It was developed in a Unix/C environment and runs onvarious Unix, Linux, and win32 systemsploticus is good for automated or just-in-time graph generation, handles date and time data nicely, and has basic statistical capabilities. It allows significant user control over colors, styles, options and details.


Read Full Post | Make a Comment ( None so far )

when “import twitter” report “ImportError: No module named simplejson”

Posted on August 27, 2009. Filed under: Python |

When I use python-twitter in windows python 2.6:

“import twitter” will have error “ImportError: No module named simplejson”

The simplest solve method:

1. check if there is package  json: import json

2. find the “Python26\Lib\site-packages\”, modify the beginning line “import simplejson” to ”import json as simplejson’

3. recall again, it should be solved


Read Full Post | Make a Comment ( None so far )

Skype Auto Answer launch!

Posted on August 18, 2009. Filed under: Programming, Python |

Read Full Post | Make a Comment ( None so far )

W3C Geolocation API standard, according to Google

Posted on July 9, 2009. Filed under: Programming |

Read Full Post | Make a Comment ( None so far )

Python: ImportError: No module named _md5

Posted on June 21, 2009. Filed under: Python |

Python 2.5.1 (r251:54863, Sep 3 2007, 17:35:15)
[GCC 3.3.3 20040412 (Red Hat Linux 3.3.3-7)] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import md5
Traceback (most recent call last):
File “”, line 1, in
File “/usr/lib/python2.5/”, line 6, in
from hashlib import md5
File “/usr/lib/python2.5/”, line 133, in
md5 = __get_builtin_constructor(‘md5’)
File “/usr/lib/python2.5/”, line 60, in __get_builtin_constructor
import _md5
ImportError: No module named _md5

Searching on the Internet, It is caused incompatible by Python 2.5.1 and openssl-0.9.8a, the python 2.5.1 need the openssl library (symbol link: and, and also which is supplied by libc), but the openssl-0.9.8a only supply the and at /lib/, solution as following:
1. login as user “root”
2. cd /lib/
3. ln –s
4. ln -s
5. check in the python: execute python, then input “import md5”, if there is no output, the bug is fixed.

Read Full Post | Make a Comment ( 5 so far )

Howto run a sub thread/process to monitor the system status in Python?

Posted on May 22, 2009. Filed under: Linux, Python |

Howto run a sub thread/process to monitor the system status in Python?

 1. popen  or popen2

 This will call another process by /bin/sh, so if you want get the pid, the result pid maybe the pid of /bin/sh


1.1   without log

>>>Import popen2, os

>>>Cur = popen2.Popen4(“vmstat –n 100”)





ps aux|grep vmstat

root       123    0.0  0.1   1808   564 pts/9    S+   10:57   0:00 vmstat -n 100


1.2    With log

>>>import popen2, os

>>>cur = popen2.Popen4(“vmstat –n 100 > vmstat.log”)





ps aux|grep vmstat

root       124  0.1  0.2   4488   996 pts/9    S+   11:01   0:00 /bin/sh -c vmstat -n 100 > vmstat.log

root       125  0.0  0.1   1804   560 pts/9    S+   11:01   0:00 vmstat -n 100


>>>os.kill(, 9)


(124, 9)



ps aux|grep vmstat

root       125  0.0  0.1   1804   560 pts/9    S+   11:01   0:00 vmstat -n 100


From the examples, you can see that, if you want the sub-thread record some log from the command, you’d better do not use this kind of popen2 lib, you can use following method


2. subprocess


2.1 Start the subprocess:

>>>Outlog = open(outputlogname, “w”)

>>>errlog=open(errlogname, “w”)


>>>    process=subprocess.Popen([“vmstat”,”-n”,”100”], stdout=outlog, stderr=errlog)

>>>except  OSError,e:

>>>    print “The OSError is:”,e

>>>    print “Maybe  this command is not exist”

>>>except Exception,e2:

>>>    print e

>>>pid =



ps aux|grep vmstat

root       134  0.0  0.1   1804   560 pts/9    S+   11:01   0:00 vmstat -n 100


You can see here: there is no “/bin/sh” process, the Popen have a parameter “shell=True/False”:

On Unix, with shell=False (default): In this case, the Popen class uses os.execvp() to execute the child program. args should normally be a sequence. A string will be treated as a sequence with the string as the only item (the program to execute).

On Unix, with shell=True: If args is a string, it specifies the command string to execute through the shell. If args is a sequence, the first item specifies the command string, and any additional items will be treated as additional shell arguments.


2.2 Kill the sub process

>>>os.kill(, 9)


>>>os.waitpid(, 0)

(134, 9)







Read Full Post | Make a Comment ( None so far )

Python http upload script

Posted on May 15, 2009. Filed under: Python |

Python http upload script

 Httplib module and some useful methods

 Note   The httplib module has been renamed to http.client in Python 3.0. The 2to3 tool will automatically adapt imports when converting your sources to 3.0.

class httplib.HTTPConnection(host[, port[, strict[, timeout]]])

 HTTPConnection.request(method, url[, body[, headers]])


Note   Note that you must have read the whole response before you can send a new request to the server


Set the debugging level (the amount of debugging output printed). The default debug level is 0, meaning no debugging output is printed.


Connect to the server specified when the object was created.


Close the connection to the server.

As an alternative to using the request() method described above, you can also send your request step by step, by using the four functions below.

HTTPConnection.putrequest(request, selector[, skip_host[, skip_accept_encoding]])

This should be the first call after the connection to the server has been made. It sends a line to the server consisting of the request string, the selector string, and the HTTP version (HTTP/1.1). To disable automatic sending of Host: or Accept-Encoding: headers (for example to accept additional content encodings), specify skip_host or skip_accept_encoding with non-False values.

Changed in version 2.4: skip_accept_encoding argument added.

HTTPConnection.putheader(header, argument[, ])

Send an RFC 822-style header to the server. It sends a line to the server consisting of the header, a colon and a space, and the first argument. If more arguments are given, continuation lines are sent, each consisting of a tab and an argument.


Send a blank line to the server, signalling the end of the headers.


Send data to the server. This should be used directly only after the endheaders() method has been called and before getresponse() is called.


 Example of upload file by PUT

 import httplib

conn = httplib.HTTPConnection(“”)



conn.putheader(“Content-Length”, “32”)


conn.send(“Hello world, I am uploading 32 bytes, if the length is less than 32 bytes, the script will halt here, if more than 32 bytes, the upload.php will only read the first 32 bytes”)

resps = conn.getresponse()

data =

print “response is the webpage which upload.php shows after PUT upload:”, data



httplib — HTTP protocol client

 urllib — Open arbitrary resources by URL

Http client to POST using multipart/form-data

Big File Upload

httplib HTTPConnection request problem

PyCURL interface – Uploading large binary files


HTTP Upload — An Overview

HTTP Upload using a Proxy Server

Read Full Post | Make a Comment ( None so far )

Howto upload file by PHP

Posted on May 9, 2009. Filed under: PHP |

Howto upload file by PHP


  1. Upload by POST


  1. Upload multi-files by POST


  1. Upload by PUT


  1. Other references


Use the function  print_r()  can print out the whole structure of the variable.

Read Full Post | Make a Comment ( None so far )

How To Use Linux epoll with Python

Posted on May 5, 2009. Filed under: Linux, Python | Tags: , , , , |



As of version 2.6, Python includes an API for accessing the Linux epoll library. This article uses Python 3 examples to briefly demonstrate the API. Questions and feedback are welcome.

Blocking Socket Programming Examples

Example 1 is a simple Python 3.0 server that listens on port 8080 for an HTTP request message, prints it to the console, and sends an HTTP response message back to the client.

  • Line 9: Create the server socket.
  • Line 10: Permits the bind() in line 11 even if another program was recently listening on the same port. Otherwise this program could not run until a minute or two after the previous program using that port had finished.
  • Line 11: Bind the server socket to port 8080 of all available IPv4 addresses on this machine.
  • Line 12: Tell the server socket to start accepting incoming connections from clients.
  • Line 14: The program will stop here until a connection is received. When this happens, the server socket will create a new socket on this machine that is used to talk to the client. This new socket is represented by the clientconnection object returned from the accept() call. The address object indicates the IP address and port number at the other end of the connection.
  • Lines 15-17: Assemble the data being transmitted by the client until a complete HTTP request has been transmitted. The HTTP protocol is described at HTTP Made Easy.
  • Line 18: Print the request to the console, in order to verify correct operation.
  • Line 19: Send the response to the client.
  • Lines 20-22: Close the connection to the client as well as the listening server socket.

The official HOWTO has a more detailed description of socket programming with Python.

Example 1 (All examples use Python 3)

 1  import socket
 3  EOL1 = b'\n\n'
 4  EOL2 = b'\n\r\n'
 5  response  = b'HTTP/1.0 200 OK\r\nDate: Mon, 1 Jan 1996 01:01:01 GMT\r\n'
 6  response += b'Content-Type: text/plain\r\nContent-Length: 13\r\n\r\n'
 7  response += b'Hello, world!'
 9  serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
10  serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
11  serversocket.bind(('', 8080))
12  serversocket.listen(1)
14  connectiontoclient, address = serversocket.accept()
15  request = b''
16  while EOL1 not in request and EOL2 not in request:
17     request += connectiontoclient.recv(1024)
18  print(request.decode())
19  connectiontoclient.send(response)
20  connectiontoclient.close()
22  serversocket.close()

Example 2 adds a loop in line 15 to repeatedly processes client connections until interrupted by the user (e.g. with a keyboard interrupt). This illustrates more clearly that the server socket is never used to exchange data with the client. Rather, it accepts a connection from a client, and then creates a new socket on the server machine that is used to communicate with the client.

The finally statement block in lines 23-24 ensures that the listening server socket is always closed, even if an exception occurs.

Example 2

 1  import socket
 3  EOL1 = b'\n\n'
 4  EOL2 = b'\n\r\n'
 5  response  = b'HTTP/1.0 200 OK\r\nDate: Mon, 1 Jan 1996 01:01:01 GMT\r\n'
 6  response += b'Content-Type: text/plain\r\nContent-Length: 13\r\n\r\n'
 7  response += b'Hello, world!'
 9  serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
10  serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
11  serversocket.bind(('', 8080))
12  serversocket.listen(1)
14  try:
15     while True:
16        connectiontoclient, address = serversocket.accept()
17        request = b''
18        while EOL1 not in request and EOL2 not in request:
19            request += connectiontoclient.recv(1024)
20        print('-'*40 + '\n' + request.decode()[:-2])
21        connectiontoclient.send(response)
22        connectiontoclient.close()
23  finally:
24     serversocket.close()

Benefits of Asynchronous Sockets and Linux epoll

The sockets shown in Example 2 are called blocking sockets, because the Python program stops running until an event occurs. The accept() call in line 16 blocks until a connection has been received from a client. The recv() call in line 19 blocks until data has been received from the client (or until there is no more data to receive). The send() call in line 21 blocks until all of the data being returned to the client has been queued by Linux in preparation for transmission.

When a program uses blocking sockets it often uses one thread (or even a dedicated process) to carry out the communication on each of those sockets. The main program thread will contain the listening server socket which accepts incoming connections from clients. It will accept these connections one at a time, passing the newly created socket off to a separate thread which will then interact with the client. Because each of these threads only communicates with one client, it is ok if it is blocked from proceeding at certain points. This blockage does not prohibit any of the other threads from carrying out their respective tasks.

The use of blocking sockets with multiple threads results in straightforward code, but comes with a number of drawbacks. It can be difficult to ensure the threads cooperate appropriately when sharing resources. And this style of programming can be less efficient on computers with only one CPU.

The C10K Problem discusses some of the alternatives for handling multiple concurrent sockets. One is the use of asynchronous sockets. These sockets don’t block until some event occurs. Instead, the program performs an action on an asynchronous socket and is immediately notified as to whether that action succeeded or failed. This information allows the program to decide how to proceed. Since asynchronous sockets are non-blocking, there is no need for multiple threads of execution. All work may be done in a single thread. This single-threaded approach comes with its own challenges, but can be a good choice for many programs. It can also be combined with the multi-threaded approach: asynchronous sockets using a single thread can be used for the networking component of a server, and threads can be used to access other blocking resources, e.g. databases.

Linux 2.6 has a number of mechanisms for managing asynchronous sockets, three of which are exposed by the Python API’s select, poll and epoll.  epoll and poll are better than select because the Python program does not have to inspect each socket for events of interest. Instead it can rely on the operating system to tell it which sockets may have these events. And epoll is better than poll because it does not require the operating system to inspect all sockets for events of interest each time it is queried by the Python program. Rather Linux tracks these events as they occur, and returns a list when queried by Python. So epoll is a more efficient and scalable mechanism for large numbers (thousands) of concurrent socket connections, as shown in these graphs.

Asynchronous Socket Programming Examples with epoll

Programs using epoll often perform actions in this sequence:

  1. Create an epoll object
  2. Tell the epoll object to monitor specific events on specific sockets
  3. Ask the epoll object which sockets may have had the specified event since the last query
  4. Perform some action on those sockets
  5. Tell the epoll object to modify the list of sockets and/or events to monitor
  6. Repeat steps 3 through 5 until finished
  7. Destroy the epoll object

Example 3 duplicates the functionality of Example 2 while using asynchronous sockets. The program is more complex because a single thread is interleaving the communication with multiple clients.

  • Line 1: The select module contains the epoll functionality.
  • Line 13: Since sockets are blocking by default, this is necessary to use non-blocking (asynchronous) mode.
  • Line 15: Create an epoll object.
  • Line 16: Register interest in read events on the server socket. A read event will occur any time the server socket accepts a socket connection.
  • Line 19: The connection dictionary maps file descriptors (integers) to their corresponding network connection objects.
  • Line 21: Query the epoll object to find out if any events of interest may have occurred. The parameter “1” signifies that we are willing to wait up to one second for such an event to occur. If any events of interest occurred prior to this query, the query will return immediately with a list of those events.
  • Line 22: Events are returned as a sequence of (fileno, event code) tuples. fileno is a synonym for file descriptor and is always an integer.
  • Line 23: If a read event occurred on the socket server, then a new socket connection may have been created.
  • Line 25: Set new socket to non-blocking mode.
  • Line 26: Register interest in read (EPOLLIN) events for the new socket.
  • Line 31: If a read event occurred then read new data sent from the client.
  • Line 33: Once the complete request has been received, then unregister interest in read events and register interest in write (EPOLLOUT) events. Write events will occur when it is possible to send response data back to the client.
  • Line 34: Print the complete request, demonstrating that although communication with clients is interleaved this data can be assembled and processed as a whole message.
  • Line 35: If a write event occurred on a client socket, it’s able to accept new data to send to the client.
  • Lines 36-38: Send the response data a bit at a time until the complete response has been delivered to the operating system for transmission.
  • Line 39: Once the complete response has been sent, disable interest in further read or write events.
  • Line 40: A socket shutdown is optional if a connection is closed explicitly. This example program uses it in order to cause the client to shutdown first. The shutdown call informs the client socket that no more data should be sent or received and will cause a well-behaved client to close the socket connection from it’s end.
  • Line 41: The HUP (hang-up) event indicates that the client socket has been disconnected (i.e. closed), so this end is closed as well. There is no need to register interest in HUP events. They are always indicated on sockets that are registered with the epoll object.
  • Line 42: Unregister interest in this socket connection.
  • Line 43: Close the socket connection.
  • Lines 18-45: The try-catch block is included because the example program will most likely be interrupted by a KeyboardInterrupt exception
  • Lines 46-48: Open socket connections don’t need to be closed since Python will close them when the program terminates. They’re included as a matter of good form.
Example 3

 1  import socket, select
 3  EOL1 = b'\n\n'
 4  EOL2 = b'\n\r\n'
 5  response  = b'HTTP/1.0 200 OK\r\nDate: Mon, 1 Jan 1996 01:01:01 GMT\r\n'
 6  response += b'Content-Type: text/plain\r\nContent-Length: 13\r\n\r\n'
 7  response += b'Hello, world!'
 9  serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
10  serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
11  serversocket.bind(('', 8080))
12  serversocket.listen(1)
13  serversocket.setblocking(0)
15  epoll = select.epoll()
16  epoll.register(serversocket.fileno(), select.EPOLLIN)
18  try:
19     connections = {}; requests = {}; responses = {}
20     while True:
21        events = epoll.poll(1)
22        for fileno, event in events:
23           if fileno == serversocket.fileno():
24              connection, address = serversocket.accept()
25              connection.setblocking(0)
26              epoll.register(connection.fileno(), select.EPOLLIN)
27              connections[connection.fileno()] = connection
28              requests[connection.fileno()] = b''
29              responses[connection.fileno()] = response
30           elif event & select.EPOLLIN:
31              requests[fileno] += connections[fileno].recv(1024)
32              if EOL1 in requests[fileno] or EOL2 in requests[fileno]:
33                 epoll.modify(fileno, select.EPOLLOUT)
34                 print('-'*40 + '\n' + requests[fileno].decode()[:-2])
35           elif event & select.EPOLLOUT:
36              byteswritten = connections[fileno].send(responses[fileno])
37              responses[fileno] = responses[fileno][byteswritten:]
38              if len(responses[fileno]) == 0:
39                 epoll.modify(fileno, 0)
40                 connections[fileno].shutdown(socket.SHUT_RDWR)
41           elif event & select.EPOLLHUP:
42              epoll.unregister(fileno)
43              connections[fileno].close()
44              del connections[fileno]
45  finally:
46     epoll.unregister(serversocket.fileno())
47     epoll.close()
48     serversocket.close()

epoll has two modes of operation, called edge-triggered and level-triggered. In the edge-triggered mode of operation a call to epoll.poll() will return an event on a socket only once after the read or write event occurred on that socket. The calling program must process all of the data associated with that event without further notifications on subsequent calls to epoll.poll(). When the data from a particular event is exhausted, additional attempts to operate on the socket will cause an exception. Conversely, in the level-triggered mode of operation, repeated calls to epoll.poll() will result in repeated notifications of the event of interest, until all data associated with that event has been processed. No exceptions normally occur in level-triggered mode.

For example, suppose a server socket has been registered with an epoll object for read events. In edge-triggered mode the program would need to accept() new socket connections until a socket.error exception occurs. Whereas in the level-triggered mode of operation a single accept() call can be made and then the epoll object can be queried again for new events on the server socket indicating that additional calls to accept() should be made.

Example 3 used level-triggered mode, which is the default mode of operation. Example 4 demonstrates how to use edge-triggered mode. In Example 4, lines 25, 36 and 45 introduce loops that run until an exception occurs (or all data is otherwise known to be handled). Lines 32, 38 and 48 catch the expected socket exceptions. Finally, lines 16, 28, 41 and 51 add the EPOLLET mask which is used to set edge-triggered mode.

Example 4

 1  import socket, select
 3  EOL1 = b'\n\n'
 4  EOL2 = b'\n\r\n'
 5  response  = b'HTTP/1.0 200 OK\r\nDate: Mon, 1 Jan 1996 01:01:01 GMT\r\n'
 6  response += b'Content-Type: text/plain\r\nContent-Length: 13\r\n\r\n'
 7  response += b'Hello, world!'
 9  serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
10  serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
11  serversocket.bind(('', 8080))
12  serversocket.listen(1)
13  serversocket.setblocking(0)
15  epoll = select.epoll()
16  epoll.register(serversocket.fileno(), select.EPOLLIN | select.EPOLLET)
18  try:
19     connections = {}; requests = {}; responses = {}
20     while True:
21        events = epoll.poll(1)
22        for fileno, event in events:
23           if fileno == serversocket.fileno():
24              try:
25                 while True:
26                    connection, address = serversocket.accept()
27                    connection.setblocking(0)
28                    epoll.register(connection.fileno(), select.EPOLLIN | select.EPOLLET)
29                    connections[connection.fileno()] = connection
30                    requests[connection.fileno()] = b''
31                    responses[connection.fileno()] = response
32              except socket.error:
33                 pass
34           elif event & select.EPOLLIN:
35              try:
36                 while True:
37                    requests[fileno] += connections[fileno].recv(1024)
38              except socket.error:
39                 pass
40              if EOL1 in requests[fileno] or EOL2 in requests[fileno]:
41                 epoll.modify(fileno, select.EPOLLOUT | select.EPOLLET)
42                 print('-'*40 + '\n' + requests[fileno].decode()[:-2])
43           elif event & select.EPOLLOUT:
44              try:
45                 while len(responses[fileno]) > 0:
46                    byteswritten = connections[fileno].send(responses[fileno])
47                    responses[fileno] = responses[fileno][byteswritten:]
48              except socket.error:
49                 pass
50              if len(responses[fileno]) == 0:
51                 epoll.modify(fileno, select.EPOLLET)
52                 connections[fileno].shutdown(socket.SHUT_RDWR)
53           elif event & select.EPOLLHUP:
54              epoll.unregister(fileno)
55              connections[fileno].close()
56              del connections[fileno]
57  finally:
58     epoll.unregister(serversocket.fileno())
59     epoll.close()
60     serversocket.close()

Since they’re similar, level-triggered mode is often used when porting an application that was using the select or poll mechanisms, while edge-triggered mode may be used when the programmer doesn’t need or want as much assistance from the operating system in managing event state.

In addition to these two modes of operation, sockets may also be registered with the epoll object using the EPOLLONESHOT event mask. When this option is used, the registered event is only valid for one call to epoll.poll(), after which time it is automatically removed from the list of registered sockets being monitored.

Performance Considerations

Listen Backlog Queue Size

In Examples 1-4, line 12 has shown a call to the serversocket.listen() method. The parameter for this method is the listen backlog queue size. It tells the operating system how many TCP/IP connections to accept and place on the backlog queue before they are accepted by the Python program. Each time the Python program calls accept() on the server socket, one of the connections is removed from the queue and that slot can be used for another incoming connection. If the queue is full, new incoming connections are silently ignored causing unnecessary delays on the client side of the network connection. A production server usually handles tens or hundreds of simultaneous connections, so a value of 1 will usually be inadequate. For example, when using ab to perform load testing against these sample programs with 100 concurrent HTTP 1.0 clients, any backlog value less than 50 would often produce performance degradation.

TCP Options

The TCP_CORK option can be used to “bottle up” messages until they are ready to send. This option, illustrated in lines 34 and 40 of Examples 5, might be a good option to use for an HTTP server using HTTP/1.1 pipelining.

Example 5

 1  import socket, select
 3  EOL1 = b'\n\n'
 4  EOL2 = b'\n\r\n'
 5  response  = b'HTTP/1.0 200 OK\r\nDate: Mon, 1 Jan 1996 01:01:01 GMT\r\n'
 6  response += b'Content-Type: text/plain\r\nContent-Length: 13\r\n\r\n'
 7  response += b'Hello, world!'
 9  serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
10  serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
11  serversocket.bind(('', 8080))
12  serversocket.listen(1)
13  serversocket.setblocking(0)
15  epoll = select.epoll()
16  epoll.register(serversocket.fileno(), select.EPOLLIN)
18 try:
19     connections = {}; requests = {}; responses = {}
20     while True:
21        events = epoll.poll(1)
22        for fileno, event in events:
23           if fileno == serversocket.fileno():
24              connection, address = serversocket.accept()
25              connection.setblocking(0)
26              epoll.register(connection.fileno(), select.EPOLLIN)
27              connections[connection.fileno()] = connection
28              requests[connection.fileno()] = b''
29              responses[connection.fileno()] = response
30           elif event & select.EPOLLIN:
31              requests[fileno] += connections[fileno].recv(1024)
32              if EOL1 in requests[fileno] or EOL2 in requests[fileno]:
33                 epoll.modify(fileno, select.EPOLLOUT)
34                 connections[fileno].setsockopt(socket.IPPROTO_TCP, socket.TCP_CORK, 1)
35                 print('-'*40 + '\n' + requests[fileno].decode()[:-2])
36           elif event & select.EPOLLOUT:
37              byteswritten = connections[fileno].send(responses[fileno])
38              responses[fileno] = responses[fileno][byteswritten:]
39              if len(responses[fileno]) == 0:
40                 connections[fileno].setsockopt(socket.IPPROTO_TCP, socket.TCP_CORK, 0)
41                 epoll.modify(fileno, 0)
42                 connections[fileno].shutdown(socket.SHUT_RDWR)
43           elif event & select.EPOLLHUP:
44              epoll.unregister(fileno)
45              connections[fileno].close()
46              del connections[fileno]
47  finally:
48     epoll.unregister(serversocket.fileno())
49     epoll.close()
50     serversocket.close()

On the other hand, the TCP_NODELAY option can be used to tell the operating system that any data passed to socket.send() should immediately be sent to the client without being buffered by the operating system. This option, illustrated in line 14 of Example 6, might be a good option to use for an SSH client or other “real-time” application.

Example 6

 1  import socket, select
 3  EOL1 = b'\n\n'
 4  EOL2 = b'\n\r\n'
 5  response  = b'HTTP/1.0 200 OK\r\nDate: Mon, 1 Jan 1996 01:01:01 GMT\r\n'
 6  response += b'Content-Type: text/plain\r\nContent-Length: 13\r\n\r\n'
 7  response += b'Hello, world!'
 9  serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
10  serversocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
11  serversocket.bind(('', 8080))
12  serversocket.listen(1)
13  serversocket.setblocking(0)
14  serversocket.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
16  epoll = select.epoll()
17  epoll.register(serversocket.fileno(), select.EPOLLIN)
19 try:
20     connections = {}; requests = {}; responses = {}
21     while True:
22        events = epoll.poll(1)
23        for fileno, event in events:
24           if fileno == serversocket.fileno():
25              connection, address = serversocket.accept()
26              connection.setblocking(0)
27              epoll.register(connection.fileno(), select.EPOLLIN)
28              connections[connection.fileno()] = connection
29              requests[connection.fileno()] = b''
30              responses[connection.fileno()] = response
31           elif event & select.EPOLLIN:
32              requests[fileno] += connections[fileno].recv(1024)
33              if EOL1 in requests[fileno] or EOL2 in requests[fileno]:
34                 epoll.modify(fileno, select.EPOLLOUT)
35                 print('-'*40 + '\n' + requests[fileno].decode()[:-2])
36           elif event & select.EPOLLOUT:
37              byteswritten = connections[fileno].send(responses[fileno])
38              responses[fileno] = responses[fileno][byteswritten:]
39              if len(responses[fileno]) == 0:
40                 epoll.modify(fileno, 0)
41                 connections[fileno].shutdown(socket.SHUT_RDWR)
42           elif event & select.EPOLLHUP:
43              epoll.unregister(fileno)
44              connections[fileno].close()
45              del connections[fileno]
46  finally:
47     epoll.unregister(serversocket.fileno())
48     epoll.close()
49     serversocket.close()

Source Code

The examples on this page are in the public domain and available for download.


Read Full Post | Make a Comment ( None so far )

Memory Leak Detection in C++

Posted on April 10, 2009. Filed under: C/C++, Programming |

Memory Leak Detection in C++ under linux
dmalloc, ccmalloc, NJAMD, YAMD, Valgrind, mpatrol, Insure ++

C/C++ Memory Corruption And Memory Leaks
This tutorial will discuss examples of memory leaks and code constructs which lead to memory corruption

Memory Leak Detection and Isolation in Windows

Read Full Post | Make a Comment ( None so far )

Java memory leaks

Posted on April 10, 2009. Filed under: Java, Programming |

Garbage collection in the Java™ programming language simplifies memory management and eliminates typical memory problems. However, contrary to popular belief, garbage collection can not take care of all memory problems. One such problem is of Java memory leaks, which are harder to detect because they usually result from design and implementation errors (for example, a reference to an object kept beyond its useful life). This article demystifies Java memory leaks, provides an overview of the Java leak detection technology in IBM® Rational® Application Developer (IRAD), and shares the IBM JVM L3 support group’s experiences using the technology. It proposes exploiting this technology in the Development and Testing phases, and suggests how to design test cases for detecting Java memory leaks.

Java memory leaks — Catch me if you can
Detecting Java leaks using IBM Rational Application Developer 6.0

Finding Memory Leaks in Java Apps
Here is a small HOWTO on how to find memory leaks with Java SE.
I’ve written it while trying to find memory leaks in our testing tools: JTHarness and ME Framework, and then wanted to share the HOWTO with the world, but I didn’t have my blog at that time, so I posted this info as a comment to a relevant entry in the excellent Adam Bien’s blog.

Handling memory leaks in Java programs
Find out when memory leaks are a concern and how to prevent them

How to Fix Memory Leaks in Java

Read Full Post | Make a Comment ( None so far )

« Previous Entries Next Entries »

Liked it here?
Why not try sites on the blogroll...