Archive for January, 2010

ubuntu livecd mount LVM2 fs

Posted on January 28, 2010. Filed under: Linux | Tags: , , , , , |

1. Startup ubuntu livecd

2. sudo apt-get install lvm2

do NOT reboot!!

3.if you mount as following:

sudo mount -t auto /dev/sda2 /mnt

you will get error: mount: unknown filesystem type ‘LVM2_member’

4. check the phisical disk

[ubuntu] $sudo pvs

PV          VG         Fmt  Attr PSize   PFree 

  /dev/sda1  VolGroup00 lvm2 a-   100M 40.00M

/dev/sda2 VolGroup01 lvm2 a- 200G 40.00M
5. check the volume

[ubuntu]$sudo vgs

 VG         #PV #LV #SN Attr   VSize   VFree 
  VolGroup01   1   4   0 wz--n- 200G 40.00M

6. check the logical volume

[ubuntu]lvdisplay

  --- Logical volume ---
  LV Name                /dev/VolGroup01/LogVol03
  VG Name                VolGroup01
  LV UUID                YhG8Fu-Zrek-zx8D-AzxC-Dz34-d32F-dfal23I
  LV Write Access        read/write
  LV Status              unenable
  # open                 1
  LV Size                200GB
  Current LE             781
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2

7. active the logic vol

[ubuntu]vgchange -ay /dev/VolGroup01

8. Check the status again

[ubuntu]lvdisplay

  --- Logical volume ---
  LV Name                /dev/VolGroup01/LogVol03
  VG Name                VolGroup01
  LV UUID                YhG8Fu-Zrek-zx8D-AzxC-Dz34-d32F-dfal23I
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                200GB
  Current LE             781
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2

9. mount it

mount /dev/VolGroup01/LogVol03 /mnt


Reference:

http://fedoraforum.org/forum/archive/index.php/t-64964.html

http://forums.opensuse.org/install-boot-login/396994-mount-lvm2-partition.html

http://www.howtoforge.com/linux_lvm

Read Full Post | Make a Comment ( 3 so far )

staticmethod vs classmethod in Python

Posted on January 19, 2010. Filed under: Python | Tags: , , |

Python’s static methods have a similar implementation as Java & C++. Static methods were not introduced into Python until version 2.2

Example of version 2.2 and higher implementation:


>>> class Foo:
...     def bar(arg):
...         Foo.arg = arg
...     bar = staticmethod(bar)
...
>>> Foo.bar('Hello World')
>>> Foo.arg
'Hello World'
>>> Foo().bar('Hello')
>>> Foo.arg
'Hello'

Static methods can be called either on the class (such as Foo.bar()) or on an instance (such as Foo().bar()). The instance is ignored except for its class.

In version 2.4, function decorator syntax was added, which allows another way to define a static method. If you are using 2.4 or above, this is the recommended way of creating a static method.

Example of version 2.4 and higher implementation:


>>> class Foo:
...     @staticmethod
...     def bar(arg):
...         Foo.arg = arg
...
>>> Foo.bar('Hello World')
>>> Foo.arg
'Hello World'
>>> Foo().bar('Hello')
>>> Foo.arg
'Hello'

If you are looking to do more advanced static methods, look into using classmethod instead of staticmethod. One of the differences between the two is that class method receives the class as implicit first argument, just like an instance method receives the instance. For further reading on class method, refer to the built in functions page.

———————————————————-

classmethod(function)
Return a class method for function.

A class method receives the class as implicit first argument, just like an instance method receives the instance. To declare a class method, use this idiom:

class C:
    @classmethod
    def f(cls, arg1, arg2, ...): ...

The @classmethod form is a function decorator – see the description of function definitions in Function definitions for details.

It can be called either on the class (such as C.f()) or on an instance (such as C().f()). The instance is ignored except for its class. If a class method is called for a derived class, the derived class object is passed as the implied first argument.

Class methods are different than C++ or Java static methods. If you want those, see staticmethod() in this section.

For more information on class methods, consult the documentation on the standard type hierarchy in The standard type hierarchy.

New in version 2.2.

Changed in version 2.4: Function decorator syntax added.

staticmethod(function)
Return a static method for function.

A static method does not receive an implicit first argument. To declare a static method, use this idiom:

class C:
    @staticmethod
    def f(arg1, arg2, ...): ...

The @staticmethod form is a function decorator – see the description of function definitions in Function definitions for details.

It can be called either on the class (such as C.f()) or on an instance (such as C().f()). The instance is ignored except for its class.

Static methods in Python are similar to those found in Java or C++. For a more advanced concept, see classmethod() in this section.

For more information on static methods, consult the documentation on the standard type hierarchy in The standard type hierarchy.

New in version 2.2.

Changed in version 2.4: Function decorator syntax added.

—————————————————————————

Being educated under Java background, static method and class method are the same thing.

But not so in Python, there is subtle difference:

Say function a() is defined in Parent Class, while Sub Class extends Parent Class

If function a() has @staticmethod decorator, Sub.a() still refers to definition inside Parent Class. Whereas,

If function a() has @classmethod decorator, Sub.a() will points definition inside Sub Class.

Let’s talk about some definitions here:

@staticmethod function is nothing more than a function defined inside a class. It is callable without instantiating the class first. It’s definition is immutable via inheritance.

@classmethod function also callable without instantiating the class, but its definition follows Sub class, not Parent class, via inheritance. That’s because the first argument for @classmethod function must always be cls (class).

————————————————-

Usually, static methods would be used for Singleton or Factory Patterns (http://en.wikipedia.org/wiki/Factory_method_pattern), but if you review this page, you will see Python doesn’t require a static method to be used to implement the Factory pattern.

Coming up with an example for static methods can be difficult. I usually use static methods in python for organization purposes. For instance, if I made a DB abstraction layer and had a few functions that were related to the DB layer, but weren’t necessary used in the class, I would added them as static methods. Keeps those functions organized in once place, plus I find it easier to remember.

For example:

>>> data = db_convertAsciiToHtml(”Foo & Bar”)

OR

>>> data = DB.convertAsciiToHtml(”Foo & Bar”)

References:

http://www.techexperiment.com/2008/08/21/creating-static-methods-in-python/

http://docs.python.org/library/functions.html

http://rapd.wordpress.com/2008/07/02/python-staticmethod-vs-classmethod/

Read Full Post | Make a Comment ( 1 so far )

Major and Minor Numbers

Posted on January 18, 2010. Filed under: Linux, Services | Tags: , , , , |

One of the basic features of the Linux kernel is that it abstracts the handling of devices. All hardware devices look like regular files; they can be opened, closed, read and written using the same, standard, system calls that are used to manipulate files. Every device in the system is represented by a file. For block (disk) and character devices, these device files are created by the mknod command and they describe the device using major and minor device numbers. Network devices are also represented by device special files but they are created by Linux as it finds and initializes the network controllers in the system.

To UNIX, everything is a file. To write to the hard disk, you write to a file. To read from the keyboard is to read from a file. To store backups on a tape device is to write to a file. Even to read from memory is to read from a file. If the file from which you are trying to read or to which you are trying to write is a “normal” file, the process is fairly easy to understand: the file is opened and you read or write data. If, however, the device you want to access is a special device file (also referred to as a device node), a fair bit of work needs to be done before the read or write operation can begin.

One key aspect of understanding device files lies in the fact that different devices behave and react differently. There are no keys on a hard disk and no sectors on a keyboard, though you can read from both. The system, therefore, needs a mechanism whereby it can distinguish between the various types of devices and behave accordingly.

To access a device accordingly, the operating system must be told what to do. Obviously, the manner in which the kernel accesses a hard disk will be different from the way it accesses a terminal. Both can be read from and written to, but that’s about where the similarities end. To access each of these totally different kinds of devices, the kernel needs to know that they are, in fact, different.

Inside the kernel are functions for each of the devices the kernel is going to access. All the routines for a specific device are jointly referred to as the device driver. Each device on the system has its own device driver. Within each device driver are the functions that are used to access the device. For devices such as a hard disk or terminal, the system needs to be able to (among other things) open the device, write to the device, read from the device, and close the device. Therefore, the respective drivers will contain the routines needed to open, write to, read from, and close (among other things) those devices.

The kernel needs to be told how to access the device. Not only does the kernel need to be told what kind of device is being accessed but also any special information, such as the partition number if it’s a hard disk or density if it’s a floppy, for example. This is accomplished by the major number and minor number of that device.

All devices controlled by the same device driver have a common major device number. The minor device numbers are used to distinguish between different devices and their controllers. Linux maps the device special file passed in system calls (say to mount a file system on a block device) to the device’s device driver using the major device number and a number of system tables, for example the character device table, chrdevs. The major number is actually the offset into the kernel’s device driver table, which tells the kernel what kind of device it is (whether it is a hard disk or a serial terminal). The minor number tells the kernel special characteristics of the device to be accessed. For example, the second hard disk has a different minor number than the first. The COM1 port has a different minor number than the COM2 port, each partition on the primary IDE disk has a different minor device number, and so forth. So, for example, /dev/hda2, the second partition of the primary IDE disk has a major number of 3 and a minor number of 2.

It is through this table that the routines are accessed that, in turn, access the physical hardware. Once the kernel has determined what kind of device to which it is talking, it determines the specific device, the specific location, or other characteristics of the device by means of the minor number.

The major number for the hd (IDE) driver is hard-coded at 3. The minor numbers have the format

(<unit>*64)+<part>

where <unit> is the IDE drive number on the first controller, either 0 or 1, which is then multiplied by 64. That means that all hd devices on the first IDE drive have a minor number less than 64. <part> is the partition number, which can be anything from 1 to 20. Which minor numbers you will be able to access will depend on how many partitions you have and what kind they are (extended, logical, etc.). The minor number of the device node that represents the whole disk is 0. This has the node name hda, whereas the other device nodes have a name equal to their minor number (i.e., /dev/hda6 has a minor number 6).

If you were to have a second IDE on the first controller, the unit number would be 1. Therefore, all of the minor numbers would be 64 or greater. The minor number of the device node representing the whole disk is 1. This has the node name hdb, whereas the other device nodes have a name equal to their minor number plus 64 (i.e., /dev/hdb6 has a minor number 70).

If you have more than one IDE controller, the principle is the same. The only difference is that the major number is 22.

For SCSI devices, the scheme is a little different. When you have a SCSI host adapter, you can have up to seven hard disks. Therefore, we need a different way to refer to the partitions. In general, the format of the device nodes is

sd<drive><partition>

where sd refers to the SCSI disk driver, <drive> is a letter for the physical drive, and <partition> is the partition number. Like the hd devices, when a device refers to the entire disk, for example the device sda refers to the first disk.

The major number for all SCSI drives is 8. The minor number is based on the drive number, which is multiplied by 16 instead of 64, like the IDE drives. The partition number is then added to this number to give the minor.

The partition numbers are not as simple to figure out. Partition 0 is for the whole disk (i.e., sda). The four DOS primary partitions are numbered 14. Then the extended partitions are numbered 58. We then add 16 for each drive. For example:

brw-rw----   1 root     disk       8,  22 Sep 12  1994
/dev/sdb6

Because b is after the sd, we know that this is on the second drive. Subtracting 16 from the minor, we get 6, which matches the partition number. Because it is between 4 and 8, we know that this is on an extended partition. This is the second partition on the first extended partition.

The floppy devices have an even more peculiar way of assigning minor numbers. The major number is fairly easy its 2. Because the names are a little easier to figure out, lets start with them. As you might guess, the device names all begin with fd. The general format is

fd<drive><density><capacity>

where <drive> is the drive number (0 for A:, 1 for B:), <density> is the density (d-double, h-high), and <capacity> is the capacity of the drive (360Kb, 1440Kb). You can also tell the size of the drive by the density letter, lowercase letter indicates that it is a i5.25″ drive and an uppercase letter indicates that it is a 3.5″ driver. For example, a low-density 5.25″ drive with a capacity of 360Kb would look like

fd0d360

If your second drive was a high-density 3.5″ drive with a capacity of 1440Kb, the device would look like

fd1H1440

What the minor numbers represents is a fairly complicated process. In the fd(4) man-page there is an explanatory table, but it is not obvious from the table why specific minor numbers go with each device. The problem is that there is no logical progression as with the hard disks. For example, there was never a 3.5″ with a capacity of 1200Kb nor has there been a 5.25″ with a capacity of 1.44Mb. So you will never find a device with H1200 or h1440. So to figure out the device names, the best thing is to look at the man-page.

The terminal devices come in a few forms. The first is the system console, which are the devices tty0-tty?. You can have up to 64 virtual terminals on you system console, although most systems that I have seen are limited five or six. All console terminals have a major number of 4. As we discussed earlier, you can reach the low numbered ones with ALT-Fn, where n is the number of the function key. So ALT-F4 gets you to the fourth virtual console. Both the minor number and the tty number are based on the function key, so /dev/tty4 has a minor number 4 and you get to it with ALT-F4. (Check the console(4) man-page to see how to use and get to the other virtual terminals.)

Serial devices can also have terminals hooked up to them. These terminals normally use the devices /dev/ttySn, where n is the number of the serial port (0, 1, 2, etc.). These also have a minor number of 4, but the minor numbers all start at 64. (Thats why you can only have 63 virtual consoles.) The minor numbers are this base of 64, plus the serial port number (04). Therefore, the minor number of the third serial port would be 64+3=67.

Related to these devices are the modem control devices, which are used to access modems. These have the same minor numbers but have a major number of 5. The names are also based on the device number. Therefore, the modem device attached to the third serial port has a minor number of 64+3=67 and its name is cua3.

Another device with a major number of 5 is /dev/tty, which has a minor number of 0. This is a special device and is referred to as the “controlling terminal. ” This is the terminal device for the currently running process. Therefore, no matter where you are, no matter what you are running, sending something to /dev/tty will always appear on your screen.

The pseudo-terminals (those that you use with network connections or X) actually come in pairs. The “slave” is the device at which you type and has a name like ttyp?, where ? is the tty number. The device that the process sees is the “master” and has a name like ptyn, where n is the device number. These also have a major number 4. However, the master devices all have minor numbers based on 128. Therefore, pty0 has a minor number of 128 and pty9 has a minor number of 137 (128+9). The slave device has minor numbers based on 192, so the slave device ttyp0 has a minor number of 192. Note that the tty numbers do not increase numerically after 9 but use the letter af.

Other oddities with the device numbering and naming scheme are the memory devices. These have a major number of 1. For example, the device to access physical memory is /dev/mem with a minor number of 1. The kernel memory device is /dev/kmem, and it has a minor number of 2. The device used to access IO ports is /dev/port and it has a minor number of 4.

What about minor number 3? This is for device /dev/null, which is nothing. If you direct output to this device, it goes into nothing, or just disappears. Often the error output of a command is directed here, as the errors generated are uninteresting. If you redirect from /dev/null, you get nothing as well. Often I do something like this:

cat /dev/null > file_name

This device is also used a lot in shell scripts where you do not want to see any output or error messages. You can then redirect standard out or standard error to /dev/null and the messages disappear. For details on standard out and error, see the section on redirection.

If file_name doesn’t exist yet, it is created with a length of zero. If it does exist, the file is truncated to 0 bytes.

The device /dev/zero has a major number of 5 and its minor number is 5. This behaves similarly to /dev/null in that redirecting output to this device is the same as making it disappear. However, if you direct input from this device, you get an unending stream of zeroes. Not the number 0, which has an ASCII value of 48this is an ASCII 0.

Are those all the devices? Unfortunately not. However, I hope that this has given you a start on how device names and minor numbers are configured. The file <linux/major.h> contains a list of the currently used (at least well known) major numbers. Some nonstandard package might add a major number of its own. Up to this point, they have been fairly good about not stomping on the existing major numbers.

As far as the minor numbers go, check out the various man-pages. If there is a man-page for a specific device, the minor number will probably be under the name of the driver. This is in major.h or often the first letter of the device name. For example, the parallel (printer) devices are lp?, so check out man lp.

The best overview of all the major and minor numbers is in the /usr/src/linux/Documentation directory. The devices.txt is considered the “authoritative” source for this information.

References:

http://www.linux-tutorial.info/modules.php?name=MContent&pageid=94

http://www.linux-tutorial.info/modules.php?name=MContent&obj=glossary&term=man-page

http://www.kernel.org/pub/linux/docs/device-list/devices.txt

http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000ic/index.jsp?topic=/com.ibm.storage.ssic.help.doc/f2c_linuxdevnaming_2hsag8.html

http://docsrv.sco.com/HDK_concepts/ddT_majmin.html

http://burks.brighton.ac.uk/burks/linux/rute/node18.htm

http://www.makelinux.info/ldd3/chp-3-sect-2.shtml

http://www.thegeekstuff.com/2009/06/how-to-identify-major-and-minor-number-in-linux/

Read Full Post | Make a Comment ( None so far )

XenIntro

Posted on January 8, 2010. Filed under: C/C++, Linux, Services | Tags: , |

Xen Intro- version 1.0:

Contents

  1. Introduction
  2. Xen and IA32 Protection Modes
  3. The Xend daemon:
  4. The Xen Store:
  5. VT-x (virtual technology) processors – support in Xen
  6. Vmxloader
  7. VT-i (virtual technology) processors – support in Xen
  8. AMD SVM
  9. Xen On Solaris
  10. Step by step example of creating guest OS with Virtual Machine Manager in Fedora Core 6
  11. Physical Interrupts
  12. Backend Drivers:
  13. Migration and Live Migration:
  14. Creating of a domain – behind the scenes:
  15. HyperCalls Mapping to code Xen 3.0.2
  16. Virtualization and the Linux Kernel
  17. Pre-Virtualization
  18. Xen Storage
  19. kvm – Kernel-based Virtualization Driver
  20. Tip: How to build Xen with your own tar ball
  21. Xen in the Linux Kernel
  22. VMI : Virtual Machine Interface
  23. Links
  24. Adding new device and triggering the probe() functions
    1. deviceback.c
    2. xenbus.c
    3. common.h
    4. Makefile
  25. Adding a frontend device
    1. Makefile
    2. devicefront.c
  26. Discussion

Introduction

All of the following text refers to x86 platform of Xen-unstable, unless otherwise explicitly said. We will deal only with Xen on linux 2.6 ; We are not dealing at all with Xen on linux 2.4 (and as far as we know, in the future, domain 0 is intended to be based ONLY on 2.6). Moreover,currently the 2.4 linux tree is removed from Xen Tree (changeset 7263:f1abe953e401 from 8.10.05) but it can be that it will be back when some problems will be fixed.

This document deals only with Xen 3.0 version unless explictily said otherwise.

This is not intended to be a full and detailed documentation of the Xen project but we hope it will be a starting point to anyone who is interested in Xen and wants to learn more.

The Xen Project team is permitted to take part or all of this document and integrate it with the official Xen documentation or put it as a standalone document in the Xen Web Site if they wish, without any further notice.

This is not a complete detailed document nor a full walkthrough ,and many important issues are omitted. Any feedback is welcomed to : Rami Rosen , ramirose@gmail.com

Xen and IA32 Protection Modes

In the classical protection model of IA-32, there are 4 privilege levels; The highest ring is 0, where the kernel runs. (this level is also called SuperVisor Mode) The lowest is ring 3, where User applications run (this level is also called User Mode) Issuing some instructions , which are called “privileged instructions” , from ring which is NOT ring 0, will cause a General Protection Fault.

Ring 1,2 were not used through the years (except for in the case of OS/2). When running Xen, we run a Hypervisor in ring 0 and the guest OS in ring 1. The applications run unmodified at ring 3.

BTW, there are of course architectures which have a different privilege models; for example, in PPC both domain 0 and the Unprivileged domains run in supervisor mode. Diagram: Xen and IA32 Protection Modes

rings

The Xend daemon:

The Xend Daemon handles requests issued from Domain 0; requests can be, for example, creating a new domain (“xm create”) or listing the domains (“xm list”), shutting down a domain (“xm destroy”). Running “xm help” will show all possibilities.

You start the Xend daemon by running, after booting into domain0, “xend start”. “xend start” creates two daemons: xenstored and xenconsoled (see toos/misc/xend). It also creates an instance of a python SrvDaemon class and calls its start() method. (see tools/python/xen/xend/server/SrvDaemon.py).

The SrvDaemon start() method is in fact the xend main program.

In the past,the start() method of SrvDaemon eventually started an http socket (8000) on which it listened to http requests. Now it does not open an http socket on port 8000 anymore.

Note : There is an altenative to the management layer of Xen which is called libvirt; see http://libvir.org. This is a free API (LGPL)

The Xen Store:

The Xen Store Daemon provides a simple tree-like database to which we can read and write values. The Xen Store code is mainly under tools\xenstore.

It replaces the XCS, which was a daemon handling control messages.

The physical xenstore resides in one file: /var/lib/xenstored/tdb. (previously it was sacttered in some files; the change to using one file (named “tdb”) was probably to increase performance).

Both user space (“tools” in Xen terminology) and kernel code can write to the XenStore.The kernel code writes to the XenStore by using XenBus.

The python scripts (under tools/python) uses lowlevel/xs.c to read/write to the XenStore.

The Xen Store Daemon is started in xenstored_core.c. It creates a device file (“/dev/xen/evtchn”) in case such a device file does not exists and it opens it. (see : domain_init() ,file tools/xenstore/xenstored_domain.c).

It opens 2 TCP sockets (UNIX sockets). One of these sockets is a Read-Only socket, and it resides under /var/run/xenstored/socket_ro. The second is /var/run/xenstored/socket.

Connections on these sockets are represented by the connection struct.

A connection can be in one of three states:

        BLOCKED (blocked by a transaction)
        BUSY    (doing some action)
        OK      (completed it's transaction)

struct connection is declared in xenstore/xenstored_core.h; When a socket is ReadOnly,the “can_write” member of it is false.

Then we start an endless loop in which we can get input/output from three sources: the two sockets and the event channel, mentioned above.

Events, which are received in the event channel,are handled by handle_event() method (file xenstored_domain.c).

There are six executables under tools/xenstore, five of which are in fact made from the same module, which is xenstore_client.c, each time built with a different DEFINE passed. (See the Makefile). The sixth tool is built from xsls.c

These executables are : xenstore-exists, xenstore-list, xenstore-read, xenstore-rm, xenstore-write and xsls.

You can use these executable for accessing xenstore. For example: to view the list of fields of domain 0 which has a path “local/domain/0”, you run:

xenstore-list /local/domain/0

and a typical result can be the following list:

cpu
memory
name
console
vm
domid
backend

The xsls command is very useful and recursively shows the contents of a specified XenStore path. Essentially it does a xenstore-list and then a xenstore-read for each returned field, displaying the fields and their values and then repeating this recursively on each sub-path. For example: to view information about all VIFs backends hosted in domain 0 you may use the following command.

xsls /local/domain/0/backend/vif

and a typical result may be:

14 = ""
 0 = ""
  bridge = "xenbr0"
  domain = "vm1"
  handle = "0"
  script = "/etc/xen/scripts/vif-bridge"
  state = "4"
  frontend = "/local/domain/14/device/vif/0"
  mac = "aa:00:00:22:fe:9f"
  frontend-id = "14"
  hotplug-status = "connected"
15 = ""
 0 = ""
  mac = "aa:00:00:6e:d8:46"
  state = "4"
  handle = "0"
  script = "/etc/xen/scripts/vif-bridge"
  frontend-id = "15"
  domain = "vm2"
  frontend = "/local/domain/15/device/vif/0"
  hotplug-status = "connected"

(The xenstored must be running for these six executables to run; If xenstored is not running, then running theses executables will usually hang. The Xend daemon can be stopped).

An instance of struct node is the elementary unit of the XenStore. (struct node is defined in xenstored_core.h). The actual writing to the XenStore is done by write_node() method of xenstored_core.c.

xen_start_info structure has a member named :store_evtchn. (declared in public/xen.h as u16). This is the event channel for store communication.

VT-x (virtual technology) processors – support in Xen

Note: following text refers only to IA-32 unless explicitly said otherwise.

Intel had announced Pentium® 4 672 and 662 processors in November 2005 with virtualization support. (see, for example: http://www.physorg.com/news8160.html).

How does Xen support the Intel Virtualization Technology ?

The VT extensions support in Xen3 code is mostly in xen/arch/x86/hvm/vmx*.c.

  • and xen/include/asm-x86/vmx*.h and xen/arch/x86/x86_32/entry.S.

arch_vcpu structure (file xen/include/asm-x86/domain.h) contains a member which is called arch_vmx and is an instance of arch_vmx_struct. This member is also important to understand the VT-x mechanism.

But the most important structure for VT-x is the VMCS( vmcs_struct in the code) which represents the VMCS region.

The definition (file include/asm-x86/vmx_vmcs.h) is short:

struct vmcs_struct

  • { u32 vmcs_revision_id; unsigned char data [0]; /* vmcs size is read from MSR */ };

The VMCS region contains six logical regions; most relevant to our discussions are Guest-state area and Host-state area. We will also deal with the other four regions: VM-execution control fields,VM-exit control fields, VM-entry control fields and VM-exit information fields.

Intel added 10 new opcodes in VT-x to support Intel Virtualization Technology. They are detailed in the end of this section.

When using this technology, Xen runs in “VMX root operation mode” while the guest domains (which are unmodified OSs) run in “VMX non-root operation mode”. Since the guest domains run in “non-root operation” mode, it is more restricted,meaning that certain actions will cause “VM exit” to the VM.

Xen enters VMX operation in start_vmx() method. ( file xen/arch/x86/vmx.c)

This method is called from init_intel() method (file xen/arch/x86/cpu/intel.c.) (CONFIG_VMX should be defined).

First we check the X86_FEATURE_VMXE bit in ecx register to see if the cpuid shows that there is support for VMX in the processor. In IA-32 Intel added in the CR4 control register a bit specifying whether we want to enable VMX. So we must set this bit to enable VMX on the processor (by calling set_in_cr4(X86_CR4_VMXE)); This bit is bit 13 in CR4 (VMXE).

Then we call _vmxon to start VMX operation. If we will try to start VMX operation by _vmxon when the VMXE bit in CR4 is not set we will get exception (#UD , for undefined opcode)

In IA-64, things are a little different due to different architecture structure: Intel added a new bit in IA-64 in the Processor Status Register (PSR). This is bit 46 and it’s called VM. It should be set to 1 in guest OSs; and when it’s values is 1 , certain instructions cause virtualization fault.

VM exit:

Some instructions can cause unconditionally VM exit and some can cause VM exit under certain VM-execution control fields. (see the discussion about VMX-region above)

The following instructions will cause VM exit unconditionally: CPUID, INVD, MOV from CR3, RDMSR, WRMSR, and all the new VT-x instructions (which are listed below).

There are other instruction like HLT,INVPLG (Invalidate TLB Entry instruction) MWAIT and others which will cause VM exit if a corresponding VM-execution control was set.

Apart from VM-execution control fields, there are 2 bitmpas which are used for determining whether to perform VM exit: The first is the exception bitmap (see EXCEPTION_BITMAP in vmcs_field enum , file xen/include/asm-x86/vmx_vmcs.h). This bitmap is 32 bit field; when a bit is set in this bitmap, this causes a VM exit if a corresponding exception occurs; by default ,the entries which are set are EXCEPTION_BITMAP_PG (for page fault) and EXCEPTION_BITMAP_GP (for General Protection). see MONITOR_DEFAULT_EXCEPTION_BITMAP in vmx.h.

The second bitmap is the I/O bitmap (in fact, there are 2 I/O bitmaps,A and B, each is 4KB in size) which controls I/O instructions on ports. I/O bitmap A contains the ports in the range 0000-7FFF and I/O bitmap B contains the ports in the range 8000-FFFF. (one bit for each I/O port). see IO_BITMAP_A and IO_BITMAP_B in vmcs_field enum (VMCS Encordings).

When there is an “VM exit” we reach the vmx_vmexit_handler(struct cpu_user_regs regs) in vmx.c. We handle the VM exit according to the exit reason which we read from the VMCS region. We read the vmcs by calling vmread() ; The return value of vmread is 0 in case of success.

We sometimes also need to read some additional data (VM_EXIT_INTR_INFO) from the vmcs.

We get additional data by getting the “VM-exit interruption information” which is a 32 bit field and the “Exit qualification” (64 bit value).

For example, if the exception was NMI, we check if it is valid by checking bit 31 (valid bit) of the VM-exit interruption field. In case it is not valid we call _hvm_bug() to print some statistics and crash the domain.

Example of reading the “Exit qualification” field is in the case where the VMEXIT was caused by issuing INVPLG instruction.

When we work with vt-x, the guest OSs work in shadow mode, meaning they use shadow page tables; this is because the guest kernel in a VMX guest does not know that it’s being virtualized. There is no software visible bit which indicates that the processor is in VMX non-root operation. We set shadow mode by calling shadow_mode_enable() in vmx_final_setup_guest() method (file vmx.c).

There are 43 basic exit reasons – you can see part of them in vmx.h (fields starting with EXIT_REASON_ like EXIT_REASON_EXCEPTION_NMI, which is exit reason number 0, and so on).

In VT-x, Xen will probably use an emulated devices layer which will send virtual interrupts to the VMM. We can prevent the OS from receiving interrupts by setting the IF flag of EFLAGS.

The new ten opcodes which Intel added in Vt-x are detailed below:

1) VMCALL: (VMCALL_OPCODE in vmx.h)

  • This simply calls the VM monitor, causing vm exit.

2) VMCLEAR: (VMCLEAR_OPCODE in vmx.h)

  • copies VMCS data to memory in case it does not written there.
  • wrapper : _vmpclear (u64 addr) in vmx.h.

3) VMLAUNCH (VMLAUNCH_OPCODE in vmx.h)

  • launched a virtual machine; changes the launch state of the VMCS to
    • launched (if it is clear)

4) VMPTRLD (VMPTRLD_OPCODE file vmx.h)

  • loads a pointer to the VMCS.
    • wrapper : _vmptrld (u64 addr) (file vmx.h)

5) VMPTRST (VMPTRST_OPCODE in vmx.h)

  • stores a pointer to the VMCS.wrapper : _vmptrst (u64 addr) (file vmx.h.)

6) VMREAD (VMREAD_OPCODE in vmx.h)

  • read specified field from VMCS.
  • wrapper : _vmread(x, ptr) (file vmx.h)

7) VMRESUME (VMRESUME_OPCODE in vmx.h)

  • resumes a virtual machine ; in order it to resume the VM,
    • the launch state of the VMCS should be “clear.

8) VMWRITE (VMWRITE_OPCODE in vmx.h)

  • write specified field in VMCS. wrapper _vmwrite (field, value).

9) VMXOFF (VMXOFF_OPCODE in vmx.h)

  • terminates VMX operation.
    • wrapper : _vmxoff (void) (file vmx.h.)

10) VMXON (VMXON_OPCODE in vmx.h)

  • starts VMX operation.wrapper : _vmxon (u64 addr) (file vmx.h.)

QEMU and VT-D The io in Vt-x is performed by using QEMU. The QEMU code which Xen uses is under tools/ioemu. It is based on version 0.6.1 of QEMU. This version was patched accrording to Xen needs. Also AMD SVM uses QEUMU emulation.

The default network card which QEMU uses in Vt-x is AMD PCnet-PCI II Ethernet Controller. (file tools/ioemu/hw/pcnet.c). The reason to prefer this nic emulation to the other alternative, ne2000, is that pcnet uses DMA whereas ne2000 does not.

There is of course a performance cost for using QEMU, so there are chances that usage of QEMU will be replaced in the future with different soulutions which have lower performance costs.

Intel had annouced in March 2006 its VT-d Technology (Intel Virtualization Technology for Directred I/O). This technology enables to assign devices to virtual machines. It also enables DMA remapping, which can be configured for each device. There is a cache called IOTLB which improves performance.

Vmxloader

There are some restrictions on VMX operation. Guest OSes in VMX cannot operate in Real Mode. If bit PE (Protection Enabled) of CR0 is 0 or bit PG (“Enable Paging”) of CR0 is 0, then trying to start the VMX operation (VMXON instruction) fails.If after entering VMX operation you try to clear these bits, you get an exception (General Protection Exception). When using a linux loader, it starts in real mode. As a result, a vmxloader was written for vmx images. (file tools/firmware/vmxassist/vmxloader.c.)

(In order to build vmxloader you must have dev86 package installed; dev86 is a real mode 80×86 assembler and linker).

After installing Xen, vmxloader is under /usr/lib/xen/boot. In order to use it, you should specify kernel = “/usr/lib/xen/boot/vmxloader” in the config file (which is an input to your “xm create” command.)

The vmxloader loads ROMBIOS at 0xF0000, then VGABIOS at 0xC0000, and then VMXAssist at D000:0000.

What is VMXAssist? The VMXAssist is an emulator for real mode which uses the Virtual-8086 mode of IA32. After setting Virtual-8086 mode, it executes in a 16-bit environment.

There are certain instructions which are not recognized in virtual-8086 mode. For example, LIDT (Load Interrupt Register Table), or LGDT (Load Global DescriptorTable).

These instructions cause #GP(0) when trying to run them in protected mode.

So the VMXAssist assist checks the opcode of the instructions which are being executed, and handles them so that they will not cause General Protection Exception (as would have happened without its intervention).

VT-i (virtual technology) processors – support in Xen

Note : the files mentioned in this sections are from the unstable xen version).

In Vt-i extension for IA64 processors,intel added a bit to the PSR (process status register). This bit is bit 46 of the PSR and is called PSR.vm. When this bit is set, some instructions will cause a fault.

A new instruction called vmsw (Virtual Machine Switch) was added. This instruction sets the PSR.vm to 1 or 0. This instruction can be used to cause transition to or from a VM without causing an interruption.

Also a descriptor named VPD was added; this descriptor represents the resources of a virtual processor. It’s size is 64 K. (It must be 32 aligned).

A VPD stands for “Virtual Processor Descriptor”. A structure named vpd_t represents the VPD descriptor (file include/public/arch-ia64.h).

Two vectors were added to the ivt: One is the External Interrupt vector (0x3400) and the other is the Virtualization vector (0x6100).

The virtualization vector handler is called when an instruction which need virtualization was called. This handler cannot be raised by IA-32 instructions.

Also nine PAL services were added. PAL stands for Processor Abstraction Layer.

The services that were added are: PAL_VPS_RESUME_NORMAL,PAL_VPS_RESUME_HANDLER,PAL_VPS_SYNC_READ, PAL_VPS_SYNC_WRITE,PAL_VPS_SET_PENDING_INTERRUPT,PAL_VPS_THASH, PAL_VPS_TTAG, PAL_VPS_RESTORE and PAL_VPS_SAVE, (file include/asm-ia64/vmx_pal_vsa.h)

AMD SVM

AMD will hopefully release PACIFICA processors with virtualization support in Q2 2006. (probably on June 2006). The IOMMU virtualization support is to be out in 2007.

Them xen-unstable tree now includes both intel VT and SVM support, using a common API which is called HVM.

The inclusion of HVM in the unstable tree is since changeset 8708 from 31/1/06, which is a “Big merge the HVM full-virtualisation abstractions.”

You can download the code by: hg clone http://xenbits.xensource.com/xen-unstable.hg

The code for AMD SVM is mostly under xen/arch/x86/hvm/svm.

The code is developed by AMD team: Tom Woller, Mats Petersson, Travis Betak, Nagib Gulam, Leo Duran, Rosilmildo Dasilva and Wei Huang.

SVM stands for “Secure Virtual Machine”.

One major difference between Vt-x and AMD SVM is that the AMD SVM virtualization extensions include tagged TLB (whereas Intel virtualization extensions for IA-32 does not). The benefit of a tagged TLB is significantly reducing the number of TLB flushes ; this is achieved by using an ASID (Address Space Identifer) in the TLB. Using tagged TLB is common in RISC processors.

In AMD SVM, the most important struct (which is parallel to the VT-x vmcs_struct) is the vmcb_struct. (file xen/include/asm-x86/hvm/svm/vmcb.h). VMCB stands for Virtual Machine Control Block.

AMD added the following eight instructions to the SVM processor:

VMLOAD loads the processor state from the VMCB. VMMCALL enables the guest to communicate with the VMM. VMRUN starts the operation of a guest OS. VMSAVE store the processor state from the VMCB. CLGI clears the global interrupt flag (GIF) SLGI sets the global interrupt flag (GIF) INVPLGA invalidates the TLB mapping of a specified virtual page

  • and a specfied ASID.

SKINIT reinitilizes the CPU.

To issue these instructions SVM must be enabled. Enabling SVM is done by setting bit 12 of the EFER MSR register.

In VT-x, the vmx_vmexit_handler() method handles VM Entries. In AMD SVM, the svm_vmexit_handler() method is the one which handles VM exits. (file xen/arch/x86/hvm/svm/svm.c) When VM exit occurs, the processor saves the reason for this exit in the exit_code member of the VCMB. The svm_vmexit_handler() handles the VM EXIT according to the exit_reason of the VMCB.

Xen On Solaris

On 13 Feb 2006, Sun had released the Xen sources for Solaris x86. See : http://opensolaris.org/os/community/xen/opening-day.

This version currently supports 32 bit only ;it enables openSolaris to be a guest OS where dom0 is a modifed Linux kernel running Xen. Also this version is currently only for x86 (porting to SPARC processor is much more difficult). The members of the Solaris Xen project are Tim Marsland, John Levon, Mark Johnson, Stu Maybee, Joe Bonasera, Ryan Scott, Dave Edmondson and others. Todd Clayton is leading the 64-bit solaris Xen project. In order to boot the Solaris Xen guest many changes were done; can see more details in http://blogs.sun.com/roller/page/JoeBonasera.

You can download the Xen Solaris sources from : http://dlc.sun.com/osol/xen/downloads/osox-src-12-02-2006.tar.gz

Frontend net virtual device sources are in uts/common/io/xennet/xennetf.c. (xennet is the net front virtual driver.).

Frontend block virtual device sources are in uts/i86xen/io/xvbd (xvbd is the block front virtual driver.).

Currently the front block device does not work. There are many things which are similiar between Xen on Solaris and Xen on Linux.

In Xen Solaris Hypercall are also made by calling int 0x82 . (see #define TRAP_INSTR int $0x82 (file /uts/i86xen/ml/hypersubr.s)

Sun also released in february 2006 the specs for the T1 prcoessor, which supports virtualization: see : http://opensparc.sunsource.net/nonav/opensparct1.html

see: http://www.prnewswire.com/cgi-bin/stories.pl?ACCT=104&STORY=/www/story/02-14-2006/0004281587&EDATE=

Also the UltraSPARC T1 Hypervisor API Specification was released: http://opensparc.sunsource.net/specs/Hypervisor_api-26-v6.pdf

T1 virtualization:

The Hyperprivileged edition of the UltraSPARC Architecture 2005 Specification describes the Nonprivileged, Privileged, and Hyperprivileged (hypervisor/virtual machine firmware) spec.

The virtual processor on Sun supports three privilege modes:

2 bits determine the privilege mode of the processor: HPSTATE.hpriv and PSTATE.priv When both are 0 ,we are in nonprivileged mode When both are 1 ,we are in privileged mode When HPSTATE.hpriv is 1 , we are in Hyperprivileged mode (regardless of the value of PSTATE.priv). PSTATE is the Processor State register. HPSTATE is the Hyperprivileged State register HPSTATE.(64 bit). Each virtual processor has only one instance of the PSTATE and HPSTATE registers. The HPSTATE is one of the HPR state registers, and it is also called HPR 0. It can be read by the RDHPR instructions, and it can be written by the WRHPR instruction.

Step by step example of creating guest OS with Virtual Machine Manager in Fedora Core 6

This secrion describes a step by step example of creating guest OS based on FC6 i386 with Virtual Machine Manager in a Fedora Core 6 machine by installing from a WEB URL:

Go to : Application->System Tools->Virtual Machine Manager Choose : Local Xen Host Press New. Enter a name for the guest. You reach now the “Locating installation media” dialog. In “install media URL” you should enter a URL of Fedora Core 6 i386 download. For example, “http://download.fedora.redhat.com/pub/fedora/linux/core/6/i386/os/” then press forward. Choose simple file, and give a path to a non existing file in some existing folder. than: File size: choose 3.5 GB for example ; if you will assign less space, you will not be able to finish the installation, assuming it is a typical , non custom, installation.

Then press forward ; accept the defaults for memory/cpu and then press forward. Than press finish. That’s it! When insalling from web like this it can take 2-4 hours, depending on your bandwidth. You will get to the text mode installation of fedora core 6, and have to enter parameters for the installation.

After the installation is finished and you want to restart the guest OS , you do it by simply: “xm create /etc/xen/NameOfGuest“, where NameOfGuest is of course the name of guest you choose in the installation.

Physical Interrupts

In Xen, only the Hypervisor has an access to the hardware so that to achieve isolation (it is dangerous to share the hardware and let other domains access directly hardware devices simultaneously).

Let’s take a little walkthrough dealing with Xen interrupts:

Handling interrupts in Xen is done by using event channels. Each domain can hold up to 1024 events. An event channel can have 2 flags associated with it : pending and mask. The mask flag can be updated only by guests. The hypervisor cannot update it. These flags are not part of the event channel structure itself. (struct evtchn is defined in xen/include/xen/sched.h ). There are 2 arrays in struct shared_info which contains these flags: evtchn_pending[] and evtchn_mask[] ; each holds 32 elements. (file xen/include/public/xen.h)

(The shared_info is a member in domain struct; it is the domain shared data area).

TBD: add info about event selectors (evtchn_pending_sel in vcpu_info).

Registration (or binding) of irqs in guest domains:

The guest OS calls init_IRQ() when it boots (start_kernel() method calls init_IRQ() ; file init/main.c).

(init_IRQ() is in file sparse/arch/xen/kernel/evtchn.c)

There can be 256 physical irqs; so there is an array called irq_desc with 256 entries. (file sparse/include/linux/irq.h)

All elements in this array are initialized in init_IRQ() so that their status is disabled (IRQ_DISABLED).

Now, when a physical driver starts it usually calls request_irq().

This method eventually calls setup_irq() (both in sparse/kernel/irq/manage.c). which calls startup_pirq().

startup_pirq() send a hypercall to the hypervisor (HYPERVISOR_event_channel_op) in order to bind the physical irq (pirq) . The hypercall is of type EVTCHNOP_bind_pirq. See: startup_pirq() (file sparse/arch/xen/kernel/evtchn.c)

On the Hypervisor side, handling this hypervisor call is done in: evtchn_bind_pirq() method (file /common/event_channel.c) which calls pirq_guest_bind() (file arch/x86/irq.c). The pirq_guest_bind() changes the status of the corresponding irq_desc array element to be enabled (~IRQ_DISABLED). it also calls startup() method.

Now when an interrupts arrives from the controller (the APIC), we arrive at do_IRQ() method as is also in usual linux kernel (also in arch/x86/irq.c). The Hypervisor handles only timer and serial interrupts. Other interrupts are passed to the domains by calling _do_IRQ_guest() (In fact, the IRQ_GUEST flag is set for all interrupts except for timer and serial interrupts). _do_IRQ_guest() send the interrupt by calling send_guest_pirq() to all guests who are registered on this IRQ. The send_guest_pirq() creates an event channel (an instance of evtchn) and sets the pending flag of this event channel. (by calling evtchn_set_pending()) Then, asynchronously, Xen will notify this domain regarding this interrupt (unless it is masked).

TBD: shared interrupts; avoiding problems with shared interrupts when using PCI express.

Backend Drivers:

The Backend Drivers are started from domain 0. We will deal mainly with the network and block drivers. The network backend drivers reside under sparse/drives/xen/netback, and the block backend drivers reside under sparse/drives/xen/blkback.

There are many things in common between the netback and blkback device drivers. There are some differences, though. The blkback device drivers runs a kernel daemon thread (named :xenblkd) whereas the netback device driver does not run any kernel thread.

The netback and blkback register themselves with XenBus by calling xenbus_register_backend().

This method simply calls xenbus_register_driver_common(); both are in sparse/drivers/xen/xenbus/xenbus_probe.c.

(The xenbus_register_driver() method calls the generic kernel method for registering drivers, driver_register()).

Both netback (network backend driver) and blkback (block backend driver) has a module named xenbus.c. There are drivers which are not splitted to backend/frontend drivers;for example, the balloon driver.The balloon driver calls register_xenstore_notifier() in its initialization (balloon_init() method). The register_xenstore_notifier() uses a generic linux callback mechanism for passing status changes (notifier_block in include/linux/notifier.h).

The USB driver also has a backend and frontend drivers; currently it has no support to the xenbus/xenstore API so it does not have a module named xenbus.c but it will probably be adjusted in the future. As of writing of this document, the USB backend/frontend code was removed temporarily from the sparse tree.

Each of the backend drivers registers two watches: one for the backend and one for the frontend. The registration of the watches is done in the probe method:

* In netback it is in netback_probe() method (file netback/xenbus.c).

* In blkback it is in blkback_probe() method (file blkback/xenbus.c).

A registration of a watch is done by calling the xenbus_watch_path2() method. This method is implemented in sparse/drivers/xen/xenbus/xenbus_client.c. Evntually the watch registration is done by calling register_xenbus_watch(), which is implemented in sparse/drivers/xen/xenbus/xenbus_xs.c.

In both cases, netback and blkback, the callback for the backend watch is called backend_changed, and the callback for the forntend watch is called frontend_changed.

xenbus_watch is a simple struct consisting of 3 elements:

A reference to a list of watches (list_head)

A pointer to a node (char*)

A callback function pointer.

The xenbus.c in both netback and blkback defines a struct called backend_info; These structs have much in common: there are minor differences between them. One difference is that in the netback the communications channel is an instance of netif_t whereas in the blkback the communications channel is an instance of blkif_t; In the case of blkback, it includes also the major/minor numbers of the device and the mode (whereas these members don’t exist in the backend_info struct of the netback).

In the case of netback, there is also a XenbusState member. The state machine for XenBus includes seven states: Unknown, Initialising, InitWait (early initialisation was finished, and xenbus is waiting for information from the peer or hotplug scripts), Initialised (waiting for a connection from the peer), Connected, Closing (due to an error or an unplug event) and Closed.

One of the members of this struct (backend_info) is an instance of xenbus_device.(xenbus_device is declared in sparse/include/asm-xen/xenbus.h). The nodename looks like a directory path, for example, dev->nodename in the blkback case may look like:

backend/vbd/94e11324-7eb1-437f-86e6-3db0e145136e/771

and dev->nodename in the netback may look like:

backend/vif/94e11324-7eb1-437f-86e6-3db0e145136e/1

We create an event channel for communication between the two domains by calling a bind_interdomain Hypervisor call. (HYPERVISOR_event_channel_op).

For the networking,this is done in netif_map() in netback/interface.c. For the block device, this is done in blkif_map() in blkback/interface.c.

We use the grant tables to create shared memory between frontend and backend domain. In the case of network drivers,this is done by calling: gnttab_grant_foreign_transfer_ref(). (called in: network_alloc_rx_buffers(), file netfront.c)

gnttab_grant_foreign_transfer_ref() sets a bit named GTF_accept_transfer in the grant_entry.

In the case of block drivers,this is done by calling: gnttab_grant_foreign_access_ref() in blkif_queue_request() (file blkfront.c)

gnttab_grant_foreign_access_ref() sets a bit named GTF_permit_access in the grant entry. grant entry (grant_entry_t) represents a page frame which is shared between domains.

Diagram: Virtual Split Devices

virtualDevices

Migration and Live Migration:

Xend must be configured so that migration (which is also termed relocation) will be enabled. In /etc/xen/xend-config.sxp, there is the definition of the relocation-port, so the following line should be uncommented:

(xend-relocation-port 8002)

The line “(xend-address localhost)” prevents remote connections on the localhost,so this line must be commented.

Notice: if this line is commented in the side to which you want to migrate your domain, you will most likely get the following error after issuing the migrate command:

"Error: can't connect: Connection refused"

This error can be traced to domain_migrate() method in /tools/python/xen/xend/XendDomain.py which start a TCP connection on the relocation port (which is by default 8002)

...
...
def domain_migrate(self, domid, dst, live=False, resource=0):
        """Start domain migration."""
        dominfo = self.domain_lookup(domid)
        port = xroot.get_xend_relocation_port()
        try:
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            sock.connect((dst, port))
        except socket.error, err:
            raise XendError("can't connect: %s" % err[1])
...
...

See more details on the relocation protocol (implemented in relocate.py) below.

The line “(xend-relocation-server yes)” should be uncommented so the migration server will be running.

When we issue a migration command like “xm migrate #numOfDomain ipAddress” or , if we want to use live-migration , we add the –live flag thus: “xm migrate #numOfDomain –live ipAddress”, we call server.xend_domain_migrate() after validating that the arguments are valid. (file /tools/python/xen/xm/migrate.py)

We enter the domain_migrate() method of XendDomain, which first performs a domain lookup and then creates a TCP connection for the migration with the target machine; then it sends a “receive” packet (a packets which contains the string “receive”) on port 8002. The other sides gets this message in in the dataReceived() method (see of web/connection.py) and delegates it to dataReceived() of RelocationProtocol (file /tools/python/xen/xend/server/relocate.py). Eventually it calls the save() method of XendChekpoint to start the migration process.

We end up with a call to op_receive() which ultimately sends back a “ready receive” message (also on port 8002). See op_receive() method in relocation.py.

The save() method of XendChekpoint opens a thread which calls the xc_save executable (which is in /usr/lib/xen/bin/xc_save). (file /tools/python/xen/xend/XendCheckpoint.py).

The xc_save executable is build from tools/xcutils/xc_save.c.

The xc_save executable calls xc_linux_save() in tools/libxc/xc_linux_save, which in fact performs most of the migration process. (see xc_linux_save() in /tools/libxc/xc_linux_save.c)

The xc_linux_save() returns 0 on success, and 1 in case of failure.

Live migration is currently supported for 32-bit and 64-bit architectures. TBD: find out if there is support for live-migration for pae architectures. Jacob Gorm Hansen is doing an interesting work with migration in Xen; see http://www.diku.dk/~jacobg/self-migration.

He wrote code which performs Xen “self-migration” , which is a migration which is done without the hypervisor involvement. The migration is done by opening a User Space socket and reading from a special file (/dev/checkpoint).

His code works with linux-2.6 bases xen-unstable.

You can get it by: hg clone http://www.distlab.dk/hg/index.cgi/xen-gfx.hg

His work was inspired by his own ‘NomadBIOS’ for L4 microkernel, which also uses this approach of self-migration.

It might be that this new alternative to managed migration will be adopted in future versions of Xen (time will tell).

Migration of operating systems has much in common with migration of single processes. see Master’s thesis: “Nomadic Operating Systems” Jacob G. Hansen , Asger kahl Henriksen,2002 http://nomadbios.sunsite.dk/NomadicOS.ps

You can find there a discussion about migration of single processes in Mosix (Barak and La’adan).

Creating of a domain – behind the scenes:

We create a domain by:

xm create configFile -c

A simple config file may look like (using ttylinux,as in the user manual):

kernel = "/boot/vmlinuz-2.6.12-xenU"
memory = 64
name = "ttylinux"
nics = 1
ip = "10.0.0.3"
disk = ['file:/home/work/downloads/tmp/ttylinux-xen,hda3,w']
root = "/dev/hda3 ro"

The create() method of XendDomainInfo handles creation of domains. When a domain is created it is assigned a unique id (uuid ) which is cretaed using uuidgen command line utility of e2fsprogs.(file python/xen/xend/uuid.py).

If the memory paramter specified a too high memory which the hypervisor cannot allocate, we end up with the following message: “Error: Error creating domain: The privileged domain did not balloon!”

The devices of a domain are created using the createDevice() method which delegates the call to the createDevice() method of the Device Controller (see XendDomainInfo.py) The createDevice() in turn calls writeDetails() method (also in DevController). This writeDetails() method write the details in XenStore to trigger the creation of the device. The getDeviceDetails() is an abstract method which each subclass of DevController implements. Writing to the store is done by calling Write() method of xstransact. (file tools/pyhton/xen/xend/xenstore/xstransact.py) which returns the id of the newly created device.

By using transaction you can batch together some actions to perform against the xenstored (the common use is some read actions). You can create a domain also without Xend and without Python bindings; Jacob Gorm Hansen had demonstrated it in 2 little programs (businit.c and buscrate.c) (see http://lists.xensource.com/archives/html/xen-devel/2005-10/msg00432.html).

However, true to now these programs should be adjusted beacuse there were some API changes, especially that creation of interdomain event channel is done now with sending ioctl to event_channel (IOCTL_EVTCHN_BIND_INTERDOMAIN).

HyperCalls Mapping to code Xen 3.0.2

Mapping of HyperCalls to code :

Follwoing is the location of all hypercalls: The HyperCalls appear according to their order in xen.h.

The hypercall table itself is in xen/arch/x86/x86_32/entry.S (ENTRY(hypercall_table)).

HYPERVISOR_set_trap_table => do_set_trap_table() (file xen/arch/x86/traps.c)

HYPERVISOR_mmu_update => do_mmu_update() (file xen/arch/x86/mm.c)

HYPERVISOR_set_gdt => do_set_gdt() (file xen/arch/x86/mm.c)

HYPERVISOR_stack_switch => do_stack_switch() (file xen/arch/x86/x86_32/mm.c)

HYPERVISOR_set_callbacks => do_set_callbacks() (file xen/arch/x86/x86_32/traps.c)

HYPERVISOR_fpu_taskswitch => do_fpu_taskswitch(int set) (file xen/arch/x86/traps.c)

HYPERVISOR_sched_op_compat => do_sched_op_compat() (file xen/common/schedule.c)

HYPERVISOR_dom0_op => do_dom0_op() (file xen/common/dom0_ops.c)

HYPERVISOR_set_debugreg => do_set_debugreg() (file xen/arch/x86/traps.c)

HYPERVISOR_get_debugreg => do_get_debugreg() (file xen/arch/x86/traps.c)

HYPERVISOR_update_descriptor => do_update_descriptor() (file xen/arch/x86/mm.c)

HYPERVISOR_memory_op => do_memory_op() (file xen/common/memory.c)

HYPERVISOR_multicall => do_multicall() (file xen/common/multicall.c)

HYPERVISOR_update_va_mapping => do_update_va_mapping() (file /xen/arch/x86/mm.c)

HYPERVISOR_set_timer_op => do_set_timer_op() (file xen/common/schedule.c)

HYPERVISOR_event_channel_op => do_event_channel_op() (file xen/common/event_channel.c)

HYPERVISOR_xen_version => do_xen_version() (file xen/common/kernel.c)

HYPERVISOR_console_io => do_console_io() (file xen/drivers/char/console.c)

HYPERVISOR_physdev_op => do_physdev_op() (file xen/arch/x86/physdev.c)

HYPERVISOR_grant_table_op => do_grant_table_op() (file xen/common/grant_table.c)

HYPERVISOR_vm_assist => do_vm_assist() (file xen/common/kernel.c)

HYPERVISOR_update_va_mapping_otherdomain =>

  • do_update_va_mapping_otherdomain() (file xen/arch/x86/mm.c)

HYPERVISOR_iret => do_iret() (file xen/arch/x86/x86_32/traps.c) /* x86/32 only */

HYPERVISOR_vcpu_op => do_vcpu_op() (file xen/common/domain.c)

HYPERVISOR_set_segment_base => do_set_segment_base (file xen/arch/x86/x86_64/mm.c) /* x86/64 only */

HYPERVISOR_mmuext_op => do_mmuext_op() (file xen/arch/x86/mm.c)

HYPERVISOR_acm_op => do_acm_op() (file xen/common/acm_ops.c)

HYPERVISOR_nmi_op => do_nmi_op() (file xen/common/kernel.c)

HYPERVISOR_sched_op => do_sched_op() (file xen/common/schedule.c)

(Note: sometimes hypercalls are also called hcalls.)

Virtualization and the Linux Kernel

Virtualization in computer context can be thought of as extending the abilities of a computer beyond what a straight, non-virtual implelmentation allows.

In this category we can include also virtual memory, which allows a process to access 4GB virtual address space even though the physical RAM is usually much lower.

We can also think of the Linux IP Virtual Server (which is now a part of the linux kernel) as a kind of virtualization. By using the Linux IP Virtual Server you can configure a router to redirect service requests from a virtual server address to other machines (called real servers).

The IP Virtual Server is part of the kernel starting 2.6.10 (In the 2.4.* kernels it is also available as a patch; the code for 2.6.10 and above kernels is under net/ipv4/ipvs under the kernel tree ;there is still no implementation for ipv6).

The Linux Virtual Server (LVS) was started quite a time ago,in 1998; see http://www.linuxvirtualserver.org.

The idea of virtualization in the sense of enabling of running more than one operating system on a single platform is not new and was researched for many years. However, it seems that the Xen project is the first which produces performance benchmark metrics of such a feature which make this idea more practical and more attractive.

Origins of the Xen project: The Xen project is based on the Xenoservers project; It was originally built as part of the XenoServer project, see http://www.cl.cam.ac.uk/Research/SRG/netos/xeno.

Also the arsenic project has some ideas which were used in Xen. (see http://www.cl.cam.ac.uk/Research/SRG/netos/arsenic)

In the arsenic project, written by Ian Pratt and Keir Fraser, a big part of the Linux kernel TCP/IP stack was ported to user space. The arsenic project is based on Linux 2.3.29. After a short look at the Arsenic porject code you can find some data structures which can remind of parallel data structures in Xen, like the event rings. (for exmaple,the ring_control_block struct in arsenic-1.0/acenic-fw-12.3.10/nic/common/nic_api.h)

Meiosys is a French Company which was purchased by IBM. It deals with another different type of virtualization – Application Virtualization.

see http://www.virtual-strategy.com/article/articleview/680/1/2/ and http://www.infoworld.com/article/05/06/23/HNibmbuy_1.html

In context of the Meiosys project, it is worth to mention that a patch was sent recently to the Linux Kernel Mailing List from Serge E. Hallyn (IBM): see http://lwn.net/Articles/160015

This patch deals with process IDs. (the pid should stay the same after starting anew the application in Meiosys).

Another article on PID virtualization can be found in “PID virtualization: a wealth of choices” http://lwn.net/Articles/171017/?format=printable This article deals with PID virtualization in a context of a diffenet project (openVZ).

There is also the colinux open source project (see:http://colinux.sourceforge.net for more details) and the openvz project, which is based on Virtuozzo™. (Virtuozzo is a commercial solution).

The openvz offers server virtualization, linux-based solution: see http://openvz.org.

There are other projects which probably ispire virtualization; to name of few:

Denali Project uses (uses paravirtualization). http://denali.cs.washington.edu

A paper: Denali: Lightweight Virtual Machines for Distributed and Networked Applications By Andrew Whitaker et al. http://denali.cs.washington.edu/pubs/distpubs/papers/denali_usenix2002.pdf

Nemesis Operating System. http://www.cl.cam.ac.uk/Research/SRG/netos/old-projects/nemesis/index.html

Exokernel: see “Application Performance and Flexibility on Exokernel Systems” by M. Frans Kaashoek et al http://www.cl.cam.ac.uk/~smh22/docs/kaashoek97application.ps.gz

TBD: more details.

Pre-Virtualization

Another interesting virtulaization technique is Pre-Virtualization; in this method, we rewite sensitive instructions using the assembler files (whether generated by compiler, as is the usual case, or assembler files created manually). There is a problem in this method because there are instuctions which are sensitive only when they are performed in a certain context. A solution for this is to generate profiling data of a guest OS and then recompile the OS using the profiling data.

See:

http://l4ka.org/projects/virtualization/afterburn/

and an article: Pre-Virtualization: Slashing the Cost of Virtualization Joshua LeVasseur, Volkmar Uhlig, Matthew Chapman et al. http://l4ka.org/publications/2005/previrtualization-techreport.pdf

This technique is based on a paper by Hideki Eiraku and Yasushi Shinjo, “Running BSD Kernels as User Processes by Partial Emulation and Rewriting of Machine Instructions” http://www.usenix.org/publications/library/proceedings/bsdcon03/tech/eiraku/eiraku_html/index.html

Xen Storage

You can use iscsi for Xen Storage. The xen-tools package of OpenSuse has an example of using iscsi, called xmexample.iscsi. The disk entry for iscsi in the configuration file may look like: disk = [ ‘iscsi:iqn.2006-09.de.suse@0ac47ee2-216e-452a-a341-a12624cd0225,hda,w’ ]

TBD: more on iSCSI in Xen.

Solutions for using CoW in Xen: blktap (part of the xen project).

UnionFS: a stackable filesystem (used also in Knoppix Live-CD and other Live-CDs)

dm-userspace (A tool which uses device-mapper and a daemon called cowd; written by Dan Smith) You may download dm-userspace by:

To build as a module out-of-tree, copy dm-userspace.h to: /lib/modules/uname -r/build/include/linux and then run “make”.

Home of dm-userspace:

Copy-on-write NFS server: see http://www.russross.com/CoWNFS.html

kvm – Kernel-based Virtualization Driver

Kvm is as an open source virtualization project , written by Avi Kivity and Yaniv Kamay from qumranet. See : http://kvm.sourceforge.net.

It is included in the linux kerel tree since 2.6.20-rc1; see: http://lkml.org/lkml/2006/12/13/361 (“kvm driver for all those crazy virtualization people to play with”)

Currently it deals with Intel processors with the virtual extension (VT-X). and AMD SVM processors. You can know if your processor has these extensions by issuing from the command line: “egrep ‘^flags.*(vmx|svm)’ /proc/cpuinfo”

kvm.ko is a kernel module which handles userspace requests through ioctls. It works with a character device (/dev/kvm). The userspace part is built from patched quemu. One of KVM advantages is that it uses linux kernel mechanisms as they are without change (such as the linux scheduler). The Xen project, for example, made many changes to parts of the kernel to enable para-virtualization. Another advantage is the simplicty of the project: there is a kernel part and a userspace part. An advantage of KVM is that future versions of linux kernel will not entail changes in the kvm module code (and of course not in the user space part). The project currently support SMP hosts and will support SMP guests in the future.

Currently there is no support to live migration in KVM (but there is support for ordinary migration, when the migrated OS is stopped and than transfrerred to the target and than resumed).

In intel vt-x , VM-Exits are handled by the kvm module by kvm_handle_exit() method in kvm_main.c according to the reason which caused them (and which is specified and read from the VMCS). in AMD SVM , exit are handled by handle_exit() in svm.c.

There is an interesting usage of memory slots . There is already an rpm for openSUSE by Gerd Hoffman.

Tip: How to build Xen with your own tar ball

If you want to run “make world” without downloading the kernel (beacuse that you want to to your own tar ball which is a bit different from the original one because you made few changes inside the kernel), then do the following:

1) Let’s say that the kernel tar ball is named: my_linux-2.6.18.tar.bz2.

  • First, move my_linux-2.6.18.tar.bz2 to the folder from where you build Xen

2) Run from bash: XEN_LINUX_SOURCE=tarball make world

That’s it; it will use the my_linux-2.6.18.tar.bz2. tar ball that you copied to that folder.

Xen in the Linux Kernel

According to the following thread from xen-devel: http://lists.xensource.com/archives/html/xen-devel/2005-10/msg00436.html, there is a mercurial repository in which xen is a subarch of i386 and x86_64 of the linux kernel, and there is an intention to send releavant stuff to Andrew/Linus for the upcoming 2.6.15 kernel. In 22/3/2006 , a patchest of 35 parts was sent to the Linux Kernel Mailing List (lkml) for Xen i386 paravirtualization support in the linux kernel: see http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/2313.html

VMI : Virtual Machine Interface

On 13/3/06 , a patchset titled “VMI i386 Linux virtualization interface proposal” was sent to the LKML (Linux Kernel Mailing List) by Zachary Amsden and othes. (see http://lkml.org/lkml/2006/3/13/140) It suggests for a common interfcace which abstracts the specifics of each hypervisor and thus can be used by many hypervisors. According to the vmi_spec.txt of this patchset, when an OS is ported to a paravirtulizable x86 processor, it should access the hypervisor through the VMI layer.

The VMI layer interface:

The VMI is divided to the following 10 types of calls:

CORE INTERFACE CALLS (like VMI_Init)

PROCESSOR STATE CALLS (like VMI_DisableInterrupts, VMI_EnableInterrupts,VMI_GetInterruptMask)

DESCRIPTOR RELATED CALLS (like VMI_SetGDT,VMI_SetIDT)

CPU CONTROL CALLS (like VMI_WRMSR,VMI_RDMSR)

PROCESSOR INFORMATION CALLS (like VMI_CPUID)

STACK/PRIVILEGE TRANSITION CALLS (like VMI_UpdateKernelStack,VMI_IRET)

I/O CALLS (like VMI_INB,VMI_INW,VMI_INL)

APIC CALLS (like VMI_APICWrite,VMI_APICRead)

TIMER CALLS (VMI_GetWallclockTime)

MMU CALLS (like VMI_SetLinearMapping)

Links

1) Xen Project HomePage http://www.cl.cam.ac.uk/Research/SRG/netos/xen

2) Xen Mailing Lists Pge: http://lists.xensource.com

(don’t forget to read the XenUsersNetiquette before posting on the lists)

3) Atricle : Analysis of the Intel Pentium’s Ability to Support a Secure Virtual Machine Monitor http://www.usenix.org/publications/library/proceedings/sec2000/full_papers/robin/robin_html/index.html

4) Xen Summits: 2005: http://xen.xensource.com/xensummit/xensummit_2005.html 2006 fall: http://xen.xensource.com/xensummit/xensummit_fall_2006.html 2006 winter: http://xen.xensource.com/xensummit/xensummit_winter_2006.html 2007 spring: http://xen.xensource.com/xensummit/xensummit_spring_2007.html

5) Intel Virtualizatiuon technology: http://www.intel.com/technology/virtualization/index.htm

6) Article by Ryan Maueron Linux Journal in 2 parts:

6-1) Xen Virtualization and Linux Clustering, Part 1 http://www.linuxjournal.com/article/8812

6-2) Xen Virtualization and Linux Clustering, Part 2 http://www.linuxjournal.com/article/8816

Commercial Companies: 7)XenSource:

http://www.xensource.com/

8) Enomalism: http://www.enomalism.com/

9) Thoughtcrime is a brand new company specialising in opensource virtualisation solutions see http://debian.thoughtcrime.co.nz/ubuntu/README.txt

IA64:

10)IA64 Master Thesis HPC Virtualization with Xen on Itanium by Havard K. F. Bjerke http://openlab-mu-internal.web.cern.ch/openlab%2Dmu%2Dinternal/Documents/2_Technical_Documents/Master_thesis/Thesis_HarvardBjerke.pdf

11) vBlades: Optimized Paravirtualization for the Itanium Processor Family Daniel J. Magenheimer and Thomas W. Christian http://www.usenix.org/publications/library/proceedings/vm04/tech/full_papers/magenheimer/magenheimer_html/index.html

12) Now and Xen Feature Story Article Written by Andrew Warfield and Keir Fraser http://www.linux-mag.com/2004-10/xen_01.html

13) Self Migration: http://www.diku.dk/~jacobg/self-migration

14) online magazine: http://www.virtual-strategy.com

15) Denali Project http://denali.cs.washington.edu

general links about virtualization:

16) http://www.virtualization.info

17) http://www.kernelthread.com/publications/virtualization

18) A Survey on Virtualization Technologies Susanta Nanda Tzi-cker Chiueh http://www.ecsl.cs.sunysb.edu/tr/TR179.pdf

AMD:

19) AMD I/O virtualization technology (IOMMU) specification Rev 1.00 – February 03, 2006 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/34434.pdf

20) AMD64 Architecture Programmer’s Manual: Vol 2 System Programming : Revision 3.11 added chapter 15 on virtualization (“Secure Virtual Machine”).(december 2005) http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf

21) AMD64 Architecture Programmer’s Manual: Vol 3: General-Purpose and System Instructions Revision 3.11 added SVM instructions (december 2005) http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24594.pdf

22) AMD virtualization on the Xen Summit: http://www.xensource.com/files/xs0106_amd_virtualization.pdf

23) AMD Press Release: SUNNYVALE, CALIF. — May 23, 2006: availability of AMD processors with virtualization extensions: http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~108605,00.html

Open Solaris:

24) Open Solaris Xen Forum: http://www.opensolaris.org/jive/category.jspa?categoryID=32

25) Update to opensolaris Xen: adding OpenSolaris-based dom0 capabilities, as well as 32-bit and 64-bit MP guest. 14/07/2006 http://www.opensolaris.org/os/community/xen/announcements/?monthYear=July+2006

26) Open Sparc Hypervisor Spec: http://opensparc.sunsource.net/nonav/Hypervisor_api-26-v6.pdf

27) Open Sparc T1 page: http://opensparc.sunsource.net/nonav/opensparct1.html extension to the Solaris Zones: http://opensolaris.org/os/community/brandz

28) OSDL Development Wiki Homepage (Virtualization) http://www.osdl.org/cgi-bin/osdl_development_wiki.pl?Virtualization

29) fedora xen mailing list archive: https://www.redhat.com/archives/fedora-xen

30) Xen Quick Start for FC4 (Fedora Core 4). http://www.fedoraproject.org/wiki/FedoraXenQuickstart?highlight=%28xen%29

31) Xen Quick Start for FC5 (Fedora Core 5): http://www.fedoraproject.org/wiki/FedoraXenQuickstartFC5

32) Xen Quick Start for FC6 (Fedora Core 6): http://fedoraproject.org/wiki/FedoraXenQuickstartFC6

Fedora 7 quick start: http://fedoraproject.org/wiki/Docs/Fedora7VirtQuickStart Fedora 8 quick start: http://fedoraproject.org/wiki/Docs/Fedora8VirtQuickStart

33) The Xen repository is handled by the mercurial version system. mercurial download: http://www.selenic.com/mercurial/release

34) Measuring CPU Overhead for I/O Processing in the Xen Virtual Machine Monitor Cherkasova Ludmila and Gardner, Rob http://www.hpl.hp.com/techreports/2005/HPL-2005-62.html

35) XenMon: QoS Monitoring and Performance Profiling Tool Gupta Diwaker and Gardner Rob; Cherkasova, Ludmila http://www.hpl.hp.com/techreports/2005/HPL-2005-187.html

36) Potemkin VMM: A virtual machine based on xen-unstable ; used in a honeypot By Michael Vrable, Justin Ma, Jay Chen, David Moore, Erik Vandekieft, Alex C. Snoeren, Geoffrey M. Voelker, and Stefan Savage. http://www.cs.ucsd.edu/~savage/papers/Sosp05.pdf

37) Memory Resource Management in VMware ESX Server http://www.usenix.org/events/osdi02/tech/waldspurger.html

books:

38) Virtualization: From the Desktop to the Enterprise By Erick M. Halter , Chris Wolf Published: May 2005 http://www.apress.com/book/bookDisplay.html?bID=449

39) Virtualization with VMware ESX Server Publisher: Syngress; 2005 by Al Muller, Seburn Wilson, Don Happe, Gary J. Humphrey http://www.syngress.com/catalog/?pid=3315

40) VMware ESX Server: Advanced Technical Design Guide by Ron Oglesby, Scott Herold http://www.amazon.com/exec/obidos/ASIN/0971151067/virtualizatio-20/002-5634453-4543251?creative=327641&camp=14573&link_code=as1

41) PPC: Hollis Blanchard, IBM Linux Technology Center Jimi Xenidis, IBM Research http://wiki.xensource.com/xenwiki/Xen/PPC

42) http://about-virtualization.com/mambo

43) PID virtualization: a wealth of choices http://lwn.net/Articles/171017/?format=printable

44) The Xen Hypervisor and its IO Subsystem: http://www.mulix.org/lectures/xen-iommu/xen-io.pdf

45) G. J. Popek and R. P. Goldberg, Formal requirements for virtualizable third generation architectures, Commun. ACM, vol. 17, no. 7, pp. 412 421, 1974. http://www.cis.upenn.edu/~cis700-6/04f/papers/popek-goldberg-requirements.pdf

46) “Running multiple operating systems concurrently on an IA32 PC using virtualization techniques” by Kevin Lawton (1999). http://www.floobydust.com/virtualization/lawton_1999.txt

47) Automating Xen Virtual Machine Deployment (talks about integrating SystemImager with Xen and more) by Kris Buytaert http://howto.x-tend.be/AutomatingVirtualMachineDeployment/

48)Virtualizing servers with Xen Evaldo Gardenali VI International Conference of Unix at UNINET http://umeet.uninet.edu/umeet2005/talks/evaldoa/xen.pdf

49)Survey of System Virtualization Techniques Robert Rose March 8, 2004 http://www.robertwrose.com/vita/rose-virtualization.pdf

NetBSD

50)http://www.netbsd.org/Ports/xen/

51)Interview on Xen with NetBSD develope Manuel Bouyer http://ezine.daemonnews.org/200602/xen.html

52) netbsd xen mailing list: http://www.netbsd.org/MailingLists/#port-xen

53) NetBSD/xen Howto http://www.netbsd.org/Ports/xen/howto.html

54) “C” API for Xen (LGPL) By Daniel Veillard and others

http://libvir.org/

55) Fraser Campbell page: http://goxen.org/

56) Another page from Fraser Campbell : http://linuxvirtualization.com/

57) Virtualization blog

58)Hardware emulation with QEMU (article)

59) http://linuxemu.retrofaction.com

60) Linux Virtualization with Xen http://www.linuxdevcenter.com/pub/a/linux/2006/01/26/xen.html

61) The virtues of Xen by Alex Maier http://www.redhat.com/magazine/014dec05/features/xen/

62) Deploying VirtualMachines as Sandboxes for the Grid Sriya Santhanam, Pradheep Elango, Andrea Arpaci Dusseau, Miron Livny http://www.cs.wisc.edu/~pradheep/SandboxingWorlds05.pdf

63) article: Xen and the new processors:

Infiniband:

64) Infiniband (Smart IO) wiki page http://wiki.xensource.com/xenwiki/XenSmartIO hg repository: http://xenbits.xensource.com/ext/xen-smartio.hg

65) Novell Infiniband and virtualization, Patrick Mullaney , may 1, 2007: http://www.openfabrics.org/archives/spring2007sonoma/Monday%20April%2030/Novell%20xen-ib-presentation-sonoma.ppt

66) A Case for High Performance Computing with Virtual Machines Wei Huangy, Jiuxing Liuz, Bulent Abaliz et al. http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/huangwei-ics06.pdf

67) High Performance VMM-Bypass I/O in Virtual Machines Wei Huangy, Jiuxing Liuz, Bulent Abaliz et al. (usenix 06) http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/usenix06.pdf

68) User Mode Linux , a book By Jeff Dike. Bruce Perens’ Open Source Series. Published: Apr 12, 2006; http://www.phptr.com/title/0131865056

69) Xen 3.0.3 features,schedule : http://lists.xensource.com/archives/html/xen-devel/2006-06/msg00390.html

70) Practical Taint-Based Protection using Demand Emulation Alex Ho, Michael Fetterman, Christopher Clark et al. http://www.cs.kuleuven.ac.be/conference/EuroSys2006/papers/p29-ho.pdf

71) http://stateless.geek.nz/2006/09/11/current-virtualisation-hardware/ Current Virtualisation Hardware by Nicholas Lee

72) RAID: Installing Xen with RAID 1 on a Fedora Core 4 x86 64 SMP machine: http://www.freax.be/wiki/index.php/Installing_Xen_with_RAID_1_on_a_Fedora_Core_4_x86_64_SMP_machine

73) RAID 1 and Xen (dom0) : (On Debian) http://wiki.kartbuilding.net/index.php/RAID_1_and_Xen_(dom0)

74) OpenVZ Virtualization Software Available for Power Processors http://lwn.net/Articles/204275/

75) Kernel-based Virtual Machine patchset (Avi Kivity) adding /dev/kvm which exposes the virtualization capabilities to userspace. http://lwn.net/Articles/205580

76) Intel Technology Journal : Intel Virtulaization Technology: articles by Intel Staff (96 pages) http://download.intel.com/technology/itj/2006/v10i3/v10_iss03.pdf

77) kvm site: (Avi Kivity and others) Includes a howto and a white paper, download , faq sections. http://kvm.sourceforge.net/

78) kvm on debian: http://people.debian.org/~baruch/kvm http://baruch.ev-en.org/blog/Debian/kvm-in-debian

79) Linux Virtualization Wiki

80) ” New virtualisation system beats Xen to Linux kernel” (about kvm) http://www.techworld.com/opsys/news/index.cfm?newsID=7586&pagtype=all

81) article about kvm: http://linux.inet.hr/finally-user-friendly-virtualization-for-linux.html

82) Virtual Linux : An overview of virtualization methods, architectures, and implementations An article by M. Tim Jones (auhor of “GNU/Linux Application Programming”,”AI Application Programming”, and “BSD Sockets Programming from a Multilanguage Perspective”. http://www-128.ibm.com/developerworks/library/l-linuxvirt/index.html

83) Lguest: The Simple x86 Hypervisor by Rusty Russel (formerly lhype) http://lguest.ozlabs.org/

84) “Infrastructure virtualisation with Xen advisory” – a wiki atricle : using iscsi for Xen-clustering ; shared storage http://docs.solstice.nl/index.php/Infrastructure_virtualisation_with_Xen_advisory

85) Xen with DRBD, GNBD and OCFS2 HOWTO http://xenamo.sourceforge.net/

86) Virtualization with Xen(tm): Including Xenenterprise, Xenserver, and Xenexpress (Paperback) by David E Williams Syngress Publishing (April 1, 2007) # ISBN-10: 1597491675 # ISBN-13: 978-1597491679 (Author)http://www.amazon.com/Virtualization-Xen-Including-Xenenterprise-Xenexpress/dp/1597491675/ref=pd_bbs_sr_2/103-3574046-3264617?ie=UTF8&s=books&qid=1173899913&sr=8-2

Paperback: 512 pages

87) Professional XEN Virtualization (Paperback) by William von Hagen (Author)

# Paperback: 500 pages # Publisher: Wrox (August 27, 2007) # Language: English # ISBN-10: 0470138114 # ISBN-13: 978-0470138113

88) Xen and the Art of Consolidation Tom Eastep Linuxfest NW. April 29, 2007. http://www.shorewall.net/Linuxfest-2007.pdf

89) Optimizing Network Virtualization in Xen

By Willy Zwaenepoel, Alan L. Cox, Aravind Menon, usenix 2006 http://www.usenix.org/events/usenix06/tech/menon/menon_html/paper.html

90) Virtual Machine Checkpointing

Brendan Cully,University of British Columbia with Andrew Warfield, University of Cambridge

http://xcr.cenit.latech.edu/hapcw2006/program/papers/cr-xen-hapcw06-final.pdf

Adding new device and triggering the probe() functions

The following is a simple example which shows how to add a new device and trigger the probe() function of a backend driver using xenstore-write tool. This is relevant for Xen 3.1

Currently in Xen, triggering of the probe() method in a backend driver or a frontend driver is done by writing some values to the xenstore into directories where the xenbus poses watches. This writing to the xenstore is currently done in Xen from the python code, and it is wrapped deep inside the xend and/or xm commands. Eventually it is done in the writeDetails method of the DevController class. (And both blkif and netif use it).

For those who want who want to be able to trigger the probe() function without diving too deeply into the python code, this should suffice.

For the purposes of this little tutorial, let’s assume that you have built and installed Xen 3.1 from source and have used it to fire up a guest domain at least once. After you’ve done that, let’s say we want to add new device. We will add a device named “mydevice”. Let’s begin with the backend. For this purpose, we will add a directory named “deviceback” to linux-2.6-sparse/drivers/xen. This directory will store the backend portion of our driver.

First, create linux-2.6-sparse/drivers/xen/deviceback. Next, add the following three files to that directory: deviceback.c, xenbus.c, common.h and Makefile.

Here is a minimal skeleton implementation of these files:

deviceback.c

#include <linux/module.h>
#include "common.h"
static int __init deviceback_init(void)
{
        device_xenbus_init();
}
static void deviceback_cleanup()
{
}
module_init(deviceback_init);
module_exit(deviceback_cleanup);

xenbus.c

#include <xen/xenbus.h>
#include <linux/module.h>
#include <linux/slab.h>
struct backendinfo
{
        struct xenbus_device* dev;
        long int frontend_id;
        struct xenbus_watch backend_watch;
        struct xenbus_watch watch;
        char* frontpath;
};
//////////////////////////////////////////////////////////////////////////////
static int device_probe(struct xenbus_device* dev,
                        const struct xenbus_device_id* id)
{
        struct backendinfo* be;
        char* frontend;
        int err;
        be = kmalloc(sizeof(*be),GFP_KERNEL);
        memset(be,0,sizeof(*be));
        be->dev = dev;
        printk("Probe fired!\n");
        return 0;
}
//////////////////////////////////////////////////////////////////////////////
static int device_uevent(struct xenbus_device* xdev,
                          char** envp, int num_envp,
                          char* buffer, int buffer_size)
{
        return 0;
}
static int device_remove(struct xenbus_device* dev)
{
        return 0;
}
//////////////////////////////////////////////////////////////////////////////
static struct xenbus_device_id device_ids[] =
{
        { "mydevice" },
        { "" }
};
//////////////////////////////////////////////////////////////////////////////
static struct xenbus_driver deviceback =
{
        .name    = "mydevice",
        .owner   = THIS_MODULE,
        .ids     = device_ids,
        .probe   = device_probe,
        .remove  = device_remove,
        .uevent  = device_uevent,
};
//////////////////////////////////////////////////////////////////////////////
void device_xenbus_init()
{
        xenbus_register_backend(&deviceback);
}

common.h

#ifndef COMMON_H
#define COMMON_H
void device_xenbus_init(void);
#endif

Makefile

#Makefile
obj-y += xenbus.o deviceback.o

Next, we should add our new backend device to the Makefile in linux-2.6-sparse/drivers/xen/Makefile. Add the following line to the bottom of that file:

obj-y += deviceback/

This will make sure that it will be included in the build.

Next, we need to add symlinks from linux-2.6-sparse/drivers/xen/deviceback into linux-2.6.18-xen/drivers/xen/deviceback:

  1. Create linux-2.6.18-xen/drivers/xen/deviceback
  2. Change into that directory
  3. Add the symlinks: ‘ln -s ../../../../linux-2.6-xen-sparse/./drivers/xen/deviceback/./* .’

Now we should build the new drivers and reboot with the new Xen image. You can do this by going back to the root directory of the source tree (the place where you typed ‘make world’ when doing your normal build before) and do ‘make install-kernels’. (Note: This will overwrite the previous Xen kernel!) Finally, reboot.

After the machine boots back up, go ahead and start a guest domain and you should notice that device_probe() does not get executed. (Check /var/log/syslog on dom0 to look for the printk() to show up.)

How can we trigger the probe() function of our backend driver? We just need to write the correct key/value pairs into the xenstore.

The call to xenbus_register_backend() in xenbus.c causes xenbus to set a watch on local/domain/0/backend/mydevice in the xenstore. Specifically, anytime anything is written into that location in the store the watch fires and checks for a specific set of key/value pairs that indicate the probe should be fired.

So performing the following 4 calls using xenstore-write will trigger our probe() function. Change the X with the ID of a running guest domain. (Check ‘xm list’ for this. If you’ve only started one guest, this number is probably 1.)

xenstore-write /local/domain/X/device/mydevice/0/state 1
xenstore-write /local/domain/0/backend/mydevice/X/0/frontend-id X
xenstore-write /local/domain/0/backend/mydevice/X/0/frontend /local/domain/X/device/mydevice/0
xenstore-write /local/domain/0/backend/mydevice/X/0/state 1

You should see the probe message appear the Dom0’s /var/log/syslog. What happened here behind the scenes ,without going too deep, is that the xenbus_register_backend() put a watch on the xenback directory of /local/domain/0 in the xenstore. Once frontend, frontend-id, and state are all written to the watched location, the xenbus driver will gather all of that information, as well as the state of the frontend driver (written in that first line) and use it to setup the appropriate data structures. From there, the probe() function is finally fired.

Adding a frontend device

For this purpose,we will add a directory named “devicefront” to linux-2.6-sparse/drivers/xen.

We will create 2 files there: devicefront.c and Makefile.

We will also add directories and symlinks as we did in the deviceback case.

Makefile

obj-y := devicefront.o

devicefront.c

The devicefront.c will be (a minimalist implementation):

// devicefront.c
#include <xen/xenbus.h>
#include <linux/module.h>
#include <linux/list.h>
struct device_info
{
        struct list_head list;
        struct xenbus_device* xbdev;
};
//////////////////////////////////////////////////////////////////////////////
static int devicefront_probe(struct xenbus_device* dev,
                             const struct xenbus_device_id* id)
{
        printk("Frontend Probe Fired!\n");
        return 0;
}
//////////////////////////////////////////////////////////////////////////////
static struct xenbus_device_id devicefront_ids[] =
{
        {"mydevice"},
        {}
};
static struct xenbus_driver devicefront =
{
        .name  = "mydevice",
        .owner = THIS_MODULE,
        .ids   = devicefront_ids,
        .probe = devicefront_probe,
};
static int devicefront_init(void)
{
        printk("%s\n",__FUNCTION__);
        xenbus_register_frontend(&devicefront);
}
module_init(devicefront_init);

We should also remember to add the following to the Makefile under linux-2.6-sparse/drivers/xen:

obj-y   += devicefront/

Getting the frontend driver to fire is a bit more complicated, the following bash script should help you:

#!/bin/bash
if [ $# != 2 ]
then
        echo "Usage: $0 <device name> <frontend-id>"
else
        # Write backend information into the location the frontend will look
        # for it.
        xenstore-write /local/domain/${2}/device/${1}/0/backend-id 0
        xenstore-write /local/domain/${2}/device/${1}/0/backend \
                       /local/domain/0/backend/${1}/${2}/0
        # Write frontend information into the location the backend will look
        # for it.
        xenstore-write /local/domain/0/backend/${1}/${2}/0/frontend-id ${2}
        xenstore-write /local/domain/0/backend/${1}/${2}/0/frontend \
                       /local/domain/${2}/device/${1}/0
        # Set the permissions on the backend so that the frontend can
        # actually read it.
        xenstore-chmod /local/domain/0/backend/${1}/${2}/0 r
        # Write the states.  Note that the backend state must be written
        # last because it requires a valid frontend state to already be
        # written.
        xenstore-write /local/domain/${2}/device/${1}/0/state 1
        xenstore-write /local/domain/0/backend/${1}/${2}/0/state 1
fi

Here’s how to use it:

  • Startup a Xen guest that contains your frontend driver, and be sure dom0 contains the backend driver.
  • Figure out the frontend-id for the guest. This is the ID field when running xm list. Let’s say that number is 3.
  • Run the script as so: ./probetest.sh mydevice 3

That should fire both the frontend driver. (You’ll have to check /var/log/messages in the guest to verify that the probe was fired.)

Discussion

SimonKagstrom: Maybe it would be a good idea to split this document into several pages? It’s starting to be fairly long :) TimPost

TimPost: ACK, even if printed, this is hard to digest as an intro

Reference:

http://wiki.xensource.com/xenwiki/XenIntro

Read Full Post | Make a Comment ( 1 so far )

Liked it here?
Why not try sites on the blogroll...