Services

Useful Linux/Redhat Admin commands:chkconfig, rpm, firstboot, killproc

Posted on January 26, 2011. Filed under: Linux, Services |

1. chkconfig

chkconfig provides a  simple  command-line  tool  for  maintaining  the
       /etc/rc[0-6].d  directory  hierarchy by relieving system administrators
       of the task of directly manipulating the  numerous  symbolic  links  in
       those directories.

examples:

chkconfig –add servicename

chkconfig –del servicename

chkconfig –list

Normally, we are using chkconfig to manage the initscripts of package or applications.

2. rpm

RPM Package Manager is a package management system. The name RPM refers to two things: software packaged in the .rpm file format, and the package manager itself. RPM was intended primarily for GNU/Linux distributions; the file format is the baseline package format of the Linux Standard Base.
Originally developed by Red Hat for Red Hat Linux, RPM is now used by many GNU/Linux distributions. It has also been ported to some other operating systems, such as Novell NetWare (as of version 6.5 SP3) and IBM’s AIX as of version 4.
Originally standing for “Red Hat Package Manager”, RPM now stands for “RPM Package Manager”, a recursive acronym.

–  How to View Installation / Uninstallation Script Inside The RPM File ?

To list the package specific scriptlet(s) that are used as part of the installation and uninstallation processes pass –scripts option to rpm command. You also need to specify the following options:

-q option : Query option
-p option : Query an (uninstalled) package PACKAGE_FILE. The PACKAGE_FILE may be specified as an ftp or http style URL, in which case the package header will be downloaded and querie).

General syntax for uninstalled foo.rpm file

$ rpm -qp --scripts foo.rpm
Find out Installation / Uninstallation scripts inside the rpm file called monit-4.9-2.el5.rf.x86_64.rpm:
$ rpm -qp --scripts monit-4.9-2.el5.rf.x86_64.rpm

General syntax for installed httpd package

$ rpm -q --scripts httpd

3. firstboot

firstboot is the program that runs on the first boot of a Fedora or Red Hat Enterprise Linux system that allows you to configure more things than the installer allows.

4. rsyslog

Rsyslog has become the de-facto standard on modern Linux operating systems. It’s high-performance log processing, database integration, modularity and support for multiple logging protocols make it the sysadmin’s logging daemon of choice. The project was started in 2004 and has since then evolved rapidly.

The SIGHUP signal is used to reload the rsyslogd, not to kill it.

 5. killproc

killproc sends signals to all processes that use the specified executable.  If no signal name is  specified,  the
      signal  SIGTERM  is  sent. If this program is not called with the name killproc then SIGHUP is used. Note that if
      SIGTERM is used and does not terminate a process the signal SIGKILL is send after a few  seconds  (default  is  5
      seconds,  see  option -t).  If a program has been terminated successfully and a verified pid file was found, this
      pid file will be removed if the terminated process didn't already do so.

References:

http://man-wiki.net/index.php/8:killproc

http://en.wikipedia.org/wiki/RPM_Package_Manager

http://blog.gerhards.net/2008/10/new-rsyslog-hup-processing.html

http://fedoraproject.org/wiki/FirstBoot

http://www.cyberciti.biz/faq/rhel-list-package-specific-scriptlets/

http://linuxcommand.org/man_pages/chkconfig8.html

http://www.linuxjournal.com/article/4445

http://www.techrepublic.com/article/using-chkconfig-to-control-initscripts/5033660

http://krnjevic.com/wp/?p=76

Read Full Post | Make a Comment ( None so far )

up and running with Cassandra

Posted on March 21, 2010. Filed under: Java, Linux, MySQL, PHP, Services | Tags: , , |

Cassandra is a hybrid non-relational database in the same class as Google’s BigTable. It is more featureful than a key/value store like Dynomite, but supports fewer query types than a document store like MongoDB.

Cassandra was started by Facebook and later transferred to the open-source community. It is an ideal runtime database for web-scale domains like social networks.

This post is both a tutorial and a “getting started” overview. You will learn about Cassandra’s features, data model, API, and operational requirements—everything you need to know to deploy a Cassandra-backed service.

Jan 8, 2010: post updated for Cassandra gem 0.7 and Cassandra version 0.5.

features

There are a number of reasons to choose Cassandra for your website. Compared to other databases, three big features stand out:

  • Flexible schema: with Cassandra, like a document store, you don’t have to decide what fields you need in your records ahead of time. You can add and remove arbitrary fields on the fly. This is an incredible productivity boost, especially in large deployments.
  • True scalability: Cassandra scales horizontally in the purest sense. To add more capacity to a cluster, turn on another machine. You don’t have restart any processes, change your application queries, or manually relocate any data.
  • Multi-datacenter awareness: you can adjust your node layout to ensure that if one datacenter burns in a fire, an alternative datacenter will have at least one full copy of every record.

Some other features that help put Cassandra above the competition :

  • Range queries: unlike most key/value stores, you can query for ordered ranges of keys.
  • List datastructures: super columns add a 5th dimension to the hybrid model, turning columns into lists. This is very handy for things like per-user indexes.
  • Distributed writes: you can read and write any data to anywhere in the cluster at any time. There is never any single point of failure.

installation

You need a Unix system. If you are using Mac OS 10.5, all you need is Git. Otherwise, you need to install Java 1.6, Git 1.6, Ruby, and Rubygems in some reasonable way.

Start a terminal and run:

sudo gem install cassandra

If you are using Mac OS, you need to export the following environment variables:

export JAVA_HOME="/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home"
export PATH="/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin:$PATH"

Now you can build and start a test server with cassandra_helper:

cassandra_helper cassandra

It runs!

live demo

The above script boots the server with a schema that we can interact with. Open another terminal window and start irb, the Ruby shell:

irb

In the irb prompt, require the library:

require 'rubygems'
require 'cassandra'
include Cassandra::Constants

Now instantiate a client object:

twitter = Cassandra.new('Twitter')

Let’s insert a few things:

user = {'screen_name' => 'buttonscat'}
twitter.insert(:Users, '5', user)  

tweet1 = {'text' => 'Nom nom nom nom nom.', 'user_id' => '5'}
twitter.insert(:Statuses, '1', tweet1)

tweet2 = {'text' => '@evan Zzzz....', 'user_id' => '5', 'reply_to_id' => '8'}
twitter.insert(:Statuses, '2', tweet2)

Notice that the two status records do not have all the same columns. Let’s go ahead and connect them to our user record:

twitter.insert(:UserRelationships, '5', {'user_timeline' => {UUID.new => '1'}})
twitter.insert(:UserRelationships, '5', {'user_timeline' => {UUID.new => '2'}})

The UUID.new call creates a collation key based on the current time; our tweet ids are stored in the values.

Now we can query our user’s tweets:

timeline = twitter.get(:UserRelationships, '5', 'user_timeline', :reversed => true)
timeline.map { |time, id| twitter.get(:Statuses, id, 'text') }
# => ["@evan Zzzz....", "Nom nom nom nom nom."]

Two tweet bodies, returned in recency order—not bad at all. In a similar fashion, each time a user tweets, we could loop through their followers and insert the status key into their follower’s home_timeline relationship, for handling general status delivery.

the data model

Cassandra is best thought of as a 4 or 5 dimensional hash. The usual way to refer to a piece of data is as follows: a keyspace, a column family, a key, an optional super column, and a column. At the end of that chain lies a single, lonely value.

Let’s break down what these layers mean.

  • Keyspace (also confusingly called “table”): the outer-most level of organization. This is usually the name of the application. For example, 'Twitter' and 'Wordpress' are both good keyspaces. Keyspaces must be defined at startup in the storage-conf.xml file.
  • Column family: a slice of data corresponding to a particular key. Each column family is stored in a separate file on disk, so it can be useful to put frequently accessed data in one column family, and rarely accessed data in another. Some good column family names might be :Posts, :Users and :UserAudits. Column families must be defined at startup.
  • Key: the permanent name of the record. You can query over ranges of keys in a column family, like :start => '10050', :finish => '10070'—this is the only index Cassandra provides for free. Keys are defined on the fly.

After the column family level, the organization can diverge—this is a feature unique to Cassandra. You can choose either:

  • A column: this is a tuple with a name and a value. Good columns might be 'screen_name' => 'lisa4718' or 'Google' => 'http://google.com'.It is common to not specify a particular column name when requesting a key; the response will then be an ordered hash of all columns. For example, querying for (:Users, '174927') might return:
    {'name' => 'Lisa Jones',
     'gender' => 'f',
     'screen_name' => 'lisa4718'}

    In this case, name, gender, and screen_name are all column names. Columns are defined on the fly, and different records can have different sets of column names, even in the same keyspace and column family. This lets you use the column name itself as either structure or data. Columns can be stored in recency order, or alphabetical by name, and all columns keep a timestamp.

  • A super column: this is a named list. It contains standard columns, stored in recency order.Say Lisa Jones has bookmarks in several categories. Querying (:UserBookmarks, '174927') might return:
    {'work' => {
        'Google' => 'http://google.com',
        'IBM' => 'http://ibm.com'},
     'todo': {...},
     'cooking': {...}}

    Here, work, todo, and cooking are all super column names. They are defined on the fly, and there can be any number of them per row. :UserBookmarks is the name of the super column family. Super columns are stored in alphabetical order, with their sub columns physically adjacent on the disk.

Super columns and standard columns cannot be mixed at the same (4th) level of dimensionality. You must define at startup which column families contain standard columns, and which contain super columns with standard columns inside them.

Super columns are a great way to store one-to-many indexes to other records: make the sub column names TimeUUIDs (or whatever you’d like to use to sort the index), and have the values be the foreign key. We saw an example of this strategy in the demo, above.

If this is confusing, don’t worry. We’ll now look at two example schemas in depth.

twitter schema

Here is the schema definition we used for the demo, above. It is based on Eric Florenzano’s Twissandra:

<Keyspace Name="Twitter">
  <ColumnFamily CompareWith="UTF8Type" Name="Statuses" />
  <ColumnFamily CompareWith="UTF8Type" Name="StatusAudits" />
  <ColumnFamily CompareWith="UTF8Type" Name="StatusRelationships"
    CompareSubcolumnsWith="TimeUUIDType" ColumnType="Super" />
  <ColumnFamily CompareWith="UTF8Type" Name="Users" />
  <ColumnFamily CompareWith="UTF8Type" Name="UserRelationships"
    CompareSubcolumnsWith="TimeUUIDType" ColumnType="Super" />
</Keyspace>

What could be in StatusRelationships? Maybe a list of users who favorited the tweet? Having a super column family for both record types lets us index each direction of whatever many-to-many relationships we come up with.

Here’s how the data is organized:

Click to enlarge

Cassandra lets you distribute the keys across the cluster either randomly, or in order, via the Partitioner option in the storage-conf.xml file.

For the Twitter application, if we were using the order-preserving partitioner, all recent statuses would be stored on the same node. This would cause hotspots. Instead, we should use the random partitioner.

Alternatively, we could preface the status keys with the user key, which has less temporal locality. If we used user_id:status_id as the status key, we could do range queries on the user fragment to get tweets-by-user, avoiding the need for a user_timeline super column.

multi-blog schema

Here’s a another schema, suggested to me by Jonathan Ellis, the primary Cassandra maintainer. It’s for a multi-tenancy blog platform:

<Keyspace Name="Multiblog">
  <ColumnFamily CompareWith="TimeUUIDType" Name="Blogs" />
  <ColumnFamily CompareWith="TimeUUIDType" Name="Comments"/>
</Keyspace>

Imagine we have a blog named ‘The Cutest Kittens’. We will insert a row when the first post is made as follows:

require 'rubygems'
require 'cassandra'
include Cassandra::Constants

multiblog = Cassandra.new('Multiblog')

multiblog.insert(:Blogs, 'The Cutest Kittens',
  { UUID.new =>
    '{"title":"Say Hello to Buttons Cat","body":"Buttons is a cute cat."}' })

UUID.new generates a unique, sortable column name, and the JSON hash contains the post details. Let’s insert another:

multiblog.insert(:Blogs, 'The Cutest Kittens',
  { UUID.new =>
    '{"title":"Introducing Commie Cat","body":"Commie is also a cute cat"}' })

Now we can find the latest post with the following query:

post = multiblog.get(:Blogs, 'The Cutest Kittens', :reversed => true).to_a.first

On our website, we can build links based on the readable representation of the UUID:

guid = post.first.to_guid
# => "b06e80b0-8c61-11de-8287-c1fa647fd821"

If the user clicks this string in a permalink, our app can find the post directly via:

multiblog.get(:Blogs, 'The Cutest Kittens', :start => UUID.new(guid), :count => 1)

For comments, we’ll use the post UUID as the outermost key:

multiblog.insert(:Comments, guid,
  {UUID.new => 'I like this cat. - Evan'})
multiblog.insert(:Comments, guid,
  {UUID.new => 'I am cuter. - Buttons'})

Now we can get all comments (oldest first) for a post by calling:

multiblog.get(:Comments, guid)

We could paginate them by passing :start with a UUID. See this presentation to learn more about token-based pagination.

We have sidestepped two problems with this data model: we don’t have to maintain separate indexes for any lookups, and the posts and comments are stored in separate files, where they don’t cause as much write contention. Note that we didn’t need to use any super columns, either.

storage layout and api comparison

The storage strategy for Cassandra’s standard model is the same as BigTable’s. Here’s a comparison chart:

multi-file per-file intra-file
Relational server database table* primary key column value
BigTable cluster table column family key column name column value
Cassandra, standard model cluster keyspace column family key column name column value
Cassandra, super column model cluster keyspace column family key super column name column name column value

* With fixed column names.

Column families are stored in column-major order, which is why people call BigTable a column-oriented database. This is not the same as a column-oriented OLAP database like Sybase IQ—it depends on what you use the column names for.

Click to enlarge

In row-orientation, the column names are the structure, and you think of the column families as containing keys. This is the convention in relational databases.

Click to enlarge

In column-orientation, the column names are the data, and the column families are the structure. You think of the key as containing the column family, which is the convention in BigTable. (In Cassandra, super columns are also stored in column-major order—all the sub columns are together.)

In Cassandra’s Ruby API, parameters are expressed in storage order, for clarity:

Relational SELECT `column` FROM `database`.`table` WHERE `id` = key;
BigTable table.get(key, "column_family:column")
Cassandra: standard model keyspace.get("column_family", key, "column")
Cassandra: super column model keyspace.get("column_family", key, "super_column", "column")

Note that Cassandra’s internal Thrift interface mimics BigTable in some ways, but this is being changed.

going to production

Cassandra is an alpha product and could, theoretically, lose your data. In particular, if you change the schema specified in the storage-conf.xml file, you must follow these instructions carefully, or corruption will occur (this is going to be fixed). Also, the on-disk storage format is expected to change in version 0.4.0. After that things will be a bit more stable.

The biggest deployment is at Facebook, where hundreds of terabytes of token indexes are kept in about a hundred Cassandra nodes. However, their use case allows the data to be rebuilt if something goes wrong. Currently there are no known deployments of non-transient data. Proceed carefully, keep a backup in an unrelated storage engine…and submit patches if things go wrong.

That aside, here is a guide for deploying a production cluster:

  • Hardware: get a handful of commodity Linux servers. 16GB memory is good; Cassandra likes a big filesystem buffer. You don’t need RAID. If you put the commitlog file and the data files on separate physical disks, things will go faster. Don’t use EC2 or friends except for testing; the virtualized I/O is too slow.
  • Configuration: in the storage-conf.xml schema file, set the replication factor to 3. List the IP address of one of the nodes as the seed. Set the listen address to the empty string, so the hosts will resolve their own IPs. Now, adjust the contents of cassandra.in.sh for your various paths and JVM options—for a 16GB node, set the JVM heap to 4GB.
  • Deployment: build a package of Cassandra itself and your configuration files, and deliver it to all your servers (I use Capistrano for this). Start the servers by setting CASSANDRA_INCLUDE in the environment to point to your cassandra.in.sh file, and run bin/cassandra. At this point, you should see join notices in the Cassandra logs:
    Cassandra starting up...
    Node 10.224.17.13:7001 has now joined.
    Node 10.224.17.14:7001 has now joined.

    Congratulations! You have a cluster. Don’t forget to turn off debug logging in the log4j.properties file.

  • Visibility: you can get a little more information about your cluster via the tool bin/nodeprobe, included:
    $ bin/nodeprobe --host 10.224.17.13 ring
    Token(124007023942663924846758258675932114665)  3 10.224.17.13  |<--|
    Token(106858063638814585506848525974047690568)  3 10.224.17.19  |   ^
    Token(141130545721235451315477340120224986045)  3 10.224.17.14  |-->|

    Cassandra also exposes various statistics over JMX.

Note that your client machines (not servers!) must have accurate clocks for Cassandra to resolve write conflicts properly. Use NTP.

conclusion

There is a misperception that if someone advocates a non-relational database, they either don’t understand SQL optimization, or they are generally a hater. This is not the case.

It is reasonable to seek a new tool for a new problem, and database problems have changed with the rise of web-scale distributed systems. This does not mean that SQL as a general-purpose runtime and reporting tool is going away. However, at web-scale, it is more flexible to separate the concerns. Runtime object lookups can be handled by a low-latency, strict, self-managed system like Cassandra. Asynchronous analytics and reporting can be handled by a high-latency, flexible, un-managed system like Hadoop. And in neither case does SQL lend itself to sharding.

I think that Cassandra is the most promising current implementation of a runtime distributed database, but much work remains to be done. We’re beginning to use Cassandra at Twitter, and here’s what I would like to happen real-soon-now:

  • Interface cleanup: the Thrift API for Cassandra is incomplete and inconsistent, which makes writing clients very irritating.
    Done!
  • Online migrations: restarting the cluster 3 times to add a column family is silly.
  • ActiveModel or DataMapper adapter: for interaction with business objects in Ruby.
    Done! Michael Koziarski on the Rails core team wrote an ActiveModel adapter.
  • Scala client: for interoperability with JVM middleware.

Go ahead and jump on any of those projects—it’s a chance to get in on the ground floor.

Cassandra has excellent performance. There some benchmark results for version 0.5 at the end of the Yahoo performance study.

further resources

Reference(cited):
Read Full Post | Make a Comment ( 1 so far )

Google Protocol Buffers and other data interchange formats

Posted on March 20, 2010. Filed under: Computer Science, Programming, Services | Tags: , , , , , |

We’ve been planning on moving to a new messaging protocol for a while. We’ve looked at a lot of different solutions but had enough issues with every proposed solution to date that we haven’t made a decision. JR Boyens pointed us to Google’s announcement Protocol Buffers: Google’s Data Interchange Format in July. Glanced at it but then it got lost in the everyday noise. Recent work on a project caused it to get more attention. I like what I see.

As part of a new offering we decided to add in our new messaging direction. We’re processing realtime voice conversations. Some of our major considerations are:

  1. Latency and Performance – Latency matters to us. A LOT. I’m including not only network transport but also memory and CPU. The total time it takes for a message to get from it’s native format in the sender to it’s native format in the receiver. We’re dealing with real time voice communications, too much latency and best case is the callers experience suffers. Our labor model is also sensitive to even small changes in latency. The smaller the latency the more efficient we are, the happier our client’s customers are and the more money we make. As greedy capitalist we see that as a good thing.
  2. Versioning – Our current system has no versioning. Yeah, short sighted on my part. We have to fix it so it’s required for any new message protocol. Protobuf fits our needs on this nicely. Different versions have to coexist and interoperate. We could do this on a different layer than the messaging but it makes sense to me to keep it at this level.
  3. Java and C++ – Language independence is cool and all but in practice if the protocol support Java and C++ we’re good to go. Maybe I’m being a bit myopic but my feeling is the likely hood that whatever we choose will expand to support more languages in the future is very high if it supports several today.
  4. Internal – We control the end points. I don’t care if the schema is external to the data package. In fact, for our use case that’s a plus. For any external services we’ll still expose those using the usual standards. Internally our applications will be using PB for their messaging format.
In short, we’re all about high volume low latency messages.

Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the “old” format.

Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers:

  • are simpler
  • are 3 to 10 times smaller
  • are 20 to 100 times faster
  • are less ambiguous
  • generate data access classes that are easier to use programmatically
Seems like a decent fit. OK, actually an awesome fit. One of our developers has been doing some testing. It’s impressive.

Alternatives

JSON

To me protobuf feels like compiled JSON. They are very similar.  The main difference being JSON sends data over the wire in text format verses protobuf’s binary format. The latter has the advantage of a smaller size and being faster for a computer to parse.

ASN.1

Why not ASN.1? Seems like one of the best choices. Well understood and widely used. Sure the full ASN.1 specification is complex but we’d only need a small subset. I’m still struggling with this one a bit. Tool support seems a bit better in protobuf and it’s definitely simpler.

Thrift

Facebook’s Thrift is very similar to protobuf. Not surprising since the main author interned at Google. It’s a strong offering and recently became an Apache project. Nice stuff. Stuart Sierra has a nice comparison on his blog, Thrift vs. Protocol Buffers. Another worthy contender but not a big enough advantage to stop the internal momentum protobuf already has.

HDF5

The HDF wiki has an entry Google Protocol Buffers and HDF5 that concludes:

In summary, Protocol Buffers and HDF5 were designed to serve different kinds of data intensive applications: a network based transient message system, and a high performance data storage system for very large datasets such as multi-dimensional images, respectively. That said, both 1) offer open source technologies that can reduce data management headaches for individual developers and projects, 2) increase the ability to share data through the use of well-defined binary formats and supporting libraries that run on a variety of platforms, and 3) provide the ability to access data stored with “older” versions of the data structures.

Different design goals. HDF5 doesn’t fit our needs as well.

Hessian

The Hessian protocol has the following design goals:

  • It must not require external IDL or schema definitions, i.e. the protocol should be invisible to application code.
  • It must be language-independent.
  • It must be simple so it can be effectively tested and implemented.
  • It must be as fast as possible.
  • It must be as compact as possible.
  • It must support Unicode strings.
  • It must support 8-bit binary data (i.e. without encoding or using attachments.)
  • It must support encryption, compression, signature, and transaction context envelopes.

I still haven’t figured out how/if you can version your messages. Can you add and remove fields and still have compatibility (backwards and forwards)?  Cool effort, still feels very rough in places. Another worthy effort to consider.

ZeroC ICE

At it’s core it’s

The Ice core library. Among many other features, the Ice core library manages all the communication tasks using a highly efficient protocol (including protocol compression and support for both TCP and UDP), provides a flexible thread pool for multi-threaded servers, and offers additional functionality that supports extreme scalability with potentially millions of Ice objects.

ICE is a comprehensive middleware system. It can even use PB as it’s messaging layer. It’s messaging layer doesn’t handle adding or removing fields as well as PB. We don’t need the RPC side of ICE. Just not a good fit for us.

SDO

Service Data Objects provides a rather ambitious messaging architecture. It’s concerns aren’t speed and efficiency. The SDO V2.1 White Paper states

SDO is intended to create a uniform data access layer that provides a data access solution for heterogeneous data sources in an easy-to-use manner that is amenable to tooling and frameworks.

Interesting, not a fit.

Cisco Etch

Primary focus is an RPC implementation, not a messaging protocol. Steve Vinoski summarized it nicely in Just What We Need: Another RPC Package. In fairness Steve had some negative thoughts on PB also in Protocol Buffers: Leaky RPC. However, his concerns are around the undefined RPC features Google put in PB, not the IDL type aspects of PB.

Some other XML based protocol

Yeah, I know the problem with XML isn’t XML it’s with the parsers. Cute argument. Getting my message from native format on one system to native format on another as fast as possible is what matters to me. So oddly enough parsers are part of the equation. Yeah, jaxb is fast but just how fast?

Remember, we’re all about high volume low latency messages. It’s not a focus for XML. Yep, no one will take issue with that statement!

Binary XML

Enough said. Next.

Corba/IIOP

Well defined IDL, a bit complicated (because it addresses a wide range of issues).  Built in to the JDK! Not designed for speed or efficiency. Bad fit.

Conclusion

Several good choices. I’m sure there’s others I missed. We’re going with protobuf. Early tests by our developers have been very impressive. Google fan bois can rejoice and the Google haters gripe. In the meantime we’ve got a job to do.

Reference:

http://dewpoint.snagdata.com/2008/10/21/google-protocol-buffers/ (Original)

Read Full Post | Make a Comment ( 1 so far )

CKEditor

Posted on February 26, 2010. Filed under: Javascript, Services | Tags: , , , |

CKEditor (formerly FCKeditor) is an open source WYSIWYG text editor from CKSource that can be used in web pages. It aims to be lightweight and requires no client-side installation.

Its core code is written in JavaScript, having server side interfaces with Active-FoxPro, ASP, ASP.NET, ColdFusion, Java, JavaScript, Lasso, Perl, PHP and Python.[3]

CKEditor is compatible with most Internet browsers, including: Internet Explorer 6.0+ (Windows), Firefox 2.0+, Safari 3.0+, Google Chrome (Windows), Opera 9.50+, and Camino 1.0+ (Apple).[3]

Reference:

http://en.wikipedia.org/wiki/CKEditor

http://ckeditor.com/

Read Full Post | Make a Comment ( None so far )

Major and Minor Numbers

Posted on January 18, 2010. Filed under: Linux, Services | Tags: , , , , |

One of the basic features of the Linux kernel is that it abstracts the handling of devices. All hardware devices look like regular files; they can be opened, closed, read and written using the same, standard, system calls that are used to manipulate files. Every device in the system is represented by a file. For block (disk) and character devices, these device files are created by the mknod command and they describe the device using major and minor device numbers. Network devices are also represented by device special files but they are created by Linux as it finds and initializes the network controllers in the system.

To UNIX, everything is a file. To write to the hard disk, you write to a file. To read from the keyboard is to read from a file. To store backups on a tape device is to write to a file. Even to read from memory is to read from a file. If the file from which you are trying to read or to which you are trying to write is a “normal” file, the process is fairly easy to understand: the file is opened and you read or write data. If, however, the device you want to access is a special device file (also referred to as a device node), a fair bit of work needs to be done before the read or write operation can begin.

One key aspect of understanding device files lies in the fact that different devices behave and react differently. There are no keys on a hard disk and no sectors on a keyboard, though you can read from both. The system, therefore, needs a mechanism whereby it can distinguish between the various types of devices and behave accordingly.

To access a device accordingly, the operating system must be told what to do. Obviously, the manner in which the kernel accesses a hard disk will be different from the way it accesses a terminal. Both can be read from and written to, but that’s about where the similarities end. To access each of these totally different kinds of devices, the kernel needs to know that they are, in fact, different.

Inside the kernel are functions for each of the devices the kernel is going to access. All the routines for a specific device are jointly referred to as the device driver. Each device on the system has its own device driver. Within each device driver are the functions that are used to access the device. For devices such as a hard disk or terminal, the system needs to be able to (among other things) open the device, write to the device, read from the device, and close the device. Therefore, the respective drivers will contain the routines needed to open, write to, read from, and close (among other things) those devices.

The kernel needs to be told how to access the device. Not only does the kernel need to be told what kind of device is being accessed but also any special information, such as the partition number if it’s a hard disk or density if it’s a floppy, for example. This is accomplished by the major number and minor number of that device.

All devices controlled by the same device driver have a common major device number. The minor device numbers are used to distinguish between different devices and their controllers. Linux maps the device special file passed in system calls (say to mount a file system on a block device) to the device’s device driver using the major device number and a number of system tables, for example the character device table, chrdevs. The major number is actually the offset into the kernel’s device driver table, which tells the kernel what kind of device it is (whether it is a hard disk or a serial terminal). The minor number tells the kernel special characteristics of the device to be accessed. For example, the second hard disk has a different minor number than the first. The COM1 port has a different minor number than the COM2 port, each partition on the primary IDE disk has a different minor device number, and so forth. So, for example, /dev/hda2, the second partition of the primary IDE disk has a major number of 3 and a minor number of 2.

It is through this table that the routines are accessed that, in turn, access the physical hardware. Once the kernel has determined what kind of device to which it is talking, it determines the specific device, the specific location, or other characteristics of the device by means of the minor number.

The major number for the hd (IDE) driver is hard-coded at 3. The minor numbers have the format

(<unit>*64)+<part>

where <unit> is the IDE drive number on the first controller, either 0 or 1, which is then multiplied by 64. That means that all hd devices on the first IDE drive have a minor number less than 64. <part> is the partition number, which can be anything from 1 to 20. Which minor numbers you will be able to access will depend on how many partitions you have and what kind they are (extended, logical, etc.). The minor number of the device node that represents the whole disk is 0. This has the node name hda, whereas the other device nodes have a name equal to their minor number (i.e., /dev/hda6 has a minor number 6).

If you were to have a second IDE on the first controller, the unit number would be 1. Therefore, all of the minor numbers would be 64 or greater. The minor number of the device node representing the whole disk is 1. This has the node name hdb, whereas the other device nodes have a name equal to their minor number plus 64 (i.e., /dev/hdb6 has a minor number 70).

If you have more than one IDE controller, the principle is the same. The only difference is that the major number is 22.

For SCSI devices, the scheme is a little different. When you have a SCSI host adapter, you can have up to seven hard disks. Therefore, we need a different way to refer to the partitions. In general, the format of the device nodes is

sd<drive><partition>

where sd refers to the SCSI disk driver, <drive> is a letter for the physical drive, and <partition> is the partition number. Like the hd devices, when a device refers to the entire disk, for example the device sda refers to the first disk.

The major number for all SCSI drives is 8. The minor number is based on the drive number, which is multiplied by 16 instead of 64, like the IDE drives. The partition number is then added to this number to give the minor.

The partition numbers are not as simple to figure out. Partition 0 is for the whole disk (i.e., sda). The four DOS primary partitions are numbered 14. Then the extended partitions are numbered 58. We then add 16 for each drive. For example:

brw-rw----   1 root     disk       8,  22 Sep 12  1994
/dev/sdb6

Because b is after the sd, we know that this is on the second drive. Subtracting 16 from the minor, we get 6, which matches the partition number. Because it is between 4 and 8, we know that this is on an extended partition. This is the second partition on the first extended partition.

The floppy devices have an even more peculiar way of assigning minor numbers. The major number is fairly easy its 2. Because the names are a little easier to figure out, lets start with them. As you might guess, the device names all begin with fd. The general format is

fd<drive><density><capacity>

where <drive> is the drive number (0 for A:, 1 for B:), <density> is the density (d-double, h-high), and <capacity> is the capacity of the drive (360Kb, 1440Kb). You can also tell the size of the drive by the density letter, lowercase letter indicates that it is a i5.25″ drive and an uppercase letter indicates that it is a 3.5″ driver. For example, a low-density 5.25″ drive with a capacity of 360Kb would look like

fd0d360

If your second drive was a high-density 3.5″ drive with a capacity of 1440Kb, the device would look like

fd1H1440

What the minor numbers represents is a fairly complicated process. In the fd(4) man-page there is an explanatory table, but it is not obvious from the table why specific minor numbers go with each device. The problem is that there is no logical progression as with the hard disks. For example, there was never a 3.5″ with a capacity of 1200Kb nor has there been a 5.25″ with a capacity of 1.44Mb. So you will never find a device with H1200 or h1440. So to figure out the device names, the best thing is to look at the man-page.

The terminal devices come in a few forms. The first is the system console, which are the devices tty0-tty?. You can have up to 64 virtual terminals on you system console, although most systems that I have seen are limited five or six. All console terminals have a major number of 4. As we discussed earlier, you can reach the low numbered ones with ALT-Fn, where n is the number of the function key. So ALT-F4 gets you to the fourth virtual console. Both the minor number and the tty number are based on the function key, so /dev/tty4 has a minor number 4 and you get to it with ALT-F4. (Check the console(4) man-page to see how to use and get to the other virtual terminals.)

Serial devices can also have terminals hooked up to them. These terminals normally use the devices /dev/ttySn, where n is the number of the serial port (0, 1, 2, etc.). These also have a minor number of 4, but the minor numbers all start at 64. (Thats why you can only have 63 virtual consoles.) The minor numbers are this base of 64, plus the serial port number (04). Therefore, the minor number of the third serial port would be 64+3=67.

Related to these devices are the modem control devices, which are used to access modems. These have the same minor numbers but have a major number of 5. The names are also based on the device number. Therefore, the modem device attached to the third serial port has a minor number of 64+3=67 and its name is cua3.

Another device with a major number of 5 is /dev/tty, which has a minor number of 0. This is a special device and is referred to as the “controlling terminal. ” This is the terminal device for the currently running process. Therefore, no matter where you are, no matter what you are running, sending something to /dev/tty will always appear on your screen.

The pseudo-terminals (those that you use with network connections or X) actually come in pairs. The “slave” is the device at which you type and has a name like ttyp?, where ? is the tty number. The device that the process sees is the “master” and has a name like ptyn, where n is the device number. These also have a major number 4. However, the master devices all have minor numbers based on 128. Therefore, pty0 has a minor number of 128 and pty9 has a minor number of 137 (128+9). The slave device has minor numbers based on 192, so the slave device ttyp0 has a minor number of 192. Note that the tty numbers do not increase numerically after 9 but use the letter af.

Other oddities with the device numbering and naming scheme are the memory devices. These have a major number of 1. For example, the device to access physical memory is /dev/mem with a minor number of 1. The kernel memory device is /dev/kmem, and it has a minor number of 2. The device used to access IO ports is /dev/port and it has a minor number of 4.

What about minor number 3? This is for device /dev/null, which is nothing. If you direct output to this device, it goes into nothing, or just disappears. Often the error output of a command is directed here, as the errors generated are uninteresting. If you redirect from /dev/null, you get nothing as well. Often I do something like this:

cat /dev/null > file_name

This device is also used a lot in shell scripts where you do not want to see any output or error messages. You can then redirect standard out or standard error to /dev/null and the messages disappear. For details on standard out and error, see the section on redirection.

If file_name doesn’t exist yet, it is created with a length of zero. If it does exist, the file is truncated to 0 bytes.

The device /dev/zero has a major number of 5 and its minor number is 5. This behaves similarly to /dev/null in that redirecting output to this device is the same as making it disappear. However, if you direct input from this device, you get an unending stream of zeroes. Not the number 0, which has an ASCII value of 48this is an ASCII 0.

Are those all the devices? Unfortunately not. However, I hope that this has given you a start on how device names and minor numbers are configured. The file <linux/major.h> contains a list of the currently used (at least well known) major numbers. Some nonstandard package might add a major number of its own. Up to this point, they have been fairly good about not stomping on the existing major numbers.

As far as the minor numbers go, check out the various man-pages. If there is a man-page for a specific device, the minor number will probably be under the name of the driver. This is in major.h or often the first letter of the device name. For example, the parallel (printer) devices are lp?, so check out man lp.

The best overview of all the major and minor numbers is in the /usr/src/linux/Documentation directory. The devices.txt is considered the “authoritative” source for this information.

References:

http://www.linux-tutorial.info/modules.php?name=MContent&pageid=94

http://www.linux-tutorial.info/modules.php?name=MContent&obj=glossary&term=man-page

http://www.kernel.org/pub/linux/docs/device-list/devices.txt

http://publib.boulder.ibm.com/infocenter/dsichelp/ds8000ic/index.jsp?topic=/com.ibm.storage.ssic.help.doc/f2c_linuxdevnaming_2hsag8.html

http://docsrv.sco.com/HDK_concepts/ddT_majmin.html

http://burks.brighton.ac.uk/burks/linux/rute/node18.htm

http://www.makelinux.info/ldd3/chp-3-sect-2.shtml

http://www.thegeekstuff.com/2009/06/how-to-identify-major-and-minor-number-in-linux/

Read Full Post | Make a Comment ( None so far )

XenIntro

Posted on January 8, 2010. Filed under: C/C++, Linux, Services | Tags: , |

Xen Intro- version 1.0:

Contents

  1. Introduction
  2. Xen and IA32 Protection Modes
  3. The Xend daemon:
  4. The Xen Store:
  5. VT-x (virtual technology) processors – support in Xen
  6. Vmxloader
  7. VT-i (virtual technology) processors – support in Xen
  8. AMD SVM
  9. Xen On Solaris
  10. Step by step example of creating guest OS with Virtual Machine Manager in Fedora Core 6
  11. Physical Interrupts
  12. Backend Drivers:
  13. Migration and Live Migration:
  14. Creating of a domain – behind the scenes:
  15. HyperCalls Mapping to code Xen 3.0.2
  16. Virtualization and the Linux Kernel
  17. Pre-Virtualization
  18. Xen Storage
  19. kvm – Kernel-based Virtualization Driver
  20. Tip: How to build Xen with your own tar ball
  21. Xen in the Linux Kernel
  22. VMI : Virtual Machine Interface
  23. Links
  24. Adding new device and triggering the probe() functions
    1. deviceback.c
    2. xenbus.c
    3. common.h
    4. Makefile
  25. Adding a frontend device
    1. Makefile
    2. devicefront.c
  26. Discussion

Introduction

All of the following text refers to x86 platform of Xen-unstable, unless otherwise explicitly said. We will deal only with Xen on linux 2.6 ; We are not dealing at all with Xen on linux 2.4 (and as far as we know, in the future, domain 0 is intended to be based ONLY on 2.6). Moreover,currently the 2.4 linux tree is removed from Xen Tree (changeset 7263:f1abe953e401 from 8.10.05) but it can be that it will be back when some problems will be fixed.

This document deals only with Xen 3.0 version unless explictily said otherwise.

This is not intended to be a full and detailed documentation of the Xen project but we hope it will be a starting point to anyone who is interested in Xen and wants to learn more.

The Xen Project team is permitted to take part or all of this document and integrate it with the official Xen documentation or put it as a standalone document in the Xen Web Site if they wish, without any further notice.

This is not a complete detailed document nor a full walkthrough ,and many important issues are omitted. Any feedback is welcomed to : Rami Rosen , ramirose@gmail.com

Xen and IA32 Protection Modes

In the classical protection model of IA-32, there are 4 privilege levels; The highest ring is 0, where the kernel runs. (this level is also called SuperVisor Mode) The lowest is ring 3, where User applications run (this level is also called User Mode) Issuing some instructions , which are called “privileged instructions” , from ring which is NOT ring 0, will cause a General Protection Fault.

Ring 1,2 were not used through the years (except for in the case of OS/2). When running Xen, we run a Hypervisor in ring 0 and the guest OS in ring 1. The applications run unmodified at ring 3.

BTW, there are of course architectures which have a different privilege models; for example, in PPC both domain 0 and the Unprivileged domains run in supervisor mode. Diagram: Xen and IA32 Protection Modes

rings

The Xend daemon:

The Xend Daemon handles requests issued from Domain 0; requests can be, for example, creating a new domain (“xm create”) or listing the domains (“xm list”), shutting down a domain (“xm destroy”). Running “xm help” will show all possibilities.

You start the Xend daemon by running, after booting into domain0, “xend start”. “xend start” creates two daemons: xenstored and xenconsoled (see toos/misc/xend). It also creates an instance of a python SrvDaemon class and calls its start() method. (see tools/python/xen/xend/server/SrvDaemon.py).

The SrvDaemon start() method is in fact the xend main program.

In the past,the start() method of SrvDaemon eventually started an http socket (8000) on which it listened to http requests. Now it does not open an http socket on port 8000 anymore.

Note : There is an altenative to the management layer of Xen which is called libvirt; see http://libvir.org. This is a free API (LGPL)

The Xen Store:

The Xen Store Daemon provides a simple tree-like database to which we can read and write values. The Xen Store code is mainly under tools\xenstore.

It replaces the XCS, which was a daemon handling control messages.

The physical xenstore resides in one file: /var/lib/xenstored/tdb. (previously it was sacttered in some files; the change to using one file (named “tdb”) was probably to increase performance).

Both user space (“tools” in Xen terminology) and kernel code can write to the XenStore.The kernel code writes to the XenStore by using XenBus.

The python scripts (under tools/python) uses lowlevel/xs.c to read/write to the XenStore.

The Xen Store Daemon is started in xenstored_core.c. It creates a device file (“/dev/xen/evtchn”) in case such a device file does not exists and it opens it. (see : domain_init() ,file tools/xenstore/xenstored_domain.c).

It opens 2 TCP sockets (UNIX sockets). One of these sockets is a Read-Only socket, and it resides under /var/run/xenstored/socket_ro. The second is /var/run/xenstored/socket.

Connections on these sockets are represented by the connection struct.

A connection can be in one of three states:

        BLOCKED (blocked by a transaction)
        BUSY    (doing some action)
        OK      (completed it's transaction)

struct connection is declared in xenstore/xenstored_core.h; When a socket is ReadOnly,the “can_write” member of it is false.

Then we start an endless loop in which we can get input/output from three sources: the two sockets and the event channel, mentioned above.

Events, which are received in the event channel,are handled by handle_event() method (file xenstored_domain.c).

There are six executables under tools/xenstore, five of which are in fact made from the same module, which is xenstore_client.c, each time built with a different DEFINE passed. (See the Makefile). The sixth tool is built from xsls.c

These executables are : xenstore-exists, xenstore-list, xenstore-read, xenstore-rm, xenstore-write and xsls.

You can use these executable for accessing xenstore. For example: to view the list of fields of domain 0 which has a path “local/domain/0”, you run:

xenstore-list /local/domain/0

and a typical result can be the following list:

cpu
memory
name
console
vm
domid
backend

The xsls command is very useful and recursively shows the contents of a specified XenStore path. Essentially it does a xenstore-list and then a xenstore-read for each returned field, displaying the fields and their values and then repeating this recursively on each sub-path. For example: to view information about all VIFs backends hosted in domain 0 you may use the following command.

xsls /local/domain/0/backend/vif

and a typical result may be:

14 = ""
 0 = ""
  bridge = "xenbr0"
  domain = "vm1"
  handle = "0"
  script = "/etc/xen/scripts/vif-bridge"
  state = "4"
  frontend = "/local/domain/14/device/vif/0"
  mac = "aa:00:00:22:fe:9f"
  frontend-id = "14"
  hotplug-status = "connected"
15 = ""
 0 = ""
  mac = "aa:00:00:6e:d8:46"
  state = "4"
  handle = "0"
  script = "/etc/xen/scripts/vif-bridge"
  frontend-id = "15"
  domain = "vm2"
  frontend = "/local/domain/15/device/vif/0"
  hotplug-status = "connected"

(The xenstored must be running for these six executables to run; If xenstored is not running, then running theses executables will usually hang. The Xend daemon can be stopped).

An instance of struct node is the elementary unit of the XenStore. (struct node is defined in xenstored_core.h). The actual writing to the XenStore is done by write_node() method of xenstored_core.c.

xen_start_info structure has a member named :store_evtchn. (declared in public/xen.h as u16). This is the event channel for store communication.

VT-x (virtual technology) processors – support in Xen

Note: following text refers only to IA-32 unless explicitly said otherwise.

Intel had announced Pentium® 4 672 and 662 processors in November 2005 with virtualization support. (see, for example: http://www.physorg.com/news8160.html).

How does Xen support the Intel Virtualization Technology ?

The VT extensions support in Xen3 code is mostly in xen/arch/x86/hvm/vmx*.c.

  • and xen/include/asm-x86/vmx*.h and xen/arch/x86/x86_32/entry.S.

arch_vcpu structure (file xen/include/asm-x86/domain.h) contains a member which is called arch_vmx and is an instance of arch_vmx_struct. This member is also important to understand the VT-x mechanism.

But the most important structure for VT-x is the VMCS( vmcs_struct in the code) which represents the VMCS region.

The definition (file include/asm-x86/vmx_vmcs.h) is short:

struct vmcs_struct

  • { u32 vmcs_revision_id; unsigned char data [0]; /* vmcs size is read from MSR */ };

The VMCS region contains six logical regions; most relevant to our discussions are Guest-state area and Host-state area. We will also deal with the other four regions: VM-execution control fields,VM-exit control fields, VM-entry control fields and VM-exit information fields.

Intel added 10 new opcodes in VT-x to support Intel Virtualization Technology. They are detailed in the end of this section.

When using this technology, Xen runs in “VMX root operation mode” while the guest domains (which are unmodified OSs) run in “VMX non-root operation mode”. Since the guest domains run in “non-root operation” mode, it is more restricted,meaning that certain actions will cause “VM exit” to the VM.

Xen enters VMX operation in start_vmx() method. ( file xen/arch/x86/vmx.c)

This method is called from init_intel() method (file xen/arch/x86/cpu/intel.c.) (CONFIG_VMX should be defined).

First we check the X86_FEATURE_VMXE bit in ecx register to see if the cpuid shows that there is support for VMX in the processor. In IA-32 Intel added in the CR4 control register a bit specifying whether we want to enable VMX. So we must set this bit to enable VMX on the processor (by calling set_in_cr4(X86_CR4_VMXE)); This bit is bit 13 in CR4 (VMXE).

Then we call _vmxon to start VMX operation. If we will try to start VMX operation by _vmxon when the VMXE bit in CR4 is not set we will get exception (#UD , for undefined opcode)

In IA-64, things are a little different due to different architecture structure: Intel added a new bit in IA-64 in the Processor Status Register (PSR). This is bit 46 and it’s called VM. It should be set to 1 in guest OSs; and when it’s values is 1 , certain instructions cause virtualization fault.

VM exit:

Some instructions can cause unconditionally VM exit and some can cause VM exit under certain VM-execution control fields. (see the discussion about VMX-region above)

The following instructions will cause VM exit unconditionally: CPUID, INVD, MOV from CR3, RDMSR, WRMSR, and all the new VT-x instructions (which are listed below).

There are other instruction like HLT,INVPLG (Invalidate TLB Entry instruction) MWAIT and others which will cause VM exit if a corresponding VM-execution control was set.

Apart from VM-execution control fields, there are 2 bitmpas which are used for determining whether to perform VM exit: The first is the exception bitmap (see EXCEPTION_BITMAP in vmcs_field enum , file xen/include/asm-x86/vmx_vmcs.h). This bitmap is 32 bit field; when a bit is set in this bitmap, this causes a VM exit if a corresponding exception occurs; by default ,the entries which are set are EXCEPTION_BITMAP_PG (for page fault) and EXCEPTION_BITMAP_GP (for General Protection). see MONITOR_DEFAULT_EXCEPTION_BITMAP in vmx.h.

The second bitmap is the I/O bitmap (in fact, there are 2 I/O bitmaps,A and B, each is 4KB in size) which controls I/O instructions on ports. I/O bitmap A contains the ports in the range 0000-7FFF and I/O bitmap B contains the ports in the range 8000-FFFF. (one bit for each I/O port). see IO_BITMAP_A and IO_BITMAP_B in vmcs_field enum (VMCS Encordings).

When there is an “VM exit” we reach the vmx_vmexit_handler(struct cpu_user_regs regs) in vmx.c. We handle the VM exit according to the exit reason which we read from the VMCS region. We read the vmcs by calling vmread() ; The return value of vmread is 0 in case of success.

We sometimes also need to read some additional data (VM_EXIT_INTR_INFO) from the vmcs.

We get additional data by getting the “VM-exit interruption information” which is a 32 bit field and the “Exit qualification” (64 bit value).

For example, if the exception was NMI, we check if it is valid by checking bit 31 (valid bit) of the VM-exit interruption field. In case it is not valid we call _hvm_bug() to print some statistics and crash the domain.

Example of reading the “Exit qualification” field is in the case where the VMEXIT was caused by issuing INVPLG instruction.

When we work with vt-x, the guest OSs work in shadow mode, meaning they use shadow page tables; this is because the guest kernel in a VMX guest does not know that it’s being virtualized. There is no software visible bit which indicates that the processor is in VMX non-root operation. We set shadow mode by calling shadow_mode_enable() in vmx_final_setup_guest() method (file vmx.c).

There are 43 basic exit reasons – you can see part of them in vmx.h (fields starting with EXIT_REASON_ like EXIT_REASON_EXCEPTION_NMI, which is exit reason number 0, and so on).

In VT-x, Xen will probably use an emulated devices layer which will send virtual interrupts to the VMM. We can prevent the OS from receiving interrupts by setting the IF flag of EFLAGS.

The new ten opcodes which Intel added in Vt-x are detailed below:

1) VMCALL: (VMCALL_OPCODE in vmx.h)

  • This simply calls the VM monitor, causing vm exit.

2) VMCLEAR: (VMCLEAR_OPCODE in vmx.h)

  • copies VMCS data to memory in case it does not written there.
  • wrapper : _vmpclear (u64 addr) in vmx.h.

3) VMLAUNCH (VMLAUNCH_OPCODE in vmx.h)

  • launched a virtual machine; changes the launch state of the VMCS to
    • launched (if it is clear)

4) VMPTRLD (VMPTRLD_OPCODE file vmx.h)

  • loads a pointer to the VMCS.
    • wrapper : _vmptrld (u64 addr) (file vmx.h)

5) VMPTRST (VMPTRST_OPCODE in vmx.h)

  • stores a pointer to the VMCS.wrapper : _vmptrst (u64 addr) (file vmx.h.)

6) VMREAD (VMREAD_OPCODE in vmx.h)

  • read specified field from VMCS.
  • wrapper : _vmread(x, ptr) (file vmx.h)

7) VMRESUME (VMRESUME_OPCODE in vmx.h)

  • resumes a virtual machine ; in order it to resume the VM,
    • the launch state of the VMCS should be “clear.

8) VMWRITE (VMWRITE_OPCODE in vmx.h)

  • write specified field in VMCS. wrapper _vmwrite (field, value).

9) VMXOFF (VMXOFF_OPCODE in vmx.h)

  • terminates VMX operation.
    • wrapper : _vmxoff (void) (file vmx.h.)

10) VMXON (VMXON_OPCODE in vmx.h)

  • starts VMX operation.wrapper : _vmxon (u64 addr) (file vmx.h.)

QEMU and VT-D The io in Vt-x is performed by using QEMU. The QEMU code which Xen uses is under tools/ioemu. It is based on version 0.6.1 of QEMU. This version was patched accrording to Xen needs. Also AMD SVM uses QEUMU emulation.

The default network card which QEMU uses in Vt-x is AMD PCnet-PCI II Ethernet Controller. (file tools/ioemu/hw/pcnet.c). The reason to prefer this nic emulation to the other alternative, ne2000, is that pcnet uses DMA whereas ne2000 does not.

There is of course a performance cost for using QEMU, so there are chances that usage of QEMU will be replaced in the future with different soulutions which have lower performance costs.

Intel had annouced in March 2006 its VT-d Technology (Intel Virtualization Technology for Directred I/O). This technology enables to assign devices to virtual machines. It also enables DMA remapping, which can be configured for each device. There is a cache called IOTLB which improves performance.

Vmxloader

There are some restrictions on VMX operation. Guest OSes in VMX cannot operate in Real Mode. If bit PE (Protection Enabled) of CR0 is 0 or bit PG (“Enable Paging”) of CR0 is 0, then trying to start the VMX operation (VMXON instruction) fails.If after entering VMX operation you try to clear these bits, you get an exception (General Protection Exception). When using a linux loader, it starts in real mode. As a result, a vmxloader was written for vmx images. (file tools/firmware/vmxassist/vmxloader.c.)

(In order to build vmxloader you must have dev86 package installed; dev86 is a real mode 80×86 assembler and linker).

After installing Xen, vmxloader is under /usr/lib/xen/boot. In order to use it, you should specify kernel = “/usr/lib/xen/boot/vmxloader” in the config file (which is an input to your “xm create” command.)

The vmxloader loads ROMBIOS at 0xF0000, then VGABIOS at 0xC0000, and then VMXAssist at D000:0000.

What is VMXAssist? The VMXAssist is an emulator for real mode which uses the Virtual-8086 mode of IA32. After setting Virtual-8086 mode, it executes in a 16-bit environment.

There are certain instructions which are not recognized in virtual-8086 mode. For example, LIDT (Load Interrupt Register Table), or LGDT (Load Global DescriptorTable).

These instructions cause #GP(0) when trying to run them in protected mode.

So the VMXAssist assist checks the opcode of the instructions which are being executed, and handles them so that they will not cause General Protection Exception (as would have happened without its intervention).

VT-i (virtual technology) processors – support in Xen

Note : the files mentioned in this sections are from the unstable xen version).

In Vt-i extension for IA64 processors,intel added a bit to the PSR (process status register). This bit is bit 46 of the PSR and is called PSR.vm. When this bit is set, some instructions will cause a fault.

A new instruction called vmsw (Virtual Machine Switch) was added. This instruction sets the PSR.vm to 1 or 0. This instruction can be used to cause transition to or from a VM without causing an interruption.

Also a descriptor named VPD was added; this descriptor represents the resources of a virtual processor. It’s size is 64 K. (It must be 32 aligned).

A VPD stands for “Virtual Processor Descriptor”. A structure named vpd_t represents the VPD descriptor (file include/public/arch-ia64.h).

Two vectors were added to the ivt: One is the External Interrupt vector (0x3400) and the other is the Virtualization vector (0x6100).

The virtualization vector handler is called when an instruction which need virtualization was called. This handler cannot be raised by IA-32 instructions.

Also nine PAL services were added. PAL stands for Processor Abstraction Layer.

The services that were added are: PAL_VPS_RESUME_NORMAL,PAL_VPS_RESUME_HANDLER,PAL_VPS_SYNC_READ, PAL_VPS_SYNC_WRITE,PAL_VPS_SET_PENDING_INTERRUPT,PAL_VPS_THASH, PAL_VPS_TTAG, PAL_VPS_RESTORE and PAL_VPS_SAVE, (file include/asm-ia64/vmx_pal_vsa.h)

AMD SVM

AMD will hopefully release PACIFICA processors with virtualization support in Q2 2006. (probably on June 2006). The IOMMU virtualization support is to be out in 2007.

Them xen-unstable tree now includes both intel VT and SVM support, using a common API which is called HVM.

The inclusion of HVM in the unstable tree is since changeset 8708 from 31/1/06, which is a “Big merge the HVM full-virtualisation abstractions.”

You can download the code by: hg clone http://xenbits.xensource.com/xen-unstable.hg

The code for AMD SVM is mostly under xen/arch/x86/hvm/svm.

The code is developed by AMD team: Tom Woller, Mats Petersson, Travis Betak, Nagib Gulam, Leo Duran, Rosilmildo Dasilva and Wei Huang.

SVM stands for “Secure Virtual Machine”.

One major difference between Vt-x and AMD SVM is that the AMD SVM virtualization extensions include tagged TLB (whereas Intel virtualization extensions for IA-32 does not). The benefit of a tagged TLB is significantly reducing the number of TLB flushes ; this is achieved by using an ASID (Address Space Identifer) in the TLB. Using tagged TLB is common in RISC processors.

In AMD SVM, the most important struct (which is parallel to the VT-x vmcs_struct) is the vmcb_struct. (file xen/include/asm-x86/hvm/svm/vmcb.h). VMCB stands for Virtual Machine Control Block.

AMD added the following eight instructions to the SVM processor:

VMLOAD loads the processor state from the VMCB. VMMCALL enables the guest to communicate with the VMM. VMRUN starts the operation of a guest OS. VMSAVE store the processor state from the VMCB. CLGI clears the global interrupt flag (GIF) SLGI sets the global interrupt flag (GIF) INVPLGA invalidates the TLB mapping of a specified virtual page

  • and a specfied ASID.

SKINIT reinitilizes the CPU.

To issue these instructions SVM must be enabled. Enabling SVM is done by setting bit 12 of the EFER MSR register.

In VT-x, the vmx_vmexit_handler() method handles VM Entries. In AMD SVM, the svm_vmexit_handler() method is the one which handles VM exits. (file xen/arch/x86/hvm/svm/svm.c) When VM exit occurs, the processor saves the reason for this exit in the exit_code member of the VCMB. The svm_vmexit_handler() handles the VM EXIT according to the exit_reason of the VMCB.

Xen On Solaris

On 13 Feb 2006, Sun had released the Xen sources for Solaris x86. See : http://opensolaris.org/os/community/xen/opening-day.

This version currently supports 32 bit only ;it enables openSolaris to be a guest OS where dom0 is a modifed Linux kernel running Xen. Also this version is currently only for x86 (porting to SPARC processor is much more difficult). The members of the Solaris Xen project are Tim Marsland, John Levon, Mark Johnson, Stu Maybee, Joe Bonasera, Ryan Scott, Dave Edmondson and others. Todd Clayton is leading the 64-bit solaris Xen project. In order to boot the Solaris Xen guest many changes were done; can see more details in http://blogs.sun.com/roller/page/JoeBonasera.

You can download the Xen Solaris sources from : http://dlc.sun.com/osol/xen/downloads/osox-src-12-02-2006.tar.gz

Frontend net virtual device sources are in uts/common/io/xennet/xennetf.c. (xennet is the net front virtual driver.).

Frontend block virtual device sources are in uts/i86xen/io/xvbd (xvbd is the block front virtual driver.).

Currently the front block device does not work. There are many things which are similiar between Xen on Solaris and Xen on Linux.

In Xen Solaris Hypercall are also made by calling int 0x82 . (see #define TRAP_INSTR int $0x82 (file /uts/i86xen/ml/hypersubr.s)

Sun also released in february 2006 the specs for the T1 prcoessor, which supports virtualization: see : http://opensparc.sunsource.net/nonav/opensparct1.html

see: http://www.prnewswire.com/cgi-bin/stories.pl?ACCT=104&STORY=/www/story/02-14-2006/0004281587&EDATE=

Also the UltraSPARC T1 Hypervisor API Specification was released: http://opensparc.sunsource.net/specs/Hypervisor_api-26-v6.pdf

T1 virtualization:

The Hyperprivileged edition of the UltraSPARC Architecture 2005 Specification describes the Nonprivileged, Privileged, and Hyperprivileged (hypervisor/virtual machine firmware) spec.

The virtual processor on Sun supports three privilege modes:

2 bits determine the privilege mode of the processor: HPSTATE.hpriv and PSTATE.priv When both are 0 ,we are in nonprivileged mode When both are 1 ,we are in privileged mode When HPSTATE.hpriv is 1 , we are in Hyperprivileged mode (regardless of the value of PSTATE.priv). PSTATE is the Processor State register. HPSTATE is the Hyperprivileged State register HPSTATE.(64 bit). Each virtual processor has only one instance of the PSTATE and HPSTATE registers. The HPSTATE is one of the HPR state registers, and it is also called HPR 0. It can be read by the RDHPR instructions, and it can be written by the WRHPR instruction.

Step by step example of creating guest OS with Virtual Machine Manager in Fedora Core 6

This secrion describes a step by step example of creating guest OS based on FC6 i386 with Virtual Machine Manager in a Fedora Core 6 machine by installing from a WEB URL:

Go to : Application->System Tools->Virtual Machine Manager Choose : Local Xen Host Press New. Enter a name for the guest. You reach now the “Locating installation media” dialog. In “install media URL” you should enter a URL of Fedora Core 6 i386 download. For example, “http://download.fedora.redhat.com/pub/fedora/linux/core/6/i386/os/” then press forward. Choose simple file, and give a path to a non existing file in some existing folder. than: File size: choose 3.5 GB for example ; if you will assign less space, you will not be able to finish the installation, assuming it is a typical , non custom, installation.

Then press forward ; accept the defaults for memory/cpu and then press forward. Than press finish. That’s it! When insalling from web like this it can take 2-4 hours, depending on your bandwidth. You will get to the text mode installation of fedora core 6, and have to enter parameters for the installation.

After the installation is finished and you want to restart the guest OS , you do it by simply: “xm create /etc/xen/NameOfGuest“, where NameOfGuest is of course the name of guest you choose in the installation.

Physical Interrupts

In Xen, only the Hypervisor has an access to the hardware so that to achieve isolation (it is dangerous to share the hardware and let other domains access directly hardware devices simultaneously).

Let’s take a little walkthrough dealing with Xen interrupts:

Handling interrupts in Xen is done by using event channels. Each domain can hold up to 1024 events. An event channel can have 2 flags associated with it : pending and mask. The mask flag can be updated only by guests. The hypervisor cannot update it. These flags are not part of the event channel structure itself. (struct evtchn is defined in xen/include/xen/sched.h ). There are 2 arrays in struct shared_info which contains these flags: evtchn_pending[] and evtchn_mask[] ; each holds 32 elements. (file xen/include/public/xen.h)

(The shared_info is a member in domain struct; it is the domain shared data area).

TBD: add info about event selectors (evtchn_pending_sel in vcpu_info).

Registration (or binding) of irqs in guest domains:

The guest OS calls init_IRQ() when it boots (start_kernel() method calls init_IRQ() ; file init/main.c).

(init_IRQ() is in file sparse/arch/xen/kernel/evtchn.c)

There can be 256 physical irqs; so there is an array called irq_desc with 256 entries. (file sparse/include/linux/irq.h)

All elements in this array are initialized in init_IRQ() so that their status is disabled (IRQ_DISABLED).

Now, when a physical driver starts it usually calls request_irq().

This method eventually calls setup_irq() (both in sparse/kernel/irq/manage.c). which calls startup_pirq().

startup_pirq() send a hypercall to the hypervisor (HYPERVISOR_event_channel_op) in order to bind the physical irq (pirq) . The hypercall is of type EVTCHNOP_bind_pirq. See: startup_pirq() (file sparse/arch/xen/kernel/evtchn.c)

On the Hypervisor side, handling this hypervisor call is done in: evtchn_bind_pirq() method (file /common/event_channel.c) which calls pirq_guest_bind() (file arch/x86/irq.c). The pirq_guest_bind() changes the status of the corresponding irq_desc array element to be enabled (~IRQ_DISABLED). it also calls startup() method.

Now when an interrupts arrives from the controller (the APIC), we arrive at do_IRQ() method as is also in usual linux kernel (also in arch/x86/irq.c). The Hypervisor handles only timer and serial interrupts. Other interrupts are passed to the domains by calling _do_IRQ_guest() (In fact, the IRQ_GUEST flag is set for all interrupts except for timer and serial interrupts). _do_IRQ_guest() send the interrupt by calling send_guest_pirq() to all guests who are registered on this IRQ. The send_guest_pirq() creates an event channel (an instance of evtchn) and sets the pending flag of this event channel. (by calling evtchn_set_pending()) Then, asynchronously, Xen will notify this domain regarding this interrupt (unless it is masked).

TBD: shared interrupts; avoiding problems with shared interrupts when using PCI express.

Backend Drivers:

The Backend Drivers are started from domain 0. We will deal mainly with the network and block drivers. The network backend drivers reside under sparse/drives/xen/netback, and the block backend drivers reside under sparse/drives/xen/blkback.

There are many things in common between the netback and blkback device drivers. There are some differences, though. The blkback device drivers runs a kernel daemon thread (named :xenblkd) whereas the netback device driver does not run any kernel thread.

The netback and blkback register themselves with XenBus by calling xenbus_register_backend().

This method simply calls xenbus_register_driver_common(); both are in sparse/drivers/xen/xenbus/xenbus_probe.c.

(The xenbus_register_driver() method calls the generic kernel method for registering drivers, driver_register()).

Both netback (network backend driver) and blkback (block backend driver) has a module named xenbus.c. There are drivers which are not splitted to backend/frontend drivers;for example, the balloon driver.The balloon driver calls register_xenstore_notifier() in its initialization (balloon_init() method). The register_xenstore_notifier() uses a generic linux callback mechanism for passing status changes (notifier_block in include/linux/notifier.h).

The USB driver also has a backend and frontend drivers; currently it has no support to the xenbus/xenstore API so it does not have a module named xenbus.c but it will probably be adjusted in the future. As of writing of this document, the USB backend/frontend code was removed temporarily from the sparse tree.

Each of the backend drivers registers two watches: one for the backend and one for the frontend. The registration of the watches is done in the probe method:

* In netback it is in netback_probe() method (file netback/xenbus.c).

* In blkback it is in blkback_probe() method (file blkback/xenbus.c).

A registration of a watch is done by calling the xenbus_watch_path2() method. This method is implemented in sparse/drivers/xen/xenbus/xenbus_client.c. Evntually the watch registration is done by calling register_xenbus_watch(), which is implemented in sparse/drivers/xen/xenbus/xenbus_xs.c.

In both cases, netback and blkback, the callback for the backend watch is called backend_changed, and the callback for the forntend watch is called frontend_changed.

xenbus_watch is a simple struct consisting of 3 elements:

A reference to a list of watches (list_head)

A pointer to a node (char*)

A callback function pointer.

The xenbus.c in both netback and blkback defines a struct called backend_info; These structs have much in common: there are minor differences between them. One difference is that in the netback the communications channel is an instance of netif_t whereas in the blkback the communications channel is an instance of blkif_t; In the case of blkback, it includes also the major/minor numbers of the device and the mode (whereas these members don’t exist in the backend_info struct of the netback).

In the case of netback, there is also a XenbusState member. The state machine for XenBus includes seven states: Unknown, Initialising, InitWait (early initialisation was finished, and xenbus is waiting for information from the peer or hotplug scripts), Initialised (waiting for a connection from the peer), Connected, Closing (due to an error or an unplug event) and Closed.

One of the members of this struct (backend_info) is an instance of xenbus_device.(xenbus_device is declared in sparse/include/asm-xen/xenbus.h). The nodename looks like a directory path, for example, dev->nodename in the blkback case may look like:

backend/vbd/94e11324-7eb1-437f-86e6-3db0e145136e/771

and dev->nodename in the netback may look like:

backend/vif/94e11324-7eb1-437f-86e6-3db0e145136e/1

We create an event channel for communication between the two domains by calling a bind_interdomain Hypervisor call. (HYPERVISOR_event_channel_op).

For the networking,this is done in netif_map() in netback/interface.c. For the block device, this is done in blkif_map() in blkback/interface.c.

We use the grant tables to create shared memory between frontend and backend domain. In the case of network drivers,this is done by calling: gnttab_grant_foreign_transfer_ref(). (called in: network_alloc_rx_buffers(), file netfront.c)

gnttab_grant_foreign_transfer_ref() sets a bit named GTF_accept_transfer in the grant_entry.

In the case of block drivers,this is done by calling: gnttab_grant_foreign_access_ref() in blkif_queue_request() (file blkfront.c)

gnttab_grant_foreign_access_ref() sets a bit named GTF_permit_access in the grant entry. grant entry (grant_entry_t) represents a page frame which is shared between domains.

Diagram: Virtual Split Devices

virtualDevices

Migration and Live Migration:

Xend must be configured so that migration (which is also termed relocation) will be enabled. In /etc/xen/xend-config.sxp, there is the definition of the relocation-port, so the following line should be uncommented:

(xend-relocation-port 8002)

The line “(xend-address localhost)” prevents remote connections on the localhost,so this line must be commented.

Notice: if this line is commented in the side to which you want to migrate your domain, you will most likely get the following error after issuing the migrate command:

"Error: can't connect: Connection refused"

This error can be traced to domain_migrate() method in /tools/python/xen/xend/XendDomain.py which start a TCP connection on the relocation port (which is by default 8002)

...
...
def domain_migrate(self, domid, dst, live=False, resource=0):
        """Start domain migration."""
        dominfo = self.domain_lookup(domid)
        port = xroot.get_xend_relocation_port()
        try:
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            sock.connect((dst, port))
        except socket.error, err:
            raise XendError("can't connect: %s" % err[1])
...
...

See more details on the relocation protocol (implemented in relocate.py) below.

The line “(xend-relocation-server yes)” should be uncommented so the migration server will be running.

When we issue a migration command like “xm migrate #numOfDomain ipAddress” or , if we want to use live-migration , we add the –live flag thus: “xm migrate #numOfDomain –live ipAddress”, we call server.xend_domain_migrate() after validating that the arguments are valid. (file /tools/python/xen/xm/migrate.py)

We enter the domain_migrate() method of XendDomain, which first performs a domain lookup and then creates a TCP connection for the migration with the target machine; then it sends a “receive” packet (a packets which contains the string “receive”) on port 8002. The other sides gets this message in in the dataReceived() method (see of web/connection.py) and delegates it to dataReceived() of RelocationProtocol (file /tools/python/xen/xend/server/relocate.py). Eventually it calls the save() method of XendChekpoint to start the migration process.

We end up with a call to op_receive() which ultimately sends back a “ready receive” message (also on port 8002). See op_receive() method in relocation.py.

The save() method of XendChekpoint opens a thread which calls the xc_save executable (which is in /usr/lib/xen/bin/xc_save). (file /tools/python/xen/xend/XendCheckpoint.py).

The xc_save executable is build from tools/xcutils/xc_save.c.

The xc_save executable calls xc_linux_save() in tools/libxc/xc_linux_save, which in fact performs most of the migration process. (see xc_linux_save() in /tools/libxc/xc_linux_save.c)

The xc_linux_save() returns 0 on success, and 1 in case of failure.

Live migration is currently supported for 32-bit and 64-bit architectures. TBD: find out if there is support for live-migration for pae architectures. Jacob Gorm Hansen is doing an interesting work with migration in Xen; see http://www.diku.dk/~jacobg/self-migration.

He wrote code which performs Xen “self-migration” , which is a migration which is done without the hypervisor involvement. The migration is done by opening a User Space socket and reading from a special file (/dev/checkpoint).

His code works with linux-2.6 bases xen-unstable.

You can get it by: hg clone http://www.distlab.dk/hg/index.cgi/xen-gfx.hg

His work was inspired by his own ‘NomadBIOS’ for L4 microkernel, which also uses this approach of self-migration.

It might be that this new alternative to managed migration will be adopted in future versions of Xen (time will tell).

Migration of operating systems has much in common with migration of single processes. see Master’s thesis: “Nomadic Operating Systems” Jacob G. Hansen , Asger kahl Henriksen,2002 http://nomadbios.sunsite.dk/NomadicOS.ps

You can find there a discussion about migration of single processes in Mosix (Barak and La’adan).

Creating of a domain – behind the scenes:

We create a domain by:

xm create configFile -c

A simple config file may look like (using ttylinux,as in the user manual):

kernel = "/boot/vmlinuz-2.6.12-xenU"
memory = 64
name = "ttylinux"
nics = 1
ip = "10.0.0.3"
disk = ['file:/home/work/downloads/tmp/ttylinux-xen,hda3,w']
root = "/dev/hda3 ro"

The create() method of XendDomainInfo handles creation of domains. When a domain is created it is assigned a unique id (uuid ) which is cretaed using uuidgen command line utility of e2fsprogs.(file python/xen/xend/uuid.py).

If the memory paramter specified a too high memory which the hypervisor cannot allocate, we end up with the following message: “Error: Error creating domain: The privileged domain did not balloon!”

The devices of a domain are created using the createDevice() method which delegates the call to the createDevice() method of the Device Controller (see XendDomainInfo.py) The createDevice() in turn calls writeDetails() method (also in DevController). This writeDetails() method write the details in XenStore to trigger the creation of the device. The getDeviceDetails() is an abstract method which each subclass of DevController implements. Writing to the store is done by calling Write() method of xstransact. (file tools/pyhton/xen/xend/xenstore/xstransact.py) which returns the id of the newly created device.

By using transaction you can batch together some actions to perform against the xenstored (the common use is some read actions). You can create a domain also without Xend and without Python bindings; Jacob Gorm Hansen had demonstrated it in 2 little programs (businit.c and buscrate.c) (see http://lists.xensource.com/archives/html/xen-devel/2005-10/msg00432.html).

However, true to now these programs should be adjusted beacuse there were some API changes, especially that creation of interdomain event channel is done now with sending ioctl to event_channel (IOCTL_EVTCHN_BIND_INTERDOMAIN).

HyperCalls Mapping to code Xen 3.0.2

Mapping of HyperCalls to code :

Follwoing is the location of all hypercalls: The HyperCalls appear according to their order in xen.h.

The hypercall table itself is in xen/arch/x86/x86_32/entry.S (ENTRY(hypercall_table)).

HYPERVISOR_set_trap_table => do_set_trap_table() (file xen/arch/x86/traps.c)

HYPERVISOR_mmu_update => do_mmu_update() (file xen/arch/x86/mm.c)

HYPERVISOR_set_gdt => do_set_gdt() (file xen/arch/x86/mm.c)

HYPERVISOR_stack_switch => do_stack_switch() (file xen/arch/x86/x86_32/mm.c)

HYPERVISOR_set_callbacks => do_set_callbacks() (file xen/arch/x86/x86_32/traps.c)

HYPERVISOR_fpu_taskswitch => do_fpu_taskswitch(int set) (file xen/arch/x86/traps.c)

HYPERVISOR_sched_op_compat => do_sched_op_compat() (file xen/common/schedule.c)

HYPERVISOR_dom0_op => do_dom0_op() (file xen/common/dom0_ops.c)

HYPERVISOR_set_debugreg => do_set_debugreg() (file xen/arch/x86/traps.c)

HYPERVISOR_get_debugreg => do_get_debugreg() (file xen/arch/x86/traps.c)

HYPERVISOR_update_descriptor => do_update_descriptor() (file xen/arch/x86/mm.c)

HYPERVISOR_memory_op => do_memory_op() (file xen/common/memory.c)

HYPERVISOR_multicall => do_multicall() (file xen/common/multicall.c)

HYPERVISOR_update_va_mapping => do_update_va_mapping() (file /xen/arch/x86/mm.c)

HYPERVISOR_set_timer_op => do_set_timer_op() (file xen/common/schedule.c)

HYPERVISOR_event_channel_op => do_event_channel_op() (file xen/common/event_channel.c)

HYPERVISOR_xen_version => do_xen_version() (file xen/common/kernel.c)

HYPERVISOR_console_io => do_console_io() (file xen/drivers/char/console.c)

HYPERVISOR_physdev_op => do_physdev_op() (file xen/arch/x86/physdev.c)

HYPERVISOR_grant_table_op => do_grant_table_op() (file xen/common/grant_table.c)

HYPERVISOR_vm_assist => do_vm_assist() (file xen/common/kernel.c)

HYPERVISOR_update_va_mapping_otherdomain =>

  • do_update_va_mapping_otherdomain() (file xen/arch/x86/mm.c)

HYPERVISOR_iret => do_iret() (file xen/arch/x86/x86_32/traps.c) /* x86/32 only */

HYPERVISOR_vcpu_op => do_vcpu_op() (file xen/common/domain.c)

HYPERVISOR_set_segment_base => do_set_segment_base (file xen/arch/x86/x86_64/mm.c) /* x86/64 only */

HYPERVISOR_mmuext_op => do_mmuext_op() (file xen/arch/x86/mm.c)

HYPERVISOR_acm_op => do_acm_op() (file xen/common/acm_ops.c)

HYPERVISOR_nmi_op => do_nmi_op() (file xen/common/kernel.c)

HYPERVISOR_sched_op => do_sched_op() (file xen/common/schedule.c)

(Note: sometimes hypercalls are also called hcalls.)

Virtualization and the Linux Kernel

Virtualization in computer context can be thought of as extending the abilities of a computer beyond what a straight, non-virtual implelmentation allows.

In this category we can include also virtual memory, which allows a process to access 4GB virtual address space even though the physical RAM is usually much lower.

We can also think of the Linux IP Virtual Server (which is now a part of the linux kernel) as a kind of virtualization. By using the Linux IP Virtual Server you can configure a router to redirect service requests from a virtual server address to other machines (called real servers).

The IP Virtual Server is part of the kernel starting 2.6.10 (In the 2.4.* kernels it is also available as a patch; the code for 2.6.10 and above kernels is under net/ipv4/ipvs under the kernel tree ;there is still no implementation for ipv6).

The Linux Virtual Server (LVS) was started quite a time ago,in 1998; see http://www.linuxvirtualserver.org.

The idea of virtualization in the sense of enabling of running more than one operating system on a single platform is not new and was researched for many years. However, it seems that the Xen project is the first which produces performance benchmark metrics of such a feature which make this idea more practical and more attractive.

Origins of the Xen project: The Xen project is based on the Xenoservers project; It was originally built as part of the XenoServer project, see http://www.cl.cam.ac.uk/Research/SRG/netos/xeno.

Also the arsenic project has some ideas which were used in Xen. (see http://www.cl.cam.ac.uk/Research/SRG/netos/arsenic)

In the arsenic project, written by Ian Pratt and Keir Fraser, a big part of the Linux kernel TCP/IP stack was ported to user space. The arsenic project is based on Linux 2.3.29. After a short look at the Arsenic porject code you can find some data structures which can remind of parallel data structures in Xen, like the event rings. (for exmaple,the ring_control_block struct in arsenic-1.0/acenic-fw-12.3.10/nic/common/nic_api.h)

Meiosys is a French Company which was purchased by IBM. It deals with another different type of virtualization – Application Virtualization.

see http://www.virtual-strategy.com/article/articleview/680/1/2/ and http://www.infoworld.com/article/05/06/23/HNibmbuy_1.html

In context of the Meiosys project, it is worth to mention that a patch was sent recently to the Linux Kernel Mailing List from Serge E. Hallyn (IBM): see http://lwn.net/Articles/160015

This patch deals with process IDs. (the pid should stay the same after starting anew the application in Meiosys).

Another article on PID virtualization can be found in “PID virtualization: a wealth of choices” http://lwn.net/Articles/171017/?format=printable This article deals with PID virtualization in a context of a diffenet project (openVZ).

There is also the colinux open source project (see:http://colinux.sourceforge.net for more details) and the openvz project, which is based on Virtuozzo™. (Virtuozzo is a commercial solution).

The openvz offers server virtualization, linux-based solution: see http://openvz.org.

There are other projects which probably ispire virtualization; to name of few:

Denali Project uses (uses paravirtualization). http://denali.cs.washington.edu

A paper: Denali: Lightweight Virtual Machines for Distributed and Networked Applications By Andrew Whitaker et al. http://denali.cs.washington.edu/pubs/distpubs/papers/denali_usenix2002.pdf

Nemesis Operating System. http://www.cl.cam.ac.uk/Research/SRG/netos/old-projects/nemesis/index.html

Exokernel: see “Application Performance and Flexibility on Exokernel Systems” by M. Frans Kaashoek et al http://www.cl.cam.ac.uk/~smh22/docs/kaashoek97application.ps.gz

TBD: more details.

Pre-Virtualization

Another interesting virtulaization technique is Pre-Virtualization; in this method, we rewite sensitive instructions using the assembler files (whether generated by compiler, as is the usual case, or assembler files created manually). There is a problem in this method because there are instuctions which are sensitive only when they are performed in a certain context. A solution for this is to generate profiling data of a guest OS and then recompile the OS using the profiling data.

See:

http://l4ka.org/projects/virtualization/afterburn/

and an article: Pre-Virtualization: Slashing the Cost of Virtualization Joshua LeVasseur, Volkmar Uhlig, Matthew Chapman et al. http://l4ka.org/publications/2005/previrtualization-techreport.pdf

This technique is based on a paper by Hideki Eiraku and Yasushi Shinjo, “Running BSD Kernels as User Processes by Partial Emulation and Rewriting of Machine Instructions” http://www.usenix.org/publications/library/proceedings/bsdcon03/tech/eiraku/eiraku_html/index.html

Xen Storage

You can use iscsi for Xen Storage. The xen-tools package of OpenSuse has an example of using iscsi, called xmexample.iscsi. The disk entry for iscsi in the configuration file may look like: disk = [ ‘iscsi:iqn.2006-09.de.suse@0ac47ee2-216e-452a-a341-a12624cd0225,hda,w’ ]

TBD: more on iSCSI in Xen.

Solutions for using CoW in Xen: blktap (part of the xen project).

UnionFS: a stackable filesystem (used also in Knoppix Live-CD and other Live-CDs)

dm-userspace (A tool which uses device-mapper and a daemon called cowd; written by Dan Smith) You may download dm-userspace by:

To build as a module out-of-tree, copy dm-userspace.h to: /lib/modules/uname -r/build/include/linux and then run “make”.

Home of dm-userspace:

Copy-on-write NFS server: see http://www.russross.com/CoWNFS.html

kvm – Kernel-based Virtualization Driver

Kvm is as an open source virtualization project , written by Avi Kivity and Yaniv Kamay from qumranet. See : http://kvm.sourceforge.net.

It is included in the linux kerel tree since 2.6.20-rc1; see: http://lkml.org/lkml/2006/12/13/361 (“kvm driver for all those crazy virtualization people to play with”)

Currently it deals with Intel processors with the virtual extension (VT-X). and AMD SVM processors. You can know if your processor has these extensions by issuing from the command line: “egrep ‘^flags.*(vmx|svm)’ /proc/cpuinfo”

kvm.ko is a kernel module which handles userspace requests through ioctls. It works with a character device (/dev/kvm). The userspace part is built from patched quemu. One of KVM advantages is that it uses linux kernel mechanisms as they are without change (such as the linux scheduler). The Xen project, for example, made many changes to parts of the kernel to enable para-virtualization. Another advantage is the simplicty of the project: there is a kernel part and a userspace part. An advantage of KVM is that future versions of linux kernel will not entail changes in the kvm module code (and of course not in the user space part). The project currently support SMP hosts and will support SMP guests in the future.

Currently there is no support to live migration in KVM (but there is support for ordinary migration, when the migrated OS is stopped and than transfrerred to the target and than resumed).

In intel vt-x , VM-Exits are handled by the kvm module by kvm_handle_exit() method in kvm_main.c according to the reason which caused them (and which is specified and read from the VMCS). in AMD SVM , exit are handled by handle_exit() in svm.c.

There is an interesting usage of memory slots . There is already an rpm for openSUSE by Gerd Hoffman.

Tip: How to build Xen with your own tar ball

If you want to run “make world” without downloading the kernel (beacuse that you want to to your own tar ball which is a bit different from the original one because you made few changes inside the kernel), then do the following:

1) Let’s say that the kernel tar ball is named: my_linux-2.6.18.tar.bz2.

  • First, move my_linux-2.6.18.tar.bz2 to the folder from where you build Xen

2) Run from bash: XEN_LINUX_SOURCE=tarball make world

That’s it; it will use the my_linux-2.6.18.tar.bz2. tar ball that you copied to that folder.

Xen in the Linux Kernel

According to the following thread from xen-devel: http://lists.xensource.com/archives/html/xen-devel/2005-10/msg00436.html, there is a mercurial repository in which xen is a subarch of i386 and x86_64 of the linux kernel, and there is an intention to send releavant stuff to Andrew/Linus for the upcoming 2.6.15 kernel. In 22/3/2006 , a patchest of 35 parts was sent to the Linux Kernel Mailing List (lkml) for Xen i386 paravirtualization support in the linux kernel: see http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/2313.html

VMI : Virtual Machine Interface

On 13/3/06 , a patchset titled “VMI i386 Linux virtualization interface proposal” was sent to the LKML (Linux Kernel Mailing List) by Zachary Amsden and othes. (see http://lkml.org/lkml/2006/3/13/140) It suggests for a common interfcace which abstracts the specifics of each hypervisor and thus can be used by many hypervisors. According to the vmi_spec.txt of this patchset, when an OS is ported to a paravirtulizable x86 processor, it should access the hypervisor through the VMI layer.

The VMI layer interface:

The VMI is divided to the following 10 types of calls:

CORE INTERFACE CALLS (like VMI_Init)

PROCESSOR STATE CALLS (like VMI_DisableInterrupts, VMI_EnableInterrupts,VMI_GetInterruptMask)

DESCRIPTOR RELATED CALLS (like VMI_SetGDT,VMI_SetIDT)

CPU CONTROL CALLS (like VMI_WRMSR,VMI_RDMSR)

PROCESSOR INFORMATION CALLS (like VMI_CPUID)

STACK/PRIVILEGE TRANSITION CALLS (like VMI_UpdateKernelStack,VMI_IRET)

I/O CALLS (like VMI_INB,VMI_INW,VMI_INL)

APIC CALLS (like VMI_APICWrite,VMI_APICRead)

TIMER CALLS (VMI_GetWallclockTime)

MMU CALLS (like VMI_SetLinearMapping)

Links

1) Xen Project HomePage http://www.cl.cam.ac.uk/Research/SRG/netos/xen

2) Xen Mailing Lists Pge: http://lists.xensource.com

(don’t forget to read the XenUsersNetiquette before posting on the lists)

3) Atricle : Analysis of the Intel Pentium’s Ability to Support a Secure Virtual Machine Monitor http://www.usenix.org/publications/library/proceedings/sec2000/full_papers/robin/robin_html/index.html

4) Xen Summits: 2005: http://xen.xensource.com/xensummit/xensummit_2005.html 2006 fall: http://xen.xensource.com/xensummit/xensummit_fall_2006.html 2006 winter: http://xen.xensource.com/xensummit/xensummit_winter_2006.html 2007 spring: http://xen.xensource.com/xensummit/xensummit_spring_2007.html

5) Intel Virtualizatiuon technology: http://www.intel.com/technology/virtualization/index.htm

6) Article by Ryan Maueron Linux Journal in 2 parts:

6-1) Xen Virtualization and Linux Clustering, Part 1 http://www.linuxjournal.com/article/8812

6-2) Xen Virtualization and Linux Clustering, Part 2 http://www.linuxjournal.com/article/8816

Commercial Companies: 7)XenSource:

http://www.xensource.com/

8) Enomalism: http://www.enomalism.com/

9) Thoughtcrime is a brand new company specialising in opensource virtualisation solutions see http://debian.thoughtcrime.co.nz/ubuntu/README.txt

IA64:

10)IA64 Master Thesis HPC Virtualization with Xen on Itanium by Havard K. F. Bjerke http://openlab-mu-internal.web.cern.ch/openlab%2Dmu%2Dinternal/Documents/2_Technical_Documents/Master_thesis/Thesis_HarvardBjerke.pdf

11) vBlades: Optimized Paravirtualization for the Itanium Processor Family Daniel J. Magenheimer and Thomas W. Christian http://www.usenix.org/publications/library/proceedings/vm04/tech/full_papers/magenheimer/magenheimer_html/index.html

12) Now and Xen Feature Story Article Written by Andrew Warfield and Keir Fraser http://www.linux-mag.com/2004-10/xen_01.html

13) Self Migration: http://www.diku.dk/~jacobg/self-migration

14) online magazine: http://www.virtual-strategy.com

15) Denali Project http://denali.cs.washington.edu

general links about virtualization:

16) http://www.virtualization.info

17) http://www.kernelthread.com/publications/virtualization

18) A Survey on Virtualization Technologies Susanta Nanda Tzi-cker Chiueh http://www.ecsl.cs.sunysb.edu/tr/TR179.pdf

AMD:

19) AMD I/O virtualization technology (IOMMU) specification Rev 1.00 – February 03, 2006 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/34434.pdf

20) AMD64 Architecture Programmer’s Manual: Vol 2 System Programming : Revision 3.11 added chapter 15 on virtualization (“Secure Virtual Machine”).(december 2005) http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf

21) AMD64 Architecture Programmer’s Manual: Vol 3: General-Purpose and System Instructions Revision 3.11 added SVM instructions (december 2005) http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24594.pdf

22) AMD virtualization on the Xen Summit: http://www.xensource.com/files/xs0106_amd_virtualization.pdf

23) AMD Press Release: SUNNYVALE, CALIF. — May 23, 2006: availability of AMD processors with virtualization extensions: http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543~108605,00.html

Open Solaris:

24) Open Solaris Xen Forum: http://www.opensolaris.org/jive/category.jspa?categoryID=32

25) Update to opensolaris Xen: adding OpenSolaris-based dom0 capabilities, as well as 32-bit and 64-bit MP guest. 14/07/2006 http://www.opensolaris.org/os/community/xen/announcements/?monthYear=July+2006

26) Open Sparc Hypervisor Spec: http://opensparc.sunsource.net/nonav/Hypervisor_api-26-v6.pdf

27) Open Sparc T1 page: http://opensparc.sunsource.net/nonav/opensparct1.html extension to the Solaris Zones: http://opensolaris.org/os/community/brandz

28) OSDL Development Wiki Homepage (Virtualization) http://www.osdl.org/cgi-bin/osdl_development_wiki.pl?Virtualization

29) fedora xen mailing list archive: https://www.redhat.com/archives/fedora-xen

30) Xen Quick Start for FC4 (Fedora Core 4). http://www.fedoraproject.org/wiki/FedoraXenQuickstart?highlight=%28xen%29

31) Xen Quick Start for FC5 (Fedora Core 5): http://www.fedoraproject.org/wiki/FedoraXenQuickstartFC5

32) Xen Quick Start for FC6 (Fedora Core 6): http://fedoraproject.org/wiki/FedoraXenQuickstartFC6

Fedora 7 quick start: http://fedoraproject.org/wiki/Docs/Fedora7VirtQuickStart Fedora 8 quick start: http://fedoraproject.org/wiki/Docs/Fedora8VirtQuickStart

33) The Xen repository is handled by the mercurial version system. mercurial download: http://www.selenic.com/mercurial/release

34) Measuring CPU Overhead for I/O Processing in the Xen Virtual Machine Monitor Cherkasova Ludmila and Gardner, Rob http://www.hpl.hp.com/techreports/2005/HPL-2005-62.html

35) XenMon: QoS Monitoring and Performance Profiling Tool Gupta Diwaker and Gardner Rob; Cherkasova, Ludmila http://www.hpl.hp.com/techreports/2005/HPL-2005-187.html

36) Potemkin VMM: A virtual machine based on xen-unstable ; used in a honeypot By Michael Vrable, Justin Ma, Jay Chen, David Moore, Erik Vandekieft, Alex C. Snoeren, Geoffrey M. Voelker, and Stefan Savage. http://www.cs.ucsd.edu/~savage/papers/Sosp05.pdf

37) Memory Resource Management in VMware ESX Server http://www.usenix.org/events/osdi02/tech/waldspurger.html

books:

38) Virtualization: From the Desktop to the Enterprise By Erick M. Halter , Chris Wolf Published: May 2005 http://www.apress.com/book/bookDisplay.html?bID=449

39) Virtualization with VMware ESX Server Publisher: Syngress; 2005 by Al Muller, Seburn Wilson, Don Happe, Gary J. Humphrey http://www.syngress.com/catalog/?pid=3315

40) VMware ESX Server: Advanced Technical Design Guide by Ron Oglesby, Scott Herold http://www.amazon.com/exec/obidos/ASIN/0971151067/virtualizatio-20/002-5634453-4543251?creative=327641&camp=14573&link_code=as1

41) PPC: Hollis Blanchard, IBM Linux Technology Center Jimi Xenidis, IBM Research http://wiki.xensource.com/xenwiki/Xen/PPC

42) http://about-virtualization.com/mambo

43) PID virtualization: a wealth of choices http://lwn.net/Articles/171017/?format=printable

44) The Xen Hypervisor and its IO Subsystem: http://www.mulix.org/lectures/xen-iommu/xen-io.pdf

45) G. J. Popek and R. P. Goldberg, Formal requirements for virtualizable third generation architectures, Commun. ACM, vol. 17, no. 7, pp. 412 421, 1974. http://www.cis.upenn.edu/~cis700-6/04f/papers/popek-goldberg-requirements.pdf

46) “Running multiple operating systems concurrently on an IA32 PC using virtualization techniques” by Kevin Lawton (1999). http://www.floobydust.com/virtualization/lawton_1999.txt

47) Automating Xen Virtual Machine Deployment (talks about integrating SystemImager with Xen and more) by Kris Buytaert http://howto.x-tend.be/AutomatingVirtualMachineDeployment/

48)Virtualizing servers with Xen Evaldo Gardenali VI International Conference of Unix at UNINET http://umeet.uninet.edu/umeet2005/talks/evaldoa/xen.pdf

49)Survey of System Virtualization Techniques Robert Rose March 8, 2004 http://www.robertwrose.com/vita/rose-virtualization.pdf

NetBSD

50)http://www.netbsd.org/Ports/xen/

51)Interview on Xen with NetBSD develope Manuel Bouyer http://ezine.daemonnews.org/200602/xen.html

52) netbsd xen mailing list: http://www.netbsd.org/MailingLists/#port-xen

53) NetBSD/xen Howto http://www.netbsd.org/Ports/xen/howto.html

54) “C” API for Xen (LGPL) By Daniel Veillard and others

http://libvir.org/

55) Fraser Campbell page: http://goxen.org/

56) Another page from Fraser Campbell : http://linuxvirtualization.com/

57) Virtualization blog

58)Hardware emulation with QEMU (article)

59) http://linuxemu.retrofaction.com

60) Linux Virtualization with Xen http://www.linuxdevcenter.com/pub/a/linux/2006/01/26/xen.html

61) The virtues of Xen by Alex Maier http://www.redhat.com/magazine/014dec05/features/xen/

62) Deploying VirtualMachines as Sandboxes for the Grid Sriya Santhanam, Pradheep Elango, Andrea Arpaci Dusseau, Miron Livny http://www.cs.wisc.edu/~pradheep/SandboxingWorlds05.pdf

63) article: Xen and the new processors:

Infiniband:

64) Infiniband (Smart IO) wiki page http://wiki.xensource.com/xenwiki/XenSmartIO hg repository: http://xenbits.xensource.com/ext/xen-smartio.hg

65) Novell Infiniband and virtualization, Patrick Mullaney , may 1, 2007: http://www.openfabrics.org/archives/spring2007sonoma/Monday%20April%2030/Novell%20xen-ib-presentation-sonoma.ppt

66) A Case for High Performance Computing with Virtual Machines Wei Huangy, Jiuxing Liuz, Bulent Abaliz et al. http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/huangwei-ics06.pdf

67) High Performance VMM-Bypass I/O in Virtual Machines Wei Huangy, Jiuxing Liuz, Bulent Abaliz et al. (usenix 06) http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/usenix06.pdf

68) User Mode Linux , a book By Jeff Dike. Bruce Perens’ Open Source Series. Published: Apr 12, 2006; http://www.phptr.com/title/0131865056

69) Xen 3.0.3 features,schedule : http://lists.xensource.com/archives/html/xen-devel/2006-06/msg00390.html

70) Practical Taint-Based Protection using Demand Emulation Alex Ho, Michael Fetterman, Christopher Clark et al. http://www.cs.kuleuven.ac.be/conference/EuroSys2006/papers/p29-ho.pdf

71) http://stateless.geek.nz/2006/09/11/current-virtualisation-hardware/ Current Virtualisation Hardware by Nicholas Lee

72) RAID: Installing Xen with RAID 1 on a Fedora Core 4 x86 64 SMP machine: http://www.freax.be/wiki/index.php/Installing_Xen_with_RAID_1_on_a_Fedora_Core_4_x86_64_SMP_machine

73) RAID 1 and Xen (dom0) : (On Debian) http://wiki.kartbuilding.net/index.php/RAID_1_and_Xen_(dom0)

74) OpenVZ Virtualization Software Available for Power Processors http://lwn.net/Articles/204275/

75) Kernel-based Virtual Machine patchset (Avi Kivity) adding /dev/kvm which exposes the virtualization capabilities to userspace. http://lwn.net/Articles/205580

76) Intel Technology Journal : Intel Virtulaization Technology: articles by Intel Staff (96 pages) http://download.intel.com/technology/itj/2006/v10i3/v10_iss03.pdf

77) kvm site: (Avi Kivity and others) Includes a howto and a white paper, download , faq sections. http://kvm.sourceforge.net/

78) kvm on debian: http://people.debian.org/~baruch/kvm http://baruch.ev-en.org/blog/Debian/kvm-in-debian

79) Linux Virtualization Wiki

80) ” New virtualisation system beats Xen to Linux kernel” (about kvm) http://www.techworld.com/opsys/news/index.cfm?newsID=7586&pagtype=all

81) article about kvm: http://linux.inet.hr/finally-user-friendly-virtualization-for-linux.html

82) Virtual Linux : An overview of virtualization methods, architectures, and implementations An article by M. Tim Jones (auhor of “GNU/Linux Application Programming”,”AI Application Programming”, and “BSD Sockets Programming from a Multilanguage Perspective”. http://www-128.ibm.com/developerworks/library/l-linuxvirt/index.html

83) Lguest: The Simple x86 Hypervisor by Rusty Russel (formerly lhype) http://lguest.ozlabs.org/

84) “Infrastructure virtualisation with Xen advisory” – a wiki atricle : using iscsi for Xen-clustering ; shared storage http://docs.solstice.nl/index.php/Infrastructure_virtualisation_with_Xen_advisory

85) Xen with DRBD, GNBD and OCFS2 HOWTO http://xenamo.sourceforge.net/

86) Virtualization with Xen(tm): Including Xenenterprise, Xenserver, and Xenexpress (Paperback) by David E Williams Syngress Publishing (April 1, 2007) # ISBN-10: 1597491675 # ISBN-13: 978-1597491679 (Author)http://www.amazon.com/Virtualization-Xen-Including-Xenenterprise-Xenexpress/dp/1597491675/ref=pd_bbs_sr_2/103-3574046-3264617?ie=UTF8&s=books&qid=1173899913&sr=8-2

Paperback: 512 pages

87) Professional XEN Virtualization (Paperback) by William von Hagen (Author)

# Paperback: 500 pages # Publisher: Wrox (August 27, 2007) # Language: English # ISBN-10: 0470138114 # ISBN-13: 978-0470138113

88) Xen and the Art of Consolidation Tom Eastep Linuxfest NW. April 29, 2007. http://www.shorewall.net/Linuxfest-2007.pdf

89) Optimizing Network Virtualization in Xen

By Willy Zwaenepoel, Alan L. Cox, Aravind Menon, usenix 2006 http://www.usenix.org/events/usenix06/tech/menon/menon_html/paper.html

90) Virtual Machine Checkpointing

Brendan Cully,University of British Columbia with Andrew Warfield, University of Cambridge

http://xcr.cenit.latech.edu/hapcw2006/program/papers/cr-xen-hapcw06-final.pdf

Adding new device and triggering the probe() functions

The following is a simple example which shows how to add a new device and trigger the probe() function of a backend driver using xenstore-write tool. This is relevant for Xen 3.1

Currently in Xen, triggering of the probe() method in a backend driver or a frontend driver is done by writing some values to the xenstore into directories where the xenbus poses watches. This writing to the xenstore is currently done in Xen from the python code, and it is wrapped deep inside the xend and/or xm commands. Eventually it is done in the writeDetails method of the DevController class. (And both blkif and netif use it).

For those who want who want to be able to trigger the probe() function without diving too deeply into the python code, this should suffice.

For the purposes of this little tutorial, let’s assume that you have built and installed Xen 3.1 from source and have used it to fire up a guest domain at least once. After you’ve done that, let’s say we want to add new device. We will add a device named “mydevice”. Let’s begin with the backend. For this purpose, we will add a directory named “deviceback” to linux-2.6-sparse/drivers/xen. This directory will store the backend portion of our driver.

First, create linux-2.6-sparse/drivers/xen/deviceback. Next, add the following three files to that directory: deviceback.c, xenbus.c, common.h and Makefile.

Here is a minimal skeleton implementation of these files:

deviceback.c

#include <linux/module.h>
#include "common.h"
static int __init deviceback_init(void)
{
        device_xenbus_init();
}
static void deviceback_cleanup()
{
}
module_init(deviceback_init);
module_exit(deviceback_cleanup);

xenbus.c

#include <xen/xenbus.h>
#include <linux/module.h>
#include <linux/slab.h>
struct backendinfo
{
        struct xenbus_device* dev;
        long int frontend_id;
        struct xenbus_watch backend_watch;
        struct xenbus_watch watch;
        char* frontpath;
};
//////////////////////////////////////////////////////////////////////////////
static int device_probe(struct xenbus_device* dev,
                        const struct xenbus_device_id* id)
{
        struct backendinfo* be;
        char* frontend;
        int err;
        be = kmalloc(sizeof(*be),GFP_KERNEL);
        memset(be,0,sizeof(*be));
        be->dev = dev;
        printk("Probe fired!\n");
        return 0;
}
//////////////////////////////////////////////////////////////////////////////
static int device_uevent(struct xenbus_device* xdev,
                          char** envp, int num_envp,
                          char* buffer, int buffer_size)
{
        return 0;
}
static int device_remove(struct xenbus_device* dev)
{
        return 0;
}
//////////////////////////////////////////////////////////////////////////////
static struct xenbus_device_id device_ids[] =
{
        { "mydevice" },
        { "" }
};
//////////////////////////////////////////////////////////////////////////////
static struct xenbus_driver deviceback =
{
        .name    = "mydevice",
        .owner   = THIS_MODULE,
        .ids     = device_ids,
        .probe   = device_probe,
        .remove  = device_remove,
        .uevent  = device_uevent,
};
//////////////////////////////////////////////////////////////////////////////
void device_xenbus_init()
{
        xenbus_register_backend(&deviceback);
}

common.h

#ifndef COMMON_H
#define COMMON_H
void device_xenbus_init(void);
#endif

Makefile

#Makefile
obj-y += xenbus.o deviceback.o

Next, we should add our new backend device to the Makefile in linux-2.6-sparse/drivers/xen/Makefile. Add the following line to the bottom of that file:

obj-y += deviceback/

This will make sure that it will be included in the build.

Next, we need to add symlinks from linux-2.6-sparse/drivers/xen/deviceback into linux-2.6.18-xen/drivers/xen/deviceback:

  1. Create linux-2.6.18-xen/drivers/xen/deviceback
  2. Change into that directory
  3. Add the symlinks: ‘ln -s ../../../../linux-2.6-xen-sparse/./drivers/xen/deviceback/./* .’

Now we should build the new drivers and reboot with the new Xen image. You can do this by going back to the root directory of the source tree (the place where you typed ‘make world’ when doing your normal build before) and do ‘make install-kernels’. (Note: This will overwrite the previous Xen kernel!) Finally, reboot.

After the machine boots back up, go ahead and start a guest domain and you should notice that device_probe() does not get executed. (Check /var/log/syslog on dom0 to look for the printk() to show up.)

How can we trigger the probe() function of our backend driver? We just need to write the correct key/value pairs into the xenstore.

The call to xenbus_register_backend() in xenbus.c causes xenbus to set a watch on local/domain/0/backend/mydevice in the xenstore. Specifically, anytime anything is written into that location in the store the watch fires and checks for a specific set of key/value pairs that indicate the probe should be fired.

So performing the following 4 calls using xenstore-write will trigger our probe() function. Change the X with the ID of a running guest domain. (Check ‘xm list’ for this. If you’ve only started one guest, this number is probably 1.)

xenstore-write /local/domain/X/device/mydevice/0/state 1
xenstore-write /local/domain/0/backend/mydevice/X/0/frontend-id X
xenstore-write /local/domain/0/backend/mydevice/X/0/frontend /local/domain/X/device/mydevice/0
xenstore-write /local/domain/0/backend/mydevice/X/0/state 1

You should see the probe message appear the Dom0’s /var/log/syslog. What happened here behind the scenes ,without going too deep, is that the xenbus_register_backend() put a watch on the xenback directory of /local/domain/0 in the xenstore. Once frontend, frontend-id, and state are all written to the watched location, the xenbus driver will gather all of that information, as well as the state of the frontend driver (written in that first line) and use it to setup the appropriate data structures. From there, the probe() function is finally fired.

Adding a frontend device

For this purpose,we will add a directory named “devicefront” to linux-2.6-sparse/drivers/xen.

We will create 2 files there: devicefront.c and Makefile.

We will also add directories and symlinks as we did in the deviceback case.

Makefile

obj-y := devicefront.o

devicefront.c

The devicefront.c will be (a minimalist implementation):

// devicefront.c
#include <xen/xenbus.h>
#include <linux/module.h>
#include <linux/list.h>
struct device_info
{
        struct list_head list;
        struct xenbus_device* xbdev;
};
//////////////////////////////////////////////////////////////////////////////
static int devicefront_probe(struct xenbus_device* dev,
                             const struct xenbus_device_id* id)
{
        printk("Frontend Probe Fired!\n");
        return 0;
}
//////////////////////////////////////////////////////////////////////////////
static struct xenbus_device_id devicefront_ids[] =
{
        {"mydevice"},
        {}
};
static struct xenbus_driver devicefront =
{
        .name  = "mydevice",
        .owner = THIS_MODULE,
        .ids   = devicefront_ids,
        .probe = devicefront_probe,
};
static int devicefront_init(void)
{
        printk("%s\n",__FUNCTION__);
        xenbus_register_frontend(&devicefront);
}
module_init(devicefront_init);

We should also remember to add the following to the Makefile under linux-2.6-sparse/drivers/xen:

obj-y   += devicefront/

Getting the frontend driver to fire is a bit more complicated, the following bash script should help you:

#!/bin/bash
if [ $# != 2 ]
then
        echo "Usage: $0 <device name> <frontend-id>"
else
        # Write backend information into the location the frontend will look
        # for it.
        xenstore-write /local/domain/${2}/device/${1}/0/backend-id 0
        xenstore-write /local/domain/${2}/device/${1}/0/backend \
                       /local/domain/0/backend/${1}/${2}/0
        # Write frontend information into the location the backend will look
        # for it.
        xenstore-write /local/domain/0/backend/${1}/${2}/0/frontend-id ${2}
        xenstore-write /local/domain/0/backend/${1}/${2}/0/frontend \
                       /local/domain/${2}/device/${1}/0
        # Set the permissions on the backend so that the frontend can
        # actually read it.
        xenstore-chmod /local/domain/0/backend/${1}/${2}/0 r
        # Write the states.  Note that the backend state must be written
        # last because it requires a valid frontend state to already be
        # written.
        xenstore-write /local/domain/${2}/device/${1}/0/state 1
        xenstore-write /local/domain/0/backend/${1}/${2}/0/state 1
fi

Here’s how to use it:

  • Startup a Xen guest that contains your frontend driver, and be sure dom0 contains the backend driver.
  • Figure out the frontend-id for the guest. This is the ID field when running xm list. Let’s say that number is 3.
  • Run the script as so: ./probetest.sh mydevice 3

That should fire both the frontend driver. (You’ll have to check /var/log/messages in the guest to verify that the probe was fired.)

Discussion

SimonKagstrom: Maybe it would be a good idea to split this document into several pages? It’s starting to be fairly long :) TimPost

TimPost: ACK, even if printed, this is hard to digest as an intro

Reference:

http://wiki.xensource.com/xenwiki/XenIntro

Read Full Post | Make a Comment ( 1 so far )

Asterisk + Twitter = Call Monitor — Good idea

Posted on October 2, 2009. Filed under: Linux, Services |

Copied from

http://www.uk-experience.com/2008/06/13/asterisk-twitter-call-monitor/

It’s a good idea.

 

Why? Why not. I was wondering how else I could make use of Twitter for other things I thought “How could I use twitter with my Asterisk PABX?” Doing some Google searching showed I wasn’t the first to think of this (as if that would have happened). There are a few people doing various things with asterisk and twitter.

I didn’t really like much of what I found, or found it wasn’t really what I wanted to do with it .. so I decided to use some of that info and came up with something of my own.

The short story is this. When someone calls my Canadian number, my US Toll-Free number any number that ultimately rings the group of phones in the house, I will now get a direct message in Twitter from a special account I setup for this purpose, which also sends a SMS to my phone (device notifications in twitter).

Read more by clicking below… Be warned! Thar be scripts and technical stuff in them thar parts
Ok, so lets see what’s under the bonnet of this beast now.

First, you’ll need an AGI script. Look, here’s one I prepared earlier – twitter.agi

#!/bin/bash
# Background the curl process incase twitter doesn't respond. It will hang the dialplan.
curl -u username:PASSWORD -d text="$1" \
-d user="recipient" \
http://twitter.com/direct_messages/new.xml &

Put the script in /var/lib/asterisk/agi-bin. Don’t forget to chmod +x the script as well.

Pretty simple eh? The only things you’ll have to change in the script are

  • username = your twitter login name
  • PASSWORD = your twitter password
  • recipient = Who will get your direct message

Ok, and to make it work in Asterisk I put this in the extensions.conf file for the exten that I wanted to monitor:

[home_phone]
exten => s,1,NoOp("This is my home phone context")
exten => s,n,Set(_DIDNUM="PUT YOUR DID NUM HERE")
exten => s,n,LookupCIDName
exten => s,n,AGI(twitter.agi|PBX: ${CALLERID(all)} just called on ${EXTEN})

I’ve only shown the first bits of this context, after this is your normal context such as DIAL or whatever else you might now. I actually build in a few seconds of delay to allow the text message a chance to get to my phone before the phone rings. A sort of early warning system

The only thing here you need to change really is the value of _DIDNUM (notice the underscore in front of the var name, this isn’t strictly needed, but if you wish to use that variable else where in the context or exten it’s a good idea. Obviously you can change the message after the “pipe” in the AGI line to customize the message you are sent.

The one thing I have left to work out is what asterisk variable to use to actually get the DID number used to call into the PBX with. If I can get that that would make this so much easier and more flexible. If you know what that is please drop me a comment.

Well, that’s really about it. It’s not rocket science by any means, heck, it’s nothing at all but it works.

Read Full Post | Make a Comment ( 1 so far )

Sendmail configuration links

Posted on July 13, 2009. Filed under: Linux, Services |

Sendmail: How To Queue Mails based Server Load Average

http://techgurulive.com/2008/09/15/sendmail-how-to-queue-mails-based-server-load-average/?92febc20

 

Sendmail Config

http://www.acme.com/mail_filtering/sendmail_config.html

 

Quick-Tip: Configuring Sendmail with m4 and the sendmail.mc file

http://www.revsys.com/writings/quicktips/sendmail-mc.html

Read Full Post | Make a Comment ( None so far )

Howto disable ssh host key checking

Posted on March 12, 2009. Filed under: Linux, Services, Windows |

Remote login using the SSH protocol is a frequent activity in today’s internet world. With the SSH protocol, the onus is on the SSH client to verify the identity of the host to which it is connecting. The host identify is established by its SSH host key. Typically, the host key is auto-created during initial SSH installation setup.

By default, the SSH client verifies the host key against a local file containing known, rustworthy machines. This provides protection against possible Man-In-The-Middle attacks. However, there are situations in which you want to bypass this verification step. This article explains how to disable host key checking using OpenSSH, a popular Free and Open-Source implementation of SSH.

When you login to a remote host for the first time, the remote host’s host key is most likely unknown to the SSH client. The default behavior is to ask the user to confirm the fingerprint of the host key.

$ ssh peter@192.168.0.100
The authenticity of host '192.168.0.100 (192.168.0.100)' can't be established.
RSA key fingerprint is 3f:1b:f4:bd:c5:aa:c1:1f:bf:4e:2e:cf:53:fa:d8:59.
Are you sure you want to continue connecting (yes/no)?

If your answer is yes, the SSH client continues login, and stores the host key locally in the file ~/.ssh/known_hosts. You only need to validate the host key the first time around: in subsequent logins, you will not be prompted to confirm it again.

Yet, from time to time, when you try to remote login to the same host from the same origin, you may be refused with the following warning message:

$ ssh peter@192.168.0.100
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
3f:1b:f4:bd:c5:aa:c1:1f:bf:4e:2e:cf:53:fa:d8:59.
Please contact your system administrator.
Add correct host key in /home/peter/.ssh/known_hosts to get rid of this message.
Offending key in /home/peter/.ssh/known_hosts:3
RSA host key for 192.168.0.100 has changed and you have requested strict checking.
Host key verification failed.$

There are multiple possible reasons why the remote host key changed. A Man-in-the-Middle attack is only one possible reason. Other possible reasons include:

  • OpenSSH was re-installed on the remote host but, for whatever reason, the original host key was not restored.
  • The remote host was replaced legitimately by another machine.

If you are sure that this is harmless, you can use either 1 of 2 methods below to trick openSSH to let you login. But be warned that you have become vulnerable to man-in-the-middle attacks.

The first method is to remove the remote host from the ~/.ssh/known_hosts file. Note that the warning message already tells you the line number in the known_hosts file that corresponds to the target remote host. The offending line in the above example is line 3(“Offending key in /home/peter/.ssh/known_hosts:3”)

You can use the following one liner to remove that one line (line 3) from the file.

$ sed -i 3d ~/.ssh/known_hosts

Note that with the above method, you will be prompted to confirm the host key fingerprint when you run ssh to login.

The second method uses two openSSH parameters:

  • StrictHostKeyCheckin, and
  • UserKnownHostsFile.

This method tricks SSH by configuring it to use an empty known_hosts file, and NOT to ask you to confirm the remote host identity key.

$ ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no peter@192.168.0.100
Warning: Permanently added '192.168.0.100' (RSA) to the list of known hosts.
peter@192.168.0.100's password:

The UserKnownHostsFile parameter specifies the database file to use for storing the user host keys (default is ~/.ssh/known_hosts).

The /dev/null file is a special system device file that discards anything and everything written to it, and when used as the input file, returns End Of File immediately.

By configuring the null device file as the host key database, SSH is fooled into thinking that the SSH client has never connected to any SSH server before, and so will never run into a mismatched host key.

The parameter StrictHostKeyChecking specifies if SSH will automatically add new host keys to the host key database file. By setting it to no, the host key is automatically added, without user confirmation, for all first-time connection. Because of the null key database file, all connection is viewed as the first-time for any SSH server host. Therefore, the host key is automatically added to the host key database with no user confirmation. Writing the key to the /dev/null file discards the key and reports success.

Please refer to this excellent article about host keys and key checking.

By specifying the above 2 SSH options on the command line, you can bypass host key checking for that particular SSH login. If you want to bypass host key checking on a permanent basis, you need to specify those same options in the SSH configuration file.

You can edit the global SSH configuration file (/etc/ssh/ssh_config) if you want to make the changes permanent for all users.

If you want to target a particular user, modify the user-specific SSH configuration file (~/.ssh/config). The instructions below apply to both files.

Suppose you want to bypass key checking for a particular subnet (192.168.0.0/24).

Add the following lines to the beginning of the SSH configuration file.

Host 192.168.0.*
   StrictHostKeyChecking no
   UserKnownHostsFile=/dev/null

Note that the configuration file should have a line like Host * followed by one or more parameter-value pairs. Host *means that it will match any host. Essentially, the parameters following Host * are the general defaults. Because the first matched value for each SSH parameter is used, you want to add the host-specific or subnet-specific parameters to the beginning of the file.

As a final word of caution, unless you know what you are doing, it is probably best to bypass key checking on a case by case basis, rather than making blanket permanent changes to the SSH configuration files.

SSH Automatical login

Of course this is not the right phrase for it. It should be something like “key-based authorization with SSH”. Or simply “publickey authorization”. Or “unattended ssh login”. But I guess you know what I mean.

Here are the steps:

  1. Create a public ssh key, if you haven’t one already.
    Look at ~/.ssh. If you see a file named id_dsa.pub then you obviously already have a public key. If not, simply create one. ssh-keygen -t dsa should do the trick.
    Please note that there are other types of keys, e.g. RSA instead of DSA. I simply recomend DSA, but keep that in mind if you run into errors.
  2. Make sure your .ssh dir is 700:
    chmod 700 ~/.ssh
  3. Get your public ssh key on the server you want to login automatically.
    A simple scp ~/.ssh/id_dsa.pub remoteuser@remoteserver.com: is ok.
  4. Append the contents of your public key to the ~/.ssh/authorized_keys and remove it.
    Important: This must be done on the server you just copied your public key to. Otherwise you wouldn’t have had to copy it on your server.
    Simply issue something like cat id_dsa.pub >> .ssh/authorized_keys while at your home directory.
  5. Instead of steps 3 and 4, you can issue something like this:
    cat ~/.ssh/id_dsa.pub | ssh -l remoteuser remoteserver.com 'cat >> ~/.ssh/authorized_keys'
  6. Remove your public key from the home directory on the server.
  7. Done!
    You can now login:
    ssh -l remoteuser remoteserver.com or ssh remoteuser@remoteserver.com without getting asked for a password.

That’s all you need to do.

Reference:

http://linuxcommando.blogspot.com/2008/10/how-to-disable-ssh-host-key-checking.html

http://wp.uberdose.com/2006/10/16/ssh-automatic-login/

Read Full Post | Make a Comment ( None so far )

DNS host setting

Posted on June 22, 2007. Filed under: Linux, Mac, Services, Windows |

File locations:

1. Windows

Windows 95/98/Me c:\windows\hosts

Windows NT/2000/XP Pro c:\winnt\system32\drivers\etc\hosts

Windows XP Home c:\windows\system32\drivers\etc\hosts

2.Linux/Unix/MacOSX

/etc/hosts

 

Format:

127.0.0.1    example.domain.com

Read Full Post | Make a Comment ( None so far )

Unix Daemon Server Programming

Posted on November 2, 2006. Filed under: C/C++, Linux, Services |

Introduction

Unix processes works either in foreground or background. A process running in foreground interacts with the user in front of the terminal (makes I/O), whereas a background process runs by itself. The user can check its status but he doesn’t (need to) know what it is doing. The term ‘daemon’ is used for processes that performs service in background. A server is a process that begins execution at startup (not neccessarily), runs forever, usually do not die or get restarted, operates in background, waits for requests to arrive and respond to them and frequently spawn other processes to handle these requests.

Readers are suppossed to know Unix fundamentals and C language. For further description on any topic use “man” command (I write useful keywords in brackets), it has always been very useful, trust me :)) Keep in mind that this document does not contain everything, it is just a guide.

1) Daemonizing (programming to operate in background) [fork]

First the fork() system call will be used to create a copy of our process(child), then let parent exit. Orphaned child will become a child of init process (this is the initial system process, in other words the parent of all processes). As a result our process will be completely detached from its parent and start operating in background.

	i=fork();
	if (i<0) exit(1); /* fork error */
	if (i>0) exit(0); /* parent exits */
	/* child (daemon) continues */

2) Process Independency [setsid]

A process receives signals from the terminal that it is connected to, and each process inherits its parent’s controlling tty. A server should not receive signals from the process that started it, so it must detach itself from its controlling tty.

In Unix systems, processes operates within a process group, so that all processes within a group is treated as a single entity. Process group or session is also inherited. A server should operate independently from other processes.

	setsid() /* obtain a new process group */

This call will place the server in a new process group and session and detach its controlling terminal. (setpgrp() is an alternative for this)

3) Inherited Descriptors and Standart I/0 Descriptors [gettablesize,fork,open,close,dup,stdio.h]

Open descriptors are inherited to child process, this may cause the use of resources unneccessarily. Unneccesarry descriptors should be closed before fork() system call (so that they are not inherited) or close all open descriptors as soon as the child process starts running.

	for (i=getdtablesize();i>=0;--i) close(i); /* close all descriptors */

There are three standart I/O descriptors: standart input ‘stdin’ (0), standart output ‘stdout’ (1), standart error ‘stderr’ (2). A standard library routine may read or write to standart I/O and it may occur to a terminal or file. For safety, these descriptors should be opened and connectedthem to a harmless I/O device (such as /dev/null).

	i=open("/dev/null",O_RDWR); /* open stdin */
	dup(i); /* stdout */
	dup(i); /* stderr */

As Unix assigns descriptors sequentially, fopen call will open stdin and dup calls will provide a copy for stdout and stderr.

4) File Creation Mask [umask]

Most servers runs as super-user, for security reasons they should protect files that they create. Setting user mask will pre vent unsecure file priviliges that may occur on file creation.

	umask(027);

This will restrict file creation mode to 750 (complement of 027).

5) Running Directory [chdir]

A server should run in a known directory. There are many advantages, in fact the opposite has many disadvantages: suppose that our server is started in a user’s home directory, it will not be able to find some input and output files.

	chdir("/servers/");

The root “/” directory may not be appropriate for every server, it should be choosen carefully depending on the type of the server.

6) Mutual Exclusion and Running a Single Copy [open,lockf,getpid]

Most services require running only one copy of a server at a time. File locking method is a good solution for mutual exclusion. The first instance of the server locks the file so that other instances understand that an instance is already running. If server terminates lock will be automatically released so that a new instance can run. Recording the pid of the running instance is a good idea. It will surely be efficient to make ‘cat mydaamon.lock’ instead of ‘ps -ef|grep mydaemon’

	lfp=open("exampled.lock",O_RDWR|O_CREAT,0640);
	if (lfp<0) exit(1); /* can not open */
	if (lockf(lfp,F_TLOCK,0)<0) exit(0); /* can not lock */
	/* only first instance continues */

	sprintf(str,"%dn",getpid());
	write(lfp,str,strlen(str)); /* record pid to lockfile */

7) Catching Signals [signal,sys/signal.h]

A process may receive signal from a user or a process, its best to catch those signals and behave accordingly. Child processes send SIGCHLD signal when they terminate, server process must either ignore or handle these signals. Some servers also use hang-up signal to restart the server and it is a good idea to rehash with a signal. Note that ‘kill’ command sends SIGTERM (15) by default and SIGKILL (9) signal can not be caught.

	signal(SIG_IGN,SIGCHLD); /* child terminate signal */

The above code ignores the child terminate signal (on BSD systems parents should wait for their child, so this signal should be caught to avoid zombie processes), and the one below demonstrates how to catch the signals.

	void Signal_Handler(sig) /* signal handler function */
	int sig;
	{
		switch(sig){
			case SIGHUP:
				/* rehash the server */
				break;
			case SIGTERM:
				/* finalize the server */
				exit(0)
				break;
		}
	}

	signal(SIGHUP,Signal_Handler); /* hangup signal */
	signal(SIGTERM,Signal_Handler); /* software termination signal from kill */

First we construct a signal handling function and then tie up signals to that function.

8 ) Logging [syslogd,syslog.conf,openlog,syslog,closelog]

A running server creates messages, naturally some are important and should be logged. A programmer wants to see debug messages or a system operator wants to see error messages. There are several ways to handle those messages.

Redirecting all output to standard I/O : This is what ancient servers do, they use stdout and stderr so that messages are written to console, terminal, file or printed on paper. I/O is redirected when starting the server. (to change destination, server must be restarted) In fact this kind of a server is a program running in foreground (not a daemon).

	# mydaemon 2> error.log

This example is a program that prints output (stdout) messages to console and error (stderr) messages to a file named “error.log”. Note that this is not a daemon but a normal program.

Log file method : All messages are logged to files (to different files as needed). There is a sample logging function below.

	void log_message(filename,message)
	char *filename;
	char *message;
	{
	FILE *logfile;
		logfile=fopen(filename,"a");
		if(!logfile) return;
		fprintf(logfile,"%sn",message);
		fclose(logfile);
	}

	log_message("conn.log","connection accepted");
	log_message("error.log","can not open file");

Log server method : A more flexible logging technique is using log servers. Unix distributions have system log daemon named “syslogd”. This daemon groups messages into classes (known as facility) and these classes can be redirected to different places. Syslog uses a configuration file (/etc/syslog.conf) that those redirection rules reside in.

	openlog("mydaemon",LOG_PID,LOG_DAEMON)
	syslog(LOG_INFO, "Connection from host %d", callinghostname);
	syslog(LOG_ALERT, "Database Error !");
	closelog();

In openlog call “mydaemon” is a string that identifies our daemon, LOG_PID makes syslogd log the process id with each message and LOG_DAEMON is the message class. When calling syslog call first parameter is the priority and the rest works like printf/sprintf. There are several message classes (or facility names), log options and priority levels. Here are some examples :

Message classes : LOG_USER, LOG_DAEMON, LOG_LOCAL0 to LOG_LOCAL7
Log options : LOG_PID, LOG_CONS, LOG_PERROR
Priority levels : LOG_EMERG, LOG_ALERT, LOG_ERR, LOG_WARNING, LOG_INFO

About

This text is written by Levent Karakas <levent at mektup dot at >. Several books, sources and manual pages are used. This text includes a sample daemon program (compiles on Linux 2.4.2, OpenBSD 2.7, SunOS 5.8, SCO-Unix 3.2 and probably on your flavor of Unix). You can also download plain source file : exampled.c. Hope you find this document useful. We do love Unix.

/*
UNIX Daemon Server Programming Sample Program
Levent Karakas <levent at mektup dot at> May 2001

To compile:	cc -o exampled examped.c
To run:		./exampled
To test daemon:	ps -ef|grep exampled (or ps -aux on BSD systems)
To test log:	tail -f /tmp/exampled.log
To test signal:	kill -HUP `cat /tmp/exampled.lock`
To terminate:	kill `cat /tmp/exampled.lock`
*/

#include <stdio.h>
#include <fcntl.h>
#include <signal.h>
#include <unistd.h>

#define RUNNING_DIR	"/tmp"
#define LOCK_FILE	"exampled.lock"
#define LOG_FILE	"exampled.log"

void log_message(filename,message)
char *filename;
char *message;
{
FILE *logfile;
	logfile=fopen(filename,"a");
	if(!logfile) return;
	fprintf(logfile,"%sn",message);
	fclose(logfile);
}

void signal_handler(sig)
int sig;
{
	switch(sig) {
	case SIGHUP:
		log_message(LOG_FILE,"hangup signal catched");
		break;
	case SIGTERM:
		log_message(LOG_FILE,"terminate signal catched");
		exit(0);
		break;
	}
}

void daemonize()
{
int i,lfp;
char str[10];
	if(getppid()==1) return; /* already a daemon */
	i=fork();
	if (i<0) exit(1); /* fork error */
	if (i>0) exit(0); /* parent exits */
	/* child (daemon) continues */
	setsid(); /* obtain a new process group */
	for (i=getdtablesize();i>=0;--i) close(i); /* close all descriptors */
	i=open("/dev/null",O_RDWR); dup(i); dup(i); /* handle standart I/O */
	umask(027); /* set newly created file permissions */
	chdir(RUNNING_DIR); /* change running directory */
	lfp=open(LOCK_FILE,O_RDWR|O_CREAT,0640);
	if (lfp<0) exit(1); /* can not open */
	if (lockf(lfp,F_TLOCK,0)<0) exit(0); /* can not lock */
	/* first instance continues */
	sprintf(str,"%dn",getpid());
	write(lfp,str,strlen(str)); /* record pid to lockfile */
	signal(SIGCHLD,SIG_IGN); /* ignore child */
	signal(SIGTSTP,SIG_IGN); /* ignore tty signals */
	signal(SIGTTOU,SIG_IGN);
	signal(SIGTTIN,SIG_IGN);
	signal(SIGHUP,signal_handler); /* catch hangup signal */
	signal(SIGTERM,signal_handler); /* catch kill signal */
}

main()
{
	daemonize();
	while(1) sleep(1); /* run */
}

/* EOF */
1. Compile
cc -o exampled examped.c
2. run
./exampled
3. testing daemon process
ps -ef|grep exampled (or ps -aux on BSD systems)
4.log
tail -f /tmp/exampled.log
5. testing signal
kill -HUP `cat /tmp/exampled.lock`
6. Kill
kill `cat /tmp/exampled.lock`
Read Full Post | Make a Comment ( None so far )

CVS server on Ubuntu 6.0.6

Posted on October 8, 2006. Filed under: Linux, Programming, Services |

1.Edit /etc/apt/sources.list
sudo vi /etc/apt/sources.list 
 
uncomment the following two lines:
deb http://cn.archive.ubuntu.com/ubuntu/ dapper universe
deb-src http://cn.archive.ubuntu.com/ubuntu/ dapper universe
 
uncomment the following two lines:update package:
sudo apt-get update
2.Install CVS
install CVS files:
sudo apt-get install cvs 
 
install CVS server:
sudo apt-apt install cvsd
When prompted in the cvsd installation process for Repository, type in “/cvsroot”.
3.Configure CVS
goto /var/lib/cvsd and build the cvsroot:
sudo cvsd-buildroot /var/lib/cvsd 
 
create the folder cvsroot:
sudo mkdir cvsroot

and then initilize the repository
sudo cvs -d /var/lib/cvsd/cvsroot init
sudo chown -R cvsd:cvsd cvsroot
create a user and password:
sudo cvsd-passwd /var/lib/cvsd/cvsroot +justin
and then change the AUTH type:
sudo vi /var/lib/cvsd/cvsroot/CVSROOT/config
uncomment the “SystemAuto=no” line.
4.Test it
cvs -d :pserver:justin@localhost:/cvsroot login
cvs -d :pserver:justin@localhost:/cvsroot checkout .
5.Done

Read Full Post | Make a Comment ( None so far )

Netfilter arch in IPv4

Posted on March 6, 2003. Filed under: Linux, Services |

Netfilter architecture in IPv4

A Packet Traversing the Netfilter System:

   --->[1]--->[ROUTE]--->[3]--->[4]--->
                 |            ^
                 |            |
                 |         [ROUTE]
                 v            |
                [2]          [5]
                 |            ^
                 |            |
                 v            |

 

 

 

Packets come in from the left. After verification of the IP checksum, the packets hit the NF_IP_PRE_ROUTING [1] hook.

Next they enter the routing code, which decides if the packets are local or have to be passed to another interface.

If the packets are considered to be local, they traverse th NF_IP_LOCAL_IN [2] hook and get passed to the process (if any) afterwards.

If the packets are routed to another interface, they pass the NF_IP_FORWARD [3] hook.

The packet passes a final netfilter hook, NF_IP_POST_ROUTING [4], before they get transmitted on the target interface.

The NF_IP_LOCAL_OUT [5] hook is called for locally generated packets. Here You can see that routing occurs after this hook is called: in fact, the routing code is called first (to figure out the source IP address and some IP options), and called again if the packet is altered.

Locally generated packets hit NF_IP_POST_ROUTING [4], too.

 

Kernel modules can register a callback function for each one of these hooks. This callback function is called for each packet traversing the hook. The module is free to alter the packet. It has to return netfilter one of these constants:

 

  • NF_ACCEPT continue traversal as normal
  • NF_DROP drop the packet; do not continue traversal
  • NF_STOLEN I’ve taken over the packet; do not continue traversal
  • NF_QUEUE queue the packet (usually for userspace handling)
  • NF_REPEAT call this hook again
Read Full Post | Make a Comment ( None so far )

Active FTP vs Passive FTP

Posted on January 21, 2003. Filed under: Computer Science, Perl, Services |

In active mode FTP the client connects from a random unprivileged port (N > 1023) to the FTP server’s command port, port 21. Then, the client starts listening to port N+1 and sends the FTP command PORT N+1 to the FTP server. The server will then connect back to the client’s specified data port from its local data port, which is port 20.

From the server-side firewall’s standpoint, to support active mode FTP the following communication channels need to be opened:

  • FTP server’s port 21 from anywhere (Client initiates connection)
  • FTP server’s port 21 to ports > 1023 (Server responds to client’s control port)
  • FTP server’s port 20 to ports > 1023 (Server initiates data connection to client’s data port)
  • FTP server’s port 20 from ports > 1023 (Client sends ACKs to server’s data port)

When drawn out, the connection appears as follows:

activeftp

 

 

In step 1, the client’s command port contacts the server’s command port and sends the command PORT 1027. The server then sends an ACK back to the client’s command port in step 2. In step 3 the server initiates a connection on its local data port to the data port the client specified earlier. Finally, the client sends an ACK back as shown in step 4.

The main problem with active mode FTP actually falls on the client side. The FTP client doesn’t make the actual connection to the data port of the server–it simply tells the server what port it is listening on and the server connects back to the specified port on the client. From the client side firewall this appears to be an outside system initiating a connection to an internal client–something that is usually blocked.

 

In order to resolve the issue of the server initiating the connection to the client a different method for FTP connections was developed. This was known as passive mode, or PASV, after the command used by the client to tell the server it is in passive mode.

In passive mode FTP the client initiates both connections to the server, solving the problem of firewalls filtering the incoming data port connection to the client from the server. When opening an FTP connection, the client opens two random unprivileged ports locally (N > 1023 and N+1). The first port contacts the server on port 21, but instead of then issuing a PORT command and allowing the server to connect back to its data port, the client will issue the PASV command. The result of this is that the server then opens a random unprivileged port (P > 1023) and sends the PORT P command back to the client. The client then initiates the connection from port N+1 to port P on the server to transfer data.

From the server-side firewall’s standpoint, to support passive mode FTP the following communication channels need to be opened:

  • FTP server’s port 21 from anywhere (Client initiates connection)
  • FTP server’s port 21 to ports > 1023 (Server responds to client’s control port)
  • FTP server’s ports > 1023 from anywhere (Client initiates data connection to random port specified by server)
  • FTP server’s ports > 1023 to remote ports > 1023 (Server sends ACKs (and data) to client’s data port)

When drawn, a passive mode FTP connection looks like this:

 

 

passiveftp

In step 1, the client contacts the server on the command port and issues the PASV command. The server then replies in step 2 with PORT 2024, telling the client which port it is listening to for the data connection. In step 3 the client then initiates the data connection from its data port to the specified server data port. Finally, the server sends back an ACK in step 4 to the client’s data port.

While passive mode FTP solves many of the problems from the client side, it opens up a whole range of problems on the server side. The biggest issue is the need to allow any remote connection to high numbered ports on the server. Fortunately, many FTP daemons, including the popular WU-FTPD allow the administrator to specify a range of ports which the FTP server will use. See Appendix 1 for more information.

The second issue involves supporting and troubleshooting clients which do (or do not) support passive mode. As an example, the command line FTP utility provided with Solaris does not support passive mode, necessitating a third-party FTP client, such as ncftp.

With the massive popularity of the World Wide Web, many people prefer to use their web browser as an FTP client. Most browsers only support passive mode when accessing ftp:// URLs. This can either be good or bad depending on what the servers and firewalls are configured to support.

 

 

The following chart should help admins remember how each FTP mode works:

Active FTP :

command : client >1023 -> server 21

data : client >1023 <- server 20

 

Passive FTP :

command : client >1023 -> server 21

data : client >1023 -> server >1023

A quick summary of the pros and cons of active vs. passive FTP is also in order:

Active FTP is beneficial to the FTP server admin, but detrimental to the client side admin. The FTP server attempts to make connections to random high ports on the client, which would almost certainly be blocked by a firewall on the client side. Passive FTP is beneficial to the client, but detrimental to the FTP server admin. The client will make both connections to the server, but one of them will be to a random high port, which would almost certainly be blocked by a firewall on the server side.

Luckily, there is somewhat of a compromise. Since admins running FTP servers will need to make their servers accessible to the greatest number of clients, they will almost certainly need to support passive FTP. The exposure of high level ports on the server can be minimized by specifying a limited port range for the FTP server to use. Thus, everything except for this range of ports can be firewalled on the server side. While this doesn’t eliminate all risk to the server, it decreases it tremendously.

 

Other references:

FTP: File Transfer Using Perl

http://www.foo.be/docs/tpj/issues/vol1_3/tpj0103-0007.html

http://aplawrence.com/Unixart/perlnetftp.html

http://perldoc.perl.org/Net/FTP.html

Read Full Post | Make a Comment ( 2 so far )

iptables setting to the NAT

Posted on January 18, 2003. Filed under: Linux, Services |

/sbin/iptables -F
/sbin/iptables -t nat -F
/sbin/iptables -P INPUT ACCEPT
/sbin/iptables -P OUTPUT ACCEPT
/sbin/iptables -P FORWARD ACCEPT
/sbin/iptables -t nat -A POSTROUTING -s 192.168.0.0/24 -o eth1 -j MASQUERADE
echo 1 > /proc/sys/net/ipv4/ip_forward

Read Full Post | Make a Comment ( None so far )

Liked it here?
Why not try sites on the blogroll...