Computer Science

Thrift vs. Protocol Buffers

Posted on March 21, 2010. Filed under: Computer Science, Programming | Tags: , , , |

Google recently released its Protocol Buffers as open source. About a year ago, Facebook released a similar product called Thrift. I’ve been comparing them; here’s what I’ve found:

Thrift Protocol Buffers
Backers Facebook, Apache (accepted for incubation) Google
Bindings C++, Java, Python, PHP, XSD, Ruby, C#, Perl, Objective C, Erlang, Smalltalk, OCaml, and Haskell C++, Java, Python
(Perl, Ruby, and C# under discussion)
Output Formats Binary, JSON Binary
Primitive Types bool
16/32/64-bit integersdouble
byte sequence
bool32/64-bit integers
byte sequence

“repeated” properties act like lists

Enumerations Yes Yes
Constants Yes No
Composite Type struct message
Exception Type Yes No
Documentation So-so Good
License Apache BSD-style
Compiler Language C++ C++
RPC Interfaces Yes Yes
RPC Implementation Yes No
Composite Type Extensions No Yes

Overall, I think Thrift wins on features and Protocol Buffers win on
documentation. Implementation-wise, they’re quite similar. Both use
integer tags to identify fields, so you can add and remove fields
without breaking existing code. Protocol Buffers support
variable-width encoding of integers, which saves a few bytes. (Thrift
has an experimental output format with variable-width ints.)

The major difference is that Thrift provides a full client/server RPC
implementation, whereas Protocol Buffers only generate stubs to use in
your own RPC system.

Update July 12, 2008: I haven’t tested for speed, but from a cursory examination it seems that, at the binary level, Thrift and Protocol Buffers are very similar. I think Thrift will develop a more coherent community now that it’s under Apache incubation. It just moved to a new web site and mailing list, and the issue tracker is active.

Reference: (Original Site)

Read Full Post | Make a Comment ( None so far )

Google Protocol Buffers and other data interchange formats

Posted on March 20, 2010. Filed under: Computer Science, Programming, Services | Tags: , , , , , |

We’ve been planning on moving to a new messaging protocol for a while. We’ve looked at a lot of different solutions but had enough issues with every proposed solution to date that we haven’t made a decision. JR Boyens pointed us to Google’s announcement Protocol Buffers: Google’s Data Interchange Format in July. Glanced at it but then it got lost in the everyday noise. Recent work on a project caused it to get more attention. I like what I see.

As part of a new offering we decided to add in our new messaging direction. We’re processing realtime voice conversations. Some of our major considerations are:

  1. Latency and Performance – Latency matters to us. A LOT. I’m including not only network transport but also memory and CPU. The total time it takes for a message to get from it’s native format in the sender to it’s native format in the receiver. We’re dealing with real time voice communications, too much latency and best case is the callers experience suffers. Our labor model is also sensitive to even small changes in latency. The smaller the latency the more efficient we are, the happier our client’s customers are and the more money we make. As greedy capitalist we see that as a good thing.
  2. Versioning – Our current system has no versioning. Yeah, short sighted on my part. We have to fix it so it’s required for any new message protocol. Protobuf fits our needs on this nicely. Different versions have to coexist and interoperate. We could do this on a different layer than the messaging but it makes sense to me to keep it at this level.
  3. Java and C++ – Language independence is cool and all but in practice if the protocol support Java and C++ we’re good to go. Maybe I’m being a bit myopic but my feeling is the likely hood that whatever we choose will expand to support more languages in the future is very high if it supports several today.
  4. Internal – We control the end points. I don’t care if the schema is external to the data package. In fact, for our use case that’s a plus. For any external services we’ll still expose those using the usual standards. Internally our applications will be using PB for their messaging format.
In short, we’re all about high volume low latency messages.

Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the “old” format.

Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers:

  • are simpler
  • are 3 to 10 times smaller
  • are 20 to 100 times faster
  • are less ambiguous
  • generate data access classes that are easier to use programmatically
Seems like a decent fit. OK, actually an awesome fit. One of our developers has been doing some testing. It’s impressive.



To me protobuf feels like compiled JSON. They are very similar.  The main difference being JSON sends data over the wire in text format verses protobuf’s binary format. The latter has the advantage of a smaller size and being faster for a computer to parse.


Why not ASN.1? Seems like one of the best choices. Well understood and widely used. Sure the full ASN.1 specification is complex but we’d only need a small subset. I’m still struggling with this one a bit. Tool support seems a bit better in protobuf and it’s definitely simpler.


Facebook’s Thrift is very similar to protobuf. Not surprising since the main author interned at Google. It’s a strong offering and recently became an Apache project. Nice stuff. Stuart Sierra has a nice comparison on his blog, Thrift vs. Protocol Buffers. Another worthy contender but not a big enough advantage to stop the internal momentum protobuf already has.


The HDF wiki has an entry Google Protocol Buffers and HDF5 that concludes:

In summary, Protocol Buffers and HDF5 were designed to serve different kinds of data intensive applications: a network based transient message system, and a high performance data storage system for very large datasets such as multi-dimensional images, respectively. That said, both 1) offer open source technologies that can reduce data management headaches for individual developers and projects, 2) increase the ability to share data through the use of well-defined binary formats and supporting libraries that run on a variety of platforms, and 3) provide the ability to access data stored with “older” versions of the data structures.

Different design goals. HDF5 doesn’t fit our needs as well.


The Hessian protocol has the following design goals:

  • It must not require external IDL or schema definitions, i.e. the protocol should be invisible to application code.
  • It must be language-independent.
  • It must be simple so it can be effectively tested and implemented.
  • It must be as fast as possible.
  • It must be as compact as possible.
  • It must support Unicode strings.
  • It must support 8-bit binary data (i.e. without encoding or using attachments.)
  • It must support encryption, compression, signature, and transaction context envelopes.

I still haven’t figured out how/if you can version your messages. Can you add and remove fields and still have compatibility (backwards and forwards)?  Cool effort, still feels very rough in places. Another worthy effort to consider.


At it’s core it’s

The Ice core library. Among many other features, the Ice core library manages all the communication tasks using a highly efficient protocol (including protocol compression and support for both TCP and UDP), provides a flexible thread pool for multi-threaded servers, and offers additional functionality that supports extreme scalability with potentially millions of Ice objects.

ICE is a comprehensive middleware system. It can even use PB as it’s messaging layer. It’s messaging layer doesn’t handle adding or removing fields as well as PB. We don’t need the RPC side of ICE. Just not a good fit for us.


Service Data Objects provides a rather ambitious messaging architecture. It’s concerns aren’t speed and efficiency. The SDO V2.1 White Paper states

SDO is intended to create a uniform data access layer that provides a data access solution for heterogeneous data sources in an easy-to-use manner that is amenable to tooling and frameworks.

Interesting, not a fit.

Cisco Etch

Primary focus is an RPC implementation, not a messaging protocol. Steve Vinoski summarized it nicely in Just What We Need: Another RPC Package. In fairness Steve had some negative thoughts on PB also in Protocol Buffers: Leaky RPC. However, his concerns are around the undefined RPC features Google put in PB, not the IDL type aspects of PB.

Some other XML based protocol

Yeah, I know the problem with XML isn’t XML it’s with the parsers. Cute argument. Getting my message from native format on one system to native format on another as fast as possible is what matters to me. So oddly enough parsers are part of the equation. Yeah, jaxb is fast but just how fast?

Remember, we’re all about high volume low latency messages. It’s not a focus for XML. Yep, no one will take issue with that statement!

Binary XML

Enough said. Next.


Well defined IDL, a bit complicated (because it addresses a wide range of issues).  Built in to the JDK! Not designed for speed or efficiency. Bad fit.


Several good choices. I’m sure there’s others I missed. We’re going with protobuf. Early tests by our developers have been very impressive. Google fan bois can rejoice and the Google haters gripe. In the meantime we’ve got a job to do.

Reference: (Original)

Read Full Post | Make a Comment ( 1 so far )

GNU Midnight Commander

Posted on June 6, 2008. Filed under: Computer Science, Linux, Programming |

GNU Midnight Commander is a file manager for free operating systems. Like
other GNU software, it is written in a portable manner and should compile
and run on a wide range of operating systems with some degree of UNIX

Read Full Post | Make a Comment ( None so far )

Common Criteria

Posted on November 24, 2007. Filed under: Computer Science, Programming |

The Common Criteria is the result of the integration of information technology and computer security criteria. In 1983 the US issued the Trusted Computer Security Evaluation Criteria (TCSEC), which became a standard in 1985. Criteria developments in Canada and European ITSEC countries followed the original US TCSEC work. The US Federal Criteria development was an early attempt to combine these other criteria with the TCSEC, and eventually led to the current pooling of resources towards production of the Common Criteria.

Version 1.0 of the CC was published for comment in January 1996. Version 2.0 took account of extensive review and trials during the next two years and was published in May 1998. Version 2.0 was adopted by the International Organization for Standards (ISO) as an International Standard (ISO 15408) in 1999.

In 2005, the interpretations that had been made to date were incorporated into an update, version 2.3. This was published as ISO/IEC 15408-1:2005, 15408-2:2005, and 15408-3:2005; the corresponding update of the CEM was published as ISO/IEC 18045:2005. In September 2006, CC Version 3.1 was published. The new version provided a major change to the Security Assurance Requirements and incorporated all approved Interpretations. In September 2007, minor changes/corrections were incorporated into Version 3.1 and Revision 2 became official.

The Common Criteria is composed of three parts: the Introduction and General Model (Part 1), the Security Functional Requirements (Part 2), and the Security Assurance Requirements (Part 3). While Part 3 specifies the actions that must be performed to gained assurance, it does no specify how those actions are to be conducted; to address this, the Common Evaluation Methodology (CEM) was created for the lower levels of assurance.

This common methodology is the basis upon which the member nations have agreed to recognize the evaluation results of one another, as specified in the “Arrangement on the Recognition of Common Criteria Certificates in the field of Information Technology Security”. This was first signed in 2000 and additional member nations continue to join this agreement.

The CC and CEM continue to evolve as its use spreads. This evolution is propagated through the use of Interpretations, which are formal changes periodically made to the CC/CEM that have been mutually agreed by the participating producing nations.

The following links are to the CC, CEM, and their interpretations, as well as to other informative documents.

Read Full Post | Make a Comment ( None so far )

A perl script that can monitor windows

Posted on June 18, 2007. Filed under: Computer Science, Programming |

A perl script that can monitor windows

#!/usr/bin/perl -w
use Win32::OLE qw[in];

my $host = $ARGV[0] || ‘.’;
my $wmi = Win32::OLE->GetObject( “winmgmts://$host/root/cimv2” )
        or die Win32::FormatMessage( Win32::OLE::LastError() );

my %instances = (
        Win32_PhysicalMemory => &get_pmem,
        Win32_PerfRawData_PerfOS_Memory => &get_amem,
        Win32_Processor => &get_load,
        Win32_LogicalDisk => &get_disk,

while(1) {
        my $out = get_perf_data();
        print $out;
        print “n”;

sub get_perf_data {
        my($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(time);
        $year = $year + 1900;
        $mon  = $mon + 1;
        my $str = sprintf “%4.4d-%2.2d-%2.2d”,$year,$mon,$mday;
        my $timestr = sprintf “%2.2d:%2.2d:%2.2d”,$hour,$min,$sec;
        my $mem;
        foreach ( keys %instances ) {
                my $class = $wmi->InstancesOf( $_ );
                $mem .= $instances{ $_ }->( $class );
        my $out = “##nCollect Time: “.$str.” “.$timestr.”n”.$mem.”%%rn”;
        return $out;

# get cpu loadavg
sub get_load {
        my $class = shift;
        my $total=””;
        my $i = 0;
        $i++,$total = $total.”CPU No. $i: “.$_->{LoadPercentage}.”%n” foreach in($class);
        return $total;

# get total memory size
sub get_pmem {
        my $class = shift;
        my $total;
        $total += $_->{Capacity} foreach in($class);
        return “Physical Memory: $total Bytesn”;

# get available memory size
sub get_amem {
        my $class = shift;
        my $amem;
        $amem .= join ‘ ‘, $_->{AvailableBytes} foreach in($class);
        return “Available Memory: $amem Bytesn”;

# get free disk sizes
sub get_disk {
        my $class = shift;
        my $total = “”;
        $total .= “DISK “.$_->{DeviceID}.” Free: “.$_->{FreeSpace}.” Bytesn” foreach in($class);
        return $total

Read Full Post | Make a Comment ( None so far )

Wavelet video compress

Posted on May 7, 2005. Filed under: Computer Science |

Following are some wavelet video compress links:

Motion Wavelets Video

Wavelets for Motion and Video Coding

Read Full Post | Make a Comment ( None so far )

Fractal encoding&compression

Posted on December 19, 2004. Filed under: Computer Science, Programming |

Fractal compression is a lossy image compression method using fractals to achieve high levels of compression. The method is best suited for photographs of natural scenes (trees, mountains, ferns, clouds).

There are some links:
Yuval Fisher’s Fractal Links
Geoff Davis’s homepage
On Fractal Compression
Iterated function systems and compression
Fractal Image Compression
Algorithm for Fast Fractal Image Compression
IFS – Fractal Image Compression

Read Full Post | Make a Comment ( None so far )

Wavelets for Image Compression

Posted on November 12, 2004. Filed under: Computer Science |

Wavelets  is usually for the image compression, following is some related links:

Image Compression with Set Partitioning in Hierarchical Trees

Data Compression Bibliography

Introduction to Wavelets for Image Compression

Wavelet Filter Evaluation for Image Compression

EZW encoding

An Implementation of EZW

Wavelets and Signal Processing

VcDemo – Image and Video Compression Learning Tool

Eduard Kriegler’s EZW Encoder

GWIC – GNU Wavelet Image Codec

The Wavelet Tutorial

Wavelet Compression Example

A brief guide to wavelet sources

Read Full Post | Make a Comment ( None so far )

Data Compression Bibliography

Posted on November 11, 2004. Filed under: Computer Science |

The University of Washington has a nice bibliography here, with pointers to books on Data Compression, VQ, Wavelets, and Information Theory.

Read Full Post | Make a Comment ( None so far )

Data Compression Researchers

Posted on November 10, 2004. Filed under: Computer Science |

The page from the Google directory.

Read Full Post | Make a Comment ( None so far )

Compression via Arithmetic Coding in Java

Posted on November 10, 2004. Filed under: Computer Science, Java |

Bob Carpenter has created a nice Java package that implements a PPM/arithmetic coding compression system. This page includes links to the source code, javadocs, and a fair amount of tutorial material. Very complete!

Read Full Post | Make a Comment ( None so far )

Run Length Encoding

Posted on November 9, 2004. Filed under: Computer Science |

Run Length Encoding (RLE) is a very simple form of data compression encoding. It is based on simple principle of encoding data. This principle is to every stream which is formed of the same data values (repeating values is called a run) i.e sequence of repeated data values is replaced with count number and a single value. This intuitive principle works best on certain data types in which sequences of repeated data values can be noticed; RLE is usually applied to the files that a contain large number of consecutive occurrences of the same byte pattern.

Following are some related urls:
RLE – Run Length Encoding

A Run Length Encoding Scheme For Block Sort Transformed Data

Read Full Post | Make a Comment ( None so far )

Arithmetic Coding

Posted on November 8, 2004. Filed under: Computer Science |

Arithmetic coding (AC) is a special kind of entropy coding. Unlike Huffman coding, arithmetic coding doesn´t use a discrete number of bits for each symbol to compress. It reaches for every source almost the optimum compression in the sense of the Shannon theorem and is well suitable for adaptive models. The biggest drawbak of the arithmetic coding is it´s low speed since of several needed multiplications and divisions for each symbol. The main idea behind arithmetic coding is to assign to each symbol an interval. Starting with the interval [0..1), each interval is devided in several subintervals, which sizes are proportional to the current probability of the corresponding symbols of the alphabet. The subinterval from the coded symbol is then taken as the interval for the next symbol. The output is the interval of the last symbol. Implementations write bits of this interval sequence as soon as they are certain.
A fast variant of arithmetic coding, which uses less multiplications and divisions, is a range coder, which works byte oriented. The compression rate of a range coder is only a little bit less than pure arithmetic coding, and the difference in many real implementation is not noticeable.

Some useful link as following:

Arithmetic Coding + Statistical Modeling = Data Compression

The Arithmetic Coding Page

Read Full Post | Make a Comment ( None so far )

Huffman Coding

Posted on November 8, 2004. Filed under: Computer Science |

In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. It was developed by David A. Huffman while he was a Ph.D. student at MIT, and published in the 1952 paper “A Method for the Construction of Minimum-Redundancy Codes”.

Lossless Compression Algorithms (Entropy Encoding)

Huffman Coding: A CS2 Assignment

The Huffman Compression Algorithm

Dynamic Huffman Coder
Canonical Huffman Coder Construction
C Library to search over compressed texts
Michael Dipperstein’s Huffman Code Page
libhuffman – Huffman encoder/decoder library
Compression and Encryption Sources
Huffman Coding Class

Adaptive Huffman coding modifies the table as characters are encoded, which allows the encoder to adapt to changing conditions in the input data. Adaptive decoders don’t need a copy of the table when decoding, they start with a fixed decoding table and update the table as characters are read in.

Design and Analysis of Dynamic Huffman Codes
Adaptive Huffman Encoding

Read Full Post | Make a Comment ( None so far )

Network Simulator (NS2)

Posted on April 3, 2003. Filed under: Computer Science |

Ns-2 is a widely used tool to simulate the behavior of wired and wireless networks. Useful general information can be found at

Read Full Post | Make a Comment ( None so far )

What is ad hoc wireless network?

Posted on February 16, 2003. Filed under: Computer Science |

Most installed wireless LANs today utilize “infrastructure” mode that requires the use of one or more access points. With this configuration, the access point provides an interface to a distribution system (e.g., Ethernet), which enables wireless users to utilize corporate servers and Internet applications.

As an optional feature, however, the 802.11 standard specifies “ad hoc” mode, which allows the radio network interface card (NIC) to operate in what the standard refers to as an independent basic service set (IBSS) network configuration. With an IBSS, there are no access points. User devices communicate directly with each other in a peer-to-peer manner.

In ad hoc networks, nodes do not start out familiar with the topology of their networks; instead, they have to discover it. The basic idea is that a new node may announce its presence and should listen for announcements broadcast by its neighbours. Each node learns about nodes nearby and how to reach them, and may announce that it, too, can reach them.

Pros and cons to consider

Before making the decision to use ad hoc mode, you should consider the following:

  • Cost savings. Without the need to purchase or install access points, you’ll save a considerable amount of money when deploying ad hoc wireless LANs. Of course this makes the bean counters happy, but be sure you think about all of the pros and cons before making a final decision on which way to go.
  • Rapid setup time. Ad hoc mode only requires the installation of radio NICs in the user devices. As a result, the time to setup the wireless LAN is much less than installing an infrastructure wireless LAN. Obviously this time savings only applies if the facility you plan to support wireless LAN connectivity doesn’t already have a wireless LAN installed.
  • Better performance possible. The question of performance with ad hoc mode is certainly debatable. For example, performance can be higher with ad hoc mode because of no need for packets to travel through an access point. This assumes a relatively small number of users, however. If you have lots of users, then you’ll likely have better performance by using multiple access points to separate users onto non-overlapping channels to reduce medium access contention and collisions. Also because of a need for sleeping stations to wake up during each beacon interval, performance can be lower with ad hoc mode due to additional packet transmissions if you implement power management.
  • Limited network access. Because there is no distribution system with ad hoc wireless LANs, users don’t have effective access to the Internet and other wired network services. Of course you could setup a PC with a radio NIC and configure the PC with a shared connection to the Internet. This won’t satisfy a larger group of users very well, though. As a result, ad hoc is not a good way to go for larger enterprise wireless LANs where there’s a strong need to access applications and servers on a wired network.
  • Difficult network management. Network management becomes a headache with ad hoc networks because of the fluidity of the network topology and lack of a centralized device. Without an access point, network managers can’t easily monitor performance, perform security audits, etc. Effective network management with ad hoc wireless LANs requires network management at the user device level, which requires a significant amount of overhead packet transmission over the wireless LAN. This again leans ad hoc mode away from larger, enterprise wireless LAN applications.
Read Full Post | Make a Comment ( 1 so far )

« Previous Entries

Liked it here?
Why not try sites on the blogroll...