Google Protocol Buffers and other data interchange formats

Posted on March 20, 2010. Filed under: Computer Science, Programming, Services | Tags: , , , , , |

We’ve been planning on moving to a new messaging protocol for a while. We’ve looked at a lot of different solutions but had enough issues with every proposed solution to date that we haven’t made a decision. JR Boyens pointed us to Google’s announcement Protocol Buffers: Google’s Data Interchange Format in July. Glanced at it but then it got lost in the everyday noise. Recent work on a project caused it to get more attention. I like what I see.

As part of a new offering we decided to add in our new messaging direction. We’re processing realtime voice conversations. Some of our major considerations are:

  1. Latency and Performance – Latency matters to us. A LOT. I’m including not only network transport but also memory and CPU. The total time it takes for a message to get from it’s native format in the sender to it’s native format in the receiver. We’re dealing with real time voice communications, too much latency and best case is the callers experience suffers. Our labor model is also sensitive to even small changes in latency. The smaller the latency the more efficient we are, the happier our client’s customers are and the more money we make. As greedy capitalist we see that as a good thing.
  2. Versioning – Our current system has no versioning. Yeah, short sighted on my part. We have to fix it so it’s required for any new message protocol. Protobuf fits our needs on this nicely. Different versions have to coexist and interoperate. We could do this on a different layer than the messaging but it makes sense to me to keep it at this level.
  3. Java and C++ – Language independence is cool and all but in practice if the protocol support Java and C++ we’re good to go. Maybe I’m being a bit myopic but my feeling is the likely hood that whatever we choose will expand to support more languages in the future is very high if it supports several today.
  4. Internal – We control the end points. I don’t care if the schema is external to the data package. In fact, for our use case that’s a plus. For any external services we’ll still expose those using the usual standards. Internally our applications will be using PB for their messaging format.
In short, we’re all about high volume low latency messages.

Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the “old” format.

Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers:

  • are simpler
  • are 3 to 10 times smaller
  • are 20 to 100 times faster
  • are less ambiguous
  • generate data access classes that are easier to use programmatically
Seems like a decent fit. OK, actually an awesome fit. One of our developers has been doing some testing. It’s impressive.



To me protobuf feels like compiled JSON. They are very similar.  The main difference being JSON sends data over the wire in text format verses protobuf’s binary format. The latter has the advantage of a smaller size and being faster for a computer to parse.


Why not ASN.1? Seems like one of the best choices. Well understood and widely used. Sure the full ASN.1 specification is complex but we’d only need a small subset. I’m still struggling with this one a bit. Tool support seems a bit better in protobuf and it’s definitely simpler.


Facebook’s Thrift is very similar to protobuf. Not surprising since the main author interned at Google. It’s a strong offering and recently became an Apache project. Nice stuff. Stuart Sierra has a nice comparison on his blog, Thrift vs. Protocol Buffers. Another worthy contender but not a big enough advantage to stop the internal momentum protobuf already has.


The HDF wiki has an entry Google Protocol Buffers and HDF5 that concludes:

In summary, Protocol Buffers and HDF5 were designed to serve different kinds of data intensive applications: a network based transient message system, and a high performance data storage system for very large datasets such as multi-dimensional images, respectively. That said, both 1) offer open source technologies that can reduce data management headaches for individual developers and projects, 2) increase the ability to share data through the use of well-defined binary formats and supporting libraries that run on a variety of platforms, and 3) provide the ability to access data stored with “older” versions of the data structures.

Different design goals. HDF5 doesn’t fit our needs as well.


The Hessian protocol has the following design goals:

  • It must not require external IDL or schema definitions, i.e. the protocol should be invisible to application code.
  • It must be language-independent.
  • It must be simple so it can be effectively tested and implemented.
  • It must be as fast as possible.
  • It must be as compact as possible.
  • It must support Unicode strings.
  • It must support 8-bit binary data (i.e. without encoding or using attachments.)
  • It must support encryption, compression, signature, and transaction context envelopes.

I still haven’t figured out how/if you can version your messages. Can you add and remove fields and still have compatibility (backwards and forwards)?  Cool effort, still feels very rough in places. Another worthy effort to consider.


At it’s core it’s

The Ice core library. Among many other features, the Ice core library manages all the communication tasks using a highly efficient protocol (including protocol compression and support for both TCP and UDP), provides a flexible thread pool for multi-threaded servers, and offers additional functionality that supports extreme scalability with potentially millions of Ice objects.

ICE is a comprehensive middleware system. It can even use PB as it’s messaging layer. It’s messaging layer doesn’t handle adding or removing fields as well as PB. We don’t need the RPC side of ICE. Just not a good fit for us.


Service Data Objects provides a rather ambitious messaging architecture. It’s concerns aren’t speed and efficiency. The SDO V2.1 White Paper states

SDO is intended to create a uniform data access layer that provides a data access solution for heterogeneous data sources in an easy-to-use manner that is amenable to tooling and frameworks.

Interesting, not a fit.

Cisco Etch

Primary focus is an RPC implementation, not a messaging protocol. Steve Vinoski summarized it nicely in Just What We Need: Another RPC Package. In fairness Steve had some negative thoughts on PB also in Protocol Buffers: Leaky RPC. However, his concerns are around the undefined RPC features Google put in PB, not the IDL type aspects of PB.

Some other XML based protocol

Yeah, I know the problem with XML isn’t XML it’s with the parsers. Cute argument. Getting my message from native format on one system to native format on another as fast as possible is what matters to me. So oddly enough parsers are part of the equation. Yeah, jaxb is fast but just how fast?

Remember, we’re all about high volume low latency messages. It’s not a focus for XML. Yep, no one will take issue with that statement!

Binary XML

Enough said. Next.


Well defined IDL, a bit complicated (because it addresses a wide range of issues).  Built in to the JDK! Not designed for speed or efficiency. Bad fit.


Several good choices. I’m sure there’s others I missed. We’re going with protobuf. Early tests by our developers have been very impressive. Google fan bois can rejoice and the Google haters gripe. In the meantime we’ve got a job to do.

Reference: (Original)

Read Full Post | Make a Comment ( 1 so far )

Liked it here?
Why not try sites on the blogroll...