With ever increasing data volumes being transfered in networks around the world, the type data formats used to pass information around has obviously been a heavily discussed topic. It usually comes down to three main factors: data compression, speed of encoding/decoding and how easy it is to read and debug. In this post I am going to talk about some of the most commonly used message formats; their advantages and disadvantages, including my opinions, and links to some other blog posts with examples of using them with c++.
1 2 3 4 5 6 7 8 9 10 11
Xml is made up of a hierarchy of tags and values. Each tag may also have its own properties. Xml is a very mature message structure and as such it has a few very well defined schema definition languages that can be used to make sure xml files conform to a certain structure, with the most popular being the Xml Schema Definition (xsd). An example of an xsd is below, and it will be used in one of our examples that uses the xsd to generate code to encode and decode to and from xml.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
With an example of an xml message conforming to the above schema being.
1 2 3 4 5 6 7 8 9 10
Then we have non human-readable messaging. These use some sort of binary representation when going across the wire but need to be decoded and encoded either side. These have been given the name of “protocol buffers”. The two mainly used ones are Google’s protobuf and msgpack. The advantages of these are that the payloads of data you need to send are a lot smaller compared with a human readable format. But then you lose the ability to understand (without computer interaction) what is being sent across the wire. You are also going to need a bit more time at each end to encode and decode, but protocol buffers are heavily optimised and should not take too long to encode/decode. Both of these formats have many bindings in multiple programming languages to allow you to communicate between any system you need. There seems to be a lot of debate currently about which is faster out of the two, and I suggest you profile the two for your specific needs and actually decide which to use taking into account your own profiling and the fundamental differences between the two libraries.
Protobuf requires a schema to keep data consistent and at least in c++ uses this schema to generate optimised code. This gives you methods, depending on the datatypes defined in the schema, to access and set any value you defined. You can not add any value into a protobuf message that is not defined in the schema nor one that has an incorrect type. An example of a schema taken from the protobuf documentation is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
This defines an address book which stores a list of people and their phone numbers, specifying what type the number is.
Protobuf Binary Format
Protobuf heavily uses varints which are a way to represent numbers with a varying amount of memory. In the case of protobufs it does this by setting the first bit to 1 in every byte if there are further bytes to come. The other 7 bits of each byte are used to store the two’s complement representation of the number in groups of 7 bits using the least significant group first i.e. you need to reverse the groups of 7 bits. Below is an example of working from the binary representation of a varint to a negative decimal number:
1011 1000 0011 1100
First take off the first bit from each byte.
011 1000 011 1100
Then reverse the byte order as the protocol defines that the least significant group comes first.
111 1100 111 0000
Then finally convert the two’s compliment binary number into a decimal.
Like with most messaging format a protocol buffer message is a series of key-value pairs. In the binary format the key is made up from the fields number from the .proto file and a number defining the field type. The use of the field number from the schema saves quite a lot of space, however, means that at each end there is a requirement for a .proto file referencing that field. The field type is needed so that protobuf can work out the length of the value so as to skip over any it does not recognise, i.e. to allow backwards compatibility of .proto files.
Each key in the binary message is a varint with the value:
(field_number << 3) | type
i.e. the last three bits of the number specify the type, these values are hard coded and defined in the protobuf documentation.
The possible value types are pretty much just a combination of varints, fixed-length values and length-delimited values. With length-delimited values (string, bytes, etc…) the following field is just a varint defining the length in bytes of the value and then the actual value.
Msgpack is very dynamic and bases itself on json except that at either end it encodes/decodes into a binary protocol. Even when you output it out inside the code it prints it out as if it were encoded in json. Since it allows any data to be added and removed it generally uses the language own containers (lists and maps) as the undecoded format. The problem with the use of this library with static languages is that it requires hacks and quite a lot of boilerplate code to get around the static limitation that appear when message structure becomes more complicated.
Msgpack Binary Format
Msgpack contains all type information inside the binary message and therefore is always backwards compatible. Each value is stored in a type-data or type-length-data style. Meaning that there are quite a few well defined types that have a hard coded fixed length value and others such as raw bytes and containers need to know the length of their values.
I think with the modern day responsive websites mostly utilising restful web api’s that json is here to stay and xml will slowly fade away. Even xml configuration files (a very popular use of xml) are slowly moving to json equivalents.
Which binary representation to use depends heavily on the particular use case. But as my examples in c++ show. I strongly feel that using msgpack in a statically typed language greatly reduces the impressiveness of msgpack’s dynamic features.