It is widely known that the JSON is a fat-free alternative to XML. However, as with any alternative, there is a price you have to pay. Most people don’t even consider limitations of JSON format when they design their JSON-based protocols. However, they exist and they usually appear later, when there is a need to scale out.
First, I want to quickly summarize this article: JSON vs XML: confessions of a JSON advocate.
1. Absence of comments. I really can’t comment on this.
2. Absence of attributes. In short, there’s always some properties you want to read first. XML ensured that these properties are available by putting them in attributes. No such luck in JSON.
3. Unquoted linefeeds are not allowed in JSON strings.
4. No ordering except of arrays. See my comment to #2, but that one is not about attributes byt rather xs:sequence.
As I told you earlier, most web service writers and consumers don’t even notice these limitations. If you send out stock price ticker in JSON, there’s no need in comments, attributes or sequential containers, you just parse entire thing into a DOM-like structure and use it.
However, there are people who are forced to overcome such problems. In this post, I’ll try to show a best JSON data container structure that can help most of the people writing JSON-based protocols for the web data exchange.
Let’s start with an example of the most common issue – large data and important metadata. My example is a spatial data container. It has a seemingly problem-free syntax: http://help.arcgis.com/en/arcgisserver/10.0/apis/rest/fsquery.html
{
”importantProperty” : “<importantPropertyValue>”,
”features” : [
<feature1>, <feature2>
]
}
This REST operation returns JSON featureset which is a simple JSON object. Bulk of the data in this object is a ‘features’ array that contains (obviously) features. All properties before ‘features’ is metadata. These properties are needed to parse the ‘features’ array.The ‘features’ array usually takes 99.9% of the total size of a container.
Now.. what if your response is big? Or even huge?
Most obvious solution is to amend the protocol rules and introduce a limit on a number of features. But there are cases where feature limit cannot be used. For the sake of an example let’s suppose that a pure, innocent soul decided to use this operation to replicate entire database and turned featurelimit off. Or worse, this protocol has no continuation clause, which means that search-engine-style pagination is not available out-of-the-box. (Yes there are tricks to emulate pagination by requesting all feature IDs first… but I am not talking about these right now.)
While it is possible to read entire featureset into memory, it makes much more sense to use a sequential reader. SAX, not DOM.
Now that I said the word SAX, I should apologize. I didn’t really mean SAX, I actually mean simple sequential one-pass reader, like Microsot .Net’s XmlReader. The sequential reading has simple advantage of not keeping entire DOM-like data tree in memory. But it requires a sequential data container. Oh xs:sequence, where art thou when we need thou? Microsoft had to get around this even for XML – they introduced XmlBookmarkReader. Basically, it allows you to reset the reader state to any arbitrary position. It’s not cost-free, because every time you go back, you wasted a pass. While skipping tokens in a string is less costly in terms of memory (vs DOM model), it still takes CPU cycles to perform.
Can we avoid double-pass parsing? Can we use JSON syntax to create a container that provides sequential access to the data?
Yes, JSON actually provides an anonymous sequential container: JSON array.
Updated data container syntax will look like this:
[
{
"importantProperty" : "<importantPropertyValue>"
},
[ //anonymous array that used to be ‘features’.
<feature1>, <feature2>
]
]
See? We enforced metadata to come before data without violating JSON syntax rules.
Now there’s a bigger problem: types
If you are not a JSON novice like me, you probably noticed that most of serializers that work with JSON introduce a special __type property that holds a type hint for the deserializer. (Examples: OpenLayers, Microsoft WPF JSON deserializer). Apparently, duck typing is not always an option when you want to (painlessly) introduce automatic serialization and code generation. And versioning support for a good measure.
But where do you insert __type in arrays? It only works for objects.
A naive solution would be wrap a sequential container in an object with __type property. However, __type property could very well come last. We just returned to the square one.
When in doubt, look at the giants. How Microsoft got around this? By introducing a DataContract attribute language which is essentially a generalized XML-like schema. In essence, you provide a data contract and expect the server to be conforming to that contract. Typical workflow is to make a first request to the web service by asking a proper version, then use proper DataContract-attributed value type (class) to deserialize a service response.
Aha! This is DOM-like model again! Looks like there’s no way around it…
Unless we introduce type&version string as a very first mandatory element of the array (use null for the default (lowest) version).
Sequential data container, version 3:
[“sequential.data.container.3”, //type and version
{
"importantProperty" : "<importantPropertyValue>"
},
[ //anonymous array that holds the rest of the data.
//Use importantProperty and type information to read it properly<feature1>, <feature2>
]
]
Yes, this looks ugly, but this is the best we can do without violating JSON syntax.
Now let’s remember why we did all of this. We actually wanted a data structure that we can read and deserialize sequentially in one pass without introducing either intermediate storage (DOM-like model: read stream into an intermediate storage, then read intermediate storage in a specific order) or bookmarks (read stream once to read important parts, read stream once to read the rest).
Thankfully, most of the objects that people work with do not require such exotic treatment because their size is pretty much fixed.
Still, I am not sure if the end result was worth it. Basically, we just got rid of the property names and the result looks confusing. Well, if not confusing, then it’s certainly is not self-describing like XML and (most of the time) JSON are.
To actually solve this problem we need to look higher, beyond the (DOM-like) tree we just tried to climb and see the forest behind it. The answer is simple: JSON and XML are not designed for an efficient bulk data transfer but rather for an interoperability. If you want to transfer large amount of data, you should really use a binary protocol. Binary data transmission is well-researched area (think video compression and streaming) and you shouldn’t have any problems choosing the right protocol for you. We can look into this in the next article, if there is a need.
Exercise for the reader: Name couple of reasons why the sequential container designed in this article is bad.
Also, feel free to tell me where I was wrong. I am usually wrong, so there will be no hurt feelings.