Introducing TJSON, a stricter, typed form of JSON
NOTE: TJSON syntax has been revised since this post was originally published. Please visit https://www.tjson.org/ for the latest syntax.
I’d like to announce a project I’ve been working on with Ben Laurie called TJSON (Tagged JSON).
TJSON is syntax-compatible with JSON, but adds mandatory type annotations. Its primary intended use is in cryptographic authentication contexts, particularly ones where JSON is used as a human-friendly alternative representation of data in a system which otherwise works natively in a binary format.
Before I go further describing TJSON, I’d like to give some background.
JSON is a bit of a mess. You may have seen Parsing JSON is a Minefield recently, which did a fantastic job of illustrating that while JSON’s “simplicity is a virtue” approach lead to widespread adoption, underspecification has lead to a proliferation of interoperability problems and ambiguities. From a strictly software engineering perspective these ambiguities can lead to annoying bugs and reliability problems, but in a security context such as JOSE they can be fodder for attackers to exploit. It really feels like JSON could use a well-defined “strict mode”.
JSON’s problems don’t end there: a lack of commonly used types has lead to a litany of non-interoperable hacks to jam various non-string data types into JSON strings. With no native binary type, storing such data in JSON typically involves encoding it in a format like hex or Base64, and having ad hoc serialization logic that knows which fields should be encoded or decoded this way. Lack of full precision 64-bit integers is also an issue: when Twitter moved to its new Snowflake ID generation algorithm, it had to add a new
id_str field to replace the old status
id field which serialized 64-bit integers as strings to avoid integers being silently munged by JSON parsers.
Jamming encoded data into JSON strings this way is problematic: the format loses its self-describing nature, and custom logic must be added to serialize/deserialize each field that operates this way into the intended type. This is a perpetual problem trying to use JSON in any sort of security context, because cryptography operates exclusively on blobs of binary data, so anyone trying to encode such blobs as JSON has to go through an artisanal selection process of choosing which particular binary encoding format to use for a particular field.
Extending JSON with tags #
REMINDER: This syntax is obsolete! Please visit https://www.tjson.org/ for the latest syntax.
TJSON specifies a syntax for annotating JSON strings with a tag indicating a data type, which is mandatory for all TJSON strings:
TJSON’s only additions to JSON, beyond imposing a set of strictness requirements, are adding the following tags:
b16:binary data serialized as hexadecimal
b32:binary data serialized as Base32
b64:binary data serialized as Base64url
i:signed integer (base 10, 64-bit range)
u:unsigned integer (base 10, 64-bit range)
That’s it! An example TJSON document looks like this:
The TJSON specification describes the format in considerably more detail, along with additional strictness requirements. There’s also a set of test cases covering all types and strictness requirements which can be used to ensure interoperability between implementations and that strictness requirements are being honored.
The format is still flagged as being in a draft state pending the resolution of a number of open issues, but hopefully it won’t see substantial changes and is ready to implement in its current form in any language you would like to use it in.
What good are these type annotations for? Well, besides simply letting us express a richer set of types as serialized in JSON documents, they also help enable a powerful way of authenticating data which can work across a variety of serialization formats.
Content-Aware Hashing #
In cryptographic use cases we may also want to compute a content hash of some data to authenticate it. One approach to this is serializing JSON in some sort of “canonical” manner. Unfortunately there is no canonical JSON canonicalization format (i.e. in the form of an RFC we can point to), but a fractured ecosystem of half-hearted attempts so we’re stuck with the xkcd problem there. But beyond that, canonicalization is limiting: it encodes aspects of JSON’s structure into the content hash, so we wind up with a content hash which is only useful for identifying a JSON encoding of a particular object/resource, but not the underlying data fields of an object itself. I feel there are many cases where it would be useful to be able to reuse content hashes and digital signatures between JSON and some other binary format.
One of these cases is JSON Web Tokens (JWT). In a browser environment, it may be nicer to work with JSON than a binary format. However in embedded environments, a compact binary representation is preferable to JSON. There exists a binary analogue of JWT called CWT which is based on the Compact Binary Object Representation (CBOR) standard. Unfortunately you can’t convert JWTs to CWTs without the original issuer re-issuing them and re-signing them in the new format. (This isn’t a problem with a more advanced token format called Macaroons, but that’s a matter for another blog post)
This is a pervasive problem for anyone who would like to store authenticated/signed data natively in a binary format (e.g. Protobufs, Thrift, capnp, MessagePack, BSON, or CBOR), but also permit clients to work natively with a JSON API without necessarily being aware of a full-and-evolving schema. Perhaps a client just want to check the latest state of a particular object, cryptographically authenticate it, extract one or two fields and move on. Binary formats may offer a canonical JSON representation, but in absence of a “content aware” hashing scheme, authenticating JSON that has been serialized from e.g. protos requires effectively converting the JSON back into the binary format, which can only be done with knowledge of the scheme. This works against the grain of JSON’s schema-free, self-describing nature.
For added bonus points, a typed “content aware” hashing scheme can be used to authenticate any subset of an object graph against a digital signature without having to publish its entire contents. This can be done by redacting portions of the original object graph and replacing them with a placeholder type for redacted data that contains the original content hash. It’s a feature I’m sure you wish was available in git if you ever dealt with someone checking in a large binary. TJSON does not yet specify this, but it’s likely the
r: tag will be used to facilitate storing content hashes of redacted portions of documents in place of their original contents.
To support a secure “content-aware” hashing scheme that works on both binary encodings and a JSON-based format, we need a way to embed type information into JSON which is as rich as what’s available in the binary formats.
TJSON is designed to work with objecthash, a content-aware hashing format designed by Ben Laurie.