~cryo/cj - A minimal token based JSON parser. - sourcehut git

4 min read Original article ↗

CJ is a minimal low level token based JSON parser.

#Features

  • Portable ANSI C code (C89) under BSD license.
  • No external dependencies, no use of libc or any headers.
  • No dynamic memory allocations or usage of malloc and free.
  • Small footprint of 3.5 kilo bytes, suitable for micro-controllers.
  • Easy to embed by adding cj.c and cj.h to your project.
  • Low level token based interface to access the JSON structure.
  • The input JSON data is checked to be valid UTF-8 and compliant with RFC 4627.

The basic API is very minimal, all values are treated as strings. The application is responsible to convert these strings to unescaped strings, numbers and booleans. To help with that extra utility functions are provided which you may use or take as inspiration to create your own.

#API documentation

#Basic interface

This interface is the core of CJ provided by cj.c and cj.h files. The cj.h header documents all public functions in more detail.

/** Initialize the parse context. */
void cj_parse_init(cj_ctx *ctx, const char *json, cj_size len,
                   cj_token *tokens, cj_size tokens_size);
/** Parses the formerly initialized context. */
void cj_parse(cj_ctx *ctx);
/** Get the token reference of a key in an object. */
cj_token_ref cj_value_ref(cj_ctx *ctx, cj_token_ref obj, const char *key);
/** Copy the value of an object key as string into a buffer. */
int cj_copy_value(cj_ctx *ctx, char *buf, unsigned size, cj_token_ref obj, const char *key);
/** Copy the value of a token as string into a buffer. */
int cj_copy_ref(cj_ctx *ctx, char *buf, unsigned size, cj_token_ref ref);
#Support for JSON data greater than 4 GB

The cj_size type is an unsigned 32-bit type by default, suitable for JSON data up to 4 GB. The type can be changed to 64-bit at compile time for larger data sets, by defining CJ_USE_64BIT_SIZE_T. Note that in 64-bit mode the memory required for tokens gets doubled.

gcc -DCJ_USE_64BIT_SIZE_T my_app.c cj.c

The extra/ directory contains optional code to parse doubles, booleans, null and escaped \uXXXX Unicode sequences. These functions can be used as is or taken as an inspiration to create your own. For usage examples refer to the tests/ directory.

/** Copy the value of a token as unescaped UTF-8 string into a buffer. */
int cj_copy_ref_utf8(cj_ctx *ctx, char *buf, unsigned size, cj_token_ref ref);
/** Converts a JSON token reference to boolean. */
int cj_ref_to_boolean(cj_ctx *ctx, int *result, cj_token_ref ref);
/** Converts a JSON token reference to double. */
int cj_ref_to_double(cj_ctx *ctx, double *result, cj_token_ref ref);
/** Converts a JSON token reference to signed long. */
int cj_ref_to_long(cj_ctx *ctx, long *result, cj_token_ref ref);
/** Tests if JSON token reference is null. */
int cj_ref_to_null(cj_ctx *ctx, cj_token_ref ref);

Like the parser itself, all extra functions are implemented without libc or any system headers.

#Unicode support

CJ with the basic interface only validates that the input JSON data is UTF-8 encoded. The extra/cj_copy_ref_utf8.c file contains embeddable code to unescape Unicode sequences like "\uXXXX" including surrogate pairs to UTF-8, refer to tests/05_extra_parse_utf8.c for an example.

#Minimal example

More examples can be found in the tests/ directory.

/* tests/01_object_keys.c */
#include <stdio.h>
#include <string.h>

#include "../cj.h"
#include "../cj.c"

#define MAX_TOKENS   8

int main(int argc, char **argv)
{
	cj_ctx cj;
	cj_token_ref parent_ref;
	cj_token tokens[MAX_TOKENS];
	char val[32];
	const char *json = "{\"hello\": \"world\"}";

	cj_parse_init(&cj, json, (cj_size)strlen(json), &tokens[0], MAX_TOKENS);
	cj_parse(&cj);

	if (cj.status == CJ_OK)
	{
		parent_ref = 0; /* root object */
		if (cj_copy_value(&cj, &val[0], sizeof(val), parent_ref, "hello"))
		{
			printf("hello -> %s\n", &val[0]);
		}
	}

	return 0;
}

#Token breakdown

The parser fills the token array cj.tokens[] with references to the JSON string with the following types:

typedef enum cj_token_type
{
    CJ_TOKEN_INVALID    = 'i',
    CJ_TOKEN_STRING     = 'S',
    CJ_TOKEN_PRIMITIVE  = 'P', /* numbers, true, false, null */
    CJ_TOKEN_ARRAY_BEG  = '[',
    CJ_TOKEN_ARRAY_END  = ']',
    CJ_TOKEN_OBJECT_BEG = '{',
    CJ_TOKEN_OBJECT_END = '}',
    CJ_TOKEN_ITEM_SEP   = ',',
    CJ_TOKEN_NAME_SEP   = ':'
} cj_token_type;

typedef struct cj_token
{
    cj_token_type type;
    cj_size pos;
    cj_size len;
    cj_token_ref parent;
} cj_token;

Important: During parsing the CJ_TOKEN_PRIMITIVE type is checked to be a valid JSON number or keyword according RFC 4627, but the string representation isn't converted automatically, conversion can be done with the extra utility functions.

The JSON object {"hello": "world"} is parsed in to the following tokens:

0  CJ_TOKEN_OBJECT_BEG  {
1  CJ_TOKEN_STRING      hello
2  CJ_TOKEN_NAME_SEP    :
3  CJ_TOKEN_STRING      world
4  CJ_TOKEN_OBJECT_END  }

#Running tests

CJ is tested through libFuzzer and CMake's CTest based tests. The test cases are build and executed with following commands:

mkdir build-tests
cd build-tests
cmake ../tests
cmake --build .
ctest

#References

The CJ code is inspired by JSMN library.