A structured data interchange format for UTF-8 encoded text

2 min read Original article ↗

#CXC Format

A structured data interchange format for UTF-8 encoded text.

#Features

  • Defines only the list structure, not the content.
  • No numbers, booleans or other data types.
  • All data is UTF-8 encoded text.
  • No escape characters or special handling of newlines.
  • Text data is not modified in any way.
  • Machine readable only, not a config format.
  • Licensed under Public Domain CC0

#Specification

The structure is defined by three 1-byte codepoints from the Control Code C0 block.

  • U+0001 group begin
  • U+0002 separator
  • U+0003 group end

All other UTF-8 encoded codepoints are the content.

#Rules

The structure is a list which can be arbitrarily nested.

CXC EBNF Grammar

#ABNF

document = group
group = beg elements end
elements = element / element sep elements
element = group / text
text = *unichar
unichar = %x00 / %x04-10FFFF
beg = %x01
sep = %x02
end = %x03

https://datatracker.ietf.org/doc/html/rfc5234

#Rationale

The format was created as base for application specific formats which specify constraints and more precise data types. The Unicode codepoints for beg, sep and end were chosen to prevent CXC to be abused as config file format. Since they don't appear in normal text data, escaping them is a non-issue.

There is no special whitespace handling, all whitespace is part of the actual text data. This allows to transfer text documents like source code, JSON, YAML or Markdown unchanged within a data structure without the need of Base64 encoding.

The initial use of CXC was as message format for Remote Procedure Calls (RPC) and pattern matching ala:

( /obj/abc , a_method , ( params ... ) )

#What CXC is not

A replacement for more defined formats such as JSON or YAML. The more defined part must be done at a higher level either by the application or a format build on top of CXC.