C-Minus Preprocessor: C-Minus Preprocessor

54 min read Original article ↗

If you are reading this on github (a read-only mirror): most of the links in this doc, and various formatting, will not work on github because this page is written for the Fossil SCM repository hosted at this project's canonical home: https://fossil.wanderinghorse.net/r/c-pp

These are the docs for the "trunk" version of c-pp. See the "lite" branch for the lighter-weight fork referenced by the SQLite JS/WASM docs (which continues to be maintained for that purpose).

The C-minus Preprocessor (a.k.a. c-pp or cmpp) is a minimalistic C-preprocessor-like application. Why? Because C preprocessors can process non-C code but generally make quite a mess of it1. The purpose of this library is to provide an embeddable, customizable preprocessor akin to a C preprocessor. It was first deployed in builds of JavaScript but is generic and configurable enough to be used with essentially arbitrary UTF-8 text (including C code). It does not support any non-UTF-8 encodings.

Like a C preprocessor, this tool reads input from text sources and conditionally filters out parts. Its content expansion options differ significantly, but not fundamentally, from CPP's, but it provides a much richer, and client-extensible, set of directives.

Input MAGIC! Output

file "Input"
arrow
ellipse "MAGIC!"
arrow
file "Output"

→ /pikchrshow

Features of potential interest:

  • Can perform moderately sophisticated filtering of text inputs in a fashion similar, but not identical, to a C preprocessor.

  • Well documented, in the form of this file and libcmpp.h.

  • It can stream its input from any source via an input stream abstraction. It provides implementations for FILE and file-descriptor sources, and creating custom implementations is usually trivial.

  • Can send its output anywhere via an output stream abstraction. It includes implementations for FILE and file-descriptor destinations, as well as to strings which it dynamically allocates on demand to buffer the output.

  • Can process multiple distinct inputs and outputs in a single invocation, allowing it to be automated in interesting ways. The test script demonstrates how that can be useful.

  • Supports registering custom stateful directive handlers, either linked in or loaded dynamically from DLLs. This capability is, to the very best of my fallible knowledge, a world's first in a generic preprocessor. (IBM's COBOL preprocessor is reportedly extensible but is limited to COBOL input.)

  • Supports savepointing within scripts, to limit the scope of any given #define or to temporarily override it, reverting to its old value when the savepoint is rolled back. i.e. it supports "local variables".

  • Its #pipe directive allows it to run external programs to filter input. That is: an HTML template could embed markdown- or pikchr-formatted code directly and preprocess it using an external converter. This also allows it to wrap a C preprocessor, should one ever really want to. (Pikchr is also supported by an optional directive.)

  • WASM-friendly. Though it is as-yet untested in WASM/WASI builds, its API is designed to be friendly to those. Still TODO is to optionally eliminate all dependencies on C-level I/O APIs in such builds, to improve WASM portability. (Such I/O routines are currently only used for debug output and as a default output channel. They are not a core component or requirement of the library.)

  • Strictly single-threaded and synchronous, if only to provide evidence that not everything needs to be made async.

  • Distributed as a single source file of portable C99, making it easy and portable to copy around. It builds as both a library and a standalone CLI application. These docs cover the high-level features of the library, and the app is a very thin wrapper around that. The API docs are in libcmpp.h. (A good Perl hacker could probably implement most or all of this library in about 100 lines of Perl. This implementation is in C, so is significantly larger than that.)

See c-pp --help for usage details of the application (as opposed to the library interface, which is in libcmpp.h), in particular the fact that it processes its arguments and flags in the order they're provided, which allows chaining of multiple input and output files in a single invocation.

Design note: this tool makes use of SQLite. Though not strictly needed in order to implement it, it was specifically created for use with the sqlite3 project's own JavaScript code in order to facilitate creation of different builds, so there's no reason not to make use of sqlite3 to do some of the heavy lifting (it does much of that lifting). c-pp does not require any cutting-edge sqlite3 features and should be usable with any post-2020 version.

Table of Contents:

Formalities

Project home: https://fossil.wanderinghorse.net/r/c-pp

License: the SQLite Blessing

Author: Stephan Beal https://wanderinghorse.net/home/stephan/

Dependencies: A C99-capable C compiler, SQLite, and the target system's libc. It includes a copy of SQLite in the source tree but can use any relatively recent version.

Contributors are welcomed - please get in touch via the link above or post to this project's forum.

Building It

Grab a copy of the source code from /download or by cloning the repository using fossil:

$ fossil clone https://fossil.wanderinghorse.net/r/c-pp

Then, from its top-most directory:

$ ./configure --prefix=$HOME
$ make
# optionally:
$ make test
$ make install
$ make uninstall

See also: configuring its build

Somewhat ironically, the primary generated deliverables, libcmpp.[ch], are not installed by that process. They are intended to be dropped as-is into client-side trees.

Markup

c-pp is, like CPP, a line-oriented preprocessor. It looks for lines which start with its current delimiter (see below) and processes them. Other lines are normally passed through unmodified, but enabling @token@ parsing will cause non-preprocessor lines to be filtered. Similarly, specific directives may treat the content of their own block differently than other content (e.g #define heredocs).

The general syntax for a c-pp directive line is:

DELIMITER DIRECTIVE ...args

Where DELIMITER is the symbolic # described below and DIRECTIVE is one of the operations supported by the preprocessor.

The delimiter "#" used in these docs is symbolic only. The delimiter is configurable and defaults to ##2. Define CMPP_DEFAULT_DELIM to a string when compiling to set the default at build-time. The delimiter may also be modifed via the --delimiter=... command-line flag. This documentation, for brevity and clarity, exclusively uses # unless its specifically demonstrating changing the delimiter.

See #directives for examples and more syntax details.

Token Types

c-pp directive arguments must each follow one of the following forms:

  • word: this is a near-arbitrary token with no spaces. Most of the time, word tokens resolve as define keys. Sometimes a directive will instead treat them as literal values, such that the word foo is interpreted as foo instead of whatever value foo is defined to (if any).
  • int: if it looks like an integer, with an optional +/- sign, it's tagged as such.
  • "string" or 'string': this token starts out with quotation marks around it, but they're not part of its value. c-pp does not support backslash-escaping within a string. That is, all backslashes are retained as-is and there is no way to escape the outer quote character within the string.
  • @"..." or @'...': is a string which gets passed through @token@ parsing when it's evaluated.
  • Group constructs:
    • (...): is currently only used in subexpressions.
    • [...]: context-dependent, but the convention is to use this for lists of other tokens. See #query for an example. See also: the "call" syntax.
    • {...}: is for context-specific free-form content or, sometimes, used like a quoted string. See #query for an example.
    • Syntactic quirks and limitations:
      • No group may contain an unbalanced closing character.
      • There is no mechanism for escaping a group opening or closing character. i.e. app openers must be balanced by a closer.
      • Their contents do not require backslash-escaped newlines. If newlines are escaped then the backslashes are stripped from them but the newlines are retained. It is not currently possible to double-backslash newlines to force them to remain backslash-escaped after parsing.
        Potential TODO is transform the escaped newlines to spaces (like we used to). The main problem with that is that it would affect how all directive lines are parsed, not just grouping tokens, and side effects need to be ruled out.
      • Leading and trailing space characters, up to and including the first resp. last newline, are trimmed but the content is otherwise left as-is because it may contain text intended for external parsing, e.g. via #pipe or #query. Hard tabs are not considered spaces in this specific context so that they may be used in custom content.

Aside from the backslash-escaped newline case mentioned above, c-pp does not support backslash escaping of anything. That is: it treats all other backslashes, in all other contexts (unless explicitly noted otherwise), just as any other character. It does this primarily to give directives like #pipe flexibility in passing on arguments. It is, however, admittedly sometimes a problem and it may eventually need to be solved (i.e. changed to unescape certain sequences, perhaps opt-in on a case-by-case basis or via the addition of an as-yet-hypothetical #unescape function).

"Define" Keys (a.k.a. "Macros")

These docs frequently refer to "define keys". That's its term (for lack of a better one ("macro" doesn't really fit here)) for the names managed via #define, #undef, and the -D.../-U.../-F... CLI flags.

Define key naming rules are:

  • It disallows any control characters, spaces, and most punctuation, but allows alphanumeric characters (but must not start with a number).
  • Allows any of -./:_ (but may not start with -).
  • Any characters with a high bit set are assumed to be UTF-8 and are permitted as well.
  • Its length is limited, rather arbitrarily, to 64 bytes.
  • Names with the prefix cmpp are reserved for use by the library. It does not loudly impose this rule, but it handles its internal defines such that attempts to override them will silently have no effect.

See #undefined-policy for how the library deals with references to undefined values.

Directives

c-pp directives both look and function a good deal like C preprocessor (CPP) directives do. They begin with a delimiter, followed by a directive, followed by any directive-dependent arguments.

A fundamental difference from a CPP is that c-pp's delimiter is configurable, rather than being hard-coded to #. These docs use # for brevity, but they always mean "the currently-configured delimiter" (which can be changed while processing inputs).

Another fundamental difference is that each c-pp instance starts off with no directives installed. When it finds a directive in an input stream it checks its internal list of candidates and registers them on-demand. If it cannot find one, it falls back to the client-registered auto-loader and, if that isn't set or doesn't yield results, it will try to load them dynamically from a DLL. Client applications are free to register any they like in advance, and that's normally simpler than setting up an auto-loader fallback.

Example:

#if a
 That #if is a directive, but the one on this line is not
 because it has non-space content before it.
#/if

Spaces and tabs before and after the delimiter, and between arguments, are ignored, so the following #if is equivalent to the the previous one:

  #  if  a
...
 #  /if

A directive may span lines by backslash-escaping each end-of-line character:

#if this is unusually \
 long                 \
 "so we'll wrap it"
...
#/if

No spaces may follow such a backslash. As an exception, the bodies of (...), {...}, and [...] may span lines without requiring backslash-escaped newlines:

##assert (
  1
) and defined x and \
(
  x=3
)

Backslashes are optional within the confines of each group.

So-called "block" directives, like #if, have both an opening and a closing line. The closing line is always in the form #/DIRECTIVE, e.g. #/if, #/query, or #/pipe. The closing tags ignore any arguments so that they can be decorated with informative comments by document maintainers:

#if defined foo
... 1000 imaginary lines of text ...
#/if defined foo

Non-block directives have one-time effects which take place when they are parsed. Their effects may change the behavior of further parsing.

The following subsections cover each directive in alphabetical order.

#@ Controls Expansion and Delimiters of @tokens@

This directive manages the state regarding so-called @tokens@, a.k.a. at-tokens.

By default c-pp does no expansion of content beyond the filtering of content blocks using #if. If passed the -@ flag or a @token@ policy is used in a script (using this directive), then it will perform a restricted type of expansion on content blocks: tokens in the form @TOKEN@ are processed as described below.

This directive manages:

  • Whether to process @tokens@ at all. The default is not to.
  • How to handle @tokens@ which refer to undefined values.
  • The delimiters which surround @tokens@. These docs use @ for brevity and because it's the default, but both the opening and closing delimiters can be configured with this directive or the cmpp_atdelim_set() C API.

Usages:

#@ ?push? policy NAME ?<<?
#@ ?push? delimiter OPEN CLOSE ?<<?
#@ ?push? policy NAME delimiter OPEN CLOSE ?<<?
#@ pop policy
#@ pop delimiter
#@ pop policy delimiter
#@ pop both

A trailing << argument means that (A) "push" is implied and (B) the given changes apply only until the closing #/@ directive. If << is used then a #/@ directive is required, otherwise no #/@ is expected (and will trigger an error if used).

It manages two independent stacks, which can make it potentially confusing to use properly. These docs will not show all legal usage combinations but will demonstrate those patterns which are least likely to cause confusion.

The various usages are broken down below.

Usage: ?push? policy NAME

policy takes a policy name argument which describes how to deal with @tokens@ in the input:

  • off (the default): no processing of @tokens@ is performed.
  • error: fail if an undefined X is referenced when parsing @X@.
  • retain: emit any unresolved @X@ tokens as-is to the output stream.
  • elide: omit unresolved @X@ from the output, as if their values were empty.

The push option tells it to set the policy and remember the previous policy, otherwise it will replace whatever the current value is.

Usage: ?push? delimiter OPEN CLOSE

Sets the current @token@ open/closing delimiters to the given values (be sure to quote them).

The push option tells it to set the policy and remember the previous policy, otherwise it will replace whatever the current value is.

Usage: ?push? policy NAME delimiter OPEN CLOSE

The policy and delimiter may be combined into a single call. When push is used here, the push applies to both the policy and the delimiter. i.e. both stacks get amended. pop can be used to independently pop each of those but it is not recommended because it can be confusing to keep track of which stack is in which state.

Usage: pop ?policy? ?delimiter? ?both?

This pops one level from the stack for the expansion policy and/or the delimiter. It requires at least one argument and passing both policy and delimiter is equivalent to passing both.

It is an error to pop an empty stack.

BUG:

  • In one external tree #/@ is failing to recognize that it's part of a block for reasons not yet understood. The workaround is to use an explicit push and pop instead of the << convenience form. (2026-01-04: is that still a thing?)

@token@ Behavior and Limitations

  • @token@ expansion generally happens only in "content" parts - not preprocessor lines. That is, #if foo=@bar@ won't try to expand @bar@ (just use foo=bar for that). at-strings can be used in some contexts to perform @token@ expansion on directive arguments.

  • The X part of @X@ is treated as a define key. If no match is found, then the current policy specifies how to deal with it. If a match is found then @X@ gets replaced by X's value.

  • It will not cross line boundaries looking for a closing @. i.e. the X part of @X@ may not contain newlines. The expanded value may contain newlines.

The --no-@ CLI flag or policy off both disable expansion until either a subsequent -@ or policy change re-enables it.

A demonstration of the "@" policy:

$ ./c-pp -e '##@ policy off' -e 'a@x@c'
a@x@c
$ ./c-pp -Dy=Y -e '##@ policy retain' -e 'a@x@@y@c'
a@x@Yc
$ ./c-pp -e '##@ policy elide' -e 'a@x@c'
ac
$ ./c-pp -e '##@ policy error' -e 'a@x@c'
a
./c-pp: @<stdin>:1: Undefined @key@: x

#arg

#arg is intended for use as a function. It expands its argument and emits it. "The plan" is to add flags to this to perform meta-operations on arguments, e.g. fetching their type or raw value instead of their expanded value.

Usage:

#arg ?flags? one-argument

Flags:

  • -trim-left|-trim-right|-trim
    Trim the given side(s) of space and newlines.
  • -raw
    Do not expand the value before emitting it. This would strip the outer quotes from a string, for example, but not process the contents of an at-string or a possible function call.

It's currently difficult to envision a usage for this outside of testing this library.

#assert

This works like #expr except that (A) it emits no output and (B) it fails if its expression is false. #assert is essentially syntactical sugar for:

#if not foo
#error ...
#/if

Which can be shortened to:

#assert foo

#attach or #detach a Database File

This directive is a thin proxy for SQLite's ATTACH command, which "attaches" a database to the current db connection:

#attach "/path/to/my.db" as "foo"

On its own it's not of much use, but it's intended to be paired with #query.

It won't create a new db without a URL-style db name like file://foo.db?mode=rwc (assuming the linked-in SQLite has that feature enabled (most builds do)). We don't really want to create or administer arbitrary dbs from c-pp (there are much, much better ways to do that). It is, however, useful for as a basic templating system, e.g.:

#attach "my.db" as "foo"
#query {select a, b, c from foo.t order by a}
a=@a@, b=@b@, c=@c@
#/query
#detach "foo"

#define: Set Preprocessor Symbols

This directive "defines" values, in the same sense that a C preprocessor does, the main difference being that defines in c-pp behave more like variables, in that they can be freely overwritten without first having to undefine them.

Usages:

#define foo
#// the same as
#define foo 1
#define foo "this is foo"
#define bar foo
#assert bar="this is foo"

Prior to 2025-09-27, the equal sign was "just another identifier letter", but it is now no longer permitted by this directive. In the context of expressions, = is a comparison operation.

If a define is given no value, it has an implicit value of 1.

If the first argument is a question mark then the define is only applied if the value is not already defined:

#define ? foo "hi there"

Is syntactic sugar for:

#if not defined foo
#define foo "hi there"
#/if

Multiple defines can be set at once with:

#define {
  x -> 2
  y -> 3
}
#assert (x=2) and (y=3)

The ? modifier is applied to each key separately.

This form requires a value for each key - there is no default. Each key is interpreted literally and each value is interpreted in the usual ways:

#define {a -> "hi there" b -> a}'

Will define both a and b to hi there.

In {...}, values in the form (...) are interpreted as expressions.

Design note: after some experimentation, the -> is required (A) because it's easier [for me] to read that way than {k v k2 v2...} is and (B) to avoid over-complicating the parsing by optionally allowing -> or =. My eyes find = to be less legible in that context.

To define a variable to the contents of a file, use the function call syntax:

#define x [include -raw the-file]

The -a SEPARATOR or -append SEPARATOR flags will append the value to any existing value and inject SEPARATOR between them if the value was already set to a non-empty value and the new value is not empty.

Potential TODOs:

  • Flag(s?) to change how define interprets its value.

#define "Heredocs"

#define can also assign a value from a content block using a heredoc-like syntax:

#define foo <<
content goes here
#/define

Notes and limitations:

  • It must end with #/define on a line of its own.
  • Its content may contain other #directives but they must be completely contained, not interwoven.
  • The final newline in the content is included but that can be supressed with the -chomp flag (see below).
  • It is currently parsed for @tokens@ when it is read (if the current policy is not "off"). The thinking is that it would normally be more useful to delay that, but currently there is no straightforward way to expand it (or to know whether to expand it) after-the-fact.

#define accepts the following flags immediately before the <<:

  • -chomp: will remove one trailing newline from (a.k.a. "chomp") the block before assigning it. -chomp can be given any number of times to chomp that many newlines. Chomping has no effect if the content does not end on a newline, but content blocks will always, because of how c-pp's syntax works, have at least one trailing newline unless they are completely empty. Tip: <<< is syntactic sugar for -chomp <<.

  • Potential TODO: a -@policy specific to this define. For that to be useful, e.g. delaying @token@ expansion until the define's value is later read, we first need a good way to be able to tell c-pp to expand later (e.g. by tagging the value as an @-string and recognizing that when fetching the value. Options are being explored for that but the most obvious ones would affect the lowest-level routines and i'm not sure this feature belongs there. Maybe it does. Who knows?

Predefined Symbols

The following symbols are predefined:

  • __FILE__
    The current input file's name. (How this is implemented is actually pretty interesting, if one is into that kind of thing.)
    Sidebar: its conventional __LINE__ counterpart is not implemented because setting that everywhere it needs to be set, or special-casing it in the lowest-level pieces which fetch defines, would be onerous and too computationally expensive to justify its addition. Also, doing so might reveal bugs in the line counting ;). __FILE__, in contrast, has an unobtrusive and inexpensiive solution.

  • c-pp::argv
    All arguments passed to the c-pp binary, including the binary's name. This has proven useful in testing and content validation.

  • cmpp::version
    libcmpp version information in an unspecified single-line format.

#delimiter: Change the Directive Delimiter

This directive changes the directive delimiter. The delimiter is managed as a stack, the same way as @ policy. The stack always starts out with the library's compile-time-defined delimiter on top.

Usages:

  • #delimiter DELIM
    Changes the current delimiter to DELIM.
  • #delimiter push DELIM
    Pushes DELIM as a new delimiter on the stack, making it the current delim.
  • #delimiter pop
    Pops the most-recently pushed delimiter. It is illegal to invoke this unless one has invoked a corresponding push.

A DELIM argument of default, predictably enough, uses the default delimiter (set when the library is compiled).

A final argument of << indicates that the new delimiter remains in place only until a following #/delimiter directive, noting that the closing directive has to be delimited by the newly-pushed delimiter. In this form, it is an error if EOF is encountered before the closing tag is found.

When used in a function call then (A) << is not permitted and (B) if given no arguments then it will emit the current delimiter.

Example:

##delimiter push @@
@@delimiter push !! <<
!!expr 1
!!/delimiter
@@expr 2
@@delimiter pop
##expr 3

Results in the output 1\n2\n3.

PS: don't do that.

#error Breaks Things

Immediately stops processing with an error.

#error the rest of the line is an error message

As a special case, if the line both starts and ends with the same character of " or ' then those are stripped from the result.

#expr Evaluates Things

This directive evaluates an expression (described in the next subsection) and emits its result (typically an integer).

#expr expression...

There are no known practical uses for this directive beyond in testing c-pp itself, but see #assert and #if for more practical uses of expressions.

Expression Rules

An expression, in this context, is a series of operators and operands which evaluate to either true or false. Expressions are used by several directives, most significantly #if.

The general syntax is:

[not] [defined] value [COMPARISON-OPERATOR value] [and ...] [or ...] [glob ...]
  • X COMPARISON-OP Y compares the define of X against Y. Y may be an integer, a quoted string, a define name, or a (...) subexpression. The following comparison operators are supported and spaces between them and their operands is optional: =, !=, <, >, <=, >=. Value comparisons are, for the most part, internally against strings, but expressions evaluate to an integer value.

  • X, with no comparison operator, performs a boolean check: empty values and those with a value of 0 (zero) are false. All other values are true. (This is a string comparison, so 000 is true!)

  • "..." or '...' are strings. Strings do not currently support any form of backslash-(un)escaping, so a string may not contain its own quote characters. All backslash characters in strings are retained as-is.

  • @"..." or @'...' are "at-strings". They work like strings but, in most contexts, get the same expansion handling as @tokens@.

  • (...) is a subexpression, the contents of which may be any legal expression. These may be nested. They currently always evaluate to an integer. "The hope" is to also support string expressions at some point, but the addition of function calls may make that unnecessary.

The following unary operators are supported:

  • not negates the result of the expression. not may optionally be written as !. It may also be used multiple times in a row, each pair of which cancels each other out.

  • defined changes the expression such that if the argument refers to a defined value, regardless of its value, the expression evaluates to true. The operand must be a word token. It does not accept strings, subexpressions, or other operators as its operand. Tip: the string #if is technically a word-type token, so it qualifies here, and defined #DIRECTIVE-NAME evaluates to true if a given directive exists. (As an exception to this documentation's conventions: it expects a literal single #, not the current directive delimiter!) This can be used to test for whether a given custom directive has been installed.
    Sidebar: defined very specifically does not trigger a search for a dynamically-loadable directive. It may trigger an autoloader, and that may trigger a DLL search. Hmm. (Removing the autoloader from that search causes tests to fail and also fails to give me the semantics i'd prefer. Hmm. Maybe defined needs a flag to specify whether or not to search the various sources for directives (registered ones, autoloadable ones, DLL-loadable ones, noting that an autoloader may do whatever it likes to load a directive).

The unary operators bind tightly to their RHS argument, but without consideration for whether it is the beginning of a longer expression. That is (not a=3) will parse as ((not a)=3). The workaround is to use a subexpression: not (a=3) or, even simpler, a!=3. (Patches to fix that, even if it means rewriting the beast of an expression engine, would be very thoughtfully considered!)

The following binary operators are supported:

  • The comparison operators listed above.
  • X and Y
  • X or Y
  • X glob Y
    Evaluates to true if X matches glob pattern Y, else false. X is currently restricted to a quoted string or a define name. Y is required to be a quoted string or an at-string. (X not glob Y) is syntactic sugar for (not (X glob Y)).

All of the binary operators are evaluated strictly left-to-right, with equal precedence for each.

Sidebar: there is currently no short-circuiting of and and or because the evaluation and parsing are closely tied together. Since the addition of function calls, this can hypothetically lead to undesired side effects in the should-have-been-short-circuited parts of an expression, but a genuine problematic case has yet to show up in practice.

Expression limits:

  • Most glaring is that chains of binary operators may need subexpressions: (a=b and b=3) does not parse how it looks like/should, and currently needs to be written as (a=b and (b=3)).
  • FIXME: call syntax needs to be permitted for operands.

#if, #else, #elif, #/if

#if and friends cause blocks of the input to be emitted or elided depending on the result of an expression. The expression syntax differs from that of a C preprocessor but the end result is the same. This family of directives includes #elif, #else, and #/if.

#if's arguments must make up an expresion. #/if ignores all of its arguments - it's commonly useful to add a note there saying which block is being ended.

Example:

#if foo=1
...
#elif foo < 2 or foo > 5
...
#elif bar or baz or not defined charlie
...
#else
...
#/if foo=1

(Any text after /if on that last line will be ignored, which is useful for annotating the line with the purpose of the block it's closing.)

#include External Files

This directive emits the contents of other files into the output:

#include ?-raw? filename...

The filename arguments may optionally be quoted, and must be if they contain any quote or space characters. They may also be function calls, e.g. [join -s '/' a b c.d].

The -raw flag specifies that each file's contents are to be passed through to the current output channel with no interpretation, otherwise each file is filtered through the preprocessor as if it were part of the current file.

The filenames are searched for in the so-called "include path", which works just like a C/C++ include path.

As of 2025-12-31, the #include search path is amended automatically to contain the directory of the current input file, such that:

#include "foo.bar"

Will resolve foo.bar from the input file's own directory before it will resolve it from other directories.

Trivia: prior to that addition, the workaround for the lack of it was to adjust makefile rules to use the equivalent of -I$(dir $@) in all invocations of c-pp.

The -Idirname flag can be used any number of times to specify search directories and they will be searched in the order provided, with the caveat that the aforementioned directory part always has a higher search priority.

There is currently no mechanism for modifying the #include path from within a script, and no current plans to add that capability.

#join Arguments Together

The #join directive concatenates its arguments and emits the result to the output stream.

It accepts the following flags:

  • -s SEPARATOR: sets the separator which gets emitted between the following arguments. Its value will be resolved in the usual ways, so it may need to be quoted. It may be used multiple times to change separators for the following arguments. The default separator is a single space.
  • -nonl: When running in non-call mode, do not emit a newline. The default is to emit a newline. (In call mode newlines are trimmed automatically by a higher level.)
$ ./c-pp -Db=2 -e '##join 1  b    3'
1 2 3

$ ./c-pp -e '##join 1 2 3 [join -s "X" 4 5 6]'
1 2 3 4X5X6X

TODO: unescaping of the separator to allow newlines and tabs. This needs to be done at a different level of the API, though.

In practice #join is most commonly used to glue directory and filename parts for use with #include.

#module Loads Directives from DLLs

The #module directive can load new directives from DLLs. It will neither register nor run in safe mode. This support currently only works on Unix-esque platforms: those with either dlopen() or ld_dlopen(). Patches to add support for other platforms would be welcomed.

#module "dllName" "directive-name"

It tries to open the given DLL, find an entry point with the given name, which it assumes to be of type cmpp_loadable_module*, and it invokes the module's callback. The intent is that such callbacks register new directives. The DLL name argument may include the platform's conventional DLL extension (".so" on most platforms), but that's optional - the search includes checking the name both as-provided and with the DLL extension added to it.

The directive name can be left off if the module in question is specifically built and registered as the sole module in that DLL (in which case it uses a pre-defined entry point name). Whether that's the case depends on how it is built and which module registration macro(s) it uses.

Registration of modules is handled via macros named CMPP_MODULE_... in libcmpp.h.

This directive performs no filename transformation beyond the path lookup and automatic DLL extension.

Example:

./c-pp -Ddll=libcmpp.so \
  -e '##assert not defined #dyno' \
  -e '##module dll dyno' \
  -e '##dyno hi there' \
  -e '##assert defined #dyno'
cmpp_dx_f_dyno() arg: cmpp_TT_Word hi
cmpp_dx_f_dyno() arg: cmpp_TT_Word there

When the DLL is built with a singleton module registration the entry point name is not required, as the singleton uses a well-defined name:

$ ./c-pp -e '##module "libcmpp.so"' -e '##dyno hi there'
cmpp_dx_f_dyno() arg: cmpp_TT_Word hi
cmpp_dx_f_dyno() arg: cmpp_TT_Word there

Example module: /file/mod/dyno/d-dyno.c

Rather than explicitly load DLLs, they can be set up to be loaded on demand if the DLL's name matches its directive, as described in the following section.

Directives in DLLs

If built with DLL support and it's not running in safe mode then the library will, when encountering an unknown directive, search for a matching DLL. For purposes of this search, the DLL is expected to be named libcmpp-d-NAME.so. The module search path defaults to the $CMPP_MODULE_PATH environment variable, but it can also be set with the -L flag to c-pp or the the C API's cmpp_module_dir_add().

Any given DLL may install any number of directives, and none of them have to match its DLL name, but this specific automated search for a DLL requires a naming convention.

If it finds a matching DLL, it is opened, and, if it contains a loadable module, that module's registration function is called. If that call registers the being-sought directive, the library continues processing. If not, it fails with an "unknown directive" error.

The C API also offers an "autoloader" API which clients can install to load their own statically-linked directives on demand or to implement their own DLL search. That's independent of the library's automatic DLL search (which is, in terms of search priority, last on the list).

@policy Controls Expansion of @tokens@

Replaced by #@ 2025-11-18.

#pipe Filters Content through External Processes

This directive is not currently available on Windows builds (patches to improve that would be thoughtfully considered!).

#pipe runs an external command, optionally feeds it input from the script, and emits the output from that command:

#pipe -- /usr/bin/sed -e 's/this/that/'
this content
#/pipe

Will pipe this content\n into sed and get that content\n back.

Similarly:

#define cmd "echo"
#pipe -no-input -chomp-output -- cmd this is from echo

Will emit this is from echo and chomp the trailing newline from the output.

Arguments and flags:

  • -chomp: each time this flag is used, it causes one newline to be removed from the directive's input block.

  • -chomp-output: each time this flag is used, it causes one newline to be removed from the external command's output.

  • -no-input: tells this directive to not consume the following content looking for a #/pipe directive. The external command is sent no input from this directive.

  • -exec-direct: normally the external command and its arguments are passed to the OS as a suffix of /bin/sh -c. This flag tells it to run that command directly, without the intermediary shell. This can only work if the command has no arguments, otherwise the arguments will be treated as part of the command name. (We could optionally implicitly set this if the command has no arguments.)

  • -path: tells it to search the $PATH when looking for the command, as documented for execlp(3) and execvp(3). More specifically, it uses execlp(3) or execvp(3), depending on the form of the command (see below), instead of execl(3) or execv(3).

  • -debug: emit the post-processed command to stderr before running it.

  • --: must immediately precede the command name. This tells the directive that we are switching from c-pp's token parsing to near-arbitrary input.

The final argument must be the command and its arguments in one of two forms:

  • command-name ...args: if the command name is not quoted then is treated as a define key unless it contains any / or \ or . or - characters. In those cases it is assumed to be a filename or command switch and is not subject to any further interpretation. At-strings, as well as define names which do not match the aforementioned matterns, will be expanded appropriately.
    Only the first token of the command string is parsed so that command names may be runtime-configurable via defines. The remaining arguments, because they may be essentially free-form, are not parsed as arguments by c-pp, but are passed on almost as-is to the command. The only interpretation they go through are (A) to determine where this directive's line ends and (B) any newlines from backslash-escaped newlines in the arguments get elided entirely, as if they were not there.

  • [command-name args...]: in this form, each argument in the given list is treated like a normal directive argument. Each may be a string, at-string, number, function call, or word. The one exception to their normal processing is the same one described for the previous command name, but in this form that rule applies to all unquoted word tokens. The -- flag is optional for this call form because the [...] group unambiguously tells us that it's the command.

The external command gets piped, via its stdin, the contents of the directive's block unless -no-input is used. The command's stdout output is collected and emitted in its place. The output is not currently post-processed in any way except as per the -chomp-output flag, but should the need arise we can easily add optional at-token parsing to the output via a flag.

Stupid #pipe trick: run a C preprocessor through it:

##pipe -path -- 'cpp' -E
#include <stdio.h>
##/pipe

That requires using a directive delimiter other than # to avoid a conflict with cpp's #.

TODOs:

  • BUG: it will hang, waiting on I/O, in some constructs, e.g. the one marked BUG in this file.
  • A build option and CLI flag to disable both this and #include, to make it safer for use with potentially untrusted inputs.
  • Flag(s) to control whether or not to @-parse the command arguments.
  • A -define X flag which sets X to the piped output instead of emitting it.
  • Figure out how to report when the underlying exec() call fails due to an invalid command name. "The problem" is that the command is run as an argument to /bin/sh -c, and exec() succeeds in calling that, but then /bin/sh fails to find the command. That happens in the child process, so we can't directly report it to the parent. Currently this situation results in empty output (and maybe a cryptic message from /bin/sh on stderr) but no error. (Maybe we should close the child's stderr? Or maybe capture it separately and error if stderr produces any output? How do we do that?)

#pragma Is for Debugging

This directive is undocumented. It changes at the whim of the library's developer, primarily to support testing and debugging.

#query Renders Data from a Database

This directive runs SQL queries. c-pp internally uses only one (private) database, so #query isn't much use on its own except for in testing c-pp, but #attach can be used to attach arbitrary databases (and was added to support #query).

This directive has two forms:

First, it can run an SQL query, set scope-local defines for each result column, and filter its block's contents for @tokens@ using the current @token@ policy:

Your list of foo:
#query {select name AS name, price AS price from foo order by name}
@price@ @name@
#query:no-rows
This part is optional and is emitted if the query has no results.
#/query

This form requires a terminating #/query directive but the #query:no-rows sub-directive is optional (and may not appear more than once). As of 2025-12-21, the query body may contain other directives, which gives it much more expressive capability:

#query {select a from b}
#if a
...
#else
...
#/if
#/query

For this form the body of the query block is @token@-expanded to the output stream one time for each result row. Before each is expanded, defines are set matching the names of the result columns. The defines are set within the context of a local savepoint so that after the query is processed the defines are either unset or reverted to their previous values. If no rows are found, the (optional) #query:no-rows block is emitted. If that block is not set, no output is emitted for querie which have no result rows.

Formatting of the results, if needed, can be done using SQLite's format() function. It is exceedingly unlikely that c-pp will ever be extended to include formatting-related features. (However, function calls bring that capability within easy reach.)

Secondly, it can define one or more symbols from the first row of an SQL query:

#query define {select a, b from c order by a}

This form does not use a terminating #/query directive.

The "define" form sets corresponding defines for the first row of the result set and does not use a savepoint. If no result rows are found it sets each define to an empty value. (Potential TODO: add a flag to error out in that case, or maybe provide default values.)

Sidebar: remember that the only guaranteed reliable way to get a result column's name is to set it oneself using SELECT x AS x (with the "AS" being optional).

Lastly, it can run batch queries which neither produce output nor consume from the input stream:

#query -batch {
  create temp table foo(a);
  insert into foo(a) values(1),(2),(3);
}

Neither define nor bind values (see below) may be used with -batch.

Potential TODOs:

  • Maybe make the @token@ policy for the content part configurable for this call, rather than using the current policy. It seems that a mode of "error" is the best fit for this use and it's difficult to imagine wanting any other mode here. However, there's an internal reason which enforces that we use the current policy here, and that still needs to be resolved.

Binding Query Parameters

Query parameters can be bound either by name or index, but not a mix of both, by adding a bind argument:

  • #query {select :a a, $b b} bind {:a -> 1 $b -> 2}
  • #query {select ?1 a, ?2 b} bind ["one" "two"]

Sidebar: SQLite supports a prefix of @ in addition to : and $ but it's not supported here because of syntactic confusion with at-strings.

Bind values may be any of:

  • Quoted string (the quotes are not part of the bound value).
  • A {...} is treated like a quoted string, supported here soley for the outlier case where a value has to contain both single- and double-quotes.
  • A define name gets expanded to its value.
  • An at-string gets expanded.
  • An integer.
  • An expression enclosed in (...).
  • [...] is a function call.

Built-in DB Functions

The library installs the following non-standard SQL functions:

  • cmpp_file_exists(name): evaluates to 1 if the given filename is stat(3)-able and is a regular file, else 0.
    This function is used by the internals to implement path lookups and is exposed to clients for whatever amuses them.
  • generate_series(): https://sqlite.org/src/file/ext/misc/series.c
    Has proven especially helpful in generating repetitive CSS code.

#query as a Loop Construct

One of the most useful, but possibly not most obvious, capabilities of #query is that it effectively provides a nestable for-each loop with a richness approaching Tcl's:

Tcl:

foreach {k v} {k1 v1 k2 v2...} {
  ...
}

Vs.

#query {select k, v from somewhere order by k, v}
...
#/query

Noting that, in practice, the second example is usually something more elaborate like a CTE, as in this real-world example which generates CSS code:

//#query {
  WITH w(which) AS (
    SELECT 'grayscale' AS which
    UNION ALL select 'invert'
    UNION ALL select 'sepia'
    UNION ALL select 'saturate'
  )
  select which which, value num from w, generate_series(0,100,5)
}
.filter-${which}-${num} {filter:${which}(${num}%)}
//#/query

//#query {select value num from generate_series(0,355,5)}
.filter-hue-rotate-${num} {filter:hue-rotate(${num}deg)}
//#/query

Alas, we do not yet have support for break and continue, and it's unclear whether they really could be supported here.

#savepoint: Scoped Defines

Savepoints are like nestable transactions. In c-pp they let us define/undefine values in a scoped manner. That is, a symbol defined in a savepoint will become undefined, or revert to its pre-savepoint value, if that savepoint is rolled back. It might be interesting to someday explore how savepoints might be used for content blocks as well, but the internals are not currently set up to do such a thing (we'd need to buffer all output to the db or memory, rather than sending it directly to the output channel).

#savepoint requires a single argument:

  • begin starts a new savepoint
  • commit saves all changes and closes the savepoint
  • rollback discards all changes made since the start of the most recent savepoint and closes that savepoint.

If a savepoint is neither committed nor rolled back by the end of its script file, it will automatically be rolled back. It is an error to try to end a savepoint when none is currently open.

Why was #savepoint added? An idle thought of "wouldn't it be interesting to automatically undefine these vars at the end of the file which defined them?" led to "oh, savepoints can do that". Then it was actually really easy to add.

#stderr

Emits the remainder of line to stderr.

#stderr This goes to stderr along with file location info.

#undef

Undefines one or more defines:

#undef foo bar baz

TODO:

  • Treat quoted strings as glob patterns.

#undefined-policy

Specifies how c-pp should react to references made to undefined keys:

#undef ?push? error|null
#undef pop

The policy values are:

  • null (the default): treat undefined keys as falsy.
  • error: trigger an error if resolving an expression would require using an undefined key. This should probably be the default. The defined expression operator specifically does not trigger such errors.

push and pop work exactly as described for #@.

Infrequently useful, but...

#// This is a c-pp comment.

There must be a space after the // because that // is, despite appearances, parsed as a directive name.

Multi-line comments are not supported but #if can be used for the same effect:

#if defined nope
...
#/if

Add-on Directives

This section describes directives which are not part of the core library but which are in this source tree, available for copy/paste reuse. They may require third-party software. They may or may not also be pre-built into the library or CLI app.

The directives are listed in alphabetical order.

#c-code

This proof of concept directive filters input into C code formats.

Source file: d-c-code.c

#c-code -mode byte-array \
  -getter get_mah_bytes {
this is content
}

Emits something like:

unsigned char const * get_mah_bytes_get(unsigned * pLen){
  static unsigned char const _a[] = {
    10,116,104,105,115,32,105,115,32,99,111,110,116,101,110,116
  };
  if(pLen) *pLen=sizeof(_a);
  return _a;
}

And:

#c-code -mode byte-array -hex -name mah_bytes
...content goes here...
#/c-code

Emits:

unsigned char const mah_bytes[] = {
    0x23,0x69,<big snip>...
    0x0a
};

The block content may contain other directives, which is especially useful here with #include -raw

-mode cstr has it emit the content as a string literal. The -comma flag tells it to add a comma at the end of each line unless that newline is at the end of the input.

#pikchr

This directive reads pikchr input and emits SVG-format image output.

Source file: d-pikchr.c

Usages:

#pikchr ...flags
... pikchr markup...
#/pikchr

Or:

#pikchr ...flags {
  ...pikchr markup...
}

Those differ in the following ways:

  • The block form may contain other directives, whereas {...} may not.

  • The block form's output is implicitly @token@-parsed using the current @token@ policy. The {...} form is not @token@-parsed by default but see the -@ flag.

It emits an SVG-format image or an error message. In the case of a pikchr() error, this directive emits the full pikchr result to the ouput stream before setting the error state to something less verbose than pikchr()'s error dump.

Flags:

  • -@
    Tells the {...markup...} form to @token@-parse the {...} block using the current policy. This flag is illegal in the block form (which is implicitly @token@-parsed according to the current policy).

  • -dark
    Tells pikchr to prefer a "dark-mode" color scheme.

  • -css-class STRING
    Adds the given CSS class(es) to the generated SVG image.

  • -unchomp
    Forces an additional newline on the output. May be used multiple times.

  • -chomp
    Remove ones trailing newline from it. May be used multiple times.

Regarding newlines: it's not specified whether pikchr output always includes a trailing newline. If -unchomp and -chomp are used together, results may be unpredictable.

"Function Calls"

As of 2025-11-11 c-pp supports a limited form of "function call" in the form [D ...args] where D is the name of a directive. This only works for directives which can function (as it were) without a closing directive (even if they do so only conditionally, e.g. #query, in which case only the closing-directive-less forms are legal here).

Calls work by doing the following:

  1. Copy the X part of [X], prepending the current delimiter to it. We have to copy it because of $REASONS.
  2. Redirect the current output stream to a buffer.
  3. Process the buffer from #1 as an input document.
  4. Restore the output stream to its previous state.
  5. Any output from that document is now in the buffer from #2, which becomes the result of the call. A single trailing newline is unconditionally trimmed from the result.

It's still being determined where this syntax should be legal, but here are some examples of where it currently is:

  • Expression tokens
  • #query bind values
  • #define values
  • During @token@ parsing, @[...]@ is experimentally a form of call and the ... part may span lines like [...] may.
  • The [sum ...args] directive was created as a demonstration of this feature, simply adding all integer-looking arguments together.

Some of the functions currently available: #arg, #join.

The Library API

This section demonstrates how to use the library API from client C code. It is not an exhaustive guide (that's what the API docs are for) but is enough to get started with the library.

The first step is getting a preprocessor instance:

#include "libcmpp.h"
...
cmpp * pp = 0;
int rc = cmpp_ctor(&pp, 0/*optional flags*/);
if( rc ){
  // error
  if( pp ){
    // cmpp_err_get() will get the error info.
    cmpp_free(pp);
  }
  return;
}
... use pp ...

(Initialization will only fail if an allocation fails or if optional custom initialization code fails. In the former case, pp will always be NULL. In the latter case, the pp's error state holds info about the failure.)

Next, we set up an output channel:

cmpp_outputer out = cmpp_outputer_FILE;
out.state = stdout;
cmpp_outputer_set(pp, &out, "<stdout>");

Any output destination which can be wrapped in the cmpp_output_f() interface is suitable. Implementions are provided for FILE*, file descriptors, and cmpp_b (basically a dynamic string buffer), and adding one's own is normally trivial. e.g. send output directly to a UI widget.

Then we feed it some input:

unsigned char const *input = ...a script full of input...;
int rc = cmpp_process_string(pp, "my-input.txt", input, -1);
if( 0==rc ) { ... success ... }

In essence it can take input from anywhere, but it requires that the input be completely available when parsing starts. The lowest level of feeding it input is cmpp_process_string(), where each call equates to a new input source. cmpp_process_file() and cmpp_process_stream() are both thin proxies around cmpp_process_string().

On success, all of the output will show up in the provided output channel. On error, the output may have been partially generated and must not be trusted as being complete or usable. Most errors cannot be recovered from without cleaning up all state, and practice shows that in this context there's little or no reason to attempt it.

When we're done we need to clean up:

cmpp_dtor(pp);

For the most part, that's really all there is to it.

The library can be extended with custom directives and several are demonstrated in d-demo.c and d-pikchr.c. Custom directives can perform essentially any jobs the builtin directives do, the notable exception being flow-control changes (like #if does). More properly, they can implement flow control but must provide the infrastructure needed for nesting such constructs and ensuring that they're closed properly. The internal infrastructure for doing so is probably not well-suited to general-purpose flow control, e.g. adding a hypothetical #while or #foreach loop. Similarly, the expression-evaluation API is not yet in the public API, and it's still being determined whether to make it so (because it's rather primitive).

Library Build Options

The library, for client-side use, is distributed in two files, libcmpp.[ch], which are products of its build process.

The following CPP defines influence how libcmpp.c is built. They have no effect on client code.

  • Any of the following can be used to automatically #include an external config header file:
    • -DCMPP_HAVE_AUTOCONFIG_H"libcmpp-autoconfig.h
    • -DHAVE_AUTOCONFIG_H"autoconfig.h
    • -DHAVE_CONFIG_H"config.h
  • -DCMPP_CTOR_INSTANCE_INIT=function_name
    If set, the given function must have a signature of int f(cmpp*). It will be called as part of cmpp_ctor() so that any custom directives or autoloader can be added to each new instance. It will be called before the preprocessor installs any of its built-in directives, so custom directives may override builtin ones.
  • -DCMPP_MAIN
    Includes the main() impl for the c-pp binary.
    • -DCMPP_MAIN_INIT=func
      Works just like CMPP_CTOR_INSTANCE_INIT (see above) but applies only to the instance which main() uses. This is used, e.g., for plugging in custom/non-core/demo directives.
    • -DCMPP_MAIN_AUTOLOADER=func
      If defined then func must have the signature of cmpp_d_autoload_f(). It is installed as the main instance's directive autoloader.
  • -DCMPP_OMIT_...
    The following features are optional because they give scripts ways to access near-arbitrary content and may, in some uses, be security-relevant:
    • CMPP_OMIT_D_DB: omit #query #attach, and #detach. The library still internally uses a database but does not directly expose it to input scripts.
    • CMPP_OMIT_D_INCLUDE: omit #include
    • CMPP_OMIT_D_MODULE: omit #module
    • CMPP_OMIT_D_PIPE: omit #pipe
    • CMPP_OMIT_ALL_UNSAFE: sets all of above OMIT flags and will include any future directives which access the filesystem, invoke external processes, or similar. This flag currently only affects directives, not other library-level APIs, but an eventual goal is to be able to make all filesystem-specific parts optional. ("Unsafe" is too strongly-worded here but its heart is in the right place.)

Real-world Examples

Examples taken from real-life source trees. It's frequently the case that libcmpp features are added specifically to support stuff like this...

Embedding CSS files in JavaScript:

const styleSheet = new CSSStyleSheet;
const css = [
  //#c-code -mode cstr -comma
  //#include "../css/common.c-pp.css"
  //#/c-code
].join('');
styleSheet.replaceSync(css);

Generating Repetitive CSS

//#query {
  select 'us-standard' k, '63.5mm' w, '88mm' h
  UNION ALL select 'us-mini', '41mm', '63.5mm'
  UNION ALL select 'euro-mini', '45mm', '68mm'
  UNION ALL select 'euro-standard', '59mm', '92mm'
  UNION ALL select '80x80', '80mm', '80mm'
}
.ahct.card[data-card-size="${k}"] {
  min-width: ${w};
  max-width: ${w};
  min-height: ${h};
  max-height: ${h};
}
.ahct.card[data-card-size="${k}"].landscape {
  min-width: ${h};
  max-width: ${h};
  min-height: ${w};
  max-height: ${w};
}
//#/query

And

/* Generate various .filter-X-Y rules... */
//#query {
  WITH w(which) AS (
    SELECT 'grayscale' AS which
    UNION ALL select 'invert'
    UNION ALL select 'sepia'
    UNION ALL select 'saturate'
  )
  select which which, value num from w, generate_series(0,100,5)
}
.filter-${which}-${num} {filter:${which}(${num}%)}
//#/query

//#query {select value num from generate_series(0,355,5)}
.filter-hue-rotate-${num} {filter:hue-rotate(${num}deg)}
//#/query

(SQLite's generate_series() is not part of its standard library, but libcmpp embeds a copy because of this very use case. The alternative for such number generation is using a CTE, which works fine but is much more verbose.)

Background: Why Create c-pp?

In mid-2022 the SQLite project started work on its JS/WASM bindings. It was initially written for "vanilla" JS for the simple reason of personal preference of the guy writing the code, but it was clear we would eventually need to support ESM (ES6 modules) because that's what the modern-day JS ecosystem uses. Vanilla JS and ESM are 99.9% identical but each has tiny context-specific syntactic differences. Most differences in JS can be resolved via runtime introspection but syntactic differences make code outright illegal in one or other of the modes.

We had several options for dealing with this:

  • Ignore it. It might go away. This was tried, but pressure eventually mounted and my proverbial white flag had to be raised. (Tip: having a support contract with SQLite greatly increases the odds of ones own specific variety of pressure bearing fruit!)
  • Switch to ESM only. That wasn't going to happen (A) for the aforementioned reason about the one doing the coding and (B) because, at the time, some browsers could not yet launch ESM modules as Workers. Since the "killer feature" of the project's JS bindings was expected to be its integration with persistent client-side storage via OPFS, and OPFS is only available in JS Workers, point (B) held significant weight.
  • Maintain two copies with slight differences. No way. No. way. Nope.
  • Construct the sources dynamically. This could easily turn into a huge mess of scripts but... it still sounded like the best of the available options.

A notable restriction: one rule of the SQLite project is that we cannot simply import random code into it, so any tooling was going to have to be hand-rolled by members of the project. Spoiler alert: only one team member needed this tool, so it was up to them to implement it (double-spoiler alert: 🙋‍♂️).

First we tried a C preprocessor, as that's precisely the type of thing we needed, but it didn't take more than 15 minutes to determine that it was unsuitable for the job. Summary: C preprocessors make a mess of non-C code by injecting it with C-isms like #line markers or, in the case of GCC, a GNU license header. If gcc's preprocessor could have been taught to emit only its filtered inputs, without irrelevant other content, the story would have ended there and much subsequent effort could have been spared.

The SQLite project has a strong culture of "keep it simple" and "don't be shy about writing your own tools", instilled the hard way over 2.5 decades, and that culture has seeped into me in my time there. My built-in tendency, however, is to over-engineer everything, even otherwise simple shell scripts, a fault at odds with The SQLite Way. Even so... we needed a preprocessor, or something like it.

For logistical reasons, the choices had to come down to Tcl, dependency-free C, or the core Unix tools like sed, awk, and sh. A large handful of Tcl scripts already generate the core of SQLite, some much like a very-specific-purpose preprocessor. At the time, my Tcl-fu was not strong enough for me to confidently pull off my envisioned tool in Tcl. Maintaining JS code using shell scripts was, and remains, simply unappealing. So C became the implementation route of choice.

Writing dependency-free C code can be somewhat tedious, as one invariably ends up re-inventing the same set of utillity code, like a memory buffer class and a function to read in a file's whole contents at once (possibly into one of those buffers). In this case we'd also need a hashtable early on and, sigh, it would have to be written3.

It turns out, though, that we could use sqlite3.h and still be effectively dependency-free because this tool would be embedded in SQLite's own tree. How convenient! Long story short: being able to use an in-memory db as a hashtable was a huge time-saver and had further downstream benefits.

So work began on the preprocessor with the self-imposed restriction that it do only what we need, and not (contrary to my core nature!) be designed as a generic, client-agnostic, tool. That meant, for example, that it would use only global state, read only from a single file handle, and write only to one file handle. (Whereas my natural tendency would be to abstract the I/O channel into a client-extensible interface, taking up more code, more time, and adding a feature we ultimately wouldn't use. Sigh.)

And thus c-pp was born.

c-pp has proven invaluable for its initial role. SQLite has, as of late 2025, some 8 or 10 different JS builds, all from the same core source files, and that would have been nigh impossible for our tiny team to reliably manage without some sort of source-filtering tool.

At some point my natural urge to over-engineer got the best of me and c-pp was refactored from a single-purpose monolithic app into a client-agnostic library, quickly more than tripling in code and docs. It would be difficult to justify adding that sort of complexity and code bloat to the SQLite tree, given that that tree needs exactly none of it, so the original/"lite" version is maintained over in the lite branch, tweaked only insofar as necessary for SQLite-side JS maintenance.

The trunk branch, contrariwise, is where my over-engineering gets to run rampant without risk to the SQLite JS builds. Some of the remnants of the c-pp's orignal monolithic-app shape are still visible in its interface and code, but the trunk version has become a significantly different thing than its predecessor.

But why? Why do we need an over-engineered, client-extensible preprocessor?

We don't. Spoiler alert: i don't, either! The world has lots of problems and the ones this project ostensibly solves aren't among them. It is done because it interests me to do, and for no other reason.

Potential TODOs

  • Add the ability to persist the db? "The problem" with that is that the schema would then be "public", so couldn't be modified without some hassle. This would allow us to build up a db of values before processing, e.g. via a configure script. What could we really do with it, anyway?

  • Maybe #/* and #*/ as comment blocks. #if 0 works fine, though.

  • Maybe a #db directive with operations like:

    • #db open filename as dbName
    • #db trace dbName ?expanded? to filename (would be especially helpful)
    • #db trace dbName off
    • #db query dbName ... (as for #query)
    • #db close dbName

Reminders to self...

Currently none.


  1. ^ C preprocessors, when running in comment-retention mode, tend to inject # characters all over the place and may do silly things like automatically include compiler-specific headers and emit the comments from those. e.g. using gcc -E -CC will include a gcc-internal header and emit a GPL license header in the output. e.g. try:
    $ echo 'extern int x;' > y.c; gcc -E -CC y.c
  2. ^ We do not use a default of # because some source files this tool was initially designed to handle have lines which start with that (JavaScript class private members). In that particular tree we use a delimiter of //#. Even so, the docs use # because it's easier on the eyes than the real default is.
  3. ^ Writing hashtables is one of those things which becomes tedious the fourth of fifth time around.