~cryo/bitmatch - Binary pattern matching on bits and bytes - sourcehut git

Binary pattern matching on bit-level.

#Features

Match binary data with bit patterns, not restricted to 8-bit boundaries.
Extract data via capture groups.
Portable ANSI C code (C89) under BSD license.
No external dependencies, no use of libc.
No dynamic memory allocations or usage of malloc and free.
Small footprint of 5 kilo bytes, suitable for micro-controllers.
Easy to embed by adding bitmatch.c and bitmatch.h to your project.

Matching bit patterns usually requires code to be written to match specific values. For example a protocol frame starts with a command byte followed by a 16-bit length and certain flags which can be one or more bits longs, like a 3-bit enum.

#Pattern syntax

Patterns are described as a ASCII string similar to Regular Expressions. Fields are extracted via capture groups. The application code only needs to work on the matched data, without worrying about matching or data extraction.

Token	Description
^	Pattern must match at begin.
$	Pattern must match at end.
L	Switch to little-endian (default)
B	Switch to big-endian.
( id	Begin capture group with id > 0.
(-	Begin non-capture group.
)	End of a group.
\|	Logical or for alternates.
m mod	Current bit position modulus mod must be 0.
# bits	Bit literal → #11.1010 (dot separator is optional)
x hex : count	Hex literal with bit count → x5:3
d dec : count	Decimal literal with bit count → d1001:16
_ count	Match any count of bits.

The bit, hex and dec literals are limited to 32-bits.

#Example: Match multi-byte UTF-8 codepoints

This example only demonstrates what is possible, it's not a particular useful real world use case.

Background: A UTF-8 encoded codepoint has one of the following forms:

110xxxxx 10xxxxxx                    // 2-bytes
1110xxxx 10xxxxxx 10xxxxxx           // 3-bytes
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx  // 4-bytes

The pattern

The following pattern matches 2, 3 and 4-byte UTF-8 encoded codepoints:

m8 (1 _5 #110 _6 #10) |
   (2 _4 #1110 _6 #10 _6 #10) |
   (3 _3 #11110 _6 #10 _6 #10 _6 #10)

A closer look

Lets see what the first line "m8 (1 _5 #110 _6 #10) |" does:

Token	Description
`m8`	the pattern must be on a 8-bit boundary (bit stream position mod 8 == 0) (This prevents false positives on ASCII data in the bit stream)
`(1`	start a capture group with id 1
`_5`	take any 5-bits, this is the `xxxxx` in `110xxxxx`
`#110`	match 3-bits `110`
`_6`	take any 6-bits, this is the `xxxxxx` in `10xxxxxx`
`#10`	match 2-bits `10`
`)`	end of group 1, everything matched so far becomes the match (id: 1) value
\|	if the pattern fails at current position, try the following one

Note the order in which the fields are specified. We start at the least significant bit (LSB) and advance towards the most significant bit (MSB) of the data. Therefore iterate from right to left, because in memory UTF-8 data is stored as:

  byte 1  |   byte 0
----------+----------
5         |
1      8  |  7      0    bit position
10xxxxxx  |  110xxxxx

The code

We test the pattern on the string "lambda λ!" containing the 2-byte UTF-8 encoded Greek Small Lambda "λ" (U+03BB).

Full example in tests/03_utf8_multibyte.c

#include <stdio.h>
#include <string.h>
#include "bitmatch.h"

#define MAX_MATCHES 1
#define MAX_MEM     256

int main(int argc, char **argv)
{
    int match;
    bm_context bm;
    bm_match *m;
    bm_match matches[MAX_MATCHES];
    unsigned char mem[MAX_MEM];

    const char *pattern = "m8 (1 _5 #110 _6 #10) | "
                             "(2 _4 #1110 _6 #10 _6 #10) | "
                             "(3 _3 #11110 _6 #10 _6 #10 _6 #10)";

    /* lambda as 2-byte UTF-8: 0xCE 0xBB */
    unsigned char bin[] = { 'l', 'a', 'm', 'b', 'd', 'a', ' ', 0xCE, 0xBB, '!' }; 

    bm_init(&bm, &mem[0], sizeof(mem));
    bm_compile(&bm, pattern, strlen(pattern));

    match = bm_exec(&bm, &bin[0], sizeof(bin), &matches[0], MAX_MATCHES);

    if (match == 1)
    {
        m = &matches[0];
        printf("match id: %u pos: %lu, bit_count: %u, value: 0x%lX\n",
               m->id, m->bit_pos, m->bit_count, m->value);
    }

    return 0;
}

Executing this code prints:

match id: 1 pos: 56, bit_count: 16, value: 0xBBCE

#API documentation

Excerpt from the bitmatch.h header:

/** Entry for a bitmatch match.
 *
 * Needs to be allocated by the user on the stack
 * as array of expected matches.
 */
typedef struct
{
    unsigned long value;   /* value of a capture group */
    unsigned long bit_pos; /* position in the bit stream */
    unsigned bit_count;    /* count of bits in value */
    unsigned id;           /* id of the respective capture group */
} bm_match;

/** The bitmatch context.
 *
 * Needs to be allocated by the application on the stack.
 */
typedef struct
{
    unsigned instr_pos;
    unsigned instr_size;
    bm_word *instr;

    bm_status status;
} bm_context;

/** Initialize bitmatch context.
 *
 * The memory buffer needs to be supplied by the application,
 * when a pattern is compiled via bm_compile() the instructions
 * are stored in there. For simple patterns 128-256 bytes are
 * enough.
 *
 * \param bm the bitmatch context.
 * \param mem pointer to a memory buffer.
 * \param mem_size size of the mem buffer.
 */
void bm_init(bm_context *bm, void *mem, unsigned long mem_size);

/** Compile a pattern.
 *
 * \param bm the bitmatch context.
 * \param pattern the pattern.
 * \param size size of the pattern aka strlen(pattern).
 * \return 1 on success
 *         0 on failure with bm->status set to an error code
 */
int bm_compile(bm_context *bm, const char *pattern, unsigned size);

/** Execute a compiled binary pattern.
 *
 * \param bm the bitmatch context.
 * \param bin data to match against.
 * \param bin_size size of bin in bytes.
 * \param matches a buffer for keeping results.
 * \param matches_size number of maximum matches to return.
 * \return >0 on success, this is the number of matches
 *          0 on failure with bm->status set to an error code
 */
int bm_exec(bm_context *bm, const void *bin, unsigned long bin_size,
            bm_match *matches, unsigned matches_size);

#Build tests

cmake -S tests -B build
cmake --build build
cd build
ctest --output-on-failure

#Why bitmatch?

I needed a way to match arbitrary protocol frames and pass the data to a Javascript engine with handler code. The Javascript code shouldn't be concerned with matching, because that is one step too late. And the Javascript code shouldn't do cumbersome and slow data extraction. With bitmatch the Javascript code is kept minimal and does only run after a match.

Turns out this can be useful for all kinds of applications and so this little library came to be.