HarfBuzz Study: PackTab

4 min read Original article ↗

HarfBuzz Study: PackTab

A static integer table packer

behdad
July 29, 2022

Introduction

In HarfBuzz we use a lot of generated C code, and PackTab is one of the tools we use for that. In this writeup I will expand on what it is, what it does, where it comes from, how it does it, and where else it might be useful.

What it is

PackTab, as the name might suggest, packs a table of numbers into a compact form. Since HarfBuzz deals with Unicode data, we have many data tables that need to be stored in an efficient manner, and PackTab is one of the tools we use to generate such tables.

What it does

PackTab, in its current incarnation, is a small Python module. The main entry point, pack_table(), takes a list of integers, a default value, and a compression level, and returns a solution. You can then write out the solution as C code:

    solution = pack_table(data, default, compression=compression)

    code = Code('data')

    expr = solution.genCode(code, 'get')

    code.print_c()

Let’s test this with:

    data = [-2, -2, 4, -2]

    default = 0

    compression = 1

This will generate the following output:

static const uint8_t

data_u8[1] =

{

  4,

};

static inline unsigned

data_b1 (const uint8_t* a, unsigned i)

{

  return (a[i>>3]>>((i&7u)<<0))&1u;

}

int_fast8_t

data_get (unsigned u)

{

  return u<4u?-2+(int_fast8_t) 6*data_b1(data_u8,u):0;

}

The entry-point of interest is the data_get() function, which returns the data array we passed in when invoked with the index.  For example, data_get(2) will return 4, while data_get(3) will return -2 and data_get(4) will return the default value of 0.

You might notice that the generated code is not optimal in its use of a data array of size 1 byte where the data could fit an integer. That is a limitation of packtab currently. With larger data arrays this is not a practical limitation however.

There is also a module main entry point to play with this. The same test can be done by invoking:

$ python -m packTab -2 -2 4 -2

Where it comes from

I do not know of a generic tool similar to packtab. Many projects include similar homegrown tools. For example, the CPython project as well as the ICU project both include tools to compact their Unicode data tables.

I first developed packtab in C in 2001 for the FriBidi project. That code was so unreadable and hard to extend that I then rewrote an improved version in Python in 2019 for use with HarfBuzz. Unfortunately, and this will remain a stain in my resume, the rewritten version is equally unreadable.

How it does it

Both versions explore the same solution space, but the C version does that by backtracking, whereas the Python version does that by dynamic-programming.

The main idea is to split the data array into chunks of a power-of-two size, see how many unique such chunks are there, and try to store them efficiently recursively.

The Python version also has extra smartness built in to reduce storage needs by dividing data entries by their greatest-common-divisor as well as storing array members as sub-byte entries, as 1, 2, or 4 bit per entry.  Both of these features are demonstrated in the example above, where the values -2 and 4 are produced by adding -2 to the result of multiplying 6 by the value of a 1-bit array.

Other features

Mnemonic names in the data array are supported. In that case, all data list entries must be strings, and their values are assumed to fit in an unsigned byte. They are verbatim printed in the generated C code, and should be resolved to C macro or enum values at compile time.

Results

We use packtab in four table-generators in HarfBuzz. The main consumer being the Unicode Character Database table provider. You can check the code for these at:

And the respective generated table code:

The compression level allows for trading off speed versus memory. For example, with the UCD table, the compression level 1 consumes 56kb of data, while level 3 consumes 36kb.

Porting the USE-table from a hand-written generator to packtab made it go from 13kb of data down to 4kb.

Where else it might be useful

Packtab is not specific to HarfBuzz, nor to Unicode data. It is useful anywhere a highly-redundant / sparse static integer data array needs to be stored compactly.

It would be interesting to see it adopted by more projects. Sometimes, the data arrays need some massaging before passing to packtab, to make it more palatable, by making them sparse.