COUNTLESS 3D- Vectorized 2x Downsampling of Labeled Volume Images Using Python and Numpy

A visualization of a 2x2x2 downsample of a labeled dataset. This is what COUNTLESS 3D does.

Previously, I demonstrated a fully vectorized algorithm, COUNTLESS, that downsampled labeled images by finding the mode of 2×2 patches without counting pixel value frequencies. Here, I generalize COUNTLESS to 3D images using a 2x2x2 patch and to arbitrary downsample factors. COUNTLESS 2D worked by finding a majority pixel without computing its total frequency. This was determined by finding the presence or absence of two matching pixels within a 2×2 image patch. COUNTLESS 3D extends this logic by searching among eight voxels (three dimensional pixels) for matches of four pixels, then of three, and finally of two. COUNTLESS N generalizes this technique to any input size, though it rapidly becomes impractical COUNTLESS 2D and COUNTLESS 3D can be seen as application specific implementations of COUNTLESS 4 and COUNTLESS 8 respectively. COUNTLESS variants are implemented using only vectorized instructions, which makes them practical for implementation in programming languages like Python that have fast linear algebra libraries like Numpy, but which have slow looping and branching statements. While COUNTLESS 2D can outperform naive algorithms, using COUNTLESS 3D for high throughput applications is not advisable as it is much slower than the standard approach. However, in Python, it can be implemented without resorting to extra compilation steps and still provides reasonable performance.

Motivation

Figure 1. Volumetric images used in connectomics research, a field where we attempt to find the wiring diagrams of brains. Left: A 3D image formed from a stack of electron microscope scans of brain tissue. Right: A machine labeling of neurons in the left image by Kisuk Lee. This article concerns the type of image on the right.

While not commonly encountered by the average person, 3D images ( volumetric images) are heavily used in biomedical imaging. A volumetric image can be constructed from a stack of 2D images acquired at regularly deepening intervals. MRI machines use magnets to non-invasively acquire images of brain slices, and cell biologists often use laser powered microscopes to scan samples at different depths. Their acquired images are arranged in a stack in sorted order to form the final image.

My tissue of interest is a brain and the method for acquiring those images is to slice it very finely using a machine akin to a deli slicer called an ultramicrotome. The resulting slices are then imaged with an electron microscope and assembled into a stack to generate a volumetric image. The slices are large, and range between tens of thousands to hundreds of thousands of pixels per a side. By contrast, a high end 4K resolution television or computer monitor displays images only three to four thousands pixels per a side, and often the other side is half as large.

In order to display these images using consumer hardware, it is common practice to downsample them, that is to create a series of smaller summary images that fairly represent the underlying high resolution image. For instance, on Google Maps, the world is displayed as a single (or small number of) stitched images, but as you zoom in, higher resolution images are fetched and displayed of only the region of interest. With microscope images, we can downsample them by averaging 2x2x2 patches to create an eight times smaller image, and do so iteratively until the image is sufficiently small. However, once we generate labels to describe which voxel belongs to which neuron, averaging is not an option as the labels are not analog signals, but discrete identifiers. The most suitable way to summarize them is to choose the most frequent value within a patch.

The storage of downsamples incurs additional cost, though it exponentially declines with each additional mip level. For a recursive 2×2 reduction, the storage and transmission cost of a stack of downsamples is 33% over the cost of storing the full resolution layer.

LISTING 1: Size of an Infinite Stack of Downsamples

Let S = The size of all the downsamples
Let r = The reduction factor of a downsample (e.g. 4 for 2x2)

S = 1 + 1/r + 1/r^2 + 1/r^3 + ...

S/r = 1/r + 1/r^2 + 1/r^3 + 1/r^4 + ...

S - S/r = 1

S = r / (r-1)

Therefore, the storage cost of a 2×2 downsample stack is at most 4/3 or 133% of the cost of the original image alone. The storage cost of a 2x2x2 downsample stack is at most 8/7 or 114.3% of the full resolution. For some use cases, this reduction can be enticing. For those so enticed, the question turns to how it might be achieved without counting?

COUNTLESS 5 – A Small Extension of the 2D Problem

At its root, COUNTLESS 2D relied on the idea that if two labels in a set of four match, it was not necessary to consider any additional information to declare the mode of that patch. It’s natural to ask if a similar principle can be stated for a set of five labels, the smallest possible extension of this problem.

Figure 2. The seven cases of COUNTLESS 5. (a) all labels match (b) all but one label matches (c) three labels match versus two labels (d) three match but the other two are both different (e) two pairs and one different (f) one pair and three different pixels (g) all different

In the case of four labels, we found that if any two match, they immediately form either a majority or a tie, in which case a member of either of the tied groups can be selected arbitrarily. In the case of five labels, a match of two labels is no longer sufficient to declare a winner, however three are. If three labels match, they will always form a majority with no ties possible. However, if there are no matches of three, then the mode will be comprised of matches of two. If no two labels match, then a label can be chosen arbitrarily.

Therefore, we must by some mechanism look for matches of three, if none exist, then matches of two, if there are no matches of two, then having thrown up our hands, select one convenient pixel. Searching for matches of three means checking every combination of three labels among the five and likewise for matches of two.

LISTING 2: The Relevant Combinations of Five Labels

Let the five integer labels be denoted A, B, C, D, E

Combinations of three: 5C3 = 5! / 3!2! = 10 combinations

    ABC, ABD, ABE, 
         ACD, ACE, 
              ADE,
         BCD, BCE, 
              BDE,
              CDE

Combinations of two: 5C2 = 5! / 2!3! = 10 combinations

    AB, AC, AD, AE
        BC, BD, BE
            CD, CE
                DE

Combinations of one: 5C1 = 5 combinations

    A, B, C, D, E

To evaluate each combination, we can generalize the PICK(A,B) operator from COUNTLESS 2D to accept any number of arguments:

PICK(A,B,C, ..., N) := A if A == B == C == ... == N else 0    EQN. 1

This can be implemented in Python/numpy pseudocode ("&" is the bitwise AND operator):

PICK(A,B,C ..., M,N) := A * (A == B &amp; B == C &amp; ... &amp; M == N) EQN. 2

Each application of PICK to a given combination will yield a label or zero. If any of our combinations of three yields non-zero, it qualifies as a candidate mode. If there is more than one match, for instance if there are actually four matching labels, we can pick any of the candidate modes arbitrarily. The short circuiting logical or operator has the right semantics to select a label if one exists.

MODE(A,B,C,D,E) := PICK(A,B,C) || PICK(A,C,D) || PICK(A,D,E)
    || PICK(B,C,D) || PICK(B,D,E) || PICK(C,D,E) || PICK(A,B)
    || PICK(A,C) || PICK(A,D) || PICK(A,E) || PICK(B,C) 
    || PICK(B,D) || PICK(B,E) || PICK(C,D) || PICK(C,E)
    || PICK(D,E) || E                                         EQN. 3

As described previously, the || operator can be implemented like so:

LOGICAL_OR(X,Y) := X + (X == 0) * Y                           EQN. 4

Equation 3 has seventeen elements that need to be calculated and logically ORed together. Is there some way to reduce the number of elements needed? It’s not possible to substantially reduce the time complexity of the general algorithm, but there are some marginal savings that make a difference for small numbers of labels. Notice for the case of matches of two we can extend the trick used in COUNTLESS 2D to avoid computing matches of the last element. Using the notation PQ to mean PICK(P,Q), if neither AB, AC, AD, BC, BD, nor CD is a match, then we would be forced to choose one of AE, BE, CE, DE, or E, all of which are identical to E. Therefore we can omit AE, BE, CE, and DE leaving thirteen elements to be calculated, a substantial savings.

MODE(A,B,C,D,E) := PICK(A,B,C) || PICK(A,C,D) || PICK(A,D,E)
    || PICK(B,C,D) || PICK(B,D,E) || PICK(C,D,E) 
    || PICK(A,B) || PICK(A,C) || PICK(A,D) || PICK(B,C) 
    || PICK(B,D) || PICK(C,D) || E                            EQN. 5

This will always be true for the last label, no matter how large the set of labels we are considering. That means for the case of N choose two, we can always reduce it to N-1 choose two combinations to consider.

Putting this all together yields some working Python code for COUNTLESS 5. Note that this is just a stepping stone to COUNTLESS 3D, perhaps better called COUNTLESS 8 as it is solving the mode of 2 x 2 x 2 voxels.

COUNTLESS 3D (aka COUNTLESS 8)

COUNTLESS 3D, the practical application of this concept is now within reach. Downsampling a 2x2x2 volumetric image patch is equivalent to finding the mode of eight integers. The bare requirement for a candidate majority or tie is four matching labels. If there is a tie of 4–4, we can pick one of the matches arbitrarily. To evaluate COUNTLESS 8, we must consider matches of length four, then three, then two.

Equation 6 below shows the number of PICK calls that must be made. Note that the 7C2 term below comes from our simplification of 8C2 using the method shown in the COUNTLESS 5 section.

Total PICKs = 8C4 + 8C3 + 7C2
            =  70 +  56 +  21 
            = 147                                             EQN. 6

Note for now that, PICK(A,B,C,D) is 1.5x more expensive than PICK(A,B,C) (six operations versus four) and 3x more expensive than PICK(A,B) (six operations versus two). This fact will be important later.

Below is the implementation of COUNTLESS 8. Note that the creation of 148 PICKS would swell the required memory if we are not careful. Using Python 3 generator expressions, we can reduce the memory requirement of this program substantially by only creating a few additional derived images at a time.

Similarly to COUNTLESS 2D, the output of the PICK operator is nonsensical if the matching labels are zero (it returns 0 whether they match or not), so we shift the data up by one to accommodate zero labels and shift down at the end.

Dynamic Programming COUNTLESS 3D

So far so good, but could speed this up a bit more. As shown in Equation 6, the performance of COUNTLESS 3D is roughly proportional to the number of PICK operations that must be computed, which in turn equal to the number operations that must be computed. PICK(A,B,C,D) requires six operations, while PICK(A,B,C) requires four, and PICK(A,B) requires two. We also need to compute three operations per a logical or.

LISTING 3: Number of Operations Required for COUNTLESS 3D

# of COUNTLESS_3D operations 
    = 8C4 PICK(A,B,C,D) + 8C3 PICK(A,B,C) + 7C2 PICK(A,B)
      + (8C4 + 8C3 + 7C2 + 1 - 1) logical ors
    = 8C4 x 6 + 8C3 x 4 + 7C2 x 2
      + (8C4 + 8C3 + 7C2) x 3
    = 686 + 441     // picks + logical or

    = 1,127 operations

Or more generally:

Eqn. 8: Establishing a lower time complexity bound for COUNTLESS N, where N ≥ 2. N represents the number of labels being evaluated, 7C2 + 1 represents the abbreviated cost of the matches of two case. The left hand term refers to the cost of PICK and the right hand term refers to the cost of logical or.

So far, because the problem’s solution is governed by larger matches before smaller matches, we’ve been computing the largest matches first. However, the larger combinations can be built from the smaller ones using dynamic programming methods. At the expense of additional memory, we can treat the matches of two as subproblems of the matches of three, and matches of three as subproblems of matches of four. Since each new layer builds on top of the old ones by adding one element, this technique will enable us to drop the (r-1) coefficient in the PICK term by eliminating duplicate work.

Eqn. 9: Dynamic programming drops the coefficient (r-1) from the summation.

By pulling out the factor of two, we can simplify this equation to:

Eqn. 10: Simplified cost of the dynamic programming solution for COUNTLESS N.

However, this is somewhat annoying to calculate and communicate so let’s simplify it with some Landau notation. The sum of all combinations from 0 to N is equal to 2^N. Since combinations are symmetric, and we are summing to N/2 while skipping a small amount of calculation, this works out to an exponential time complexity of 2^(N-1).

Eqn. 11: Exponential big-O time complexity.

Dynamic programming saves solved subproblems for use by the next iteration of the algorithm. In this case, that means saving the results of of match length two and using them to bootstrap matches of three. This is very memory expensive, but there’s a trick to make it slightly more palatable.

LISTING 4: Not All Subproblems are Used

Let A, B, C, D, E represent five labels.
Let the notation AB := PICK(A,B) and ABC := PICK(A,B,C) and so on.

Combinations of Length Two:

    AB, AC, AD, AE
        BC, BD, BE
            CD, CE
                DE

Combinations of Length Three:

   ABC, ABD, ABE, ACD, ACE, ADE
                  BCD, BCE, BDE
                            CDE

Notice that all length three combinations are prefixed with length two problems. For example, PICK(A,B,C) can be written as PICK(PICK(A,B), C). However, certain length two subproblems are not used in length three. AE, BE, CE, and DE can never be used as a prefix because E always occurs at the end.

Therefore, substantial memory can be saved by not memoizing subproblems that include the terminal element.

Putting this all together yields a dynamic programming version of COUNTLESS 3D that is substantially faster though more memory hungry. It requires storing the results of the previous iteration. This results in a peak memory consumption of:

How much faster should we expect the dynamic version of COUNTLESS 3D to be? Using equations 8 and 10:

LISTING 5: Comparing the Ratio of Eqn. 10 and Eqn. 8

Non-Dynamic = 1,127 operations ; from LISTING 3
Dynamic     = 5 * ( 8C4 + 8C3 + 7C2 ) ; from Eqn 9
            = 735 operations

Dynamic / Non-Dynamic = 0.65 or 1.53x faster

As the idealized dynamic programming solution requires only 65% of the work, we should roughly expect the dynamic programming solution to be about 1.5x faster. In practice, I find it to be between 1.3x to 2.3x faster, probably due to differences in operator speeds.

COUNTLESS N – Generalized COUNTLESS

It’s simple to generalize what we learned in formulating COUNTLESS 3D. COUNTLESS N is a very memory and compute hungry algorithm. I don’t expect it to be performant outside of a narrow range of parameters. However, the below algorithm, when given a 2x2x2 reduction factor, matches COUNTLESS 3D in performance.

Performance

COUNTLESS 2D’s performance metrics focused on sheer execution speed. However, for COUNTLESS 3D and COUNTLESS ND, because of the number of combinations, require measurements of memory usage as well.

In order to ascertain how fast this algorithm is, a comparison suite was developed and run on a dual core 2.8 GHz, i7 Macbook Pro (circa 2014), 256 kB L2 cache, 4MB L3 cache. Max pooling, averaging, and striding were included for speed comparisons even though they are unsuitable for this task.

Python 3.6.2 with numpy-1.13.3 and clang-900.0.39.2 were used for the following experiments.

The tested algorithms:

striding: Pick every other pixel.
countless3d: The low memory implementation of countless3d
dynamic_countless3d: The dynamic programming implementation of countless3d
countless3d_generalized: Low memory implementation of COUNTLESS N with downsample factor 2,2,2 to match COUNTLESS 3D
countless3d_dynamic_generalized: Dynamic programming version of COUNTLESS N with downsample factor 2,2,2 to match COUNTLESS 3d

These algorithms were also tested even though they are inappropriate for handling segmentation to provide a point of comparison for other image processing algorithms:

downsample_with_averaging: Average pixels in the 2×2 window.
downsample_with_max_pooling: Pick highest value pixel in 2×2 window.

The code used to test the algorithms can be found here. The below numbers were tested on revision 15c0077.

Throughput

Table 1. Python 3 Downsample Performance in Megavoxels/sec on a 512x512x512 Numpy Array by Array Data Type, 5 Trials

The following numbers come from running the algorithms specified above five times each against a 512 x 512 x 512 voxel cube when the cube is allocated using uint8, uint16, uint32 and uint64. The results in Table 1 and Fig. 3 are specified in Megavoxels per second (MVx/sec), as usually one wishes to process a volume of a particular "physical" size. However, Table 2, Fig. 4, and Fig. 5 are specified in Megabytes per second (MB/sec) since the differences between runs are principally in the size of memory.

The four variants of COUNTLESS 3D consist of a 3D and N-dimensional implementation of the base algorithm along with their dynamic programming counterparts. The specific implementations’ performance is very similar to the generalized implementations in both base and dynamic algorithms. All such comparisons are within 5% of each other. Processing this 128 MVx cube takes about five seconds using the dynamic programming algorithm.

Figure 3. Python 3 Downsample Performance in Megavoxels/sec on a 512x512x512 Numpy Array by Array Data Type, 5 Trials showing only COUNTLESS 3D variants

In terms of MVx/sec, for uint8 arrays, dynamic programming produces about a 2.2x improvement in speed. For uint16 and uint32, it is closer to 1.5x. For uint64, it is about 1.27x. Thus there is an apparent disappointing declining trend in performance as memory increases, however this is belied somewhat by the performance measured in MB/sec.

Table 2. Python 3 Downsample Performance in Megabytes/sec on a 512x512x512 Numpy Array by Array Data Type, 5 Trials

Figure 4 makes clear that performance actually increases slightly between uint8 and uint64, though uint64 dips slightly compared to uint32 in the dynamic implementation. The uint16 results seem to be an outlier. They are comparable to uint8 performance and jump nearly twice as high for uint32. I suspect that either in library internals or on my chip, uint16s are being handled as uint32s. The improvement due to dynamic programming is substantial, but in the next section we will see it comes at a cost.

Figure 5 is provided for context. A simple implementation of an averaging algorithm and max pooling demonstrate far faster execution and a clear rising trend nearly commensurate with the increase in data type width.

Figure 4. Python 3 Downsample Performance in Megavoxels/sec on a 512x512x512 Numpy Array by Array Data Type, 5 Trials showing only COUNTLESS 3D variants

Figure 5. Python 3 Downsample Performance in Megavoxels/sec on a 512x512x512 Numpy Array by Array Data Type, 5 Trials including COUNTLESS 3D variants, averaging, and max pooling for context.

Memory Usage

The memory required to store the subproblems in the dynamic programming implementation of COUNTLESS 3D is obviously substantial. Below, I measured the memory requirement for processing the 128MB 512x512x512 cube five times. Figure 6 shows the use base algorithm in operation while Figure 7 shows dynamic programming algorithm. The following graphs were generated using Python 3.4 on Ubuntu 14.04 Linux using the mprof tool.

The base algorithm fairly consistently uses about 350MB of RAM or about 2.7x the memory required to store the cube. The dynamic programming algorithm requires about 1150MB, or about 9x the cube’s memory footprint.

Figure 6. Memory profile for ordinary COUNTLESS 3D over five trials.

Figure 7. Memory profile for dynamic programming COUNTLESS 3D over five trials.

Discussion

COUNTLESS 3D is a "slow" algorithm. The faster dynamic programming (DP) variant has a exponential time complexity of O(2^(N-1)) while the base variant is somewhat slower than that. The DP variant has combinatorially growing space complexity. However, out of extraordinary good fortune, the special cases of 2×2 (N=4) and 2x2x2 (N=8) are commonly used and small enough to be practical, even under these horrendous conditions. The 2×2 case was shown to be capable, in some cases, of outperforming the standard counting algorithm which was implemented as O(N²) though it could be reduced to O(N) through the use of hashmaps (though odds are it would be slower for small N due to increased constant overhead).

The 1.5x to 2.2x higher speed of the DP algorithm comes at a steep memory price. The base variant of COUNTLESS 3D consumes less than a third as much memory of the DP variant. Pay the piper 3x the memory for 2x the speed. Luckily, even in the uint64 case (~8–10 GB), it is still within a reasonable range for consumer grade setups for a single process.

It is possible to create C or Cython code that runs many times faster than COUNTLESS 3D while using nearly no additional space. Nonetheless, COUNTLESS 3D can serve a peculiar niche: running 3D downsampling on modestly sized datasets using vanilla Python/numpy installations without requiring additional compilation. This 3D algorithm can aid researchers and programmers that desire a library that will reliably deploy wherever Python/numpy does.

One use case where such researchers can get maximum performance is when downsampling binary masks of regions of interest. Certain kinds of semantic segmentation can make use of small numbers of labels. In that case, uint8 arrays provide more than enough labels and benefit from ~20 MVx/sec processing speeds. In connectomics, more typical requirements are uint16 for small volumes, and uint32 or uint64 for substantial projects. For those situations, I recommend compiled code.

A big reason why this article exists is that I had fun researching a full generalization of COUNTLESS 2D. Thanks for reading. 😇

Future Work

I plan to show comparisons between C implementations in the future. My expectation is that it will clock in > 100 MVx/sec. Some C code exists on Github, but I am not yet convinced that it is correct.

Acknowledgements

Several people contributed helpful advice and assistance in developing COUNTLESS 3D and in the writing of this article. Kisuk Lee provided the visualization of 3D data shown in Figures 1 and 2. Chris Jordan provided the seed of the C implementation of counting and countless. Dr. George Nagy suggested generalizing COUNTLESS 2D. Jeremy Maitin-Shepard at Google originally developed the Python code for striding and _downsample_withaveraging for use with neuroglancer. Special thanks to Seung Lab for providing neural segmentation labels.

Code

The code used for testing this pipeline can be found on Github. Contributions in additional languages and feedback are welcomed and will be credited.

March 11, 2019: Now available as part of the tinybrain PyPI package.

This article is a followup to "COUNTLESS – High Performance 2x Downsampling of Labeled Images Using Python and Numpy" (2017).