Splitting a long string in lines efficiently

Suppose that you have a long string and you want to insert line breaks every 72 characters. You might need to do this if you need to write a public cryptographic key to a text file.

A simple C function ought to suffice. I use the letter K to indicate the length of the lines. I copy from an input buffer to an output buffer.

void insert_line_feed(const char *buffer, size_t length, 
        int K, char *output) {
  if (K == 0) {
    memcpy(output, buffer, length);
    return;
  }
  size_t input_pos = 0;
  size_t next_line_feed = K;
  while (input_pos < length) {
    output[0] = buffer[input_pos];
    output++;
    input_pos++;
    next_line_feed--;
    if (next_line_feed == 0) {
      output[0] = '\n';
      output++;
      next_line_feed = K;
    }
  }
}

This character-by-character process might be inefficient. To go faster, we might call memcpy to copy blocks of data.

void insert_line_feed_memcpy(const char *buffer, size_t length, int K,
                             char *output) {
  if (K == 0) {
    memcpy(output, buffer, length);
    return;
  }
  size_t input_pos = 0;
  while (input_pos + K < length) {
    std::memcpy(output, buffer + input_pos, K);
    output += K;
    input_pos += K;
    output[0] = '\n';
    output++;
  }
  std::memcpy(output, buffer + input_pos, length - input_pos);
}

The memcpy function is likely to be turned into just a few instruction. For example, if you compile for a recent AMD processor (Zen 5), it might generate only two instructions (two vmovups) when the length of the lines (K) is 64.

Can we do better ?

In general, I expect that you cannot do much better than using the memcpy function. Compilers are simply great a optimizing it.

Yet it might be interesting to explore whether deliberate use of SIMD instructions could optimize this code. SIMD (Single Instruction, Multiple Data) instructions process multiple data elements simultaneously with a single instruction: the memcpy function automatically uses it. We can utilize SIMD instructions through intrinsic functions, which are compiler-supplied interfaces that enable direct access to processor-specific instructions, optimizing performance while preserving high-level code readability.

Let me focus on AVX2, the instruction set supported by effectively all x64 (Intel and AMD) processors. We can load 32-byte registers and write 32-byte registers. Thus we need a function that takes a 32-byte register and inserts a line-feed character at some location (N) in it. For cases where N is less than 16, the function shifts the input vector right by one byte to align the data correctly, using _mm256_alignr_epi8 and _mm256_blend_epi32, before applying the shuffle mask and inserting the newline. When N is 16 or greater, it directly uses a shuffle mask from the precomputed shuffle_masks array to reorder the input bytes and insert the newline, leveraging a comparison with `0x80` to identify the insertion point and blending the result with a vector of newline characters for efficient parallel processing.

inline __m256i insert_line_feed32(__m256i input, int N) {
  __m256i line_feed_vector = _mm256_set1_epi8('\n');
  __m128i identity =
      _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 
         9, 10, 11, 12, 13, 14, 15);
  if (K >= 16) {
    __m128i maskhi = _mm_loadu_si128(shuffle_masks[N - 16]);
    __m256i mask = _mm256_set_m128i(maskhi, identity);
    __m256i lf_pos = _mm256_cmpeq_epi8(mask, 
       _mm256_set1_epi8(0x80));
    __m256i shuffled = _mm256_shuffle_epi8(input, mask);
    __m256i result = _mm256_blendv_epi8(shuffled, 
       line_feed_vector, lf_pos);
    return result;
  }
  // Shift input right by 1 byte
  __m256i shift = _mm256_alignr_epi8(
      input, _mm256_permute2x128_si256(input, input, 0x21), 
        15);
  input = _mm256_blend_epi32(input, shift, 0xF0);
  __m128i masklo = _mm_loadu_si128(shuffle_masks[N]);
  __m256i mask = _mm256_set_m128i(identity, masklo);
  __m256i lf_pos = _mm256_cmpeq_epi8(mask, 
      _mm256_set1_epi8(0x80));
  __m256i shuffled = _mm256_shuffle_epi8(input, mask);
  __m256i result = _mm256_blendv_epi8(shuffled, 
      line_feed_vector, lf_pos);
  return result;
}

Can we go faster by using such a fancy function ? Let us test it out. I wrote a benchmark. I use a large input string on an Intel Ice Lake processor with GCC 12.

character-by-character	1.0 GB/s	8.0 ins/byte
memcpy	11 GB/s	0.46 ins/byte
AVX2	16 GB/s	0.52 ins/byte

The handcrafted AVX2 approach is faster in my tests than the memcpy approach despite using more instructions. However, the handcrafted AVX2 approach stores data to memory using fewer instructions.