An experiment in reducing vector size with dimensionality reduction, lower-precision storage, and quantization while watching what it does to retrieval quality.
The Problem
Embedding-heavy systems hit a simple limit: vectors cost money to store, memory to move, and compute to score. A 512-dimensional embedding in float32 takes 2048 bytes. That is fine at small scale, but once the corpus grows, storage, cache pressure, and memory bandwidth start to matter.
The hard part is that the obvious savings can hurt search quality. If a smaller representation changes nearest neighbors, top-k ranking, or retrieval scores too much, the system gets cheaper and worse at the same time. The goal is to reduce size without meaningfully changing retrieval behavior.
The Main Compression Options
The main options we looked at were:
- Dimensionality reduction: replace the original vector with a smaller one, such as 1536 to 512 or 512 to 128
- Lower-precision storage: keep the same dimensions but store values in fp16 instead of float32
- Quantization: keep the same dimensions but store each value with fewer bits
These approaches make different tradeoffs. Dimensionality reduction removes information from the representation. Lower precision keeps the same shape but reduces storage cost. Quantization also keeps the same dimensionality, but stores a more approximate version of the vector.
Option 1: Dimensionality Reduction
Dimensionality reduction is the most direct way to shrink embeddings because it reduces the number of coordinates outright. You can get there by training a smaller embedding model or by applying a projection step after the fact. PCA is the standard example.
In our case, the results depended a lot on how far we pushed it. Going from 1536 dimensions to 512 held up fairly well. Going from 512 to 128 did not. A naive 512 to 128 reduction dropped nearest-centroid agreement to about 0.7520, which was bad enough to rule it out quickly.
That was the main lesson from this path: the question is not whether fewer dimensions can work at all. The question is how far you can reduce them before retrieval quality falls apart.
A future dimensionality-reduction experiment would be to try a learned projection such as PCA instead of naive truncation.
Option 2: fp16
fp16 is the simplest lower-precision option. The dimensionality stays the same, but each value drops from 32 bits to 16 bits. That cuts storage roughly in half.
Operationally, it is straightforward. There is no projection step, and no change to the basic structure of the vectors. It is just a smaller numeric format.
For us, fp16 preserved behavior extremely well. Nearest-centroid agreement was about 0.9998, which made it an easy win.
Option 3: Quantization
Quantization goes further by storing each value with fewer bits. In our case, we rotate the embedding, quantize each rotated dimension to 4 bits using per-dimension offsets and scales, and then pack the result into bytes.
At search time, those packed values are unpacked, dequantized, and rotated back into approximate float vectors before scoring. So the quality question is how much error that process introduces, and how much it changes retrieval behavior.
For us, 4-bit quantization held up reasonably well. Nearest-centroid agreement was about 0.9644. That was meaningfully worse than fp16, but still much better than naive 512 to 128 reduction.
For our rotated 4-bit quantization setup, fitting means choosing the rotation and estimating the per-dimension offsets and scales that map rotated values into 4-bit bins. Those parameters become the reusable artifact used at write time and search time.
Fitting the quantizer
train = np.asarray(sample_embeddings, dtype=np.float32)
rotation = random_orthogonal_matrix(train.shape[1], seed=seed)
rotated = train @ rotation
mins = rotated.min(axis=0)
scales = (rotated.max(axis=0) - mins) / 15.0
The important distinction is between fitting and transforming. Fitting learns the reusable compression parameters. Transforming applies them to each embedding you store.
Compressing an embedding
x = np.asarray(embedding, dtype=np.float32)
z = x @ rotation
codes = np.rint((z - mins) / scales).clip(0, 15).astype(np.uint8)
packed = codes[0::2] | (codes[1::2] << 4)
Reconstructing an embedding
codes = np.empty(512, dtype=np.uint8)
codes[0::2], codes[1::2] = packed & 0x0F, (packed >> 4) & 0x0F
z = mins + codes.astype(np.float32) * scales
approx = z @ rotation.T
Transforming Vectors at Write Time
Once the chosen representation is fit, indexing is straightforward:
embedding_f32 = embed(document)
compressed = compressor.transform(embedding_f32)
store(compressed)
For dimensionality reduction, transform applies the projection. For fp16, it casts to half precision. For quantization, it rotates the embedding, quantizes each dimension into 4-bit bins, and packs the result into bytes.
What to Measure
The evaluation loop should not stop at reconstruction error. For retrieval systems, the metrics that mattered for us were the ones tied to actual search behavior:
- nearest-centroid agreement
- nearest-neighbor overlap with the baseline
- top-k overlap on real queries
- rank or score changes on candidate sets
- latency, memory use, and index size so the quality tradeoff is tied to actual system gain
A smaller representation can look fine numerically and still move the wrong results around.
What We Took From It
The takeaway from our experience was pretty simple. fp16 was almost free from a quality perspective as a lower-precision baseline. 4-bit quantization introduced some loss, but still held up well enough to be interesting as an actual compression path. Naive dimensionality reduction was much riskier once pushed too far.
The numbers made that pretty clear:
- fp16: nearest-centroid agreement about 0.9998
- 4-bit quantization: nearest-centroid agreement about 0.9644
- naive 512 to 128 reduction: nearest-centroid agreement about 0.7520
So for us, this was not about picking a universally best method. It was about measuring what survived contact with the real retrieval path. fp16 was an easy win. 4-bit quantization looked usable. Naive 512 to 128 reduction was bad enough to discard.
Closing
Storage reduction and compression were both worth exploring because the savings were real, but the methods were not interchangeable. In our setup, fp16 barely changed behavior at all. 4-bit quantization caused some degradation, but stayed within range. Naive dimensionality reduction from 512 to 128 was where things broke down.
It is also worth being clear about the system shape here. We are using Postgres with online, continuous updates rather than a dedicated vector stack, so we do not get a lot of the index structures and compression machinery that systems like FAISS can provide out of the box. For many teams, that is probably the first place to look. Our work here was about what made sense inside the constraints of a Postgres-based retrieval path that has to stay current as listings keep changing.
That was the useful lesson: the right representation is not the one that looks best in the abstract. It is the one that still behaves well enough in the actual retrieval pipeline.