Introducing MyVector Plugin — Vector Storage & Similarity Search in MySQL

12 min read Original article ↗

Shankar Iyer

With lot of buzz around vectors/Gen AI and databases, I experimented with adding vector support to MySQL over the last couple of days. In contrast to PostgreSQL, MySQL does not come with extensibility interfaces to add new data types, and no support to add new index types and to add new access methods (think IndexAmRoutine interface in PostgreSQL or Oracle Data Cartridge Interface in Oracle). MySQL is extensible in terms of adding a completely new Storage Engine. Between the choice of a specialized vector database and that of a new storage engine, I wanted to complete a fully functional vector storage and search feature within existing MySQL + InnoDB.

In theory, there are 2 paths then : 1) Fork MySQL and add vector support in the source code itself. This will include touching SQL layers and potentially InnoDB also. 2) Design and complete a fully functional feature using MySQL Plugin architecture & UDFs (zero code change inside MySQL/InnoDB).

Option #1 will require touching a wide range of code locations inside MySQL. The amount of code changes comes with the cost of code maintenance in critical components (essentially a fork from community MySQL). Also, retro-fitting a new datatype and a new index type into an existing code base is usually unclean and fraught with missing out bits.

A plugin is seamless to build and deploy in any MySQL variant deployment due to its nature of compatibility.

I am publishing what I have ready with the MySQL Plugin architecture that I experimented over the last couple of days.

Background to the design :- I found this paper to be succinct and rich and contains examples of how UDFs are a great tool to extend a database to handle vector embeddings :

https://wwwdb.inf.tu-dresden.de/wp-content/uploads/48e25148570cf8f1b6685d3289cf298ebf96.pdf

Also before trying out stuff, this post discussing vectors and vector indexes was very insightful and is practical — https://news.ycombinator.com/item?id=37750555

The high level design is to implement the vector index as an external, domain index in MySQL. An excellent conceptual read on extensible indexing is this publication (for Oracle Database) :-

https://homes.cs.aau.dk/~simas/teaching/oracle_spatial/Oracle8ExtensibleIndexing.pdf

In the first cut, the feature is ready for batch’ed or bulk load of the vector index. I found that approach somewhat aligns with :-

Bulk load and incremental maintenance of indexes : Check how SQL Server manages columnstore index : https://learn.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-data-loading-guidance?view=sql-server-ver16

[Edit] Demo : Check here : https://youtu.be/vkWIX1Ti_D8

Vector Storage : How to store vectors inside MySQL?

An embedding vector from OpenAI looks like this :

From https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

“embedding”: [

-0.006929283495992422,

-0.005336422007530928,

… (omitted for spacing)

-4.547132266452536e-05,

-0.024047505110502243

],

A vector thus is a sequence of floating point values and the dimension of a vector is the number of values in that vector. An OpenAI vector can have 1536 dimensions or even 3072 dimensions (recently announced). Storing 1536 or more floating point values for an entity attribute or table column certainly has interesting aspects.

Storing the vector as is ( as a human readable string ) into a VARCHAR column has 2 negatives : i) Space usage : If the floating point values have been printed with a precision of 20 digits, each dimension consumes 23–24 characters/bytes and thus a vector of 1536 dimensions will take a total of around 36KB! ii) The vector in a string representation is not usable in arithmetic operations that are inherent to vector indexing & search.

Going back a bit, MySQL does not have an array datatype. The JSON datatype seems to suggest itself as a choice to store vectors. JSON datatype is flexible and an array of string values can certainly be persisted in a JSON column. But fundamentally, the individual values are stored as strings in the JSON column, same as in the VARCHAR column. InnoDB table level compression (via zlib) can be used to compress the JSON (or even VARCHAR) column and should result in good space savings. By basic theory, each digit in the floating point value string can be represented by 4 bits (0–9 and ‘.’, ‘-’, ‘ ‘). Thus table compression may bring down the storage of a vector column to 18KB or a bit less. But the key issue still remains — for arithmetic during vector indexing & search, the string representation is not helpful.

A floating point value in a C/C++ (or even Java, Python etc) application is represented using a float or a double data type, consuming 4 bytes or 8 bytes of memory/disk. The internal bit layout is of course the IEEE754 standard for representing floating point numbers and performing arithmetic. Speculating a little, the OpenAI embedding vector in human readable string has also been most likely produced using floating point arithmetic in IEEE754 representation (using numpy ?). Thus the intuitive optimal storage for a 1536 dimension vector is to represent the individual values as 4 byte native floats (via the under-appreciated atof() function!). We now need 1536x4 = 6KB for storing a single vector. Such a vector with native floats can directly be used in arithmetic operations with another vector.

Back to the MySQL datatype selection — Since we are not going to add a new storage engine, the right option is to use typical InnoDB tables for storing vector data also. Our goal is now to store a column value of 6KB (or 12KB if vector dimension is 3072) in size and that value is just bytes. Note that MySQL itself in any case will be unable to help in any vector arithmetic or other search operations typical to vectors. Any of the LOB data types in MySQL — specifically VARBINARY is perfectly suited to hold binary data. A VARBINARY column can hold a maximum of 64KB of data, so that is sufficient to store a vector of up to 16K dimension. If ever a single vector itself needs 100s of KBs or more, a larger LOB type is an option.

Note : In MySQL, a JSON datatype column also ultimately is a LOB datatype underneath :- Link

MYVECTOR Column type Annotation

A MYVECTOR annotation is provided by the MyVector Plugin to declare vector columns. The plugin recognizes the MYVECTOR annotation in CREATE TABLE and ALTER TABLE and creates the appropriate VARBINARY column type (and something more!)

CREATE TABLE products (

productid BIGINT NOT NULL PRIMARY KEY,

productdesc varchar(4096),

pvector MYVECTOR(type=hnsw, dim=50, size=10000000, M=64, ef=100),

pprice DECIMAL

);

Part 2 in this article series will go into specifics of should the MYVECTOR column be defined in the critical operational table or in another secondary table (like 2NF).

Vector Input & Output

An embedding vector received from OpenAI in human readable string cannot be just saved in the above MYVECTOR() VARBINARY column :-

INSERT INTO products VALUES (1, ‘Laptop’, ‘[ -0.029252680018544197, -0.07079032808542252, -0.0169233288615942’])

MySQL will definitely execute the above SQL successfully, but we end up with meaningless string data in a VARBINARY column that was intended to store native floats.

So the next step is clear — there has to be conversion from a human readable vector string to a byte sequence of 4-byte native floats. One option is to ask applications to perform this conversion. A C/C++ application could do the following :

float pvector[1536];
//populate the above array corresponding to embedding vector

char *insertsql = "INSERT INTO products VALUES (1, 'Laptop', ?);";

MYSQL_BIND bind;
bind[0].buffer_type = MYSQL_TYPE_STRING;
bind[0].buffer = pvector;
bind[0].length = 1536 * sizeof(float);

mysql_bind_param(&mysql,2, bind, "pvector");

The above should be possible also in Java/PHP/Go. A complimentary conversion code from binary representation to human readable string is also required when fetching the vector column (Off-topic: The vector definitely is never needed for human reading or human interpretation. The vector as string could be needed when interfacing with other APIs or micro-services)

MySQL does have UDFs — SQL functions that can be implemented in C/C++ and dynamically linked with the mysqld process. Thus the MyVector plugin provides 2 functions for writing to and reading a MYVECTOR column

MYVECTOR_CONSTRUCT(string value ) — returns bytes that can be inserted/updated in to VARBINARY column.

MYVECTOR_DISPLAY(varbinary value) — returns a human readable string of a vector stored natively in a VARBINARY column

Note : Conversion functions for working with datatypes are an old and well-tested technique in databases. (I was doing conversions a long time ago in the Oracle C++ Call Interface when binary float & double datatypes were introduced in Oracle 10g : link)

With conversion functions, future improvements like compression or vector quantization before serialization are seamless.

Vector Distance Function

Continuing on the theme of UDFs, a UDF to compute distance between 2 vectors is implemented in the plugin

MYVECTOR_DISTANCE(vector1, vector2, distancetype) :- returns REAL

distancetype = “L2” | “EUCLIDEAN” | “Cosine” | “IP”

vector1 and vector2 are column names of type MYVECTOR or any expression like MYVECTOR_CONSTRUCT() that can produce a native vector of floats.

Constructing the Vector Index

A couple of vector index types are supported in v1 : HNSW in-memory and KNN in-memory. Both the index types load the entire set of vectors in to main-memory.

A simple stored procedure to administer the vector search indexes is part of the MyVector plugin :-

call MYVECTOR_SEARCH_ADMIN(action, vectorcolumn, idcolumn, extra)

action : “build” | “rebuild” | “load” | “save” | “delete”

Once an HNSW index has been constructed in-memory, it can be saved to disk and reloaded into memory from disk after a MySQL restart.

The KNN in-memory index is just a simple in-memory index that implements a O(N) brute-force neighbour search using std::priority_queue<>. This index type demonstrates the extensibility of the MyVector Plugin design.

Vector Similarity Search — KNN and ANN

KNN refers to exact neighbour search or brute-force neighbour search. KNN is easy using the earlier documented MYVECTOR_DISTANCE() UDF and plain old SQL :-

SELECT productid, productdesc

FROM products

ORDER BY myvector_distance(pvector, myvector_construct(‘[<searchvec>]’))

LIMIT 10

KNN via ORDER BY is a O(NLogN) algorithm — the distance of the search vector to each of the million(s) of vectors in the data table is computed and a subsequent sort is executed to rank the distances from nearest to farthest. Note that the vector itself is not being copied around in the ORDER BY sorting implementation — only the distance and any projected scalar base table columns are part of the sorting runs (need to study MySQL sorting code to understand if selected columns are materialized before or after the sort).

Since KNN via ORDER BY is only usable for medium datasets/medium number vectors, approximate neighbour search (ANN) algorithms via specialized data structures or indexes are now the norm in vector search.

ANN Search in MyVector Plugin

The MyVector plugin provides a MYVECTOR_IS_ANN() annotation for users to express ANN search via a vector index in a SELECT statement.

SELECT productid, productdesc

FROM products

WHERE MYVECTOR_IS_ANN(‘db1.products.pvector’, ’productid’, <search vector>’)

The algorithm inside plugin recognizes the MYVECTOR_IS_ANN() predicate in the SELECT and the query executes the following access path :-

  1. Lookup the vector index in MyVector and fetch the IDs of the nearest neighbours to the search vector
  2. Use the set of retrieved primary key IDs and efficiently fetch the nearest neighbour rows from the base table.

There are multiple syntax options that I have working now and will decide based on feedback. A BigQuery syntax influenced solution is shown below :-

SELECT

articleid, title

FROM MYVECTOR_SEARCH[‘wikipedia’,’wikivecs.wvector’,’queryt’];

In the above, the base table is “wikipedia”, the vectors are stored out-of-line in another table “wikivecs” (2NF) and the query vector is generated and inserted temporarily in another table “queryt” (single row). The demo will show both the options in action. The pgvector ORDER BY style for ANN is also in the works for the plugin.

Listing MYVECTOR Columns

A simple view : myvector_columns is added for users to get a view of all vector columns & search indexes in their database :-

Press enter or click to view image in full size

myvector_columns is just a view on top of information_schema.columns !

Query Rewrite

Query rewrite has been a pragmatic and enduring technique in databases to engineer creative solutions! Couple of examples of typical query rewrite are : 1) A database supporting materialized views rewrites a query on base table(s) to replace a base table with pre-computed view(s). 2) Similarly, rewrite is common in most of the table partitioning implementations. 3) Replace a correlated subquery with a Join

The MyVector Plugin uses query rewrite to complement the UDFs and that ties together the end to end solution.

Demo of MyVector Plugin v1

Check here : https://youtu.be/vkWIX1Ti_D8

Please leave your feedback/questions/inputs in the above video and I will work on those!

Recap of the MyVector Plugin Solution

MYVECTOR() Column type Annotation

MYVECTOR_CONSTRUCT(), MYVECTOR_DISPLAY(), MYVECTOR_DISTANCE() functions

MYVECTOR_SEARCH_ADMIN() function for vector search index management

Multiple options for intuitive ANN search predicate :-

1. SELECT … FROM <table> WHERE MYVECTOR_IS_ANN(…)

2. SELECT … FROM MYVECTOR_SEARCH[…] ORDER BY

3. SELECT … FROM <table> ORDER BY MYVECTOR_DISTANCE_ANN(…) LIMIT n

That’s a small learning curve!

Features in v2 & beyond

Parallel HNSW index build

Incremental, batch’ed index update (inspired from logstash)

Online index updates via binlog CDC, including UPDATE/DELETE

DiskANN or variant implementation

Support for additional search attribute in the vector index itself

The vector index in v1 is not updated online — it only provides full reconstruction options (doing a simple incremental index update inspired from logstash will be published shortly). A user thus needs to manually refresh the index at regular intervals. Online or in-transaction vector index updates is an interesting problem. In the MyVector Plugin approach, in-transaction update of the vector index is technically impossible because MySQL or InnoDB source code is not touched. Binlog replication hooks could possibly be an option.

What are the challenges of updating an index in every transaction? The answer is of course implementing commit, rollback and MVCC semantics. Updating a new, external index type within an InnoDB transaction will require large scale code changes in InnoDB and/or MySQL.

Doing MVCC for a new table or new index technique is tough! Without doing MVCC, the external vector index could support online updates and easily offer and document READ UNCOMMITTED/READ DIRTY semantics. In contrast, a documented READ COMMITTED semantics is a pragmatic option for an external index. Hence in v2, the plugin will be enhanced to update the HNSW index via binlog post-commit change data capture. The vector index in v2 will support READ COMMITTED semantics, well explained here :-

https://asktom.oracle.com/Misc/oramag/on-transaction-isolation-levels.html

READ COMMITTED in Oracle : — https://docs.oracle.com/cd/B13789_01/server.101/b10743/consist.htm is perfect!

Of course, there will be a bit of eventually consistent implication since the index is updated post-commit with the added latency of binlog capture/apply.

Part 2 of the article coming in a couple of days, thanks for reading!

Edit : Part 2 here https://medium.com/@shiyer22/introducing-myvector-plugin-vector-storage-similarity-search-in-mysql-part-2-74a7195226e2