In this article, I describe my journey towards creating a color search engine. I start by formalizing the problem and I go on to explaining two approaches to tackle it. A color-distance-based approach and a deep-learning-based approach. Both of these approaches rely on K-nearest-neighbors search for executing queries and differ in image and query representation strategies. Spoiler alert: the deep-learning-based approach works better.
What is a color search engine?
A color search engine is a search engine that takes one or more query colors as input and retrieves a collection of images containing these colors as dominant colors.
A very good example of a color search engine is Multicolr. Go ahead and check it out, it’s pretty cool!
Press enter or click to view image in full size
A nice feature Multicolr provides is that it lets you play around with the ratios of your query colors, and gets you images with a color distribution similar to the one you asked for. It happens to claim that it’s the best color search engine in the world. Let’s see if we can create one that rivals it ;)
Problem definition
Given N input RGB colors and N floating point numbers between 0 and 1 that form a discrete probability distribution (sum to 1). Retrieve from a collection of images, the top K images containing the given colors as dominant colors, such that the distribution of these colors in the images is similar to the given distribution.
Example query
Colors [[142, 220, 20], [10, 30, 255]]. Distribution [0.6, 0.4]
K-nearest-neighbors
One way to formulate this problem is as a K-nearest-neighbors problem. In this model, the dataset is represented as a collection of vectors, one vector describing each image. The dataset vectors are stored in a data structure that facilitates fast distance queries, such as a K-D Tree or a Ball Tree. Then, the query is represented as a vector and the K images associated with the K vectors in the dataset that are closest to the query vector are retrieved.
Press enter or click to view image in full size
In order for this formulation to work, we need to find a way to represent both images and queries as vectors in the same space, and a way to compute distances between these vectors (a distance metric).
First representation strategy: dominant colors in an image and their distribution
One way to represent an image as a vector is by finding the N most dominant colors in the image and using them as a representation. A popular method for doing so is applying a clustering algorithm on the RGB color vectors of the image, then taking the cluster means as dominant colors.
Next, the distribution of these dominant colors can be computed simply by normalizing the number of pixels belonging to each cluster by the total pixel count. The vector representing the image then becomes the concatenation of the RGB color vectors of the K most dominant colors and the probability distribution, sorted by descending order of pixel count, thus the vector has a number of dimensions of (3 * N) + N. Representing the query similarly as a vector is straightforward.
Next, we need to define a function that computes the distance between a query vector and an image vector. Distance query data structures such as K-D Tree and Ball tree require that the distance function be a true metric. That means it has to be non-negative and symmetric and it has to satisfy the identity of indiscernibles and the triangle inequality. For more about the math behind distance metrics I refer you to this Wikipedia article.
Color distance
How to compute a distance between two colors that makes sense perceptually is not an easy question to answer. An in-depth discussion about this topic can be found here. The author of that article proposes a true metric for measuring color distance in a way that’s meaningful to the human eye. I put my trust in them and I’ll be using their metric to measure color distance between the colors of an image and those of a query.
Distribution distance
There are many ways of measuring the distance between probability distributions. A popular one that comes to mind is the Kullback–Leibler divergence. However, it is not a true metric, as it is asymmetric. Another probability distribution distance is the Hellinger distance and it happens to be a true metric, so I’ll be using it to compute the dissimilarity of between the color distributions of an image and a query.
Combined distance
Now that we have a way to compute the distance between two colors and a way to compute the distance between two color distributions, we’re ready to define a combined distance metric.
Our vector representation is a concatenation of N colors and a probability distribution. Therefore, given a pair of these vectors, we can compute the distances between the pairs of corresponding colors and the distance between the probability distributions and combine these distances linearly. It can be proven that the resulting distance function is indeed a true metric.
Limitations of the combined color and distribution distance model
1. Order of colors matters
Credit goes to my colleague Hugo Chargois for pointing out this one. Recall that when creating a vector representation of the image, the dominant colors are sorted by their corresponding pixel counts. This causes a problem that may be not so obvious to spot.
To illustrate, I will present a pathological example. Suppose for simplicity that we are only computing the 2 most dominant colors for every image. Now let’s assume that we have an entry in our database that looks as follows:
image = [0, 0, 255, 255, 0, 0, 0.51, 0.49]
This vector says that the two most dominant colors in the image are blue (0, 0, 255) and red (255, 0, 0), with normalized pixel counts 0.51 and 0.49 respectively.
Now suppose we get a query that asks for blue (0, 0, 255) and red (255, 0, 0), but with a slightly different distribution 0.49 and 0.51 respectively. The query will be sorted by color distribution and its vector will look like this:
query = [255, 0, 0, 0, 0, 255, 0.51, 0.49]
We know that the image we have in our database is very similar to the query and should be retrieved by the system. However, this will not happen, because when comparing the corresponding colors of the query and image vectors, the system will find that there is a large color distance between the two, and will consequently not retrieve the image.
2. Difficult to tune the trade-off between color distance and distribution distance
The combined distance function we defined is a linear combination of color distances and a distribution distance. It is not obvious whether this is the best way of combining the two. Furthermore, it is not easy to decide what is the optimal balance between encouraging distribution similarity and color similarity i.e. how much weight should be given for the distribution similarity.
A better representation strategy: deep neural embedding
Imagine we can train a neural network to represent any image with a D-dimensional vector that lies in a vector space where Euclidean distances correspond to color and color distribution distances combined. Imagine that we are also able to convert a query to an image, allowing us to obtain a representation for it in the same space.
Press enter or click to view image in full size
This would allows us to overcome the limitations of our first model: we don’t have to worry about neither color order nor explicitly modeling color distribution similarity. The problem that remains is, how do we come up with these representations for images and queries?
Metric learning using the triplet loss
I got the inspiration from this paper, in which the authors describe how they trained a deep neural network with the triplet loss to predict representations of images and sketches in the same space, such that the representations of a sketch and its corresponding images would lie in proximity, which allows for accurate sketch-based image retrieval.
Press enter or click to view image in full size
What if we could do the same, but instead of sketches, we used color queries? Would it work? Let’s find out!
Dataset
Fortunately, our problem does not need any labeled data. All we need is a large collection of diverse images. I used 60k Instagram images downloaded from an internal database we have at Synthesio. You can use any other general image dataset.
For every image in the dataset, I extracted its 5 most dominant colors and their corresponding pixel counts by applying K-means on the pixels and taking cluster centers as the dominant colors. Here’s the Python function I wrote to extract dominant colors and their distribution from an image:
Now from this information, I generate a sketch image, which is an image that contains randomly placed pixels of the 5 most dominant colors, with roughly the same pixel counts from the original image. Here’s the Python function I wrote for generating a sketch image:
Here’s an example of an image from Instagram and its corresponding sketch image:
Triplet mining
Next, a neural network is trained with the triplet loss. The network is trained on batches of triplets of images. Each triplet is comprised of an anchor, a positive, and a negative image. To construct a triplet, an image is selected randomly from the dataset, this will be our positive image. Next, the sketch, which will be the anchor, is constructed from the positive image. Finally, another randomly selected image from the dataset will be the negative image.
Triplet loss
The network will predict an embedding (basically a D-dimensional vector) for each of its 3 input images (check figure). Our objective is to make Euclidean distances in the embedding space correspond to some combination of color and color distribution distance.
Press enter or click to view image in full size
This is achieved by minimizing what is known as the triplet loss. The triplet loss for a single triplet of representation vectors (a, p, n) looks like this:
Press enter or click to view image in full size
As the training progresses, the loss will decrease, which in turn means that the distances between the anchor and positive embeddings will gradually become smaller than the distances between the anchor and negative embeddings by at least m, the margin hyper-parameter (check figure).
I did not experiment much with the margin; I tried only two different values: m=0 and m=1. I found that if m=0, the model collapses to a state in which it outputs the same representation vector for any input image, as this makes the loss equal to 0. This is not desirable as the embedding space is no longer useful in that case. Therefore, I used m=1.
For a more detailed explanation of how the triplet loss works, I refer you to this excellent blog post by Olivier Moindrot.
Network architecture
There are no restrictions on the network architecture we can use to train with the triplet loss, as long as it outputs a D-dimensional vector. Popular CNN architectures such as ResNet or Inception can be used as a backend. However, I am hoping to solve the problem with a model that allows fast inference without requiring a GPU.
I have therefore experimented with a few model architectures, including MobileNetV2, a vanilla multi-layer perceptron, and a classic convolution + feed forward model. The model that worked best in terms of performance and inference time consists of a single convolutional layer followed by average pooling and finally a dense layer that projects to the size of the embedding space.
Here is the Python code describing the triplet model using the Keras functional API:
Training details
I trained the model described above on triplets generated from the Instagram dataset. The data was not split into training and test sets. Overfitting was not a concern as I was using extreme data augmentation: shuffling the pixels of all 3 input images. A batch size of 64 was used. All images were resized to (100, 100). The model was trained for 123 epochs with 100 batches each.
Press enter or click to view image in full size
Results
To qualitatively evaluate the model’s ability to generalize, I used the flowers dataset, which is a dataset of 8k images of flowers. I passed all the images in the flowers dataset through the network to obtain their embeddings, then I indexed the embedding vectors using scikit-learn’s nearest neighbors module.
One way to assess how well the model has learned color representations is by applying a clustering algorithm on the image embedding vectors and looking at images that belong to the same cluster. Ideally, we should see images with similar colors in the same clusters. Indeed, here are a few images taken from one of the clusters produced by applying K-means with K=100 on the embedding vectors:
Very nice huh? Perhaps you’re still not convinced. Let’s try some queries. A cool side-effect of using this model is that we can now not only query by colors, but also query by image. That is, given a query image, retrieve images that have similar colors.
I present the output of 3 queries, the first two are color queries that have the same colors (red and yellow) but with different distributions, notice how increasing the ratio of yellow in the query results in retrieving images which still have red yellow, but with more yellow than red. The third query is an image query. The image is taken from the flowers dataset, so it is expected to be returned among the results as it has 0 distance from itself.
Query 1
Colors [[255, 0, 0], [255, 255, 0]]. Distribution [0.5, 0.5]
Query 2
Colors [[255, 0, 0], [255, 255, 0]]. Distribution [0.3, 0.7]