Building a “People who bought this item but also bought” recommendation engine.

We recently had a client request to build a “People who bought this item also bought” recommendation engine to show on the product page of a B2B eCommerce site. You can learn about our approach below, and why we chose it in favour of alternate approaches, so as not to over-engineer the solution.

Initial High Level Design

The chosen approach was to calculate & preload the ‘most commonly purchased with items’ for all active products (~20K rows) by using the invoices lines of the past 12 months (~1M rows), and then caching the stock codes in Redis, with a fallback of calculating this on the fly upon page load incase the cache expired due to the pre-load failing or any other reason.

We may add some “signals” to be stronger than others in the future, such as favouring purchase links from other customers in the same industry. As discussed above, it’s a balancing act of not overcomplicating this, while making the recommendations as relevant as possible, but also trying to avoid recreating what would be considered to be Matrix Factorisation using Supervised Machine Learning (ML).

We had also considered adding a rule into this calculation for restricting the recommendations to only other products in the same top-level category so that if you were purchasing a pen, it would recommend other types of pens or pencils, however quickly realised that there would be many circumstances where this is a bad idea such as if you are purchasing a whiteboard, the best recommendations would be whiteboard markers, whiteboard erasers, which would generally be in a different top-level category than the whiteboard itself.

Press enter or click to view image in full size

Rough process flow of how the recommended products are generated.

There are 3 other development approaches we had considered:

Matrix Factorisation: It is common to build a recommendation system using Matrix Factorisation, as Netflix does. This was a tough decision as we could’ve gone down this alternate path of personalised recommendations to exclude certain categories of products based on the user’s preferences of what they purchase, in which case the Matrix Factorisation would be our pick. We chose not to go down this path, at least in its first iteration, as it would significantly complicate and lengthen the feature development, and we were happy with the simplified recommendation engine providing only the most common purchase links based on historical invoices. This simple model is hard to beat as it uses popular behaviour based on a single dataset, compared with Matrix Factorisation, which involves ML, you still need a default dataset for users with little to no purchase history. You would also need to build learnings as to what categories each user loves, based on what they view/purchase, which also becomes problematic if the client changes category names. A lot of potential knock-on effects from this solution for an eCommerce site.
Bayesian Model: This method would be more suited to determining the future purchase frequency of a customer, rather than this feature, so we quickly dismissed this as an option.
Preloaded relational database table for ‘purchased with’ with columns of item_a (main item), item_b (purchased with), and purchase_count (number of items purchased together), and running a query against those on page load. Certainly an option, however we would end up with an 1–1 count of rows with invoice lines, and we’d ideally like to avoid having another table with 1M rows.

Machine learning and graph databases are exciting, but it’s possible to get great results with simple algorithms in a standard tech stack. Neo4J and AWS Neptune are great examples of this, however it is a big commitment to bring something like that into our tech stack.

Workflow

As the title describes, all approaches must use user purchase history to recommend products that have been commonly purchased for every product. Since we have direct access to historical invoice lines, this is perfect way to find out what the most commonly invoiced items (lines without an attached credit note) are, and then sort descending by the count of how often it’s been purchased to give you the most commonly purchased products purchased with the current product — what a mouthful.

The preloading cache process consists of a command running at 4:30am each morning, after the invoice lines are sync’d from the previous day, which calculates and preloads the top recommendations for every product, and these recommendations are stored as stock codes in Redis for each product. This Redis cache key looks similar to commonly_purchased_with_ABC123 so each product has its recommendations stored in a similar key. If it exists, then it uses that data, and if it doesn’t, it must calculate the recommendations on page load (should never happen, in theory, since it’s preloaded) and will then push these into the cache with a TTL of 24h.

At the risk of being code-judged by fellow web developers, below is the raw query of how the stock codes are calculated against the invoice lines, and sorted by how often they’ve been purchased together. The query is built using Laravel’s Eloquent. It’s a simple but effective query as it utilises existing customer purchasing behaviours.

Press enter or click to view image in full size

This whole feature took the better half of a day to develop and we’re happy with the recommendations shown to the user — they appear relevant, the page loads quickly, and we have seen an increase in average order values (AOV) within 24 hours already, although too early to conclusively use that data to celebrate. We do have some thoughts about how to scale it into the future and improve the dataset for the products, however it’s always a fine line between the perfect solution and over-engineering the solution for the client — Matrix Factorisation is a good example of this.

Optimisations & Scaling

Fast page loads are a pre-requisite for any site these days, but more so for eCommerce sites. If your page takes longer than 1–2 seconds to load, you risk losing a sale to a competitor. With Google PageSpeed embedded in Google ranking algorithms, organic search results in Google will be penalised if your site is considered too slow — this is a major consideration when building any feature to be used on 20K product pages on the site.

Prior to preloading the cache, we noticed that for the most commonly purchased item accounting for 10% of all invoice lines over that 12 month period, it took 4s to calculate it’s recommendations list due to the sub-query of finding invoices where that stock code appeared in. For less commonly purchased items, it was sitting at roughly 0.005s for the query which would be acceptable, but certainly not 4s.

Caching with a daily pre-load into Redis was the solution and would allow the data to be as fresh as the invoice data, while ensuring the page loads fast. Anything on the site that requires processing such as big queries, analytics reports, that are the same for most users have been cached & pre-loaded for the user.

Measuring Success

Success for this feature is measured in both whether users use it, and that it has the intended purpose of increasing sales growth, and on a deeper level, how useful the recommendations are.

This will be measured through:

Increased average order value (AOV). If users add these to the cart, this metric should increase.
Increased number of items per cart. As above, if users are adding these to the cart, then this metric should increase.
Relevant & useful product recommendations. This is difficult to measure, and thus the metrics above will be used as a decider, along with a splash of common sense and user feedback. Netflix A/B tests clickthrough rates by rotating various movie images to the user, and measures which one is clicked the most and uses that as the default. This is also a great way to measure effectiveness of what to show and what’s useful.

These metrics will be measured via Google analytics over the next 3 months.

If you have further questions, please feel free to share your comments, or contact us via our website: digitalbird.com.au.