I was talking to some cockroachDB developers at an event lately and discovered that many of them were, like me, using ElasticSearch for full text and complex querying in the cases where SQL was not appropriate. One of the topics that came up a lot was the difficulty in keeping the elastic indices up to date with changes that were happening in the database. A database with hundreds of tables and lots of cross-referencing is very difficult to synchronize without simply rebuilding the indices from time to time. In this post, I’m not going to share a lot of code, partly because it’s not public, but mainly because there’s a lot of it that is very specific to my project. Instead, I’ll mainly be discussing the technique to solve the problem

GraphQL

Luckily, I use GraphQL in front of my cockroachDB database, so this abstraction allows the enhancement of mutations to include requests to rebuild the affected index. It’s still pretty complex to figure out what indices a particular change will affect, but the extra abstraction that GraphQL gives is a great advantage, especially if you have a generic function for creating mutations shared by them all. First some background.

Searching

This search for beer returns films which feature beer in some way. I can hover over the reasons for selection and find out why – for example, the highlighted film was selected because it is about the product ‘beer’. Others were selected because the film metadata mentions it, there was someone associated with the film called Beer, it won an award for beer, because an automatic transcription from voice revealed the word beer being mentioned, or even because a video analysis of the film revealed that there was a beer associated image somewhere in the film. That’s a whole bunch of reasons for a film to be selected, all driven from a selection of Elastic Search indices.

In fact, here’s a selection of some of the reasons that a search might pick up a film. The same kind of selection criteria applies to find people, brands and so on.

 

So if assigning someone to a film for example, would modify the film findability, we really need a way to rebuild the film index to take account of that change more quickly than some kind of batch rebuild. What’s more we also need to rebuild the people index that features that person, along with anything else that person is associated with.

The database

Behind the scenes in the cockroachDB, there are as you’d imagine, tables for products, films, people, brands and so on – but there are also link tables to assign personToBrands, filmToPerson etc. Indeed there are several hundred tables and link tables in the project

Mutations

All updates to all tables pass through a GraphQL resolver – here’s a little bit of resolver for creating rows. I won’t bother with all the code – just the bit that requests an index rebuild
The piece of interest here is the elReindexing function. Given a tableName and the result of a mutation its purpose is to figure out if any Elastic indices need rebuilding, and if so which ones and which document ids are affected.

Reindexing

The reindexing function itself is provoked – it doesn’t run as part of the GraphQL API request which needs to be getting its update response back as soon as possible. Instead, it publishes a pubsub message, which in turn wakes up a Google Cloud Run function that does the selected reindexing.

Exactly what needs to be reindexed is driven by a definition file that is related to the ElasticSearch mappings. Here’s a snippet for what happens if a given brand document needs to be updated. From this, I can know that in addition to the brand document index, I also need to reindex the product, company and film indices.

However, we want to selectively update those indices – only those documents directly related to the original brand update. There are 2 parts.

 

Identify all elastic documents in any index (identified by the type property) that refer to the brandID being updated and remove them. The idField points to the Elastic mapping path at which they can be found

  • Find all data in the database related to the brand just updated. The gQuery property refers to a graphQL query to find them and the gPluck fields describe how to pull out their ids. These documents will be re-indexed

In principle, these 2 lists of ids should be the same, but it’s possible that the elastic documents being deleted might not exactly match the current state of the database. Here’s a snip of the code to do this.

I won’t go into the details of the helper functions as they are specific to this project, but in summary

  • getDocIds gets the document ids of elastic documents that match elastic queries built from the mapping reference file. These are combined into a bulk delete.
  • chunkBuild executes graphQL queries to get related items from the database, again built from the same mapping reference file, then to rebuild the affected indices.

Log file

Here’s what happens during a trial local run of a request to update brand id ‘1’. Notice it detected that other documents in other indices needed to be rebuilt too.

 

Back to the mutation

The exact same reference file can be used to identify what needs to happen when a given table is updated.  Since the table names are constructed in a standard way, it’s possible to deduce an affected table and link table name, and thus whether or not any re-indexing is required. For example, if a mutation involves the brandToFilm table a simple search of the mapping defintion file will reveal whether that provokes an update, and the same definition will reveal where to the find the id of the document that needs to be redone. The reindexing task has the responsibility of discovering everything else affected so there’s no need to worry about that in the API. Like this

Provoke reindex

The mutation provokes a pubsub message so that cloud run can get on and do the work.

This is actually a mutation resolver being reused – the makes it even possible to provoke a rebuild through GraphIQL if you need some repairs or custom rebuilding.

 

Summary

For as long as you keep table names standard (mine almost are, but they needed a little tweaking with minor upper and lower camel case issues), all indices will automatically be kept up to date. The API does a one-off preparation to fix those issues and make a faster lookup for my mapping definition file.

ElasticSearch, Google Cloud Run, Pubsub and GraphQL make a great team!

You can find an Apps Script version of this here.
Since G+ is closed, you can now star and follow post announcements and discussions on github, here