Caching Object Graphs Is Terrifying

Posted on Sun 07 July 2019 in technology

All systems start out with a webserver and a database. This arrangement promptly keels over at the first sign of any sustained load, at which point the bright-eyed engineer who really should know better says "let me just add some caching". Two weeks later, the engineer's daughter is asking why Daddy "has a lot of wizard hairs", and the engineer has a thousand mile stare.

Caching, as the old saying goes, is hard. Caching graphs of objects is even harder. Read on to find out why.

Example Object Chain

We will use the following trie of interdependent objects to guide our discussion; objects further up the trie depend on the ones lower down.

     F(E)   G(E)
       \    / \
        \  /   \           
       E(B,C)  D(C)     upstream
        /  \   /           
       /    \ /
     B(A)    C
      |
      |
      A

Assumptions

Our first assumption is that all downstream nodes already exist in the database before we attempt to persist the parent.

In the above chain, A would need to exist in the database before B(A) or C(A) could be created, and those would in turn have to exist before E(B,C) or D(B) could be constituted.

I like to call the second assumption in this system The Guarantee: when a cache item is created, updated or touched, all objects that depend on it are touched/created before it is finalized.

The reasoning behind this is driven by the LRU (least recently used) eviction algorithm used by memcache: since we guarantee that dependencies are going to be touched before their parents, the nightmare scenario of, say, E(B,C) being evicted and leaving no way for an invalidation signal to travel from A to F(E) or G(E) can never come to pass. At most, a leaf such as A can be evicted; this is okay, because it will be recreated in a consistent manner when it is next accessed (see Search below for details).

Caching Primitive

Keep it simple: for each node persisted, use a cryptographic hash (I use SHA256, feel free to choose your own poison) of the node as a key, and persist a pickled instance of this object as a value:

class CacheObject(object):

    # A direct copy of what's coming out of the database
    payload = None

    # *Upstream* Nodes that need to be notified of changes to this node
    dependencies = {}

Database Operations

Cache operations are married to database operations and transparent to the caller; cache actions for each possible database operation are listed below.

Create

If a database entry is not available, return an empty cache object. The client can figure out what to do with it.

Begin by creating an entry in cache.

If there are the node has downstream entries (e.g. E(B,C) or D(C)), DFS traverse the downstream nodes...

  • Creating cache entries for each and...
  • Adding the parent's key to each child's dependency array. This will form the basis of the cache update notification/invalidation chain later on.

Note that writes to cache for any given node happen post-traverse, meaning that The Guarantee is honored: each child is updated before its parent.

Update

Updates are the same for nodes with or without children. Once the node's database and cache entries are updated, upstream cache invalidation is carried out:

  • Using the dependency array on the cache object, recursively delete payload on all upstream cache entries, keeping in mind that...
  • Updates on cache items must be pre-traverse to preserve The Guarantee. If, for example we update B(A), the following nodes must also be updated, in order: E(B,C), F(E), G(E).
  • If retrieval of a key in any dependency array fails, it refers to an evicted cache item. This is nothing to worry about: just prune the key from the dependency array to prevent unnecessary future lookups.

Delete

Deletes have properties in common with creates and updates. The sequence of events that follows a delete is complicated enough that an example is illuminating.

  • Delete node B(A) in the database and in cache.
  • Delete all upstream cache items E(B,C), F(E), G(E). Yes, delete: with the deletion of B(A), the assumptions undergirding all upstream nodes are invalid and it doesn't make sense to maintain cache entries for them.
  • Notice that there are still stale references to B(A) and its deleted, upstream dependents in unrelated downstream nodes such as A (contains reference to B(A)) and D(B) (contains reference to G(E)). This is ok! Dependencies only matter for delete- and update-related invalidations. Remember: if a node cannot be found for invalidation, it is pruned from the dependency array during the invalidation process.

Search

When a request to retrieve information from the database is received, first check the cache store for it.

  • If a cache entry is available: Return it after checking it for validity. In this context, we define a valid parent node as one whose downstream children contain a reference to the parent in their dependency arrays.
    • If the child node does not exist, it has been evicted. Create the child, adding the parent to its dependency list, and then execute an Update on the parent to maintain time-based ordering of the graph.
    • If the child node exists but does not contain a reference to the parent, execute an Update on the child after adding the parent to the child's dependency array.
    • If the child node exists and contains a reference to the parent, you're good to go.
  • If a cache entry is not available: Create a cache entry from the database (see Create above) and return it to the caller.

Conclusion

In this mind-numbing blog post, I have outlined a (probably) highly inefficient way of consistently caching the database representation of a graph of interdependent nodes using a caching store with memcache-like semantics.

It occurs to me that my eviction handling scheme breaks down completely once you start running multiple independent memcache servers (each with separate LRU queues) and consistently hashing graph nodes across them. If this is an issue for you, you are probably in much better operational (i.e. you're richer than me) and technical (i.e. you're smarter than me) shape to solve the issue than I am... my problems are tiny enough that simply upgrading the amount of RAM on my caching server is usually enough.

Anyhow: thanks for accompanying me on this terrifying journey. I'm going to go die some of these wizard hairs now, and possibly teach my four year old some better manners.