Like most high-traffic websites, at the core of SurveyMonkey is a highly-tuned database where we store and protect your survey data. However, the best way to make sure that we can dish out surveys and store your responses quickly, reliably, and 24×7 involves making sure that we don’t pound on the database too hard. If you send out a survey to 100,000 people, it would be kind of silly if SurveyMonkey had to go the database every single time to figure out what questions were in the survey. So, we use a caching layer, which is basically just a bunch of memory where we can store things temporarily, like a very important World Series survey, to prevent us from hammering the database with the same requests over and over and over. Caches also makes our website much more responsive because getting your survey from memory is much faster than rooting around on a disk.
David, one of our awesome new engineers, was recently asked to look at our caching solution and figure out how to make it work with our changing technology stack. He’s the first to author a post in our new engineering blog series: the Code-Monkey Corner. Stay tuned for future posts in this series, but in case you haven’t figured it out yet, this is about to get very tech-y. Proceed with caution!
A Tale of Two Caches
It was the best of caching solutions — unfortunately — it was the worst of development bottlenecks for a rapidly expanding codebase.
SurveyMonkey is in the middle of transitioning its codebase from .NET to the Python Pyramid framework. Despite delivering valiantly and keeping SurveyMonkey running along briskly for the many years, the .NET caching layer, Scale Out Software (SOSS), was incompatible with Python. Consequently, a new caching layer with the Redis library, was implemented in the Python code at the beginning of the transition. And so, as a result, SurveyMonkey had two active caching layers: SOSS for its .NET codebase and Redis for its Python codebase.
While having two cache layers allowed Python development to steadily move forward, as more and more of the codebase was moved over to Python, there was increasing overhead in making sure SOSS and Redis were consistent with each other. For instance, if the Python code happened to update the title for a Survey with id = 123 and cached that Survey in Redis, the corresponding Survey object with id = 123 in SOSS would be out of date. As a temporary solution, decaching calls were embedded in the program logic to appropriately tell SOSS or Redis when to remove items from its cache. So in the previous example, after the Python code updates the title for Survey with id = 123, it would send a decache message to SOSS telling it to remove Survey with id = 123. However, as the Python codebase was growing, more and more decaching calls would need to be written to keep the caches consistent. The needed solution was clear: a unified caching layer for both .NET and Python.
Redis vs Memcached
Since SOSS is incompatible with Python, the choice for a shared cache between .NET and Python was either Redis or a different library. We decided to go with a caching library called Memcached for two reasons: 1) It was originally designed as a distributed caching solution. We were interested in a caching library that could easily scale horizontally to provide fault tolerance. There is currently a distributed version of Redis in development called Redis Cluster, but it has not yet been released. 2) As memcached is an older library, we felt there was more stability in their client libraries for .NET. We chose to go with the Enyim client library.
One challenge in writing the shared caching layer for .NET and Python was how to store the object in the cache so that both stacks could meaningfully retrieve and unpack that object. We decided to use JSON serialization to serialize each object before placing it in the cache.
As caching is used to keep an application running quickly, it was important for us to be able to measure the performance of the shared caching layer. We wrote several load tests that aimed to induce cache hits and misses for several key objects. We then ran those load tests and logged the performance of the application, first with SOSS as the cache and then with Memcached as the cache. SOSS was the benchmark for evaluation as it was the tried and true solution over the past decade. The logs were then processed and made into a data visualization. Here are some example graphs below.
We are currently still in the process of writing the shared caching layer. So far though, all tests have proved memcached to be a viable shared caching solution between .NET and Python. As we progress, we’re excited to share the results of our findings. Stay tuned!