Back when you were in school, most math problems had one correct answer, and you could prove if you were right or wrong. But many real-world problems are messy and don’t have one right answer, or an obvious best answer. We often run into complex systems with lots of knobs and levers to adjust — too many to possibly express as an equation. In these situations, you can often apply the principles of evolution to fine-tune a complex system to yield good (but not necessarily the best) solutions.
Our problem: email click rates
I manage engineering for SurveyMonkey Audience, and we recently ran into one of these kinds of problems. Our product works by connecting our customers with the survey respondents they’re looking for (100 male nurses in California, for example) to take their survey.
Here’s the problem: how many people will open their email and click to start the survey? We need to know this to figure out how many emails to send in order to get a particular number of responses. Surprisingly, the email click rate varies a lot.
We don’t want to send too many emails and drastically overshoot our target number of responses, because then we would be needlessly sending our panelists emails, which could hurt long-term participation. But we also don’t want to under-shoot and not get enough responses.
We can make a reasonable rough guess based on the demographics of the group being sampled (e.g. women tend to be a bit more responsive than men, and people aged 18-24 are less responsive to email than people aged 35-50). But even the most sophisticated guess still has a large margin of error.
First solution: test and wait
Our first solution was to send a small test batch of emails on day #1; wait 24 hours; and check on day #2 to see what the test batch click rate was. This approach worked OK, but it had two problems:
- We had to wait a full day for testing
- Some people respond to emails after more than 24 hours.
There was a strong desire to be faster and more accurate. But how?
The ideal solution: continually adjust
The best possible solution would be a system that sends a constant trickle of emails and constantly adjusts the flow based on the performance so far. For example, suppose we saw the following:
- 9 hours ago, we sent 250 emails. Of these, 45 (20%) have been clicked; AND
- 5 hours ago, we sent 500 emails. Of these, 60 (12%) have been clicked; AND
- 3 hours ago, we sent 1,000 emails. Of these, 80 (8%) have been clicked
Based on this, some sort of algorithm could predict that our project’s overall click rate, once the dust has settled a few days from now, would be something like 22%. The more data you feed this algorithm in terms of larger sample sizes per unit of time and more units of time, the more accurate it should become.
Our new method
We want a model that takes a bunch of historical data about emails sent so far as the input, and which produces an output which is the predicted eventual click rate for all emails for the survey.
We eyeballed our historical data, and decided that we could probably make a decent prediction if we had sent at least 250 emails more than 1 hour ago for a given SurveyMonkey Audience survey.
Our approach is to group data on all the emails we have sent so far based on time. For the emails sent between 1-2 hours ago, we determine how many emails were sent; and how many of these were clicked. We do this for the following 8 time buckets:
- 1-2 hours ago
- 2-3 hours ago
- 3-5 hours ago
- 5-8 hours ago
- 8-24 hours ago
- 24-48 hours ago
- 48-72 hours ago
- More than 72 hours ago
For example, an input to our model may be:
- 2-3 hours ago: 1000 emails were sent, 80 of these were clicked
- 8-24 hours ago: 250 emails were sent, 45 of these were clicked
Given all these inputs, how should we compute one single estimate for the eventual click rate for this project? For each of these buckets, we calculate the click rate so far within that bucket. We then multiply this observed click rate by a per-bucket multiplier to give the predicted final click rate based on just that bucket’s data alone. For example, if we see a 20% click rate so far for emails sent 8-24 hours ago, we may expect e.g. 12% more to trickle in, giving us a predicted eventual 22.4% click rate for that group of emails.
We then need to combine all our email time buckets together. A bucket with a larger number of total emails sent should be more significant than a bucket with a smaller number of emails; and a bucket that has had longer to “bake off” is likely to be more reliable than a bucket that we just emailed recently. We do this by averaging the buckets together, with a weighting factor that is based on the number of emails in that bucket multiplied by a constant per-bucket weighting factor.
For example, if I sent 2000 emails 72+ hours ago (weighting factor = 1.98), and another batch of 1000 emails 2-3 hours ago (weighting factor = 1.20), than the relative importance of each group would be:
72+ hour bucket weight = 2000 * 1.98 = 3960
2-3 hour bucket weight = 1000 * 1.20 = 1200
This weighting value doesn’t mean anything on its own, but it describes the relative importance of each bucket. In the above case, the 72+ hour bucket data is 3.3 times more significant than the 2-3 hour bucket (3960 / 1200 = 3.3), so when we average the bucket’s estimated response rates together, the 72+ hour will have 3.3 times the importance of the 2-3 hour bucket.
The hard part: picking constants
We’ve now described an algorithm that takes as input the progress of a survey so far across 8 time buckets, produces an estimate based on each time bucket alone, and then combines all these estimates. The secret sauce is that for each of the buckets, we will need to pick constant values for:
- Rate multiplier
- Weighting factor
Since we have 8 buckets, that’s 16 constants. A change in one constant could affect another constant.
How do we pick these 16 values? We’ve described our approach, but don’t have a solution yet.
Evolution as a tool for smarter survey intelligence
We have modeled our problem as an algorithm that takes the data for 8 time buckets as inputs, and produces a single final overall click rate as an output. Our model will depend on picking 16 specific constants to tune how good (or bad) our answer will be.
We have gobs of historical data about the email click behavior for past SurveyMonkey Audience surveys, and we know the eventual exact click rate for a variety of very different survey projects. So we can easily take a given model with its 16 constants and test it across thousands of past situations to see how accurate the predictions are. But we don’t know how to pick those 16 constants to give the best possible solution.
This sounds like a job for… evolution by natural selection! The essential idea is:
- Individuals can have different traits which can be passed to their offspring.
- Traits determine the likelihood of surviving and reproducing (its fitness).
- There is a population of diverse individuals.
- In each generation, only the most fit individuals produce offspring for the next generation; the majority of individuals will die off.
- When traits are passed on to the next generation, there is some swapping of traits between two fit individuals, and some mutation.
- Over time, each generation will show improved overall fitness.
We can apply these same principles to our computer problem. This is called a genetic algorithm.
In our genetic algorithm setup, individuals represent different solutions to our problem. Each individual has 16 genes (which are our 16 constants). For example, a particular individual could be represented as [1.2, 1.4, 1.06, …].
We have a fitness function that calculates how good or bad an individual is. This fitness function runs the individual’s gene through thousands of historical test cases, and gives a higher fitness value the more times we produced an email click rate prediction that was close to the real value, and a lower fitness value the more times we produced an email click rate prediction that was far off.
With all that in place, here’s what we do:
- Initially, pick 16 random numbers for each gene; this is an individual.
- We will create a population of 20 individuals.
- For each individual, run through test data to calculate its fitness.
- We will take the top 5 best individuals, and delete the rest.
- We will create a new population of 20 individuals by taking the 5 best from the previous generation. We will mutate the new genes randomly, e.g. take each value and jiggle it by +/- 0.1; and we pick two parents and swap chunks of their genes.
- Repeat for 10 generations.
What’s amazing is that this kind of genetic algorithm converges on very decent solutions shockingly fast; in our case, within only 3-5 generations we were able to have a solution that could produce estimates that were usually accurate to +/- 10%, and a few generations later, to +/- 5%.
The act of chugging through thousands and thousands of possible answers, evaluating them, and throwing out the bad ones only takes a few minutes. We quickly found answers that worked, and which worked very well.
Most of our results made intuitive sense. As expected, if you only have data on emails sent 1-2 hours ago, you will expect that a lot more people will click on them later (245% more), versus emails that were sent 24-48 hours ago (only 5% more will click).
Our genetic algorithm did make some unexpected discoveries. For example, it told us that the batch of emails sent 5-8 hours ago is less relatively important than emails sent 3-5 hours ago OR 8-24 hours ago. This is surprising since we’d expect older batches to always be more significant; my hunch is that people tend to either deal with emails immediately or much later, so the 5-8 hour bucket isn’t as useful for making predictions.
The importance of fitness
We we wrote our genetic algorithm, we realized that that evolution is, in a word, lazy. The system optimizes on the fitness function we gave it, and a poorly written fitness function led to bad answers.
When we first wrote our fitness function, we would run the solution through a test battery of thousands of real examples, with known actual click rate estimates. At first, fitness was simply a measure of the sum of how good the answer was:
- +4 points: 0-5% off
- +3 points: 5-10% off
- +2 points: 10-15% off
- +1 point: 15-20% off
The evolutionary mechanism quickly “figured out” that it was better to have a few answers that were spot-on than to have many answers that were reasonable; and so the answers were grossly skewed towards accuracy only in cases where we had plenty of email data that was e.g. 24+ hours old, and was highly inaccurate for situations with only recent email data.
After some revisions and experimentation, we realized that what we really wanted was consistently reliable estimates which were never very inaccurate. We wanted to heavily penalize grossly inaccurate answers, but only mildly reward accurate ones. Our eventual scoring was changed to:
- +5 points: 0-5% off
- +3 points: 5-10% off
- +1 point: 10-15% off
- -1 point: 15-20% off
- -10 points: 20-25% off
- -50 points: more than 25% off
This seemed to do the trick. Across thousands of test cases, most answers had about a 2-3% margin of error, but very, very few answers had more than a 10% margin of error.
Are you interested in these kinds of engineering problems?
Are you a programmer who is interested in machine learning, natural language processing, or other challenges? You should apply for a job at SurveyMonkey! We’re always looking for sharp, curious minds to join the Monkey.
Chuck Groom is the head of engineering in Seattle, WA.