As an engineer here at SurveyMonkey, I have a cool job. I get to build features such as the suggested questions autocomplete, and I thought I’d share how we built this new feature directly into our Question Builder to make it even easier to use Question Bank.
One of the main goals of autocomplete is suggesting highly relevant questions to our customers as they are designing their surveys, which is a lot harder than it may seem! To aid us in this, we have our great Question Bank from which we can draw questions. So how/where do we start?
Step 1: Stemming your Question
As a user is typing, we will get characters that we can split the text into what we consider distinct words. However, once we have a set of words we have our first problem: Say you typed the word ‘companies’ but that really means the same thing as ‘company’ but they aren’t the same word, how do we determine that they mean the same thing?
We first solved this using a concept called Stemming, which, simply put, takes in a word, such as ‘companies’ and reduces it to a base word, such as ‘compani’ for easy comparision. Since we build everything in Python here at SurveyMonkey, we are using an open source stemming library called PyStemmer. This library works in a multitude of difficult languages, which is great! It allows us to support English and Dutch, which are the two languages we currently support in Question Bank.
Step 2: Checking Question Bank
Once we have stemmed all the given words, we then determine which Question Bank questions these words appear in, and create some sort of ‘rank’ or ‘importance.’ We used a form of inverse frequency matching to create our rankings for each word. Here’s a good Wikipedia article for a detailed explanation on inverse frequency matching. In short, it gives higher relevance to words that appear less often. The assumption is that they’re more important. For example, the stemmed version of the word “company” might appear in 10 of 100 documents, whereas “the” might appear in 90 of 100 documents, therefore the word “company” should always have a higher “score” than the word “the”. By using this algorithm we create a unique score for every word in our Question Bank.
Step 3: Checking Your Question History
Another feature we added to make your life easier is incorporating your own question history to the autocomplete. When you type a question, we look at a significant portion of your last questions added and look for matches against those. We also try to keep this as up-to-date as possible, so any changes you make to existing questions, such as editing, and adding new questions will automatically refresh your question history so we can continue to give you accurate question suggestions.
As you type words into the Question Builder, we stem all the words you’ve typed. For each normalized word, we compute a score based on the pre-computed inverse frequency for that word. The tally of the scores is what determines the order in which we show you Question Bank questions. Then, we use your input text to find suggested questions from your question history. We do all of this in a blazingly fast average of 10 milliseconds!
Did you get lost on pre-computed inverse frequency? Don’t worry. We’ve done all the hard work for you! Your job is to enjoy the new feature and tell us what you think!