Last fall, as part of our first SurveyMonkey Hackathon, I worked on a data-mining project to try and evaluate whether or not there are algorithmic ways to distinguish good survey questions from bad ones. Specifically, I was looking to see if we could identify certain words that bias a respondent into responding one way or another. For example, I expected that if someone had the word “never” or “always” in a True/False question, it would significantly affect how people answered. I was curious to see which other words might affect the distribution of answers. So, here’s what I did:
I grabbed about a month’s worth of anonymous survey data and looked for the simplest multiple choice questions: those that had answers like Yes/No, True/False, or Agree/Disagree. I found nearly a million questions that met that criterion, with nearly 15 million responses to those questions.
In order to establish a baseline, I looked to see how these questions were answered across the board. Turns out, survey takers are an agreeable bunch!
People love to say yes. Across all the Yes/No questions in our sample, people answered “Yes” 57% of the time.
People also really love to say that things are true. They gave a “True” answer 63% of the time across our set of True/False questions.
What people love the most, though, is agreeing. In our sample of Agree/Disagree questions, folks answered “Agree” 79% of the time. This phenomenon of people giving a positive answer more often than a negative answer is a concept in survey methodology known as Acquiescence Bias. I wanted to see if the presence of any particular words affected this bias in either direction.
So, I found all the words that appeared in many questions and created a set of questions and answers for each of those frequently appearing words. For each one of those sets, I compared the rates of positive and negative answers to that of the overall average in above results. Finally, I sorted the sets to easily see which words created the greatest bias.
For example, the set of True/False questions that contained the word “must” got answered with “True” 82% of the time. That’s significantly higher than the “True” rate of 63% I saw across the entire set of True/False questions. I guess those questions must be true.
Similarly, among Yes/No questions that included the word “easy”, the “Yes” rate was 85%. My hypothesis to explain this is that no one wanted to answer “No” for something that was supposed to be easy.
The top biasing word among Agree/Disagree questions was “recommend,” with questions including it clocking in at an agree rate of 90%. Most of those questions were along the lines of “Would you recommend our product?” It appears that people like to recommend whatever they’re being asked about, which is presumably something they’ve already bought or used at least once.
On the flip side, when a Yes/No question contained the word “so,” the “Yes” rate suddenly dropped from the original 57% down to 37%. This surprised me, but when I looked at the actual questions being asked, these “so” questions appeared to be long and wordy. I’d guess that the dearth of “Yes” responses probably related to question fatigue more than actually being repelled by that tiny word.
After the Hackathon I realized that although I wasn’t sure about these words actually creating bias, it would be really interesting to see if the results changed if I broke it down by the gender of the respondent. So, I reduced my dataset down to those questions that appeared on surveys where gender was also being asked. This way I could run my same analysis again and see if some words affected one gender more than another. That’s blog post gold, right there!
Step one was to confirm that I still saw the same general split of positive answers vs. negative answers in this smaller dataset, which was about 10% of the original size, but still big (millions of responses). Surprisingly, I did not find the same acquiescence bias. I found much less!
|Original Dataset||Dataset with Known Gender|
Why was I seeing so much less acquiescence bias in the dataset where I knew the gender of the respondent? I ran this question past our methodology department, and Dr. Phil had a pretty good theory. He suggested that if a survey includes a gender question, it’s likely that the person creating the survey does not personally know the survey respondent. These surveys are likely to be market research, academic studies, etc. However, if a survey does not ask about the respondent’s demographic data like gender, it’s more likely to be one where the survey creator already has a personal relationship with the respondent, like a parent teacher or an employee satisfaction survey.
Phil’s theory was that acquiescence bias is greatest when the respondent personally knows the person who sent the survey. These respondents would be more inclined to give positive answers based on their relationship to the survey creator.
This seemed like a pretty reasonable explanation for the data I was seeing. I’m looking forward to the next SurveyMonkey Hackathon to see if more data insights are uncovered. Stay tuned. The next Hackathon is in two weeks, and I’m sure we’ll have more interesting results to write about afterwards.