The primary question asked of us at conferences or at other forums is: “Are TrueSample-excluded survey respondents really ‘bad’ respondents, or are they just different?”
TrueSample excludes online survey respondents who are “not real,” “not unique,” and “not engaged.” The first and second have to do with the respondents being verifiable and not duplicates. The third has to do with their performance in surveys – do they speed through the survey relative to other respondents, or do they straight-line their responses? In all three situations, the online survey respondents are different from others in characteristics that are separate from their survey responses. In other words, they are outliers, but they are classified as such not because of how they answer the survey questions, but because of other characteristics that they exhibit.
Therefore, the obvious question that we are asked is: why would we assume their data is not of high quality and therefore discard what could potentially be valuable information?
To answer that question, we need to consider the underlying problem: we start with the belief that there are, in fact, respondents who are intent on gaming the system and therefore provide less-than-truthful responses, thereby compromising online research data quality.
Starting with that assumption, the next step is how to identify them. We are definitely in unsupervised modeling land here. There are no tags that we can train a supervised model with, telling us what a “bad” respondent is. Supervised modeling is out of the question for this type of quality control – there is no cost effective way to identify a set of “bad” online survey respondents for model training.
So we do what we feel is the next best thing: we identify a set of undesirable characteristics, such as not providing verifiable information, like name/address (considering that is the only survey-agnostic information asked on a survey that we can verify) or speeding/straight-lining through an interview.
We feel strongly that online survey respondents that exhibit these undesirable characteristics are more likely to give data of poorer quality. And since our research (see the white paper on “What Impact do Bad Respondents Have on Business Results”) consistently shows that they provide data that is biased compared to the data provided by the individuals that do not exhibit these characteristics, we feel that the decision to exclude online survey respondents is the correct one.
There is a very valid argument made that the percentage of respondents that we call “bad” is more than small in some cases and in certain demographics. We agree that there are “good” respondents in the discarded pile that may have been excluded, for example, because their names and addresses are not verifiable for legitimate reasons, or because they think and move so quickly that they are in fastest few percentiles across a surveys and have been identified as “speeders.”
But do these people make up the majority of excluded respondents? If so, wouldn’t the data of the excluded people be closer to the data of the “good” ones? Is there something about these legitimately unverifiable people that causes their data to be in the same cluster as the gamers? We do realize that we are likely removing some respondents that are good – i.e. we are committing Type I errors. Having recognized this likelihood we are, in fact, continuing to conduct research to reduce these errors.
The questions that are asked of us are valid. If you truly believe that all online survey respondents are above-board, then in your view, such quality control measures are unnecessary. However, if you believe that there is a percentage of respondents that game the system and that provide questionable data, then we at TrueSample believe that we have a defensible quality control methodology. We also recognize that there is always room to improve a method and reduce errors. To this end, we are continuing to conduct research in this area and will continue to share our findings as we learn more.