A couple of weeks ago, I wrote about TrueSample and how it is used to improve survey data quality. Last week I had the opportunity to present some findings on data quality at the CASRO Online Conference in Las Vegas. My colleague, Phil Garland, and I were on a data quality panel and both of us gave talks on the subject. You can read about Phil’s experience on his blog post.
What I presented in Las Vegas was TrueSample’s RealCheck Postal and RealCheck Social. RealCheck Postal is validation of respondents identification through their name and postal address, while RealCheck Social is validation through their email address. Both of these solutions complement each other and represent a waterfall approach to validation.
First, we check respondents against a database to see if their names matches the names available at the physical address they provide. If this match fails, then we check them against a secondary database that uses their e-mail address as a key to do the validation. If they fail both, then we say the respondent is not valid.
You can see why we need both solutions to work together. Obviously, not everyone is going to be found in one database. People in certain demographics, such as 18-24 year-olds, are less likely to be found in a name/address database. So even passing them through a second name/address database will only give you an incremental lift. You need to go to a different way of sourcing this data – hence the email address database.
So how does the validation work? Well, that is basically what I presented at CASRO along with Susan Frede from Lightspeed Research. We found that the name/address validation works reasonably well, validating about 89% of the people on the Lightspeed panel (this number varies from panel to panel, and typically, we see about 80% of the people being validated through this process).
We saw that this validation percentage was skewed lower in the 18-24, Hispanic and African-American demographics. But when we ran the non-validated people against the secondary database, we were consistently able to validate over 50% of them in the secondary database, across all demographics. This took the overall validation percentage from 89% to 95% (on other panels, this typically takes the validation percentage from 80% to 90%).
How about the data? We couldn’t just take a bunch of people and then call them valid without checking out the data they provided, so we ran a survey to check this out. The survey was sent to all these respondents – ones that passed the name/address validation, ones that failed the name/address but passed email validation, and lastly, the ones that failed both. For data quality experiments, we compared the responses that provided to the questions by all these three categories of respondents to see how they looked against each other.
Before we get into that, let me give you an analogy that might help make this clearer. Let us say you have your eyes closed, and you are asked to separate a bunch of shorter people from a bunch of taller people in a room, where most of the people are shorter. All you have to guide you is the angle from which the voice comes when they speak to you.
You can probably do a good job of that and when you are done, the group of shorter people is mostly shorter with a few taller people in there, and the group of taller people has some shorter people in it as well. When you measure the average heights of the two groups, there is probably a difference in the heights. Now, you close your eyes again and you are asked to reclassify from the group of taller people based on some other characteristic – say, shoe size. You now move people that you consider shorter from the taller group into the shorter group. If you do a good job of this, then when you measure the average heights of the two groups, they should be further apart, right?
Well, that is exactly what we tried to see in the surveys. In analogy above, replace the heights by the answers to the questions, the method of voice angle by RealCheck Postal, the method of shoe size by RealCheck Social. Replace the shorter people by the validated people and the taller people by the non-validated, with no offense meant to either group in the analogy.
For starters, we found that the name/address validated (RealCheck Postal) people answered differently when compared to the name/address non-validated people. We have seen this consistently in previous experiments as well. If you look at the picture above, each dot represents a question. The value is the average answer on a scale of 1-5 to the question, with the x-axis representing the respondents that passed validation and the y-axis representing those that failed.
If both the set of respondents were no different from each other, you would expect to see the dots spread on both sides of the 45-degree line meaning that there is no bias between the two. However, the dots are more consistently on one side of the line, indicating that there is a bias and that they are different in their responses.
Next, we look at the email validation, i.e. RealCheck Social. What this will do is to move a bunch of people that were on the y-axis (non-valid), to the x-axis (valid). If we just moved people over at random and called them good, the bias should decrease, and the dots will become spread more across the 45-degree line. However, if we move more of the right people over, the dots should, at a minimum, stay in the same location, or better yet, move further away from the 45-degree line. This would mean that the two-step validation process does a good job of separating the two classes of valid and non-valid people.
This is exactly what we see in the picture on the right, telling us that RealCheck Postal and RealCheck Social do work in improving your data quality.