Getting comfortable with data sampling in the growing data world
“Big data” is today’s buzz word, just the latest of many. However, I think analytics professionals will agree: “Big data” is not necessarily better than “small data” unless you use it to make better decisions.
At a recent conference, I heard it proposed that “Real data is better than a representative sample.” With all due respect, I disagree. That kind of logic assumes that a “representative sample” is not, in fact, representative.
If the use of “representative” data would not accurately reflect the complete data set, and its use would lead to different conclusions, using “real” data is absolutely better. However, it’s not actually because “real” data is somehow superior, but rather because the representative sample itself is not serving its intended purpose.
On the flip side, let’s assume the representative sample does actually represent the complete data set, and would reveal the same results and lead to the same decisions. In this case, what are the benefits of leveraging the sample?
- Speed – sampling is typically used to speed up the process, since the analytics process doesn’t need to evaluate every collected record.
- Accuracy – if the sample is representative (the assumption we are making here) using the full or sampled data set should make no difference. Results will be just as accurate.
- Cost-savings – a smaller, sampled data set requires less effort to clean and analyse than the entire data set.
- Agility – by gaining time and freeing resources, digital analytics teams can become more agile and responsive to acting wisely on small (sampled) data.
There is no doubt that technology continues to develop rapidly. Storage and computing power that used to require floors of space now fits into my iPhone 5. However, the volume of data we leverage is growing at the same rate (or faster!) The bigger data gets, and the quicker we demand answers, the more sampling will become an accepted best practice. After all, statisticians and researchers in the scientific community have been satisfied with sampling for decades. Digital too will reach this level of comfort in time, and focus on decisions instead of data volume.
What do you think?