The term "Big Data" has been around for a while. Most of my time I have dealt with problems around traditional data (relational data), new data structures, non-traditional data, semi-structured data, in-memory data, on-disk data, hard-to-find data, lost data, object-oriented data, object-relational data, old data, new data, archive data, wrong data, corrupted data, THE right data, and all kinds of other data.
Almost daily I receive emails or letters where my name is misprinted, the address is incorrect, or where the email was not even intended for me (long story). Few days ago I listened to a longer discussion about problems with various kind of applications at several companies, all non-IBM-related. One key problem was that due to failing components in a too complex environment, data got outdated, was missing or is incorrect. The consequences impacted the "regular guys", ranging from having no or incorrect parts in an assembly line over not updated timesheets to not being able to "badge in". When I talk with my students at the Cooperative State University (who all have a job in the industry), many of them have seen or had to deal with incorrect or outdated data and "information" based on it.
The issues remind me of one of the first lessons in physics at school: "Garbage in, garbage out". For Big Data, the amount of data is in a different dimension, but the same principles as with the other data apply. It is important to know that there are or could be issues with the input data. How is that dealt with? Does it impact the algorithms and the result? Is the derived "information" really information? How is the data interpreted? Is it ok to base decisions on them or is it taken as just one of many indicators? Do those dealing with information found from Big Data know the source, is everything fully disclosed? There are many more questions, many of them non-technical. One important question is what questions to ask against the data. Is this to "proof" some speculation and guesswork or to really find out something new that may be validated by further work? And then we are back to a problem we still face since years with existing, "small" data.
Now to a small problem and statistics...:
Over the weekend we will have big fireworks locally here over the Lake Constance. My kids want to watch them for sure. I could point out that only 20% of the persons I considered for my statistical findings will attend. My kids could respond that 70% will attend. They chose people aged 8-80 years, I those aged 0-8 years...
The context counts and asking the right questions, regardless of small or big data.