Big Data: Does Size Really Matter?
If you google the term Big Data, you will get 94,300,000 results in about 1.14 seconds. (No lie, I just tried it, talk about size, man they must have some huge indexes!) If you search for big data definitions you get about 3,510,000 results in less than one second. Three and half million results on the definition alone! I would say that from talking to Business Intelligence and analytics professionals from all over the globe, that number is a little low. There are a lot of definitions out there!
It seems that getting a straight answer on what big data is, rivals the search for the elusive “single version of the truth”. Some will say it’s anything over a petabyte, others will argue that it’s not the size but the type and girth of data i.e. social media feeds IOT information etc. For the purpose of this article we will just think of big data as, stuff that’s too large to fit in an excel spreadsheet….easily. I say easily because the current row limitations of excel are just over a million rows and about 16,384 columns, but who would ever want to do that?
The truth is, whether you are dealing with 7 petabytes of data like my man Rajeev Guliani at Netflix, or you just have a 1 million row sales transaction table, if it’s big to you, then it’s big.
Does Size Matter?
The answer to this question in just about any context is fairly consistent: it depends. When it comes to dealing with large data sets, the larger the dataset the more specialized set of tools, planning and skills are needed from a technical perspective. For example, you can’t approach an ETL routine that is hitting a SQL Server database on the same server, extracting 100 records a day from an HR management system the same way you would approach ingesting real time FAA feeds relating them to all tweets, Facebook posts related to flying including customer service and safety as well as sales. There are different things to consider including storage, retention policies, speed, indexes etc. It could be argued that you will need this sort of planning regardless of the data size, but one must admit two things:
- Scale increases complexity
- Smaller scale translates to being more forgiving of mistakes on architecture (long running queries, missing indexes etc)
As it relates to the technical process needed to navigate a big data set, yes, size matters. It matters in the planning and processing of the data. It matters in the tool selection and the database choice.
BUT (And there’s always a "but")
When it comes down to the end game, the customer, the business: Does size matter? I would say emphatically – NO. At the end of the day the user is trying to get a question answered, usually a small one. As I write this article I’m sitting in the DFW Airport admiral's club lounge in their lovely international terminal D (lovely place to visit BTW). There is a veritable plethora of data types around me. There are signs for other lounge areas, business center and print/copy place. A bar with 100 options of drinks. The concierge is announcing over the loud speaker what flight is coming up. There is a tablet next to me that can check the status of any flight. That’s a lot of data. However, there are only two pieces of data I care about right now: Where is the bathroom and is MY flight on time. My tripcase will let me know the latter and the sign behind me informs me of the former. Do I really care about all of this other data? What’s the most important piece of data? The one you need right now.
When it comes to data, no matter the size, there is nothing like good planning to make things go well. The larger and more complex the dataset, the more planning will be needed. When it comes to the customer, the smaller the better. Learn how to use massive amounts of data to answer small questions. In short, when it comes to data, think big. When it comes to customers, think small.
Enjoy, don’t worry, B.I.