Turing's Man Blog
- Last Updated on Tuesday, 19 August 2014 20:38
- Published on Sunday, 17 August 2014 21:42
- Written by Pawel Wawrzyniak
- Hits: 15296
There is an interesting blog post by Gli Allouche – it was recommended on the CITM group on LinkedIn – "Big Data 101". To be honest, Big Data, on the very first impression, seems to be yet another buzzword, like previously promoted "cloud computing" term. Well, definitely this is not a new, ground-breaking technology, more an approach to gather and process the huge amounts of data. So, what it is about in fact? Why all the buzz around Big Data?
A very short introduction to Big Data
Before going into any discussion – it's not a bad idea to watch a nice and easy introductory video "Explaining Big Data" by Explaining Computers:
Explaining Big Data by Explaining Computers
However, according to the Gli's point of view (and I like it), Big Data is:
Put simply, big data is the overarching term for the process of gathering, storing and analyzing extremely large amounts of data (instead of megabyte and gigabyte, we're talking petabyte (the "byte" after terabyte) and potentially exabyte).
From kilobytes to infinities
As we can see – we're no longer talking about kilobytes, megabytes or gigabytes? That's funny, because most of us remember when, in the 80's, 640 kilobytes seemed to be "enough for everyone". Well, maybe not exactly, but at least according to the well-spread (and false) IT urban legend. Here in IT world, we used to believe that such words about RAM capacity have been said by Bill Gates, exact phrase is: "640K ought to be enough for anybody". For some unknown reasons, this myth exists no matter how much it is denied by Bill Gates himself – he made many official statements trying to convince the people that he has never limited us up to 640 kilobytes, at least not in a given context. There is an official response of Bill Gates cited in "The '640K' quote won't go away -- but did Gates really say it?" article by Eric Lai, published in Computerworld on-line::
I've said some stupid things and some wrong things, but not that. No one involved in computers would ever say that a certain amount of memory is enough for all time.
Of course, in the legendary "640K ought to be enough for anybody" sentence we're referring to the operational memory (volatile), which is used for the purposes of quick data processing and not to the storage memory (persistent) which is used for the long-time data gathering, but anyway… We know that once upon a time, the kilobytes were the measure of important data, the megabytes seemed to be almost like infinity. Today we are moving forward with our requirements and the amounts of data are growing in ridiculous pace. I believe, we should refer to the data metrics right now to get the proper perspective for further discussion.
|Multiples of bytes|
|1000||kB kilobyte||1024||kilobyte||KiB kibibyte|
|1000^2||MB megabyte||1024^2||megabyte||MiB mebibyte|
|1000^3||GB gigabyte||1024^3||gigabyte||GiB gibibyte|
|1000^4||TB terabyte||1024^4||terabyte||TiB tebibyte|
|1000^5||PB petabyte||1024^5||petabyte||PiB pebibyte|
|1000^6||EB exabyte||1024^6||exabyte||EiB exibyte|
|1000^7||ZB zettabyte||1024^7||zettabyte||ZiB zebibyte|
|1000^8||YB yottabyte||1024^8||yottabyte||YiB yobibyte|
Any and all data makes a Big Data
So, as far as for today, we're at the level of petabytes and exabytes. Great – but why such an increase is being noticed? An answer is simple – this is because there are more new sources of data today. According to Gli's:
It can span the spectrum of internet browsing information, social media posts and mobile app usage to sales data, income reports and millions of movements tracked during a basketball game. Any and all data make up big data. Really, it's whatever data a company or enterprise decides to gather.
In the above citation we have a great definition of – what sources of data should be considered, when we talk about Big Data (also, the introductory video presents such sources very well). Hence, to simplify the things we can assume: all possible sources of data we have today and we will develop tomorrow are in the domain of big data. Although, there is a much more important idea presented here: "whatever data a company or enterprise decided to gather". Here I can see some serious risks.
How much data is enough for Big Data?
Do we really need all these data? Definitely, no! The problem is to decide what we need in fact. An easy part is to collect as much as we can (there are new possibilities today with the growing storage capacities, processing power, quantum computing promises, as well as cloud computing concepts), but we should remember: data is not equal to information. Going back to the basics of information theory, let's just remember that:
Data (/ˈdeɪtə/ DAY-tə or /ˈdætə/ DA-tə) is a set of values of qualitative or quantitative variables; restated, data are individual pieces of information. Data in computing (or data processing) are represented in a structure that is often tabular (represented by rows and columns), a tree (a set of nodes with parent-children relationship), or a graph (a set of connected nodes). Data is typically the result of measurements and can be visualized using graphs or images.
Data as an abstract concept can be viewed as the lowest level of abstraction, from which information and then knowledge are derived.
To complement the above definition of data, we should also consider how exactly "the data" is understood in the field of computing:
Data (/ˈdeɪtə/ DAY-tə or /ˈdætə/) is the quantities, characters, or symbols on which operations are performed by a computer, being stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
Moreover, according to Cisco estimations the global Internet traffic will be 4.8 zettabytes a year since 2015 – and we're talking only about the data which will flow through the network, not the data which is constantly stored! Therefore, the problem for Big Data is not to gather as much as we can from anywhere, but how to select an useful information out from collected data. Useful information? Hence, let's try to answer: what is information? – according to popular definition:
Information (shortened as info or info.) is that which informs, i.e. that from which data can be derived. Information is conveyed either as the content of a message or through direct or indirect observation of some thing. That which is perceived can be construed as a message in its own right, and in that sense, information is always conveyed as the content of a message. Information can be encoded into various forms for transmission and interpretation. For example, information may be encoded into signs, and transmitted via signals.
And just to have the full picture (we're now at the level of zettabytes) – the basic, atomic chunk of information is:
The bit is a typical unit of information.
So, there is a difference between the data and the actual information, which can be retrieved out of data. We can feel it – we can see it in the definitions. Yet, when we want to measure the information itself, we use "bits", which then are used to measure the amounts of… data. Up to the level of zettabytes and even more. That seems to be a little bit incoherent, isn't it? From this perspective – the border between "data" and "information" seems to be a little bit blurry. However, the most important question is: how to retrieve valuable information from all the data we have?
Knowledge. The real sign of organizational status
Unfortunately, having in mind the typical corporate mindset, we can assume that in most cases all the data or information will be stored as much as it will be possible, just because there is such option. Having in mind the findings of famous Business Intelligence veterans (Martinet and Marti), the information itself is commonly treated as an external sign of organizational status (the same way like representative, big HQ buildings, huge data centers etc.). Such approach means that it is always better to have the information – the more we have, the better we feel. However, there is no rational requirement to collect all possible information. We should remember that the information is useful only in the case when it adds something extra to the already gained… Knowledge.
Here we touched the next important thing: the knowledge. What is this?
Knowledge is a familiarity, awareness or understanding of someone or something, such as facts, information, descriptions, or skills, which is acquired through experience or education by perceiving, discovering, or learning. Knowledge can refer to a theoretical or practical understanding of a subject.
That's about information usability. Usable information helps to gain – what is reallyimportant for each enterprise – the knowledge!
Collecting data means costs
The next important factor to be considered when we talk about Big Data is the cost – what is the ROI (Return of Investment) factor when we invest in our Big Data solutions and what is the TCO (Total Cost of Ownership) when we have our own Big Data infrastructure or we use some solutions provided by the cloud operators? The value of information relies not only in its usability, but also in the costs required to gather and process all the amounts of data to produce useful information, hence – to gain more knowledge on given subject (better understanding of the issue).
The importance and the purpose of Big Data
Coming back to Gli's article – why Big Data is so important?
Big data is important because with modern technology like the internet, social media and mobile apps on smart phones and tablets, there is more information available and more information being created than ever before. However, without big data technology there would be no way to gather, store or analyze all this information and what could be used would be slow and inefficient. Big data is taking this jumble of information and turning it into a gold mine — a place where companies can go to find solutions to some of their most difficult questions.
Therefore – what is the main purpose of Big Data? It is not to collect the data, because we can; it is not about processing the data to get some/any information. We want only useful (valuable) information, as this creates the real thing – the knowledge! In fact "the knowledge" means all these answers to the most difficult questions mentioned by Gli.
To summarize, Big Data is about gaining more and more knowledge about the business environment in which our company has to stay competitive among others. This idea is beautifully concluded in the words of famous Greek businessman – Aristotle Onasis:
The secret of business is to know something that nobody else knows.
That's the purpose of Big Data! We should only remember that no matter in which period of business and entrepreneurship history we look, Business Intelligence (at a more general level – "intelligence") was always the most important way of gaining a market advantage against the competitors. Nothing has changed here, except volumes of stored and processed data, available data sources and techniques to collect usable information and produce enterprise knowledge – a motor of further development and a guarantee of being competitive.
Are we ready to talk about Data Pollution?
However, there is the most important risk behind all these idyllic concepts – seeing the growing amounts of stored and processed data and the actual problems which stand in the face of Big Data, I would propose to seriously consider another term: Data Pollution! Maybe we produce too much data? How many resources (electrical energy in the first place) we have to waste each year to retrieve usable information out of the zettabytes of data… To get this Holy Grail – the Knowledge!
Should we look on Big Data from this perspective, also? I believe – yes!