When Google first introduced us to their internal secret sauce about massive data processing, back in December 2004, the term ‘Big Data’ was not yet indicating any of the fuss we’re witnessing since then.
As a matter of fact, the term was used before several times – but as far as I concern, Google’s document was the moment in time when the data era had started. That was 10 years ago.
Few months after, Doug Cutting made history by developing Hadoop.
Heavily influenced by Google’s document (and his son’s favorite toy), Hadoop rapidly changed the technology ecosystem, and amazing new tools leveraging common concepts, started to pop out like mushrooms after rain.
Happy birthday Big Data, you’re 10 years old !
Oh, sweet child; Where do we go now?
I believe that the coming 10 years are going to be just as interesting for Big Data:
New technologies relaying on Hadoop as infrastructure, the rise of data science and the ease of data analytics, sophisticated computer vision, trends like Internet of Things and wearable devices – all and more are going to take us to new, interesting, sometime unpredictable, directions.
Nonetheless, I’ve decided to take a bold step, and try to predict how the coming 10 years for big data are going to look like.
In the coming posts, I’m going to lay down my forecasting about the future of big data, 10 years from now, focusing on 3 categories – technology, data analysis and science, data products and privacy. This post will be focused on technology.
Big Data Technology – Where do we go now ?
First thing first, lets see how technology is going to evolve in the coming 10 years.
I believe that data platforms are set on 5 peelers: Data Repositories, Data Transformation, Data Retrieval, Data Visualization and Data Science.
These are my 6 predictions for the coming 10 years, related to the above peelers.
1. Data Repositories: Hadoop will become THE platform for any data driven technologies.
Hadoop as a platform had gone quite a way since it was first developed as a distributed file system and MapReduce enabler.
10 years after and we are now witnessing many new technologies built on top of the Hadoop File System (HDFS), leveraging the parallelism, high availability and robustness of the system.
It’s no longer ‘just’ a batch oriented system – you can now run interactive queries, using Impala and Spark, build search engines using Solar and ElasticSearch and run real time event processing using Spark Streaming, all governed by YARN.
But we’re kind of stuck.
On its very core, HDFS is a distributed file system; as a matter of fact, it’s a pretty lame one. It can only support immutable objects, can’t handle many/small files very well, quiet slow when direct accessing, and other limitations.
These limitations are now holding Hadoop from its true destiny – becoming the infrastructure for any data driven technology riding on its scalability and popularity.
That has to change from its very core architecture, HDFS needs to get all the basic feature of any other respectable filesystem. MapR had already made a significant progress in this direction.
Once HDFS becomes a true filesystem, with equal features to the ext* family and NTFS file systems, we’ll see the true adoption of data infrastructures really utilizing its potential.
At first, we’ll see the ‘new age’ data infrastructures migrating into HDFS – Cassandra, Elasticsearch, MongoDB and others, but then the giants will have to follow as well:
- Do you need a filesystem to host your millions images ? use Hadoop.
- Oracle running on HDFS ? you bet! it’s just a matter of time.
- Looking for a storage solution with easy search capabilities ? Try Lucene on HDFS.
- Real time processing of data ? Spark Streaming.
2. Data Transformation: ETL = CEP = Spark
ETL stands for Extract Transform and Load, or in English – pull the data, change it, and stuck it back somewhere else. Hadoop’s MapReduce makes the ultimate ETL infrastructure, especially the ‘T’ part of it – it lets you easily manipulate any amount of data in a super effective way.
CEP stands for Complex Event Processing, or in English – get data in real time, manipulate it (usually by joining with other data sources and/or aggregating it over a window of time) and take an action upon it or stuck it somewhere else. Real Time data manipulation is today’s promise using tools like Storm, Spark Streaming or any of the other commercial CEP solutions.
At the end of the day, both concepts are very similar. The main difference is that ETLs are usually considered to be batch processing jobs while CEPs are more real time creatures, but at their core, they do the same job – manipulate data and move it forward.
When you get to develop an ETL+CEP environment, you often ask yourself how to correlate these tasks together – I want real time signals and also 3 years of data based on the same logic.
Nathan Marz, the ‘father’ of Storm, is addressing this question by proposing the Lambda architecture, designed to answer any data question with any freshness between 100 ms to 100 years. I highly recommend reading his writings/books, but if you didn’t get the chance yet – in a nutshell what he’s suggesting, is to stream the data through a real time aggregation process (CEP) and also through a batch aggregation process (ETL), and to join the two in query time in order to get accurate data with no latency and no limits. While this is nice and elegant solution, this concept has an inherited flaw – if you develop two data manipulation processes, you’d end up getting two un-synchronized results.
What’s the solution ?
The new kid in the block – apart from its main quality as a MR replacement and a fast data processing engine, Spark it is also a unified mechanism to develop any kind of data manipulation jobs, both batch and real time oriented.
In the coming years, we’ll be witnessing Spark getting a larger and larger footprint and in 10 years (together with 3rd party tools leveraging it’s code) it will dominate the data transformation world both for batch and real time jobs.
3. Data Retrival #1: The return of SQL –
Structured Standard Query Language
NoSQL is a pretty successful paradigm – tech companies behind Mongo, Cassandra, CouchBase and others have gained a huge traction and success in past 5 years. There are two main reasons for that –
- SQL databases were not ready for the data boom – even today, scaling a MySQL cluster is not a trivial task (yet possible). Oracle and SQL Server are still behind.
- Modeling data in a two-dimensional structure feels unnatural (good luck modeling user’s multiple purchases with multiple products in each purchase in a SQL database).
Nevertheless, having an easy method to query data is crucial – you should not be a developer to run queries.
Almost each of the companies behind the NoSQL companies have created a proprietary, often strangely resemble to SQL, language to query their database. But as of today, there’s no single standard to query NoSQL DBs.
Standards are crucial for the economy!
From the customers’ point of view – it lets them easily replace once piece of technology with another. As long as both parts ‘speak’ the same language, the goal is make this transition nearly seamlessly (unfortunately, this never actually like that).
From the vendors’ side – it lets them develop solutions that can easily scale, since they don’t require expensive integration projects.
We’re lacking a Standard Query Language (SQL?) that will be adopted by everyone, including the NoSQL players to get the real movement going. This language, which should probably be based on extensions of the good old SQL – should support 2d tables and also complex documents, including: records, arrays and maps. Surprisingly enough, the leading technology in the current SQL world that is already doing so is PostgreSQL.
4. Data Retrieval #2: Big Data Analytics – select anything from anywhere where whatever
In the old ages (5 years ago), when a manager wanted a new report, a request was fired to the R&D department to develop this new capability.
R&D had to take into consideration – the report UI, data modeling, ETLs, aggregations/cubes, scheduling, backups and so forth. It could have taken few months till that request was satisfied.
The main reason for that was due to the fact that queries in the traditional databases were just not fast enough – querying even just few millions of records in Oracle can take minutes, which is pretty lame compared to the slick experience you’re used to when running Google Analytics.
In order to increase query performance, you had to pre-calculate the results and serve it when crunched and ready. If the manager then asked to filter by State – another 3 months could have passed again because the aggregation that was created does not maintain this hierarchy.
Prof. Michael Stonebraker, the mind behind c-store and Vertica – revealed the world the simple fact, that some data formats are meant for writing (row oriented) while some other data formats are meant for reading (column oriented). Based on this ground, he founded Vertica, which was later acquired by HP.
Thanks to Mr. Stonebraker (conception-breaker), we now understand that it is possible to query anything on any filter without pre-calculating in advance.
While it works very well for technologies such as Vertica, Greenplum, Amazon Redshift, Sybase IQ and some others – we’re still lacking a serious breakthrough in the Hadoop ecosystem.
True, the well communicated war between Impala vs Hive Spark and the file formats Parquet and ORC – looks very interesting and one will have the upper hand eventually. But it is clear that both are 3-4 years behind the none Hadoop vendors.
5. Data Visualization: beyond bars and gauges – visualization of documents
It is really funny how the BI vendors of the world, have built their tools driven mostly by the limitation of the underlying repositories of two-dimension models and not by how human are actually thinking. It’s even funnier that we’ve got used to it.
Every kid can take a table in Excel and make different charts out of it, but how would you visualize JSON file ?
The NoSQL movement have catched the BI players off-guard, they were just not built to visualize complex objects, just tables. It looks like they were just hopping it would go away somehow.
But the world had a different plan – it needed a solution.
3 guys have raised the glove and developed D3js – a data driven visualization library aiming to go beyond the simple art of visualizing 2d tables. Using d3 (and other similar libraries) you can now chart graphs, tree maps, word clouds, chord diagrams and more. Really amazing stuff.
Since then, many of the BI players had also started to adopt these concepts, embedding them into their tools as well.
While today data visualization is 99% 2d tables driven and 1% document driven – I think that in 10 years it will be 50-50.
6. Data Science: Commoditizing data science
Today when a developer wants to develop a market basket analysis, real time prediction systems or find exceptions in financial data – he needs to be able to fluently discuss about: entropy of continues variables, binning of un-discrete data, standard errors of proportions and r square distance from a line.
Data science is a sexy job, but it’s pretty darn complicated. Sex shouldn’t be complicated.
The data science problem had been too long in the hands of mathematicians and too little in the hands of developers.
I know data, but I’m not really sure how the electricity to my house really works. It just works. Black boxing is the ultimate answer to complexity.
I think that data science in 10 years will be the BI of today – you’ve gotta have it in your stack, and all you’ve got to do is to hook the wires correctly and it works like a black box charm.
10 years anniversary to Big Data since Google’s document about MR back in Dec 2004. These are my 6 predictions for the coming 10 years:
- Oracle will run on Hadoop
- ETLs and CEP will be done together with Spark.
- SQL will support document data and will be adopted by the NoSQL DBs.
- We’ll be charting a lot more complex documents and less 2d tables
- Developers will be able to run predictive models without a Ph.D in Mathematics.
Next posts will focus on data analysis/science and data products.