Hadoop has revolutionised data management since 2006 with the promise of being able to store enormous amounts of data in distributed environments economically and process them as easily as possible. The framework took quite a hit in the past year, affecting the application and users. Still, Hadoop will stay with us for a while.
With acquisitions valued at around $18 billion, 2019 saw true tectonic shifts in the big data industry. These include the acquisitions of Tableau by Salesforce, Looker by Google, and Hedvig by CommVault. This wave of consolidation undoubtedly signals a sea change in Hadoop’s outlook. But even with the recent rollercoaster ride of Cloudera, MapR, and other Hadoop players, it’s too early to say exactly what that means for the platform. Hadoop’s former superstar status is weakened, but its existence is not questioned. You first need to look back and then forward to the next stages to classify this.
Hadoop is an open-source Java-based framework maintained by the Apache Software Foundation. It is designed to store and process huge data sets across clusters of commodity hardware, leveraging simple programming models. Hadoop is designed to scale from single to thousands of servers. It relies on software rather than hardware for its high availability – which means that the system detects and handles errors in the application layer on its own. Hadoop consists of two main components: the Hadoop Distributed File System (HDFS) and Yet Another Resource Negotiator (YARN).
HDFS is Hadoop’s main data storage system that uses a NameNode/DataNode architecture to enable high-performance access to data in a distributed file system based on highly scalable Hadoop clusters. YARN, originally named “MapReduce 2” – the next generation of the very popular “MapReduce” – helps schedule jobs and manage resources in all cluster applications. It is also commonly used by Hadoop developers to build applications that can work with extremely large datasets.
Also Read: The Impact Of Cloud Computing In The Retail World
Hadoop’s origins date to 2002 when Apache Nutch developers Doug Cutting and Mike Cafarella set out to find a more cost-effective project architecture that could meet Nutch’s goal of indexing a billion web pages. Doug joined Yahoo in 2006 and was given a dedicated team and resources to help develop Hadoop into a web-based system. Then in 2008, Yahoo released Hadoop for Apache, and it was successfully tested over a 4,000 node cluster.
The following year, in 2009, Hadoop was successfully tested at the petabyte scale for the first time, handling billions of searches and indexing millions of pages in just 17 hours. Almost unthinkable at the time. Later that year, Doug Cutting left Yahoo for Cloudera, making it the first Hadoop-specific company. The declared common goal was to expand Hadoop to other industries. MapR followed Cloudera in 2009 and Hortonworks in 2011, and Hadoop quickly won favour with Fortune 500 vendors, who identified big data as a rapidly evolving and high-value field.
The term “big data” means very different things to different people. Perhaps that is better expressed as “much more data with much greater impact.” Because at some point, companies realised that all the data they generated from their web and social media presences were either lost or accumulated in inexpensive unused storage. They realised that this data could be used to create a better and more personalised user experience that would increase satisfaction and sales alike. They just lacked the tools to do it cheaply and at scale.
Enter Hadoop! This new technology promised economical large-scale data storage and streamlined processing of high-petabyte data volumes. Thus the idea of company-owned “data lakes” was born – and the glorious era of effective processing of large amounts of data began.
When Hadoop was born and gained popularity, it was the proverbial idea whose time had come (and nothing in the world is more powerful than such, as we now know since Victor Hugo). Finally, there was a cost-effective way to store petabytes of data at a price that was a fraction of the traditional data warehousing cost. But then companies realised that storing data and processing it are two completely different challenges. They started using their data warehouses as a “data swamp” but not making use of the data stored there.
Data architects began rethinking their massive data lakes despite promises from Cloudera, MapR, and other companies to achieve cloud-like flexibility via Hadoop. Cloudera and other Hadoop providers have responded to growing interest in cloud-native solutions with hybrid and multi-cloud offerings like the Cloudera Data Platform (CDP), which finally launched last March. However, these offers were largely based on clunky “lift & shift” methods, the effectiveness, and efficiency of which are still in question.
That was “too little, too late.” Essentially, the Hadoop vendors were trying to create their version of a lock-in. Instead, they created a new market. Attempting to rein in innovation, they propelled big data organisations straight into the open arms of specialised cloud services for storing, processing, and analysing big data, such as AWS, Azure, and Google Cloud. Those responsible in these organisations have become accustomed to cloud-based solutions’ freedom, power, and flexibility. There is no turning back from this.
Hadoop’s free fall over the past year exemplifies the industry’s ongoing transition away from the technology of an outdated era. We are moving away from storing data locally and running billions of batch-based queries to real-time analysing massive datasets in the cloud. Still, Hadoop isn’t going away anytime soon. For now, and for some time to come, companies need to find a way to make the transition while gradually finding other options in a post-Hadoop world.
Meanwhile, Hadoop-based data lakes will live on for years in industries where time-sensitive and insightful analytics are less important and cost trumps efficiency. Hadoop will have its rightful place in the big data ecosystem. But in dynamic and fast-moving business landscapes, data management will undoubtedly occur in the cloud, and companies need to plan for this transition today. It is time.
Data lakes are a thing of the past because data is not a static closed construct. Rather, we must view data as a river that cannot be damned, not as a lake. Data flows are constantly changing. Businesses cannot stand still because of migrations, upgrades, or downtime. The data context evolves by the minute, and ensuring data consistency and availability is the real challenge for the data steward—not just filling a reservoir.
And so, Hadoop will ultimately fade and fade, as all monolithic technological models inevitably do in favour of their more dynamic descendants. People are on the side of the freedom inherent in the cloud paradigm. Data should not be drowned in lakes. It must be able to flow freely.
Also Read: Innovative Automation Through The Cloud
If you're a blogger, you probably know how important it is to have an editorial…
Most Indian workers, in these days of emergency linked to Coronavirus, are rightly locked at…
After carefully combing Generation Y and Millennials, it is Generation Z's turn to be scrutinized.…
The virtual tour has become an exciting reality for small and medium-sized businesses. Until a…
At barely 26 years old, Evan Spiegel, the young prodigy founder of Snapchat, decided to…
They answer customer calls with a voice that sounds human, giving contemplated data and not…