Enormous amounts of data and rapid data availability is a technical challenge for data-driven companies. The larger the amount of data, the more expensive the storage and the slower the data processing. Hadoop promises to solve this Big Data problem: intensive computing processes with huge amounts of data are distributed to computer clusters in order to enable faster and more stable work.
Data distribution with the Hadoop Distributed File System (HDFS) enables high data throughput
Yet Another Resource Negotiator (YARN) enables job scheduling and cluster resources management
MapReduce parallel processing through splitting and distributing large amounts of data for processing across distributed locations
Hadoop is a free framework from Apache. Hadoop offers companies a solution, with which they can quickly process huge amounts of data. Based on the MapReduce algorithm, Douglas Cutting intended to develop an Open Source search engine, which could be made available to all. Hadoop addresses the challenges posed by the “Big Three”: Variety, Volume and Velocity.
The Open Source framework Hadoop can be programmed individually. Basically, it is used to group numerous computers into clusters, in order to store enormous amounts of data and process them at high velocity. The speed results from the utilization of the nodes in the cluster, which are calculated in a user-based manner for the respective task in their capacity. In the basic version, Hadoop consists of four components (Hadoop Common, Hadoop DFS / HDFS, MapReduce and YARN). The core of the application is the MapReduce algorithm. It ensures a separation of the computing tasks, which are directed to different points for processing. In an intermediate step, the results are combined and subsequently evaluated in the reduction phase.
Data can easily be stored and distributed in large amounts across clusters
Data processing is simplified and accelerated through distribution of data across large clusters.
Thanks to the parallel processing of data, companies save a lot of time in evaluations.
The combination of the MapReduce Algorithm and YARN enables data processing at the Petabyte level (more than 1 Million Gigybytes)
The compression into different formats leads to an optimal use of storage space and resources.
Hadoop supports structured and unstructured file formats such as text formats (.csv or .json) or tabular formats (-orc and parquet).