Big Data and Hadoop beginner short notes | Part-1.
- Huge volume of data that cannot be stored and processed using traditional methods.
- There is no strict boundary on the size of the data but it is generally in PetaBytes.
This is way too much for a traditional system to handle. The challenges of systematic storage and analysis of such huge amounts led to the development of big data techniques.
- Volume (amount of data )
- Velocity (Speed at which this data is generated )
- Variety (Different kinds of data being generated )
Open Source framework that "allows for the distributed processing of large data sets across clusters of computers using simple programming models"
The two main components of the Hadoop framework are HDFS and MapReduce
The Hadoop distributed file system takes care of storing and managing data within the Hadoop Cluster
Processing and computing the already data present in the HDFS
Hadoop follows a Master-Slave architecture
Master Node is responsible for running the NameNode and JobTracker Daemon
Slave Node on the other hand is responsible for running DataNode and TaskTracker Daemon. There are generally multiple Slave Nodes
Node = Term generally used for computers and machines Daemon=Term used to describe process running on a Linux machine
- Name and data Node are responsible for storing and managing data
- Job and Task tracker are used for computing and processing data
Some Features of Hadoop
- Supports a large number of Nodes (More storage and computing power)
- Parallel processing of data
- Distribution and replication of data
- Data Locality Optimization
This is a very interesting feature according to me!
Data Locality Optimization Example
Let's say our system consists of three parts ->
- A database DB (which has a list of number )
- A server S (which counts all the numbers which fit some business logic B)
- Client C (Which asks the server for results)
In a typical system, the server S will request all the data from the DB pre-filtered using some queries and then count the numbers satisfying B.
This works perfectly fine for small amounts of data say some MB's but if the data is very very large then this will be a long process. Because receiving tens of GB's of Data from the **DB** will take a lot of time.
The problem with the above implementation was the time and resources required for the transfer of huge amounts of data from the DB and then it's processing.
Hadoop solves this by transferring the code (which is much smaller) from the server S to the DB. Now, this code is compiled and run on the DB itself where the data is present. And instead of transferring tons of Data only the end result i.e, the count is transferred