HADOOP A new way to store and analyze data


A new way to store and analyze data  


Hadoop, Why?
ØNeed to process 100TB data sets.
ØNeed Efficient, Reliable and Usable framework.

     What Is Hadoop ??
ØHadoop was created by Douglas Reed Cutting, who named haddop after his child’s stuffed elephant to support Lucene and Nutch search engine projects.
ØHadoop is a software framework for distributed processing of large datasets across large clusters of computers
ØCore Hadoop has two main systems:
    – Hadoop Distributed File System: self-healing
        high-bandwidth clustered storage.
    – MapReduce: distributed fault-tolerant resource
        management and scheduling coupled with a
        scalable data programming abstraction.
Hdoop Architecture:-
The core Hadoop has two main systems:-
Hadoop Distributed File System(HDFS)
§A distributed file system that provides high throughput access to application data.
Map Reduce
§A software framework for distributed processing of large data sets on compute clusters.

HDFS (Hadoop Distributed FileSystem)
ØHadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.
ØHDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.
     HDFS Architecture Diagram 


     Hadoop Map Reduce
Ø     In Map Reduce, records are processed in isolation by tasks    called Mappers.  
Ø     The output from the Mappers is then brought together into          a second set of  tasks called  Reducers .
Map Reduce Implementation
1.Input files split (M splits)
2.Assign Master & Workers
3.Map tasks
4.Writing intermediate data to disk (regions)
5.Intermediate data read & sort
6.Reduce tasks
7.Return


 Benefits of MapReduce
ØCapable of processing vast amounts of data
    Scales linearly
Ø Same data problem will process 10x faster on 10x larger cluster
ØIndividual failures have minimal impact
ØFailures during a job cause only a small portion of the job to re-executed

      Drawbacks of MapReduce
ØJob setup takes time (e.g., several seconds)
   
Ø Map Reduce is not for real-time interaction
Ø
ØRequires deep understanding of the MapReduce paradigm
Ø
ØNot all problems are easily expressed in MapReduce.
     Advantages:-
ØHadoop is designed to run on cheap commodity hardware
ØIt automatically handles data replication and node failure
ØIt does the hard work – you can focus on processing data
ØCost Saving and efficient and reliable data processing

     Conclusion:-
ØHadoop is data storage and analysis platform for large volumes of data.
ØHadoop will sit along side, not replace your existing RDBMS.
ØHadoop has many tools to ease data analysis.


Comments