A new way to store and
analyze data
Hadoop, Why?
ØNeed to process 100TB
data sets.
ØNeed Efficient, Reliable
and Usable framework.
What Is Hadoop ??
ØHadoop was created
by Douglas Reed Cutting, who named haddop after his child’s
stuffed elephant to support Lucene and Nutch search
engine projects.
ØHadoop is a
software framework for distributed
processing of large datasets across large clusters of computers
ØCore Hadoop has
two main systems:
– Hadoop Distributed
File System: self-healing
high-bandwidth clustered storage.
– MapReduce: distributed
fault-tolerant resource
management and scheduling coupled with a
scalable data programming abstraction.
Hdoop Architecture:-
The core Hadoop has two main systems:-
Hadoop Distributed
File System(HDFS)
§A distributed file system that provides high throughput access to
application data.
Map Reduce
§A software framework for distributed processing of large data sets
on compute clusters.
HDFS (Hadoop Distributed FileSystem)
ØHadoop Distributed
File System (HDFS) is the primary storage system used by Hadoop applications.
ØHDFS creates multiple
replicas of data blocks and distributes them on compute nodes throughout a
cluster to enable reliable, extremely rapid computations.
HDFS Architecture Diagram
Hadoop Map Reduce
Ø In Map Reduce, records
are processed in isolation by tasks called Mappers.
Ø The output from the Mappers is
then brought together into a
second set of tasks called Reducers .
Map Reduce
Implementation
1.Input files split (M splits)
2.Assign Master & Workers
3.Map tasks
4.Writing intermediate data to disk (R regions)
5.Intermediate data read & sort
6.Reduce tasks
7.Return
Benefits of MapReduce
ØCapable of processing
vast amounts of data
Scales linearly
Ø Same data problem
will process 10x faster on 10x larger cluster
ØIndividual failures have
minimal impact
ØFailures during a job
cause only a small portion of the job to re-executed
Drawbacks of MapReduce
ØJob setup takes time
(e.g., several seconds)
Ø Map Reduce is not
for real-time interaction
Ø
ØRequires deep
understanding of the MapReduce paradigm
Ø
ØNot all problems are
easily expressed in MapReduce.
Advantages:-
ØHadoop is designed
to run on cheap commodity hardware
ØIt automatically handles
data replication and node failure
ØIt does the hard work –
you can focus on processing data
ØCost Saving and
efficient and reliable data processing
Conclusion:-
ØHadoop is data
storage and analysis platform for large volumes of data.
ØHadoop will sit
along side, not replace your existing RDBMS.
ØHadoop has many
tools to ease data analysis.
Comments
Post a Comment