HADOOP A new way to store and analyze data

A new way to store and analyze data

Hadoop, Why?

ØNeed to process 100TB data sets.

ØNeed Efficient, Reliable and Usable framework.

What Is Hadoop ??

ØHadoop was created by Douglas Reed Cutting, who named haddop after his child’s stuffed elephant to support Lucene and Nutch search engine projects.

ØHadoop is a software framework for distributed processing of large datasets across large clusters of computers

ØCore Hadoop has two main systems:

– Hadoop Distributed File System: self-healing

high-bandwidth clustered storage.

– MapReduce: distributed fault-tolerant resource

management and scheduling coupled with a

scalable data programming abstraction.

Hdoop Architecture:-

The core Hadoop has two main systems:-

Hadoop Distributed File System(HDFS)

§A distributed file system that provides high throughput access to application data.

Map Reduce

§A software framework for distributed processing of large data sets on compute clusters.

HDFS (Hadoop Distributed FileSystem)

ØHadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.

ØHDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.

HDFS Architecture Diagram

Hadoop Map Reduce

Ø In Map Reduce, records are processed in isolation by tasks called Mappers.

Ø The output from the Mappers is then brought together into a second set of tasks called Reducers .

Map Reduce Implementation

1.Input files split (M splits)

2.Assign Master & Workers

3.Map tasks

4.Writing intermediate data to disk (R regions)

5.Intermediate data read & sort

6.Reduce tasks

7.Return

Benefits of MapReduce

ØCapable of processing vast amounts of data

Scales linearly

Ø Same data problem will process 10x faster on 10x larger cluster

ØIndividual failures have minimal impact

ØFailures during a job cause only a small portion of the job to re-executed

Drawbacks of MapReduce

ØJob setup takes time (e.g., several seconds)

Ø Map Reduce is not for real-time interaction

ØRequires deep understanding of the MapReduce paradigm

ØNot all problems are easily expressed in MapReduce.

Advantages:-

ØHadoop is designed to run on cheap commodity hardware

ØIt automatically handles data replication and node failure

ØIt does the hard work – you can focus on processing data

ØCost Saving and efficient and reliable data processing

Conclusion:-

ØHadoop is data storage and analysis platform for large volumes of data.

ØHadoop will sit along side, not replace your existing RDBMS.

ØHadoop has many tools to ease data analysis.

Informationgossip

Search This Blog

HADOOP A new way to store and analyze data

Comments

Post a Comment

Popular posts from this blog

5 thing you must know about Sunil Chhetri

What is Nipah virus (NiV)? | Nipah virus wiki | Symptoms |

What is 5G Network and How it Works