Share this link via
Or copy link
Hadoop Distributed File System is a filesystem designed for large-scale distributed data processing under framework such as Mapreduce.
Hadoop works more effectively with single large file than number of smaller one. Hadoop mainly uses four input formats-FileInput Format,KeyValueTextInput Format,TextInput Format,NLineInput Format. Mapreduce is Data processing model consist of data processing primitives called Mapper and Reducer. Hadoop supports chaining MapReduce programs together to form a bigger job.We will explore various joining technique in hadoop for simultaneously processing multiple datasets.Many complex tasks need to be broken down into simpler subtasks,each accomplished by an individual Mapreduce jobs.
From the citation data set, you may be interested in finding ten most cited patents. A sequence of two Mapreduce jobs can do this.
Hadoop clusters which support for Hadoop HDFS,MapReduce ,Sqoop ,Hive ,Pig , HBase , Oozie , Zookeeper, Mahout , NOSQL , Lucene/Solr,Avro,Flume,Spark,Ambari Hadoop is designed for offline processing and analysis of large-scale data.
Hadoop is best used as a write-once, Read-many-times type of datastore.
With the help of Hadoop, large dataset will be divided into smaller (64 or 128 MB)blocks that are spread among many machines in the clusters via Hadoop Distributed File System.
The key functions of Hadoop are,
1)approachable-Hadoop runs on Huge clusters of appropriate Hardware apparatus.
2)Powerful-Because it is intentional to run on clusters of appropriate Hardware apparatus, Hadoop is an architect with the presumption of repeated hardware malfunctions. It can handle most of such failures.
3)Resizable-Hadoop measures sequentially to hold large data by including more nodes to the cluster.
4)Simple-Hadoop allows users to speedily write well-organized parallel codes.
What is Hadoop Development?
There are mainly two teams when it comes to Big Data Hadoop. One team consists of Hadoop Administrators and the second one is Hadoop Developers. So, the common question which comes in mind is what are their roles and responsibilities.
To know their roles and responsibilities we need to know what is Big Data Hadoop. With the evolution of the internet and the increase in the smartphone industry and with the easy access to the internet the amount of data that is generated on a daily basis has also been increased. This data can be anything, for example, your daily online transaction, you feed activity on social media sites, the amount of time you spend on a particular app, etc. So the data can be generated from anywhere in the form of logs.
Now with this amount of data that is generated on a daily basis, we cannot rely on the traditional RDBMS to process our data as the SLA for the traditional RDBMS is very high. And access to old data that is in the archives cannot be processed in real-time.
Hadoop provides a solution to these entire problems. You can put all your data in the Hadoop Distributed File System and can access and process the data in real-time, whether the data is generated today or the data is 10 years old, it does not matter, you can process the data easily in real-time.
Let me explain the above situation with a real-time example. Suppose you are a customer od XYZ telecom company from the past 10 years, so every call record will be stored in the form of logs. Now that Telecom Company wants to introduce new plans for its customers for a particular age group and for that they want to access the logs of each and every customer who falls under that age group. The main problem arises now that this data has been stored in traditional RDBMS and only 40% of the data can be processed in real-time and rest 60% cannot be processed in real-time as this data is stored in the form of archives and the company cannot wait too long to get the data from the archives and then process it.
The data available for processing in real-time is 40% and if the company takes a decision on the 40% data available then the success rate of that decision will be 40% and the company cannot take that risk. Now if all this data is stored in Hadoop Distributed File System then the access to 100% data is in real-time and we can process 100% data.
The above example has cleared your doubts about why is Big Data Hadoop required in industry and is so much in demand. Now we will discuss the two teams related to Big Data Hadoop to make things work. One in Hadoop Admin team and other is Hadoop Development team
Hadoop Administrator Team:
This team is responsible for the maintenance of the Cluster in which the data is stored.
This team is responsible for the authentication of the users that are going to work on the cluster.
This team is responsible for the authorization of the users that are going to work on the cluster.
This team is responsible for the troubleshooting, that means if the cluster goes down then it is their job to get back the cluster to running state.
This team deploys, configures and manages the services present in the cluster.
Basically Hadoop Admin team looks after the cluster, is responsible for the good health of the cluster, security of cluster and managing the data. But what to do with the data, a company does not want to spend this amount of money in just storing the data.
Now comes the Hadoop Development team. You might have remembered in the above example when we discussed the real-time access. This real-time access to the data will help the Hadoop Development team to process the data. Now,
What is data processing?
The data which comes to the cluster is raw data. Raw Data means it can be structured, unstructured, semi-structured data or binary data. We need to filter that data that is of use and process the data to generate some insights so that business decisions can be made. All the work, filtering the data the processing it falls under the Hadoop Development team.
Hadoop Development Team:
This team is responsible for ETL, which means to extract, transform and load.
This team performs analysis of data sets and generate insights.
This team performs high-speed querying.
Reviewing and managing Hadoop log files.
Defining Hadoop Job flows.
As a Hadoop Developer you need to know about the basic architecture and working of the following services.
Apache Flume
Apache Pig
Apache Sqoop
Apache Hive
Apache Impala
Spark
Scala
HBase
Apache Flume and Apache Sqoop are ETL Tools. These are the basic tools in HDFS that are used to get the data in the cluster. Apache Hive is a data warehouse and is used to run queries on the data set using Hive QL. Impala is also used for the queries. Spark is used for High-speed processing of data set. HBase is a database. The above-mentioned points were the introduction about the services and what are their uses in the Hadoop Cluster.