Course Description

Hadoop Syllabus

Course Content for Hadoop and Spark

Introduction to BIGDATA and HADOOP

� What is Big Data?
� What is Hadoop?
� Relation between Big Data and Hadoop.
� What is the need of going ahead with Hadoop?
� Scenarios to apt Hadoop Technology in REAL TIME Projects
� Challenges with Big Data
      o Storage
      o Processing
� How Hadoop is addressing Big Data Changes
� Comparison with Other Technologies
      o RDBMS
      o Data Warehouse
      o TeraData
� Different Components of Hadoop Echo System
      o Storage Components
      o Processing Components
� Importance of Hadoop Echo System Components
� Other solutions of Big Data
      o Introduction to NO SQL

HDFS (Hadoop Distributed File System)

� What is a Cluster Environment?
� Cluster Vs Hadoop Cluster.
� Significance of HDFS in Hadoop
� Features of HDFS
� Storage aspects of HDFS
      o Block
      o How to Configure block size
      o Default Vs Configurable Block size
      o Why HDFS Block size so large?
      o Design Principles of Block Size

HDFS Architecture - 5 Daemons of Hadoop

� NameNode and its functionality
� DataNode and its functionality
� JobTracker and its functionality
� TaskTrack and its functionality
� Secondary Name Node and its functionality.

Replication in Hadoop – Fail Over Mechanism

� Data Storage in Data Nodes
� Fail Over Mechanism in Hadoop – Replication
� Replication Configuration
� Custom Replication
� Design Constraints with Replication Factor
� Can we change the replication factor in Hadoop?
� Can we change the block size for a file or directory in Hadoop?
Accessing HDFS
� CLI (Command Line Interface) and HDFS Commands
� Java Based Approach
� Hadoop Archives
� Configuration files in Hadoop Installation and the Purpose
� How to & Where to Configure Hadoop Daemons in a Hadoop Cluster?
� Difference between Hadoop 1.X.X and Hadoop 2.X.X version
      o Name Node HA (High Availability in Hadoop 2.X.X)
� Why Map Reduce is essential in Hadoop?
� Processing Daemons of Hadoop
�Job Tracker
      o Roles Of Job Tracker
      o Drawbacks Job Tracker failure in Hadoop Cluster
      o How to configure Job Tracker in Hadoop Cluster
� Task Tracker
      o Roles of Task Tracker
      o Drawbacks Task Tracker Failure in Hadoop Cluster
Input Split
� InputSplit
� Need Of Input Split in Map Reduce
� InputSplit Size
� InputSplit Size Vs Block Size
� InputSplit Vs Mappers
Map Reduce Life Cycle
� Communication Mechanism of Job Tracker & Task Tracker
� Input Format Class
� Record Reader Class
� Success Case Scenarios
� Failure Case Scenarios
� Retry Mechanism in Map Reduce
MapReduce Programming Model
� Different phases of Map Reduce Algorithm
� Different Data types in Map Reduce
      o Primitive Data types Vs Map Reduce Data types
How to write a basic Map Reduce Program
� Driver Code
� Mapper Code
� Reducer Code
Driver Code
� Importance of Driver Code in a Map Reduce program
� How to Identify the Driver Code in Map Reduce program
� Different sections of Driver code
Mapper Code
� Importance of Mapper Phase in Map Reduce
� How to Write a Mapper Class?
� Methods in Mapper Class
Reducer Code
� Importance of Reduce phase in Map Reduce
� How to Write Reducer Class?
� Methods in Reducer Class
Input Format’s in Map Reduce
� TextInputFormat
� KeyValueTextInputFormat
� NLineInputFormat
� DBInputFormat
� SequenceFileInputFormat.
� How to use the specific input format in Map Reduce
� How to write Custom Input Format Class and Custom Record Reader
Output Format’s in Map Reduce
� TextOutputFormat
� KeyValueTextOutputFormat
� NLineOutputFormat
� DBOutputFormat
� SequenceFileOutputFormat.
� How to use the specific Output format in Map Reduce
� How to write Custom Output Format Class and Custom Record Writer
Map Reduce API(Application Programming Interface)
      o New API
      o Deprecated API
� Combiner in Map Reduce
      o Is combiner mandate in Map Reduce
      o How to use the combiner class in Map Reduce
      o Performance tradeoffs Combiner
      o Real Time Use Cases
      o Where to Use & Where Not to Use Combiner
� Partitioner in Map Reduce
      o Importance of Practitioner class in Map Reduce
      o How to use the Partitioner class in Map Reduce
      o Different types of Practitioners in Map Reducer
      o Importance of hashPartitioner
      o How to write a custom Practitioner
      o Real Time Use Cases
� Compression Techniques in Map Reduce
      o Importance of Compression in Map Reduce
      o What is CODEC
      o Compression Types
      o GzipCodec
      o BzipCodec
      o LZOCodec
      o SnappuCodec
      o Configurations Compression Techinques
      o How to customize the Compression per one job Vs all the job.
� Map Reduce Job Chaining
      o What is Map Reduce Job Chaining?
      o Use of MR Chaining in Real Time Hadoop Projects
      o Real Time Use case
      o Performance trade off’s using MR Chaining
� Joins - in Map Reduce
      o Map Side Join
      o Reduce Side Join
      o Performance Trade Off
      o Distributed cache
� How to debug MapReduce Jobs in Local and Pseudo cluster Mode.
      o Introduction to MapReduce Streaming
      o Data locality in Map Reduce
      o Secondary Sorting Using Map Reduce
Apache PIG
� Introduction to Apache Pig
� Map Reduce Vs Apache Pig
� SQL Vs Apache Pig
� Different datat ypes in Pig
� Where to Use Map Reduce and PIG in REAL Time Hadoop Projects
� Modes Of Execution in Pig
      o Local Mode
      o Map Reduce OR Distributed Mode
� Execution Mechanism
      o Grunt Shell
      o Script
      o Embedded
� Transformations in Pig
� How to write a simple pig script
� Parameter substitution in PIG Scripts
� How to develop the Complex Pig Script
� Bags , Tuples and fields in PIG
� UDFs in Pig
      o Need of using UDFs in PIG
      o How to use UDFs
      o REGISTER Key word in PIG
� Techniques to improve the performance and efficiency of Pig Latin
� Hive Introduction
� Need of Apache HIVE in Hadoop
� When to choose PIG & HIVE in REAL Time Project
� Hive Architecture
      o Driver
      o Compiler
      o Executor(Semantic Analyzer)
� Meta Store in Hive
      o Importance Of Hive Meta Store
      o Embedded metastore configuration
      o External metastore configuration
      o Communication mechanism with Metastore
� Hive Integration with Hadoop
� Hive Query Language(Hive QL)
� Configuring Hive with MySQL MetaStore
� SQL VS Hive QL
� Data Slicing Mechanisms
      o Partitions In Hive
      o Buckets In Hive
      o Partitioning Vs Bucketing
      o Real Time Use Cases
� Collection Data Types in HIVE
      o Array
      o Struct
      o Map
      o Real Time Use Cases
� User Defined Functions(UDFs) in HIVE
      o UDFs
      o UDAFs
      o UDTFs
      o Need of UDFs in HIVE
� Hive Serializer/Deserializer - SerDe
� Semi Structured Data Processing Using Hive
� HIVE – HBASE Integration
� Introduction to Sqoop.
� MySQL client and Server Installation
� How to connect to Relational Database using Sqoop
� Different Sqoop Commands
      o Different flavors of Imports
      o Export
      o Hive-Imports
� Hbase
� Hbase introduction
� HDFS Vs Hbase
� Hbase Vs RDBMS
� Hbase Vs NO SQL
� Hbase usecases
� Hbase Data modeling Elements
      o Column families
      o Column Qualifier Name
      o Row Key
� Hbase Architecture
� Clients
      o REST
      o Thrift
      o Java Based
      o Avro
� Map Reduce Integration
� Map Reduce over Hbase
� Hbase Admin
      o Schema Definition
      o Basic CRUD Operations
      o Client Side Buffering in Hbase
� Flume Introduction
� Flume Architecture
� Flume Master , Flume Collector and Flume Agent
� Flume Configurations
� Real Time Use Case using Apache Flume
� Oozie Introduction
� Oozie Architectrure
� Oozie Configuration Files
� Oozie Job Submission
      o Workflow.xml
      o Coordinator.xml
      o Transit parameters in workflow.xml
YARN (Yet another Resource Negotiator) – Next Gen. MapReduce
� What is YARN?
� Difference between Map Reduce & YARN
� YARN Architecture
      o Resource Manager
      o Application Master
      o Node Manager
� When should we go ahead with YARN
� YARN Process flow
� Different Configuration Files for YARN
� Examples on YARN
� What is Impala?
� How can we use Impala for Query Processing?
� When should we go ahead with Impala
� HIVE Vs Impala
� REAL TIME Use Cases with Impala
MongoDB ( As part of NoSQL Databases )
� Need of NoSQL Databases
� Relational VS Non-Relational Databases
� Introduction to MongoDB
� Features of MongoDB
� Installation of MongoDB
� Mongo DB Basic operations
� REAL Time Use Cases on Hadoop & MongoDB Use Cases
Apache Cassandra
� Introduction to Cassandra
� Mongo DB Vs Cassandra
� Basic Operation using Cassandra
Apache Kafka (A Distributed Message Queuing System)
� Introduction to Kafka
� Installation of Kafka
� Difference between MQ Vs Kafka
� Basic Operation using Kafka
Mahout (As a part of BIGDATA ANALYTICS)
� Introduction to Machine Learning (ML) Languages
� Types of Machine Learning
� Introduction to Apache MAHOUT
� Categories of Mahout Algorithms
Real Time Use case using Classifier Algorithm of Mahout
– Naives Bayes
SCALA (Object Oriented and Functional Programming)
� Getting started With Scala.
� Scala Background, Scala Vs Java and Basics.
� Interactive Scala – REPL, data types, variables,expressions, simple
� Running the program with Scala Compiler.
� Explore the type lattice and use type inference
� Define Methodsand Pattern Matching.
Scala Environment Set up.
� Scala set up on Windows.
� Scala set up on UNIX.
Functional Programming.
� What is Functional Programming.
� Differences between OOPS and FPP.
Collections (Very Important for Spark)
� Iterating, mapping, filtering and counting
� Regular expressions and matching with them.
� Maps, Sets, group By, Options, flatten, flat Map
� Word count, IO operations,file access, flatMap
Object Oriented Programming.
� Classes and Properties.
� Objects, Packaging and Imports.
� Traits.
� Objects, classes, inheritance, Lists with multiple related types, apply
� What is SBT?
� Integration of Scala in Eclipse IDE.
� Integration of SBT with Eclipse.
� Batch versus real-time data processing
� Introduction to Spark, Spark versus Hadoop
� Architecture of Spark.
� Coding Spark jobs in Scala
� Exploring the Spark shell -> Creating Spark Context.
� RDD Programming
� Operations on RDD.
� Transformations
� Actions
� Loading Data and Saving Data.
� Key Value Pair RDD.
� Broadcast variables.
� Configuring and running the Spark cluster.
� Exploring to Multi Node Spark Cluster.
� Cluster management
� Submitting Spark jobs and running in the cluster mode.
� Developing Spark applications in Eclipse
� Tuning and Debugging Spark.
� Learning Cassandra
� Getting started with architecture
� Installing Cassandra.
� Communicating with Cassandra.
� Creating a database.
� Create a table
� Inserting Data
� Modelling Data.
� Creating an Application with Web.
� Updating and Deleting Data.
� Introduction to Spark and Cassandra Connectors.
� Spark With Cassandra -> Set up.
� Creating Spark Context to connect the Cassandra.
� Creating Spark RDD on the Cassandra Data base.
� Performing Transformation and Actions on the Cassandra RDD.
� Running Spark Application in Eclipse to access the data in the Cassandra.
� Introduction to Amazon Web Services.
� Building 4 Node Spark Multi Node Cluster in Amazon Web Services.
� Deploying in Production with Mesos and YARN.

� Introduction of Spark Streaming.
� Architecture of Spark Streaming
� Processing Distributed Log Files in Real Time
� Discretized streams RDD.
� Applying Transformations and Actions on Streaming Data
� Integration with Flume and Kafka.
� Integration with Cassandra
� Monitoring streaming jobs.
� Introduction to Apache Spark SQL
� The SQL context
� Importing and saving data
� Processing the Text files,JSON and Parquet Files
� DataFrames
� user-defined functions
� Using Hive
� Local Hive Metastore server
� Introduction to Machine Learning
Types of Machine Learning.
� Introduction to Apache Spark MLLib Algorithms.
� Machine Learning Data Types and working with MLLib.
� Regression and Classification Algorithms.
� Decision Trees in depth.
� Classification with SVM, Naive Bayes
� Clustering with K-Means
� Building the Spark server
What we are offering as part of this Course?
� 3 REAL TIME Hadoop Projects End-to-End Explanation with architecture.
� Mock Interviews will be conducted on a one-to-one basis after the
course duration.
� Hard Copy & Soft Copy Materials for all the Components.
� Detailed Assistance in RESUME Preparation on a one-to-one basis with
Real Time Projects based on your technical back ground.
� All the Real time interview questions and answers will be provided.
� Discussing the new happenings in Hadoop
� Discussing the Interview Questions on a daily basis
� Discussing Certification (CCA 175 – Spark and Hadoop Certification)
Related topics on a daily basis.
� Proof Of Concept using complex architectures to give a real time idea

Pune, Maharashtra, India

