How to Install Hadoop on Ubuntu 16.04

How to install Apache Hadoop on Ubuntu 16.04

There are many queries asked on internet about how to install Hadoop on Ubuntu, Linux, Windows 10/8.1/8/7 and Mac OS. So this guide shows how to install hadoop big data database on ubuntu 16.04 with very simple and easy steps.

What is Big Data?
According to Wikipedia, Big data is data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating and information privacy. There are three dimensions to big data known as Volume, Variety and Velocity.

Then what is Hadoop?
According to Apache Hadoop ORG, The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Installing Hadoop Big Data on Ubuntu
The main goal of this tutorial is to simplify installation of Hadoop Database on Ubuntu with correct and accurate commands, so that you learn more with Hadoop Database.
Note: Following tutorial can be used to install latest hadoop release.

This tutorial has been tested on :   
Ubuntu 16.04
Hadoop Latest Version [ hadoop-2.9.0.tar.gz 350MB]
Prerequisites
JAVA JDK
Java JDK is required for working of Hadoop. I recommend to install JDK7 or JRE7 above.

Following are the commands for installing JRE7 on Ubuntu
# Open terminal & give following commands
sudo apt-get update
sudo apt-get install openjdk-9-jre-headless
sudo apt-get install openjdk-9-jdk
#To check which java version is installed on your system
readlink -f /usr/bin/javac

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it .
# Hadoop requires SSH access to manage its nodes
sudo apt-get install ssh
sudo apt-get install rsync


Download Hadoop
In this tutorial i am installing hadoop-2.9.0.tar.gz 350MB which is the stable version of hadoop from http://www.eu.apache.org

# Download hadoop from : http://www.eu.apache.org/dist/hadoop/common/stable/
# copy and extract hadoop-2.9.0.tar.gz in home folder
# rename the name of the extracted folder from hadoop-2.9.0 to hadoop
# find whether ubuntu is 32 bit (i686) or 64 bit (x86_64)
uname -i

Below command open the file hadoop-env.sh in gedit
gedit ~/hadoop/etc/hadoop/hadoop-env.sh

Add the bellow line at the end of FILE [ hadoop-env.sh ]

FOR 32 bit:
# add following line in the file at the end
# for 32 bit ubuntu
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
# save and exit the file

FOR 64 bit:
# add following line in the file at the end
# for 64 bit ubuntu
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
# save and exit the file

Save and exit the file.

# to display the usage documentation for the hadoop
~/hadoop/bin/hadoop

1. Standalone Mode
# 1. standalone mode

mkdir input

cp ~/hadoop/etc/hadoop/*.xml input

# the next two line instruction is a single command

#Make sure that you change the hadoop file name version in my case 2.9.0.jar
~/hadoop/bin/hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar grep input output us[a-z.]+

cat output/*

# Our task is done, so remove input and output folders

rm -r input output

2. Pseudo-Distributed mode

Find out your user name using following command and remember it, as we are going to use it in next step

whoami

Open core-site.xml file using following command
gedit ~/hadoop/etc/hadoop/core-site.xml

Replace the <configuration>ANY TEXT BETWEEN</configuration> tags with the bellow code
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:1234</value>
</property>
</configuration>
Save file and exit.

Open hdfs-site.xml file using following command
gedit ~/hadoop/etc/hadoop/hdfs-site.xml

Replace the <configuration>ANY TEXT BETWEEN</configuration> tags with the bellow code and correct USER NAME:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/your_user_name/hadoop/name_dir</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/your_user_name/hadoop/data_dir</value>
</property>
</configuration>
Save file and exit.

Setup passphraseless/passwordless ssh
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

export HADOOP\_PREFIX=/home/your_user_name/hadoop

ssh localhost

# type exit in the terminal to close the ssh connection (very important)

exit

The following instructions are to run a MapReduce job locally.
# The following instructions are to run a MapReduce job locally.

#Format the filesystem:(Do it only once)
~/hadoop/bin/hdfs namenode -format

#Start NameNode daemon and DataNode daemon:
~/hadoop/sbin/start-dfs.sh

# check which daemons are running by,

jps

DONE!!! And the final step Open your browser and type the following URL as hadoop uses 50070 port
#Browse the web interface for the NameNode; by default it is available at:

http://localhost:50070/

ADDITIONAL COMMANDS
#Make the HDFS directories required to execute MapReduce jobs:
~/hadoop/bin/hdfs dfs -mkdir /user
~/hadoop/bin/hdfs dfs -mkdir /user/your_user_name

#Copy the sample files (from ~/hadoop/etc/hadoop) into the distributed filesystem folder(input)
~/hadoop/bin/hdfs dfs -put ~/hadoop/etc/hadoop input

#Run the example map-reduce job
~/hadoop/bin/hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output us[a-z.]+

#View the output files on the distributed filesystem
~/hadoop/bin/hdfs dfs -cat output/*

#Copy the output files from the distributed filesystem to the local filesystem and examine them:
~/hadoop/bin/hdfs dfs -get output output

#ignore warnings (if any)
cat output/*

# remove local output folder

rm -r output

# remove distributed folders (input & output)

~/hadoop/bin/hdfs dfs -rm -r input output

#When you’re done, stop the daemons with
~/hadoop/sbin/stop-dfs.sh

jps

THANKS FOR READING THE TUTORIAL AND HOPE HADOOP IS INSTALLED ON SYSTEM

Check out: HOW TO INSTALL GOOGLE CHROME IN UBUNTU 16.04 LTS

Comments