Summary
This is a tutorial on how to setup Hadoop and run applications locally. The instructions are for Linux operating system.
Installation
Install prerequisites
Before proceeding you should have the following installed:
- A version of Java
- SSH
- PDSH
For java you can use the following tutorial.
For SSH and PDSH you can use the following commands (assuming Debian based OS):
sudo apt-get install ssh
sudo apt-get install pdsh
Install HADOOP
Installation of Hadoop can be done by just downloading the tar.gz.file, extracting it and configuring the Java HOME path. You don’t have to use a specific installer.
Download the tar.gz from the official mirror site. This might change in the future so you can look the for official release page (at the time of this tutorial the link is this one)
For example, for version 3.3.6, you can could the following command:
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Download the SHA file:
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz.sha512
Verify the hash:
gpg --print-md SHA512 hadoop-3.3.6.tar.gz
cat hadoop-3.3.6.tar.gz.sha512
The two hashes should be the same.
Extract the file:
tar vxvfz hadoop-3.3.6.tar.gz.sha512
Configuring JAVA_HOME for Hadoop
Edit the file: etc/hadoop/hadoop-env.sh and set the line for JAVA_HOME:
export JAVA_HOME=/usr/java/latest
To find the proper path you can use the following command:
sudo update-alternatives --config java
The path you should use is the one of the paths displayed by the previous command after you remove the …bin/java postfix.
For example if the command update-alternatives displays the path:
/usr/lib/jvm/java-17-openjdk-amd64/bin/java
Then the line you should add in etc/hadoop/hadoop-env would be:
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
Creating required directories
To execute an application you need to manually create the input directory:
mkdir input
Then you can run a demo:
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar grep input output 'dfs[a-z.]+'
cat output/*
Before running another application you may need to completely delete the output directory. Hadoop will display an error if the output directory exists already.
You may optionally configure the HDFS filesystem, but it is not necessary (and not recommended if you are a beginner who wants to simply experiment a little bit with Hadoop.