Hadoop 2.8.2 Cluster Configuration

I recently started playing with Hadoop recently, as reflected in my recent posts. In the process, I spent a significant amount of time setting up a Hadoop cluster made of three physical servers I have screaming in my bathroom. The process took a lot of trial and error and reaching out for help in Hadoop’s IRC channel. To minimize the suffering to those of you who are doing the same thing, I decided to write this blog, specifically for 2.8.2. Many blogs are outdated or simply did not address a possibly common problem I was experiencing.

Why Hadoop 2.8.2?

At the time of this writing, there are three stable releases for Hadoop 2.x version: 2.6.5, 2.7.4, and 2.8.2. When I first approach this problem, I went for version 2.7.4. This works fine for single node setup, but when it comes to multi-node setup a serious problem starts to rise. Yarn’s ResourceManager can’t detect your machine’s resource and think your machine has an insufficient resource in some instances¬†thus causing to abort. After some retry, I changed my version to 2.8.2, in which this problem didn’t show up.

So if you are following this post, I assume that you have Hadoop 2.8.2 downloaded. Let’s begin!

Single Node Configuration

Before we go ahead with our Cluster Configuration, let us first test it in one of our machines.

    1. I recommend creating a hadoop user in your cluster machines. So let us create one:
      sudo adduser hadoop
      # Enter password and so on.
      
      sudo usermod -aG sudo hadoop  # Add hadoop user ot sudo group.
      
      # Logout and log back in as hadoop user.
      
    2. Create a directory for Hadoop installation. I’ll pick /opt/hadoop:
      sudo mkdir /opt/hadoop
      
      # Change the owner of the /opt/hadoop so hadoop user can modify it.
      sudo chown hadoop:hadoop /opt/hadoop
      
    3. Download Java 8 JDK. I’ll demonstrate the Oracle Java, but you can also use OpenJDK:
      sudo add-apt-repository ppa:webupd8team/java
      sudo apt-get update
      sudo apt install oracle-java8-installer
      sudo apt install oracle-java8-set-default  # Makes oracle java 8 your default, if it is not already.
      
    4. Let us now download Hadoop 2.8.2. Find a download link in hadoop download page and copy it.
      cd /opt/hadoop
      wget <copied download link>
      
      tar -zxf hadoop-2.8.2.tar.gz
      
    5. Change etc/hadoop/hadoop-env.sh to point to the appropriate JAVA_HOME value:
      # Edit with your preferred editor.
      emacs /opt/hadoop/hadoop-2.8.2/etc/hadoop/hadoop-env.sh
      

      Change the line containing export JAVA_HOME= to

      # Set to java installation directory.
      # This is the case for oracle java 8, 
      # Change for your own java installation.
      export JAVA_HOME=/usr/lib/jvm/java-8-oracle
      
    6. Setup a script to initialize much needed environment variables:
      sudo emacs /etc/profile.d/hadoop.sh  # Emacs or whatever editor you want.
      

      In the hadoop.sh file, past:

      # Required hadoop environment variables.
      export HADOOP_HOME=/opt/hadoop/hadoop-2.8.2
      export PATH=$PATH:$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
      
    7. Logout and Log back in.
    8. Try executing hadoop in terminal and you should see a help instruction, otherwise something went wrong with the installation.
    9. To test our single node configuration, let us have a pseudo-distributed configuration.
      etc/hadoop/core-site.xml:

      etc/hadoop/hdfs-site.xml:

    10. Setup passphraseless ssh. Run the following the your terminal.
      ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
      cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
      chmod 0600 ~/.ssh/authorized_keys
      

      You should be able to ssh to localhost without asking for credential:

      ssh localhost
      
    11. Finally, let us run some sanity tests:
      hdfs namenode -format
      start-dfs.sh
      hdfs dfs -mkdir -p /user/hadoop
      hdfs dfs -put etc/hadoop input
      hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.2.jar grep input output 'dfs[a-z.]+'
      hdfs dfs -get output output
      cat output/*  # View output.
      stop-dfs.sh
      

    If things went well after the last step, your single node installation and configuration should be good. Let us now proceeded to the cluster node installation and configuration.

    Multi-Node Configuration

    In this section, we’ll continue our work from our single node configuration. I will use my Hadoop network layout as an example, but you can certainly use your own.

    Before we get started, I will be basing my multi-node configuration base on the following machine layout:

    linux-01 and linux-02 are the host name of my Linux machines I plan to be in the Ubuntu cluster. linux-01 act as both the master and slave node, thus carries the YARN Resource Manager and a Slave Node. linux-02 acts exclusively as a slave node under linux-01.

    I also assume that you have assigned a static IP for your machines. I used the Ubuntu’s official doc, to do that.

    To know the host name of your machine, simply run hostname in your terminal.

    We begin by just setting up the master node. In my case, linux-01 machine.

    1. Ensure that you are resolving the hostname’s of your machine to their correct local IP address. To do this, we will need to edit /etc/hosts of your server. The first time you open your /etc/hosts, you should see:

      Hadoop recommend against this. What caused the biggest headache is Hadoop expects your hostname, in my case linux-01, and the hostname of other machines in your cluster to resolve to your local statically assigned IP. So, in the case with my two machines:

      These changes should take effect immediately, so no need to restart or log off.

    2. Let us now edit our hadoop configurations again. Now for cluster setup:
      Change your core-site.xml to:

      Note how the hdfs points to master node’s hostname, linux-01. Change it for the hostname of your master node.

      Change your hdfs-site.xml to:

      Create the directory needed by the hdfs:

      # Shouldn't need sudo since /opt/hadoop is owned by hadoop user.
      mkdir /opt/hadoop/{namenode,datanode}
      

      Change mapred-site.xml to:

      Change yarn-site.xml to:

    3. Create a slaves directory in etc/hadoop directory. Since I want my two machines to be slave, where linux-01 act as a slave and master, my slaves file is:
      linux-01
      linux-02
      
    4. Do this steps to your slave machines. Just copy and paste the configs.
    5. Format your hdfs:
      hdfs namenode -format
      
    6. Start your cluster:
      start-all.sh
      

    Conclusion

    That’s it! You can observe your cluster from various web interface provided by hadoop. Although things like ambari or chef is available to make the pain of setting up disappear, there is something to be gained by learning the underlying configurations. I hope you figured out your hadoop cluster configurations!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.