Installing Hadoop Cluster

I have four VM machines in dev and I want to configure my own hadoop cluster to use as a tool and analysis.

I’m going to follow the general process out lined by hadoop’s instructions and yahoo helphere and here.

This is what the final setup will look like



I found that hadoop has default ports that need to be opened between servers before it will work.

# Add local ssh support on each machine 
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa 
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

# Configure master to talk to each slave.
a.brlamore@master:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub a.brlamore@slave 

# Configure each slave to talk to the master.
a.brlamore@slave:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub a.brlamore@master

Important Directories

Directory Description Suggested location
HADOOP_LOG_DIR Output location for log files from daemons /var/log/hadoop
hadoop.tmp.dir A base for other temporary directories /tmp/hadoop
dfs.name.dir Where the NameNode metadata should be stored /home/hadoop/dfs/name
dfs.data.dir Where DataNodes store their blocks /home/hadoop/dfs/data
mapred.system.dir The in-HDFS path to shared MapReduce system files /hadoop/mapred/system


Hadoop Home

  • HADOOP_HOME is set to /u01/accts/a.brlamore/tmp/hadoop-0.21.0
  • I’ve put in a request for root access so I can change this to /opt/hadoop

Edit Slaves file

vi /conf/slaves


Site Configuration

  • Set the JAVA_HOME in conf/hadoop-env.sh to export JAVA_HOME=/usr/java/default
  • Set values in conf/core-site.xml
    <description>A base for other temporary directories. Default location /tmp/hadoop-${user.name}. Suggested Location /tmp/hadoop</description>
    <description>The name of the default file system. This specifies the NameNode</description>

Set values in conf/hdfs-site.xml

<description>Where the NameNode metadata should be stored. Default location is ${hadoop.tmp.dir}/dfs/name. Suggested location /home/hadoop/dfs/name</description>
    <description>Where DataNodes store their blocks. Default location ${hadoop.tmp.dir}/dfs/data. Suggested location /home/hadoop/dfs/data</description>

Set values in conf/mapred-site.xml

    <description>Host or IP and port of JobTracker</description>

Hadoop Startup

# Format the filesystem
bin/hadoop namenode -format

# Start the HDFS on the NameNode

# Start Map-Reduce on the TrackerNode