User Tools

Site Tools


build:hadoop

Hadoop

Hadoop is a framework for distribute processing of large data sets. It includes HDFS distributed file system and MapReduce.

We're using Cloudera's distribution of Hadoop (CHD4) because it's pretty simple to use, and all the pieces have been integrated and tested together. CDH4 provides an APT repository specifically for Debian Squeeze.

We've decided to use MapReduce 2.0 (also known as YARN, or MRv2).

Requirements

We'll need to have Java installed on the servers.

Installation

# Add Cloudera's GPG key, to verify downloaded packages.
curl -s http://archive.cloudera.com/cdh4/debian/squeeze/amd64/cdh/archive.key | sudo apt-key add -
 
# Add the APT repository for CDH4.
sudo sh -c 'cat > /etc/apt/sources.list.d/cloudera.list' <<EOF
deb http://archive.cloudera.com/cdh4/debian/squeeze/amd64/cdh squeeze-cdh4 contrib
deb-src http://archive.cloudera.com/cdh4/debian/squeeze/amd64/cdh squeeze-cdh4 contrib
EOF
 
# Tell APT to load the new repo.
sudo apt-get update
 
# Install the client tools. (This includes hadoop-hdfs, hadoop-mapreduce, and hadoop-yarn.)
sudo apt-get install hadoop-client
 
# Install Resource Manager.
sudo apt-get install hadoop-yarn-resourcemanager
 
# Install NameNode.
sudo apt-get install hadoop-hdfs-namenode
 
# Install DataNode.
sudo apt-get install hadoop-yarn-nodemanager hadoop-hdfs-datanode
 
# Install HBase.
sudo apt-get install hbase

Configuration

Make sure conf/core-site.xml and conf/yarn-site.xml have the hostnames of the NameNode, the ResourceManager, and the ResourceManager Scheduler.

Set JAVA_HOME for the Hadoop services.

sudo sh -c 'cat >> /etc/default/hadoop' <<EOF
export JAVA_HOME='/usr/lib/jvm/java-6-openjdk'
EOF

Set the services to start on boot.

sudo chkconfig hadoop-hdfs-namenode on
sudo chkconfig hadoop-hdfs-datanode on
sudo chkconfig hadoop-yarn-resourcemanager on
sudo chkconfig hadoop-yarn-nodemanager on

Start the services.

sudo service hadoop-hdfs-namenode start
sudo service hadoop-hdfs-datanode start
sudo service hadoop-yarn-resourcemanager start
sudo service hadoop-yarn-nodemanager start

TODO

Install Impala. Not sure if there's a Debian package yet. Requires Hive and PostgreSQL connector.

Document ports we need to open in the firewall.

build/hadoop.txt · Last modified: 2012/12/27 00:22 by 99.100.133.164