Hadoop is a framework for distribute processing of large data sets. It includes HDFS distributed file system and MapReduce.
We're using Cloudera's distribution of Hadoop (CHD4) because it's pretty simple to use, and all the pieces have been integrated and tested together. CDH4 provides an APT repository specifically for Debian Squeeze.
We've decided to use MapReduce 2.0 (also known as YARN, or MRv2).
We'll need to have Java installed on the servers.
# Add Cloudera's GPG key, to verify downloaded packages. curl -s http://archive.cloudera.com/cdh4/debian/squeeze/amd64/cdh/archive.key | sudo apt-key add - # Add the APT repository for CDH4. sudo sh -c 'cat > /etc/apt/sources.list.d/cloudera.list' <<EOF deb http://archive.cloudera.com/cdh4/debian/squeeze/amd64/cdh squeeze-cdh4 contrib deb-src http://archive.cloudera.com/cdh4/debian/squeeze/amd64/cdh squeeze-cdh4 contrib EOF # Tell APT to load the new repo. sudo apt-get update # Install the client tools. (This includes hadoop-hdfs, hadoop-mapreduce, and hadoop-yarn.) sudo apt-get install hadoop-client # Install Resource Manager. sudo apt-get install hadoop-yarn-resourcemanager # Install NameNode. sudo apt-get install hadoop-hdfs-namenode # Install DataNode. sudo apt-get install hadoop-yarn-nodemanager hadoop-hdfs-datanode # Install HBase. sudo apt-get install hbase
Make sure conf/core-site.xml
and conf/yarn-site.xml
have the hostnames of the NameNode, the ResourceManager, and the ResourceManager Scheduler.
Set JAVA_HOME for the Hadoop services.
sudo sh -c 'cat >> /etc/default/hadoop' <<EOF export JAVA_HOME='/usr/lib/jvm/java-6-openjdk' EOF
Set the services to start on boot.
sudo chkconfig hadoop-hdfs-namenode on sudo chkconfig hadoop-hdfs-datanode on sudo chkconfig hadoop-yarn-resourcemanager on sudo chkconfig hadoop-yarn-nodemanager on
Start the services.
sudo service hadoop-hdfs-namenode start sudo service hadoop-hdfs-datanode start sudo service hadoop-yarn-resourcemanager start sudo service hadoop-yarn-nodemanager start
Install Impala. Not sure if there's a Debian package yet. Requires Hive and PostgreSQL connector.
Document ports we need to open in the firewall.