You’ll need to activate the “partner” repositories in /etc/apt/sources.list by uncommenting the lines below:
deb http://archive.canonical.com/ubuntu maverick partner
deb-src http://archive.canonical.com/ubuntu maverick partner
Substitute maverick with your Ubuntu release name. These sources are required for the Java package.
Now update your package list and install all the required packages:
sudo apt-get update
sudo apt-get install sun-java6-jdk python protobuf-compiler \
libprotobuf6 libprotoc6 python-protobuf ant ant-optional g++ \
libboost-python-dev
The run-time dependencies need to be installed on all cluster nodes. As such, you will need to either install the software to all the nodes or install it to a shared volume. On the other hand, the build-time dependencies [1] only need to be installed on the node you use to build Seal. JUnit4 is only needed to run the unit tests if you’re using a version of Hadoop older than 0.20.203.
If you haven’t done so already, set up your Hadoop cluster. Please refer to the Hadoop documentation for your chosen distribution:
Seal has been developed with Apache Hadoop 0.20, but we have also tested it with Cloudera CDH3.
Download the latest version of Pydoop from here: http://sourceforge.net/projects/pydoop/files/. Set the HADOOP_HOME environment variable so that it points to where the Hadoop tarball was extracted:
export HADOOP_HOME=<path to Hadoop directory>
Then, in the same shell:
tar xzf pydoop-0.4.0_rc2.tar.gz
cd pydoop-0.4.0_rc2
python setup.py build
Download the latest version of Pydoop from here: http://sourceforge.net/projects/pydoop/files/. We assume the Cloudera package repository is already in your sources (see Installing CDH3 on Ubuntu Systems). You’ll need to install the Hadoop source code and libhdfs:
sudo apt-get install hadoop-source libhdfs0 libhdfs0-dev
Now extract and build Pydoop:
tar xzf pydoop-0.4.0_rc2.tar.gz
cd pydoop-0.4.0_rc2
python setup.py build
You need to decide where to install Pydoop. Remember that it needs to be accessible by all the cluster nodes running Seal tasks. We recommend installing to a shared volume, except for medium-large clusters (more than 100 nodes) where local installation may be necessary.
If your user’s home directory is accessible on all cluster nodes, then installing it there may be a good idea:
python setup.py install --user
Otherwise, to install to a specific path:
python setup.py install --home <path>
For a system-wide (local) installation:
sudo python setup.py install --skip-build
Note
If you had to export HADOOP_HOME to build Pydoop, make sure the variable is also set when you call setup.py install. The Pydoop documentation has more details regarding its installation.
Seal needs the Hadoop jars to compile. Tell the build script where to find them by setting the HADOOP_HOME environment variable.
If you installed Hadoop from a tarball, set HADOOP_HOME to point to the extracted copy of the archive. For instance:
export HADOOP_HOME=/home/me/hadoop-0.20
If you installed Hadoop from the Cloudera packages, set HADOOP_HOME like this:
export HADOOP_HOME=/usr/lib/hadoop
The build process expects to find the Hadoop jars in the ${HADOOP_HOME} and ${HADOOP_HOME}/lib directories.
Seal includes Java, Python and C components that need to be built. A Makefile is provided that builds all components. Simply go into the root Seal source directory and run:
make
This will create the archive build/seal-<release>.tar.gz containing all Seal components. Go to the section on Deploying to see what to do with it.
You can find the documentation for Seal at http://biodoop-seal.sourceforge.net/.
If however you want to build yourself a local copy, you can do so in three steps:
You’ll find the documentation in HTML in docs/_build/html/index.html.
| [1] | The following packages should only be required at build-time: protobuf-compiler libprotoc6 ant ant-optional g++ |