I am going to try and start posting regularily again, including proper blog posts and not just linkfests.

Weekend Linkfest 1.12.2013

R Package of the Week

Weekend Linkfest 9.11.2013

I successfully defended my master thesis this week, so I was too busy to post linkfests in the meantime. But now I am back! I have also released my first package on CRAN: parboost. Expect more on \texttt{parboost} in another post.

Weekend Linkfest 20.10.2013

R Package of the Week

Weekend Linkfest 12.10.2013

R Package of the Week

Weekend Linkfest 5.10.2013

Weekend Linkfest 28.9.2013


Weekend Linkfest 15.9.2013

R Package of the Week

Weekend Linkfest 8.9.2013

R Package of the Week

StarCluster and R

StarCluster is a utility for creating and managing
distributed computing clusters hosted on Amazon’s Elastic Compute
Cloud (EC2). StarCluster utilizes Amazon´s EC2 web service to create
and destroy clusters of Linux virtual machines on demand.

StarCluster provides a convenient way to quickly set up a cluster of machines to run some data parallel jobs using a distributed memory framework.

Install StarCluster using

$ sudo easy_install StarCluster

and then create a configuration file using

$ starcluster help

Add your AWS credentials to the config file and follow the instructions at the StarCluster quick-Start guide.

Once you have StarCluster up and running, you need to install R on all the cluster nodes and any packages you require. I wrote a shell script to automate the process:


starcluster put $1 starcluster.setup.zsh /home/starcluster.setup.zsh
starcluster put $1 Rpkgs.R /home/Rpkgs.R

numNodes=`starcluster listclusters | grep "Total nodes" | cut -d' ' -f3`
nodes=(`eval echo $(seq -f node%03g 1 $(($numNodes-1)))`)

for node in $nodes; do
    cmd="source /home/starcluster.setup.zsh >& /home/install.log.$node"
    starcluster sshmaster $1 "ssh $node $cmd" &

The script takes the name of your cluster as a parameter and pushes the two helper files to the cluster. It then runs the installation on the master and every node. It assumes you are running an Ubuntu Server based StarCluster AMI, which is the default. The first helper script, starcluster.setup.zsh, installs the basic software required:


echo "deb precise/" >> /etc/apt/sources.list
gpg --keyserver --recv-key E084DAB9
gpg -a --export E084DAB9 | sudo apt-key add -
apt-get update
apt-get install -y r-base r-base-dev
echo “DONE with Ubuntu package installation on $(hostname -s).”
R CMD BATCH --no-save /home/Rpkgs.R /home/install.Rpkgs.log
echo “DONE with R package installation on $(hostname -s).

The second script, Rpkgs.R, is just a R script containing the packages you want installed:

install.packages(c("randomForest", "caret", "mboost", "plyr", "glmnet"),
 repos = "")
print(paste("DONE with R package installation on ", system("hostname -s", intern = TRUE), "."))

Once you have everything installed, you can ssh into your master node and start up R as usual:

$ starcluster sshmaster mycluster
$ R

Since StarCluster has set up all the networking nicely, you can use parLapply from the parallel package to run a task on your cluster without further configuration. Running a data parallel task on a cluster with 10 nodes is now as easy as this (parLapply is just like lapply, except it distributes the tasks over the cluster):

cluster_names <- paste("node00", 1:9, sep="")
cluster_names <- c(cluster_names, "node010")
cluster <- makePSOCKcluster(names = cluster_names)
output <- parLapply(cluster, some_input, some_function)

Now you can watch 10 machines working for you. Like!