Setting up an Apache Spark Cluster

Installing Spark/pySpark

We will not deal with the actual setup of Spark and IPython on each machine in this post. To get that done you can follow the instructions in this blog-post

Configuring the Spark Cluster Manager

The first thing we will need is for the manager machine to be able to SSH into the worker machines. To do this we need to create an SSH-keypair.

1. Create SSH Keypair

Create an SSH-keypair using the following commands:

# Move into the .ssh folder
$ cd ~/.ssh

# Create a keypair.
$ ssh-keygen -t rsa

When asked for a name, name it something appropriate, like: sparkManagerKey

2. Worker Shortcuts

Open the SSH config file using nano ~/.ssh/config and paste the following configuration settings for the penguins:

Host worker.one
		HostName XXX.XXX.XXX.XXX
		Port 22
		User <USERNAME HERE>
		IdentityFile ~/.ssh/sparkManagerKey

Host worker.two
		HostName XXX.XXX.XXX.XXX
		Port 22
		User <USERNAME HERE>
		IdentityFile ~/.ssh/sparkManagerKey   

The benefit of doing this is that you can now refer to the worker machines using their given names. So if you want to SSH into one of the machines from your manager machine all you need to do is ssh worker.one. However, this will not work now as the worker machine does not accept any incoming SSH-connections from this machine yet. We’ll deal with this in a second, first we’ll finish setting up the manager machine.

3. Configure the Spark Manager

Now we will need to configure the manager machine to know about its workers. Navigate into the spark installation folder and then go to the conf-directory (cd $SPARK_HOME/conf). There are a number of files present here with the extension .template. To look at the contents of the directory:

$ ls -la

drwxr-xr-x  2 spark.manager spark.manager 4096 Jun 19 17:06 .
drwxr-xr-x 13 spark.manager spark.manager 4096 Jun 19 17:06 ..
-rw-r--r--  1 spark.manager spark.manager  202 Jun  3 02:30 docker.properties.template
-rw-r--r--  1 spark.manager spark.manager  303 Jun  3 02:30 fairscheduler.xml.template
-rw-r--r--  1 spark.manager spark.manager  632 Jun  3 02:30 log4j.properties.template
-rw-r--r--  1 spark.manager spark.manager 5565 Jun  3 02:30 metrics.properties.template
-rw-r--r--  1 spark.manager spark.manager   80 Jun  3 02:30 slaves.template
-rw-r--r--  1 spark.manager spark.manager  507 Jun  3 02:30 spark-defaults.conf.template
-rwxr-xr-x  1 spark.manager spark.manager 3318 Jun  3 02:30 spark-env.sh.template

We’ll copy the one named slaves.template and configure the workers there.

$ cp slaves.template slaves

$ nano slaves

When opened paste the following configuration.

# Slaves file.
worker.one
worker.two

Here you can see how nice it is to have ‘shortcut’ names for the workers. If we every want to change something about their configuration we only need to change the SSH config file and Spark will know instantly.

Next copy the manager spark-env.sh file and open it using

$ cp spark-env.sh.template spark-env.sh

$ nano spark-env.sh

In this file there are a lot of configuration parameters that can be set. We only need a few for this specific occasion. Set the following parameters and make sure to replace <MANAGER_IP_ADDRESS> with the IP-address of your Spark manager machine:

SPARK_LOCAL_IP=<MANAGER_IP_ADDRESS>
SPARK_MASTER_IP=<MANAGER_IP_ADDRESS>
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=1000m

Note: The IP-address used here is the IP-address of the manager (or master) machine.

The last thing to do on the manager machine is changing the IPython startup command slightly. When starting IPython/Spark as the master/manager, the command is slightly different as we have to tell it that we are running it as a manager and not a standalone instance. Open up the .bash_profile:

$ nano ~/.bash_profile

and add the following line to it:

alias IPYSPARKMASTER='MASTER=spark://<MANAGER_IP_ADDRESS>:7077 PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --profile=pyspark --ip=0.0.0.0" $SPARK_HOME/bin/pyspark'

Note: here again we are using the IP address of the master/manager where it states <MANAGER_IP_ADDRESS>.

Therefore rather than typing IPYSPARK (because this is the shortcut in the bash profile), we can later on start IPython in Spark-cluster mode using IPYSPARKMASTER.

These were all the configurations that need to be setup for the master/manager. We now need to configure the workers so they know what their role is and who is managing them.

Configuring Spark Workers

The following instructions will deal with setting up the worker machines. Note that its best to complete the above instructions before setting up the workers as you will need bits and pieces from the above here.

1. Add the Manager’s SSH-Keypair

For the manager machine to be able to SSH into the worker machine the worker needs to first have its SSH key listed as an authorized key.

On the manager machine print the contents of the public key using

$ cat ~/.ssh/sparkManagerKey.pub

Copy the contents and, on the worker machine open the authorized_keys file in a text-editor

$ nano ~/.ssh/authorized_keys

Paste in the contents of the public-key you just copied on a new line and save-and-close the file.

Now your worker machine will accept SSH connections from the manager machine. You can try this by trying to connect from the manager machine using ssh worker.one.

2. Create an alias for the Manager Machine

Add the IP address of the manager machine to the .bash_profile file

$ nano ~/.bash_profile

and paste in the following

# paste in the following:
alias SPARK_MANAGER_IP='<MANAGER_IP_ADDRESS>'

Note: here again we are using the IP address of the master/manager where it states <MANAGER_IP_ADDRESS>.

3. Configuring the Spark Worker

On the worker machine navigate into the Spark configuration folder (cd $SPARK_HOME/conf).

Just like with the manager machine we need to configure the Spark environment for this worker machine. Copy the spark-env.sh and open it using a text-editor

$ cp spark-env.sh.template spark-env.sh

$ nano spark-env.sh

Now we need to add some configurations to this file so the worker machine knows about who to report to. Add the following lines

SPARK_MASTER_IP=SPARK_MANAGER_IP 
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=1000m

Note: the SPARK_MANAGER_IP alias was added before in the ~/.bash_profile.

4. Spark Worker Path

Since we are working from a Vagrant virtual machine this is where we run into a complication. The workers have a Spark installation similar to /home/<MACHINE_NAME>/spark_installation. However the manager machine is SSH-ing in as vagrant (or some other username) and is looking for home/vagrant/spark_installation. To work around this we will create a symbolic link inside the worker machine that reflects the manager machine’s login and points to the actual installation path. So, in the workers create symbolic links for whatever the manager will do.

$ sudo mkdir /home/<MANAGER_NAME>

$ sudo ln -s /home/<WORKER_NAME>/spark-1.4.0-bin-hadoop2.6/ /home/<MANAGER_NAME>/spark-1.4.0-bin-hadoop2.6

Note: in a production environment, make sure the same user is configured in all machines (this is the problem here)?

Start the Cluster and use IPython

To run the cluster we need to go to the manager machine and run the following command from within the Spark folder $SPARK_HOME.

# Start the master and worker...
$ .sbin/start-all.sh

Now to start IPython we will want to nagivate to the folder we will want to start it from.

# Navigate to the HOME folder...
$ cd $HOME

# Start IPython.
$ IPYSPARKMASTER

go to http://<master_ip_address or 127.0.0.1>:8001/tree for pynotebooks
go to http://<master_ip_address or 127.0.0.1>:8080 for overview GUI (starts working after start all)
go to http://<master_ip_address or 127.0.0.1>:4040 to see jobs (starts working after the python kernel has started up)

When finished, quit the notebook and don’t forget to stop all the services using:

$ cd $SPARK_HOME

$ .sbin/stop-all.sh