Docker 101

Setup spark cluster in docker with 4 commands

Posted by Shubham Chaudhary on July 22, 2017

When working in multi node environment like Spark/Hadoop clusters, docker diminishes the barrier to entry. By barrier to entry, I mean the need to have a constantly running EMR cluster, when you are still in development phase. With Docker, you can quickly setup a 4-5 node cluster on a single machine and start coding your spark job. You can understand what Docker is and why you would use Docker on these links.

Benefits

  • You can very easily version control your environment
  • Barrier to entry for working with clusters (Spark/Hadoop) etc. reduces a lot. You no longer need EMR access to run a cluster which will have a cost associated with it.

Installing Docker

Follow this official guide

Manual

Quickly for ubuntu steps are:

sudo apt-get update
sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"
sudo apt-get update
sudo apt-get install docker-ce
sudo docker run hello-world

With Ansible

You can use following command to use ansible to install docker for you

sudo python2.7 -m pip install ansible && sudo ansible-galaxy install --force angstwad.docker_ubuntu && echo '- hosts: all
  roles:
      - angstwad.docker_ubuntu
      ' > /tmp/docker_ubuntu.yml && sudo ansible-playbook /tmp/docker_ubuntu.yml -c local -i 'localhost,'

Setting up a cluster

Follow this post.

You will be able to run a local spark cluster with 4 commands.

Quick overview:

mkdir spark_cluster; cd spark_cluster

echo 'version: "2"

services:
  master:
    image: singularities/spark
    command: start-spark master
    hostname: master
    ports:
      - "6066:6066"
      - "7070:7070"
      - "8080:8080"
      - "50070:50070"
  worker:
    image: singularities/spark
    command: start-spark worker master
    environment:
      SPARK_WORKER_CORES: 1
      SPARK_WORKER_MEMORY: 2g
    links:
      - master
' > docker-compose.yml

sudo docker-compose up -d

# sudo docker-compose scale worker=2

Extending other images

With docker you can build on top of someone else’s image. For example, here I will extend singularities/spark image, make my custom spark configuration changes, and push the final version to my own docker hub repo.

Pushing your changes to Docker hub

To create a fork from some base repo (singularities/spark), these are the steps

sudo docker run -it singularities/spark  # Run base repo. This will open a shell
# Make your changes to the image in this container
sudo docker login --username=chaudhary --password=lol
sudo docker commit <container ID from docker ps> chaudhary/my-repo-name  # Commit changes
sudo docker tag <image ID from docker images> chaudhary/my-repo-name  # Tag for pull to work properly
sudo docker push chaudhary/my-repo-name

Now that you have pushed this image, you can start a new container from this image as shown below:

sudo docker run -it chaudhary/my-repo-name

Resources

For more information read the official getting started guide.