When working in multi node environment like Spark/Hadoop clusters, docker diminishes the barrier to entry. By barrier to entry, I mean the need to have a constantly running EMR cluster, when you are still in development phase. With Docker, you can quickly setup a 4-5 node cluster on a single machine and start coding your spark job. You can understand what Docker is and why you would use Docker on these links.
Benefits
- You can very easily version control your environment
- Barrier to entry for working with clusters (Spark/Hadoop) etc. reduces a lot. You no longer need EMR access to run a cluster which will have a cost associated with it.
Installing Docker
Follow this official guide
Manual
Quickly for ubuntu steps are:
sudo apt-get update
sudo apt-get install \
apt-transport-https \
ca-certificates \
curl \
software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"
sudo apt-get update
sudo apt-get install docker-ce
sudo docker run hello-world
With Ansible
You can use following command to use ansible to install docker for you
sudo python2.7 -m pip install ansible \
&& sudo ansible-galaxy install --force angstwad.docker_ubuntu \
&& echo '- hosts: all
roles:
- angstwad.docker_ubuntu
' > /tmp/docker_ubuntu.yml \
&& sudo ansible-playbook /tmp/docker_ubuntu.yml -c local -i 'localhost,'
Setting up a cluster
Follow this post.
You will be able to run a local spark cluster with 4 commands.
Quick overview:
mkdir spark_cluster; cd spark_cluster
echo 'version: "2"
services:
master:
image: singularities/spark
command: start-spark master
hostname: master
ports:
- "6066:6066"
- "7070:7070"
- "8080:8080"
- "50070:50070"
worker:
image: singularities/spark
command: start-spark worker master
environment:
SPARK_WORKER_CORES: 1
SPARK_WORKER_MEMORY: 2g
links:
- master
' > docker-compose.yml
sudo docker-compose up -d
# sudo docker-compose scale worker=2
Extending other images
With docker you can build on top of someone else’s image. For example, here I will extend singularities/spark
image, make my custom spark configuration changes, and push the final version to my own docker hub repo.
Pushing your changes to Docker hub
To create a fork from some base repo (singularities/spark), these are the steps
sudo docker run -it singularities/spark # Run base repo. This will open a shell
# Make your changes to the image in this container
sudo docker login --username=chaudhary --password=lol
sudo docker commit <container ID from docker ps> chaudhary/my-repo-name # Commit changes
sudo docker tag <image ID from docker images> chaudhary/my-repo-name # Tag for pull to work properly
sudo docker push chaudhary/my-repo-name
Now that you have pushed this image, you can start a new container from this image as shown below:
sudo docker run -it chaudhary/my-repo-name
Resources
For more information read the official getting started guide.