When working in multi node environment like Spark/Hadoop clusters, docker diminishes the barrier to entry. By barrier to entry, I mean the need to have a constantly running EMR cluster, when you are still in development phase. With Docker, you can quickly setup a 4-5 node cluster on a single machine and start coding your spark job. You can understand what Docker is and why you would use Docker on these links.
- You can very easily version control your environment
- Barrier to entry for working with clusters (Spark/Hadoop) etc. reduces a lot. You no longer need EMR access to run a cluster which will have a cost associated with it.
Follow this official guide
Quickly for ubuntu steps are:
sudo apt-get update sudo apt-get install \ apt-transport-https \ ca-certificates \ curl \ software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable" sudo apt-get update sudo apt-get install docker-ce sudo docker run hello-world
You can use following command to use ansible to install docker for you
sudo python2.7 -m pip install ansible \ && sudo ansible-galaxy install --force angstwad.docker_ubuntu \ && echo '- hosts: all roles: - angstwad.docker_ubuntu ' > /tmp/docker_ubuntu.yml \ && sudo ansible-playbook /tmp/docker_ubuntu.yml -c local -i 'localhost,'
Setting up a cluster
Follow this post.
You will be able to run a local spark cluster with 4 commands.
mkdir spark_cluster; cd spark_cluster echo 'version: "2" services: master: image: singularities/spark command: start-spark master hostname: master ports: - "6066:6066" - "7070:7070" - "8080:8080" - "50070:50070" worker: image: singularities/spark command: start-spark worker master environment: SPARK_WORKER_CORES: 1 SPARK_WORKER_MEMORY: 2g links: - master ' > docker-compose.yml sudo docker-compose up -d # sudo docker-compose scale worker=2
Extending other images
With docker you can build on top of someone else’s image. For example, here I will extend
singularities/spark image, make my custom spark configuration changes, and push the final version to my own docker hub repo.
Pushing your changes to Docker hub
To create a fork from some base repo (singularities/spark), these are the steps
sudo docker run -it singularities/spark # Run base repo. This will open a shell # Make your changes to the image in this container sudo docker login --username=chaudhary --password=lol sudo docker commit <container ID from docker ps> chaudhary/my-repo-name # Commit changes sudo docker tag <image ID from docker images> chaudhary/my-repo-name # Tag for pull to work properly sudo docker push chaudhary/my-repo-name
Now that you have pushed this image, you can start a new container from this image as shown below:
sudo docker run -it chaudhary/my-repo-name
For more information read the official getting started guide.