In this assignment, we will conduct a collaborative project testing certain theoretical hypotheses in Deep Learning. In particular, each of you will build your own personal SLURM cluster on Google Compute Engine (GCE) using elasticluster and then run massive computational experiments using clusterjob. We then collect and analyse all the results you will generate and document our observations. Please follow the following step to setup your cluster and run experiments. This documents only contains the detail of setting up your cluster and testing that it works properly with GPUs. Once these steps are completed, you should conduct your experiments as assigned to you on Canvas. The details of the experiment will only be available via Stanford Canvas website to students who are taking this course for credit.
- We would like to thank Google Cloud Platform Education Grants Team for their generosity and kindness in providing Stats285 course with cloud computing grant.
- We would like to thank ElastiCluster team especially Dr. Riccardo Murri for their help and collaboration on this project.
Please visit the frequently asked questions before you submit a question on our Google group.
Building your cluster on Google Cloud Platform
To create your own cluster on Google Compute Engine, you should take the following 4 steps:
- Setup Google Compute Engine
- Install Docker
- Create your cluster using dockerized ElastiCluster
- Test your cluster with ClusterJob
Part-1: Setup Google Compute Engine
- Claim your $200 Google Compute Credit. Please note that you received two tickets ($50+$150) from Google Cloud. Please check the
commentsection of the canvas link for the $150 ticket. You will also get $300 free credit from Google Cloud as a first time user by setting up your Billing Account. However, these 300$ allows only very limited CPU computations.
Create a Google Project by Visiting Manage resources (This may take some time, be patient). You may find your project ID here which will be needed later.
Visit Google Credential page, and creat your credentials
- select Create credentials
- select OAuth client ID
- select Configure conset screen
- Choose your project name and save
- If prompted for Application Type choose Other
- choose a name for your application (say
- choose a name for your application (say
Once successful, the interface will show your
client_secret. These values appear at the Credentials tab and you may retrive them at a later time by clicking on your application name (step 4).
- Enable Google Compute for your project by visiting Enable Compute Engine
- Enable Billing for your project by visiting Enable Billing
- Go to Metadata and add your
~/.ssh/id_rsa.pubcontents to SSH Keys on Google.
If you fail to satisfy 6,7, and 8 above, your instances will not start and you get errors. Make sure you enable these.
- Go to quota page, choose your project, then EDIT QUOTAS and request 8 NVIDIA K80 GPUs at
us-west1zone. You will need this to use GPU accelerators. The default GPU quota is zero. For “justification” write “stats285”.
If you are unable to choose the GPU service, then the billing account associated with your project is incorrect. In this case, change the billing account to one of the STATS285 credits you have received as shown in FAQ item 4.
** DO NOT REQUEST MORE THAN 8 **, otherwise you will have to pay $1500 deposit in advance.
For more info on obtaining your Google credentials, you may visit googlegenomics
Part-2: Install Docker
Docker containers provide an easy way for us to use elasticluster. In fact, we have already dockerized elasticluster for Stats285 and so we will use this docker images which comes with elasticluster installed. To use this image on your personal computer, follow the following steps:
- Visit Docker Website and install it for your operating system
- Once docker installation is complete, Check your installation by searching docker repositories for
$ docker search stats285 NAME DESCRIPTION STARS OFFICIAL AUTOMATED stats285/elasticluster Dockerized elasticluster for Stanford cour... 0 stats285/elasticluster-gpu Dockerized elasticluster with GPU function... 0 [OK]
we will be using
stats285/elasticluster-gpu, which is the GPU-enabled version of elasticluster and can be found at Docker Hub
Go ahead and pull
stats285/elasticluster-gpuimage to your local machine (laptop), which we will be using in Part-3.
$ docker image pull stats285/elasticluster-gpu
You should now be able to see the image downloaded to your machine:
$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE stats285/elasticluster-gpu latest 39e63c2b22d2 8 minutes ago 551MB
- for more docker commands, visit docker tutorial
Part-3: Create your cluster using ElastiCluster
In this part, you will make a container out of the image you pulled in Part 2. This container has in itself
installed for easy use. Follow the following steps to launch your own cluster.
- Create a docker container from
docker run -v ~/.ssh:/root/.ssh -P -it stats285/elasticluster-gpu
Change the contents of the elasticluster config file
~/.elasticluster/configto reflect your own credentials and choice of resources. use
- retrive your
project_idas explained above
- retrive your
client_secretas explained above
- Update the contents of
Do not icnlude @gmail.com (e.g., email address
# Elasticluster Configuration Template # ==================================== # Author: Hatef Monajemi (July 18) # Stats285 Stanford # Create a cloud provider (call it "google-cloud") [cloud/google] provider=google noauth_local_webserver=True gce_client_id=<CLIENT> gce_client_secret=<SECRET> gce_project_id=<PROJECT> zone=us-west1-b [login/google] # Do not include @gmail (example: email@example.com -> monajemi) image_user=<GMAIL_ID> image_user_sudo=root image_sudo=True user_key_name=elasticluster user_key_private=~/.ssh/id_rsa user_key_public=~/.ssh/id_rsa.pub [setup/ansible-slurm] provider=ansible frontend_groups=slurm_master compute_groups=slurm_worker [cluster/gce] cloud=google login=google setup=ansible-slurm security_group=default image_id=ubuntu-1604-xenial-v20171107b frontend_nodes=1 compute_nodes=2 ssh_to=frontend # Ask for 500G of disk boot_disk_type=pd-standard boot_disk_size=500 [cluster/gce/frontend] flavor=n1-standard-8 # add 1x GPUs (NVidia Tesla K80) to the compute nodes # note that as of Nov. 2017, GPU-enabled VMs are available only in few zones # use `gcloud compute accelerator-types list` to see what is available [cluster/gce/compute] flavor=n1-standard-8 accelerator_count=1 accelerator_type=nvidia-tesla-k80
gcloudprovides useful commands to see the available options, for example:
gcloud compute machine-types list --zones us-west1-a
lists all the machine types that are availbale in zone us-west1-a
This infomation can be found online on Google
gcloud compute images listlist all the available images.
- retrive your
Start your cluster (This step takes 10-60 min depending on the number of nodes you request):
elasticluster -vvvv start gce
if you run into error, and asked to run the setup again, please do so using,
elasticluster -vvvv setup gce
You can also monitor the progress at Google Cloud Consol
if everything goes well, you will see
your cluster is ready!. This is perhaps the moment you should shout Yay! and congratulate yourself. You now have your own cluster!
- Get the IP address of the
elasticluster list-nodes gce
- Login to your cluster to test it
- To destroy your cluster:
elasticluster stop gce
Note that this command will destroy your cluster and you lose all the data on it. Make sure you get your data to a safe storage place before you destroy your cluster.
Part-4: Test your cluster with ClusterJob
After you have launched your cluster successfully, it is time to test it by running a small job using ClusterJob on it. Follow the instructions below to test your cluster:
add your cluster info to
~/CJ_install/ssh_config. Here is an example:
[gce] Host 126.96.36.199 User hatefmonajemi Bqs SLURM Repo /home/hatefmonajemi/CJRepo_Remote MAT "" MATlib "" Python python3.4 Pythonlib pytorch:torchvision:cuda80:pandas:matplotlib:-c soumith [gce]
Hostis the IP address of your frontend node (e.g.,
simpleExampleon your cluster:
# update cj cj update # install conda cj install miniconda gce # test CJ run cj run simpleExample.py gce -m "Python on CPU test" cj state cj ls
mnist.pyon your cluster using GPU:
cj run mnist.py gce -alloc "--gres=gpu:1" -m "Pytorch on GPU test" cj state # get a summary of all jobs on your cluster cj summary gce
If everything makes sense, move on to running your assigned Deep Learning experiments.