stats285.github.io

Using the Alpha Pattern, Machine Learning, and Clusterjob Tutorial

Current research projects use the Alpha pattern described by Vardan Papyan in Lecture 03, Prevalence of neural collapse during the terminal phase of deep learning training. We will use a simplified version of that pattern in this tutorial.

Overview:

We are going to do the following:

Elasticluster for CPU and ML:

First, our friends at Google restrict access to GPUs to free accounts. You have to request a quota expansion. You also should read a bit about how GPUs work at GCE and how to create VMs with attached GPUs. Because this is more complex, we want to start with a simple problem and then move up the complexity ladder. Hence, we’ll start by running Alpha on just the CPU and then we’ll modify the configurations to run on the GPU. Because these are image manipulation tasks, we need a large amount of memory. We will use the GPU capable n1-highmem-4 machine with 4 cores and 6.5GB/core. We will be able to attach a GPU later.

[cluster/gce-gpu]
cloud=google
login=google
setup=ansible-slurm
security_group=default
image_id=ubuntu-1804-bionic-v20210315a
flavor=n1-standard-2
frontend_nodes=1
compute_nodes=2
ssh_to=frontend
boot_disk_type=pd-standard
boot_disk_size=100

[cluster/gce-gpu/compute]
flavor=n1-highmem-4
boot_disk_size=50

After creating the cluster, do no forget to capture the current SSH keys with gcloud compute config-ssh command.

ClusterJob for CPU and ML:

Because we are going to use a GPU, CJ will need to find the new configuration and also update the software stack to use PyTorch and the GPU. Add the following to your ~/CJ_install/ssh_config file. Please remember to replace the below strings, <REPLACE_WITH_YOUR_GCE_PROJECT_ID> and <REPLACE_WITH_YOUR_USERNAME> with the appropriate data from your other configurations. Because some systems, such as Stanford’s Sherlock, have size limits per node, we have also adjusted the amount of software installed to just the amount necessary to run the experiment, Pandas, PyTorch, and TorchVision. If, as is likely, you change zones to find the right cost GPU, it is important to set the host name properly. For example, us-central1-a became, in practice, us-central1-c.

[gce-gpu]
host	gce-gpu-frontend001.us-central1-a.<REPLACE_WITH_YOUR_GCE_PROJECT_ID>
user	<REPLACE_WITH_YOUR_USERNAME>
Bqs	SLURM
Repo	/home/<REPLACE_WITH_YOUR_USERNAME>/CJRepo_Remote
MAT     	matlab/R2019a
MATlib		~/cvx:~/mosek/9.2/toolbox/r2015a:~/yalmip:/share/software/modules/math/gurobi
Python		python/3.8.8
Pythonlib	pandas:requests:pytorch:torchvision:matplotlib
R	        R
Rlib	    ggplot2  
Alloc		--time UNLIMITED
[gce-gpu]

Firing up the cluster uses this command:

cj parrun train.py gce-gpu -dep . -m "Training"

When the task is done, we’ll reduce the data and then get it:

cj state
cj reduce
cj get