Configuring new VM on Google Cloud TPU
This is my guide for creating a new VM on the Google Cloud, connecting via SSH, and attaching a persistent disk.
Google TPU Research Cloud
If you’re starting your journey with cloud computing on Google Cloud, I strongly recommend applying to the Google TPU Research Cloud. Not only will you get free access to powerful machines, but you’ll also receive free credits to get started. Personally, I was granted 1,316.12 PLN (PLN - Polish zloty, which is equivalent to more than $300) for 70 days.
Moreover, I got unlimited access for 30 days to the following machines:
- 100 preemptible Cloud TPU v2-8 device(s) in zone us-central1-f
- 5 on-demand Cloud TPU v2-8 device(s) in zone us-central1-f
- 5 on-demand Cloud TPU v3-8 device(s) in zone europe-west4-a
You can apply here
If Google accepts your request, we can get started with cloud configuration.
To start, let’s create a new, clean conda environment named gcloud
and specify we want Python version 3.10.
conda create -n gcloud python=3.10
conda activate gcloud
Instructions
1. Install gcloud
- Documentation
- Download the zip file listed in the documentation to your local machine
- Extract the downloaded zip file, navigate to its directory, then run the script to lauch interactive installer
./google-cloud-sdk/install.sh
. It will ask you to:- Log in to your Google account
- Select your project name
- Set up your region (mine is
europe-west4-a
)
2. Set up environment:
- Documentation
- Make sure to enable the Cloud TPU API
- Enable TPU service account
- Documentation
- Create your account with role
- TPU Admin
- Storage Admin: Needed for accessing Cloud Storage
- Logs Writer: Needed for writing logs with the Logging API
- Monitoring Metric Writer: Needed for writing metrics to Cloud Monitoring
3. Create a new TPU
- Option 1. via
gcloud CLI
- Use
gcloud compute tpus tpu-vm
- To create new TPU VM v3-8 in europe-west4-a
gcloud compute tpus tpu-vm create your-machine-name --zone=europe-west4-a --accelerator-type=v3-8 --version=tpu-vm-pt-2.0
- To create v2-8 TPU in us-central1-f
gcloud compute tpus tpu-vm create your-maine-name --zone=us-central1-f --accelerator-type=v2-8 --version=tpu-vm-pt-2.0
- Keep in mind that creating a new TPU might require some patience during peak hours on the cloud. While creating my VM, I frequently encountered errors like this:
ERROR: (gcloud.compute.tpus.tpu-vm.create) { "code": 8, "message": "There is no more capacity in the zone \"europe-west4-a\"; you can try in another zone where Cloud TPU Nodes are offered (see https://cloud.google.com/tpu/docs/regions) [EID: 0x1a50fbb229537bb]" }
- Use
- Option 2. via web platform using
cloud.google.com
- I don’t recommend it as I was getting all the time “unknown error” without any meaningful information
- Instructions
- Fill the form:
- Name: your-machine-name
- Zone: europe-west4-a
- TPU settings: TPU vm architecture
- TPU type: v3-8
- TPU software version: tpu-vm-pt-2.0 (for pytorch 2.0)
- You can read more about software versions here
4. Connect over SSH:
- If creation was successful, you can connect to your machine via SSH
gcloud compute tpus tpu-vm ssh your-machine-name --zone=europe-west4-a
5. Create persistent disks
- Documentation
- By default, VMs come with only 100 GB disk space, so you’ll probably want to extend it. You can do it by creating persistent disk
gcloud compute disks create your-disk-name --size 200 --zone europe-west4-a --type pd-balanced
- This command will create new disk in your Google Cloud project named
sourceDisk: projects/yourprojectname/zones/europe-west4-a/disks/your-disk-name
- Be careful; creating a new disk is not free operation, but if you’re a new Google Cloud user, you should have your free credits. You can read more about pricing here: disks-image-pricing
- This command will create new disk in your Google Cloud project named
6. Attach disk to VM
- Again, using the well-known
gcloud
CLIgcloud alpha compute tpus tpu-vm attach-disk pawai-eu-1 --zone=europe-west4-a --disk=pawai-eu-disk-1 --mode=read-write
7. Mount disk
- Documentation
- After attaching disk image to our machine we can log in over SSH to our machine (step 4.) and mount it so we can use it
- Enter these commands on your VM
sudo mount -o discard,defaults /dev/sdb /mnt/disks/persist sudo chmod a+w /mnt/disks/persist
- If done right, you should see the message.
MOUNTED TO: /mnt/disks/persist
- After that you can happily navigate to your storage
cd /mnt/disks/persist/
Useful commands on fresh VM
After following the steps above, we end up with a fresh Linux OS, but to feel really at home, we need to configure a little bit this naked system. Below I will share my cheatsheet with command that I’m executing on a fresh VM.
Conda
- Installing Conda and creating new virtual environment
- Documentation: https://docs.conda.io/projects/miniconda/en/latest/
mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 rm -rf ~/miniconda3/miniconda.sh ~/miniconda3/bin/conda init bash conda create --name tpu python=3.10 conda activate tpu
Diffusers
- Install Hugging Face Diffusers
- Documentation
pip install diffusers["torch"] transformers conda install -c conda-forge diffusers
I had an error, but i resolved it with installing exact version of
huggingface_hub
`pip install huggingface_hub==0.18
PyTorch
- Install PyTorch for Google TPU
- Documentation
pip install torch~=2.1.0 torch_xla[tpu]~=2.1.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
- Ensure that the PyTorch/XLA runtime uses the TPU.
export PJRT_DEVICE=TPU
- Note: you can also add this line at the end of
~/.bashrc
, otherwise this export will work only for single session
Jax
- Install JAX support
- Documentation
pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html pip install flax
Git
- Setup SSH key
- Documentation
Remember to replace ‘your@mail.com’ with your actual github email address when setting up SSH keys.
ssh-keygen -t ed25519 -C "your@mail.com" eval "$(ssh-agent -s)" ssh-add ~/.ssh/id_ed25519 cat ~/.ssh/id_ed25519.pub
Hugging Face
- Install Large File Support for Git and clone Stable Diffusion XL model
- Documentation
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash sudo apt-get install git-lfs git clone https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
If you are gettting all the time “There is no more capacity in the zone” during TPU creation, then you can add queued-resources with this command
gcloud alpha compute tpus queued-resources create yout-queued-resource \
--node-id your-tpu-name \
--project your-project-name \
--zone europe-west4-a \
--accelerator-type v3-8 \
--runtime-version tpu-vm-pt-2.0
Then you can check the status with
gcloud alpha compute tpus queued-resources list --project your-project-name --zone europe-west4-a