This is my guide for creating a new VM on the Google Cloud, connecting via SSH, and attaching a persistent disk.

Google TPU Research Cloud

If you’re starting your journey with cloud computing on Google Cloud, I strongly recommend applying to the Google TPU Research Cloud. Not only will you get free access to powerful machines, but you’ll also receive free credits to get started. Personally, I was granted 1,316.12 PLN (PLN - Polish zloty, which is equivalent to more than $300) for 70 days.

Moreover, I got unlimited access for 30 days to the following machines:

  • 100 preemptible Cloud TPU v2-8 device(s) in zone us-central1-f
  • 5 on-demand Cloud TPU v2-8 device(s) in zone us-central1-f
  • 5 on-demand Cloud TPU v3-8 device(s) in zone europe-west4-a

You can apply here

If Google accepts your request, we can get started with cloud configuration.

To start, let’s create a new, clean conda environment named gcloud and specify we want Python version 3.10.

conda create -n gcloud python=3.10
conda activate gcloud

Instructions

1. Install gcloud

  • Documentation
  • Download the zip file listed in the documentation to your local machine
  • Extract the downloaded zip file, navigate to its directory, then run the script to lauch interactive installer ./google-cloud-sdk/install.sh. It will ask you to:
    • Log in to your Google account
    • Select your project name
    • Set up your region (mine is europe-west4-a)

2. Set up environment:

  • Documentation
  • Make sure to enable the Cloud TPU API
  • Enable TPU service account
    • Documentation
    • Create your account with role
      • TPU Admin
      • Storage Admin: Needed for accessing Cloud Storage
      • Logs Writer: Needed for writing logs with the Logging API
      • Monitoring Metric Writer: Needed for writing metrics to Cloud Monitoring

3. Create a new TPU

  • Option 1. via gcloud CLI
    • Use gcloud compute tpus tpu-vm
    • To create new TPU VM v3-8 in europe-west4-a
        gcloud compute tpus tpu-vm create your-machine-name --zone=europe-west4-a --accelerator-type=v3-8 --version=tpu-vm-pt-2.0
      
    • To create v2-8 TPU in us-central1-f
        gcloud compute tpus tpu-vm create your-maine-name --zone=us-central1-f --accelerator-type=v2-8 --version=tpu-vm-pt-2.0
      
    • Keep in mind that creating a new TPU might require some patience during peak hours on the cloud. While creating my VM, I frequently encountered errors like this:
        ERROR: (gcloud.compute.tpus.tpu-vm.create) {
            "code": 8,
            "message": "There is no more capacity in the zone \"europe-west4-a\"; you can try in another zone where Cloud TPU Nodes are offered (see https://cloud.google.com/tpu/docs/regions) [EID: 0x1a50fbb229537bb]"
        }
      
  • Option 2. via web platform using cloud.google.com
    • I don’t recommend it as I was getting all the time “unknown error” without any meaningful information
    • Instructions
    • Fill the form:
      • Name: your-machine-name
      • Zone: europe-west4-a
      • TPU settings: TPU vm architecture
      • TPU type: v3-8
      • TPU software version: tpu-vm-pt-2.0 (for pytorch 2.0)
        • You can read more about software versions here

4. Connect over SSH:

  • If creation was successful, you can connect to your machine via SSH
      gcloud compute tpus tpu-vm ssh your-machine-name --zone=europe-west4-a
    

5. Create persistent disks

  • Documentation
  • By default, VMs come with only 100 GB disk space, so you’ll probably want to extend it. You can do it by creating persistent disk
      gcloud compute disks create your-disk-name --size 200 --zone europe-west4-a --type pd-balanced
    
    • This command will create new disk in your Google Cloud project named sourceDisk: projects/yourprojectname/zones/europe-west4-a/disks/your-disk-name
    • Be careful; creating a new disk is not free operation, but if you’re a new Google Cloud user, you should have your free credits. You can read more about pricing here: disks-image-pricing

6. Attach disk to VM

  • Again, using the well-known gcloud CLI
      gcloud alpha compute tpus tpu-vm attach-disk pawai-eu-1 --zone=europe-west4-a --disk=pawai-eu-disk-1 --mode=read-write
    

7. Mount disk

  • Documentation
  • After attaching disk image to our machine we can log in over SSH to our machine (step 4.) and mount it so we can use it
  • Enter these commands on your VM
    sudo mount -o discard,defaults /dev/sdb /mnt/disks/persist
    sudo chmod a+w /mnt/disks/persist
    
  • If done right, you should see the message. MOUNTED TO: /mnt/disks/persist
  • After that you can happily navigate to your storage
    cd /mnt/disks/persist/
    

Useful commands on fresh VM

After following the steps above, we end up with a fresh Linux OS, but to feel really at home, we need to configure a little bit this naked system. Below I will share my cheatsheet with command that I’m executing on a fresh VM.

Conda

  • Installing Conda and creating new virtual environment
  • Documentation: https://docs.conda.io/projects/miniconda/en/latest/
    mkdir -p ~/miniconda3
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
    bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
    rm -rf ~/miniconda3/miniconda.sh
    ~/miniconda3/bin/conda init bash
    conda create --name tpu python=3.10
    conda activate tpu
    

Diffusers

  • Install Hugging Face Diffusers
  • Documentation
    pip install diffusers["torch"] transformers
    conda install -c conda-forge diffusers
    

    I had an error, but i resolved it with installing exact version of huggingface_hub`

    pip install huggingface_hub==0.18
    

PyTorch

  • Install PyTorch for Google TPU
  • Documentation
    pip install torch~=2.1.0 torch_xla[tpu]~=2.1.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
    
  • Ensure that the PyTorch/XLA runtime uses the TPU.
    export PJRT_DEVICE=TPU
    
  • Note: you can also add this line at the end of ~/.bashrc, otherwise this export will work only for single session

Jax

  • Install JAX support
  • Documentation
    pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
    pip install flax
    

Git

  • Setup SSH key
  • Documentation Remember to replace ‘your@mail.com’ with your actual github email address when setting up SSH keys.
    ssh-keygen -t ed25519 -C "your@mail.com"
    eval "$(ssh-agent -s)"
    ssh-add ~/.ssh/id_ed25519
    cat ~/.ssh/id_ed25519.pub
    

Hugging Face

  • Install Large File Support for Git and clone Stable Diffusion XL model
  • Documentation
    curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
    sudo apt-get install git-lfs
    git clone https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
    

If you are gettting all the time “There is no more capacity in the zone” during TPU creation, then you can add queued-resources with this command

gcloud alpha compute tpus queued-resources create yout-queued-resource \
--node-id your-tpu-name \
--project your-project-name \
--zone europe-west4-a \
--accelerator-type v3-8 \
--runtime-version tpu-vm-pt-2.0

Then you can check the status with

gcloud alpha compute tpus queued-resources list --project your-project-name --zone europe-west4-a