Creating a Jupyter notebook server on the cloud in 2022

Utkarsh Goel
7 min readDec 21, 2021

Data scientists like you might have already heard of Jupyter. It’s simple to use, doesn’t need a lot of time to learn and significantly improves efficiency for work. Nowadays, you can easily find services online that will provide you with Jupyter notebooks on your browser to work on data science and machine learning projects — take for example Google Colab.

However, you might be one of those developers/data scientists for whom Google Colab is not enough, for whatsoever reasons. In that scenario, you have to turn to cloud services or other platform solutions that can do the same as Google Colab, with more capabilities like faster GPUs, more memory or even CPU.

Now, if you’re one of those who like to create their own VM instead of a managed service, this is probably going to be a treat to read. If not, read anyway! You could possibly learn to do this yourself, and it might just be fun. If you don’t wanna do this yourself, at the end of this article, I will give you a much easier method, using an online platform(that I have built) called Deploifai. More about it later!

Choosing a cloud provider

There are multiple options: AWS, Azure, GCP, etc. They provide GPU based virtual machines, that could run your Jupyter notebook server. We are going to choose AWS for demo purposes. Using other services are somewhat similar. Let’s look into it.

Create a VM

We start with creating a EC2 Instance. Search for EC2 > Create instance.

AMI

The first step to creating a VM is to choose an image (AMI) that has the operating system with perhaps some pre-installed software.

We will just choose an Ubuntu Server 20.04 AMI.

Instance type

Next, we need to select an Instance type. Now, this throws a lot of people off. To choose one instance type, you should have your objective in mind. We want to do Machine learning training on Jupyter notebooks, hence, I will use P3 instances. There are P4 instances that are much powerful and I probably don’t need that much compute, but maybe P2 will be too little performance for me.

Configure your instance

If you have never tried to configure networking before, this one might be confusing for you. I will try to make things simple. If you don’t wanna bother, you can leave things to default, but I can go through a few options in detail.

Starting with the Network section, we have something called a VPC or Virtual Private Cloud. This is just like setting up a personal space for your resources to live. You can decide which resources in this personal space can talk to the internet, and which ones cannot. You can choose how the internet can reach specific resources in this personal space and more. You can read more about VPCs on the AWS documentation. https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html. I am just going to leave it to default.

Next up is subnets. Subnets are like smaller sections in this personal space. This analogy doesn’t completely work, but it helps in visualising some of this concept. You can group the personal space into “public” and “private” accessible resources in your VPC. You can find even more information about subnets here: https://www.cloudflare.com/en-gb/learning/network-layer/what-is-a-subnet/. I am going to leave it for default here.

Finally, we should auto-assign a public IP since we want to access the Jupyter notebook on the server. Choose “Enable” to do so.

Add storage

We need to add storage to our server if we have large datasets. Where else will we store all the images/videos etc that we need to train a model on?

This storage costs money. One of the scariest parts of AWS is the cost that gets hidden in all of these additional add-ons.

Configure Security Groups

Security group does what it says in its name, it provides your server security. You can limit what kind of connections come into the server using the rules in a security group. 22 for SSH 80 for HTTP and so on. This is a great added layer of security that helps keep any resource safe.

By default, this is set up to work with SSH. We will need it to access the Jupyter notebook using an SSH tunnel. (Don’t know what SSH tunnelling is? Find it out later in an upcoming post).

We see a warning that essentially says that the server is public. It is possible to define specific IP addresses, but also quite painful to manage. For ease of use, for now, we can just leave it to defaults.

Review and Launch

Finally, we are ready to launch! We will need an SSH key to connect to this server. This key helps us authenticate an SSH connection without needing a password.

Create a new key and download it. Once it's downloaded, we make sure it has only read and write permissions on our computer from our user. You can do that using chmod 400 <>.pem.

Finally we are good to go! Run the server and connect to it via SSH so that we can get the real work started.

Setting up Jupyterhub

This is rather simple to do. There are plenty of tutorials to set up Jupyter notebooks on a Ubuntu machine and I could possibly not contribute something new in that regard. But I can tell you a trick that I have up my sleeve for making things easy.

Learn to install Jupyterhub here:

First install pre-requisites: https://jupyterhub.readthedocs.io/en/stable/quickstart.html#prerequisites

Then the installation: https://jupyterhub.readthedocs.io/en/stable/quickstart.html#installation

Once we have Jupyterhub installed, we can run it as a service on the background, or as a terminal process. For this tutorial, we will use tmux to run it as a terminal process. This terminal process will run in the background, hence the Jupyter server will run in the background. Here are the steps:

  • Install tmux using: sudo apt install tmux.
  • Start a tmux shell using tmux.
  • Run jupyterhub inside this shell: jupyterhub.

Then exit tmux shell by using the key pattern: Ctrl + B, D. The shell will go to the background.

Access the Jupyter notebook

We can do so using an SSH tunnel. I cannot give screenshots for any of this without giving away personal information, so you are going to take my word for it.

To connect to a server and tunnel a specific port through the SSH connection, we do:

ssh -L <port>:localhost:<port> -i <key> ubuntu@<ip>

Fill in your relevant information for your use. We will use port 8000 since that is the default for jupyterhub.

ssh -L 8000:localhost:8000 -i <key> ubuntu@<ip>

That will login to a terminal and we should have our tunnel ready. Go to http://localhost:8000 to run jupyterhub.

Now the simple way

You can create a server that is ready to train machine learning models for you in under 10 minutes without doing any of that setup. How you ask? Try Deploifai. https://deploif.ai

It allows us to create a machine learning project and create a server that can train our machine learning model in 4 simple steps:

  1. Login to Deploifai — Using simply your GitHub account.
  2. Connect your AWS account (or Azure)
  3. Create a machine learning project.
  4. Create a training server.

Watch a video on how to do that:

Full disclaimer: I am a cofounder at Deploifai. It is a modern cloud platform that lets developers build data and AI solutions faster and better by automating infrastructure. The team is working on streamlining the experience for developers to make it the best platform to do modern cloud tasks.

In the future, Deploifai will have CI/CD integrations so that code from GitHub can directly be used to train and deploy machine learning models. Join the community on Discord to stay up to date!

Thanks for reading!

--

--