IT Enables Research: JupyterHub on Kubernetes

17 September, 2020

JupyterHub brings the power of notebooks to groups of users. It gives users access to computational environments and resources without burdening them with installation and maintenance tasks. Students, researchers, data scientists, and professors teaching classes can get their work done in their own workspaces on shared resources which IT Research Computing manages efficiently by running on its Kubernetes production cluster

JupyterHub makes it possible to serve a pre-configured data science environment to any user on KAUST campus. It is customizable, scalable, and suitable for small and large teams, academic courses, and large-scale infrastructure. 

JUPYTERHUB KEY FEATURES

Customizable - JupyterHub serves a variety of environments. It supports dozens of kernels with the Jupyter server and serves a variety of user interfaces including the Jupyter Notebook, Jupyter Lab, RStudio, and Julia. 

Flexible - JupyterHub has KAUST Active Directory authentication enabled so users can use their KAUST Connect credentials to access it. 

Scalable - JupyterHub is container-friendly making it possible to deploy it with modern-day container technology. IT Research Computing chose to run it on its production Kubernetes so it can run with hundreds of users. 

Portable - JupyterHub is entirely open-source and designed to run on a variety of infrastructure. This includes commercial cloud providers, virtual machines, or even your own laptop hardware. 

IT RESEARCH COMPUTING JUPYTER NOTEBOOK OVERVIEW

Oh, man! Read so far and still no link to JupyterHub. OK, I feel you; here is the link for the impatient. This instance is running on IT Research Computing’s production Kubernetes cluster. Each user gets two cores and four GB of RAM. All these Jupyter Notebooks have access to the following storage mediums: 

  • Noor Home which gives you two-hundred GB of backed up storage 

  • DataWaha which gives your group petabytes of backed up storage 

  • Shaheen Lustre filesystem (it will be available soon, we are working with Canonical to fix a bug)

  • 10GB of persistent storage where you can find packages you installed in previous sessions

Having all those storage mediums accessible from a Jupyter Notebook is a cool thing. You can access your files without having to move them around campus; just process them where they lay. All of IT Research Computing’s infrastructure nodes are connected via 10G connections (thanks to the formidable IT Networks Team) to these storage mediums. Massive amounts of bandwidth are available to read/write data thanks to these fast and reliable connections. 

IT RESEARCH COMPUTING JUPYTER NOTEBOOK FEATURES

IT Research Computing’s Jupyter Notebook includes libraries for data analysis from the Julia, Python, TensorFlow, and R communities.

  • The Julia compiler and base environment 

  • IJulia to support Julia code in Jupyter notebooks 

  • HDF5, Gadfly, and RDatasets packages 

  • dask, pandas, numexpr, matplotlib, scipy, seaborn, scikit-learn, scikit-image, sympy, cython, patsy, statsmodel, cloudpickle, dill, numba, bokeh, sqlalchemy, hdf5, vincent, beautifulsoup, protobuf, xlrd, bottleneck, and pytables packages 

  • ipywidgets and ipympl for interactive visualizations and plots in Python notebooks 

  • Facets for visualizing machine learning datasets 

  • The R interpreter and base environment 

  • IRKernel to support R code in Jupyter notebooks

  • tidyverse packages, including ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, lubridate, and broom from conda-forge 

  • devtools, shiny, rmarkdown, forecast, rsqlite, nycflights13, caret, tidymodels, rcurl, and randomforest packages from conda-forge 

  • TeX Live for notebook document conversion 

  • git, emacs-nox, vim-tiny, jed, nano, tzdata, and unzip 

  • TensorFlow and Keras machine learning libraries 

ROADMAP

IT Research Computing is continuously trying to find ways to improve its products and services. This means that we already have in mind what comes next for our products and services; or least where we would like to be. 

Here is a non-exhaustive list of things we will be working on to improve JupyterHub for KAUST community: 

  • Allow users and/or research groups to spin up a Jupyter instance based on their own image. BinderHub will help in that regards. Stay tuned for when we also launch this service. 

  • Add GPU support for all Jupyter Notebooks. Two things are affecting this today: 

  • Money (duh!) since we must buy more GPU cards to satisfy the demand. 

  • Reconfigure our on-premise cloud to use virtual GPU technology from NVIDIA. 

  • Add more Kubernetes workers dedicated to JupyterHub. 

  • Create interface for KAUST clusters, i.e. run Jupyter workers on HPC nodes. 

  • Add kubelflow, a platform for data scientists who want to build and experiment with ML pipelines. kubeflow is also for ML engineers and operational teams who want to deploy ML systems to various environments for development, testing, and production-level serving. 

HOW DOES THIS BENEFIT MY RESEARCH?

Reading thus far, you might have wondered what the bottom line for you is. How does JupyterHub help your research? Let us see how it helps your research. 

All in one place – The Jupyter Notebook is a web-based interactive environment. It combines code, rich text, images, mathematical equations, plots, maps (and much more!) into one document. 

Easy to share & convert – You can share Jupyter Notebooks as JSON files, a structured text format. You can also use built-in tools to export notebooks as PDF or HTML. 

Language independent – Jupyter was built with this concept in mind. The client runs in your browser connecting to kernels running in any supported language.  

Stress-free reproducible experiments – Jupyter Notebooks can help you conduct efficient and reproducible interactive computing experiments with ease. It lets you keep a detailed record of your work. Also, the ease of use of the Jupyter Notebook means that you do not have to worry about reproducibility; just do all your interactive work in notebooks, put them under version control, and commit regularly. Do not forget to refactor your code into independent reusable components.  

Effective teaching/learning tool – The Jupyter Notebook is not only a tool for scientific research and data analysis but also a great tool for teaching. You can share a GitLab repo with your students. You can interactively experiment during class. Infinite possibilities! 

GitOps@KAUST

IT Research Computing manages almost all of its products and services using GitOps principles. JupyterHub is no different. We have automated GitLab pipelines to push changes to all Kubernetes workers. We also have alerting setup to check both Kubernetes and JupyterHub. Our goal is to detect issues before users do. We try to avoid downtimes; that’s why JupyterHub runs on Kubernetes. 

Run your Jupyter Notebooks at https://jupyter.kaust.edu.sa today!

KAUST Information Technology Department 

We make IT happen!