Guangyuan's Research and Development Blog: notebook

Showing posts with label notebook. Show all posts

Demystifying Databricks

Databricks is a cloud-based data platform - built on top of Apache Spark - designed to simplify and accelerate data engineering, data science, and machine learning workflows. It was founded by the creators of Apache Spark and offers a unified analytics platform built on top of it. MLflow was also originally developed by Databricks, and it is now an open-source project under the Linux Foundation.

Databricks has two types of nodes:

Driver/master node: Runs the main application (your notebook or main() function), holds the SparkContext/SparkSession, constructs the DAG (Directed Acyclic Graph), schedules tasks, and coordinates the cluster
Worker node: Execute the actual tasks assigned by the driver. Each worker runs one or more executor processes that perform computations, store data partitions, handle shuffles, and communicate back to the driver

Lazy Execution

Lazy evaluation in Apache Spark is a core optimization technique where transformations on data (such as map, filter, or join) are not immediately executed. Instead, Spark builds a Directed Acyclic Graph (DAG) representing the sequence of operations. The actual computation occurs only when an action (like collect(), count(), or save()) is invoked.

To prevent Spark from re-executing the same transformations multiple times, you can cache or persist the DataFrame. This stores the intermediate results in memory or on disk, allowing subsequent actions to reuse the cached data without re-computing the transformations.

Python Code

    
df = df.cache()   # Caches the DataFrame in memory
df.count()        # First action, triggers computation and caching
df.collect()      # Second action, uses cached data
df.unpersist()	  # Remove the dataframe from cache and free up memory

The cache() method is equivalent to calling persist() with its default parameters including the default storage level, e.g., MEMORY_AND_DISK. It means Spark will store the data in memory as much as possible and spill to disk if necessary.

Considerations

Temporary AWS credentials are only valid for 6 hours. And if we are using AWS SSO (Single Sign On), e.g., using Okta for login, the running notebook might break due to outdated credentials.

After login with SSO again in the web terminal of Databricks cluster, to get refreshed credentails without re-attaching notebooks to the cluster, we can use boto3 - an official AWS SDK - to get the refreshed credentials.

Python Code

    
session = boto3.Session(profile_name='your-profile-name')
credentials = session.get_credentials().get_frozen_credentials()

fs = S3FileSystem(
    key=credentials.access_key,
    secret=credentials.secret_key,
    token=credentials.token  # Include if using temporary credentials (like SSO/SAML)
)

Default Databricks shutdown time is 2 hours in case of inactivity.

References

Understanding Persist and Cache in Apache Spark

Jupyter notebook download as pdf: nbconvert failed: Pandoc wasn't found. nbconvert failed: No suitable chromium executable found on the system. Please use '--allow-chromium-download' to allow downloading one.

Install Pandoc

After that you need to have XeTex installed on your machine:

Linux : TeX Live
Mac : MacTex
Windows : MikTex

Install nbconvert and pyppeteer


pip install nbconvert
pip install pyppeteer

Now we can convert an ipynb file to to PDF via HTML or via Tex.


jupyter nbconvert --to webpdf --allow-chromium-download filename.ipynb


jupyter nbconvert --to pdf filename.ipynb

filename.ipynb is the filename of your file to be converted to a PDF file.

How to check docstring or method details of a method in Jupyter notebook?

When using a method in Jupyter notebook, it always comes convinient by checking docstring or method details of a method, e.g., what parameters the method have. Here we have two options to check those.

The first option is add "?" at the end of the method you would like to check and run the cell, which will give you the method details in a printed format.

The second option is using the combination of "shift+tab" keys with the mouse cursor inside the method bracket, which will show you method details in a format below.

Jupyter Notebooks not displaying progress bars

The progress bar does not show up, instead it is showing HBox(children=(FloatProgress(value=0.0, max=1.0), Label(value='In progress...')))

$ jupyter nbextension enable --py widgetsnbextension

# assuming you have already installed nodejs and jupyterlab

# otherwise you need to install nodejs and jupyterlab first (pip install jupyterlab)

$ jupyter labextension install @jupyter-widgets/jupyterlab-manager

How to Execute Python Script in Jupyter Notebook using a Specific Virtual Environment, How to list and remove kernels from Jupyter Notebook


# Suppose you are in an activate virtualenv named py2env

(py2env)$ 

# Then install jupyter within the active virtualenv
(py2env)$ pip install jupyter

# jupyter comes with ipykernel, but somehow you manage to get an error due to ipykernel, then for reference ipykernel package can be installed using:
(py2env)$ pip install ipykernel

# set up the kernel
(py2env)$ python -m ipykernel install --user --name py2env --display-name "Python2 (py2env)"

# Now we can start jupyter notebook
(py2env)$ jupyter notebook

How to list kernels of Jupyter Notebook

If we want to check the list of kernels in our Jupyter Notebook, and remove afterwards.


$ jupyter kernelspec list

This will give you a list of kernels as follows

Available kernels:
  hbnn            /home/parklize/.local/share/jupyter/kernels/hbnn
  iswctest        /home/parklize/.local/share/jupyter/kernels/iswctest
  milvus          /home/parklize/.local/share/jupyter/kernels/milvus
  py3.6           /home/parklize/.local/share/jupyter/kernels/py3.6
  py3.7tf2.3.0    /home/parklize/.local/share/jupyter/kernels/py3.7tf2.3.0
  py3.7tf2.4.1    /home/parklize/.local/share/jupyter/kernels/py3.7tf2.4.1
  pymc            /home/parklize/.local/share/jupyter/kernels/pymc
  xai             /home/parklize/.local/share/jupyter/kernels/xai
  python3         /home/parklize/anaconda3/envs/pymc_env/share/jupyter/kernels/python3

How to remove a kernel of Jupyter Notebook

If we want to remove one of them, for example, the following command removes the hbnn kernel.


$ jupyter kernelspec remove hbnn

How to use Jupyter notebook remotely

It is for the case when you want to use/access Jupyter notebook in the remote computer/server from your local/home computer.

Make sure you have installed Jupyter notebook in both remote and local computers.

1. In the remote computer, open your Jupyter notebook you want to open in the corresponding directory in the ternimal

jupyter notebook --no-browser --port=8888

2. In your local computer's terminal

ssh -N -f -L localhost:8888:localhost:8888 [username]@[your_remote_host_name]

3. You can now access the remote Jupyter notebook by typing

localhost:8888

in your browser!