Guangyuan's Research and Development Blog: Demystifying Databricks

Databricks is a cloud-based data platform - built on top of Apache Spark - designed to simplify and accelerate data engineering, data science, and machine learning workflows. It was founded by the creators of Apache Spark and offers a unified analytics platform built on top of it. MLflow was also originally developed by Databricks, and it is now an open-source project under the Linux Foundation.

Databricks has two types of nodes:

Driver/master node: Runs the main application (your notebook or main() function), holds the SparkContext/SparkSession, constructs the DAG (Directed Acyclic Graph), schedules tasks, and coordinates the cluster
Worker node: Execute the actual tasks assigned by the driver. Each worker runs one or more executor processes that perform computations, store data partitions, handle shuffles, and communicate back to the driver

Lazy Execution

Lazy evaluation in Apache Spark is a core optimization technique where transformations on data (such as map, filter, or join) are not immediately executed. Instead, Spark builds a Directed Acyclic Graph (DAG) representing the sequence of operations. The actual computation occurs only when an action (like collect(), count(), or save()) is invoked.

To prevent Spark from re-executing the same transformations multiple times, you can cache or persist the DataFrame. This stores the intermediate results in memory or on disk, allowing subsequent actions to reuse the cached data without re-computing the transformations.

Python Code

    
df = df.cache()   # Caches the DataFrame in memory
df.count()        # First action, triggers computation and caching
df.collect()      # Second action, uses cached data
df.unpersist()	  # Remove the dataframe from cache and free up memory

The cache() method is equivalent to calling persist() with its default parameters including the default storage level, e.g., MEMORY_AND_DISK. It means Spark will store the data in memory as much as possible and spill to disk if necessary.

Considerations

Temporary AWS credentials are only valid for 6 hours. And if we are using AWS SSO (Single Sign On), e.g., using Okta for login, the running notebook might break due to outdated credentials.

After login with SSO again in the web terminal of Databricks cluster, to get refreshed credentails without re-attaching notebooks to the cluster, we can use boto3 - an official AWS SDK - to get the refreshed credentials.

Python Code

    
session = boto3.Session(profile_name='your-profile-name')
credentials = session.get_credentials().get_frozen_credentials()

fs = S3FileSystem(
    key=credentials.access_key,
    secret=credentials.secret_key,
    token=credentials.token  # Include if using temporary credentials (like SSO/SAML)
)

Default Databricks shutdown time is 2 hours in case of inactivity.

References

Understanding Persist and Cache in Apache Spark

Demystifying Databricks

Lazy Execution

Python Code

Considerations

Python Code

References

No comments:

Post a Comment