Databricks is a cloud-based data platform - built on top of Apache Spark - designed to simplify and accelerate data engineering, data science, and machine learning workflows. It was founded by the creators of Apache Spark and offers a unified analytics platform built on top of it. MLflow was also originally developed by Databricks, and it is now an open-source project under the Linux Foundation.
Databricks has two types of nodes:
- Driver/master node: Runs the main application (your notebook or
main()
function), holds the SparkContext/SparkSession, constructs the DAG (Directed Acyclic Graph), schedules tasks, and coordinates the cluster - Worker node: Execute the actual tasks assigned by the driver. Each worker runs one or more executor processes that perform computations, store data partitions, handle shuffles, and communicate back to the driver
Lazy Execution
Lazy evaluation in Apache Spark is a core optimization technique where transformations on data (such as map, filter, or join) are not immediately executed. Instead, Spark builds a Directed Acyclic Graph (DAG) representing the sequence of operations. The actual computation occurs only when an action (like collect(), count(), or save()) is invoked.
Python Code
df = df.cache() # Caches the DataFrame in memory df.count() # First action, triggers computation and caching df.collect() # Second action, uses cached data df.unpersist() # Remove the dataframe from cache and free up memory
Considerations
Temporary AWS credentials are only valid for 6 hours. And if we are using AWS SSO (Single Sign On), e.g., using Okta for login, the running notebook might break due to outdated credentials.
After login with SSO again in the web terminal of Databricks cluster, to get refreshed credentails without re-attaching notebooks to the cluster, we can use boto3 - an official AWS SDK - to get the refreshed credentials.
Python Code
session = boto3.Session(profile_name='your-profile-name') credentials = session.get_credentials().get_frozen_credentials() fs = S3FileSystem( key=credentials.access_key, secret=credentials.secret_key, token=credentials.token # Include if using temporary credentials (like SSO/SAML) )
Default Databricks shutdown time is 2 hours in case of inactivity.
No comments:
Post a Comment