Demystifying Databricks

Databricks is a cloud-based data platform - built on top of Apache Spark - designed to simplify and accelerate data engineering, data science, and machine learning workflows. It was founded by the creators of Apache Spark and offers a unified analytics platform built on top of it. MLflow was also originally developed by Databricks, and it is now an open-source project under the Linux Foundation.

Databricks has two types of nodes:

  • Driver/master node: Runs the main application (your notebook or main() function), holds the SparkContext/SparkSession, constructs the DAG (Directed Acyclic Graph), schedules tasks, and coordinates the cluster
  • Worker node:  Execute the actual tasks assigned by the driver. Each worker runs one or more executor processes that perform computations, store data partitions, handle shuffles, and communicate back to the driver


Lazy Execution

Lazy evaluation in Apache Spark is a core optimization technique where transformations on data (such as map, filter, or join) are not immediately executed. Instead, Spark builds a Directed Acyclic Graph (DAG) representing the sequence of operations. The actual computation occurs only when an action (like collect(), count(), or save()) is invoked.

To prevent Spark from re-executing the same transformations multiple times, you can cache or persist the DataFrame. This stores the intermediate results in memory or on disk, allowing subsequent actions to reuse the cached data without re-computing the transformations.


Python Code

    
df = df.cache()   # Caches the DataFrame in memory
df.count()        # First action, triggers computation and caching
df.collect()      # Second action, uses cached data
df.unpersist()	  # Remove the dataframe from cache and free up memory
    
The cache() method is equivalent to calling persist() with its default parameters including the default storage level, e.g., MEMORY_AND_DISK. It means Spark will store the data in memory as much as possible and spill to disk if necessary.


Considerations

Temporary AWS credentials are only valid for 6 hours. And if we are using AWS SSO (Single Sign On), e.g., using Okta for login, the running notebook might break due to outdated credentials. 

After login with SSO again in the web terminal of Databricks cluster, to get refreshed credentails without re-attaching notebooks to the cluster, we can use boto3 - an official AWS SDK - to get the refreshed credentials.


Python Code

    
session = boto3.Session(profile_name='your-profile-name')
credentials = session.get_credentials().get_frozen_credentials()

fs = S3FileSystem(
    key=credentials.access_key,
    secret=credentials.secret_key,
    token=credentials.token  # Include if using temporary credentials (like SSO/SAML)
)
    

Default Databricks shutdown time is 2 hours in case of inactivity.


References

org.apache.spark.SparkException: Job aborted dute to storage failure: Serialized task xxx: was xxx bytes, which exceeds max allowed: spark.rpc.message.maxSize (xxx bytes)

Context:

This happens when spark.createDataFrame(). in Databricks.


Possible cause:

The problem seems was the data you're passing into createDataFrame() is too large to serialize and send to the executors, exceeding spark.rpc.message.maxSize.


Solutions:

One solution would be editing spark.rpc.message.maxSize setting in the cluster and restart it

  • Cluster => configuration => Spark => spark.rpc.message.maxSize 512

Another solution might be changing or spliting the dataframe we are creating into smaller batches for spark.createDataFrame()


In case you have a similar problem and resolved in a different manner, leave a comment:)

10 years in Ireland

Leaving South Korea

I worked at the CTO office of Hyundai Mobis - part of the Hyundai Motor Group - after completing my master's degree at Yonsei University. As one of six foreign employees among 250 new hires (also featured in news), I was responsible for in-house IT systems that supported over 10,000 overseas employees as well as manufacturing lines. Although I didn’t realize it at the time, we were essentially “on call” all the time, as we had to resolve issues for overseas subsidiaries as quickly as possible. 

Despite the decent pay, the lifestyle did not suit my personality or the way I envisioned living in  how I  in the long term. I felt the need to seek new challenges and pursue something I truely like to do. 

I had always dreamed of going abroad, such as a country does not speak my native languages, and with nothing to lose, I began reaching out to potential supervisors and research labs to explore my PhD opportunities. Fortunately, I ended up with two great opportunities in the field of Knowledge Graphs or Semantic Web at the time. Both Prof. Amit Sheth and John Breslin were no doubt overly outstanding supervisors for me, and choosing between the U.S or Ireland was the problem. After living in densely population countries, ultimately, my instinct somehow led me to University of Galway, Ireland. 


PhD Life in Galway

I still vividly remember my PhD advisor picking me up and giving me a ride to my accommodation on the first day, bringing a handful groceries to help me get started. After living in Seoul for four years, Galway felt small but peaceful - an ideal place for research, in my view.

The first two years of my PhD were personally challenging, as I had to shift my mindset from "being told what to do well" to "figuring out what I want to research". Taking the ownership of a four-year research project was a major transition. However, it was undoubtedly worth it, as becoming an independent researcher with deep expertise in a very specific domain is the ultimate goal of a PhD. Overall, PhD "advisor" instead of "supervisor" relationship has been worked well after two years, with continuously changing and refining what I want to do and focus during my PhD. During this period, I was also fortunate to participate some W3C groups and had a chance to meet Tim Berners-Lee during one of face-to-face meetings in Paris.

Observing how my PhD advisor worked - especially his mindset in dealing with rejections, whether from paper or proposal submissions - was beneficial at the time as well as nowadays. While rejection is always difficult, I learned the value of focusing on constructive feedback and using it to improve for the next attempt. It is also inspiring to work in the same lab with Sebastian Ruder, seeing how he persistenly targeted top-tier conferences like ACL and EMNLP again and again until finally getting started to be accepted, growing into one of the most influential researchers in NLP through persistent and consistent effort.

One characteristic I wish I had embraced more fully is effective time management and working with greater efficiency, and staying open to new opportunities with a positive mindset. I believe this quality is part of why my PhD advisor is not only a well-established professor but also with many other profiles such as a co-founder of several companies and impactful initiatives, advisor of start-ups, and author of best-selling books.

Looking back, I feel truly fortunate to have spent four years at DERI and later the Insight Centre at University of Galway, working alongside so many talented people and advisor.


Bell Labs

At the point of finishing my PhD, just like any major transition, I had to decide what to do next. 

After submitting numerous job applications focused on research roles, I was fortunate to receive a few offers, including postdocs and a research scientist role at a reserch lab. As a long-time admirer of IBM - particularly for its achievements with Deep Blue and Watson - I explored several industrial research labs, including IBM research. Life doesn't always gives us what we want - but sometimes, it gives us something closer. Interestingly and somewhat unexpectedly, I ended up at Bell Labs

At the Dublin office, the entrance was decorated with 10+ replica Nobel Prize medals. I'm not  sure if every Bell Labs office had the same setup, but seeing them every day was incredibly motivating.

It was rewarding experience to work alongside both system reseachers and algorithms resechers - two very different research areas - collaborating to solve realworld business problems. Although the workload is demanding, one thing I recall from my tech lead stuck with me still: "It is way better to be busy than to feel underutilized". 

I also had a moment of personal reflection when I noticed him meticulously preparing notes and rehearsing for presentations, while I had been making execuses for myself, blaming my struggles on not  being a native English speaker. That moment taught me a valuable lesson: success does not depend on background alone - it requires consistent effort. If we want to do something well, we all need to put in the work. Practice really does make perfect!

At Bell Labs, I worked on AIOps, a completely different area from my PhD research. I once heard someone say that it's good to change the research focus at least once after completing a PhD. This experience gave me valuable perspective and helped me understand how different research communities work differently - for example, in terms of publication styles and the  peer-review process.

Two years later, the Bell Labs site in Dublin shut down. Once again, it was time to make another life decision.


Maynooth University 

Many researchers who have completed a PhD often wonder what an academic life looks like. At least for me, it was something on my bucket list. In recent years, many universities in Ireland have started collabrating with universities in China to establish international colleges, aiming to bring EU pedagogy abroad. This opportunity at Maynooth was ideal for me, as I also wanted to visit my sick parents in China as often as possible during the Covid19

It was a valuable life experience and truely joyful to see students learn, grow, and graduate with significant achievements. Many of them wento on to pursue a wide range of paths, with much brighter futures than mine, both in China and internationally - at institutions like CMU and Cambridge.

It was also a great opportunity to understand how the academic system works. Beyond research, I learned how much time is spent on writing funding proposals, managing admin duties, and handling teaching responsibilities, including lectures, tutorials and labs etc.  After experiencing the difficulty of writing and facing numerous proposal rejections, I became even more grateful for the PhD funding I got from my advisor and SFI and SAP:)

After three years, both my parents passed away, and with my kid growing, spending half a year in China no longer suited our family. I needed to find a more stable job, and I decided to return to industry. It is funny though how underappreciated one can feel when you moving between the public and private sectors. Despite some challenges in switching the sectors, I’m still glad I gave the academic life a try, just like I gave PhD a try. If I had not, the open loop would still be lingering in my mind today.

Dell Technologies 

Another open loop in my mind during my academic carrer was gaining experience with EU-funded projects suchas FP7 and HORIZON EUROPE. I was fortunate to land a role at Dell Research - something I hadn't even known existed before in Ireland. It is a small scale specialised team focused on developing EU proposals as well as other national research proposals, and executing funded projects. The role felt like a perfect fit for me: although it was in industry, it naturally required many of the skills from I had developed in academia. 

It turned out to be a great learning experience. I had a high degree of autonomy to explore research topics and collaborate with a wide range of partners, from big companies like SAP to well-established academic institutions such as ETH Zurich, as well as SMEs and organizations like W3C.

During this time, I was fortunate to work with top researchers from ETH Zurich, and gained firsthand experience in how EU projects are structured, managed, and evaluated. More importantly, I also acquired valuable skills in software development and deployment, particularly around building and deploying microserivces, and exposing as REST APIs - skills that are essential in the tech industry today.

The freedom and ownership I had over both research and development provided countless opportunities for personal and professional growth. I deepened my understanding of LLM-based agents, and co-authorized four patents, with my mentor - Dr. Said Tabet who was also involved a lot in Semantic Web research community in its early days like SWRL and the SEMANTiCs conference.

One phrase from my project lead that has stayed with me is: "The best time to learn something is when we don't need it." . That really resonated with me. If there is something you hope to do in the future, the best time to start learning and preparing for it, is NOW.

Moving forward

Ten years have passed quickly. It feels as life has gently guided me, ticking off items on my bucket list one after another. Or perhaps, deep down, I had an unspoken desire - a drive to move forward because each of these steps represented an open loop I wanted to close. Either way, I am deeply grateful to God for allowing me to go through these experiences and for the chance to work with so many talented people. Beyond the technical knowledge I've gained from them, what's been even more meaningful are the small but powerful phrases they've shared - moments of clarity that have changed my perspective and attitide - and the way they approach life's ups and downs with resilience and grace.

During these 10 years - and really, throughout my life, just like any people, there have been many challenges: securing a PhD offer, getting a job time and again, having a kid during my PhD, juggling work and parenting through Covid, and coping with the loss of parents. There have been countless trials and failures, moments of doubt and struggle. But tiime and again, things have eventually worked out. It turns out at the end , everything gonna be OK, if its not OK, its not the end yet.

One of my favorite books - and currently my bedside companion - is "How to stop worrying and start living" by Dale Carnegie. One particular phrase from it has become a daily for me:
"God grant me serenity to accept changes that I cannot change,
the courage to change the things I can,
and the wisdom to know the difference."

This prayer reminds me to find peace in uncertainty, strength in action, and clarity in decision-making. And perhaps that’s the essence of what this journey has taught me: to embrace the path, however unpredictable, and keep moving forward with faith, gratitude, and intention.