October, 2024
We've all been asked the question something like:
I’m going to go out on a limb and guess that you have never been asked “...which data analytics query would you bring?”
I’ve been working a lot with Microsoft Fabric lately and, from my perspective, that one query to bring would be the Delta “merge” query which looks a bit like this...
deltaTable.alias("target").merge(updatesDF.alias("updates"),"target.id = updates.id") \ .whenMatchedUpdateAll() \ .whenNotMatchedInsertAll() \ .execute()
Let me explain the query. Its primary utility is when you have a dataframe of new data that you want to merge into an existing Lakehouse table but you don’t know whether the rows need to be inserted or updated (i.e. the classic “upsert”). Specifically, the “deltaTable” is the target Lakehouse table and the “updatesDF” is the dataframe that has data you want to add/update to the Lakehouse table. The merge command requires a key so that it knows how to match the updates dataframe with the target table. In this case, the key is “target.id” on the target table and “updates.id” on the source dataframe.
That’s it! This magical query can be used in many different ways. For example, let’s say you need to ingest data from an Application Programming Interface (API) and you’re bringing in 500 rows per API call. You simply drop the 500 rows into a dataframe and then call the Delta merge query. You don’t have to worry about inserting duplicates and I have found the performance to be very good.
I should mention that I wasn’t aware of the Delta merge query initial so was working directly with the underlying Parquet (an Apache standard) files. This was a tiresome and frustrating process in low-level coding. The whole purpose of the Delta layer is to provide a higher-level programming interface, so you’re out of the weeds. As an added bonus you get the additional capabilities provided by Delta such as “time travel” querying where you can look at your data at a particular point in the past without setting up any special logging.
Now yes, I will be rather hungry and lonely on the desert island, but I will be well prepared for and any and all analytics challenges that I may face.
Mark Clancy is a senior data engineer based in Vancouver, Canada.
All author posts