January 26, 2024

Querying the databricks api

Exploring databricks SQL usage At my company, we adopted databricks SQL for most of our users. Some users have developed applications that use the JDBC connector, some users have built their dashboards, and some users write plain ad-hoc queries. We wanted to know what they queried, so we tried to use Unity Catalog’s insights, but it wasn’t enough for our case. We work with IOT and we are interested in what filters they apply within our tables. Read more

October 27, 2023

Tweaking Spark Kafka

Well, I’m facing a huge interesting case. I’m working at Wallbox where we need to deal with billions of rows every day. Now we need to use Spark for some Kafka filtering and publish the results into different topics according to some rules. I won’t dig deep into the logic except for performance-related stuff, let’s try to increase the processing speed. When reading from Kafka you usually get 1 task per partition, so if you have 6 partitions and 48 cores you are not using 87.5 percent of your cluster. That could be adjusted with the following property **minPartitions.** Read more

October 21, 2023

KSQL, a horror tale

After spending several weeks working on a ksql solution to filter billions of events and determine their destination topic, I was disappointed to find that it did not live up to my expectations. I had hoped for a more robust product that would align with our needs. Previously, we utilized a similar filter in Spark, incurring traffic costs for both Confluent and AWS. With kSQL, the advantage was that we could avoid paying for AWS traffic. Read more

October 2, 2023

Repairing metadata unity catalog

I’ve been subscribed to https://www.dataengineeringweekly.com/p/data-engineering-weekly-148 for years. This last number included several on-call posts on Medium. I found these quite useful. Today, I got an alert from Metaplane that a cost monitor dashboard was out of date. I checked the processes, and everything was fine. I ran a query to check the freshness of the data and it was ok too. Metaplane checks our delta table freshness by querying the table information available in the Unity Catalog. For some unknown reason that metadata didn’t receive any update. I ran an optimization operation (the table tiny) and the metadata didn’t update either. Read more

July 28, 2023

Adding extra params on DatabricksRunNowOperator

With the new Databricks jobs API 2.1 you have different parameters depending on the kind of tasks you have in your workflow. Like: jar_params, sql_params, python_params, notebook_params… And not always the airflow operator is ready to handle all of the. If we check the current release of the DatabricksRunNowOperator, we can see that there is only support for: notebook_params python_params python_named_parameters jar_params spark_submit_params And not the query_params mentioned earlier. But there is a way of combining both, there is a param called jsob that allows you to write the payload of a databricksrunnow and it will also merge the content of the JSON with your named_params! Read more

May 23, 2023

Enabling Unity Catalog

I’ve spent the last few weeks setting up the unity catalog for my company. It’s been an extremely tiring process. And there are several concepts to bring here. My main point is to have a clear view of the requirements. Disclaimer: as of today with https://github.com/databricks/terraform-provider-databricks release 1.17.0, some steps should be done in an “awkward way” that is, the account API does not expose the catalog’s endpoint and should be done through a workspace. Read more

March 20, 2023

Duplicates with delta, how can it be?

Long time without writing! On highlights: I left my job at Schwarz It in December last year, and now I’m a full-time employee at Wallbox! I’m really happy with my new job, and I’ve experienced interesting stuff. This one was just one of these strange cases where you start doubting the compiler. Context One of my main tables represents sensor measures from our chargers with millisecond precision. The numbers are quite high, we are talking over 2 billion rows per day. So the analytic model doesn’t handle that level of granularity. The analyst created a table that will make a window of 5 minutes, select some specific sensors and write there those values as a column. To keep the data consistent they were generating fake rows between sessions, so if a value was missing a synthetic value would be put in place. Read more

August 12, 2022

Testing Databricks Photon

I was a bit skeptical about photon since I realized that it cost about double the amount of DBU, required specifically optimized machines and did not support UDFs (it was my main target). From the Databricks Official Docs: Limitations Does not support Spark Structured Streaming. Does not support UDFs. Does not support RDD APIs. Not expected to improve short-running queries (<2 seconds), for example, queries against small amounts of data. Photon runtime Read more

2017-2024 Adrián Abreu powered by Hugo and Kiss Theme