#Spark | #Databricks | #Python

January 26, 2024

Querying the databricks api

Exploring databricks SQL usage At my company, we adopted databricks SQL for most of our users. Some users have developed applications that use the JDBC connector, some users have built their dashboards, and some users write plain ad-hoc queries. We wanted to know what they queried, so we tried to use Unity Catalog’s insights, but it wasn’t enough for our case. We work with IOT and we are interested in what filters they apply within our tables. Read more

#Spark | #Databricks | #Structured Streaming

October 27, 2023

Tweaking Spark Kafka

Well, I’m facing a huge interesting case. I’m working at Wallbox where we need to deal with billions of rows every day. Now we need to use Spark for some Kafka filtering and publish the results into different topics according to some rules. I won’t dig deep into the logic except for performance-related stuff, let’s try to increase the processing speed. When reading from Kafka you usually get 1 task per partition, so if you have 6 partitions and 48 cores you are not using 87.5 percent of your cluster. That could be adjusted with the following property **minPartitions.** Read more

#Databricks | #Delta | #Unity Catalog

October 2, 2023

Repairing metadata unity catalog

I’ve been subscribed to https://www.dataengineeringweekly.com/p/data-engineering-weekly-148 for years. This last number included several on-call posts on Medium. I found these quite useful. Today, I got an alert from Metaplane that a cost monitor dashboard was out of date. I checked the processes, and everything was fine. I ran a query to check the freshness of the data and it was ok too. Metaplane checks our delta table freshness by querying the table information available in the Unity Catalog. For some unknown reason that metadata didn’t receive any update. I ran an optimization operation (the table tiny) and the metadata didn’t update either. Read more

#Databricks | #Workflows | #Airflow

July 28, 2023

Adding extra params on DatabricksRunNowOperator

With the new Databricks jobs API 2.1 you have different parameters depending on the kind of tasks you have in your workflow. Like: jar_params, sql_params, python_params, notebook_params… And not always the airflow operator is ready to handle all of the. If we check the current release of the DatabricksRunNowOperator, we can see that there is only support for: notebook_params python_params python_named_parameters jar_params spark_submit_params And not the query_params mentioned earlier. But there is a way of combining both, there is a param called jsob that allows you to write the payload of a databricksrunnow and it will also merge the content of the JSON with your named_params! Read more

#Databricks | #Terraform | #Unity Catalog

May 23, 2023

Enabling Unity Catalog

I’ve spent the last few weeks setting up the unity catalog for my company. It’s been an extremely tiring process. And there are several concepts to bring here. My main point is to have a clear view of the requirements. Disclaimer: as of today with https://github.com/databricks/terraform-provider-databricks release 1.17.0, some steps should be done in an “awkward way” that is, the account API does not expose the catalog’s endpoint and should be done through a workspace. Read more

#Spark | #Databricks | #Photon | #Data Engineer

August 12, 2022

Testing Databricks Photon

I was a bit skeptical about photon since I realized that it cost about double the amount of DBU, required specifically optimized machines and did not support UDFs (it was my main target). From the Databricks Official Docs: Limitations Does not support Spark Structured Streaming. Does not support UDFs. Does not support RDD APIs. Not expected to improve short-running queries (<2 seconds), for example, queries against small amounts of data. Photon runtime Read more

#Spark | #DataBricks | #Data Engineer

July 30, 2022

Databricks Cluster Management

For the last few months, I’ve been into ETL optimization. Most of the changes were as dramatic as moving tables from ORC to delta revamping the partition strategy to some as simple as upgrading the runtime version to 10.4 so the ETL starts using low-shuffle merge. But at my job, we have a lot of jobs. Each ETL can be easily launched at *30 with different parameters so I wanted to dig into the most effective strategy for it. Read more

#Spark | #DataBricks | #Firebase | #BigQuery

July 1, 2022

Reading firebase data

Firebase is a common component nowadays for most mobile apps. And it can provide some useful insights, for example in my previous company we use it to detect where the people left at the initial app wizard. (We could measure it). It is quite simple to export your data to BigQuery: https://firebase.google.com/docs/projects/bigquery-export But maybe your lake is in AWS or Azure. In the next lines, I will try to explain how to load the data in your lake and some improvements we have applied. Read more

2017-2024 Adrián Abreu powered by Hugo and Kiss Theme