#Python

August 13, 2024

Finding pet projects

As my company undergoes layoffs, I’m back on the job hunt. While I’m in the field of data, I often find myself missing the hands-on experience that comes from personal projects. I realized that I’m not practicing all the skills I need. During a recent interview, I was asked about my experience with sending reports via email—something I hadn’t done in a few years. That got me thinking: could I turn this into a pet project? Read more

#Spark | #Scala | #WSL

March 22, 2024

Developing on windows

Over the years, I’ve been using MacOS at work and Ubuntu at home for my development tasks. However, my Lenovo P1 Gen 3 laptop didn’t work well with Linux, leading to frequent issues with the camera and graphics (screen flickering, I’m looking at you, and it hurts). I’ve triend Windows Subsystem for Linux (WSL) but it was quite bad to be honest. But as I’ve heard of WSL2 and WSLg, I decided to give it another shot. Read more

#Spark | #Databricks | #Python

January 26, 2024

Querying the databricks api

Exploring databricks SQL usage At my company, we adopted databricks SQL for most of our users. Some users have developed applications that use the JDBC connector, some users have built their dashboards, and some users write plain ad-hoc queries. We wanted to know what they queried, so we tried to use Unity Catalog’s insights, but it wasn’t enough for our case. We work with IOT and we are interested in what filters they apply within our tables. Read more

#Spark | #Databricks | #Structured Streaming

October 27, 2023

Tweaking Spark Kafka

Well, I’m facing a huge interesting case. I’m working at Wallbox where we need to deal with billions of rows every day. Now we need to use Spark for some Kafka filtering and publish the results into different topics according to some rules. I won’t dig deep into the logic except for performance-related stuff, let’s try to increase the processing speed. When reading from Kafka you usually get 1 task per partition, so if you have 6 partitions and 48 cores you are not using 87.5 percent of your cluster. That could be adjusted with the following property **minPartitions.** Read more

#Confluent | #Kafka | #Ksql

October 21, 2023

KSQL, a horror tale

After spending several weeks working on a ksql solution to filter billions of events and determine their destination topic, I was disappointed to find that it did not live up to my expectations. I had hoped for a more robust product that would align with our needs. Previously, we utilized a similar filter in Spark, incurring traffic costs for both Confluent and AWS. With kSQL, the advantage was that we could avoid paying for AWS traffic. Read more

#Databricks | #Delta | #Unity Catalog

October 2, 2023

Repairing metadata unity catalog

I’ve been subscribed to https://www.dataengineeringweekly.com/p/data-engineering-weekly-148 for years. This last number included several on-call posts on Medium. I found these quite useful. Today, I got an alert from Metaplane that a cost monitor dashboard was out of date. I checked the processes, and everything was fine. I ran a query to check the freshness of the data and it was ok too. Metaplane checks our delta table freshness by querying the table information available in the Unity Catalog. For some unknown reason that metadata didn’t receive any update. I ran an optimization operation (the table tiny) and the metadata didn’t update either. Read more

#Databricks | #Workflows | #Airflow

July 28, 2023

Adding extra params on DatabricksRunNowOperator

With the new Databricks jobs API 2.1 you have different parameters depending on the kind of tasks you have in your workflow. Like: jar_params, sql_params, python_params, notebook_params… And not always the airflow operator is ready to handle all of the. If we check the current release of the DatabricksRunNowOperator, we can see that there is only support for: notebook_params python_params python_named_parameters jar_params spark_submit_params And not the query_params mentioned earlier. But there is a way of combining both, there is a param called jsob that allows you to write the payload of a databricksrunnow and it will also merge the content of the JSON with your named_params! Read more

#Databricks | #Terraform | #Unity Catalog

May 23, 2023

Enabling Unity Catalog

I’ve spent the last few weeks setting up the unity catalog for my company. It’s been an extremely tiring process. And there are several concepts to bring here. My main point is to have a clear view of the requirements. Disclaimer: as of today with https://github.com/databricks/terraform-provider-databricks release 1.17.0, some steps should be done in an “awkward way” that is, the account API does not expose the catalog’s endpoint and should be done through a workspace. Read more

2017-2024 Adrián Abreu powered by Hugo and Kiss Theme