#Databricks | #Terraform | #Unity Catalog

May 23, 2023

Enabling Unity Catalog

I’ve spent the last few weeks setting up the unity catalog for my company. It’s been an extremely tiring process. And there are several concepts to bring here. My main point is to have a clear view of the requirements. Disclaimer: as of today with https://github.com/databricks/terraform-provider-databricks release 1.17.0, some steps should be done in an “awkward way” that is, the account API does not expose the catalog’s endpoint and should be done through a workspace. Read more

#Spark | #Data Engineer | #Delta | #SQL

March 20, 2023

Duplicates with delta, how can it be?

Long time without writing! On highlights: I left my job at Schwarz It in December last year, and now I’m a full-time employee at Wallbox! I’m really happy with my new job, and I’ve experienced interesting stuff. This one was just one of these strange cases where you start doubting the compiler. Context One of my main tables represents sensor measures from our chargers with millisecond precision. The numbers are quite high, we are talking over 2 billion rows per day. So the analytic model doesn’t handle that level of granularity. The analyst created a table that will make a window of 5 minutes, select some specific sensors and write there those values as a column. To keep the data consistent they were generating fake rows between sessions, so if a value was missing a synthetic value would be put in place. Read more

#Spark | #Databricks | #Photon | #Data Engineer

August 12, 2022

Testing Databricks Photon

I was a bit skeptical about photon since I realized that it cost about double the amount of DBU, required specifically optimized machines and did not support UDFs (it was my main target). From the Databricks Official Docs: Limitations Does not support Spark Structured Streaming. Does not support UDFs. Does not support RDD APIs. Not expected to improve short-running queries (<2 seconds), for example, queries against small amounts of data. Photon runtime Read more

#Spark | #DataBricks | #Data Engineer

July 30, 2022

Databricks Cluster Management

For the last few months, I’ve been into ETL optimization. Most of the changes were as dramatic as moving tables from ORC to delta revamping the partition strategy to some as simple as upgrading the runtime version to 10.4 so the ETL starts using low-shuffle merge. But at my job, we have a lot of jobs. Each ETL can be easily launched at *30 with different parameters so I wanted to dig into the most effective strategy for it. Read more

#Github | #Streaming | #Tinybird

July 25, 2022

Pusing data to tinybird for free

So my azure subscription expired and I ended up losing the function I was using to feed my real-time data on analytics (part of the Transportes Insulares de Tenerife SA analysis I was making). And after some struggle, I decided to move it to a GitHub action. Why? Because the free mins per month were more than enough and because I just needed some script to run on a cron and that script just makes a quest and a post. So, it was quite straightforward. Read more

#Spark | #Certification | #Data Engineer

July 21, 2022

Associate Spark Developer Certification

Yesterday I took (and passed with more than 90% yay!) the Associate Spark Developer Certificaton. And before I forget I want to share my experience: In general: First of all, I needed to install Windows as there was no Linux support for the control software used during the exam. Secondly, you need to disable both the antivirus and the firewall before joining. I didn’t disable the antivirus and the technician contacted me as there was a problem with the webcam despite I was able to see myself. It is a controlled window started by the software, not a browser page (I had a good zoom on the example docs they provide, and well not the same on the software window). You can mark the questions for reviewing them later. About the exam: Read more

#Spark | #DataBricks | #Firebase | #BigQuery

July 1, 2022

Reading firebase data

Firebase is a common component nowadays for most mobile apps. And it can provide some useful insights, for example in my previous company we use it to detect where the people left at the initial app wizard. (We could measure it). It is quite simple to export your data to BigQuery: https://firebase.google.com/docs/projects/bigquery-export But maybe your lake is in AWS or Azure. In the next lines, I will try to explain how to load the data in your lake and some improvements we have applied. Read more

#Spark

June 30, 2022

Qbeast

A few days ago I ran into Qbeast which is an open-source project on top of delta lake I needed to dig into. This introductory post explains it quite well: https://qbeast.io/qbeast-format-enhanced-data-lakehouse/ The project is quite good and it seems helpful if you need to write your custom data source as everything is documented. And well as I’m in love with note-taking I want to dig into the following three topics: Explaining how the format works (including optimizations) Describing how the sampling push is implementing Understanding the table tolerance 1. Qbeast format This would be better explained with diagrams. Remember delta lake? We had a _delta_log folder with files pointing to files. Now Qbeast has extended this delta_log and has added some new properties. Read more

2017-2024 Adrián Abreu powered by Hugo and Kiss Theme