July 30, 2022

Databricks Cluster Management

For the last few months, I’ve been into ETL optimization. Most of the changes were as dramatic as moving tables from ORC to delta revamping the partition strategy to some as simple as upgrading the runtime version to 10.4 so the ETL starts using low-shuffle merge. But at my job, we have a lot of jobs. Each ETL can be easily launched at *30 with different parameters so I wanted to dig into the most effective strategy for it. Read more

July 25, 2022

Pusing data to tinybird for free

So my azure subscription expired and I ended up losing the function I was using to feed my real-time data on analytics (part of the Transportes Insulares de Tenerife SA analysis I was making). And after some struggle, I decided to move it to a GitHub action. Why? Because the free mins per month were more than enough and because I just needed some script to run on a cron and that script just makes a quest and a post. So, it was quite straightforward. Read more

July 1, 2022

Reading firebase data

Firebase is a common component nowadays for most mobile apps. And it can provide some useful insights, for example in my previous company we use it to detect where the people left at the initial app wizard. (We could measure it). It is quite simple to export your data to BigQuery: https://firebase.google.com/docs/projects/bigquery-export But maybe your lake is in AWS or Azure. In the next lines, I will try to explain how to load the data in your lake and some improvements we have applied. Read more

June 30, 2022

Qbeast

A few days ago I ran into Qbeast which is an open-source project on top of delta lake I needed to dig into. This introductory post explains it quite well: https://qbeast.io/qbeast-format-enhanced-data-lakehouse/ The project is quite good and it seems helpful if you need to write your custom data source as everything is documented. And well as I’m in love with note-taking I want to dig into the following three topics: Explaining how the format works (including optimizations) Describing how the sampling push is implementing Understanding the table tolerance 1. Qbeast format This would be better explained with diagrams. Remember delta lake? We had a _delta_log folder with files pointing to files. Now Qbeast has extended this delta_log and has added some new properties. Read more

May 31, 2022

Faker with PySpark

I’m preparing a small blog post about some tweakings I’ve done for a delta table, but I want to dig into the Spark UI differences before this. As this was done as part of my work I’m reproducing the problem with some generated data. I didn’t know about Faker and boy it is really simple and easy. In this case, I want to generate a small dataset for a dimension product table including its id, category and price. Read more

March 21, 2022

Git 101

From time to time I get to the same place, telling some people about git, what it solves and some basic usage. Since I’ve done it a lot recenly I wanted to write down a post and enjoy it. What is git? Git is a gift from the gods for the following use cases: My laptop is broke! I need the data there is a whole month of work there! Read more

February 7, 2022

Sbt tests

Últimamente en el trabajo estoy usando mucho delta para algunas tablas de dimensiones y estas tablas realizan actualizaciones parciales de las filas para replicar la lógica de negocio. Esto, nos lleva a varios tests que replican un estado de la tabla y realizan las actualizaciones pertinentes para comprobar todos los flujos y por ende un sobrecoste de ejecución de ese tipo de tests que acaba siendo agotador. Una de las soluciones planteadas fue incluir en las builds un parámetro para saltarse el step de ejecución de los tests. Lo cual es legítimo pero al menos para mí, resulta algo arbitrario. Buscando otro concens llegamos a: en las pull request se ejecutarán todos los tests y en el resto de builds (manuales o automáticas de rama) se excluirán estos tests, para que al hacer pruebas o durante las integraciones de las ramas no estemos acumulando tiempo en tests ya validados. Read more

November 11, 2021

Multiplying rows in Spark

Earlier this week I checked on a Pull Request that bothered me since I saw it from the first time. Let’s say we work for a bank and we are going to give cash to our clients if they get some people to join our bank. And we have an advertising campaign definition like this: campaign_id inviter_cash receiver_cash FakeBank001 50 30 FakeBank002 40 20 FakeBank003 30 20 And then our BI teams defines the schema they want for their dashboards. Read more

2017-2024 Adrián Abreu powered by Hugo and Kiss Theme