June 19, 2022

Spark User Defined Functions

Sometimes we need to execute arbitrary Scala code on Spark. We may need to use an external library or so on. For that, we have the UDF, which accepts and return one or more columns. When we have a function we need to register it on Spark so we can use it on our worker machines. If you are using Scala or Java, the udf can run inside the Java Virtual Machine so there’s a little extra penalty. But from Python, there is an extra penalty as Spark needs to start a Python process on the worker, serialize the data from JVM to Python, run the function and then serialize the result to the JVM. Read more

June 11, 2022

Spark DataSources

As estated in the structured api section, Spark supports a lot of sources with a lot of options. There is no other goal for this post than to clarify how the most common ones work and how they will be converted to DataFrames. First, all the supported sources are listed here: https://spark.apache.org/docs/latest/sql-data-sources.html And we can focus on the typical ones: JSON, CSV and Parquet (as those are the typical format on open-source data). Read more

June 10, 2022

Spark Dataframes

Spark was initially released for dealing with a particular type of data called RDD. Nowadays we work with abstract structures on top of it, and the following tables summarize them. Type Description Advantages Datasets Structured composed of a list of where you can specify your custom class (only Scala) Type-safe operations, support for operations that cannot be expressed otherwise. Dataframes Datasets of type Row (a generic spark type) Allow optimizations and are more flexible SQL tables and views Same as Dataframes but in the scope of databases instead of programming languages Let’s dig into the Dataframes. They are a data abstraction for interacting with name columns, those names are defined in a schema. Read more

June 8, 2022

Spark Execution

Spark provides an api and an engine, that engine is responsible for analyzing the code and performing several optimizations. But how does this work? We can do two kinds of operations with Spark, transformations and actions. Transformations are operations on top of the data that modify the data but do not yield a result directly, that is because they all are lazily evaluated so, you can add new columns, filter rows, or perform some computations that won’t be executed immediately. Read more

June 7, 2022

Spark Architecture

Spark works on top of a cluster supervised by a cluster manager. The later is responsible of: Tracking resource allocation across all applications running on the cluster. Monitoring the health of all the nodes. Inside each node there is a node manager which is responsible to track each node health and resources and inform the cluster manager. C l u s t e r M a n a g e r N N N o o o d d d e e e M M M a a a n n n a a a g g g e e e r r r When we run a Spark application we generate processes inside the cluster where one node will act as a Driver and the rest will be Workers. Here there are two main points: Read more

May 31, 2022

Faker with PySpark

I’m preparing a small blog post about some tweakings I’ve done for a delta table, but I want to dig into the Spark UI differences before this. As this was done as part of my work I’m reproducing the problem with some generated data. I didn’t know about Faker and boy it is really simple and easy. In this case, I want to generate a small dataset for a dimension product table including its id, category and price. Read more

March 21, 2022

Git 101

From time to time I get to the same place, telling some people about git, what it solves and some basic usage. Since I’ve done it a lot recenly I wanted to write down a post and enjoy it. What is git? Git is a gift from the gods for the following use cases: My laptop is broke! I need the data there is a whole month of work there! Read more

February 7, 2022

Sbt tests

Últimamente en el trabajo estoy usando mucho delta para algunas tablas de dimensiones y estas tablas realizan actualizaciones parciales de las filas para replicar la lógica de negocio. Esto, nos lleva a varios tests que replican un estado de la tabla y realizan las actualizaciones pertinentes para comprobar todos los flujos y por ende un sobrecoste de ejecución de ese tipo de tests que acaba siendo agotador. Una de las soluciones planteadas fue incluir en las builds un parámetro para saltarse el step de ejecución de los tests. Lo cual es legítimo pero al menos para mí, resulta algo arbitrario. Buscando otro concens llegamos a: en las pull request se ejecutarán todos los tests y en el resto de builds (manuales o automáticas de rama) se excluirán estos tests, para que al hacer pruebas o durante las integraciones de las ramas no estemos acumulando tiempo en tests ya validados. Read more

2017-2024 Adrián Abreu powered by Hugo and Kiss Theme