November 11, 2021

Multiplying rows in Spark

Earlier this week I checked on a Pull Request that bothered me since I saw it from the first time. Let’s say we work for a bank and we are going to give cash to our clients if they get some people to join our bank.

And we have an advertising campaign definition like this:

campaign_id inviter_cash receiver_cash
FakeBank001 50 30
FakeBank002 40 20
FakeBank003 30 20

And then our BI teams defines the schema they want for their dashboards.

campaign_id type cash
FakeBank001 inviter 50
FakeBank001 receiver 30
FakeBank002 inviter 40
FakeBank002 receiver 20
FakeBank003 inviter 30
FakeBank003 receiver 20

And well I saw a code that solved the problem.

campaignDf
.drop("receiver_cash")
.withColumn("type", lit("inviter"))
.withColumn("cash", col("inviter_cash"))
.drop("inviter_cash")
.union(
campaignDf
.drop("inviter_cash")
.withColumn("type", lit("receiver"))
.withColumn("cash", col("receiver_cash"))
.drop("receiver_cash")
)

Well it does what we want but if we don’t cache this, Spark will compute the same dataframe twice and it will dupllicate efforts and perform poorly. And well, we’re not doing anything quite special with any of them, so… Let’s rewrite it.

We want to multiply our current rows, there is an operation that does it and that’s the explode. So what we need is to add a column with and array thas has both values and explode it.

So now we have duplicated our rows and have something like this:

campaign_id type inviter_cash receiver_cash
FakeBank001 inviter 50 30
FakeBank001 receiver 40 30
FakeBank002 inviter 40 20
FakeBank002 receiver 40 20
FakeBank003 inviter 30 20
FakeBank003 receiver 30 20

Now what remains is quite straight forward, we just need to pick one column or another based on the the type we have and we will get with the desired result.

campaignDf
.withColumn("cash", 
    when( col("type") === "Inviter", col("inviter_cash"))
	.when( col("type") === "Receiver", col("receiver_cash") )
).drop("inviter_cash", "receiver_cash")

And we end with the desired result, computed in one single operation without need for caching or anything like that.

2017-2024 Adrián Abreu powered by Hugo and Kiss Theme