I can perfectly describe this as the scariest part of the exam. I’m used to working with dates but I’m especially used to suffering from the typical UTC / not UTC / summer time hours difference.
I will try to make some simple exercises for this, the idea would be:
- We have some sales data and god knows how the business people love to refresh super fast their dashboards on Databricks SQL. So we decided to aggregate at different levels the same KPI, our sales per store. Considering some data as:
data = [
(1656520076, 1001, 10),
(1656520321, 1001, 8),
(1656509025, 1002, 5),
(1656510826, 1002, 3),
(1656510056, 1001, 5),
(1656514076, 1001, 8),
]
ts = "ts"
store_id = "store_id"
amount = "amount"
df = spark.createDataFrame(data, [ts, store_id, amount])
We need to parse that data into a readable date as the first number is an epoch or unix_time. Using the function from_unixttime this is quite simple:
from pyspark.sql.functions import (from_unixtime, col)
display(df.withColumn("date_parsed", from_unixtime(col(ts))))
Not we need to perform different aggregations, we want the sum of the amount per store on:
- Quarter of the year
- Month
- Week of the year
- Day of the year
All those aggregations can be done with Spark, and not only that but Spark has first-level functions for those, making the transformations direct:
from pyspark.sql.functions import *
from functools import reduce
functions = [quarter, month, weekofyear, dayofyear]
dfs = [parsed_df.groupBy(f(col(parsed_date)), col(store_id)).agg(sum(col(amount))) for f in functions]
display(reduce(lambda x,y: x.union(y), dfs))
period | store_id | sum(amount) |
---|---|---|
2 | 1001 | 31 |
2 | 1002 | 8 |
6 | 1001 | 31 |
6 | 1002 | 8 |
26 | 1001 | 31 |
26 | 1002 | 8 |
180 | 1001 | 31 |
180 | 1002 | 8 |
As we can see this part is straightforward, but there is a dark non-mentioned part, which is formatting timestamp and parsing those formats.
The timestamp has an extended variety of formats, the UTC standard is as:
2022-06-29T19:22:55Z
This can be translated to:
YYYY-MM-DDTHH:mm:SS[.f]'Z'
If we want to parse that kind of ts with spark we have the function from_utc_timestamp which accepts both the column and the format (the format could also be another column).
Spark provides an alias for the default format so I can avoid writing all the stuff above and just type “Z” as the UTC format.
from pyspark.sql.types import *
data_ts = [
('2022-06-29T19:22:55Z',),
('2022-06-29T19:22:55Z',)
]
df_ts = spark.createDataFrame(data_ts, StructType([
StructField("ts", StringType())
]))
display(df_ts.withColumn("parsed", from_utc_timestamp(col(ts), "Z")))
ts | parsed |
---|---|
2022-06-29T19:22:55Z | 2022-06-29T19:22:55.000+0000 |
2022-06-29T19:22:55Z | 2022-06-29T19:22:55.000+0000 |