November 5, 2021

Regex 101

-You will spent your whole life relearning regex, there is a beginning, but never and end.

Last year I participated in some small code problems and practised some regex. I got used to it and feel quite good at it.

And today I had to use it again. I had the following dataframe:

product attributes
1 (SIZE-36)
2 (COLOR-RED)
3 (SIZE-38, COLOR-BLUE)
4 (COLOR-GREEN, SIZE-39)

A wonderful set of string merged with properties that could vary. And we wanted one column for each:

product attributes size color
1 (SIZE-36) 36
2 (COLOR-RED) RED
3 (SIZE-38, COLOR-BLUE) 38 BLUE
4 (COLOR-GREEN, SIZE-39) 39 GREEN

So… the solution seemed quite straightforward. I went to regex101 and start typing… I want everything behind my keyword and the ‘-’ until I found a comma or a closing )

And I ended up with this regexp for color: COLOR\-(.*)(\,\s|\))

And for these examples… It almost worked

(SIZE-36)
(COLOR-RED)
(SIZE-38, COLOR-BLUE)
(COLOR-GREEN, SIZE-39)

Try it on regex101! https://regex101.com/

Regex 101 showed the following capture groups:

RED
BLUE
GREEN, SIZE-39

Why? The , is there it should have stopped at it. since it matched.

And I didn’t get it until I searched for the key: regular expressions are greedy. That means they try to match as much as they can, so the or was valid but it didn’t work.

How do you change it? It just need to add a ? to the *.

COLOR\-(.*?)(\,\s|\))

That regexp works perfectly fine.

Just for the future this was the code I neded up using:

def regex(key: str):
	return fr'{key}\|(.*?)(\,\s|\))'

exploded_variants
.withColumn("COLOR", regexp_extract(col("attributes"), regex('COLOR'), 1))
.withColumn("SIZE", regexp_extract(col("attributes"), regex('SIZE'), 1))

2017-2024 Adrián Abreu powered by Hugo and Kiss Theme