UUIDs in Pyspark
Apr 7, 2022
I often need UUIDs in Pyspark and Databricks.
I use both deterministic(repeatable) ids and random ids.
Random UUIDs
For random ids, before I used a UDF:
import uuid
@udf
def create_random_id():
return str(uuid.uuid4())
But as of Spark 3.0.0 there is a Spark SQL for random uuids. So now I use this:
from pyspark.sql import functions as F
df.withColumn(“uuid”, F.expr(“uuid()”))
This is nicer and is much faster since it uses native Spark SQL instead of a UDF (which runs python).
Deterministic UUIDs
If I want to create a repeatable uuid based on one or more columns this is the UDF:
import uuid
@udf
def create_deterministic_uuid(some_string):
return str(
uuid.uuid5(
uuid.NAMESPACE_OID,
f’something:{some_string}’
)
)
If you want a good reference for Pyspark, check out my website PysparkIsRad.com