Searching the desert for a deterministic id. Photo by Marvin Meyer on Unsplash

UUIDs in Pyspark

Python Is Rad
Apr 7, 2022

I often need UUIDs in Pyspark and Databricks.

I use both deterministic(repeatable) ids and random ids.

Random UUIDs

For random ids, before I used a UDF:

import uuid
@udf
def create_random_id():
return str(uuid.uuid4())

But as of Spark 3.0.0 there is a Spark SQL for random uuids. So now I use this:

from pyspark.sql import functions as F
df.withColumn(“uuid”, F.expr(“uuid()”))

This is nicer and is much faster since it uses native Spark SQL instead of a UDF (which runs python).

Deterministic UUIDs

If I want to create a repeatable uuid based on one or more columns this is the UDF:

import uuid
@udf
def create_deterministic_uuid(some_string):
return str(
uuid.uuid5(
uuid.NAMESPACE_OID,
f’something:{some_string}’
)
)

If you want a good reference for Pyspark, check out my website PysparkIsRad.com

--

--

Python Is Rad

Python Is Rad. I’m a software engineer with an addiction to building things with code. https://twitter.com/PythonIsRad