Searching the desert for a deterministic id. Photo by Marvin Meyer on Unsplash

UUIDs in Pyspark

Apr 7, 2022

I often need UUIDs in Pyspark and Databricks.

I use both deterministic(repeatable) ids and random ids.

Random UUIDs

For random ids, before I used a UDF:

import uuid
@udf
def create_random_id():
 return str(uuid.uuid4())

But as of Spark 3.0.0 there is a Spark SQL for random uuids. So now I use this:

from pyspark.sql import functions as F
df.withColumn(“uuid”, F.expr(“uuid()”))

This is nicer and is much faster since it uses native Spark SQL instead of a UDF (which runs python).

Deterministic UUIDs

If I want to create a repeatable uuid based on one or more columns this is the UDF:

import uuid
@udf
def create_deterministic_uuid(some_string):
 return str(
            uuid.uuid5(
                       uuid.NAMESPACE_OID, 
                       f’something:{some_string}’
                       )
            )

If you want a good reference for Pyspark, check out my website PysparkIsRad.com

UUIDs in Pyspark

Random UUIDs

Deterministic UUIDs

Written by Python Is Rad

No responses yet