I often need UUIDs in Pyspark and Databricks.
I use both deterministic(repeatable) ids and random ids.
For random ids, before I used a UDF:
But as of Spark 3.0.0 there is a Spark SQL for random uuids. So now I use this:
from pyspark.sql import functions as F
This is nicer and is much faster since it uses native Spark SQL instead of a UDF (which runs python).
If I want to create a repeatable uuid based on one or more columns this is the UDF:
If you want a good reference for Pyspark, check out my website PysparkIsRad.com