User Defined Functions (UDFs)

User Defined Functions (UDFs) in PySpark

UDFs allow us to apply custom Python logic to columns in a PySpark DataFrame.
Unlike built-in PySpark functions, UDFs let you perform complex or non-standard transformations, like:

Categorizing numeric values
Combining multiple columns
Custom date formatting
Conditional logic not available in PySpark functions

UDFs are powerful but slower than built-ins, so use them when necessary.

Example Dataset

We will use the Airline Employment Data By WARN Database.

To follow along:

Download airline-employment-data-sample.csv
Place it in your project’s data folder

If you need help downloading datasets, see the tutorial section:

How to Get Dewey Data Sample Sets

carrier_group	carrier_name	full_time	month_date	month_date_parsed	part_time	total
Major	Alaska	19,273	Oct 2025	2025-10-01	2,322	21,595
Major	Allegiant Air	5,364	Oct 2025	2025-10-01	958	6,322
Major	American	96,547	Oct 2025	2025-10-01	12,016	108,563
Major	Atlas Air	4,281	Oct 2025	2025-10-01	53	4,334
Major	Delta	103,018	Oct 2025	2025-10-01	1,468	104,486
Major	Envoy Air	12,971	Oct 2025	2025-10-01	6,835	19,806

This displays the first few rows, letting us see column names and understand the dataset.

Loading the Data in PySpark

Before you start you will need to load the dataset into a Spark DataFrame

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("UDF Tutorial").getOrCreate()

airline = spark.read.csv(
    "data/airline-employment-data-sample.csv",
    header=True,
    inferSchema=True
)

display(airline)

The display() function lets us visually inspect the dataset.

Creating a Basic UDF

Let’s categorize airlines by total employees:

"Small" if total < 5000
"Medium" if 5000 <= total < 50000
"Large" if total >= 50000

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def categorize_size(total):
    if total < 5000:
        return "Small"
    elif total < 50000:
        return "Medium"
    else:
        return "Large"

size_udf = udf(categorize_size, StringType())

airline_size = airline.withColumn("work_status", size_udf(col("total")))

display(airline_size.select("carrier_name", "total", "work_status"))

Explanation:

udf(categorize_size, StringType()) registers the Python function as a PySpark UDF.
withColumn("work_status", size_udf(col("total"))) creates a new column based on the UDF.

UDFs with Multiple Columns

You can use more than one column in a UDF.

Example: classify airlines by full-time to total ratio:

def ft_ratio_category(full_time, total):
    if total == 0:
        return "No Data"
    ratio = full_time / total
    if ratio >= 0.9:
        return "Mostly Full-time"
    elif ratio >= 0.5:
        return "Balanced"
    else:
        return "Mostly Part-time"

ft_udf = udf(ft_ratio_category, StringType())

airline_ratio = airline.withColumn(
    "ft_category",
    ft_udf(col("full_time"), col("total"))
)

display(airline_ratio.select("carrier_name", "full_time", "total", "ft_category"))

Explanation:

ft_udf uses two columns to calculate a ratio.
Useful for custom conditional logic.

UDFs for Dates

Custom date logic is common.

Example: extract the month name from month_date_parsed:

from datetime import datetime

def get_month_name(date_str):
    try:
        return datetime.strptime(date_str, "%m/%d/%Y").strftime("%B")
    except:
        return None

month_udf = udf(get_month_name, StringType())

airline_month = airline.withColumn("month_name", month_udf(col("month_date_parsed")))

display(airline_month.select("carrier_name", "month_date_parsed", "month_name"))

Explanation:

Converts "10/1/2025" → "October"
Handles invalid values gracefully with try/except.

Example: Part-Time Impact

Create a new column showing part-time employment impact:

< 10% → "Low"
10–30% → "Medium"
> 30% → "High"

def part_time_impact(part_time, total):
    pct = part_time / total * 100
    if pct < 10:
        return "Low"
    elif pct <= 30:
        return "Medium"
    else:
        return "High"

part_udf = udf(part_time_impact, StringType())

airline_pt = airline.withColumn(
    "part_time_impact",
    part_udf(col("part_time"), col("total"))
)

display(airline_pt.select("carrier_name", "part_time", "total", "part_time_impact"))

Explanation:

Calculates % part-time
Categorizes as Low, Medium, High
Useful to quickly identify airlines with high part-time dependency.

Summary

UDFs in PySpark let you apply custom logic to DataFrame columns:

Can use one or multiple columns
Can handle numeric calculations, strings, or dates
Must define the return type
Slower than built-ins, so use for custom or complex operations only

With UDFs, you can categorize, transform, and enrich your data in ways not possible with standard PySpark functions.