Transformations vs Actions

In PySpark, operations are divided into two main types: transformations and actions. Understanding the difference between them helps explain how Spark processes data efficiently.

Transformations vs Actions

When working with PySpark DataFrames, most operations fall into one of two categories: transformations or actions.

A transformation describes how Spark should change a dataset, but Spark does not immediately perform the work. Instead, Spark builds a plan of transformations to execute later. This behavior is called lazy evaluation, and it allows Spark to optimize the computation before running it.

An action tells Spark to actually execute the transformations and return a result. When an action is called, Spark processes all the queued transformations and produces output such as displaying data, counting rows, or saving results.

Transformations

Transformations create a new DataFrame from an existing one without immediately running the computation.

Common transformation functions include:

select()
filter()
withColumn()
groupBy()

Transformation Example

This example filters a DataFrame to only include rows where the age column is greater than 25.
Because filter() is a transformation, Spark does not run the computation yet.

# Filter rows where age is greater than 25
filtered_df = df.filter(df.age > 25)

# You can also write it like this
filtered_df = df.filter(df["age"] > 25)

Spark records this transformation but waits until an action is called before actually running the computation.

Actions

Actions trigger Spark to run the computation. When an action is executed, Spark processes all previously defined transformations and returns a result.

Common actions include:

show()
count()
collect()
write()

Action Example

Here, the show() function displays the contents of the DataFrame.
Calling show() forces Spark to execute all the transformations that were previously defined.

# Trigger the computation and display results
filtered_df.show()

# Another common action example
row_count = filtered_df.count()
print(row_count)

Once this action runs, Spark performs the filtering transformation and returns the resulting rows.