The “add columns” transformation in PySpark allows
us to create new columns or modify existing columns in
a dataset.
We can calculate new values based on other columns, apply mathematical
operations, or categorize data for analysis.
Adding columns is useful when you want to:
In the following examples we will use a sample dataset from Dewey Data.
To follow along:
Go to Dewey Data
Download Airline Employment Data By WARN Database
Place the file in your project’s data folder
If you need help downloading datasets, see the tutorial section:
How to Get Dewey Data Sample Sets
| carrier_group | carrier_name | full_time | month_date | month_date_parsed | part_time | total |
|---|---|---|---|---|---|---|
| Major | Alaska | 19,273 | Oct 2025 | 2025-10-01 | 2,322 | 21,595 |
| Major | Allegiant Air | 5,364 | Oct 2025 | 2025-10-01 | 958 | 6,322 |
| Major | American | 96,547 | Oct 2025 | 2025-10-01 | 12,016 | 108,563 |
| Major | Atlas Air | 4,281 | Oct 2025 | 2025-10-01 | 53 | 4,334 |
| Major | Delta | 103,018 | Oct 2025 | 2025-10-01 | 1,468 | 104,486 |
| Major | Envoy Air | 12,971 | Oct 2025 | 2025-10-01 | 6,835 | 19,806 |
This displays the first few rows, letting us see column names and understand the dataset structure.
Before filtering, we first load the dataset into a Spark DataFrame.
import pyspark as ps
from pyspark.sql.functions import col
warn = spark.read.csv("data/airline-employment-data-sample.csv", header=True, inferSchema=True)
display(warn)The display() function allows us to visually inspect the
dataset.
Sometimes we want to convert text categories into
numbers for easier analysis.
For example, we can create a column that flags airlines with
Major carriers as 1, others as 0.
from pyspark.sql.functions import col, when
airline_flag = airline.withColumn(
"is_major",
when(col("carrier_group") == "Major", 1)
.otherwise(0)
)
display(airline_flag.select("carrier_group", "carrier_name", "is_major"))Explanation:
withColumn creates a new column called
is_major.
when(...).otherwise(...) works like an IF
statement:
If carrier_group is "Major", the value
is 1
Otherwise, the value is 0
This column can now be used for grouping, filtering, or analysis of major carriers.
We can also create columns based on calculations
from existing numeric columns.
For example, if we want to calculate the ratio of part-time
employees to total employees:
from pyspark.sql.functions import expr
airline_ratio = airline.withColumn(
"part_time_ratio",
expr("part_time / total")
)
display(airline_ratio.select("carrier_name", "part_time", "total", "part_time_ratio"))Explanation:
expr("part_time / total") performs the division for
each row
withColumn stores the result in a new column called
part_time_ratio
This column shows the proportion of part-time employees for each airline, which can be used for visualization or further analysis
Suppose we have multiple months in our dataset and want to compute
the change in full-time employees from one month to the
next.
We can subtract previous month’s full_time from current
month’s full_time.
# Ensure month_date_parsed is sorted
from pyspark.sql.window import Window
from pyspark.sql.functions import lag
windowSpec = Window.partitionBy("carrier_name").orderBy("month_date_parsed")
airline_change = airline.withColumn(
"prev_full_time",
lag("full_time").over(windowSpec)
).withColumn(
"full_time_change",
col("full_time") - col("prev_full_time")
)
display(airline_change.select("carrier_name", "month_date_parsed", "full_time", "prev_full_time", "full_time_change"))Explanation:
Window.partitionBy(...).orderBy(...) defines the
order of rows for each airline
lag("full_time") gets the previous month’s full-time
count for comparison
full_time_change calculates the difference between
current and previous month
This is useful for tracking growth or reductions in employment over time
Adding columns in PySpark lets you enhance your dataset with new insights:
Categorical transformations (when,
otherwise) for flags and indicators
Mathematical calculations (expr,
col) for ratios or sums
Window functions for comparing rows over time
With these tools, you can prepare your dataset for analysis, visualization, or modeling.