Adding Columns

Adding Columns in PySpark

The “add columns” transformation in PySpark allows us to create new columns or modify existing columns in a dataset.
We can calculate new values based on other columns, apply mathematical operations, or categorize data for analysis.

Adding columns is useful when you want to:

Perform new calculations based on existing data
Convert categorical text into numbers for modeling
Create flags or indicators for filtering or grouping

Example Dataset

In the following examples we will use a sample dataset from Dewey Data.

To follow along:

Go to Dewey Data
Download Airline Employment Data By WARN Database
Place the file in your project’s data folder

If you need help downloading datasets, see the tutorial section:

How to Get Dewey Data Sample Sets

carrier_group	carrier_name	full_time	month_date	month_date_parsed	part_time	total
Major	Alaska	19,273	Oct 2025	2025-10-01	2,322	21,595
Major	Allegiant Air	5,364	Oct 2025	2025-10-01	958	6,322
Major	American	96,547	Oct 2025	2025-10-01	12,016	108,563
Major	Atlas Air	4,281	Oct 2025	2025-10-01	53	4,334
Major	Delta	103,018	Oct 2025	2025-10-01	1,468	104,486
Major	Envoy Air	12,971	Oct 2025	2025-10-01	6,835	19,806

This displays the first few rows, letting us see column names and understand the dataset structure.

Loading the Data in PySpark

Before filtering, we first load the dataset into a Spark DataFrame.

import pyspark as ps
from pyspark.sql.functions import col

warn = spark.read.csv("data/airline-employment-data-sample.csv", header=True, inferSchema=True)

display(warn)

The display() function allows us to visually inspect the dataset.

Create a New Column with Categorical Values

Sometimes we want to convert text categories into numbers for easier analysis.
For example, we can create a column that flags airlines with Major carriers as 1, others as 0.

from pyspark.sql.functions import col, when

airline_flag = airline.withColumn(
    "is_major",
    when(col("carrier_group") == "Major", 1)
    .otherwise(0)
)

display(airline_flag.select("carrier_group", "carrier_name", "is_major"))

Explanation:

withColumn creates a new column called is_major.
when(...).otherwise(...) works like an IF statement:
- If carrier_group is "Major", the value is 1
- Otherwise, the value is 0
This column can now be used for grouping, filtering, or analysis of major carriers.

Create a New Column Using Mathematical Expressions

We can also create columns based on calculations from existing numeric columns.
For example, if we want to calculate the ratio of part-time employees to total employees:

from pyspark.sql.functions import expr

airline_ratio = airline.withColumn(
    "part_time_ratio",
    expr("part_time / total")
)

display(airline_ratio.select("carrier_name", "part_time", "total", "part_time_ratio"))

Explanation:

expr("part_time / total") performs the division for each row
withColumn stores the result in a new column called part_time_ratio
This column shows the proportion of part-time employees for each airline, which can be used for visualization or further analysis

Create a Column Representing Employee Change Between Months

Suppose we have multiple months in our dataset and want to compute the change in full-time employees from one month to the next.
We can subtract previous month’s full_time from current month’s full_time.

# Ensure month_date_parsed is sorted
from pyspark.sql.window import Window
from pyspark.sql.functions import lag

windowSpec = Window.partitionBy("carrier_name").orderBy("month_date_parsed")

airline_change = airline.withColumn(
    "prev_full_time",
    lag("full_time").over(windowSpec)
).withColumn(
    "full_time_change",
    col("full_time") - col("prev_full_time")
)

display(airline_change.select("carrier_name", "month_date_parsed", "full_time", "prev_full_time", "full_time_change"))

Explanation:

Window.partitionBy(...).orderBy(...) defines the order of rows for each airline
lag("full_time") gets the previous month’s full-time count for comparison
full_time_change calculates the difference between current and previous month
This is useful for tracking growth or reductions in employment over time

Summary

Adding columns in PySpark lets you enhance your dataset with new insights:

Categorical transformations (when, otherwise) for flags and indicators
Mathematical calculations (expr, col) for ratios or sums
Window functions for comparing rows over time

With these tools, you can prepare your dataset for analysis, visualization, or modeling.