In this challenge you will practice filtering, sorting, and creating new columns in PySpark.
Use the WARN Layoff dataset used in the tutorial.
Tasks:
Load the dataset warn-layoff-data-sample.csv into a
Spark DataFrame.
Filter the dataset to include only rows where state = “California”.
From the filtered data, keep only layoffs where number_of_workers ≥ 50.
Sort the results by number_of_workers (largest first).
Create a new column called workers_double that
multiplies number_of_workers by 2.
Display the final DataFrame.
You should use the following functions:
filter()
orderBy()
withColumn()
One possible solution is shown below. This example walks through each step needed to complete the challenge: loading the data, filtering rows, sorting the results, and creating a new column.
from pyspark.sql.functions import col
# Load the dataset
warn = spark.read.csv(
"data/warn-layoff-data-sample.csv",
header=True,
inferSchema=True
)
# Step 1: Filter rows where the state is California
ca_warn = warn.filter(col("state") == "California")
# Step 2: Keep only layoffs with 50 or more workers
large_ca_warn = ca_warn.filter(col("number_of_workers") >= 50)
# Step 3: Sort by number_of_workers from largest to smallest
sorted_warn = large_ca_warn.orderBy(col("number_of_workers").desc())
# Step 4: Create a new column that doubles the number of workers
final_warn = sorted_warn.withColumn(
"workers_double",
col("number_of_workers") * 2
)
# Display the result
display(final_warn)Explanation
spark.read.csv() loads the dataset
into a Spark DataFrame.filter() removes rows that do not meet
the specified condition.orderBy() sorts the rows based on a
column.withColumn() creates a new column
using an existing column.display() shows the final DataFrame so
we can inspect the results.