In this challenge you will practice grouping data and calculating summary statistics using PySpark.
Use the daily-weather-data-sample.csv dataset.
Tasks:
daily-weather-data-sample.csv into a
Spark DataFrame.city_location_identifier__up_to_9_alphanumeric_characters_.You should use the following functions:
groupBy()agg()avg()max()min()orderBy()One possible solution is shown below. This example groups the dataset by city and calculates several summary statistics for temperature.
from pyspark.sql.functions import col, avg, max, min
# Load the dataset
weather = spark.read.csv(
"data/daily-weather-data-sample.csv",
header=True,
inferSchema=True
)
# Step 1: Group the data by city
city_weather = weather.groupBy(
"city_location_identifier__up_to_9_alphanumeric_characters_"
)
# Step 2: Calculate aggregation statistics
weather_summary = city_weather.agg(
avg("average_temperature_c___float_value_to_nearest_hundredths_place").alias("avg_temp"),
max("maximum_temperature_c___float_value_to_nearest_hundredths_place").alias("max_temp"),
min("minimum_temperature_c___float_value_to_nearest_hundredths_place").alias("min_temp")
)
# Step 3: Sort by average temperature (highest first)
sorted_weather = weather_summary.orderBy(col("avg_temp").desc())
# Display the result
display(sorted_weather)Explanation
groupBy() groups rows that share the
same value (in this case, the same city).agg() allows multiple aggregation
calculations to be performed at once.avg(),
max(), and
min() compute summary statistics for each
group.alias() renames the resulting columns
so they are easier to read.orderBy() sorts the aggregated
results.display() shows the final
DataFrame.