PySpark provides several core components that work together to process large datasets efficiently. Understanding these components helps you understand how Spark applications are structured and how data flows through a PySpark program.



Core PySpark Components

PySpark is built around several important components that allow users to read data, transform it, and perform large-scale computations.

Some of the most important components include:

  • SparkSession
  • DataFrames
  • RDDs
  • Spark SQL

Each of these plays a different role in how PySpark processes data and performs computations.



SparkSession

The SparkSession is the entry point for working with PySpark. It allows you to create DataFrames, read data from files, and execute Spark SQL queries.

Before performing most operations in PySpark, a SparkSession must be created. Once created, it manages the connection between your Python program and the Spark engine.



Example: Creating a SparkSession

This example shows how a SparkSession is created. Most PySpark programs begin by initializing a SparkSession so that Spark operations can be performed.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark Example") \
    .getOrCreate()

Once the SparkSession is created, you can begin loading and processing data.



DataFrames

A DataFrame is the most commonly used data structure in PySpark. It represents data organized into rows and columns, similar to a table in a database or a dataframe in pandas.

DataFrames support many built-in operations such as filtering rows, selecting columns, grouping data, and performing aggregations.

Because DataFrames are optimized by Spark’s execution engine, they are usually the preferred way to work with data in PySpark.



Example: Creating a DataFrame

This example shows how a DataFrame can be created by reading a dataset.

df = spark.read.csv("data.csv", header=True, inferSchema=True)

df.show()

Once the DataFrame is created, Spark can apply transformations and actions to process the data.



RDDs

RDDs (Resilient Distributed Datasets) are the original data structure used in Spark. An RDD is a distributed collection of data that can be processed in parallel across a cluster.

RDDs provide more control over how data is distributed and processed, but they are lower-level than DataFrames and require more manual work.

Because DataFrames provide better performance optimizations and easier syntax, they are typically preferred over RDDs for most tasks.



Example: Creating an RDD

This example shows how an RDD can be created from a simple Python collection.

data = [1, 2, 3, 4, 5]

rdd = spark.sparkContext.parallelize(data)

print(rdd.collect())

RDDs are still useful in some advanced situations but are used less frequently in modern PySpark workflows.



Spark SQL

Spark SQL allows you to run SQL queries on DataFrames. This makes it possible to analyze large datasets using familiar SQL syntax.

Spark SQL integrates directly with DataFrames, allowing you to create temporary views and run queries against them.

This feature is especially helpful for users who are already comfortable working with SQL.



Example: Using Spark SQL

This example shows how a DataFrame can be registered as a temporary view and queried using SQL.

df.createOrReplaceTempView("people")

result = spark.sql("SELECT * FROM people WHERE age > 25")

result.show()

Spark will execute the SQL query using its distributed execution engine.