Scientific Computing: Dask VS PySpark
Dask and PySpark - Powerful tools for big data processing
Ever tackled a dataset that goes beyond the limits of memory and computation on a single machine?
Well, Today, on day 8 of our AI/ML Journey, we are studying a couple tools designed for distributed and parallel computing; Dask and PySpark. These two powerful libraries are essential for handling big data and scaling computations efficiently.
What Are Dask and PySpark?
Both Dask and PySpark are designed for big data and parallel computing, but they cater better to different use cases dependent on some factors we shall go over.
“Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time” ~ Wikipedia
“Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software” ~ Wikipedia
Dask is a parallel computing library in Python that scales familiar Python libraries like Pandas and NumPy to larger datasets. It’s best suited for users who are already comfortable with these libraries and want to handle bigger data or distribute computations.
PySpark on the other hand, is the Python API for Apache Spark, a distributed computing engine built to process large datasets in parallel across clusters. It’s ideal for handling complex computations and large-scale data across many machines.
Now, we shall explore both tools, their key features, and how to use them.
Dask - A Flexible Parallel Computing Library
First question, what is Dask?
Dask is a flexible parallel computing library that allows you to…
Scale Pandas/NumPy workflows to handle datasets that don’t fit into memory.
Parallelize computations on all CPU cores or even across clusters of machines.
Work with larger datasets seamlessly, distributing them into manageable partitions.
Some Key Components of Dask Include…
Dask DataFrame: A scalable version of the Pandas DataFrame, designed for larger-than-memory datasets.
Dask Array: A parallel version of NumPy arrays for handling large numerical datasets.
Setting Up Dask
To install Dask, we’ll need to use the following command (just like we did with pandas):
pip install dask[complete]
Dask DataFrame (Scalable Pandas)
Dask DataFrames look and feel like Pandas DataFrames, but they are partitioned across multiple cores or machines. Reference to parallel computing.
import dask.dataframe as dd
# Read a large CSV file in parallel
df = dd.read_csv('large_file.csv')
# Perform operations like Pandas
df.head() # Fetch first few rows
df.groupby('column').sum().compute() # Perform groupby and aggregate
Do you notice the `.compute()`
function. Dask defers computation until you explicitly request it.
When you write Dask code, it doesn’t execute immediately. Instead, Dask builds a task graph. This graph represents the sequence of operations needed to compute the final result.
Before running the computations, Dask analyzes the task graph to find the most efficient way to execute it. This in most cases, involves parallelizing tasks (on multiple CPUs), minimizing data movement, and reusing intermediate results. Don’t stress so much on the technicalities. Understanding the desired purpose is key.
Think of the task graph as a blueprint, outlining all the tasks (operations) and their dependencies. This allows Dask to optimize the order and method of execution, potentially combining tasks or reordering them to achieve the most efficiency possible.
Dask Array - (Scalable NumPy)
A Dask Array splits a NumPy array into chunks (smaller bits), allowing parallel computation.
import dask.array as da
import numpy as np
# Create a Dask array from a NumPy array
x = da.from_array(np.random.random((10000, 10000)), chunks=(1000, 1000))
# Perform operations and compute results
y = x.mean(axis=0).compute()
When to Use Dask
When your data no longer fits in memory.
To parallelize computations on a single machine or across clusters.
For interactive or real-time data processing.
Dask vs. Pandas
Pandas is great for smaller datasets.
Dask extends Pandas to larger datasets by breaking them into manageable chunks and processing them in parallel.
PySpark - Python API for Apache Spark
What is PySpark?
PySpark provides an interface for Apache Spark, a distributed computing framework that handles massive datasets in parallel across multiple machines. It’s ideal for big data applications, offering fault tolerance, in-memory computation, and high-speed processing.
Key Features of PySpark
Handles Large Datasets: PySpark is designed to work with big data, distributing it across clusters.
Parallel Processing: Operations can be executed in parallel across many machines.
In-Memory Computation: Speeds up data processing by keeping intermediate results in memory.
Setting Up PySpark
To install PySpark, (as usual) in your project terminal, use:
pip install pyspark
Then, create a Spark session to start working… like this;
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("MyPySparkApp").getOrCreate()
print(spark.version)
Key Components of PySpark
RDD (Resilient Distributed Datasets): The fundamental data structure in Spark, designed for fault-tolerant parallel processing.
DataFrames: A higher-level abstraction that makes PySpark more user-friendly and similar to Pandas. DataFrames in PySpark are optimized for performance.
# Sample data
data = [("John", 28), ("Anna", 24), ("Peter", 35)]
# Create a PySpark DataFrame
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
Basic Operations in PySpark
Reading Data
You can load data from CSV, JSON, and other sources:
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True) df.show()
Filtering Data
Filtering rows in PySpark is similar to SQL:
df.filter(df.Age > 30).show()
GroupBy and Aggregation
You can group and aggregate data in PySpark:
df.groupBy("Name").agg({"Age": "mean"}).show()
Joining DataFrames
PySpark allows easy joins between DataFrames, just like SQL:
# Inner join df1.join(df2, "Name").show()
Transformations vs. Actions
Transformations: Operations like
filter()
andselect()
are lazily evaluated and only executed when an action is triggered.Actions: Operations like
show()
,collect()
, orcount()
that trigger execution.
# Lazy transformation
transformation = df.filter(df.Age > 30)
# Action that triggers execution
transformation.show()
Running SQL Queries in PySpark
You can register a DataFrame as a temporary SQL table:
df.createOrReplaceTempView("people")
# Run SQL queries
result = spark.sql("SELECT * FROM people WHERE Age > 30")
result.show()
Saving Data
You can save DataFrames in multiple formats such as CSV or Parquet:
df.write.csv("path/to/output.csv")
df.write.parquet("path/to/output.parquet")
Comparison: Dask vs. PySpark
Both Dask and PySpark are excellent tools for scaling data science workflows to big data. Dask is more suited for users transitioning from Pandas or NumPy, while PySpark is better for those working in distributed computing environments, handling large-scale datasets.
Key Take-Away!
Dask scales your Pandas and NumPy workflows with minimal changes, perfect for parallel computing on single machines or small clusters.
PySpark is ideal for large datasets and distributed computing across multiple machines, making it great for heavy data processing and large-scale analytics.
Familiarity with these tools will enable us dive into more advanced topics like Dask’s distributed scheduler or PySpark's MLlib for machine learning.
But until then, let’s keep learning.
Well done! Today on Day 8 of our AI/ML Journey, we further explored scientific computing, with focus on the capabilities of Dask and PySpark.
Join us again, on day 9, as we continue our research on scientific computing. NumPy and Scipy up next!
Remember, if you're interested in joining our upcoming coding session, you should join our discord channel and stay tuned to Raven-R for more updates and cool stuff like these.