PySpark | RDD, Dataframe, Dataset

In Apache Spark, RDD, DataFrame & Dataset are three different APIs provided to work with distributed data. Users can able to select different APIs based on their requirement.

Similarities

Fault Tolerant
Distributed in Nature
In-Memory parallel processing
Immutable
Lazy evaluation

Differences

RDD	DataFrame	Dataset
Fundamental Data Structure	Distributed collection of data organized in a named columns	Best of RDD + Best of Dataframe
Need to program How & What to do	Need to program What to do	Need to program What to do
Strong type safety	Less type safety	Strong type safety
No optimization	Catalyst optimizer	Optimization
Supports Java, Scala, Python & R	Supports Java, Scala, Python & R	Supports Scala, Java
No schema	Schema structured	Schema structured
Suitable for low-level operations & when fine-grained control over data processing is required	Suitable for structured data processing, SQL queries, & data exploration tasks	Suitable when type safety & performance optimizations are required

PySpark | Transformation & Action

Transformation : Which transform data from one form to another. Used to define the sequence of operations to be performed. A new dataframe will be created after transformation.Ex: Filter, Union, groupby

Narrow Transformation: Each partition of an RDD contributes to generating only one partition in the resulting RDD. Common narrow transformations include map(), filter(), and union().
Wide Transformation: Each input partition of an RDD is used to compute multiple output partitions of the resulting RDD. Ex : groupByKey(), reduceByKey(), and sortByKey().

Action: Actions are operations that trigger the execution of the transformation & returns the result to the driver program. Ex: Count, Collect, Save

Lazy Evaluation : Transformations will not execute until an action is called.

DAG: The DAG is a directed graph because the transformations are executed in a specific order, and it is acyclic because there are no cycles or loops in the graph.

PySpark | On-Heap vs Off-Heap

On-Heap Memory: In worker node, memory which is controlled by executor. (Controlled by JVM process). We have 04 parts,

Reserved Memory: Reserved by Spark for internal purposes.
User Memory: For storing the data-structures created & managed by the user’s code.
Execution Memory: JVM heap space used by data-structures during shuffle operations (Joins & Aggregations)
Storage Memory: JVM heap space reserved for cached data

Off-Heap Memory: In worker node, memory which is not controlled by executor. (Controlled by OS)

DataBricks: Cluster Deployment

Cluster Types

Two types of cluster in Databricks.

All-Purpose Clusters / Interactive: Mainly used to analyze data interactively using databricks notebooks. Can able to manually stop & restart. Multiple users can share these clusters to do collaborative interactive analysis.
Job Clusters: Databricks job scheduler creates these clusters when we run a job on a new job cluster. Mainly consider for running fast & robust automated tasks. Create automatically when we run the job in the new job cluster and terminate the cluster once the job ends. Cnnot be restarted.

There are three modes of clusters that databricks supports, based on the cluster usage.

Standard clusters: Suitable for single user
High concurrent clusters
Single node clusters: Runs the job only on driver node & no worker nodes are provisioned.

PySpark | RDD, Dataframe, Dataset#

Similarities#

Differences#

PySpark | Transformation & Action#

PySpark | On-Heap vs Off-Heap#

DataBricks: Cluster Deployment#

Cluster Types#