Shahin's Blog

Shahin's Blog

Tip: Initialize PySpark session with Delta support

And how to use it with AWS Glue

Shahin's photo
·Nov 17, 2021·

1 min read

Tip: Initialize PySpark session with Delta support

Photo by Jez Timms on Unsplash

Quick Start

Delta's documentation on how to enable it with Python, is relatively straightforward. You install the delta-spark package using pip, and after adding the Delta related configuration, you need to wrap the PySpark builder with a call to configure_spark_with_delta_pip, and then you can .get_or_create your session.

By looking at its code, you'll find out, all it does is add a spark.jars.packages to your session's configuration, which consequently puts the required Java module in your classpath.

AWS Glue

This installation approach works on a typical setup; however, when I was trying to utilize this for a script on AWS Glue, I realized this package was not getting placed in the classpath, causing a ClassNotFound exception. To make it work, I needed to download the desired delta-core jar file from the maven repository, upload it to S3, and the path it to the Glue job as Dependent jar path.

PySpark version constraints

At the time of this writing, Delta package works with PySpark < 3.2. If you try to run it with a newer version, it'll raise the following exception:

java.lang.ClassNotFoundException: org.apache.spark.sql.catalyst.SQLConfHelper

Overall, it's good to make sure your Spark/PySpark versions match together, and they are compatible with the Delta version.

Share this