Iceberg on Glue: The table name confusion!

Photo by Danting Zhu on Unsplash

Iceberg on Glue: The table name confusion!

·

2 min read

Lately, our team has embarked on a journey to enhance our data storage infrastructure. We invested in Iceberg as our storage layer for data product implementation with transactional capabilities. The intent was clear: to run our applications on AWS Glue, and naturally, we wanted to leverage Glue Catalog for table registration and representation, especially in an unmanaged mode.

In this endeavor, we came across a rather unexpected pitfall compared to other solutions we’ve dealt with, such as non-transactional table formats or Databrick's delta.

The Naming Convention Hiccup

Conventionally, the table naming follows the database.table format. While this works seamlessly with many systems, with Iceberg on Glue, this throws an error:

Exception in User Class: org.apache.spark.sql.AnalysisException : org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table mytable. StorageDescriptor#InputFormat cannot be null for table: mytable (Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)

So, how do you solve this? Well, you need to prepend the catalog name and adopt the format catalog.database.name. Hence, the correct query to run would be:

spark.read.table("catalog.database.name")

Defining the Catalog Name

The following pertinent question is how one goes about defining the catalog name.

Following the official AWS documentation on how to use Glue with Iceberg (read it here), you'll note that you have to feed the following configuration to your Glue job:

spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions 
--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog 
--conf spark.sql.catalog.glue_catalog.warehouse=s3://<your-warehouse-dir>/ 
--conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog 
--conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO

Here, the Iceberg-capable catalog name is depicted as glue_catalog. But it’s important to note that this name is not set in stone. It's customizable. You can set it with any name that resonates with your project or, potentially, this setup to define multiple catalogs.

But here's a peculiarity: this catalog name, although it integrates flawlessly with Iceberg, doesn’t manifest itself when you try:

spark.sql("show catalogs").show()

This query returns empty results, which can be bewildering.

Potential Lake Formation Error

For those who've integrated Lake formation into their AWS ecosystem, there's another error to watch out for:

yspark.sql.utils.AnalysisException: spark_catalog requires a single-part namespace, but got...

This message resembles Spark's behavior, but don't misunderstand the real reason. The error probably comes from your Glue job not having enough access to the Glue table.

Closing Thoughts

With any tech combination, knowing the details can help avoid problems. We faced some surprises with Iceberg and AWS Glue, mainly in naming rules. But solving each issue makes our system more robust and our knowledge better.

For those walking a similar path, I hope this insight provides clarity and aids in a smoother integration of these powerful technologies. Happy coding!