Populate Iceberg table using Polaris catalog

Last Updated: 2024-10-16

Background

The Apache Iceberg table format coupled with an open catalog such as Snowflake's Polaris Catalog allows users to leverage different query engines (such as Apache Spark and Trino) in addition to Snowflake. Datavolo is able to load data into an Iceberg table cataloged by Polaris.

Scope of the tutorial

In this tutorial, you will leverage the PutIcebergTable processor, and appropriate controller services, to convert data into Apache Parquet files and add their records to an Iceberg table cataloged by Polaris.

Learning objectives

Once you've completed this tutorial, you will be able to:

Properly configure a PolarisIcebergCatalog controller service.
Utilize a PutIcebergTable processor to store records into an Iceberg table.
Verify Apache Parquet files are saved in the object store backing the table.
Create & query a Snowflake external table referencing the Iceberg table.

Prerequisites

Access to a Datavolo Cloud Runtime. See Datavolo Cloud: Getting started.
Basic familiarization of building dataflows with Datavolo. See Build a simple NiFi flow.
Basic familiarization with SQL.
Awareness of Apache Iceberg table format and Snowflake Polaris Catalog.
Existing, or permissions & knowledge to create a, Polaris S3-backed catalog (to include S3 bucket and associated credentials), namespace, service connection (to include client ID & secret values), and authorization server & catalog URIs. See Polaris Catalog™ overview | Snowflake Documentation.
Sign in access to Polaris Catalog via Sign in: Snowflake customer or Sign in: Polaris Catalog customer (or ability to create).
Tooling to access Polaris and run queries on Iceberg tables. See Query a table in Polaris using Snowflake and Starburst | Polaris catalogs for potential options.

Tutorial video

As an option, you can watch a video version of this tutorial.

Review the use case

For this tutorial, you will use the publicly available Bluebikes - Hubway dataset. Read more information about Blue Bikes Boston, a bicycle-sharing program based in Boston since 2011.

Data files

The System Data page provides details about their various file formats including trip data and links to the publicly-available downloads page. This tutorial focuses on the trip data starting in 2024 which has the following attributes.

Ride ID
Rideable Type (classic bike or electric bike)
Trip Duration (seconds)
Start Time and Date
Stop Time and Date
Start Station Name & ID
End Station Name & ID
Start Latitude & Longitude
Stop Latitude & Longitude
User Type (Casual = Single Trip or Day Pass user; Member = Annual or Monthly Member)

Pipeline requirements

The dataflow to construct will perform the following steps.

Retrieve a data file from the downloads page.
Unzip the compressed file.
Convert the CSV file to Apache Parquet format.
Load the Parquet file(s) into an Iceberg table cataloged by Polaris.

Initialize the canvas

Access a Runtime

Log into Datavolo Cloud and access (create if necessary) a Runtime.

Import the template

Download Tutorial__Iceberg_Polaris.json and import this flow definition file onto the top-level NiFi Flow canvas.

After importing the template, you should see a process group (PG) named Tutorial__Iceberg_Polaris on your canvas. Enter the PG. If you zoom all the way out, the contents of this new PG will look similar to the following.

The processors on the right address the requirements of downloading the file and uncompressing it to retrieve the CSV file of bike trips.

The QueryRecord processor in the upper left of the canvas shortens the number of records from > 160K down to a manageable number that will aid in making the example data more easy to understand and validate.

To help you focus on the task at hand, the template includes an appropriate Parameter Context and the necessary Controller Services. To initialize them, ensure you are inside the Tutorial__Iceberg_Polaris PG.

Set parameters

Right click on an empty area on the canvas and select Parameters from the contextual menu. You should see the Edit Parameter Context window.

Clicking on the vertical ellipsis and Edit will allow you to set your specific values for the parameters shown above. The remainder of this step will discuss each parameter in a logical grouping order, not in the sorted by name order that surfaces.

`iceberg.s3.*` parameters

As detailed in Create external cloud storage for a catalog, catalogs created in Polaris persist their Iceberg metadata and data files in external cloud object storage. For this tutorial, please use an AWS S3 bucket and set the iceberg.s3.access-key-id and iceberg.s3.secret-access-key values appropriately.

`iceberg.polaris.warehouse`

The warehouse name is identified as a catalog name in the Polaris Catalog UI (refer to the Prerequisites section of the Tutorial overview step for access options). The default value of runtime-iceberg will need to be set to your specific value if you choose to use a different name.

Refer to Create a catalog if needed.

`iceberg.client.region`

The Create a catalog documentation states the "S3 storage bucket (needs to be) in the same region that hosts your Snowflake account". Change the default value of us-east-1 if needed.

`iceberg.polaris.client-*` parameters

As detailed in Configure a service connection, iceberg.polaris.client-id and iceberg.polaris.client-secret should be set with the values generated.

`iceberg.namespace`

The namespace name can be found underneath the catalog name in the Polaris Catalog UI. The default value of development will need to be set to your specific value if you choose to use a different name.

Refer to Organize catalog content if needed.

`iceberg.table`

Leave the default value of init_test. The table does not need to be created as the processor you will use later will create it if it does not exist.

`iceberg.polaris.*.uri` parameters

For iceberg.polaris.authorization.server.uri and iceberg.polaris.catalog.uri, replace the SNOWFLAKE-ACCOUNT-IDENTIFIER portion of the default values with your Snowflake account identifier.

Refer to Account identifiers if needed.

After updating these values, click on Apply and then Close. Click Back to Process Group to return to the canvas inside the Tutorial__Iceberg_Polaris PG. These parameters will be referenced in future steps of this tutorial.

Start controller services

Right click on an empty area on the canvas and select Controller Services from the contextual menu which will render the list of Controller Services. For each of the 4 present, click on the vertical ellipsis and then Enable.

For each Enable Controller Service window, click on Enable and then Close. Verify that all 4 are reporting a STATE of Enabled then go Back to Process Group.

These controller services will be referenced in future steps of this tutorial.

Ensure your navigation breadcrumbs indicate you are inside the Tutorial__Iceberg_Polaris PG.

Fetch the file

Right click on a blank area of the canvas and select Start to start the processors on the right.

Once you see a single FlowFile in the success queue near the top of the flow, Stop the use Run Once, NOT Start processor, and verify it is no longer running.

See a subset of records

The QueryRecord processor should now be reporting an error about its limited-set relationship not being connected to anything.

Add a PutIcebergTable processor on the canvas and connect the limited-set relationship to it. Notice the QueryRecord processor starts running and a FlowFile moves into the new connection.

Run List Queue on the new queue and then View Content on the single FlowFile available to see the records that you will load to an Iceberg table in the coming steps.

Configure the processor

Notice the initial configuration problems that need to be addressed for the PutIcebergTable processor.

Go to the Properties tab of Edit Processor and use the pull-down menus to select the only options available for the first three properties. These are some of the Controller Services you already enabled.

Use the Parameter Context parameter names you set earlier for the final two properties.

Add two new funnels and connect up the success and failure relationships to them so that your flow resembles the following. Verify your processor has a red square next to its name indicating it is available to be started.

Run the processor

Let's do it! Start the PutIcebergTable and verify that a FlowFile is now present in the queue for the success relationship.

Optionally, verify file on S3 bucket

If you have access to view the underlying bucket your Polaris Catalog is using for this table, you should find something similar to this.

The boxed path information comes from Snowflake's POLARIS CATALOG web page.

As you can see, the S3 bucket is identified as the Default base location in Polaris.

The next two "directories" ultimately come from two of the parameter setting activities you did at the beginning of this tutorial.

An Apache Parquet file was created because the PutIcebergTable processor's Iceberg Writer property was set to the ParquetIcebergWriter controller service you Enabled at the beginning of this tutorial.

As the Polaris Catalog™ overview | Snowflake Documentation identifies, one of the biggest benefits to using an open table format like Apache Iceberg accessible from an open catalog such as Snowflake's Polaris Catalog is that you can BYOQE (Bring Your Own Query Engine) such as Apache Spark, Trino, and of course, Snowflake.

You cannot simply run a query on your table against Polaris as it is only the catalog. It is not a query engine. Obviously, Snowflake itself is and running a query that way on your Iceberg table is the focus of this step.

Create an external table

Here are the instructions to Query a table in Polaris Catalog using Snowflake. The tutorial will not recreate those steps, but it should be noted that the biggest effort is in Step 1 of those instructions.

Step 3 of the instructions will have you execute a CREATE ICEBERG TABLE statement in one of your Snowflake schemas which essentially creates an external table that masks the fact that the table is not a native Snowflake one.

AS AN EXAMPLE, using those instructions above, I ran commands such as the following for Steps 1 - 3 which are highly abbreviated and only shown as figurative examples.

Step 1

CREATE OR REPLACE EXTERNAL VOLUME

dev_iceberg_external_volume

Step 2

CREATE OR REPLACE CATALOG INTEGRATION

lm_polaris_int

Step 3

CREATE ICEBERG TABLE lm_init_table

CATALOG = 'lm_polaris_int'

EXTERNAL_VOLUME = 'dev_iceberg_external_volume'

CATALOG_TABLE_NAME = 'init_test';

Run a query

And just like that... I am able to query my Apache Iceberg table cataloged by Polaris from the standard Snowflake query editor!

Congratulations, you've completed the Populate Iceberg table using Polaris catalog tutorial!

What you learned

How to properly configure a PolarisIcebergCatalog controller service.
How to utilize a PutIcebergTable processor to store records into an Iceberg table.
How to verify Apache Parquet files are saved in the object store backing the table.
How to create & query a Snowflake external table referencing the Iceberg table.

What's next?

Check out some of these codelabs...

Background

Scope of the tutorial

Learning objectives

Prerequisites

Tutorial video