Last Updated: 2024–10-29

Background

AI data engineers creating pipelines targeted at Snowflake should have an easy solution to store the docs they ingest in their initial form. Snowflake Stages are an option that makes sense due to its tight integration with the rest of the Snowflake platform.

Scope of the tutorial

In this tutorial, you will learn how to persist documents of any file format into Snowflake Stages using the PutSnowflakeInternalStageFile processor.

Learning objectives

Once you've completed this tutorial, you will be able to:

Prerequisites

Tutorial video

As an option, you can watch a video version of this tutorial.

Review the use case

After importing a flow definition file that retrieves PDF documents from an AWS S3 bucket, you will configure your flow to persist these same documents in a Snowflake Stage.

Initialize the canvas

Access a Runtime

Log into Datavolo Cloud and access (create if necessary) a Runtime. Alternatively, leverage a Datavolo Server Runtime that you have access to.

Import the template

Download Snowflake_Stages_Tutorial.json and import this flow definition file onto the top-level NiFi Flow canvas.

After importing the template, you should see a process group (PG) named Snowflake_Stages_Tutorial on your canvas. The contents of this new PG will look similar to the following.

Enable controller service

Right click on an empty area of the canvas and select Enable All Controller Services to start up the controller service needed for AWS S3 access.

The ListS3 processor should now show a red square indicating it is available to be started.

There are multiple ways to create a Snowflake Stage, including the CREATE STAGE command. This tutorial will use the Snowflake web UI to accomplish this. You can skip this step if you will be using an existing Stage to persist documents to.

Log into Snowflake

Log into the web UI and ensure you are on the Home page.

Identify schema

Choose Data > Databases from the left navigation menu and then select (create if necessary) the database and schema you will be creating a Stage within. The example below is using LM_DB.LM_SCHEMA.

Create the stage

The Create button in the upper right will surface a pull down menu. Select Stage > Snowflake Managed.

In the Create Stage pop up, set the Stage Name to TUTORIAL, ensure Directory table is enabled, choose Server-side encryption, and then press Create.

Verify the newly created stage is empty.

Admin activities

Navigate to Admin > Warehouses and validate the existence of, or create, the Warehouse you will be using connecting to from Datavolo.

Navigate to Admin > Users & Roles and validate the existence of, or create, the User you will be using to connect with from Datavolo.

The step will guide you through creating & configuring the needed controller services for utilizing Snowflake Stages.

After right clicking on an empty area of the canvas and selecting Controller Services, you will see a page similar to this one.

Private key service

Click the + icon in the screenshot above and then type standardpri in Filter types. Ensure StandardPrivateKeyService is highlighted and click Add.

Click on the vertical ellipsis on the far right of the new controller service list item then choose Edit.

For the Key property, paste in the entire contents of the user's private key (including the BEGIN and END lines) file.

For the Key Password property, type the name of the password you used when creating the private key. Both properties should be identified as Sensitive value set. Click on the Verification check mark and ensure you receive the Component Validation passed message before clicking on Add.

Use the vertical ellipsis to Enable the new controller service.

Snowflake connection service

Click the + icon in the screenshot above and then type snowflakeconn in Filter types. Ensure SnowflakeConnectionService is highlighted.

After pressing Add you will see this new controller service added to the list. You can also see what configurations are missing before the controller service can be enabled.

Use the vertical ellipsis to Edit the controller service's Properties. Set the first few Properties as described below.

Key

Value

Authentication Strategy

Choose Key Pair from the pulldown

Account

Enter your Snowflake account ID

User

Enter the name of the user using key-pair authentication

Private Key Service

Select the controller service previously created from the pulldown

They should appear similar to this now.

The Warehouse, Database Name, and Schema properties are not required as the User may have appropriate defaults, but for this tutorial it is a good idea to set them specifically. Your values will likely be different to those shown below.

Before clicking on Apply, ensure the Verification check reports your properties are valid.

Enable this additional Controller Service and verify that all are reporting Enabled before returning Back to Process Group.

Ensure the Snowflake_Stages_Tutorial PG canvas is visible.

Add processor

Create a new PutSnowflakeInternalStageFile processor on canvas and then:

Configure processor

Double click on the PutSnowflakeInternalStageFile processor and update the following Properties.

Key

Value

Snowflake Connection Service

Select the controller service previously created from the pulldown

Internal Stage Type

Select Named from the pulldown

Stage

Type the name of the Stage you created

Your Properties should appear similar to the following.

Let's see it all work!

Retrieve files

Start the ListS3 and FetchS3Object processors and verify 4 FlowFiles are in the connection going into PutSnowflakeInternalStageFile.

List Queue to see the FILENAME and FILE SIZE of all 4 of these FlowFiles.

Write to Stage

Start the PutSnowflakeInternalStageFile processor and verify that all 4 FlowFiles were routed to the success connection.

Go ahead and Empty Queue as they are not needed anymore in the flow.

Verify Stage writes

On the Snowflake web UI, validate that all 4 files were persisted.

You did it!

Congratulations, you've completed the Persist docs in Snowflake Stages tutorial!

What you learned

What's next?

Check out some of these codelabs...

Further reading

Reference docs