Last Updated: 2024–10-29
AI data engineers creating pipelines targeted at Snowflake should have an easy solution to store the docs they ingest in their initial form. Snowflake Stages are an option that makes sense due to its tight integration with the rest of the Snowflake platform.
In this tutorial, you will learn how to persist documents of any file format into Snowflake Stages using the PutSnowflakeInternalStageFile processor.
Once you've completed this tutorial, you will be able to:
As an option, you can watch a video version of this tutorial.
After importing a flow definition file that retrieves PDF documents from an AWS S3 bucket, you will configure your flow to persist these same documents in a Snowflake Stage.
Log into Datavolo Cloud and access (create if necessary) a Runtime. Alternatively, leverage a Datavolo Server Runtime that you have access to.
Download Snowflake_Stages_Tutorial.json
and import this flow definition file onto the top-level NiFi Flow canvas.
After importing the template, you should see a process group (PG) named Snowflake_Stages_Tutorial on your canvas. The contents of this new PG will look similar to the following.
Right click on an empty area of the canvas and select Enable All Controller Services to start up the controller service needed for AWS S3 access.
The ListS3 processor should now show a red square indicating it is available to be started.
There are multiple ways to create a Snowflake Stage, including the CREATE STAGE
command. This tutorial will use the Snowflake web UI to accomplish this. You can skip this step if you will be using an existing Stage to persist documents to.
Log into the web UI and ensure you are on the Home page.
Choose Data > Databases from the left navigation menu and then select (create if necessary) the database and schema you will be creating a Stage within. The example below is using LM_DB.LM_SCHEMA
.
The Create button in the upper right will surface a pull down menu. Select Stage > Snowflake Managed.
In the Create Stage pop up, set the Stage Name to TUTORIAL
, ensure Directory table is enabled, choose Server-side encryption, and then press Create.
Verify the newly created stage is empty.
Navigate to Admin > Warehouses and validate the existence of, or create, the Warehouse you will be using connecting to from Datavolo.
Navigate to Admin > Users & Roles and validate the existence of, or create, the User you will be using to connect with from Datavolo.
The step will guide you through creating & configuring the needed controller services for utilizing Snowflake Stages.
After right clicking on an empty area of the canvas and selecting Controller Services, you will see a page similar to this one.
Click the + icon in the screenshot above and then type standardpri
in Filter types. Ensure StandardPrivateKeyService is highlighted and click Add.
Click on the vertical ellipsis on the far right of the new controller service list item then choose Edit.
For the Key property, paste in the entire contents of the user's private key (including the BEGIN
and END
lines) file.
For the Key Password property, type the name of the password you used when creating the private key. Both properties should be identified as Sensitive value set. Click on the Verification check mark and ensure you receive the Component Validation passed message before clicking on Add.
Use the vertical ellipsis to Enable the new controller service.
Click the + icon in the screenshot above and then type snowflakeconn
in Filter types. Ensure SnowflakeConnectionService is highlighted.
After pressing Add you will see this new controller service added to the list. You can also see what configurations are missing before the controller service can be enabled.
Use the vertical ellipsis to Edit the controller service's Properties. Set the first few Properties as described below.
Key | Value |
Authentication Strategy | Choose Key Pair from the pulldown |
Account | Enter your Snowflake account ID |
User | Enter the name of the user using key-pair authentication |
Private Key Service | Select the controller service previously created from the pulldown |
They should appear similar to this now.
The Warehouse, Database Name, and Schema properties are not required as the User may have appropriate defaults, but for this tutorial it is a good idea to set them specifically. Your values will likely be different to those shown below.
Before clicking on Apply, ensure the Verification check reports your properties are valid.
Enable this additional Controller Service and verify that all are reporting Enabled before returning Back to Process Group.
Ensure the Snowflake_Stages_Tutorial PG canvas is visible.
Create a new PutSnowflakeInternalStageFile processor on canvas and then:
Double click on the PutSnowflakeInternalStageFile processor and update the following Properties.
Key | Value |
Snowflake Connection Service | Select the controller service previously created from the pulldown |
Internal Stage Type | Select Named from the pulldown |
Stage | Type the name of the Stage you created |
Your Properties should appear similar to the following.
Let's see it all work!
Start the ListS3 and FetchS3Object processors and verify 4 FlowFiles are in the connection going into PutSnowflakeInternalStageFile.
List Queue to see the FILENAME and FILE SIZE of all 4 of these FlowFiles.
Start the PutSnowflakeInternalStageFile processor and verify that all 4 FlowFiles were routed to the success connection.
Go ahead and Empty Queue as they are not needed anymore in the flow.
On the Snowflake web UI, validate that all 4 files were persisted.
You did it!
Congratulations, you've completed the Persist docs in Snowflake Stages tutorial!
Check out some of these codelabs...