How to Set up Databricks and Snowflake Connection in Python for Efficient Data Transfer: A Step-by-Step Guide.

Dhruv Singhal
4 min readMay 6, 2023

--

Databricks and Snowflake are two of the most popular cloud-based platforms for big data analytics and storage. While Databricks provides a collaborative, cloud-based environment for data engineering, data science, and machine learning, Snowflake offers a scalable, cloud-based data warehouse for fast and secure data storage and retrieval. Combining the power of these two platforms can provide an efficient and effective solution for storing and processing large amounts of data.

In this step-by-step guide, you will learn how to set up a Databricks and Snowflake connection in Python, which will enable you to transfer data seamlessly between the two platforms. You will need to follow the instructions below to set up a Databricks and Snowflake connection in Python.

Step 1: Set up Databricks and Snowflake Accounts Before you can start using Databricks and Snowflake together, you need to set up accounts for both platforms. If you haven’t already done so, follow the instructions on the Databricks and Snowflake websites to create an account.

Step 2: Install the Snowflake Connector for Databricks Once you have set up accounts for both platforms, you need to install the Snowflake connector for Databricks. To do this, you need to follow these steps:

  1. Log in to your Databricks account.
  2. Navigate to the “Workspace” tab.
  3. Click on the “Libraries” button.
  4. Click on the “Install New” button.
  5. Select “PyPI” as the “Source” of the library.
  6. Enter “snowflake-connector-python” as the “Package” name.
  7. Click on the “Install Library” button.

Step 3: Connect Databricks to Snowflake Once you have installed the Snowflake connector for Databricks, you need to connect Databricks to Snowflake. To do this, follow these steps:

  1. In the Databricks workspace, click on the “Clusters” button.
  2. Click on the “Create Cluster” button to create a new cluster, or select an existing cluster that you want to use.
  3. Under the “Advanced Options” tab, click on the “Init Scripts” tab.
  4. Click on the “Add” button to add a new script.
  5. Enter the following code in the script editor:
%python
import snowflake.connector

# Create a connection object
conn = snowflake.connector.connect(
user='<your_snowflake_username>',
password='<your_snowflake_password>',
account='<your_snowflake_account>',
warehouse='<your_snowflake_warehouse>',
database='<your_snowflake_database>',
schema='<your_snowflake_schema>'
)

# Store the connection details in the Databricks cluster environment
spark.conf.set('spark.sql.sources.default',
'net.snowflake.spark.snowflake')

options = dict(sfURL='<your_snowflake_url>',
sfUser='<your_snowflake_username>',
sfPassword='<your_snowflake_password>',
sfDatabase='<your_snowflake_database>',
sfSchema='<your_snowflake_schema>',
sfWarehouse='<your_snowflake_warehouse>',
sfRole='<your_snowflake_role>')

# Set the connection options in the Databricks cluster environment
spark.conf.set('spark.datasource.options', str(options))

6. Replace the placeholders in the code above with your Snowflake account details, such as your username, password, account, warehouse, database, schema, URL, and role.

7. Click on the “Create” button to save the script.

8. Verify the connection To verify that you have successfully set up the connection between Databricks and Snowflake, you can create a simple test notebook in Databricks that connects to Snowflake and reads some data. Here are the steps to do this:

-> In your Databricks workspace, create a new notebook by clicking on the “Create a blank notebook” button on the home screen.

-> Give your notebook a name and select the cluster you created earlier as the target cluster.

-> In the first cell of the notebook, enter the following code to import the necessary libraries and configure the Snowflake connector:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("Snowflake Test") \
.config("spark.driver.extraClassPath", "/mnt/snowflake_jdbc/*:/mnt/snowflake_spark/*") \
.config("spark.executor.extraClassPath", "/mnt/snowflake_jdbc/*:/mnt/snowflake_spark/*") \
.getOrCreate()

# Set Snowflake credentials
sfOptions = {
"sfURL": "<snowflake_url>",
"sfUser": "<snowflake_user>",
"sfPassword": "<snowflake_password>",
"sfDatabase": "<snowflake_database>",
"sfSchema": "<snowflake_schema>",
"sfWarehouse": "<snowflake_warehouse>",
"sfRole": "<snowflake_role>"
}

Make sure to replace <snowflake_url>, <snowflake_user>, <snowflake_password>, <snowflake_database>, <snowflake_schema>, <snowflake_warehouse>, and <snowflake_role> with your actual Snowflake credentials.

-> In the next cell of the notebook, enter the following code to create a DataFrame from a Snowflake table:

# Create DataFrame from Snowflake table
df = spark.read \
.format("snowflake") \
.options(**sfOptions) \
.option("dbtable", "<snowflake_table>") \
.load()

Make sure to replace <snowflake_table> with the name of a table in your Snowflake database.

-> Finally, in the last cell of the notebook, enter the following code to display the first 10 rows of the DataFrame:

# Display first 10 rows of DataFrame
display(df.limit(10))

-> Run the notebook by clicking on the “Run All” button at the top of the notebook. If everything is set up correctly, you should see the first 10 rows of the Snowflake table displayed in the output.

Congratulations! You have successfully set up a connection between Databricks and Snowflake and read data from Snowflake using PySpark. You can now use this connection to perform more advanced data engineering and analysis tasks using Databricks and Snowflake.

Any suggestions and comments would be greatly appreciated. If you found this article helpful, please like and share it with others. Don’t forget to follow me to stay up-to-date on my latest articles. Thank you for taking the time to read this!

--

--

Dhruv Singhal
Dhruv Singhal

Written by Dhruv Singhal

Data engineer with expertise in PySpark, SQL, Flask. Skilled in Databricks, Snowflake, and Datafactory. Published articles. Passionate about tech and games.

No responses yet