Databricks with Private Endpoints

Intro

Databricks with Private Endpoints can be a bit of a daunting experience, so I am here to help remove some of the ambiguity around this subject and help you get started with setting up your Databricks Workspaces securely.

As organisations continue to adopt cloud technologies, Private Endpoints are becoming an essential requirement to meet security regulations.

However, the implementation of Private Endpoints can vary significantly across different services, and Databricks is no exception.

In this blog I will cover step by step how to configure Databricks with Private Endpoints and also ensure that Databricks can connect to other services that are behind Private Endpoints i.e. a DataLake etc.

Lets’s get into it

The Approaches….

When configuring Private Endpoints with Databricks, you can choose between two approaches: the Standard Deployment (recommended) and the simplified approach. To proceed, it’s important to understand a few key terms:

  1. Front-end Private Link – This is also known as the “User to Workspace” and is the Private Endpoint that users interact with to access the Front End / Web application of the Databricks Workspace.
  2. Back-end Private Link – This is also known as compute plane to control plane, this is to allow connection between the Databricks Clusters (Azure VMs) to connect Privately to the Databricks control plane core services
  3. Browser-Authentication-Private Link – This endpoint is to support the Front-End Private Endpoint and is used to enable the Single Sign on (SSO) if you are using Microsoft Entra to login to your workspace (which you most likely are)
  4. Vent Injection– This is deploying the virtual machines that are created as part of the cluster into your own virtual network, so they they are subject your network configurations, i.e. direction of traffic and firewall rules etc. This is also essential to ensure Databricks can communicate with services behind Private Endpoints such as a Data Lake.
  5. Routing Rules – When using VNET Injection, you will need to configure routing rules to allow Databricks to communicate to with the certain services, and these rules can be found here… https://learn.microsoft.com/en-us/azure/databricks/security/network/classic/udr

And as mentioned there are two approaches, Standard and Simplified, so let’s get into it…..

Standard Approach (Recommended)

As shown below, this is the architecture of the Standard approach, now this does look complicated, but we are going to break this down in to each resource group so we can fully grasp what is happening here.

VNET HUB

The Hub VNET, also known as the Transit VNET, is the central component in a Hub and Spoke architecture. It serves as the main connectivity point for your Enterprise Scale Landing Zone and should be configured with the following subnets.

  1. Two /26 Subnets delegated to the Databricks Workspace Service (Public and Private Subnets) with it’s own dedicated NSGs and Route Table, with the routes specified in the link at the beginning of this blog.
  2. 1 Subnets to host the Browser Authentication Subnet
  3. 1 Subnet to host the Front End Private Endpoint

DBW-Hub

This Resource Group will reside in your connectivity subscription (following the Enterprise Scale Architecture) and will host the Databricks Workspace used exclusively for the browser_authentication Private Endpoint.

If you are implementing a multi-region setup, you will need one of these workspaces per region.

However, this workspace is a “dumb” workspace, serving no purpose other than to host the browser_authentication endpoint, and no access is required to this workspace.

  1. No PIP Enabled – Meaning that the clusters are not deployed with a Public IP Address
  2. VNET Injection into the two /26 subnets created in the VNET
  3. No Databricks NSG Rules
  4. Public Network Access Disabled – Meaning no public connections to the Databricks Workspace front End will be allowed.
  5. 1 Private Endpoint with the target sub-resource browser_authentication sub resource type into the Hub (transit) VNET / subnet configured. This is the Front End Private Endpoint

The image above shows an example configuration for your Databricks workspace (this will be similar for allworkspaces)

This will need to be configured with Databricks workspace access disabled and no pip (no public IP enabled on the Virtual Machines)

Simple so far? Let’s get into configuring our spokes…..

VNET-DATA-DEV-SPOKE

This resource group will host the Virtual Network configured for the Spoke and will need to be configured with a similar subnet set up to the Hub Network

  1. Two /26 Subnets delegated to the Databricks Workspace Service (Public and Private Subnets) with it’s own dedicated NSGs and Route Table, with the routes specified in the link at the beginning of this blog.
  2. 1 Subnet to host the Back End Private Endpoint

DBW-DATA-DEV-SPOKE

This resource group is going to host the Databricks workspace that will be used as the main Data Engineering workspace (one per environment typically), and will need to be configured with the following properties:

  1. No PIP Enabled – Meaning that the clusters are not deployed with a Public IP Address
  2. VNET Injection into the two /26 subnets created in the VNET
  3. No Databricks NSG Rules
  4. Public Network Access Disabled – Meaning no public connections to the Databricks Workspace front End will be allowed.
  5. 1 Private Endpoint with the target sub-resource databricks_ui_api sub resource type into the Hub (transit) VNET / subnet configured. This is the Front End Private Endpoint
  6. 1 Private Endpoint with the target sub-resource databricks_ui_api sub resource type into the Spoke VNET / subnet configured.

Now I don’t want to delve into too much on the DNS as I have a separate blog on this, but all of the Private Endpoints must be linked to a single Private DNS Zone for Databricks (privatelink.azuredatabricks.net)

And that’s it….. once you have this configuration set up, you will now be able to access your Databricks workspace solely through your Private Network!

Simplified Approach

If you are not using a Hub and Spoke style architecture, a simplified approach might be more appropriate. I also recommend this approach if the customer’s environment uses a V-WAN Hub, where you don’t have the ability to deploy a Databricks Workspace.

This approach isn’t vastly different, but there are some slight changes to be aware of…

As mentioned, the approach is largely the same. However, it essentially assumes that you do not have a hub network, and this approach can be replicated across the three environments (DEV, Test and Prod).

VNET-DATA-DEV-SPOKE

This resource group will host the Virtual Network configured for the Spoke and will need to be configured with the same set up as if the approach was for Standard.

  1. Two /26 Subnets delegated to the Databricks Workspace Service (Public and Private Subnets) with it’s own dedicated NSGs and Route Table, with the routes specified in the link at the beginning of this blog.
  2. 1 Subnet to host the Front/Back End Private Endpoint and Browser_Authentication Private Endpoint

DBW-DATA-DEV-SPOKE

This resource group is going to host the Databricks workspace that will be used as the main Data Engineering workspace (one per environment typically), and will need to be configured with the following properties:

  1. No PIP Enabled – Meaning that the clusters are not deployed with a Public IP Address
  2. VNET Injection into the two /26 subnets created in the VNET
  3. No Databricks NSG Rules
  4. Public Network Access Disabled – Meaning no public connections to the Databricks Workspace front End will be allowed.
  5. 1 Private Endpoint with the target sub-resource databricks_ui_api into the Spoke VNET / subnet configured.
  6. 1 Private Endpoint with the target sub-resource browser_authentication into the Spoke VNET / subnet configured.

As mentioned before I don’t want to delve into too much on the DNS as I have a separate blog on this, but all of the Private Endpoints must be linked to a single Private DNS Zone for Databricks (privatelink.azuredatabricks.net)

And that’s it….. once you have this configuration set up, you will now be able to access your Databricks workspace solely through your Private Network!

Wrap Up

To wrap up, we’ve covered the two different approaches you can take to set up Private Link for your Databricks workspace.

I hope you found this blog helpful! Feel free to reach out to me on LinkedIn if you have any questions/ problems in getting this set up. I’m always happy to help!

JD

Have any questions?