One of the most important choices when creating a Synapse workspace is the choice of network configuration.
Azure gives you the possibility to use Synapse Analytics as a PaaS service accessible via the internet and controlled through a network firewall and access rules, or as a PaaS service with the possibility of no longer connecting via the internet but only from a private environment, offering you additional security.
At the date of this post, the creation of Synapse in a private environment involves the use of a managed virtual network, a virtual network managed directly by Azure Synapse, with managed private endpoints that allow the secure connection to workspace resources.
Only by enabling the managed-vnet it's possible to close public access and manage the resource in private.
When you create a Synapse workspace with these settings, the following components are created:
1x AzureAutoResolveIntegrationRuntime with sub-type Managed Virtual Network
3x Managed Private Endpoints, one for the integrated datalake, one for the dedicated pool, one for the on-demand (or serverless) pool.
These Managed Private Endpoints are, in fact, managed.
If we try to access the private endpoint resource that Azure creates on the workspace-integrated resources, it will fail.
We have no control over the virtual network that Azure creates and manages behind the scene.
2x default and not editable Linked Services that will use the Managed Private Endpoints to connect to resources via the AutoResolveIntegrationRuntime (Managed Virtual Network).
One for datalake storage:
and one for the Synapse dedicated pool:
Testing with AutoResolveIntegrationRuntime and Managed Private Endpoints
Now let's create a pipeline with 3 identical copy activities in sequence.
The copy activity copies a csv file (sample.csv) placed on the storage datalake integrated in Synapse, to a table in a dedicated SQL pool called a "demopool".
For the copy we will use PolyBase which is a high-throughput technology to move large volumes of data to Synapse, with a staging directly in the default Azure Synapse datalake.
In this case the file is very small, but we want to use PolyBase because it suffers most from queue times as it breaks the copy into two phases.
Let's run the pipeline, observing both the progress and the temporary staging container.
As you can see, the pipeline starts but after 1 minute it's always under "queue" on the first copy activity:
The status changes to "In Progress" after approximately 1:20 min and Synapse copies the temporary files in the temporary staging container.
After the copy in staging (3 seconds transfer), the second phase of copying begins from staging to Synapse pool table, and we have another queue of more than 1 minute:
After 1:34 min of queue the transfer is instantaneous and the copy activity is completed with 3:00 min of duration but...
At 4:02 minutes the pipeline has not yet moved to the next step, the temporary staging files are still in the container.
The pipeline is waiting the deletion of the temporary files.
Finally, after deleting the temporary files, the pipeline moves on to the next copy activity:
Where, however, the same queue start again:
In conclusion, the pipeline ends with about 12:00 min of duration, but copy activities pay a lot of the price of queue times.
Note how the actual Transfer phase takes a few seconds, everything else is time spent in queue.
Why does this happen?
From Microsoft documentation, when using a managed-vnet integration runtime, Microsoft does not reserve a compute node for each service instance, but there is a warm-up time every time a copy activity runs (https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-performance-troubleshooting#activity-execution-time-varies-using-azure-ir-vs-azure-vnet-ir) .
And it doesn't just occur between one copy activity to the next, but also within it if you use a staged copy.
This translates into minutes of queue time for copy tasks that would last only a few seconds.
Can you reduce these queue times, while keeping the synapse service private?
Yes, creating a new Azure IR and changing the Time To Live option, or, using a self-hosted integration runtime instead of the AutoResolveIntegrationRuntime and replacing the Managed Private Endpoints with Private Endpoints configured directly by the user.
Let's see how to do it and the results obtained.
We leave to the conclusions the various considerations on the pros and cons of each approach.
Time-to-Live settings and custom Azure IR
If you take a look at the default AutoResolveIntegrationRuntime, you will notice that many of the options under the Advanced panel in the Virtual Network tab are greyed out.
It means that with this default integration runtime you cannot customize the properties available for the Copy compute scale (Preview) and Pipeline and external compute scale (Preview).
Just to be clear (this is a bit hard to find in the documentation but it's available from the pricing calculator)
- Copy compute scale: means the IR settings (DIU and TTL) for the Copy Activities only
- Pipeline and external compute scale: means the IR settings (TTL only at the moment of this post) for pipeline activities (include Lookup, Get Metadata, Delete, and schema operations during authoring, such as test connection, browse folder list and table list, get schema, and preview data)
and external activities (include Databricks, stored procedure, HDInsight activities, Data Lake Analytics U-SQL activity, Azure Synapse Notebook activity and Custom activity)
The option that interests us is the TTL under Copy compute scale.
If you create a new Azure Managed Vnet Integration Runtime, you can change these settings, enable the Time To Live option and choose a period between 5 minutes and 30 minutes.
Doing this, you are asking Azure to keep up and running the compute nodes (for the copy activities only) so that after the first warm up period, the later copy activities will execute without wating for the warm up time.
Let's see this in action in our example, associating the new Azure IR to the linked services.
Pipeline starts and first copy activity waits for the compute nodes.
Copy starts:
First copy is completed but take a look at the second step of the staging, the queue time is around 3 second now.
And the other copy activities also complete in much less time than before.
This is the first way you can reduce queue adjusting the TTL according to your pipeline structure.
Testing with Self-Hosted Runtime and Private Endpoint
First, let's delete managed private endpoints from the Azure Synapse studio interface:
Let's create a virtual network:
And an Azure Synapse Analytics (private link hub) to connect to the Synapse Studio portal
The key point is to place the private endpoints in the virtual network (or a peered network) where the machine from the self-hosted integration runtime resides.
Create 3 more private endpoints for each Microsoft.Synapse/workspaces sub-resource target:
The result is 3 private endpoints below the Synapse workspace, each integrated into the virtual network with its own private DNS zone.
Create the private endpoint for the storage in the same way: privatelink.dfs.core.windows.net
At this point, we can create the VM in the same virtual network and install the self-hosted integration runtime:
Not being able to change the default linked services, you need two custom linked services that will use the self-hosted integration runtime to connect to both storage and the Synapse pool.
Change the pointing of datasets to new Linked Services:
Also in staging settings:
Let's launch the pipeline and see the result.
The pipeline is completed in about 1 and a half minutes.
And copy activities have much more reasonable queue times.
Conclusions
In this post we have seen how to optimize queue times for copy activities performed in Azure Synapse with public access disabled.
A first consideration that can be made is on the actual need to change the default settings.
Leaving to Synapse the management of the underlying virtual network, the machines that act as integration runtime and private links, the architecture is certainly simpler.
If pipelines ran once a week, or maybe at night, I might not even need to optimize queue times.
The cons of this option, the integration runtime under managed-vnet costs much more than the IR Self-Hosted on data movement and activities that use the integration runtime such as Lookup, Get-Metadata, Delete, preview data etc.
And time-to-live should also be considered if we used data flows.
This is a price simulation of the two Integration Runtimes for one month, with 2 data integration units being the minimum value.
To the self-hosted runtime must be added the compute costs of the machine when it is up and running.
The self-hosted integration runtime with private endpoints can be, however, the only possible solution when you want to keep Synapse private and you need the fastest possible load times.
By creating your virtual network and private endpoints, you have complete control over the underlying infrastructure and can integrate Synapse into your Azure private or on-premises network more easily and cost-effectively for ETL activities.
Comments