In this deep-dive article from one of our Principal Data Engineers, Stephen Connell, we get under the skin of how to make the most of data connections in Microsoft Azure Synapse Analytics Workspace. Stephen uses screenshots to explain his approach and outline why minimising the number of integration datasets is important when developing data solutions.
In a Synapse Analytics workspace, connection to data is made using two components: a Linked Service that acts much like a connections string for a data store, and one or more Integration Datasets that connects to data within the Linked Service as a named view of that data.
Like most things in Azure Synapse Analytics workspaces, creating Integration Datasets is incredibly easy, intuitive, and quick. They can be created on the fly when you are building pipelines or Mapping Data Flows. This ease of creation and the speed of creating them can lead to added complexity to your workspace environment and what seemed like a fast fix can end up slowing down development and making your solution less manageable.
In this blog, I explore why minimising the number of Integration Datasets is a good idea and what patterns we can adopt to minimise their number.
Figure 1 a Linked Service to on Premise SQL Server
Linked Services connect to a specific store in a specific database, for example on a server. In the case of an SQL Server, a specific database must be provided as part of the Linked Service. The relationship between the Linked Service and its supported Integration Datasets dictates that tables or views from that specific store can be generated for a specific table or view. This can lead to a scenario where data from multiple tables and views are accessed from their Integration Dataset.
Consider a database of the complexity of AdventureWorks2017 with 80+ tables and a couple of dozen views. That could require a vast number of different Integration Datasets which could lead to a high degree of complexity for data engineers in Synapse.
Figure 2 Selecting the Dataset for a copy task.
In this example, I have 19 Datasets. This can make finding the correct one to use difficult when the filter options available only go so far.
Figure 3 Filter on Source Dataset
When considering how we might simplify the approach to Integration Datasets, it helps to consider how these are used within Synapse Analytics and how we might be able to utilise the functionality to our advantage.
Integration Datasets are used in Pipelines and Dataflows. In both uses, there are capabilities that allow us to modify the specified source or destination for our data. This will change depending upon the nature of the data store. I will consider two approaches: an RMDBS – this example SQL Server, and Data lake Storage. In the latter case, there is a complication in that not only does the Integration Dataset store the location, but also the type of document that will be created. This is stored within the code of the connection e.g. "type": "DelimitedText" for CSV files or "type": "Parquet" for Parquet files.
Connecting to source data within Azure Storage, e.g., the default Azure Data Lake Storage Gen2 name [WorkspaceName] – WorkspaceDefaultStorage, allows for the Integration Datasets that point to specific folders and files, while also permitting the use of wildcards and overrides. Let us consider a parquet-based Integration Dataset that points to high-level storage.
Figure 4 Parquet Connection with only Container
In the example shown, this Integration Dataset has no Directory or File specified. When this is used in a Copy Data Activity in a Data Factory Pipeline, the activity source settings allow you to specify the location to load data from, by using Wildcard details.
Figure 5 Source Wildcard file path details
When it is used for Data Flow Source it is similar to use Wildcard paths to specify the location.
Figure 6 Wildcard Paths in Data Flow Source Transformation
For Sink activity settings or Sink Transformations, the scope for Wildcard paths is not quite as straightforward. In both Copy Data Activities and Data Flow Sinks, the destination can be set in the pipeline but only where a parameter is used. Consider an example: we create a Parquet Integration Dataset however in this instance instead of relying upon wildcards we add a parameter; see below.
Figure 7 a Parquet dataset with a Parameter
Figure 8 Using a Parameter for a File path
We can use this parameter to form part of the path of the file locations, and when this parameter is present within the Integration Dataset, it will be presented for inclusion whenever it is used in Copy data Activities or Mapping Data Flows.
Figure 9 Copy Parameterized Sink
Figure 10 Data Flow Parameterized Sink
This would be true of sources if used in this manner. Supplying a default value is allowed and there are pros and cons with doing so. It does mean that when using this approach for a source and utilising Wildcards there is no need to populate the parameter, however, it does mean that it is always populated and it is possible to miss occasions when it has not been changed to point to the correct location.
Database datasets offer other options. The first thing to note is that the Wildcard Path option is not present, however, this does not rule out the ability to parameterise our Integration Datasets. Indeed, there is more scope to do so. We can separately specify values for the schema and the table name. This approach is available for both Data Flows, however, it is important to note that On-premises Datasets using Self-Hosted Integration Runtime connections are not available for Data Flows.
Figure 11 RMDBS Parameterized Dataset
In addition to using parameters, it is possible with database connections to supply a Query or the name of Stored Procedure. In the case of supplying a query, it is possible in some cases to supply the data from another database contained within the same server!
Figure 12 AdventureWorksDW query on an AdventureWorks Connected Dataset
When using Stored Procedures, you can supply parameter values within the Copy Data Activity and Dataflows. See here for more details from Microsoft.
Because we have different ways to specify the inbound and outbound locations of our data, the need to include a new Integration Dataset for each table and each storage location folder becomes weaker. In a review of the considerations, we can see:
Given what we know above, I would recommend a pattern where:
Consider the data you are processing and how you will be working with your data stores. If you are only reading from on-premises SQL for example, you might just need one dataset, i.e., a non-parameterized connection.
Work with your team. Make sure that you are on the same page as each other and that no unnecessary connections are created. Reducing the number of Integration Datasets will simplify your design and make sure you are connecting to the right place more often.
Stephen Connell is a Principal Data Engineer at Sword based in Glasgow. Stephen has over 20 years’ experience developing solutions for Data and BI Reporting for a range of clients in business, government and the third sector. Sword specialises in data, IT and digital support to several industries including Energy. For further information or to get in contact, please email email@example.com or call directly on +44 (0)131 300 0709.