Data Connections in a Microsoft Azure Synapse Analytics Workspace

In this deep-dive article from one of our Principal Data Engineers, Stephen Connell, we get under the skin of how to make the most of data connections in Microsoft Azure Synapse Analytics Workspace. Stephen uses screenshots to explain his approach and outline why minimising the number of integration datasets is important when developing data solutions.

Making Connections to your data

In a Synapse Analytics workspace, connection to data is made using two components: a Linked Service that acts much like a connections string for a data store, and one or more Integration Datasets that connects to data within the Linked Service as a named view of that data.

Like most things in Azure Synapse Analytics workspaces, creating Integration Datasets is incredibly easy, intuitive, and quick. They can be created on the fly when you are building pipelines or Mapping Data Flows. This ease of creation and the speed of creating them can lead to added complexity to your workspace environment and what seemed like a fast fix can end up slowing down development and making your solution less manageable.

In this blog, I explore why minimising the number of Integration Datasets is a good idea and what patterns we can adopt to minimise their number.

How Connections are configured

Graphical user interface, text, application, email

Description automatically generated

Figure 1 a Linked Service to on Premise SQL Server

Linked Services connect to a specific store in a specific database, for example on a server. In the case of an SQL Server, a specific database must be provided as part of the Linked Service. The relationship between the Linked Service and its supported Integration Datasets dictates that tables or views from that specific store can be generated for a specific table or view. This can lead to a scenario where data from multiple tables and views are accessed from their Integration Dataset.

Consider a database of the complexity of AdventureWorks2017 with 80+ tables and a couple of dozen views. That could require a vast number of different Integration Datasets which could lead to a high degree of complexity for data engineers in Synapse.

Graphical user interface, application

Description automatically generated

Figure 2 Selecting the Dataset for a copy task.

In this example, I have 19 Datasets. This can make finding the correct one to use difficult when the filter options available only go so far.

Graphical user interface, text, application, email

Description automatically generated

Figure 3 Filter on Source Dataset

A Pattern for Standardising Datasets

How we use Integration Datasets to connect

When considering how we might simplify the approach to Integration Datasets, it helps to consider how these are used within Synapse Analytics and how we might be able to utilise the functionality to our advantage.

Integration Datasets are used in Pipelines and Dataflows. In both uses, there are capabilities that allow us to modify the specified source or destination for our data. This will change depending upon the nature of the data store. I will consider two approaches: an RMDBS – this example SQL Server, and Data lake Storage. In the latter case, there is a complication in that not only does the Integration Dataset store the location, but also the type of document that will be created. This is stored within the code of the connection e.g. "type": "DelimitedText" for CSV files or  "type": "Parquet" for Parquet files.

Azure Storage Datasets

Connecting to source data within Azure Storage, e.g., the default Azure Data Lake Storage Gen2 name [WorkspaceName] – WorkspaceDefaultStorage, allows for the Integration Datasets that point to specific folders and files, while also permitting the use of wildcards and overrides. Let us consider a parquet-based Integration Dataset that points to high-level storage.

Graphical user interface, text, application, email

Description automatically generated

Figure 4 Parquet Connection with only Container

In the example shown, this Integration Dataset has no Directory or File specified. When this is used in a Copy Data Activity in a Data Factory Pipeline, the activity source settings allow you to specify the location to load data from, by using Wildcard details.

Graphical user interface, text, application, email

Description automatically generated

Figure 5 Source Wildcard file path details

When it is used for Data Flow Source it is similar to use Wildcard paths to specify the location.

Graphical user interface, text, application

Description automatically generated

Figure 6 Wildcard Paths in Data Flow Source Transformation

For Sink activity settings or Sink Transformations, the scope for Wildcard paths is not quite as straightforward.  In both Copy Data Activities and Data Flow Sinks, the destination can be set in the pipeline but only where a parameter is used. Consider an example: we create a Parquet Integration Dataset however in this instance instead of relying upon wildcards we add a parameter; see below.

Graphical user interface, text, application, email

Description automatically generated

Figure 7 a Parquet dataset with a Parameter

Figure 8 Using a Parameter for a File path

We can use this parameter to form part of the path of the file locations, and when this parameter is present within the Integration Dataset, it will be presented for inclusion whenever it is used in Copy data Activities or Mapping Data Flows.

Figure 9 Copy Parameterized Sink

Figure 10 Data Flow Parameterized Sink

This would be true of sources if used in this manner. Supplying a default value is allowed and there are pros and cons with doing so. It does mean that when using this approach for a source and utilising Wildcards there is no need to populate the parameter, however, it does mean that it is always populated and it is possible to miss occasions when it has not been changed to point to the correct location.

Database Datasets

Database datasets offer other options. The first thing to note is that the Wildcard Path option is not present, however, this does not rule out the ability to parameterise our Integration Datasets.  Indeed, there is more scope to do so.  We can separately specify values for the schema and the table name. This approach is available for both Data Flows, however, it is important to note that On-premises Datasets using Self-Hosted Integration Runtime connections are not available for Data Flows.

Graphical user interface, text, application, email

Description automatically generated

Figure 11 RMDBS Parameterized Dataset

In addition to using parameters, it is possible with database connections to supply a Query or the name of Stored Procedure. In the case of supplying a query, it is possible in some cases to supply the data from another database contained within the same server!

Graphical user interface, application

Description automatically generated

Figure 12 AdventureWorksDW query on an AdventureWorks Connected Dataset

When using Stored Procedures, you can supply parameter values within the Copy Data Activity and Dataflows. See here for more details from Microsoft.

An Emerging Pattern

Because we have different ways to specify the inbound and outbound locations of our data, the need to include a new Integration Dataset for each table and each storage location folder becomes weaker. In a review of the considerations, we can see:

  • Parameters that are present must be included.
    • Default values can be included.
  • Azure Storage requires a different Integration Dataset for each file type (Delimited Text, Parquet, JSON…)
  • Azure Storage Datasets will read files within subfolders so some consideration to the structure is required.
  • SQL Server Connections can be used to read but not write to databases on the same server (caveat – security allowing, writing could be accomplished by stored procedures, but these cases are out of scope for this discussion).
  • Azure SQL will always require their own Linked Service and Integration Dataset.

Given what we know above, I would recommend a pattern where:

  1. No more than two Integration Datasets are used for any given connection:
    1. A Parameterized Dataset for use with sinks.
    1. A Dataset without parameters only for source.
  2. Naming Convention should make searching for the right Dataset simple:
    1. File types can detail the:
      1. type of file,
      1. location, and
      1. whether it has parameters.

      e.g., ParquetMycontainerParam.

  • Database datasets may detail:
    • the name of the source,
    • the type of system, and
    • whether it has parameters.

      e.g., AWSQLNoParam.

Consider the data you are processing and how you will be working with your data stores. If you are only reading from on-premises SQL for example, you might just need one dataset, i.e., a non-parameterized connection. 

Work with your team. Make sure that you are on the same page as each other and that no unnecessary connections are created. Reducing the number of Integration Datasets will simplify your design and make sure you are connecting to the right place more often.

Stephen Connell is a Principal Data Engineer at Sword based in Glasgow.  Stephen has over 20 years’ experience developing solutions for Data and BI Reporting for a range of clients in business, government and the third sector. Sword specialises in data, IT and digital support to several industries including Energy. For further information or to get in contact, please email or call directly on +44 (0)131 300 0709.

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram