WTH is Azure Data Share?
The last couple of weeks have seen a flurry of announcements about entirely new Azure services. This is great, but I have found the launch material that goes along with them has made it very hard to understand what these services are actually for! The launch announcements are full of marketing language about empowering business and maintaining control, etc., and very little about what the thing does and how it works. So, to try and bring some clarity, I'm starting this "WTH is" series. I'll provide a summary of what these services are, how they work, and why you might want to use them.
In this first instalment, we are going to look at Azure Data Share which was announced in preview this week (11th June 2019).
What is Azure Data Share?
Azure Data Share is a Platform as a Service tool that allows you to share data sets using Azure Services. It requires no infrastructure deployment and works with other PaaS services in Azure (like Azure Storage). The focus of this service seems to be around the big data world, where there is a need to pass around very large sets of data, but this tool will work with any data you can upload. Users you share with receive a copy of the shared data set(s) which they can then use as they wish.
Data Share allows you to share data with specific users either in your organization or in other organizations, so long as they are using Azure. You can share data at a particular point in time and then if you wish, push regular updates to the data to those who have subscribed to it.
Data Share relies on other Azure services to provide the backend for storing the data. At present in the preview, this is limited to Azure Storage and Azure Data Lake Storage (Gen 1 and Gen2).
Data Share is a push/subscriber-based data sharing tool. As you will see below, when you share data, you are pushing a copy of the data to the user you are sharing with, they are not accessing your data directly.
How does Azure Data Share work?
As a creator of data, if you wish to share this using Data Share, you would first create a Data Share account, and then create a Data Share inside that account. Data Shares are the things that you would share with the recipients of the data, and you can create multiple Data Shares and share them with different users.
Once you have a Data Share created, you then create one or more Datasets. Datasets are the actual data you wish to share and point to either Azure Storage or Azure Data Lake Storage where the files reside. More sources will be added in the future. You can select specific containers or files in these locations to be part of your Dataset.
Once you have created your Dataset, you are ready to share it. To do this, you use the recipients' option of the Data Share to add the email addresses of the users you wish to share the data with. At this point, you can also enable a schedule if you would like to refresh the data at specific intervals.
Once you do this, an email will be sent to the user(s) you want to share with inviting them to access this data share. To accept this, the user must have access to an Azure subscription and a storage account. When the user accepts the invitation, a wizard will guide them through the process of creating a Data Share account. Once created, they will finally designate a storage account where the received data will be stored. If you set up a schedule on the data set, then at this point, the user can also elect to use that schedule.
At this point, the data has not yet been copied. The user would go to the Data Share and use the "trigger snapshot" option to pull in the data from the source. Once that completes the data is now in the user's storage account.
Why would I want to use Azure Data Share?
Data Share provides a way to simplify the process of sharing large data sets. Previously you would have had to use something like FTP to upload the files, or a file-sharing tool like One Drive, DropBox etc. This can be problematic when you are using very large data sets and often goes against company security policies. Using these tools also makes refreshing the data painful. By using Data Share, you can make data available quickly to other Azure users, without needing to copy your data anywhere yourself.
With Azure Data Share, you can share the data with a few clicks, so long as the user, you are trying to share with has access to an Azure Subscription and storage account. The copying and updating of the data is handled for you and transits the Microsoft backbone for best performance. Using Data Share also means that information is encrypted during transit, and at rest (assuming you keep the default to enable encryption on storage).
The fact that you are sharing data in place means you have a single source and point of truth for your datasets. There is no need to move them around or create duplicates, assuming your data already resides in one of the supported services.
From a security perspective, Data Share allows you to share data with specific people by email address, which must be a valid Azure account. The monitoring aspects of data Share also enable you to keep track of who you have shared the data with. This includes seeing whether they have accepted your invitation, synced the data, and whether they have updated recently.
What issues does Azure Data Share have?
First off, Data Share is in preview, so the usual preview caveats apply. This also means that the services it supports are limited at this time (storage and data lake), but expect this to change as time goes on.
The second big issue is that the recipient you want to send data to is going to need to have access to an Azure subscription and either access to create the data share and storage account or be able to get someone to do this for them. This might not be an issue for those already using Azure, but if you need to share data with people using other cloud providers or solely on-premises, then this won't do what you want.
While you can see who you have shared your data with and whether they have accepted and downloaded this data, you do not have the option to rescind a share and remove that data. You can stop sending updates to the data, and delete the share, but the existing data that the user has downloaded will not be affected. It's dubious how useful this would be, as there is also nothing to stop a user copying the data off to another storage account, or downloading it to their machine. Removing the data from the share would never guarantee the user didn't have the data any more. If you need this sort of protection for your data, then you would need to look at other services like Azure Data Protection.
Data Share is intended as a tool for sharing datasets in one direction, from provider to consumer. If you need to work on and share the data in both directions interactively, then this service will not do what you require.
If you interested in trying out data share you can follow the tutorials here: