Basic concepts
DataSHIELD is an R-based software solution for federated analysis - the remote analysis of multiple data sources. It allows for sophisticated analyses without the user being able to view or copy individual level data. Instead, only non-disclosive summary statistics are returned. This makes it an effective solution for secure data science collaborations.
In order to use DataSHIELD, additional software is required to store data and manage user interaction. There are currently two solutions for doing this: Armadillo and Opal, which can be used compatibly within the same network. Below is an example of a simple setup:
Example setup 1
A user writes analysis commands in R, using client-side DataSHIELD R packages. The commands are sent to the biobank servers. For Armadillo, the communication between the client and the server is handled by the R packages DSI and DSMolgenisArmadillo, whilst the data storage and execution of commands on the server is handled by ArmadilloService. The non-disclosive summary statistics are then returned to the user.
Example setup 2
An alternative setup involves the user first connecting to a Central Analysis Server (CAS), which is an online R studio environment:
Once logged in to the CAS, users write their code as if they were running RStudio locally. The advantage of this setup is that Biobank servers can be configured so that they are blocked off from the rest of the internet by a firewall and can only be accessed from the CAS. This provices an additional layer of data protection. It also benefits users, as all required DataSHIELD R packages can be pre-installed thus removing the needs for users to set up their R environment.
Resources
An additional optional feature of DataSHIELD is the ability to host files elsewhere (e.g. computer clusters) and link them to the data of armadillo or opal servers. This is impletmented using the resourcer package. External resources can be used alongside data stored in armadillo itself, and resources may be hosted in different locations and formats.
DataSHIELD packages and their use
Finally, here is a brief summary of the core Armadillo and DataSHIELD packages described in this documentation.