Generating Salted Hashes in Power Bi

Table of Contents

Challenge

Earlier this month I was working on another report in Power BI. This time I needed to undertake trend analysis using personally identifying information (PII).

Access to the dataset that is the basis for this report is restricted. A very small number of users are authorised to access it. In contrast, the report is available to a wider number of users. Including the PII in the report is a significant breach of privacy.

I needed to create a way to de-identify the data in such a way that trend analysis is possible. While at the same time maintaining the confidentiality of the data.

Using a hashing function was a good start. The same information would generate the same hash value. It is also impossible to use the hashed value to derive the information passed into it.

Using a hash function wasn’t enough. As this would make the dataset vulnerable to analysis using a rainbow table.

I needed to use a hash function with a salt. I needed a salt that was a long and complex random string. This salt also could not become part of report. As a salt needs to be long, complex, random and secret.

Solution

Power BI has the ability to use scripts written in R software environment. More information on the capability is available here.

Below is the script that I developed. At a high level the script:

  1. Generates a random string.
  2. Uses the random string as a salt to hash the values in a column.
  3. Return the data with the column of hashed values added.

To use this script as a step in processing your datasource you will need the:

# 'dataset' holds the input data for this script

# Load some additional libraries.
# These need to be added to the R installation prior to using this script.
library( digest )
library( stringi )

# Get the length of the dataset.
len <- length( dataset[[1]] )

# Get a 128 character random string to use as the salt for hashing.
rsalt <- stri_rand_strings( 1, 128, pattern = "[A-Za-z0-9!@#$%]" )

# Init an new empty vector to store the generated hash values.
hashes <- c()

# Identify which column has the plain text
plainTextCol <- 1

# Process each of the records.
for ( i in 1:len ) {
    hashes <- append( hashes, digest( paste( dataset[ i, plainTextCol ], rsalt, sep=" " ), "sha256", serialize = FALSE ) )
}

# Add the hashes to the dataset
dataset$hashed_values <- hashes

# Return the processed dataset
result <- dataset

Removal of the column with the PII from the dataset is the next step in the import process. The new column of hashed values supports trend analysis and privacy is maintained.

Benefits

The benefits of this approach are:

  1. The report supports trend analysis by using the hashed values as keys.
  2. Removal of the PII maintains the privacy of the people.
  3. Salted hash values do not support reverse engineering the PII.
  4. The salted hash is random on each refresh, and is not stored in the report.

Final Note

The Power BI service supports the use of R scripts. There is one important caveat. This support is contingent on the use of a personal data gateway. My use case requires scheduled refreshes of the dataset. It cannot rely on a users computer being available. I needed to find an alternate solution.