Obscure Real Data for Testing

Very often we have a good set of production data but it cannot be used for training or testing because of data confidentiality. And the solution sometimes is just to tweak pieces of the data so it cannot be linked to the original data. Here is one method for obscure data in SQL Server 2005. It is based on a couple techniques:

– Using table with numbers (here generated on the fly by cross joining common table expressions)

– Using the table with numbers to slice each value into individual characters

– Generating random values in the normal character range that will be used to replace the real characters

– Using FOR XML PATH with empty element to concatenate back the obscured value.

CREATE TABLE Patients (

 pname VARCHAR(35),

 pname_masked VARCHAR(35));

 

INSERT INTO Patients VALUES('John Doe', NULL);

INSERT INTO Patients VALUES('Jeff Smith', NULL);

 

WITH

N1 (n) AS (SELECT 1 UNION ALL SELECT 1),

N2 (n) AS (SELECT 1 FROM N1 AS X, N1 AS Y),

N3 (n) AS (SELECT 1 FROM N2 AS X, N2 AS Y),

N4 (n) AS (SELECT ROW_NUMBER()

           OVER(ORDER BY X.n) FROM N3 AS X, N3 AS Y)

UPDATE Patients

SET pname_masked =

    CAST(

    CAST((SELECT CASE

                  WHEN SUBSTRING(pname, n, 1) = ' '

                  THEN ' '

                  ELSE CHAR(x + ASCII(SUBSTRING(pname, n, 1)))

                END

          FROM N4 AS Nums

          CROSS APPLY

          (SELECT TOP(1) T.n AS x

           FROM N4 AS T

           WHERE T.n + ASCII(UPPER(SUBSTRING(pname, Nums.n, 1))) BETWEEN 65 AND 90

           ORDER BY CHECKSUM(NEWID())) AS N

          WHERE Nums.n <= LEN(pname)

          FOR XML PATH('')) AS XML) AS VARCHAR(35));

 

SELECT pname, pname_masked

FROM Patients;

 

-- Results

 

pname        pname_masked

------------ -------------

John Doe    Xwlv Lyi

Jeff Smith   Qmol Yuyvs

1 reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *