GB2488024A

GB2488024A - Pseudonymisation of data values corresponding to user-selected attributes of the data

Info

Publication number: GB2488024A
Application number: GB1201857.8A
Authority: GB
Inventors: David Hill
Original assignee: OKA BI Ltd
Current assignee: OKA BI Ltd
Priority date: 2011-02-02
Filing date: 2012-02-02
Publication date: 2012-08-15
Also published as: GB201101805D0; GB201201857D0

Abstract

In a computer-implemented system, a database comprises a plurality of entries/values and a metadata file describes attributes of those entries. A user interface is configured to display the metadata and to enable user selection of one or more attributes. A first set of selected attributes corresponds to entries/values for which repeated occurrences of an entry/value are to be given different, preferably unrelated, pseudonyms. A second set of selected attributes corresponds to entries/values for which repeated occurrences of an entry/value are to be given the same pseudonym or traceably-related pseudonyms. A code generator generates respective first and second sets of code which, when executed and applied to the database entries, achieves the corresponding pseudonymisation. The code may comprise SQL (Structured Query Language). In an embodiment, the pseudonymised data may be de-pseudonymised to an appropriately authenticated and authorised user.

Description

t!V.' INTELLECTUAL 21 ..* PROPERTY OFFICE Application No. GB 1201857.8 RT1VI Date 30 May 2012 The following term is a registered trademark and should be read as such wherever it occurs in this document: "Microsoft".

Intellectual Properly Office is an operating name of the Patent Office www.ipo.gov.uk Pseudonymisation/de-pseudonymisation of data

Field of the Invention

This invention relates to a computer system or apparatus and method for allowing effective and efficient pseudonymisation and/or de-pseudonymisation of data stored in a data structure (e.g. database). In particular the invention relates to a method for the pseudonymisation of such data and the subsequent presentation in pseudonymised form to one or more related computer systems or system end users.

Background

In the modem world, identifiable data is a high risk to organizations who have legal and operational duties to protect the confidentiality of personal data in order to mitigate the risk of data misuse and identity theft. Examples of this can be found in local and national government agencies such as the UK National Health Service. The NHS, has experienced a series of data breaches in which personally identifiable data has been compromised. Losses of personally identifiable information of this type are often in breach of the Data Protection Act (1998:UK).

A method preventing data loss is to implement a technical measure called pseudonymisation, a reversible process which de-identifies a person within a database system, with only selected individuals in an organisation being able to reverse the pseudonymisation operation (that is, to "de-pseudonymise") in line with the roles and responsibilities of their held position within the organisation.

For example, with data such as patient data, often parties (such as statisticians, information managers ete) will perform analysis of the data. Various aspects of the data (for example name and address) may need to be kept confidential. The amount of confidential data a person can access may vary according to the person and their need to access data. For example a general practitioner may be allowed to view all data relating a patient's file, including patient's name etc., whereas a statistician who is performing statistical analysis on the dataset would only be allowed to view data on certain aspects of the data e.g. gender.

Protection of personal data can be performed by conventional encryption procedures but this is found to be computationally intensive and could require the entire encrypted data entry to be unencrypted before it could be viewed. Furthermore, this becomes even more complicated when two or more persons may have different access rights to the data resulting in further inefficiencies with the data processing.

The present invention provides a computer system or method which produces executable computer code, to create a computerised database system which implements pseudonymisation and de-pseudonymisation for any data source which contains personally identifiable data. Its function can be to produce SQL, Structured Query Language, code for the Microsoft SQL Server database platform, for the purpose of data pseudonymisation and de-pseudonymisation. The apparatus or method requires a single input, a "metadata file", which describes the content of the data file which is to be subject to the pseudonymisation and de-pseudonymisation routines of the computer system. This metadata file is structured to define the names of all stored data items in the data source (eg. a structured file containing system data and column headers).

To at least mitigate one or more of the above problems according to an aspect of the invention there is provided a computer implemented system for enabling the pseudonymisation of a plurality of database entries, the database entries comprising attributes the system comprising: a metadata file describing the attributes of the database entries; a user interface configured to display to a user the metadata and enabling the user to select one or attributes from the metadata; and further configured to enable the user to select a first set of attributes from the metadata of which it is desired to be pseudonymised in a manner so that repeated occurrences of a particular entry/value are given different (preferably unrelated) pseudonyms and to select a second set of attributes from the metadata of which it is desired to be pseudonymised in a manner so that repeated occurrences of a particular entry are given the same pseudonym or traceably related pseudonyms; and a code generator for generating a first set of code, the first set of code when executed and applied to database entries in a data source matching the metadata, pseudonymises the first set of data attributes in the database entries such that repeated occurrences of a particular entry/value are given different preferably unrelated) pseudonyms and generating a second set of code, the second set of code when executed pseudonymises the second set of data attributes in the database entries repeated occurrences of a particular entry are given the same pseudonym or traceably related pseudonyms.

Other aspects of the invention will become apparent from the appended claim set.

Brief description of the figures

Embodiments of the invention are now described, by way of example only, with reference to the accompanying drawings in which: Figure 1 is a flow chart of the process of pseudonymisation; Figure 2 is an example of some data which has been pseudonymised using the process of Figure 1; Figure 3 is a flow chart of the process of depseudonymisation; Figure 4 is an example of some data which has been depseudonymised using the process of Figure 3; Figure 5 is a diagrammatic representation of the features which can be created by executing the SQL Scripts produced by the code generator; and Figure 6 is an illustration of a pseudonymisation code generation process in accordance with invention

Detailed description

Figure 1 shows a flow chart of the process of pseudonymisation of data. In the following description the example is given of patient data comprising confidential information being uploaded to a data warehouse and pseudonymised such that it can be accessed by various people e.g. general practitioners, statisticians etc. at a later date.

There is shown the step of the data being sent to a central organisation (in an example a primary care trust or PCT) at step S 102; the data loaded into a server at step S 104; the data loaded into a data warehouse at step Si 06; repeat pseudonyms created at step S 108; one or more data fields being overwritten with repeat pseudonym data at step S 110; one or more different data fields being overwitten with single (unique) pseudonyms at step S114 and the pseudonymised data loaded into the data warehouse atstepSii6.

Figure 2 shows an example of a dataset which has been pseudonymised according to the process of Figure 1.

There is shown the original dataset 10 referred to in steps S102, S104 and S106 of Figure 1. The original dataset 10 comprises six columns: patient number 12; patient name 14; date of birth 16; gender 18; ethnicity; 20 and attendance date 22. There is also shown the second dataset 30 referred to in step S108 of Figure 1 with the additional features of an organisation column 32 and repeat patient pseudonymised number 34. There is shown the third dataset 40 in which the repeat patient pseudonymised number 34 has been overwritten on the NHS number. There is shown the final fourth dataset 50 with pseudonymised name 52, date of birth 54 and gender 56 data.

In further examples, a fewer or larger number of columns are used in the initial dataset and the data contained in the columns may change according to user requirements.

Pseudonyinisation The pseudonymisation system that is created by the invention, enables at least the following functionality. Information, supplied in any computerised format, and containing personally identifiable data such as patient number 12, Name 14, Address (not shown), Gender 18, ethnicity 20 etc is loaded into a "pseudonymisation" process at 5106 with the ultimate goal of creating a pseudonymised dataset. The data in the example is patient data as shown in the first dataset 10.

At step S 104, the patient data is loaded from the central authority (PCT) onto a server.

This occurs using known data protocols. Steps S106 to S114 describe the process of the data being loaded from the PCT server to a data warehouse. The data which has been loaded onto the server at steps S102 and S104 arc "clear" data in that none of the data has been encrypted or pseudonymised. Therefore, in order to reduce the risk of loss of personalised data (for example, if the data was subsequently "hacked" from the data warehouse), one or more of the fields from the first dataset 10 are pseudonymised in order to reduce the impact of the loss of data. In particular if the data which is pseudonymised is personal data e.g. name, address, date of birth etc. The present invention utilises two different types of pseudonymisation. A repeat pseudonymisations and unique pseudonymisation. The repeat pseudonym is used in instances where the original data may be unique, for example, the patient number data 12, and is it desirable to collate all instances of the unique data.

For example, a patient number 12 will be assigned to all instances of a data entry relating to an individual. By way of example, in the first dataset 10 the first three entries have the same NHS number 12 and name 14 relating to the three different attendance dates Pt, 2nd and 3th of January 2010. In such instances, it may be desirable to collate all instances of the same person within a dataset and identify them with the same repeating identifiers. In an embodiment it is possible to collate all instances of the same data in the pseudonymised data by depseudonymising the data to its original form and identifying matches and repseudonymising the data once the matches have been found. This is computationally expensive and accordingly a sub-optimal embodiment.

A more efficient embodiment is to ensure that all instances of the repeated unique entry arc pseudonymised in the same manner in order to identify all instances of said person. This is achieved by using the repeat pseudonymise features. Advantageously, this also allows for a greater understanding of the pseudonymised data. For example, it allows for the rapid identification of multiple instances of data from the same person. Therefore, at step S 108 repeat pseudonyms are identified thereby allowing the re-use of a given pseudonym multiple times in a dataset. Once a repeat pseudonym has been used it is stored along with the original data to which it refers in a format similar to the second data set. Therefore, when further instances of the uniquely identifying data occur (e.g. further patient numbers 12 are entered) these further instances of the uniquely identifying data can be checked against the existing assigned pseudonyms to determine whether a repeat pseudonym can be used.

As shown in Figure 2, the second dataset 30 comprises a column of repeat pseudonyms. The first entry in the first dataset 10 (Mr R Hood) has been assigned a repeat pseudonym of IB2C3B for his unique NHS number 12. This instance of the repeat pseudonym 34 is used for all instances of his unique NHS number 12, as shown in the third dataset 40. In the third dataset 40, the NHS number has been pseudonymised as shown in column 34. The first three entries of the third dataset in column 34 are identical, indicating that the data relates to the same person.

Advantageously this allows for all instances of the same person (e.g. Mr R Hood) to be easily identified by searching for all instances of the repeat pseudonym in a known manner. In an embodiment the repeat pseudonymisation occurs via logical joining of the repeat pseudonym data set with the large data set which is subject of the pseudonymisation operation, or transformation with know join and logic statements in SQL. At step 5110, the repeat pseudonym data overwrites the clear data resulting in the third dataset 40.

The other type of pseudonym is the single pseudonym, which is used to pseudonymise other forms of data in a unique manner. In the fourth dataset 50, the name 52, date of birth 54 and gender 56 colunms have been pseudonymised using a unique pseudonym.

The unique pseudonym is one which is not repeated in the dataset. As shown in Figure 2, each entry in the name column 14 has been assigned a unique pseudonym resulting in different pseudonyms for the same data. For example, in the fourth dataset 50, the name data 52 relates to the same entry for the first three entries (Mr R Hood) however this has resuhed in three different pseudonyms being applied to the same data. This ensures data safety and security in the event that such data is lost as it is not easy to reconcile the, for example, name data 52 with the pseudonyms shown.

In step Si i2, the single pseudonyms are applied to the third dataset 40 resulting in the fourth dataset 50. As shown in Figure 2, this results in significant portions of the data being pseudonymised, and accordingly if such data were obtained in the pseudonymised form, would essentially be meaningless to a third party who has no knowledge of the pseudonymisation techniques used and no way of determining the original data.

At step S 114, the pseudonymised data is loaded into the data warehouse, whereupon it may be accessed by third parties with specific rights to view the data.

Accordingly, the invention provides a system for pseudonymising one or more data items of a dataset, in particular, allowing for the use of repeat pseudonyms in the instances of the dataset containing unique data. The advantage of such repeat pseudonyms is to allow the identification of several entries which may be related to be easily identified within the pseudonymised dataset.

De-Pseudonyinisation The invention also provides a method for de-pseudonymisation of the dataset. In particular, once the data has been pseudonymised, it may be distributed to one or more relevant persons, each person may be set different permissions, the permissions defining the person's ability to view de-pseudonymised (i.e. original) data. Figure 3 shows the flowchart of the process of a user requesting identifiable data from the pseudonymised data stored in the data warehouse (i.e. as at step S 114).

The invention also creates the de-pseudonymisation process which exists in the pseudonymisation system. This process reverses the pseudonymisation process described in the previous section. Data is requested by an end user of the pseudonymisation system, and the security sub system (which is not part of the claimed invention, as it is usually administered by the organisation which owns the personal data) responds to the request, granting the end user a gateway to the pseudonymised data, along with a ticket which when presented to the database, allows that user the ability to de-pseudonymise the personal data. During the de-pseudonymisation process, repeat pseudonyms are looked up, in order to recover the identifiable versions of these pseudonyms. Data which has been uniquely pseudonymised is loaded into the decryption engine, which is part of the pseudonymisation system. In an example this is part of commercially available Microsoft software. The invention provides the end user with the ability to select one of a selection of encryptionldecryption algorithms.

After the pseudonymised data has been decrypted, it is then sent to the end user in order to be displayed, on screen or maybe on paper. If the user has no permissions to de-pseudonymise the data, either no data or pseudonymised data is sent back to the user, again depending on permissions granted by administrators.

Figure 3 shows the flowchart of the process of de-pseudonymisation. There is shown the step of the data being present in a pseudonymised form in the data warehouse at step 5202; and any user requesting identifiable data at step 5204; the user being identified and credentials issued at step 5206; determining if the user possesses credentials to view the data in a de-pseudonymised format at step 5208; in the event that the user does not have the required credentials requesting the pseudonymised data from the data warehouse at step 5210; and the pseudonymised data returned to the user via a report or spreadsheet at step S212. In the event that the user does have sufficient credentials at step S208, the process proceeds to step S214 where the request for de-pseudonymised data from the warehouse is sent, and at step S216 where the de-pseudonymised data is returned to the user.

The data is stored in the data warehouse at step S202, which is equivalent to step Si 14 in Figure 1.

An end user (such as a general practitioner, statistician, nurse etc) can send requests for information to the data warehouse via a user terminal (not shown). The user terminal in a preferred embodiment is a user interface in which the user inputs their log in credentials such as user name and password in a known manner in order to identify themselves to the data handler at the data warehouse. The users credentials therefore identify the user in a preferably unique manner.

The user log in system may be a thin-shell client or an API, a webpage, or any other suitable form of logging in system.

At step 5206, the security subsystem which handles the request from a user for identifiable data verifies the user log in credentials and determines their access credentials to access de-pseudonymised data. At step 5208, the users credentials are compared to the request for data as inputted at step S204. In such a comparison step, the system determines whether the user possesses sufficient credentials to view the de-pseudonymised data. For example, if a general practitioner were to request data regarding the name, date of birth, and patient number of a particular patient(s) then in the present example, the system at step S208 would identify the general practitioner as having sufficient credentials to view such information. Similarly, if an unauthorised person (such as a statistician) were to ask for de-pseudonymised data regarding, for example, the names and patient ID's from the dataset, they would have insufficient credentials to view such data. The assignment of a user's credentials and their ability to view data may be determined according to the needs of the system.

If at step S208 the system determines that the user does not possess sufficient credentials to view the de-pseudonymised data, the process continues to step S210.

At step 5210, the request for pseudonymised data (i.e. the data as formed and stored at steps S112 and S 114) is requested and returned to the user at step S212 in the form of a report or spreadsheet. In other embodiments, the report may be sent to the user in

any suitable format.

If the system determines at step S208 that the user possesses sufficient credentials to view the data in the de-pseudonymised format, the process goes to step S214 in which the de-pseudonymised data is requested from the data warehouse. The relevant data is then de-pseudonymised (see Figure 4 for further details) and presented to the user in a suitable format at step S216. The suitable format is in the form of a report or spreadsheet though in further embodiments other formats may be used.

Figure 4 shows an example of the data which has been de-pseudonymised according to the process of Figure 3. The dataset and reference numerals are as for Figure 2.

As shown in the fourth dataset 50, the name 52, date of birth 54, gender 56 columns have been pseudonymised. A user has requested to view the entire of the data in a de-pseudonymised form (step 5204). The user has been identified at step 5208 as having sufficient user credentials to view such data and accordingly the pseudonymised data has become de-pseudonymised and reverts back to the original format. Thus the user (for example, the general practitioner) is enabled to view the entire of the dataset in a "clear" format.

An important aspect of the invention is the reversibility of the pseudonymisation process. It is particularly useful for the entitled persons to be able to view all forms of the pseudonymised data. Therefore, by using the user credentials at step S204, the system enables the relevant data to be presented to the end user.

In a further example in Figure 4, one or more of the name, date of birth or gender columns, would not be de-pseudonymised based on the users credentials. For example, a statistician may be allowed to view data regarding date of birth and gender (54, 56) but would not be able to view data regarding name 52. In such examples, the date of birth 54 and gender 56 columns are de-pseudonymised to clear columns of date of birth 16 and gender 18 but the name column 52 remains in a pseudonymised format.

Figure 5 is a diagrammatic representation of the features which can be created by executing the SQL Scripts produced by the code generator.

Particularly, the invention provides a method to enable a user to pseudonymise large datasets in an efficient and systematic manner. In particular, as such datasets as typically held in a non-standard manner it allows the user to easily selcct which entries are to be pseudonymised.

-10 -The invention uses metadata to describe the data to be pseudonymised. The metadata is supplied with the data and describes the data entry and/or fields/columns of data.

The metadata is processed into the memory cache of a computer or server, and this is used to populate a feature which lists all data items or columns in a mctadata file (such as a "List Box"). A list of all the items and/or columns of data are displayed to an end user via an interface configured to allow an end user to select which items are to be stored in the database. The selections made are used to create an SQL Script which, when executed, creates a storage area (usually called "database tables" or "database" by data professionals) relating to the selected items or columns. Therefore, the invention provides a simple interface from which the end user can make an initial selection of the data to be uploaded and optionally pseudonymised. For example the data presented to a user may contain a number of fields, many of which are irrelevant to a user. Therefore, the user can select the data of interest.

The interface is configured so that after selection a user, can clicking a "Next" button and be presented with a new listing feature which lists only the selected items. The interface allows the user to select the data items/columns from this (possibly reduced) list which the user wished to be subject to a process called "repeat pseudonymisation", which means that data values under that selected column/item, when pseudonymised, will retain the same value for all distinct identifiable values that are loaded into this data item/column. For example, an NHS Number may have an original identifiable value of 1234987534, but if this data is pseudonymised it could have a value of a2s3g5jh399, which would de-identify the patient since whilst look up tables may be readily available which equate 1234987634 with a person's name, address etc the pseudonym a2s3g5jh399 will not be usable to identify the patient without the decryption key/keys and/or processes required to de-pseudonymise the entry. As a "repeat pseudonym" this pseudonym a2s3g5jh399 is reused in the pseudonymised storage area/database for all occurrences of 1234987534 in the original data source.

This aids identifiability, but without disclosing the data which could be used to identify the patient to individuals who would seek to use this data for criminal purposes, should the database be misappropriated. For example it allows an individual with access rights to de-pseudonymised medical data relating to a single patient to find records of the patient to use the repeated pseudonym to find the relevant records and decrypt them to make them readable. If the pseudonym was not repeated the entire database may need to be decrypted/depseudonymised in order to ensure that all records relating to that patient have been found. This is computationally intensive and allows that person to see other patients' records unnecessarily.

The interface is configured so that after the user selects the "repeat pseudonym" data items, (s)he can click the next button again and this will send all remaining, non selected items to be displayed as a further listing feature/list box. This further list box's function allow the end user to select those columns which are to be pseudonymised, but not be subject to the rules of repeat pseudonyms. That is to say, is an "NHS number" columns was selected for normallsingle pseudonymisation rather than for "repeat" then given the NHS Number 1234987534, this would be pseudonymised to a different unique value, each time the pseudonymisation routines in the stored procedures which generate pseudonyms encounter a new instance of that value. For example, the aforementioned NHS Number could be present in a data source file three times, and when pseudonymised, the original number would be stored as three separate and distinct values such as "aj3478fg3jjq", "x07hy436ku9" and "sd432k8j4jyy4". This makes those entries less "usable" to users with access rights but also makes it even harder for any hackers to gain access to patient files.

After the user has selected the data items or columns which would be pseudonymised in this way, some final details are entered into the application, such as the name of the data source file which is to be pseudonymised, the name of the data storage table which will be used to collect pseudonymised data, and the name of the folder on the server that will be the target destination for processing. The processing in an embodiment occurs via SQL Scripts that are generated by a Code Generator that forms part of the invention. When the end user has entered all of these details, a "Generate" button may be clicked in order to prompt the generation of SQL Script files, which, when executed in the SQL Server database, will create the following database components for the data source set that has been processed by the code generator.

-12 - 1. A computerised storage file containing SQL code to create the database table which will hold the final dataset (including non pseudonymised data, repeat pseudonymised data and standard pseudonymised data) 2. A computerised storage file containing SQL Code to create SQL stored procedures (these are executable code procedures), which can process data into the three states mentioned in point 1.

All pseudonyms created are encrypted variants which are unique and alphanumeric in appearance. As an alternative that data may be only processed into two states - "standard pseudonymised data" and "other" where the additional work required for repeat pseudonymised data other pseudonymised (clear data) is carried out by the wrapper described below under point 5 and need not be distinguished by the SQL stored procedures.

3. A computerised storage file containing SQL Code to create SQL Views (these are executable code procedures), which will de-pseudonymise data into its original form so as to render personal details to be identifiable.

4. If repeat pseudonyms are selected from the 2nd list box, 2 code procedures will be created for each selected multiple pseudonym data item or column Sa. A computerised storage file containing SQL Script to create a look up table to hold repeated pseudonyms and related data. That is, the original identifiable value, the pseudonymised value and a user friendly, truncated value for organizational reporting purposes) so that those with access rights can use this table to search the populated final dataset using the original identifiable value. The look up table can also be used by the wrapper described below (or in an alternative the stored procedures referenced in point 2) to repeat pseudonymise data that has been pseudonymised previously.

Sb. A computerised storage file containing SQL Script to create a SQL Stored Procedure which will be used to populate the database table described in point (a 6. A computerised storage file containing SQL Code to create a "wrapper" program. The generated code is designed to be embedded into an "ETL" (Extract, -13 -Transform and Load) program, typically built using a software tool called "SQL Server Integration Services". This "wrapper" program is the interface which allows communication and transmission of data between the workflow program and the pseudonymisation stored procedures (see 2 and 4b). These stored procedures are called upon in the SQL Server Integration Services (SSIS) workflow program in order to store, transform and pseudonymise data. The SSIS workflow program can be viewed as a "master" program and the stored procedures can be viewed as "slave" programs, controlled and executed by the master. The SQL views mention in 3 can also be only shown on other computers or only held in volatile memory'RAM and not on disk or other non-volatile memory.

Referring to Figure 5 a diagrammatic representation of the features which can be created by executing the SQL Scripts produced by the code generator.

Figure 6 is an illustration of a pseudonymisation code generation process in accordance with invention. This process can generate the code described above in point 2.

These scripts shown in Figure 6 for the generation of code or procedure code to provide the stored procedures relevant to pseudonymisation contain all of the functions which are used to create the processes of storage and of de-pseudonymisation. It is to be noted that the scripts generated do not require the data simply the metadata file which describes the data and which columns are to pseudonymised (see above with reference to Figure 5).

In a preferred form the code contained in the target file produced at the end off the process of Figure 5 is applied to an intermediary dataset which has already been subjected to a repeated pseudonymisation procedure rather than on the original dataset. That is that code for dealing with repeat pseudonymisation (generated by a similar but different code generation process by the code generator) is first executed on the original data results in a intermediary set in which all of the original clear data entries in columns selected to be "repeat" pseudonymised have been swapped for their -14 -repeated pseudonyms. These may have been pseudonymised for the first time, or if looked up in the look up table created by the code by the code in point 5a.

The procedure code generated by the process of Figure 6 can then be run row by row on this intermediate dataset. For repeat pseudonym columns the SQL selection statement can then move the data directly from the columns of the intermediate dataset (parameters) into corresponding variables and the SQL insert statement inserts the values from these variables into the columns of the final database. Consequently the SQL selection statement in alternative embodiments may not distinguish between past pseudonymised columns and non-pseudonymised columns.

For the standard pseudonym columns the selection statement specifies the encryption logic using conventional encryption tools such as those provided by SQL Server version 2008. When the generated code is executed on rows of intermediate data the entries under standard pseudonym columns will then be encrypted and stored in the corresponding variable before being inserted form that variable into the corresponding column in the final database.

Notably the code generationl procedure generation processes such as the one in Figure 6 only need access to the metadata file and not the data content to generate the code.

The generated codes can then be applied to the data content to produce the final pseudonymised database. If the metadata is changed or a new file created the processes can simply be repeated and the newly generated code executed on the updated data source file to update the final database.

-15 -

Claims

Claims 1. A computer implemented system for enabling the pseudonymisation of a plurality of database entries, the database entries comprising attributes the system comprising: a metadata file describing the attributes of the database entries; a user interface configured to display to a user the metadata and enabling the user to select one or attributes from the metadata; and further configured to enable the user to select a first set of attributes from the metadata of which it is desired to be pseudonymised in a manner so that repeated occurrences of a particular entry/value are given different (preferably unrelated) pseudonyms and to select a second set of attributes from the metadata of which it is desired to be pseudonymised in a manner so that repeated occurrences of a particular entry are given the same pseudonym or traceably related pseudonyms; and a code generator for generating a first set of code, the first set of code when executed and applied to database entries in a data source matching the metadata, pseudonymises the first set of data attributes in the database entries such that repeated occurrences of a particular entry/value are given different (preferably unrelated) pseudonyms and generating a second set of code, the second set of code when executed pseudonymises the second set of data attributes in the database entries repeated occurrences of a particular entry are given the same pseudonym or traceably related pseudonyms.
2. A computer system according to claim 1 configured to generate third code which when executed applied populates a look up table with at least some of the pseudonyms created by the first code for entries under the attributes selected to be in the second set alongside their values/entries in the data source.
3. A computer system according to claim 2 wherein the at least some of the pseudonyms include any pseudonym which are used for repeated occurrences -16 -
4. A computer system according to claim 1 wherein the system is configured to generate a storage file containing procedural code to generate a wrapper program to interface between the first and third code procedures, the wrapper preferably comprising the third code procedure.
S

S. A computer system according to any preceding claim configured to generate the first code procedure so that when it is applied the pseudonymisation of the second set is applied to referably all) entries/values in the data source and an intermediate dataset created containing the created pseudonyms for the columns /attributes of the second set and the original values form the data source in the other columns/attributes and pseudonymisation of the second set is applied to the intermediate dataset.
6. A computer system according to any preceding claim wherein the first and/or second set comprises one or more columns/attributes.
7. A computer system according to any preceding claim wherein the generation of the first code procedure involves the generation of code to create a variable/entry field for the columns/attributes of at least the first and second sets, instructions to store created pseudonyms in the variables/entry fields and to be used by the second code procedure to populate the database.
8. A computer system according to any preceding claim further comprising a memory on which the database is stored and a processor on which the generated code procedure is executed to create a pseudonymised dataset
9. A computer implemented method for enabling the pseudonymisation of a plurality of database entries, the database entries comprising attributes the method comprising: describing the attributes of the database entries to create metadata regarding the database entries providing to a user the metadata and enabling the user to select one or attributes from the metadata; -17 -enabling the user to select a first set of attributes from the metadata of which it is desired to be pseudonymised in a manner so that repeated occurrences of a particular entry/value are given different (preferably unrelated) pseudonyms and to select a second set of attributes from the metadata of which it is desired to be pseudonymised in a manner so that repeated occurrences of a particular entry are given the same pseudonym or traceably related pseudonyms; and generating a first set (or procedure) of code, the first set of code when executed and applied to database entries in a data source matching the metadata, pseudonymises the first set of data attributes in the database entries such that repeated occurrences of a particular entry/value are given different (preferably unrelated) pseudonyms and generating a second set of procedure code, the second set of code when executed pseudonymises the second set of data attributes in the database entries repeated occurrences of a particular entry are given the same pseudonym or traceably related pseudonyms.
10. A computer implemented method for accessing and enabling the pseudonymisation and de-pseudonymisation of a plurality of database entries, the database entries comprising one or more attributes the method comprising: receiving a selection of a first set of one or more attributes to be pseudonymised in a manner so that repeated occurrences of a particular entry/value are given different preferably unrelated) pseudonyms and to select a second set of one or more attributes to be pseudonymised in a manner so that repeated occurrences of a particular entry/value are given the same pseudonym or traceably related pseudonyms; pseudonymising the first and second set of attributes according to the user selection to create a psuedonymised data set; receiving a request from a user to access the psuedonymised data set, and to view one or more pseudonymised attributes in original de-pseudonymised form, the request containing user authentication data; determining using the user authentication data what data attributes, if any, the user is permitted to view in original dc-pseudonymised form; and de-pseudonymising the psuedonymised data attributes which the user is permitted to view.-18 -
11. The method of claim 10 further comprising the steps of presenting the partially or wholly depseudonymised dataset to the user.
12. The method of claims 10 or 11 wherein the repeat pseudonyms are assigned by the logical joining of the repeat pseudonym data set with the large data set which is subject of the pseudonymisation operation, or transformation.
13. The method of any of claims 10 to 12 wherein the data is presented to the user via a spreadsheet or report.
14. A computer system for enabling pseudonymisation and de-pseudonymisation of database entries the system comprising a processor and a memory, the system configured to: receive metadata containing names of columns/attributes of a data in a data source; provide a user with a list of at least some of the names of columns/attributes from the metadata; allow a user to select a first set of columns/attributes from the list the entries/values of which it is desired to be pseudonymised in a manner so that repeated occurrences of a particular entry/value are given different (preferably unrelated) pseudonyms and to select a second set of columns/attributes from the list the entries/values of which it is desired to be pseudonymised in a manner so that repeated occurrences of a particular entry/value are given the same pseudonym or traceably related pseudonyms; and to use the selections to generate a first code procedure, which cede when executed and applied to data in a data source matching the metadata will pseudonymise entries/values under the columns/attributes selected to be in the first set such that repeated occurrences of a particular entry/value are given different (preferably unrelated) pseudonyms, and will pseudonymise entries/values under the columns/attributes selected to be in the second set such that repeated occurrences of a particular entry/value are given the same pseudonym or traceably related pseudonyms and generate a second -19 -code procedure which when executed and applied to the results of the application of the first code procedure and/or the data in the data source populates a database for access by users with pseudonyms corresponding to columns/attributes selected as part of the first or second set and preferably also with entries/values from the data source corresponding to at least one columinlattribute not selected by the user as part if the first or second set.
15. A computer program product when executed on a computer enables the computer to perform each of the steps of any preceding method claim.-20 -