DESCRIPTION
Title
Method and System for the Storage of Data
Field of the Invention
The invention relates to a method and system for the storage of data, particularly of sensitive data.
Background to the Invention
The storage of sensitive data, such as medical data, is subject to a number of rules and restrictions. For example in Europe, the release and access to such data is governed by data protection laws which often require that access to the sensitive data be restricted to those having a need to know and in some cases require that the data be encrypted. Access is generally given by the input of a password into the computer. If the correct password is entered then access is granted. More recently smart cards with or without password input have been used to grant access to data.
The requirements for such the storage of sensitive data (or private information) are discussed in Canadian Patent Application CA-A-2 303 724 in connection with property insurance policies and claims verification. The teachings of this patent application are said also to be relevant to financial and medical recors, as well as the online photo documentation for sales and auctions, as well as online storage and retrieval of personal photos, wills and probates. All of this can be seen to be sensitive data.
A number of encryption systems for data are available on the marketplace. One of the most commonly used systems is the so-called PGP system. This is described in more detail on the website http://www.pgpi.org/doc/pqpintro/ (accessed on 26 March 2004). This system uses both a public key and a private key. The public key can be freely transmitted. The private key is kept secure and only used by the sender who generates and sends an encrypted document. The receiver of an encrypted document uses his or her private key, together with the public key of the sender to decrypt the document.
A number of patent documents are known in which sensitive data, such as a patient's personal health-related information, are stored and/or transmitted using encryption methods. For example, PTC Application No. WO-A-03/025798 (Califano - assigned to First Genetic Trust Inc.) teaches a system and method for maintaining an individual's privacy such that only he or she could authorise the use of his genotype data. The system includes a safe in which the individual's medical information is stored. The safe's encryption mechanisms and certificates only allow designated parties to access the data. The encryption mechanisms and certificates restrict the use of the data in studies through software that is certified to be able to analyse the data without releasing it in any form that would violate the individual's identity.
A related international patent application No. WO-A-03/019159 (Califano et al, assigned to First Genetic Trust) teaches the concept of a virtual private identity (VPI) which comprises a random number or some other type of identifier used in a database to store genetic data. The other type of identifier lacks any information that can be used to determine identity information. The data stored in association with a respective VPI may be encrypted with an encryption key generated from the VPI. The VPI is not generated from the genetic data.
Similarly US-A-2003/074564 (Peterson) teaches another method of storing personal medical information records without jeopardizing the privacy of an individual. The medical information records are portable and accessible throughout the world via the web. A "key" known only to the individual allows access only to the individual for authorised use. The medical information is stored in an encrypted format. There is no linkage at server level between the individual identifying data and the medical information, except by access by the individual. The system allows real-time altering and updating of information using a personal identifier plus a password selected by the individual. The identifier may be printed on a card or otherwise carried on the individual's person. The individual further chooses a second unique identifier for use when the card is not available.
In a similar vein, US-A-2002/124177 (Harper and Stout) teaches a method and system for encrypting and decryting electronic files using an essesntially symmetric cipher or key system. This system is described as being adapted to electronically store medical records.
US Patent No. 6, 463, 417 (Schoenberg, assigned to CareKey.com, Inc.) teaches a method and system for distributing health information which is categorised into a variety of privacy levels. A requestor is assigned an access security code to allow access to the health information. The degree of access, i.e. the degree of privacy, depends on the access security code.
US Patent Application US-A-2001/051881 (Filler) also describes a system and method for managing a medical services network. This patent application includes diagnostic data which is obtained by a diagnostic service and which is sent over a network to a diagnostic interpreter. Subsequently, the interpretation and/or the the diagnostic data may be transmitted to a display via a network.
Transmission of confidential medical data between a programming device and a remote data system is known in French Patent Application FR-A-2800481. The teachings of this patent include the use of a source of keys to deliver an encryption key to the programming device and a decryption key to the remote data system.
Another method for transmitting medical information over a network is taught in WO-A-02/082347 (Copper, assigned to Inner Vision Imaging LLC). The network of this patent application preferebaly includes two channels. Encrypted patent identifiable data is transmitted over one channel whilst unencrypted patient medical condition or treatment data is transmitted over the other channel.
Problems occur when an unauthorised person obtains the password or smart card. The unauthorised person can access the data. If the identity of a subject is stored, then the unauthorised person can misuse the data. For example, the unauthorised person might ascertain that a certain person is suffering from a disease and pass this information onto an employer. It would therefore be desirable to store the data anonymously.
These issues have been addressed in US Patent Application No. US-A- 2003/0217037 (Pfeiffer and Bicker) in which data can be stored anonymously and no link made with the person who supplied the data.
In the '037 patent application, the data is stored together with an identification number. If new data needs to be entered either the user must supply the identification number - which means that he or she can be identified - or the
data must be stored under a new identification number. In the latter case there is no possibility of correlation between the various items of data.
It is known that the so-called genetic fingerprint of a person is unique to that person. This has been exploited in the past to allow access to data. For example, US Patent Application No. US-A-2002/0059521 (Tasler, assigned to Siemens) describes a method and system in which a user is identified using biometric data. This biometric data is stored in a central server and if a user is to be identified, the biometric data is captured from the user and compared with the stored data. The biometric data includes a genetic fingerprint. Whilst this allows identification of a user, it does not provide for anonymous storage of any data.
Similarly PCT Patent Application No. WO 01/11577 (Precise Biometrics) also teaches a method and system for allowing access to sensitive data using biometric data. In this case the sensitive data is provided on a portable data carrier, such as a smart card. The biometric data is fingerprint data.
Japanese Patent Publication Number 2002-175280 (Dai Nippon) shows a gene information utilisation system which utilises the gene information to generate a cipher based on the gene information. This patent publication fails, however, to disclose the method by which the cipher is generated.
Summary of the Invention
It is therefore an object of the invention to improve the encryption methods for data storage.
This is achieved by providing a method for the encryption of data which has a first step of generating an encryption cipher from items of data derived from a genetic fingerprint based upon structural polymorphisms and a second step of encrypting the data using the encryption cipher. The method has several advantages. It is based upon internationally - recognised technology for analysis and processing of genetic information. It allows complex genetic information to be analysed and processed in a simplified form. It utilises structural genetic elements which are unique to an individual, thereby allowing unique encryption ciphers to be generated. In the event that the cipher is lost, then re-analysis of the DNA can be carried out and, as long as the original method for generating the encryption cipher is known, the encryption cipher can be regenerated.
In order to ensure confidentiality, the individual (or test subject) sends a biological sample to a trust centre which carries out the DNA analysis and generates the cipher. Only the trust centre knows the algorithm by which the encryption cipher is generated and only the trust centre can re-generate the encryption cipher. The encryption cipher is stored in a cipher repository, such as a database, for use by authorised users.
Similarly, the trust centre can generate a decryption cipher from the genetic fingerprint. The decryption cipher will be needed by another group of users to decrypt the encrypted data.
The object of the invention is also solved by providing a system for the storage of data which has a data repository connected to data entry means, a biological sample analyser and an encryption cipher generator. In this embodiment of the invention, the biological sample analyser is independent of the data repository, so that it is not possible for an unauthorised person to gain access to both the biological sample analyser and the data repository and therefore access to the encrypted data within the data repository. The encryption cipher generator uses Items of data derived from the genetic fingerprint and based upon structural polymorphisms.
The biological sample analyser is used to generate the genetic fingerprint from a biological sample provided by an individual or test subject whose sensitive data is to be encrypted.
Not only does the system includes an encryption cipher generator, it further includes a decryption cipher generator which generates a different cipher to allow decryption of the data.
It is also an object of the invention to store sensitive data, such as medical data, in an anonymous and secure manner.
This object of the invention is solved by providing a method for the anonymous storage of sensitive data from a test subject. The method has a first step of generating one or more tags from a genetic fingerprint and a second step of annotating and storing the sensitive data with the one or more tags.
In this context "sensitive data" includes but is not limited to data such as the medical history or medical results of a test subject. The sensitive data could include purchasing details, the criminal record or credit record of the test subject. In brief, any data which needs to be kept confidential and restricted only to a certain group of people can be stored using this method.
The genetic fingerprint of every individual is unique and therefore using the genetic fingerprint allows a tag to be generated which is unique. It is not necessary to use all of the genetic data in the genetic fingerprint to generate the tag. Only a selected portion of genetic data need to be used, as long as the selected portion of genetic data is selected to allow a unique identification to be made. The advantage of using the genetic fingerprint to generate the tag is that the tag can be regenerated if it is lost by re-analysing the genetic fingerprint.
The best method to obtain the genetic fingerprint is from DNA analysis as this is highly repeatable and easily done. Even small amounts of DNA can be used as the material can be amplified using polymerase chain reaction (PCR) methods. PCR based methods form the basis of many advances in genetic fingerprinting and the subsequent analysis of data. One such analytical process demonstrates the presence of short tandem repeats, or STR's in an individual DNA and such technologies can be easily applied to enable generation of the tag.
A trust centre is supplied with a biological sample in order to produce a genetic fingerprint and then generate the tag.
In a preferred embodiment of the invention, both a private tag and a public tag are generated. The private tag and public tag are distinct from each other but each comprises part of the unique genetic fingerprint. This stipulation applies equally to situations where more than one private tag and / or more than one public tag are generated. The public tag can be generally disclosed and can be initially attached to the sensitive data to uniquely identify the sensitive data. For long term storage of the sensitive data - and public analysis of the sensitive data - the private tag is attached to the sensitive data to uniquely identify the sensitive data. This private tag is mapped to the public tag, but does not allow identification of the test subject from which the data is obtained. Finally in a further embodiment of the invention, the sensitive data can also be encrypted using an encryption key. The encryption key can also be generated from the genetic fingerprint.
The objects of the invention can also be solved by providing a system for the storage of sensitive data which has a data repository connected to data entry means and a reference table, such as a look-up table, having a private tag and a public tag. As explained above, the sensitive data is supplied by the data entry means with the public tag and stored in the data repository with the private tag.
Description of the Drawings
Fig. 1 shows an overview of the system of the invention. Fig. 2 shows a flow diagram for the generation of a tag.
Detailed Description of the Invention
Fig. 1 shows an overview of a system 10 for the storage of sensitive data from a test subject 70 in accordance with this invention. The sensitive data can include, but is not limited to, address data, purchasing data, medical data, and any other types of data which is personal to the test subject 70 and which release could be detrimental to the test subject 70.
The system 10 comprises a database 20 having a plurality of records 30 stored therewithin. Each of the records 30 has an identifier 30a, an item of information 30b, such as medical information, and a tag 30c. The record 30 is only one example of a record that can be stored in the database 20 and other types of records can be stored. The database 20 could be a database such as the UK
Biobank (see for example www.ukbiobank.co.uk - accessed on 23 March 2004) or one of the databases of the US National Institutes of Health. The database 20 could also be a database of other confidential data, the access of which has to be limited because of data protection laws or similar requirements.
The identifier 30a is a public identifier which is given to the item of information 30b. The identifier 30a could refer to the test subject 70 (such as a particular patient) or it could be an entirely random number. Typically the identifier 30a comprises the name of the patient and further identifiers such as the date of birth of the patient. It is, however, not unknown for two patients in the same hospital or surgery to have identical names and dates of birth and therefore further identifiers must be added to the identifier 30a to distinguish the two patients.
The item of information 30b could be an item of sensitive data, such as medical data or other confidential data. The item of information 30b could be, for example, digitalised data from an X-ray examination, a blood test, tissue probe or genetic information. The item of information 30b could furthermore be the name and address of a client or it could relate to the purchases made by the test subject.
The tag 30c refers to the test subject 70 and its generation will be explained later. The tag 30c is a so-called "public" tag. The public tag 30c is provided by the test subject 70 to the hospital, doctor, etc. to allow a single unique identification of the items of information 30b. The number of possible public tags 30c is many times the population of the world and thus any possible confusion between any two test subjects 70 should be considered to be negligible.
The database 20 is connected to a further data repository 40. The data repository 40 contains data records with items of information 30b and a private tag 30d. The private tag 30d is generated as described below. Mapping between the public tag 30c and the private tag 30d is carried out in a tag repository 85. The tag repository 85 can be implemented as a look-up table 85. The look-up table 85 is generally not on-line to avoid possible access of the information stored therein by hackers. When data is transferred from the database 20 to the data repository 40, the look-up table is temporarily accessed and the result of the mapping operation returned. This could be done by sending a message to the look-up table 85 and receiving an answer or it could be done by temporarily establishing a secure connection to the look-up table 85 and receiving the results of the mapping operation.
The items of information 30b stored in the data repository 40 are stored completely anonymously. There is no possibility of correlating the items of information 30b with, for example, the test subject 70 from whom the items of information 30b were obtained. Transfer of items of information 30b from the database 20 to the data repository 40 can be carried out automatically by removing the identifier 30a or the test subject 70 can review the item of information 30b before authorising its storage in the data repository 40.
An interface 50 is connected to the data repository 40 which is connected, for example, to a computer, data server or Internet to allow access to the items of information 30b in the data repository 40. Since the items of information 30b are
stored anonymously, there are few restrictions under data protection laws to prevent access to the items of information 30b. The only identifier attached to the items of information 30b is the private tag 30d. There is no reference either to the public tag 30c or to the identifier 30a and thus the items of information 30b are not traceable to the test subject 70.
Generation of the public tag 30c and the private tag 30d is carried out from a genetic fingerprint supplied by the test subject 70 by means of a trust centre 80 as will be described later. The trust centre 80 stores personal data, such as details of the identity of the test subject 70, and generates the public tag 30c and the private tag 30d. However, the trust centre 80 is completely isolated from the data repository 40. In this context "complete isolated" means that there is no permanent direct connection through a network between the data repository 40 and the trust centre 80. There is no possibility of relating the items of information 30b stored in the data repository 40 to the personal data in the trust centre 80. The trust centre stores both the public tag 30c and the private tag 30d in the look-up table 85. The look-up table 85 can be either part of the trust centre 80 or it could be separate from the trust centre 80. In either case access to the look-up table 85 is restricted to only authorised users and security measures are in place to ensure that hacking into the look-up table 85 is impossible.
The generation of the public tag 30c and the private tag 30d will now be described with respect to Fig. 2. In a first step 200, biological material is obtained or extracted from the test subject 70. This biological material could be a mucus sample, a blood sample, or any other sample containing genetic material. Using methods known to the person skilled in the art, DNA is extracted from the biological material in step 210. In step 220, amplification of defined regions of the extracted DNA are carried out using standard PCR-based methods and using primers which are complementary to the conserved regions of the test subject's 70 DNA. Of course, should sufficient DNA be available from the biological material, PCR amplification does not need to be carried out.
In step 230, the amplified DNA is fractionated using one or more standard biochemical separation techniques and in step 240 the information on a resulting genetic profile of the test subject 70 is stored in either digitised or non-digitised form. Finally in step 250 the public tag 30c and the private tag 30d are generated using algorithms. Although the public tag 30c and the private tag 30d are generated from the full genetic profile, it is not possible to use the public tag 30c
and the private tag 30d to trace back and subsequently identify the test subject 70. It is also not possible to derive the private tag 30d from the public tag 30c. This is achieved by choosing appropriate algorithms.
One example of the output of a genetic profiling technique is the detection of polymorphisms in the DNA. Polymorphisms are short variations in the DNA sequence between individuals which occur even between related members of the same family. As a result, polymorphisms are commonly used for paternity testing and forensic cases.
One class of polymorphisms are short tandem repeat (STR) segments in the DNA. The use of STRs is known in the art and commercial kits are available to carry out an analysis, such as the Profiler Plus machine supplied by Applied Biosystems. STRs are short sequences of DNA, normally of 2-5 base pairs, and are repeated numerous times in a head-tail or tandem manner. The STR segments are amplified using PCR primers that bind in the conserved regions of DNA flanking each of the repeat sections. As the number of repeats within an STR locus is highly variable, the amplified STRs vary in length.
STRs have been studied extensively and are well-recognised as a system for the structural analysis of DNA. As a result, there are a number known and documented STR loci. For example, the US Federal Bureau of Investigation uses the CODIS system to identify the perpetrators of crime. The CODIS system uses thirteen different STR loci. One of these loci is the D7S280 locus which is found on the human chromosome 7. The tetrametic repeat sequence of D7S280 is "gata". Different alleles of this locus have from 6 to 15 tandem repeats of the "gata" sequence. Others of the loci include vWA and FGA. Using the results of the analysis of the STRs a numerical result is obtained. To take one example, the CODIS system supplies the genotype of the test subject for the D3S1358 STR as a pair of numbers, e.g. 15 and 18. A pair of numbers is generated as one number relates to the paternal allele and the other to the maternal allele. Similarly for the vWA locus the genotype is a second pair of numbers 16, 16. The number of possible variations is substantially greater than the population of the planet and, as a result, a reliable identification system can be established based on the CODIS STR system.
This numerical result can be combined together by a mathematical method and the mathematical method used to generate both the public tag 30c and the
private tag 30d. In the simplest method all of the digits could be conjoined together to give - in this example - one of the public tag 30c or the private tag 30d having the value 15181616 (i.e. 15+18+16+16).
Generation of the public tag 30c and the private tag 30d can be carried out using two separate and unrelated mathematical operations. This ensures that the private tag 30d cannot be obtained from the public tag 30c.
In another embodiment of the invention, the CODIS STR analysis is divided into maternal and paternal components. The paternal component is kept within the trust centre 80 and is not used to generate the private tag 30d. The maternal component is used to generate the public tag 30d.
Submission of the biological material to the trust centre 80 is carried out in accordance with the methods disclosed broadly in the afore-mentioned US Patent Application No. US-A-2003/0217307, the details of which are incorporated into this application. The biological material is submitted to the trust centre 80 using a unique random identification number to identify the test subject 70. The biological material is not identified with any personal details of the test subject, in particular the name of the test subject 70 is not submitted with the biological material. Only the test subject 70 submitting the biological sample knows the random identification number. In the event that the random identification number is lost, a new random identification number is generated. The results of the genetic analysis are stored in the trust centre 80 and are related to the random identification number. The test subject 70 is sent the public tag 30c after it has been calculated as described above
Input of any items of information 30b into the database 20 can be carried out in the following manner. The item of information 30b can be directly stored with the public tag 30c if the public tag 30c is known to the test subject 70 - for example it might be stored on the test subject's health card. Alternatively, the public tag 30c can be calculated when the item of information 30b is obtained. This would of course mean that it is necessary for the algorithm from which the public tag 30c is generated to be publicly known, which may not be desirable.
The invention can also be used to generate an encryption cipher based on the genetic fingerprint. This encryption cipher can be used to store the data in the data repository 40 in an encrypted manner. The relevant encryption cipher is
stored in a further database (either incorporated into the look-up table 85 or as a separate database for security reason) together with the public tag 30c. When the item of information 30b is transferred from the database 20 to the data repository 40, then it is encrypted. This is done by fetching the encryption cipher from the further database and encrypting the item of information 30b.
As is described in the introduction, numerous encryption methods are known which can be used for this purpose.
The foregoing is considered illustrative of the principles of the invention and since numerous modifications will occur to those skilled in the art, it is not intended to limit the invention to the exact construction and operation described. All suitable modifications are equivalents fall within the scope of the claims.