CN108334577B - Safe multiparty numerical record matching method - Google Patents

Safe multiparty numerical record matching method Download PDF

Info

Publication number
CN108334577B
CN108334577B CN201810067980.5A CN201810067980A CN108334577B CN 108334577 B CN108334577 B CN 108334577B CN 201810067980 A CN201810067980 A CN 201810067980A CN 108334577 B CN108334577 B CN 108334577B
Authority
CN
China
Prior art keywords
numerical
record
attribute
records
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810067980.5A
Other languages
Chinese (zh)
Other versions
CN108334577A (en
Inventor
申德荣
韩姝敏
聂铁铮
寇月
于戈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201810067980.5A priority Critical patent/CN108334577B/en
Publication of CN108334577A publication Critical patent/CN108334577A/en
Application granted granted Critical
Publication of CN108334577B publication Critical patent/CN108334577B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries

Abstract

The invention discloses a safe and effective multiparty numerical record matching method, which belongs to the field of data quality and data integration, and comprises the following specific steps: unifying parameters and generating keys among data sources, and then carrying out the following three steps of (1) encrypting numerical records in the data sources by using a similar mode operation, (2) safely inquiring the maximum and minimum values in the numerical attributes, optimally calculating the similarity between the numerical records as the similarity of the numerical records in the attributes, and (3) judging whether the matching is successful or not according to the similarity of the numerical records in the attributes. By adopting the multiparty numerical record matching method, repeated data objects can be identified more safely and effectively in shorter time; by proving that if the similarity of the maximum and minimum values in the attributes is greater than the threshold, the similarity of any two attribute values is greater than the threshold, and only the maximum and minimum values of each attribute need to be safely and quickly found out, whether each numerical record is successfully matched can be judged, and the high efficiency is ensured.

Description

Safe multiparty numerical record matching method
Technical Field
The invention belongs to the field of data integration and data security, and mainly relates to a safe and effective multiparty numerical record matching method.
Background
With the continuous progress of technology, data is rapidly growing and accumulating. Reducing data redundancy and realizing data sharing become the primary task in the big data era. Record linking, also known as entity identification, entity resolution, entity matching, record joining, duplicate detection, record deduplication, entity resolution, reference disambiguation, deduplication, refers to matching records from one or more data sources that represent the same entity in the real world. The application range of the record link is wide, and the record link comprises enterprise customer information management, fraud prevention, medical health, directory integration, satellite and remote sensing data identification and the like. However, when recording information involves private or sensitive information, we must consider the privacy protection problem of the recorded information. Therefore, in recent years, the trend of studying Privacy-preserving record linkage (PPRL) has been raised at home and abroad. The PPRL technology can ensure that only the final matching result is shared among all data sources in the process of recording linkage, and other unmatched recording information is not leaked. For example, in a decentralized medical system, a person's medical information may be distributed among multiple hospitals, and finding out the same person's diagnostic information at different hospitals is beneficial for more accurate analysis of medical conditions, but because of patient privacy concerns, it is undesirable for each hospital to expose the patient's medical information. The PPRL technology can not only find out the medical information of a certain patient in each hospital, but also ensure that the medical information of other patients in each hospital is not leaked. Therefore, the PPRL technology not only has theoretical research value, but also has important and urgent practical application value.
PPRL mainly comprises three steps: data security blocking, data object similarity security calculation and data object pair matching decision. Firstly, the data security block is used for safely reducing the search space, reducing useless data object comparison and improving the identification speed; data security blocking is an optional step. Secondly, safely calculating the similarity between the data objects is an important link of the PPRL, and the similarity of the encrypted data object pair and the similarity of the original data object pair needs to be ensured to be close, namely if the similarity is higher, the possibility of matching of the data object pair is higher; the similarity calculation function is used for the similarity calculation. Finally, after the data object similarity is obtained, it is necessary to determine whether the data objects are matched (repeated) by using the data object similarity, and there are various methods for determining matching currently.
The existing PPRL method has two defects: 1) it is only applicable to two data sources and there has been little research on three or more multiparty PPRL methods. This is because it is not easy to find a method that can safely and reasonably measure the similarity of multiple records, and the similarity measurement method that is applicable to two data sources is mostly not applicable to multiple data sources. 2) The existing privacy protection processing method is only suitable for character string attributes, and the privacy protection method for numerical attributes is less researched. If the privacy protection method for processing character strings is applied to the numerical attributes, the similarity between the processed numerical attributes is easily different from the similarity between the original attribute values, so that the privacy protection method suitable for the numerical attributes needs to be provided. Multiple parties and numerical attributes are common in many applications in reality, and therefore, it is of great practical significance to research a safe and effective multiparty numerical record matching method.
Disclosure of Invention
Aiming at the defects of the existing safe multiparty record matching method, such as the defects of character type data only, complex encryption process, high time cost and the like, the invention provides an efficient safe multiparty record matching method suitable for numerical type records.
A secure multiparty type record matching method, comprising the steps of:
step 1, encrypting a multi-party data source numerical record; given the number P of participants, encrypting the numerical record by using a quasi-modulo operation (smod), wherein the P participants unify a common matching attribute A ═ an|1≤n≤d};
Definition of numerical record: 1) if the attribute values of all the attributes in a record are numerical values, the record is a numerical record; 2) if the attribute value of some attribute in a record is numerical type, extracting all or part of the numerical type attribute can be regarded as numerical type record.
Step 1-1, generating a numerical record encryption key; participant P1Generating P keys Ki(1 ≦ i ≦ P) to P participants, each key containing d subkeys Ki={kinN is more than or equal to 1 and less than or equal to d, and the numerical attributes { ai1,ai2,…,aidEncryption keys of each numerical attribute are different, so that the data security is enhanced;
step 1-2, encrypting numerical type records; given a numerical record riAnd matching attribute { ai1,ai2,…,aid}, the encryption key is Ki={ki1,ki2,…,kidAnd encrypting the record by utilizing similar mode operation, wherein the encryption mode and the similar mode operation are as follows:
Enc(V(aid))=smod{(V(aid)+kid*p),p*q} (1)
wherein V (a)id) Represents a record riMiddle attribute adM represents textual information, p and q are both prime numbers;
each participant encrypts records by using a respective secret key, and then each participant record performs Cartesian product operation to generate a candidate record pair;
step 2, processing the candidate record pairs in an iterative optimization manner; and processing the candidate record pairs in an iterative optimization mode, and gradually outputting the candidate record pairs successfully matched, wherein the method comprises the following steps:
step 2-1, safely inquiring the maximum value and the minimum value in each numerical attribute; giving P records from P participants, finding out the maximum and minimum values of the numerical attributes of the P records, and giving a numerical attribute anKnowing the value C of each attribute encrypted under that attributein=Enc(ri(an) (i is more than or equal to 1 and less than or equal to P), if the encrypted attribute values have the size relationship of the original values, searching the maximum value and the minimum value in the ciphertext, and decrypting the maximum value and the minimum value to obtain the maximum value and the minimum value in the original values; to satisfy the condition of r1(an)≥r2(an) Then Enc (r)1(an))≥Enc(r2(an) And if Enc (r)1(an))≥Enc(r2(an) Then r) then1(an)≥r2(an) Record r by inference1,r2Key k of1,k2The following relationship is required:
k2=k1+hq (3)
h is an integer; then, encrypted ciphertext of each participant is transmitted to a matching unit, and as the similar mode operation has the property of homomorphic subtraction, the ciphertext is subjected to safe subtraction calculation in the matching unit, and the maximum value and the minimum value in the ciphertext are searched;
step 2-2, optimally calculating the similarity of the candidate record pairs; calculating the similarity of P records in the candidate record pair, obtaining the maximum and minimum values of the ciphertext of the P numerical records under each attribute by using the step 2-1, calculating the similarity of the P records under each attribute through a safe subtraction method, and if the similarity is greater than or equal to a threshold value, successfully matching the P records; otherwise, the matching fails; as shown in (4), the first step is,
in the formula (4), r1,r2,ri,…,rPRepresenting records from P participants, anmax,anminRespectively represent the maximum and minimum values in the attribute n, thetanRepresenting a similarity threshold in the attribute n. This is because if the similarity between the maximum value and the minimum value is greater than the threshold, it is proved that the similarity between any two attribute values recorded under the attribute of the available P records is greater than the threshold, which is proved as follows:
demonstration of if sim (a)nmin,anmax)>θnIt is possible to deduce sim (a, b)>θn,anmin≤a,b≤anmax
If a>b,sim(a,b)=1-(a-b)/dmax=1-((a/b)-1)/dmax,(proposed in equation(5))
When a ═ anmax,b=anmin,
sim (a, b) takes the minimum value and sim (a)nmin,anmax) The phase of the two phases is equal to each other,
thus, sim (a, b)>θn
In the same way, when a<b or a=b,sim(a,b)>θn
Two values n1,n2The similarity calculation formula is expressed as follows:
Figure BDA0001557242210000032
wherein d ismaxIs the maximum difference between the two values;
and finally outputting the matched repeated data object pair.
The invention has the advantages that: by adopting the safe multiparty numerical record matching method, high recall ratio and precision ratio of matching results are ensured through analog-mode encryption and homomorphic operation; by means of the similarity calculation optimization method, more repeated data objects can be matched given a short time budget.
Drawings
FIG. 1 is a general flow diagram of the present invention.
Fig. 2 shows the relationship between the participants and the transmission process of data.
FIG. 3 is a graph comparing the runtime of the present invention with two other methods.
Fig. 4 is a comparison graph of the matching quality of the present invention and other two existing methods.
Detailed Description
The invention is described in further detail below with reference to figures 1-4 of the drawings and the examples of its implementation.
As shown in table 1, 4 records from the patient information base were selected as sample data sets, and the sample data were obtained from all. The corresponding true recognition results in this dataset are { P96, P26, P37 }. Now, we calculate the similarity of 3 records { P96, P80, P26} by way of example, and determine whether the 3 records match successfully.
Table 1 sample data set containing 4 patient records with attributes of blood pressure, 2 hour insulin amount, diabetes index and age
ID Blood pressure 2 hours insulin amount Coefficient of diabetes Age (age)
P96 69 0 0.351 31
P80 66 543 0.158 53
P26 69 0 0.347 31
P37 69 0 0.357 31
1. First, a key { p ═ 181, q ═ 71, rand is generated1=23,rand2=94,rand3236, the attribute values of the common attribute blood pressure of the three records are encrypted, wherein the encryption method is a module-like operation, and the following results are obtained,
C1=Enc(66)=smod{(66+23*181),181*71}=4229,
C2=Enc(70)=smod{(70+94*181),181*71}=4233,
C3=Enc(69)=smod{(69+236*181),181*71}=4232。
2. then, respectively obtain Cmin=4229,Cmax=4232,Csub=Cmax-C min3. Using the similarity calculation formula (5), Sim (P96, P80, P26) is calculated as 1-Dec (C)sub) If 0.7 is set to/10, the similarity of the common attribute blood pressures of the three records P96, P80, and P26 is 0.7.
3. And then, generating keys for other attributes of the three records respectively, and repeating the steps 1 and 2 to obtain the similarity of the attributes of the three records. And comparing the similarity of each attribute obtained by calculation with a set threshold, wherein if the similarity is greater than the threshold, the matching of the three records is successful, and otherwise, the matching is failed. Different keys are generated by different attributes, so that the safety of recording attribute values among records is ensured.
4. And entering an iteration processing stage. And (3) selecting one record from each of the three participants as a candidate pair, repeating the steps 1, 2 and 3, and outputting whether the candidate pair is successfully matched.
TABLE 2 similarity of the attributes of the three records { P96, P80, P26} and { P96, P26, P37}
PatientID OverallSimilarity
P96,P80,P26 0.7 0 0.62 0.74 0
P96,P26,P37 1 1 0.98 1 1

Claims (1)

1. A secure multiparty type record matching method, characterized by: the method comprises the following steps:
step 1, encrypting a multi-party data source numerical record; giving the number P of participants, encrypting the numerical record by utilizing a similar modular operation smod, wherein the uniform public matching attribute A of the P participants is { a ═ an|1≤n≤d};
Definition of numerical record: 1) if the attribute values of all the attributes in a record are numerical values, the record is a numerical record; 2) if the attribute value of part of attributes in a record is numerical type, extracting all or part of numerical type attributes to be regarded as numerical type records;
step 1-1, generating a numerical record encryption key; participant P1Generating P keys KiIs distributed to P participants, wherein i is more than or equal to 1 and less than or equal to P, each key comprises d sub-keys Ki={kinN is more than or equal to 1 and less than or equal to d, and the numerical attributes { ai1,ai2,…,aidEncryption keys of each numerical attribute are different, so that the data security is enhanced;
step 1-2, encrypting numerical type records; given a numerical record riAnd matching attribute { ai1,ai2,…,aid}, the encryption key is Ki={ki1,ki2,…,kidRecording by means of analog-to-analog operation, addingThe cryptographic and modulo-like operations are as follows:
Enc(V(aid))=smod{(V(aid)+kid*p),p*q} (1)
Figure FDA0002273013630000011
wherein V (a)id) Represents a record riMiddle attribute adM represents textual information, p and q are both prime numbers;
each participant encrypts records by using a respective secret key, and then each participant record performs Cartesian product operation to generate a candidate record pair;
step 2, processing the candidate record pairs in an iterative optimization manner; and processing the candidate record pairs in an iterative optimization mode, and gradually outputting the candidate record pairs successfully matched, wherein the method comprises the following steps:
step 2-1, safely inquiring the maximum value and the minimum value in each numerical attribute; giving P records from P participants, finding out the maximum and minimum values of the numerical attributes of the P records, and giving a numerical attribute anKnowing the value C of each attribute encrypted under that attributein=Enc(ri(an) I is more than or equal to 1 and less than or equal to P, if the encrypted attribute values have the size relationship of the original values, the maximum value and the minimum value in the ciphertext are searched and decrypted to be the maximum value and the minimum value in the original values; to satisfy the condition of r1(an)≥r2(an) Then Enc (r)1(an))≥Enc(r2(an) And if Enc (r)1(an))≥Enc(r2(an) Then r) then1(an)≥r2(an) Record r by inference1,r2Key k of1,k2The following relationship is required:
k2=k1+hq (3)
h is an integer; then, encrypted ciphertext of each participant is transmitted to a matching unit, and as the similar mode operation has the property of homomorphic subtraction, the ciphertext is subjected to safe subtraction calculation in the matching unit, and the maximum value and the minimum value in the ciphertext are searched;
step 2-2, optimally calculating the similarity of the candidate record pairs; calculating the similarity of P records in the candidate record pair, obtaining the maximum and minimum values of the ciphertext of the P numerical records under each attribute by using the step 2-1, calculating the similarity of the P records under each attribute through a safe subtraction method, and if the similarity is greater than or equal to a threshold value, successfully matching the P records; otherwise, the matching fails; as shown in (4), the first step is,
in the formula (4), r1,r2,ri,…,rPRepresenting records from P participants, anmax,anminRespectively represent the maximum and minimum values in the attribute n, thetanRepresenting a similarity threshold in the attribute n; this is because if the similarity between the maximum value and the minimum value is greater than the threshold, it is proved that the similarity between any two attribute values recorded under the attribute of the available P records is greater than the threshold, which is proved as follows:
if sim (a)nmin,anmax)>θnIt is possible to deduce sim (a, b)>θn,anmin≤a,b≤anmax
If a>b,sim(a,b)=1-(a-b)/dmaxAs can be seen in equation (5);
when a ═ anmax,b=anmin
sim (a, b) takes the minimum value and sim (a)nmin,anmax) Equal;
thus, sim (a, b)>θn
In the same way, when a<b or a=b,sim(a,b)>θn
Two values n1,n2The similarity calculation formula is expressed as follows:
Figure FDA0002273013630000022
wherein d ismaxIs the maximum difference between the two values;
and finally outputting the matched repeated data object pair.
CN201810067980.5A 2018-01-24 2018-01-24 Safe multiparty numerical record matching method Active CN108334577B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810067980.5A CN108334577B (en) 2018-01-24 2018-01-24 Safe multiparty numerical record matching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810067980.5A CN108334577B (en) 2018-01-24 2018-01-24 Safe multiparty numerical record matching method

Publications (2)

Publication Number Publication Date
CN108334577A CN108334577A (en) 2018-07-27
CN108334577B true CN108334577B (en) 2020-02-07

Family

ID=62926306

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810067980.5A Active CN108334577B (en) 2018-01-24 2018-01-24 Safe multiparty numerical record matching method

Country Status (1)

Country Link
CN (1) CN108334577B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113032839B (en) * 2021-05-25 2021-08-10 华控清交信息科技(北京)有限公司 Data processing method and device and data processing device
CN113408001B (en) * 2021-08-18 2021-11-09 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for determining most value safely by multiple parties

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020073138A1 (en) * 2000-12-08 2002-06-13 Gilbert Eric S. De-identification and linkage of data records
CN101937464B (en) * 2010-09-13 2012-01-25 武汉达梦数据库有限公司 Ciphertext search method based on word-for-word indexing
EP3364316B1 (en) * 2012-08-15 2019-10-02 Visa International Service Association Searchable encrypted data
US9705850B2 (en) * 2013-03-15 2017-07-11 Arizona Board Of Regents On Behalf Of Arizona State University Enabling comparable data access control for lightweight mobile devices in clouds

Also Published As

Publication number Publication date
CN108334577A (en) 2018-07-27

Similar Documents

Publication Publication Date Title
Qi et al. Efficient privacy-preserving k-nearest neighbor search
US11003681B2 (en) Anonymization system
Wang et al. FastGeo: Efficient geometric range queries on encrypted spatial data
Kohlmayer et al. A flexible approach to distributed data anonymization
JP2008500598A (en) Method and apparatus for confidential information retrieval and lost communication with good communication efficiency
JP7061042B2 (en) Systems and architectures that support parsing for encrypted databases
CN115688167B (en) Method, device and system for inquiring trace and storage medium
Jiang et al. N-gram based secure similar document detection
Zhu et al. Privacy preserving similarity evaluation of time series data.
Liang et al. Efficient and privacy-preserving decision tree classification for health monitoring systems
Chen et al. Perfectly secure and efficient two-party electronic-health-record linkage
CN108334577B (en) Safe multiparty numerical record matching method
Troncoso-Pastoriza et al. A secure multidimensional point inclusion protocol
Randall et al. Privacy preserving record linkage using homomorphic encryption
Rajput et al. -Score-Based Secure Biomedical Model for Effective Skin Lesion Segmentation Over eHealth Cloud
Lu et al. Methods of privacy-preserving genomic sequencing data alignments
EP3441904B1 (en) System and architecture for analytics on encrypted databases
Riazi et al. Sub-linear privacy-preserving near-neighbor search
Kim et al. Privacy-preserving parallel kNN classification algorithm using index-based filtering in cloud computing
Sun et al. A systematic review on privacy-preserving distributed data mining
Kantarcioglu et al. Formal anonymity models for efficient privacy-preserving joins
Singh et al. Practical personalized genomics in the encrypted domain
Kesarwani et al. Secure k-anonymization over encrypted databases
Saha et al. Outsourcing private equality tests to the cloud
Hao et al. Efficient and privacy-preserving multi-party skyline queries in online medical primary diagnosis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant