CN109308423A - Secondary method of partition in secret protection record link - Google Patents

Secondary method of partition in secret protection record link Download PDF

Info

Publication number
CN109308423A
CN109308423A CN201811101295.6A CN201811101295A CN109308423A CN 109308423 A CN109308423 A CN 109308423A CN 201811101295 A CN201811101295 A CN 201811101295A CN 109308423 A CN109308423 A CN 109308423A
Authority
CN
China
Prior art keywords
piecemeal
lsh
suffix
record
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201811101295.6A
Other languages
Chinese (zh)
Inventor
申德荣
彤丹妮
聂铁铮
寇月
于戈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201811101295.6A priority Critical patent/CN109308423A/en
Publication of CN109308423A publication Critical patent/CN109308423A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses the secondary method of partition in a kind of secret protection record link; belong to data integration and data-privacy field; specifically each data source, which records it, carries out Bloom Filter coding; then; carry out following two step; (1) the secondary method of partition of LSH combination suffix, and introduce piecemeal dispersion degree and adjust piecemeal twice.(2) the multi-party piecemeal based on sliding window merges, and improves the serious forgiveness of link.Using PPRL method of partition of the invention, has the characteristics that LSH method recall ratio height and large data collection can quickly be divided, simultaneously effective improve precision ratio.

Description

Secondary method of partition in secret protection record link
Technical field
The invention belongs to data integrations and data-privacy field, relate generally to a kind of apply in secret protection record link Secondary method of partition.
Background technique
It is wherein valuable to excavate by analyzing the data set comprising millions of records with the arrival of big data era The demand of information increasingly increases, and analysis large data collection usually requires to integrate the data from multi-source.Meanwhile many tissues consider To regulation and law, its hetero-organization is not allowed to share its data set.For this purpose, presenting ' secret protection record link (Privacy-Preserving Record Linkage, PPRL) ' technology, it refers to that identifying that different data is concentrated indicates phase With entity record without reveal entity privacy.In PPRL project, if certain record of a data side is judged as With the record matching of other data sides, then this data side, which agrees to tell the part attribute information in this record, gives other participations Square or additional researcher.Because many organizational requirements improve the qualities of data or abundant data further to analyze, PPRL's Application field constantly expands, including health care, government services, crime detection and business application etc..For example, medical research people Member is the adverse reaction of investigation novel drugs, is needed to integrate from Different hospital, clinic and the data in pharmacy, in the research, Researcher only obtains using the symptom that patient shows after the new drug, the other information without knowing patient.Therefore, PPRL skill Art has important and urgent practical application value while with theoretical research value.
The main research of PPRL includes privacy technology, method of partition and matching process.Privacy technology is needed according to ginseng With square number and attack mode, prevent from attacking by that recording of encrypted, will record to be embedded in reference set space and add the modes such as noise Leakage of private information caused by hitting.Meanwhile piecemeal and matching process are also required to consider additional secret protection demand.Method of partition It is that similar record is divided in same piecemeal, ' candidate record group ' quantity, raising PPRL efficiency, the ideal of generation is reduced with this Method of partition should realize recall ratio, precision ratio and privacy are taken into account.Matching process is using similarity function to each time It selects record group to be calculated, candidate record group is classified as matching record group or mismatches record group, how difficult point is not The similarity calculation of candidate record group is completed in the case where knowing other data sides record content.
Current existing PPRL technology is deposited insufficient both ways: 1) existing PPRL technology is adapted to two data sides mostly Between record link, and the PPRL technical research between multiple data sides is seldom, and multi-party PPRL method will guarantee data side's number Increase will not seriously affect recall ratio and precision ratio.2) existing method of partition cannot make PPRL and meanwhile reach high recall ratio and High precision ratio, is derived mainly from following two aspect: on the one hand, after piecemeal can also generate under excessive truth and unmatched time Record group is selected, additional calculating cost is caused;On the other hand, the record group that true match is lost after piecemeal, does not carry out it With calculating.In practical applications, the Data Integration between multiple data sides is very common, therefore, studies method of partition in multi-party PPRL, Make PPRL while reaching high recall ratio and precision ratio, has important practical significance.
Summary of the invention
The present invention provides the secondary method of partition in a kind of secret protection record link.
The technical solution adopted by the present invention is that:
A kind of secondary method of partition in secret protection record link, comprising the following steps:
The secondary method of partition of step 1. local sensitivity Hash LSH combination suffix;In the base of local sensitivity Hash LSH piecemeal Suffix piecemeal is carried out on plinth, and the suffix lengths of suffix piecemeal are set according to the piecemeal dispersion degree after local sensitivity Hash LSH piecemeal, Secondary piecemeal is set to improve precision ratio under the premise of guaranteeing PPRL high recall ratio;
Step 1-1.Bloom Filter coding;Piecemeal attribute is selected from the jointly owned attribute in n data side, respectively Data side respectively carries out Bloom Filter mapping to the binary group of its piecemeal attribute value character string with identical function and generates bf, If Pete includes binary group _ P, Pe, et, te, e_;
Step 1-2.LSH piecemeal;N data side determines the consistent hash function H of J groupj, j=1 ..., J, every group of HjBy K Hash functionIt constitutes, k=1 ..., K;Bf [l] is the value of first of position of bf;For every group of Hj, right with its The bf of record carries out Hash mapping, obtains the vector i.e. hash key that length is K, and the identical record of vector value is assigned to same In LSH piecemeal;Select multiple groups HjIt is the serious forgiveness in order to improve secondary method of partition;
Step 1-3. is determined based on the suffix lengths of LSH piecemeal dispersion degree;Each data side is greater than the piecemeal of X in its size In randomly select it is N number of, be much smaller than existing piecemeal quantity, randomly selected out of this N number of piecemeal respectively q item record, much smaller than point Record sum in block;The position that m LSH piecemeal function never acts on is randomly choosed on bf, connects every this m position record bf The value set forms a sequence, counts the probability that different sequences occur in a piecemeal;N data side should determine consistent x, N, Q value and m position;Shown in monolithic dispersion degree such as formula (1):
Wherein, j indicates j-th in nN piecemeal, j=1 ..., nN;PiI kind difference sequence point in piecemeal thus The probability not occurred, i=1 ..., I;
According to monolithic dispersion degree whole dispersion degree is calculated, assesses the piecemeal dispersion that n data side integrates accordingly, then Shown in whole piecemeal dispersion degree such as formula (2):
By increasing the number of every group of hash function in LSH, determines and acceptable precision ratio is just only reached by LSH piecemeal When HsValue is threshold θ;As the H of LSH piecemeal clustersWhen less than or equal to θ, show that LSH method draws the low record of similarity It assigns in different piecemeals, without carrying out suffix piecemeal, following blocks merge step only need to be by the identical piecemeal of each side LSH-key Merging;If the H of preliminary piecemealsGreater than θ, then the minimum suffix lengths l that choosesminAs shown in formula (3):
Wherein, HtIt is whole to carrying out testing after LSH piecemeal with the data for needing the data set that links identical in quality Body dispersion degree, ltIt is corresponding HtBest minimum suffix lengths;
Step 1-4. suffix piecemeal;Pass through H for every groupjThe LSH piecemeal of generation is utilized respectively x kind different length (respectively It is lmin,lmin+1,...,lmin+ x-1) suffix x suffix piecemeal is carried out to each LSH piecemeal, in each LSH piecemeal after bf Sew the identical record of value to be assigned in the same piecemeal, then after secondary piecemeal, a record is appeared in Jx piecemeal;
Step 1-5. piecemeal signature generates;After secondary piecemeal, a piecemeal one signature<id, LSH-key, suffix> It indicates, wherein id is piecemeal number, and LSH-key is corresponding vector value in LSH blocking process, and suffix is suffix blocking process In corresponding suffix value;
Step 2. is merged based on the multi-party piecemeal of sliding window;Additional participant Pn+1Using sliding window to n data side Respective piecemeal carries out the process that fusion generates final piecemeal, improves the serious forgiveness of link, is further ensured that the height of PPRL is looked into entirely Rate;
The signature sequence of step 2-1. piecemeal;Pn+1Count the received side's n piecemeal signature, and suffix long identical to LSH-key It spends identical piecemeal signature and forms signature list by the binary sized sequence sequence of suffix value;
Final piecemeal is generated in step 2-2. sliding window;Use size for the sliding window of w to each signature list into Row slides, the piecemeal in the same window if it exists from n data side, and all piecemeals in this window can just be merged generation One final piecemeal;Every n item forms a candidate record group from the record of different data side in final piecemeal.
The invention has the advantages that
Multi-party piecemeal based on sliding window merges, and improves the serious forgiveness of link.Using PPRL method of partition of the invention, Have the characteristics that LSH method recall ratio height and large data collection can quickly be divided, simultaneously effective improves precision ratio.
Detailed description of the invention
Fig. 1 overview flow chart of the present invention.
Fig. 2 present invention records Bloom Filter explanatory diagram.
Fig. 3 merges the present invention is based on the piecemeal of sliding window to illustrate.
Specific embodiment
Here is the example of a specific implementation of the invention.
A kind of secondary method of partition in multi-party PPRL, it is characterised in that: the following steps are included:
Step 1. local sensitivity Hash (Locality-Sensitive Hashing, LSH) combines the secondary piecemeal of suffix Method.Suffix piecemeal is carried out on the basis of LSH piecemeal, after setting suffix piecemeal according to the piecemeal dispersion degree after LSH piecemeal Sew length, secondary piecemeal is made to improve precision ratio under the premise of guaranteeing PPRL high recall ratio.
Step 1-1.Bloom Filter coding.Piecemeal attribute is selected from the jointly owned attribute in n data side, respectively Data side respectively carries out Bloom Filter mapping to the binary group of its piecemeal attribute value character string with identical function and generates bf (such as Pete includes binary group _ P, Pe, et, te, e_).
Step 1-2.LSH piecemeal.N data side determines the consistent hash function H of J groupj, j=1 ..., J, every group of HjBy K Hash functionIt constitutes, k=1 ..., K.Bf [l] is the value of first of position of bf.For every group of Hj, use it Hash mapping is carried out to the bf of record, obtains the vector i.e. hash key that length is K, the identical record of vector value is assigned to same In LSH piecemeal.Select multiple groups HjIt is the serious forgiveness in order to improve secondary method of partition.
Step 1-3. is determined based on the suffix lengths of LSH piecemeal dispersion degree.Each data side is greater than the piecemeal of X in its size In randomly select N number of (much smaller than existing piecemeal quantity), randomly selected out of this N number of piecemeal respectively q item record (much smaller than point Record sum in block).The position that m LSH piecemeal function never acts on is randomly choosed on bf, connects every record bf this m The value of position forms a sequence, counts the probability that different sequences occur in a piecemeal.N data side should determine consistent x, N, q value and m position.Shown in monolithic dispersion degree such as formula (1):
Wherein, j indicates j-th in nN piecemeal, j=1 ..., nN.PiI kind difference sequence point in piecemeal thus The probability not occurred, i=1 ..., I.
According to monolithic dispersion degree whole dispersion degree is calculated, assesses the piecemeal dispersion that n data side integrates accordingly, then Shown in whole piecemeal dispersion degree such as formula (2):
By increasing the number of every group of hash function in LSH, determines and acceptable precision ratio is just only reached by LSH piecemeal When HsValue is threshold θ.As the H of LSH piecemeal clustersWhen less than or equal to θ, show that LSH method draws the low record of similarity It assigns in different piecemeals, without carrying out suffix piecemeal, following blocks merge step only need to be by the identical piecemeal of each side LSH-key Merging.If the H of preliminary piecemealsGreater than θ, then the minimum suffix lengths l that choosesminAs shown in formula (3):
Wherein, HtIt is whole to carrying out testing after LSH piecemeal with the data for needing the data set that links identical in quality Body dispersion degree, ltIt is corresponding HtBest minimum suffix lengths.
Step 1-4. suffix piecemeal.Pass through H for every groupjThe LSH piecemeal of generation is utilized respectively x kind different length (respectively It is lmin,lmin+1,...,lmin+ x-1) suffix x suffix piecemeal is carried out to each LSH piecemeal, in each LSH piecemeal after bf Sew the identical record of value to be assigned in the same piecemeal, then after secondary piecemeal, a record is appeared in Jx piecemeal.
Step 1-5. piecemeal signature generates.After secondary piecemeal, a piecemeal one signature<id, LSH-key, suffix> It indicates, wherein id is piecemeal number, and LSH-key is corresponding vector value in LSH blocking process, and suffix is suffix blocking process In corresponding suffix value.
Step 2. is merged based on the multi-party piecemeal of sliding window.Additional participant Pn+1Using sliding window to n data side Respective piecemeal carries out the process that fusion generates final piecemeal, improves the serious forgiveness of link, is further ensured that the height of PPRL is looked into entirely Rate.
The signature sequence of step 2-1. piecemeal.Pn+1Count the received side's n piecemeal signature, and suffix long identical to LSH-key It spends identical piecemeal signature and forms signature list by the binary sized sequence sequence of suffix value.
Final piecemeal is generated in step 2-2. sliding window.Use size for the sliding window of w to each signature list into Row slides, the piecemeal in the same window if it exists from n data side, and all piecemeals in this window can just be merged generation One final piecemeal.Every n item forms a candidate record group from the record of different data side in final piecemeal.
Embodiment
P1, P2, P3It is that medicine data set, resident's gene data collection and hospital admission message data set are purchased in citizen pharmacy respectively, The purpose of PPRL is to identify that 3 data concentrate the ternary record group for indicating same user, the jointly owned resident's name of 3 data sets Word attribute is as piecemeal attribute.Implement use-case to P1, P2, P3In up to ten thousand records execute secondary piecemeal side proposed by the present invention The similar record of 3 sides is divided into same final piecemeal by method.
Analyzing examples P1, P2, P3Totally 9 records are proposed by the present invention to prove by middle each 3 of 3 sides for referring to 4 specific users Method of partition effect.Table 1 enumerates this 9 records and its attribute of name value, records rijFor Pi3 record in j-th strip, note Record r11, r21, r31User 1 is represented, r is recorded13, r23, r33Represent user 4, user 2 and user 3 not 3 data concentrate by Record refers to.3 records for indicating user 1 and 3 records for indicating user 4 should be divided in same by suitable method of partition respectively In one final piecemeal.
The record of table 1 and User relationship table
It is below the implementation steps of secondary method of partition in secret protection record link, and selection is concentrated to case data The implementation process analyzing examples of 9 records:
Step 1.P1, P2, P3LSH piecemeal is respectively carried out first, then 3 sides calculate LSH piecemeal entirety dispersion degree jointly, by This determines the consistent suffix lengths of 3 sides, and 3 sides carry out suffix piecemeal on the basis of its each comfortable LSH piecemeal again.
Step 1-1.P1, P2, P3Bloom Filter coding is carried out to its record name attribute value, at the beginning of Bloom Filter When the beginning whole bit be 0, by each binary group Hash mapping in attribute value character string into Bloom Filter two bit Position, the two positions are arranged to 1, and the Bloom Filter that note every records corresponding length 100 is bf.
Step 1-2.P1, P2, P3Use 2 groups of hash function H1And H2It is recorded and carries out LSH piecemeal, H1And H2In respectively include 10 hk(bf)=bf [l], k=1 ..., 10, each hkIt is randomly selected in corresponding l value from 0 to 99, a record bf is in H1Or H2The value of corresponding 10 positions connects the hash key that the binary sequence to be formed obtains record mapping as this group of function. Pass through H1Or H2, same group of function corresponds to the identical record of hash key and is assigned in same LSH piecemeal in each data set.
Step 1-3. is determined based on the suffix lengths of LSH piecemeal dispersion degree.3 data sides are respectively random in its LSH piecemeal The piecemeal that 10 sizes are greater than 200 is chosen, respectively randomly selects 30 records out of this 10 piecemeals respectively.5 are selected on bf H1And H2The position never acted on, the value for connecting every record this 5 positions bf form a sequence.It counts in each piecemeal not The probability P occurred with sequencei, according to indicating that the different degrees of information entropy principle of sequence calculates the monolithic dispersion degree of each piecemeal Such as in P1A piecemeal in only occur 01000 and 01,010 two kind of sequence, the probability that they occur point It is not 7/10 and 3/10, then the monolithic dispersion degree of this piecemeal is H1=-7/100*ln (7/100) -3/100*ln (3/100)= 0.291.The monolithic dispersion degree for totally 30 piecemeals that 3 sides choose is between 0.278 to 0.115.Whole dispersion degree is this 30 lists The ratio between the sum of block dispersion degree and 3ln10, obtain Hs=1.08.HsWhen sufficiently small, show that similarity is the low record of LSH method It is divided into different piecemeals, without carrying out suffix piecemeal.This use-case HsGreater than threshold θ=0.5, suffix piecemeal need to be carried out.Ht= 1.2 be to and the whole dispersion degree that tested after LSH piecemeal of 3 case data collection data identical in quality, lt=12 It is HtIt is corresponding to test obtained minimum suffix lengths, utilize reference value HtAnd ltCalculate HsOptimal minimum suffix when=1.08 Length lmin=lt(Hs-θ)/(Ht- θ)=12* (1.08-0.5)/(1.2-0.5)=9.94, it is approximately equal to 10, considers case data It is weaker to upset situation, using the suffix of two kinds of length, then it is 10 bit and 11 bit that suffix piecemeal, which chooses length, Suffix.
Step 1-4. is utilized respectively the bf suffix that length is 10 and 11 and carries out suffix piecemeal, suffix in each LSH piecemeal It is worth identical record to be assigned in the same piecemeal.After secondary piecemeal, a record is appeared in 4 piecemeals.
Step 1-5.P1, P2, P3Its piecemeal of each self-generating signature.Every signature includes three contents, and first item is that piecemeal is compiled Number Bij, indicate PiJ-th of piecemeal, Section 2 is that common corresponding LSH-key is recorded in this piecemeal, and Section 3 is in this piecemeal Record common corresponding suffix value.
The case where lower surface analysis 9 citings are recorded in step 1 in the process.
Fig. 2 is that citing records corresponding bf sequence, shows H1Act on the situation on 9 citing record bf, H1Middle hkIt is right It answers 10 positions of bf to distinguish 1,9,23,24,42,58,71,72,85 and 94, and shows 9 citings and record two kinds of length Suffix value.Such as record r11H1Corresponding hash key is 0100100001, and the suffix value of length 10 is 0001001000, The suffix value of length 11 is 00001001000.
Analysis records corresponding LSH-key value and suffix value for 9 after executing step 1, accordingly to them following Merging process the case where being likely to occur be illustrated.Table 2 is the identical suffix value pair that identical LSH-key value and length are 11 The 3 side's record cases answered.Piecemeal signature with identical LSH-key value appears in same row table, the piecemeal for representative of signing It is possible to be merged.As can be seen from Table 2, because r11, r21, r31All it is to describe user 1 with identical attribute value John, passes through H1 Or H2, 3 to record corresponding LSH-key identical, and when to select suffix lengths be 10 or 11, the suffix value of place piecemeal is identical, So block phase is bound to this 3 records point in the same final piecemeal.r12, r22, r32Same user is not represented, But their attribute of name value Jones and Stone has multiple identical binary groups, their H1Corresponding LSH-key is identical, but Can suffix value be different, assign in same final piecemeal and merge according to the piecemeal after signature sequence, their H2Corresponding LSH-key Difference then utilizes H2Secondary piecemeal this 3 records will not be divided into same final piecemeal.r13, r23And r33Three records H1Corresponding LSH-key is different, although they indicate user 4, attribute of name value is not identical, and _ e_, er, r_ is in Bloom At least one of corresponding 6 positions are present in H in Filter110 l in will cause such case, then utilize H1's Secondary piecemeal cannot recognize that r13, r23, r33Indicate same user.Pass through H2After carrying out LSH piecemeal, r13, r23And r33It is corresponding LSH-key is identical, because corresponding 6 positions e_, er, r_ are not present in H210 l in, then utilize H2Secondary point Block can recognize that r13, r23, r33Indicate same user.
2 two blocking information tables of table
The additional participant P of step 2.4Using sliding window to P1, P2, P3Respective piecemeal carries out fusion and generates final point Block.
Step 2-1.P4Suffix value is pressed respectively to the signature of identical LSH-key value and the identical piecemeal of suffix length The ascending sequence of binary system.To use-case P1, P2, P3Piecemeal is carried out, H is passed through135 kinds of different LSH-key values are formed altogether, are passed through H2Form 37 kinds of different LSH-key values altogether, then the 72 kinds of LSH-key values formed in total by two groups of functions, every kind of LSH-key It is worth the list of corresponding two kinds of difference suffix length, forms 144 signature lists altogether.
Step 2-2.P4Choose size be 5 sliding window, window to each signature list from top slide downward, every time A line is slided, until bottom, if 3 side's piecemeals signature exists in window, 5 piecemeals that signature represents in window will be merged At final piecemeal.The subsequent match stage of PPRL will generate candidate record group in final piecemeal, judge one using matching primitives Group record indicates whether same user.
4 lists of piecemeal combination situation where can showing citing record are analyzed.Such as 4 table institutes in Fig. 3 Show, storage obtains all piecemeals of identical LSH-key value by same group of function in each table.Except include citing blocking of record Signature explicitly indicates that remaining piecemeal signature is replaced with asterisk.The citing note that list first is classified as piecemeal number and this piecemeal includes Record, the secondary series corresponding suffix value of piecemeal thus.Subgraph 3 (1) and Fig. 3 (2) are described and are passed through H1, r11, r21, r31And r12, r22, r32The combination situation for two lists that the suffix length at place is 11, it is seen that r11, r21, r31Place piecemeal can be merged, r32The piecemeal and r at place12, r22The piecemeal at place is apart from each other in lists, is not merged.Fig. 3 (3) and Fig. 3 (4) are described Pass through H2, r13, r23, r33The combination situation for two block lists that the suffix length at place is 11,10, suffix in subgraph (3) When length is 11, the corresponding suffix value of Peter is that the corresponding suffix value of 11000000100, Pete is 01000000100, Because being highest order difference, r13, r23Place piecemeal and r33Place piecemeal wide apart, is not merged.Suffix is long in subgraph (4) When degree is 10, r13, r23, r33The suffix value of place piecemeal is identical, and piecemeal where three records is integrated into same final piecemeal It is interior, it is assumed that Peter bf other positions in suffix corresponding with Pete are different, but different location digit is lower, and value gets over phase Closely, position is more close after sequence, and piecemeal where three records is likely to be merged.
The record r of user 1 is indicated in citing record11, r21, r31It is present in same final piecemeal, indicates the note of user 4 Record r13, r23, r33It is present in same final piecemeal, similar record r12, r22, r32It does not exist in same final piecemeal.PPRL Follow-up phase can be to the candidate record group (r of generation11, r21, r31) and (r13, r23, r33) matching primitives are carried out respectively, judge candidate Whether veritably record in record group indicates same user.

Claims (1)

1. the secondary method of partition in a kind of secret protection record link, it is characterised in that: the following steps are included:
The secondary method of partition of step 1. local sensitivity Hash LSH combination suffix;On the basis of local sensitivity Hash LSH piecemeal Suffix piecemeal is carried out, according to the suffix lengths of the piecemeal dispersion degree setting suffix piecemeal after local sensitivity Hash LSH piecemeal, makes two Secondary piecemeal improves precision ratio under the premise of guaranteeing PPRL high recall ratio;
Step 1-1.Bloom Filter coding;Piecemeal attribute, each data are selected from the jointly owned attribute in n data side Side respectively carries out Bloom Filter mapping to the binary group of its piecemeal attribute value character string with identical function and generates bf, such as Pete includes binary group _ P, Pe, et, te, e_;
Step 1-2.LSH piecemeal;N data side determines the consistent hash function H of J groupj, j=1 ..., J, every group of HjBy K Hash FunctionIt constitutes, k=1 ..., K;Bf [l] is the value of first of position of bf;For every group of Hj, with it to note The bf of record carries out Hash mapping, obtains the vector i.e. hash key that length is K, and the identical record of vector value is assigned to same LSH In piecemeal;Select multiple groups HjIt is the serious forgiveness in order to improve secondary method of partition;
Step 1-3. is determined based on the suffix lengths of LSH piecemeal dispersion degree;Each data side its size greater than X piecemeal in Machine selection is N number of, is much smaller than existing piecemeal quantity, q item record is randomly selected out of this N number of piecemeal respectively, much smaller than in piecemeal Record sum;The position that m LSH piecemeal function never acts on is randomly choosed on bf, connects every record this m position bf Value forms a sequence, counts the probability that different sequences occur in a piecemeal;N data side should determine that consistent x, N, q take Value and m position;Shown in monolithic dispersion degree such as formula (1):
Wherein, j indicates j-th in nN piecemeal, j=1 ..., nN;PiI kind difference sequence goes out respectively in piecemeal thus Existing probability, i=1 ..., I;
Whole dispersion degree is calculated according to monolithic dispersion degree, assesses the comprehensive piecemeal dispersion in n data side accordingly, then integrally Shown in piecemeal dispersion degree such as formula (2):
By increasing the number of every group of hash function in LSH, determine when only just reaching acceptable precision ratio by LSH piecemeal HsValue is threshold θ;As the H of LSH piecemeal clustersWhen less than or equal to θ, show that the low record of similarity is divided by LSH method In different piecemeals, without carrying out suffix piecemeal, following blocks merge step and need to only merge the identical piecemeal of each side LSH-key ?;If the H of preliminary piecemealsGreater than θ, then the minimum suffix lengths l that choosesminAs shown in formula (3):
Wherein, HtIt is to test obtained whole dispersion after the data identical in quality to the data set linked with needs carry out LSH piecemeal Degree, ltIt is corresponding HtBest minimum suffix lengths;
Step 1-4. suffix piecemeal;Pass through H for every groupjThe LSH piecemeal of generation, being utilized respectively x kind different length (is respectively lmin,lmin+1,...,lmin+ x-1) suffix x suffix piecemeal is carried out to each LSH piecemeal, bf suffix in each LSH piecemeal It is worth identical record to be assigned in the same piecemeal, then after secondary piecemeal, a record is appeared in Jx piecemeal;
Step 1-5. piecemeal signature generates;After secondary piecemeal, a piecemeal one signature<id, LSH-key, suffix>table Show, wherein id is piecemeal number, and LSH-key is corresponding vector value in LSH blocking process, and suffix is in suffix blocking process Corresponding suffix value;
Step 2. is merged based on the multi-party piecemeal of sliding window;Additional participant Pn+1It is respective to n data side using sliding window Piecemeal carry out the process that fusion generates final piecemeal, improve the serious forgiveness of link, be further ensured that the high recall ratio of PPRL;
The signature sequence of step 2-1. piecemeal;Pn+1Count the received side's n piecemeal signature, and suffix length phase identical to LSH-key Same piecemeal signature forms signature list by the binary sized sequence sequence of suffix value;
Final piecemeal is generated in step 2-2. sliding window;Size is used to slide for the sliding window of w to each signature list It moves, the piecemeal in the same window if it exists from n data side, all piecemeals in this window can just be merged generation one Final piecemeal;Every n item forms a candidate record group from the record of different data side in final piecemeal.
CN201811101295.6A 2018-09-20 2018-09-20 Secondary method of partition in secret protection record link Withdrawn CN109308423A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811101295.6A CN109308423A (en) 2018-09-20 2018-09-20 Secondary method of partition in secret protection record link

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811101295.6A CN109308423A (en) 2018-09-20 2018-09-20 Secondary method of partition in secret protection record link

Publications (1)

Publication Number Publication Date
CN109308423A true CN109308423A (en) 2019-02-05

Family

ID=65225030

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811101295.6A Withdrawn CN109308423A (en) 2018-09-20 2018-09-20 Secondary method of partition in secret protection record link

Country Status (1)

Country Link
CN (1) CN109308423A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866283A (en) * 2019-11-25 2020-03-06 浙江工商大学 Multi-party verifiable data record linking method based on block chain and partial homomorphic encryption
CN111246431A (en) * 2020-04-26 2020-06-05 北京全路通信信号研究设计院集团有限公司 Analysis and evaluation method and system for multi-source data of railway train control equipment
CN114282255A (en) * 2022-03-04 2022-04-05 支付宝(杭州)信息技术有限公司 Sorting sequence merging method and system based on secret sharing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
佟丹妮等: "多方强隐私保护记录链接方法", 《计算机科学与探索》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866283A (en) * 2019-11-25 2020-03-06 浙江工商大学 Multi-party verifiable data record linking method based on block chain and partial homomorphic encryption
CN110866283B (en) * 2019-11-25 2021-09-21 浙江工商大学 Multi-party verifiable data record linking method based on block chain and partial homomorphic encryption
CN111246431A (en) * 2020-04-26 2020-06-05 北京全路通信信号研究设计院集团有限公司 Analysis and evaluation method and system for multi-source data of railway train control equipment
CN111246431B (en) * 2020-04-26 2020-09-08 北京全路通信信号研究设计院集团有限公司 Analysis and evaluation method and system for multi-source data of railway train control equipment
CN114282255A (en) * 2022-03-04 2022-04-05 支付宝(杭州)信息技术有限公司 Sorting sequence merging method and system based on secret sharing

Similar Documents

Publication Publication Date Title
US11977541B2 (en) Systems and methods for rapid data analysis
Zhang et al. A privacy leakage upper bound constraint-based approach for cost-effective privacy preserving of intermediate data sets in cloud
Liu et al. Multi-constrained graph pattern matching in large-scale contextual social graphs
Baralis et al. Generalized association rule mining with constraints
CN109308423A (en) Secondary method of partition in secret protection record link
CN109117669B (en) Privacy protection method and system for MapReduce similar connection query
US11132360B2 (en) Accessing datasets
CN109710789A (en) Search method, device, electronic equipment and the computer storage medium of image data
CN104077723A (en) Social network recommending system and social network recommending method
Teng et al. An Efficient and Secure Cipher-Text Retrieval Scheme Based on Mixed Homomorphic Encryption and Multi-Attribute Sorting Method.
CN108197491A (en) A kind of subgraph search method based on ciphertext
Mueller et al. SoK: Differential privacy on graph-structured data
US7363320B2 (en) Method and system for correlating data from multiple sources without compromising confidentiality requirements
CN107070932B (en) Anonymous method for preventing label neighbor attack in social network dynamic release
Amir et al. A brief review of conditions, circumstances and applicability of sampling techniques in computer science domain
CN105787800B (en) Intelligent social platform potential relationship retrieval device, system and method
Shaham et al. Machine learning aided anonymization of spatiotemporal trajectory datasets
EP4182827A1 (en) Method and system for secure distributed software-service
Dhanalakshmi et al. Privacy preserving data mining techniques-survey
Nia et al. Leveraging social interactions to suggest friends
Hongde et al. Differential privacy data aggregation optimizing method and application to data visualization
Kuijpers et al. Analyzing trajectories using uncertainty and background information
Hamidi et al. Secure Two-party Agglomerative Hierarchical Clustering Construction.
Patsakis et al. Privacy-aware genome mining: Server-assisted protocols for private set intersection and pattern matching
Janakiraman et al. How are you related? Predicting the type of a social relationship using call graph data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20190205