CN109308423A - Secondary method of partition in secret protection record link - Google Patents
Secondary method of partition in secret protection record link Download PDFInfo
- Publication number
- CN109308423A CN109308423A CN201811101295.6A CN201811101295A CN109308423A CN 109308423 A CN109308423 A CN 109308423A CN 201811101295 A CN201811101295 A CN 201811101295A CN 109308423 A CN109308423 A CN 109308423A
- Authority
- CN
- China
- Prior art keywords
- piecemeal
- lsh
- suffix
- record
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6227—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses the secondary method of partition in a kind of secret protection record link; belong to data integration and data-privacy field; specifically each data source, which records it, carries out Bloom Filter coding; then; carry out following two step; (1) the secondary method of partition of LSH combination suffix, and introduce piecemeal dispersion degree and adjust piecemeal twice.(2) the multi-party piecemeal based on sliding window merges, and improves the serious forgiveness of link.Using PPRL method of partition of the invention, has the characteristics that LSH method recall ratio height and large data collection can quickly be divided, simultaneously effective improve precision ratio.
Description
Technical field
The invention belongs to data integrations and data-privacy field, relate generally to a kind of apply in secret protection record link
Secondary method of partition.
Background technique
It is wherein valuable to excavate by analyzing the data set comprising millions of records with the arrival of big data era
The demand of information increasingly increases, and analysis large data collection usually requires to integrate the data from multi-source.Meanwhile many tissues consider
To regulation and law, its hetero-organization is not allowed to share its data set.For this purpose, presenting ' secret protection record link
(Privacy-Preserving Record Linkage, PPRL) ' technology, it refers to that identifying that different data is concentrated indicates phase
With entity record without reveal entity privacy.In PPRL project, if certain record of a data side is judged as
With the record matching of other data sides, then this data side, which agrees to tell the part attribute information in this record, gives other participations
Square or additional researcher.Because many organizational requirements improve the qualities of data or abundant data further to analyze, PPRL's
Application field constantly expands, including health care, government services, crime detection and business application etc..For example, medical research people
Member is the adverse reaction of investigation novel drugs, is needed to integrate from Different hospital, clinic and the data in pharmacy, in the research,
Researcher only obtains using the symptom that patient shows after the new drug, the other information without knowing patient.Therefore, PPRL skill
Art has important and urgent practical application value while with theoretical research value.
The main research of PPRL includes privacy technology, method of partition and matching process.Privacy technology is needed according to ginseng
With square number and attack mode, prevent from attacking by that recording of encrypted, will record to be embedded in reference set space and add the modes such as noise
Leakage of private information caused by hitting.Meanwhile piecemeal and matching process are also required to consider additional secret protection demand.Method of partition
It is that similar record is divided in same piecemeal, ' candidate record group ' quantity, raising PPRL efficiency, the ideal of generation is reduced with this
Method of partition should realize recall ratio, precision ratio and privacy are taken into account.Matching process is using similarity function to each time
It selects record group to be calculated, candidate record group is classified as matching record group or mismatches record group, how difficult point is not
The similarity calculation of candidate record group is completed in the case where knowing other data sides record content.
Current existing PPRL technology is deposited insufficient both ways: 1) existing PPRL technology is adapted to two data sides mostly
Between record link, and the PPRL technical research between multiple data sides is seldom, and multi-party PPRL method will guarantee data side's number
Increase will not seriously affect recall ratio and precision ratio.2) existing method of partition cannot make PPRL and meanwhile reach high recall ratio and
High precision ratio, is derived mainly from following two aspect: on the one hand, after piecemeal can also generate under excessive truth and unmatched time
Record group is selected, additional calculating cost is caused;On the other hand, the record group that true match is lost after piecemeal, does not carry out it
With calculating.In practical applications, the Data Integration between multiple data sides is very common, therefore, studies method of partition in multi-party PPRL,
Make PPRL while reaching high recall ratio and precision ratio, has important practical significance.
Summary of the invention
The present invention provides the secondary method of partition in a kind of secret protection record link.
The technical solution adopted by the present invention is that:
A kind of secondary method of partition in secret protection record link, comprising the following steps:
The secondary method of partition of step 1. local sensitivity Hash LSH combination suffix;In the base of local sensitivity Hash LSH piecemeal
Suffix piecemeal is carried out on plinth, and the suffix lengths of suffix piecemeal are set according to the piecemeal dispersion degree after local sensitivity Hash LSH piecemeal,
Secondary piecemeal is set to improve precision ratio under the premise of guaranteeing PPRL high recall ratio;
Step 1-1.Bloom Filter coding;Piecemeal attribute is selected from the jointly owned attribute in n data side, respectively
Data side respectively carries out Bloom Filter mapping to the binary group of its piecemeal attribute value character string with identical function and generates bf,
If Pete includes binary group _ P, Pe, et, te, e_;
Step 1-2.LSH piecemeal;N data side determines the consistent hash function H of J groupj, j=1 ..., J, every group of HjBy K
Hash functionIt constitutes, k=1 ..., K;Bf [l] is the value of first of position of bf;For every group of Hj, right with its
The bf of record carries out Hash mapping, obtains the vector i.e. hash key that length is K, and the identical record of vector value is assigned to same
In LSH piecemeal;Select multiple groups HjIt is the serious forgiveness in order to improve secondary method of partition;
Step 1-3. is determined based on the suffix lengths of LSH piecemeal dispersion degree;Each data side is greater than the piecemeal of X in its size
In randomly select it is N number of, be much smaller than existing piecemeal quantity, randomly selected out of this N number of piecemeal respectively q item record, much smaller than point
Record sum in block;The position that m LSH piecemeal function never acts on is randomly choosed on bf, connects every this m position record bf
The value set forms a sequence, counts the probability that different sequences occur in a piecemeal;N data side should determine consistent x, N,
Q value and m position;Shown in monolithic dispersion degree such as formula (1):
Wherein, j indicates j-th in nN piecemeal, j=1 ..., nN;PiI kind difference sequence point in piecemeal thus
The probability not occurred, i=1 ..., I;
According to monolithic dispersion degree whole dispersion degree is calculated, assesses the piecemeal dispersion that n data side integrates accordingly, then
Shown in whole piecemeal dispersion degree such as formula (2):
By increasing the number of every group of hash function in LSH, determines and acceptable precision ratio is just only reached by LSH piecemeal
When HsValue is threshold θ;As the H of LSH piecemeal clustersWhen less than or equal to θ, show that LSH method draws the low record of similarity
It assigns in different piecemeals, without carrying out suffix piecemeal, following blocks merge step only need to be by the identical piecemeal of each side LSH-key
Merging;If the H of preliminary piecemealsGreater than θ, then the minimum suffix lengths l that choosesminAs shown in formula (3):
Wherein, HtIt is whole to carrying out testing after LSH piecemeal with the data for needing the data set that links identical in quality
Body dispersion degree, ltIt is corresponding HtBest minimum suffix lengths;
Step 1-4. suffix piecemeal;Pass through H for every groupjThe LSH piecemeal of generation is utilized respectively x kind different length (respectively
It is lmin,lmin+1,...,lmin+ x-1) suffix x suffix piecemeal is carried out to each LSH piecemeal, in each LSH piecemeal after bf
Sew the identical record of value to be assigned in the same piecemeal, then after secondary piecemeal, a record is appeared in Jx piecemeal;
Step 1-5. piecemeal signature generates;After secondary piecemeal, a piecemeal one signature<id, LSH-key, suffix>
It indicates, wherein id is piecemeal number, and LSH-key is corresponding vector value in LSH blocking process, and suffix is suffix blocking process
In corresponding suffix value;
Step 2. is merged based on the multi-party piecemeal of sliding window;Additional participant Pn+1Using sliding window to n data side
Respective piecemeal carries out the process that fusion generates final piecemeal, improves the serious forgiveness of link, is further ensured that the height of PPRL is looked into entirely
Rate;
The signature sequence of step 2-1. piecemeal;Pn+1Count the received side's n piecemeal signature, and suffix long identical to LSH-key
It spends identical piecemeal signature and forms signature list by the binary sized sequence sequence of suffix value;
Final piecemeal is generated in step 2-2. sliding window;Use size for the sliding window of w to each signature list into
Row slides, the piecemeal in the same window if it exists from n data side, and all piecemeals in this window can just be merged generation
One final piecemeal;Every n item forms a candidate record group from the record of different data side in final piecemeal.
The invention has the advantages that
Multi-party piecemeal based on sliding window merges, and improves the serious forgiveness of link.Using PPRL method of partition of the invention,
Have the characteristics that LSH method recall ratio height and large data collection can quickly be divided, simultaneously effective improves precision ratio.
Detailed description of the invention
Fig. 1 overview flow chart of the present invention.
Fig. 2 present invention records Bloom Filter explanatory diagram.
Fig. 3 merges the present invention is based on the piecemeal of sliding window to illustrate.
Specific embodiment
Here is the example of a specific implementation of the invention.
A kind of secondary method of partition in multi-party PPRL, it is characterised in that: the following steps are included:
Step 1. local sensitivity Hash (Locality-Sensitive Hashing, LSH) combines the secondary piecemeal of suffix
Method.Suffix piecemeal is carried out on the basis of LSH piecemeal, after setting suffix piecemeal according to the piecemeal dispersion degree after LSH piecemeal
Sew length, secondary piecemeal is made to improve precision ratio under the premise of guaranteeing PPRL high recall ratio.
Step 1-1.Bloom Filter coding.Piecemeal attribute is selected from the jointly owned attribute in n data side, respectively
Data side respectively carries out Bloom Filter mapping to the binary group of its piecemeal attribute value character string with identical function and generates bf
(such as Pete includes binary group _ P, Pe, et, te, e_).
Step 1-2.LSH piecemeal.N data side determines the consistent hash function H of J groupj, j=1 ..., J, every group of HjBy K
Hash functionIt constitutes, k=1 ..., K.Bf [l] is the value of first of position of bf.For every group of Hj, use it
Hash mapping is carried out to the bf of record, obtains the vector i.e. hash key that length is K, the identical record of vector value is assigned to same
In LSH piecemeal.Select multiple groups HjIt is the serious forgiveness in order to improve secondary method of partition.
Step 1-3. is determined based on the suffix lengths of LSH piecemeal dispersion degree.Each data side is greater than the piecemeal of X in its size
In randomly select N number of (much smaller than existing piecemeal quantity), randomly selected out of this N number of piecemeal respectively q item record (much smaller than point
Record sum in block).The position that m LSH piecemeal function never acts on is randomly choosed on bf, connects every record bf this m
The value of position forms a sequence, counts the probability that different sequences occur in a piecemeal.N data side should determine consistent x,
N, q value and m position.Shown in monolithic dispersion degree such as formula (1):
Wherein, j indicates j-th in nN piecemeal, j=1 ..., nN.PiI kind difference sequence point in piecemeal thus
The probability not occurred, i=1 ..., I.
According to monolithic dispersion degree whole dispersion degree is calculated, assesses the piecemeal dispersion that n data side integrates accordingly, then
Shown in whole piecemeal dispersion degree such as formula (2):
By increasing the number of every group of hash function in LSH, determines and acceptable precision ratio is just only reached by LSH piecemeal
When HsValue is threshold θ.As the H of LSH piecemeal clustersWhen less than or equal to θ, show that LSH method draws the low record of similarity
It assigns in different piecemeals, without carrying out suffix piecemeal, following blocks merge step only need to be by the identical piecemeal of each side LSH-key
Merging.If the H of preliminary piecemealsGreater than θ, then the minimum suffix lengths l that choosesminAs shown in formula (3):
Wherein, HtIt is whole to carrying out testing after LSH piecemeal with the data for needing the data set that links identical in quality
Body dispersion degree, ltIt is corresponding HtBest minimum suffix lengths.
Step 1-4. suffix piecemeal.Pass through H for every groupjThe LSH piecemeal of generation is utilized respectively x kind different length (respectively
It is lmin,lmin+1,...,lmin+ x-1) suffix x suffix piecemeal is carried out to each LSH piecemeal, in each LSH piecemeal after bf
Sew the identical record of value to be assigned in the same piecemeal, then after secondary piecemeal, a record is appeared in Jx piecemeal.
Step 1-5. piecemeal signature generates.After secondary piecemeal, a piecemeal one signature<id, LSH-key, suffix>
It indicates, wherein id is piecemeal number, and LSH-key is corresponding vector value in LSH blocking process, and suffix is suffix blocking process
In corresponding suffix value.
Step 2. is merged based on the multi-party piecemeal of sliding window.Additional participant Pn+1Using sliding window to n data side
Respective piecemeal carries out the process that fusion generates final piecemeal, improves the serious forgiveness of link, is further ensured that the height of PPRL is looked into entirely
Rate.
The signature sequence of step 2-1. piecemeal.Pn+1Count the received side's n piecemeal signature, and suffix long identical to LSH-key
It spends identical piecemeal signature and forms signature list by the binary sized sequence sequence of suffix value.
Final piecemeal is generated in step 2-2. sliding window.Use size for the sliding window of w to each signature list into
Row slides, the piecemeal in the same window if it exists from n data side, and all piecemeals in this window can just be merged generation
One final piecemeal.Every n item forms a candidate record group from the record of different data side in final piecemeal.
Embodiment
P1, P2, P3It is that medicine data set, resident's gene data collection and hospital admission message data set are purchased in citizen pharmacy respectively,
The purpose of PPRL is to identify that 3 data concentrate the ternary record group for indicating same user, the jointly owned resident's name of 3 data sets
Word attribute is as piecemeal attribute.Implement use-case to P1, P2, P3In up to ten thousand records execute secondary piecemeal side proposed by the present invention
The similar record of 3 sides is divided into same final piecemeal by method.
Analyzing examples P1, P2, P3Totally 9 records are proposed by the present invention to prove by middle each 3 of 3 sides for referring to 4 specific users
Method of partition effect.Table 1 enumerates this 9 records and its attribute of name value, records rijFor Pi3 record in j-th strip, note
Record r11, r21, r31User 1 is represented, r is recorded13, r23, r33Represent user 4, user 2 and user 3 not 3 data concentrate by
Record refers to.3 records for indicating user 1 and 3 records for indicating user 4 should be divided in same by suitable method of partition respectively
In one final piecemeal.
The record of table 1 and User relationship table
It is below the implementation steps of secondary method of partition in secret protection record link, and selection is concentrated to case data
The implementation process analyzing examples of 9 records:
Step 1.P1, P2, P3LSH piecemeal is respectively carried out first, then 3 sides calculate LSH piecemeal entirety dispersion degree jointly, by
This determines the consistent suffix lengths of 3 sides, and 3 sides carry out suffix piecemeal on the basis of its each comfortable LSH piecemeal again.
Step 1-1.P1, P2, P3Bloom Filter coding is carried out to its record name attribute value, at the beginning of Bloom Filter
When the beginning whole bit be 0, by each binary group Hash mapping in attribute value character string into Bloom Filter two bit
Position, the two positions are arranged to 1, and the Bloom Filter that note every records corresponding length 100 is bf.
Step 1-2.P1, P2, P3Use 2 groups of hash function H1And H2It is recorded and carries out LSH piecemeal, H1And H2In respectively include
10 hk(bf)=bf [l], k=1 ..., 10, each hkIt is randomly selected in corresponding l value from 0 to 99, a record bf is in H1Or
H2The value of corresponding 10 positions connects the hash key that the binary sequence to be formed obtains record mapping as this group of function.
Pass through H1Or H2, same group of function corresponds to the identical record of hash key and is assigned in same LSH piecemeal in each data set.
Step 1-3. is determined based on the suffix lengths of LSH piecemeal dispersion degree.3 data sides are respectively random in its LSH piecemeal
The piecemeal that 10 sizes are greater than 200 is chosen, respectively randomly selects 30 records out of this 10 piecemeals respectively.5 are selected on bf
H1And H2The position never acted on, the value for connecting every record this 5 positions bf form a sequence.It counts in each piecemeal not
The probability P occurred with sequencei, according to indicating that the different degrees of information entropy principle of sequence calculates the monolithic dispersion degree of each piecemeal Such as in P1A piecemeal in only occur 01000 and 01,010 two kind of sequence, the probability that they occur point
It is not 7/10 and 3/10, then the monolithic dispersion degree of this piecemeal is H1=-7/100*ln (7/100) -3/100*ln (3/100)=
0.291.The monolithic dispersion degree for totally 30 piecemeals that 3 sides choose is between 0.278 to 0.115.Whole dispersion degree is this 30 lists
The ratio between the sum of block dispersion degree and 3ln10, obtain Hs=1.08.HsWhen sufficiently small, show that similarity is the low record of LSH method
It is divided into different piecemeals, without carrying out suffix piecemeal.This use-case HsGreater than threshold θ=0.5, suffix piecemeal need to be carried out.Ht=
1.2 be to and the whole dispersion degree that tested after LSH piecemeal of 3 case data collection data identical in quality, lt=12
It is HtIt is corresponding to test obtained minimum suffix lengths, utilize reference value HtAnd ltCalculate HsOptimal minimum suffix when=1.08
Length lmin=lt(Hs-θ)/(Ht- θ)=12* (1.08-0.5)/(1.2-0.5)=9.94, it is approximately equal to 10, considers case data
It is weaker to upset situation, using the suffix of two kinds of length, then it is 10 bit and 11 bit that suffix piecemeal, which chooses length,
Suffix.
Step 1-4. is utilized respectively the bf suffix that length is 10 and 11 and carries out suffix piecemeal, suffix in each LSH piecemeal
It is worth identical record to be assigned in the same piecemeal.After secondary piecemeal, a record is appeared in 4 piecemeals.
Step 1-5.P1, P2, P3Its piecemeal of each self-generating signature.Every signature includes three contents, and first item is that piecemeal is compiled
Number Bij, indicate PiJ-th of piecemeal, Section 2 is that common corresponding LSH-key is recorded in this piecemeal, and Section 3 is in this piecemeal
Record common corresponding suffix value.
The case where lower surface analysis 9 citings are recorded in step 1 in the process.
Fig. 2 is that citing records corresponding bf sequence, shows H1Act on the situation on 9 citing record bf, H1Middle hkIt is right
It answers 10 positions of bf to distinguish 1,9,23,24,42,58,71,72,85 and 94, and shows 9 citings and record two kinds of length
Suffix value.Such as record r11H1Corresponding hash key is 0100100001, and the suffix value of length 10 is 0001001000,
The suffix value of length 11 is 00001001000.
Analysis records corresponding LSH-key value and suffix value for 9 after executing step 1, accordingly to them following
Merging process the case where being likely to occur be illustrated.Table 2 is the identical suffix value pair that identical LSH-key value and length are 11
The 3 side's record cases answered.Piecemeal signature with identical LSH-key value appears in same row table, the piecemeal for representative of signing
It is possible to be merged.As can be seen from Table 2, because r11, r21, r31All it is to describe user 1 with identical attribute value John, passes through H1
Or H2, 3 to record corresponding LSH-key identical, and when to select suffix lengths be 10 or 11, the suffix value of place piecemeal is identical,
So block phase is bound to this 3 records point in the same final piecemeal.r12, r22, r32Same user is not represented,
But their attribute of name value Jones and Stone has multiple identical binary groups, their H1Corresponding LSH-key is identical, but
Can suffix value be different, assign in same final piecemeal and merge according to the piecemeal after signature sequence, their H2Corresponding LSH-key
Difference then utilizes H2Secondary piecemeal this 3 records will not be divided into same final piecemeal.r13, r23And r33Three records
H1Corresponding LSH-key is different, although they indicate user 4, attribute of name value is not identical, and _ e_, er, r_ is in Bloom
At least one of corresponding 6 positions are present in H in Filter110 l in will cause such case, then utilize H1's
Secondary piecemeal cannot recognize that r13, r23, r33Indicate same user.Pass through H2After carrying out LSH piecemeal, r13, r23And r33It is corresponding
LSH-key is identical, because corresponding 6 positions e_, er, r_ are not present in H210 l in, then utilize H2Secondary point
Block can recognize that r13, r23, r33Indicate same user.
2 two blocking information tables of table
The additional participant P of step 2.4Using sliding window to P1, P2, P3Respective piecemeal carries out fusion and generates final point
Block.
Step 2-1.P4Suffix value is pressed respectively to the signature of identical LSH-key value and the identical piecemeal of suffix length
The ascending sequence of binary system.To use-case P1, P2, P3Piecemeal is carried out, H is passed through135 kinds of different LSH-key values are formed altogether, are passed through
H2Form 37 kinds of different LSH-key values altogether, then the 72 kinds of LSH-key values formed in total by two groups of functions, every kind of LSH-key
It is worth the list of corresponding two kinds of difference suffix length, forms 144 signature lists altogether.
Step 2-2.P4Choose size be 5 sliding window, window to each signature list from top slide downward, every time
A line is slided, until bottom, if 3 side's piecemeals signature exists in window, 5 piecemeals that signature represents in window will be merged
At final piecemeal.The subsequent match stage of PPRL will generate candidate record group in final piecemeal, judge one using matching primitives
Group record indicates whether same user.
4 lists of piecemeal combination situation where can showing citing record are analyzed.Such as 4 table institutes in Fig. 3
Show, storage obtains all piecemeals of identical LSH-key value by same group of function in each table.Except include citing blocking of record
Signature explicitly indicates that remaining piecemeal signature is replaced with asterisk.The citing note that list first is classified as piecemeal number and this piecemeal includes
Record, the secondary series corresponding suffix value of piecemeal thus.Subgraph 3 (1) and Fig. 3 (2) are described and are passed through H1, r11, r21, r31And r12,
r22, r32The combination situation for two lists that the suffix length at place is 11, it is seen that r11, r21, r31Place piecemeal can be merged,
r32The piecemeal and r at place12, r22The piecemeal at place is apart from each other in lists, is not merged.Fig. 3 (3) and Fig. 3 (4) are described
Pass through H2, r13, r23, r33The combination situation for two block lists that the suffix length at place is 11,10, suffix in subgraph (3)
When length is 11, the corresponding suffix value of Peter is that the corresponding suffix value of 11000000100, Pete is 01000000100,
Because being highest order difference, r13, r23Place piecemeal and r33Place piecemeal wide apart, is not merged.Suffix is long in subgraph (4)
When degree is 10, r13, r23, r33The suffix value of place piecemeal is identical, and piecemeal where three records is integrated into same final piecemeal
It is interior, it is assumed that Peter bf other positions in suffix corresponding with Pete are different, but different location digit is lower, and value gets over phase
Closely, position is more close after sequence, and piecemeal where three records is likely to be merged.
The record r of user 1 is indicated in citing record11, r21, r31It is present in same final piecemeal, indicates the note of user 4
Record r13, r23, r33It is present in same final piecemeal, similar record r12, r22, r32It does not exist in same final piecemeal.PPRL
Follow-up phase can be to the candidate record group (r of generation11, r21, r31) and (r13, r23, r33) matching primitives are carried out respectively, judge candidate
Whether veritably record in record group indicates same user.
Claims (1)
1. the secondary method of partition in a kind of secret protection record link, it is characterised in that: the following steps are included:
The secondary method of partition of step 1. local sensitivity Hash LSH combination suffix;On the basis of local sensitivity Hash LSH piecemeal
Suffix piecemeal is carried out, according to the suffix lengths of the piecemeal dispersion degree setting suffix piecemeal after local sensitivity Hash LSH piecemeal, makes two
Secondary piecemeal improves precision ratio under the premise of guaranteeing PPRL high recall ratio;
Step 1-1.Bloom Filter coding;Piecemeal attribute, each data are selected from the jointly owned attribute in n data side
Side respectively carries out Bloom Filter mapping to the binary group of its piecemeal attribute value character string with identical function and generates bf, such as
Pete includes binary group _ P, Pe, et, te, e_;
Step 1-2.LSH piecemeal;N data side determines the consistent hash function H of J groupj, j=1 ..., J, every group of HjBy K Hash
FunctionIt constitutes, k=1 ..., K;Bf [l] is the value of first of position of bf;For every group of Hj, with it to note
The bf of record carries out Hash mapping, obtains the vector i.e. hash key that length is K, and the identical record of vector value is assigned to same LSH
In piecemeal;Select multiple groups HjIt is the serious forgiveness in order to improve secondary method of partition;
Step 1-3. is determined based on the suffix lengths of LSH piecemeal dispersion degree;Each data side its size greater than X piecemeal in
Machine selection is N number of, is much smaller than existing piecemeal quantity, q item record is randomly selected out of this N number of piecemeal respectively, much smaller than in piecemeal
Record sum;The position that m LSH piecemeal function never acts on is randomly choosed on bf, connects every record this m position bf
Value forms a sequence, counts the probability that different sequences occur in a piecemeal;N data side should determine that consistent x, N, q take
Value and m position;Shown in monolithic dispersion degree such as formula (1):
Wherein, j indicates j-th in nN piecemeal, j=1 ..., nN;PiI kind difference sequence goes out respectively in piecemeal thus
Existing probability, i=1 ..., I;
Whole dispersion degree is calculated according to monolithic dispersion degree, assesses the comprehensive piecemeal dispersion in n data side accordingly, then integrally
Shown in piecemeal dispersion degree such as formula (2):
By increasing the number of every group of hash function in LSH, determine when only just reaching acceptable precision ratio by LSH piecemeal
HsValue is threshold θ;As the H of LSH piecemeal clustersWhen less than or equal to θ, show that the low record of similarity is divided by LSH method
In different piecemeals, without carrying out suffix piecemeal, following blocks merge step and need to only merge the identical piecemeal of each side LSH-key
?;If the H of preliminary piecemealsGreater than θ, then the minimum suffix lengths l that choosesminAs shown in formula (3):
Wherein, HtIt is to test obtained whole dispersion after the data identical in quality to the data set linked with needs carry out LSH piecemeal
Degree, ltIt is corresponding HtBest minimum suffix lengths;
Step 1-4. suffix piecemeal;Pass through H for every groupjThe LSH piecemeal of generation, being utilized respectively x kind different length (is respectively
lmin,lmin+1,...,lmin+ x-1) suffix x suffix piecemeal is carried out to each LSH piecemeal, bf suffix in each LSH piecemeal
It is worth identical record to be assigned in the same piecemeal, then after secondary piecemeal, a record is appeared in Jx piecemeal;
Step 1-5. piecemeal signature generates;After secondary piecemeal, a piecemeal one signature<id, LSH-key, suffix>table
Show, wherein id is piecemeal number, and LSH-key is corresponding vector value in LSH blocking process, and suffix is in suffix blocking process
Corresponding suffix value;
Step 2. is merged based on the multi-party piecemeal of sliding window;Additional participant Pn+1It is respective to n data side using sliding window
Piecemeal carry out the process that fusion generates final piecemeal, improve the serious forgiveness of link, be further ensured that the high recall ratio of PPRL;
The signature sequence of step 2-1. piecemeal;Pn+1Count the received side's n piecemeal signature, and suffix length phase identical to LSH-key
Same piecemeal signature forms signature list by the binary sized sequence sequence of suffix value;
Final piecemeal is generated in step 2-2. sliding window;Size is used to slide for the sliding window of w to each signature list
It moves, the piecemeal in the same window if it exists from n data side, all piecemeals in this window can just be merged generation one
Final piecemeal;Every n item forms a candidate record group from the record of different data side in final piecemeal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811101295.6A CN109308423A (en) | 2018-09-20 | 2018-09-20 | Secondary method of partition in secret protection record link |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811101295.6A CN109308423A (en) | 2018-09-20 | 2018-09-20 | Secondary method of partition in secret protection record link |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109308423A true CN109308423A (en) | 2019-02-05 |
Family
ID=65225030
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811101295.6A Withdrawn CN109308423A (en) | 2018-09-20 | 2018-09-20 | Secondary method of partition in secret protection record link |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109308423A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110866283A (en) * | 2019-11-25 | 2020-03-06 | 浙江工商大学 | Multi-party verifiable data record linking method based on block chain and partial homomorphic encryption |
CN111246431A (en) * | 2020-04-26 | 2020-06-05 | 北京全路通信信号研究设计院集团有限公司 | Analysis and evaluation method and system for multi-source data of railway train control equipment |
CN114282255A (en) * | 2022-03-04 | 2022-04-05 | 支付宝(杭州)信息技术有限公司 | Sorting sequence merging method and system based on secret sharing |
-
2018
- 2018-09-20 CN CN201811101295.6A patent/CN109308423A/en not_active Withdrawn
Non-Patent Citations (1)
Title |
---|
佟丹妮等: "多方强隐私保护记录链接方法", 《计算机科学与探索》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110866283A (en) * | 2019-11-25 | 2020-03-06 | 浙江工商大学 | Multi-party verifiable data record linking method based on block chain and partial homomorphic encryption |
CN110866283B (en) * | 2019-11-25 | 2021-09-21 | 浙江工商大学 | Multi-party verifiable data record linking method based on block chain and partial homomorphic encryption |
CN111246431A (en) * | 2020-04-26 | 2020-06-05 | 北京全路通信信号研究设计院集团有限公司 | Analysis and evaluation method and system for multi-source data of railway train control equipment |
CN111246431B (en) * | 2020-04-26 | 2020-09-08 | 北京全路通信信号研究设计院集团有限公司 | Analysis and evaluation method and system for multi-source data of railway train control equipment |
CN114282255A (en) * | 2022-03-04 | 2022-04-05 | 支付宝(杭州)信息技术有限公司 | Sorting sequence merging method and system based on secret sharing |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11977541B2 (en) | Systems and methods for rapid data analysis | |
Zhang et al. | A privacy leakage upper bound constraint-based approach for cost-effective privacy preserving of intermediate data sets in cloud | |
Liu et al. | Multi-constrained graph pattern matching in large-scale contextual social graphs | |
Baralis et al. | Generalized association rule mining with constraints | |
CN109308423A (en) | Secondary method of partition in secret protection record link | |
CN109117669B (en) | Privacy protection method and system for MapReduce similar connection query | |
US11132360B2 (en) | Accessing datasets | |
CN109710789A (en) | Search method, device, electronic equipment and the computer storage medium of image data | |
CN104077723A (en) | Social network recommending system and social network recommending method | |
Teng et al. | An Efficient and Secure Cipher-Text Retrieval Scheme Based on Mixed Homomorphic Encryption and Multi-Attribute Sorting Method. | |
CN108197491A (en) | A kind of subgraph search method based on ciphertext | |
Mueller et al. | SoK: Differential privacy on graph-structured data | |
US7363320B2 (en) | Method and system for correlating data from multiple sources without compromising confidentiality requirements | |
CN107070932B (en) | Anonymous method for preventing label neighbor attack in social network dynamic release | |
Amir et al. | A brief review of conditions, circumstances and applicability of sampling techniques in computer science domain | |
CN105787800B (en) | Intelligent social platform potential relationship retrieval device, system and method | |
Shaham et al. | Machine learning aided anonymization of spatiotemporal trajectory datasets | |
EP4182827A1 (en) | Method and system for secure distributed software-service | |
Dhanalakshmi et al. | Privacy preserving data mining techniques-survey | |
Nia et al. | Leveraging social interactions to suggest friends | |
Hongde et al. | Differential privacy data aggregation optimizing method and application to data visualization | |
Kuijpers et al. | Analyzing trajectories using uncertainty and background information | |
Hamidi et al. | Secure Two-party Agglomerative Hierarchical Clustering Construction. | |
Patsakis et al. | Privacy-aware genome mining: Server-assisted protocols for private set intersection and pattern matching | |
Janakiraman et al. | How are you related? Predicting the type of a social relationship using call graph data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20190205 |