Privacy preserving data record integration method, system and computer readable storage medium
Technical Field
The invention belongs to the technical field of privacy protection record linkage, and particularly relates to a privacy protection data record integration method, a system and a computer readable storage medium.
Background
With the increasing demand and the increasing construction scale of information systems, the information systems are usually operated, data in various application programs are difficult to share, and the data islanding phenomenon becomes obvious. To solve this conflict, it is necessary to design a privacy-preserving data recording integration method, system and computer-readable storage medium that can effectively solve the data islanding problem of the information system.
For example, chinese patent application No. cn201811069639.x describes a big data acquisition and transaction system based on a block chain and a trusted computing platform, which includes: the system comprises an address verification module, a data acquisition module, a data uploading module, a data credibility verification module and a data reward payment module on a user chain. Although the problem of data source shortage is solved by fusing a large-scale personal data isolated island, and all-round supervision and protection are implemented on data acquisition, storage, packaging and uploading operations, the whole link credibility of the data is realized, and the privacy of users is protected when a data acquisition company authenticates the validity of public key addresses on a user chain by using a direct anonymous certification method; and the data reward payment is guaranteed to be open and transparent, so that the contradiction between personal privacy protection and big data acquisition is reconciled to a certain degree, the credibility of a data source is guaranteed, and the method has practicability, is simple and easy to implement, but has the defects that cross-field application fusion cannot be effectively supported, the application range and the application strength of data integration are limited, and the difficulty of data recording integration is not effectively improved.
Disclosure of Invention
The invention provides a privacy protection data recording integration method, a system and a computer readable storage medium which can effectively solve the problem of data isolated island of an information system, and aims to solve the problems that information systems are usually operated independently, data in various application programs are difficult to share, and the data isolated island phenomenon is increasingly obvious in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
the privacy protection data recording integration method comprises the following steps:
s1, creating a data record structure of the data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
s2, setting privacy parameters of the data recording integration system and management authority of the nodes;
s3, based on the data record structure, using the management authority and privacy parameters of the nodes to encode and block the data of the data record integration system, so as to generate a plurality of blocks containing candidate record pairs;
and S4, performing similarity comparison on the generated candidate record pairs, and integrally outputting the data record information belonging to the same entity.
Preferably, step S1 includes the steps of:
s11, preprocessing the data records in the data record structure, wherein the preprocessing comprises data cleaning and standardization, and correcting the data records by filling missing data, removing redundant values and unifying data formats in sequence;
s12, selecting a quasi-identifier, traversing attribute fields, and selecting a plurality of cascade attributes capable of uniquely identifying real world entity information according to the particularity of the scene as the quasi-identifier attributes;
s13, recording level Bloom Filter coding, splitting the selected quasi-identifier attribute into n-gram form, connecting through n Bloom filters, mapping into record level Bloom Filter RBF.
Preferably, the management authority of the node in step S2 includes a data viewing distribution authority and a data link authority;
the node with the data checking and distributing authority is used for checking the data in the information system by the data holder under the condition that the data recording system normally operates, and distributing the coded data to the link node;
and the node with the data link authority is used for linking the data of the candidate record pairs in the same block when the data is subjected to integration operation.
Preferably, step S3 includes the steps of:
screening candidate record pairs in a blocking manner, carrying out duplication removal screening on data of the data record integration system to obtain candidate record pair data, selecting a public attribute except a standard identifier to carry out Hash coding, and taking the result as a blocking key value; data records with the same partitioning key value are partitioned into the same block, and a plurality of block sets B ═ B are obtained1,b2,…,biIn which b isiRepresenting the ith block.
Preferably, the step S4 includes the following steps that a node having the data viewing distribution authority is used as a data node, and a node having the data link authority is used as a link node:
when the data of the data record integration system is sent to the link node by the data node, the twin neural network is used for comparing the similarity of the candidate record pairs at the link node, and the linked result is integrated and output.
Preferably, step S4 includes the steps of:
s41, extracting features at the link nodes by using a bidirectional recurrent neural network (BilSTM), and performing extraction on the input data set x1,x2Firstly, dividing attribute data into m batch _ sizes through an embedding layer of the BilSTM, and generating n x m dimensional conversion vectors which are random numbers between (-1, 1);
wherein n represents the number of bloom filters, m represents the number of batch _ size, and batch _ size represents the number of samples of one training;
s42, converting vector, sample xiAnd xjThrough the embedding layer, the generated vector is represented by yi,yj∈Rn*mR producedn*mThe dimensional feature vector is used as a new feature and forms an output sample x through a full connection layeriAnd xjCosine similarity of (d); the result, as a measure of similarity, is calculated by the following equation:
wherein fw (x)i) Representing input samples xiAt the mapping of the embedded layer(s),<fw(x1),fw(x2)>denotes fw (x)1) And fw (x)2) Dot product of (c), i | fw (x)i) I is fw (x)i) Absolute value of (d);
s43, calculating a loss function according to the similarity obtained in the step S42, continuously adjusting parameters of the twin neural network model by adopting a gradient descent method, and sequentially repeating the step S41 and the step S42 to optimize the effect of the twin neural network model;
for input sample anAnd bnThe formula for calculating the loss function between the two is as follows:
wherein d | | | an-bn||2Representing the euclidean distance of the features of the two samples, N being the number of the samples, y being a label for marking whether the two samples are matched, y being 1 representing that the two samples are similar or matched, y being 0 representing that the two samples are not matched, and margin being a set threshold;
s44, constructing positive and negative samples for training the twin neural network model;
and S45, using the trained twin neural network model to predict the test set to obtain a data record set belonging to the same entity.
Preferably, the constructing of the positive and negative samples in step S44 includes the following steps:
s441, inputting a sample S, finding another same or similar sample S 'from the input data set according to the sample characteristics to form a value pair (S, S', 1), wherein 1 represents matching;
s442, repeating the process of step S441, and generating a positive sample;
s443, inputting a sample S, randomly selecting a sample S 'from a complementary set of the sample S according to the characteristics of the sample, and forming a value pair (S, S', 0), wherein 0 represents mismatching;
s444, the process of step S443 is repeated, and a negative sample is generated.
The invention also provides a privacy-preserving data recording integrated system, which comprises:
a data record structure creation module to create a data record structure of a data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
the system authority setting module is used for setting the management authority of the nodes of the data recording integrated system;
and the record linking module is based on the data record structure of the data record integrated system and used for comparing the similarity of the candidate record pairs by utilizing the management authority of the nodes and outputting the data record information belonging to the same entity.
The present invention also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of the privacy-preserving data record integration method of any one of the above embodiments.
Compared with the prior art, the invention has the beneficial effects that: (1) the invention applies the deep learning method to the record link task of privacy protection, so that the technology can adapt to more scenes; different from the existing machine learning method, the twin neural network can automatically extract scene characteristics according to the attribute representation mode of the scene, and a model is constructed, so that the accuracy of the link is improved and the time efficiency of the whole process is improved on the basis of ensuring the credibility of the integration process; (2) the coded data recording structure can anonymize original data, so that the traceability of the data recording is guaranteed, the credibility of an integration process is guaranteed, personal privacy data in an information system cannot be revealed, and the coded data recording structure has necessity in industrial application engineering with higher requirements on data safety, data traceability and the like; (3) aiming at the characteristics of data isomerization storage, the invention automatically extracts the characteristics of scene data and normalizes the data format, thereby greatly reducing the difficulty of data recording integration; (4) the invention can integrate data between different information systems under the condition of ensuring the credibility of the process according to the link of the data in the data integration system, thereby effectively supporting the application fusion across fields and expanding the application range and the application intensity of data integration.
Drawings
FIG. 1 is a flow chart of a privacy preserving data record integration method according to an embodiment of the present invention;
FIG. 2 is a diagram of a data structure for record level bloom filter encoded records according to an embodiment of the present invention;
FIG. 3 is a block diagram of an architecture of an integrated system for privacy preserving data records according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a process of similarity comparison according to an embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
Example 1:
as shown in fig. 1, the privacy-preserving data recording integration method includes the following steps:
s1, creating a data record structure of the data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
step S1 specifically includes the following steps:
s11, preprocessing the data records in the data record structure, wherein the preprocessing comprises data cleaning and standardization, and correcting the data records by filling missing data, removing redundant values and unifying data formats in sequence;
s12, selecting a quasi-identifier, traversing attribute fields, and selecting a plurality of cascade attributes capable of uniquely identifying real world entity information according to the particularity of the scene as the quasi-identifier attributes;
s13, recording level Bloom Filter coding, splitting the selected quasi-identifier attribute into n-gram form, connecting through n Bloom filters, mapping into record level Bloom Filter RBF.
Specifically, the surname and the first name are selected as quasi-identifier attributes, and bloom filter encoding is performed on the quasi-identifier attributes to form a quasi-identifier record data structure, for example, the surname John and the first name Smith are split into a 2-gram form: jo, oh, hn, Sm, mi, it, th, and are connected by n bloom filters, and mapped into record level bloom filters RBF, as shown in fig. 2.
S2, setting privacy parameters of the data recording integration system and management authority of the nodes;
the management authority of the node in the step S2 includes a data viewing and distributing authority and a data linking authority;
the node with the data checking and distributing authority is used for checking the data in the information system by the data holder under the condition that the data recording system normally operates, and distributing the coded data to the link node;
and the node with the data link authority is used for linking the data of the candidate record pairs in the same block when the data is subjected to integration operation.
And taking the node with the data viewing and distributing authority as a data node, and taking the node with the data link authority as a link node. In the present invention, the data nodes each represent an information system.
As shown in FIG. 3, in a data isolation environment, data nodes 1 to n have data view distribution rights and the link node has data link rights.
S3, based on the data record structure, using the management authority and privacy parameters of the nodes to encode and block the data of the data record integration system, so as to generate a plurality of blocks containing candidate record pairs;
step S3 specifically includes the following steps:
screening candidate record pairs in a blocking manner, carrying out duplication removal screening on data of the data record integration system to obtain candidate record pair data, selecting a public attribute except a standard identifier to carry out Hash coding, and taking the result as a blocking key value; data records with the same partitioning key value are partitioned into the same block, and a plurality of block sets BB ═ b are obtained1,b2,…,biIn which b isiRepresenting the ith block.
And S4, performing similarity comparison on the generated candidate record pairs, and integrally outputting the data record information belonging to the same entity.
Step S4 includes the following steps:
when the data of the data record integration system is sent to the link node by the data node, the twin neural network is used for comparing the similarity of the candidate record pairs at the link node, and the linked result is integrated and output.
As shown in fig. 4, step S4 specifically includes the following steps:
s41, extracting features at the link nodes by using a bidirectional recurrent neural network (BilSTM), and performing extraction on the input data set x1,x2Firstly, dividing attribute data into m batch _ sizes through an embedding layer of the BilSTM, and generating n x m dimensional conversion vectors which are random numbers between (-1, 1);
wherein n represents the number of bloom filters, m represents the number of batch _ size, and batch _ size represents the number of samples of one training;
s42, converting vector, sample xiAnd xjThrough the embedding layerThe generated vector is represented by yi,yj∈Rn*mR producedn*mThe dimensional feature vector is used as a new feature and forms an output sample x through a full connection layeriAnd xjCosine similarity of (d); the result, as a measure of similarity, is calculated by the following equation:
wherein fw (x)i) Representing input samples xiAt the mapping of the embedded layer(s),<fw(x1),fw(x2)>denotes fw (x)1) And fw (x)2) Dot product of (c), i | fw (x)i) I is fw (x)i) Absolute value of (d);
s43, calculating a loss function according to the similarity obtained in the step S42, continuously adjusting parameters of the twin neural network model by adopting a gradient descent method, and sequentially repeating the step S41 and the step S42 to optimize the effect of the twin neural network model;
for input sample anAnd bnThe formula for calculating the loss function between the two is as follows:
wherein d | | | an-bn||2Representing the euclidean distance of the features of the two samples, N being the number of the samples, y being a label for marking whether the two samples are matched, y being 1 representing that the two samples are similar or matched, y being 0 representing that the two samples are not matched, and margin being a set threshold;
for y in L, when y is 1, the loss function is left
That is, the samples that are originally similar to each other need to be adjusted if the euclidean distance in the feature space is large. And when y is 0, the loss function is
That is, when the samples are not similar, the smaller the euclidean distance of the feature space, the larger the loss value.
S44, constructing positive and negative samples for training the twin neural network model;
the construction of the positive and negative samples comprises the following steps:
s441, inputting a sample S, finding another same or similar sample S 'from the input data set according to the sample characteristics to form a value pair (S, S', 1), wherein 1 represents matching;
s442, repeating the process of step S441, and generating a positive sample;
s443, inputting a sample S, randomly selecting a sample S 'from a complementary set of the sample S according to the characteristics of the sample, and forming a value pair (S, S', 0), wherein 0 represents mismatching;
s444, the process of step S443 is repeated, and a negative sample is generated.
And S45, using the trained twin neural network model to predict the test set to obtain a data record set belonging to the same entity.
Based on embodiment 1, the present invention also provides a privacy-preserving data recording integration system, including:
a data record structure creation module to create a data record structure of a data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
the system authority setting module is used for setting the management authority of the nodes of the data recording integrated system;
and the record linking module is based on the data record structure of the data record integrated system and used for comparing the similarity of the candidate record pairs by utilizing the management authority of the nodes and outputting the data record information belonging to the same entity.
In the recording process, the data is always in a data isolation environment, so that the data of each information system in the system is guaranteed to be credible in the transmission process.
Based on embodiment 1, the present invention further provides a computer-readable storage medium, which includes computer-executable instructions, and when the computer-executable instructions are executed by one or more processors, the processors execute the steps of the privacy-preserving data recording integration method according to any one of the above embodiments.
The invention applies the deep learning method to the record link task of privacy protection, so that the technology can adapt to more scenes; different from the existing machine learning method, the twin neural network can automatically extract scene characteristics according to the attribute representation mode of the scene, and a model is constructed, so that the accuracy of the link is improved and the time efficiency of the whole process is improved on the basis of ensuring the credibility of the integration process; the coded data recording structure can anonymize original data, so that the traceability of the data recording is guaranteed, the credibility of an integration process is guaranteed, personal privacy data in an information system cannot be revealed, and the coded data recording structure has necessity in industrial application engineering with higher requirements on data safety, data traceability and the like; aiming at the characteristics of data isomerization storage, the invention automatically extracts the characteristics of scene data and normalizes the data format, thereby greatly reducing the difficulty of data recording integration; the invention can integrate data between different information systems under the condition of ensuring the credibility of the process according to the link of the data in the data integration system, thereby effectively supporting the application fusion across fields and expanding the application range and the application intensity of data integration.
The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.