CN113946871A

CN113946871A - Privacy preserving data record integration method, system and computer readable storage medium

Info

Publication number: CN113946871A
Application number: CN202111383157.3A
Authority: CN
Inventors: 袁理锋; 姚思雨; 殷为峰; 李成煜; 任一支; 王冬; 王烨茹
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-01-18

Abstract

The invention belongs to the technical field of privacy protection record linking, and in particular relates to a privacy protection data record integration method, system and computer-readable storage medium. It includes the following steps: S1, creating a data record structure; S2, setting privacy parameters and node management authority; S3, using the node management authority and privacy parameters to encode and segment the data of the data record integration system to generate candidate records containing candidate records Multiple blocks of the pair; S4, compare the similarity of the generated candidate record pairs, and integrate and output the data record information belonging to the same entity. The present invention can ensure that records are traceable without affecting data integration operations across information systems, and can support the realization of data integration tasks in the case of business scenarios and management scenarios involving confidentiality (or personal privacy). Provides convenience for multi-domain application implementation.

Description

Privacy preserving data record integration method, system and computer readable storage medium

Technical Field

The invention belongs to the technical field of privacy protection record linkage, and particularly relates to a privacy protection data record integration method, a system and a computer readable storage medium.

Background

With the increasing demand and the increasing construction scale of information systems, the information systems are usually operated, data in various application programs are difficult to share, and the data islanding phenomenon becomes obvious. To solve this conflict, it is necessary to design a privacy-preserving data recording integration method, system and computer-readable storage medium that can effectively solve the data islanding problem of the information system.

For example, chinese patent application No. cn201811069639.x describes a big data acquisition and transaction system based on a block chain and a trusted computing platform, which includes: the system comprises an address verification module, a data acquisition module, a data uploading module, a data credibility verification module and a data reward payment module on a user chain. Although the problem of data source shortage is solved by fusing a large-scale personal data isolated island, and all-round supervision and protection are implemented on data acquisition, storage, packaging and uploading operations, the whole link credibility of the data is realized, and the privacy of users is protected when a data acquisition company authenticates the validity of public key addresses on a user chain by using a direct anonymous certification method; and the data reward payment is guaranteed to be open and transparent, so that the contradiction between personal privacy protection and big data acquisition is reconciled to a certain degree, the credibility of a data source is guaranteed, and the method has practicability, is simple and easy to implement, but has the defects that cross-field application fusion cannot be effectively supported, the application range and the application strength of data integration are limited, and the difficulty of data recording integration is not effectively improved.

Disclosure of Invention

The invention provides a privacy protection data recording integration method, a system and a computer readable storage medium which can effectively solve the problem of data isolated island of an information system, and aims to solve the problems that information systems are usually operated independently, data in various application programs are difficult to share, and the data isolated island phenomenon is increasingly obvious in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

the privacy protection data recording integration method comprises the following steps:

s1, creating a data record structure of the data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;

s2, setting privacy parameters of the data recording integration system and management authority of the nodes;

s3, based on the data record structure, using the management authority and privacy parameters of the nodes to encode and block the data of the data record integration system, so as to generate a plurality of blocks containing candidate record pairs;

and S4, performing similarity comparison on the generated candidate record pairs, and integrally outputting the data record information belonging to the same entity.

Preferably, step S1 includes the steps of:

s11, preprocessing the data records in the data record structure, wherein the preprocessing comprises data cleaning and standardization, and correcting the data records by filling missing data, removing redundant values and unifying data formats in sequence;

s12, selecting a quasi-identifier, traversing attribute fields, and selecting a plurality of cascade attributes capable of uniquely identifying real world entity information according to the particularity of the scene as the quasi-identifier attributes;

s13, recording level Bloom Filter coding, splitting the selected quasi-identifier attribute into n-gram form, connecting through n Bloom filters, mapping into record level Bloom Filter RBF.

Preferably, the management authority of the node in step S2 includes a data viewing distribution authority and a data link authority;

the node with the data checking and distributing authority is used for checking the data in the information system by the data holder under the condition that the data recording system normally operates, and distributing the coded data to the link node;

and the node with the data link authority is used for linking the data of the candidate record pairs in the same block when the data is subjected to integration operation.

Preferably, step S3 includes the steps of:

screening candidate record pairs in a blocking manner, carrying out duplication removal screening on data of the data record integration system to obtain candidate record pair data, selecting a public attribute except a standard identifier to carry out Hash coding, and taking the result as a blocking key value; data records with the same partitioning key value are partitioned into the same block, and a plurality of block sets B ═ B are obtained₁,b₂,…,b_iIn which b is_iRepresenting the ith block.

Preferably, the step S4 includes the following steps that a node having the data viewing distribution authority is used as a data node, and a node having the data link authority is used as a link node:

when the data of the data record integration system is sent to the link node by the data node, the twin neural network is used for comparing the similarity of the candidate record pairs at the link node, and the linked result is integrated and output.

Preferably, step S4 includes the steps of:

s41, extracting features at the link nodes by using a bidirectional recurrent neural network (BilSTM), and performing extraction on the input data set x₁,x₂Firstly, dividing attribute data into m batch _ sizes through an embedding layer of the BilSTM, and generating n x m dimensional conversion vectors which are random numbers between (-1, 1);

wherein n represents the number of bloom filters, m represents the number of batch _ size, and batch _ size represents the number of samples of one training;

s42, converting vector, sample x_iAnd x_jThrough the embedding layer, the generated vector is represented by y_i,y_j∈R^n*mR produced^n*mThe dimensional feature vector is used as a new feature and forms an output sample x through a full connection layer_iAnd x_jCosine similarity of (d); the result, as a measure of similarity, is calculated by the following equation:

wherein fw (x)_i) Representing input samples x_iAt the mapping of the embedded layer(s),<fw(x₁),fw(x₂)>denotes fw (x)₁) And fw (x)₂) Dot product of (c), i | fw (x)_i) I is fw (x)_i) Absolute value of (d);

s43, calculating a loss function according to the similarity obtained in the step S42, continuously adjusting parameters of the twin neural network model by adopting a gradient descent method, and sequentially repeating the step S41 and the step S42 to optimize the effect of the twin neural network model;

for input sample a_nAnd b_nThe formula for calculating the loss function between the two is as follows:

wherein d | | | a_n-b_n||²Representing the euclidean distance of the features of the two samples, N being the number of the samples, y being a label for marking whether the two samples are matched, y being 1 representing that the two samples are similar or matched, y being 0 representing that the two samples are not matched, and margin being a set threshold;

s44, constructing positive and negative samples for training the twin neural network model;

and S45, using the trained twin neural network model to predict the test set to obtain a data record set belonging to the same entity.

Preferably, the constructing of the positive and negative samples in step S44 includes the following steps:

s441, inputting a sample S, finding another same or similar sample S 'from the input data set according to the sample characteristics to form a value pair (S, S', 1), wherein 1 represents matching;

s442, repeating the process of step S441, and generating a positive sample;

s443, inputting a sample S, randomly selecting a sample S 'from a complementary set of the sample S according to the characteristics of the sample, and forming a value pair (S, S', 0), wherein 0 represents mismatching;

s444, the process of step S443 is repeated, and a negative sample is generated.

The invention also provides a privacy-preserving data recording integrated system, which comprises:

a data record structure creation module to create a data record structure of a data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;

the system authority setting module is used for setting the management authority of the nodes of the data recording integrated system;

and the record linking module is based on the data record structure of the data record integrated system and used for comparing the similarity of the candidate record pairs by utilizing the management authority of the nodes and outputting the data record information belonging to the same entity.

The present invention also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of the privacy-preserving data record integration method of any one of the above embodiments.

Compared with the prior art, the invention has the beneficial effects that: (1) the invention applies the deep learning method to the record link task of privacy protection, so that the technology can adapt to more scenes; different from the existing machine learning method, the twin neural network can automatically extract scene characteristics according to the attribute representation mode of the scene, and a model is constructed, so that the accuracy of the link is improved and the time efficiency of the whole process is improved on the basis of ensuring the credibility of the integration process; (2) the coded data recording structure can anonymize original data, so that the traceability of the data recording is guaranteed, the credibility of an integration process is guaranteed, personal privacy data in an information system cannot be revealed, and the coded data recording structure has necessity in industrial application engineering with higher requirements on data safety, data traceability and the like; (3) aiming at the characteristics of data isomerization storage, the invention automatically extracts the characteristics of scene data and normalizes the data format, thereby greatly reducing the difficulty of data recording integration; (4) the invention can integrate data between different information systems under the condition of ensuring the credibility of the process according to the link of the data in the data integration system, thereby effectively supporting the application fusion across fields and expanding the application range and the application intensity of data integration.

Drawings

FIG. 1 is a flow chart of a privacy preserving data record integration method according to an embodiment of the present invention;

FIG. 2 is a diagram of a data structure for record level bloom filter encoded records according to an embodiment of the present invention;

FIG. 3 is a block diagram of an architecture of an integrated system for privacy preserving data records according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a process of similarity comparison according to an embodiment of the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.

Example 1:

as shown in fig. 1, the privacy-preserving data recording integration method includes the following steps:

step S1 specifically includes the following steps:

Specifically, the surname and the first name are selected as quasi-identifier attributes, and bloom filter encoding is performed on the quasi-identifier attributes to form a quasi-identifier record data structure, for example, the surname John and the first name Smith are split into a 2-gram form: jo, oh, hn, Sm, mi, it, th, and are connected by n bloom filters, and mapped into record level bloom filters RBF, as shown in fig. 2.

the management authority of the node in the step S2 includes a data viewing and distributing authority and a data linking authority;

And taking the node with the data viewing and distributing authority as a data node, and taking the node with the data link authority as a link node. In the present invention, the data nodes each represent an information system.

As shown in FIG. 3, in a data isolation environment, data nodes 1 to n have data view distribution rights and the link node has data link rights.

step S3 specifically includes the following steps:

screening candidate record pairs in a blocking manner, carrying out duplication removal screening on data of the data record integration system to obtain candidate record pair data, selecting a public attribute except a standard identifier to carry out Hash coding, and taking the result as a blocking key value; data records with the same partitioning key value are partitioned into the same block, and a plurality of block sets BB ═ b are obtained₁,b₂,…,b_iIn which b is_iRepresenting the ith block.

Step S4 includes the following steps:

As shown in fig. 4, step S4 specifically includes the following steps:

s42, converting vector, sample x_iAnd x_jThrough the embedding layerThe generated vector is represented by y_i,y_j∈R^n*mR produced^n*mThe dimensional feature vector is used as a new feature and forms an output sample x through a full connection layer_iAnd x_jCosine similarity of (d); the result, as a measure of similarity, is calculated by the following equation:

for y in L, when y is 1, the loss function is left

That is, the samples that are originally similar to each other need to be adjusted if the euclidean distance in the feature space is large. And when y is 0, the loss function is

That is, when the samples are not similar, the smaller the euclidean distance of the feature space, the larger the loss value.

the construction of the positive and negative samples comprises the following steps:

s442, repeating the process of step S441, and generating a positive sample;

s444, the process of step S443 is repeated, and a negative sample is generated.

Based on embodiment 1, the present invention also provides a privacy-preserving data recording integration system, including:

In the recording process, the data is always in a data isolation environment, so that the data of each information system in the system is guaranteed to be credible in the transmission process.

Based on embodiment 1, the present invention further provides a computer-readable storage medium, which includes computer-executable instructions, and when the computer-executable instructions are executed by one or more processors, the processors execute the steps of the privacy-preserving data recording integration method according to any one of the above embodiments.

The invention applies the deep learning method to the record link task of privacy protection, so that the technology can adapt to more scenes; different from the existing machine learning method, the twin neural network can automatically extract scene characteristics according to the attribute representation mode of the scene, and a model is constructed, so that the accuracy of the link is improved and the time efficiency of the whole process is improved on the basis of ensuring the credibility of the integration process; the coded data recording structure can anonymize original data, so that the traceability of the data recording is guaranteed, the credibility of an integration process is guaranteed, personal privacy data in an information system cannot be revealed, and the coded data recording structure has necessity in industrial application engineering with higher requirements on data safety, data traceability and the like; aiming at the characteristics of data isomerization storage, the invention automatically extracts the characteristics of scene data and normalizes the data format, thereby greatly reducing the difficulty of data recording integration; the invention can integrate data between different information systems under the condition of ensuring the credibility of the process according to the link of the data in the data integration system, thereby effectively supporting the application fusion across fields and expanding the application range and the application intensity of data integration.

The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims

1. a privacy protection data record integration method, is characterized in that, comprises the following steps:

S1, create a data record structure of a data record integration system, and the data record structure includes a quasi-identifier data structure and a key-value block data structure;

S2, set the privacy parameters of the data record integration system and the management authority of the nodes;

S3, based on the data record structure, utilize the management authority and privacy parameters of the node to encode and segment the data of the data record integration system to generate multiple segments including candidate record pairs;

S4, compare the similarity of the generated candidate record pairs, and integrate and output data record information belonging to the same entity.

2. The privacy protection data record integration method according to claim 1, wherein step S1 comprises the following steps:

S11, preprocessing the data records in the data record structure, the preprocessing includes data cleaning and standardization, and correcting the data records by filling missing data, removing redundant values, and unifying the data format in turn;

S12, selecting a quasi-identifier, traversing the attribute fields, and selecting multiple cascade attributes that can uniquely identify real-world entity information according to the particularity of the scene, as quasi-identifier attributes;

S13 , record-level Bloom Filter coding, splitting the selected quasi-identifier attribute into the form of n-grams, connecting through n Bloom filters, and mapping them into a record-level Bloom filter RBF.

3. The privacy-preserving data record integration method according to claim 2, wherein the management authority of the node in step S2 includes data viewing and distribution authority and data linking authority;

A node with the data viewing and distribution authority, for the data holder to view the data in the information system under the normal operation of the data recording system, and distribute the encoded data to the link node;

The node with the data linking authority is used to link the data of the candidate record pair in the same block when the data integration operation occurs.

4. The privacy protection data record integration method according to claim 1, wherein step S3 comprises the following steps:

The candidate record pair is screened in blocks, the data of the data record integration system is deduplicated and screened to obtain the candidate record pair data, a common attribute except the quasi-identifier is selected for hash coding, and the result is used as the block key value; Data records with the same block key value are divided into the same block, and multiple block sets B={b ₁ , b ₂ , . . . , b _i } are obtained, where b _i represents the i-th block.

5. The privacy protection data record integration method according to claim 3, wherein the node with the data viewing and distribution authority is used as a data node, and the node with the data link authority is used as a link node, and step S4 comprises: Follow the steps below:

When the data of the data record integration system is sent from the data node to the link node, the twin neural network is used at the link node to compare the similarity of the candidate record pairs, and the linked results are integrated and output.

6. The privacy protection data record integration method according to claim 5, wherein step S4 comprises the following steps:

S41 , extract features with a bidirectional recurrent neural network BiLSTM at the link node. For the input data sets x ₁ , x ₂ , first go through the embedding layer of BiLSTM, split the attribute data into m batch_sizes, and generate n*m-dimensional transformation vectors, so The conversion vector is a random number between (-1, 1);

Among them, n represents the number of bloom filters, m represents the number of batch_size, and batch_size represents the number of samples for one training.

S42, convert the vector, the samples x _i and x _j go through the embedding layer, the generated vector represents y _i , y _j ∈R ^n*m , the generated R ^n*m dimension feature vector is used as a new feature, and goes through the fully connected layer Form the cosine similarity of output samples x _i and x _j ; use the result calculated by the following formula as a measure of similarity:

where fw(x _i ) represents the mapping of the input sample _xi in the embedding layer, <fw(x ₁ ), fw(x ₂ )> represents the dot product result of fw(x ₁ ) and fw(x ₂ ), ||fw ( _xi )|| is the absolute value of fw( _xi );

S43, calculate the loss function according to the similarity obtained in step S42, and use the gradient descent method to continuously adjust the parameters of the twin neural network model, and repeat steps S41 and S42 in turn to optimize the effect of the twin neural network model;

For input samples an and b _n , the calculation formula of the loss function between them _is as follows:

Where d=||a _n -b _n || ² , represents the Euclidean distance between the two sample features, N is the number of samples, y is the label, used to mark whether the two samples match, y=1 represents the two samples Similar or matching, y=0 means that the two samples do not match, and the margin is the set threshold;

S44, construct positive and negative samples for training the twin neural network model;

S45 , using the trained twin neural network model to predict the test set to obtain a set of data records belonging to the same entity.

7. The privacy-preserving data record integration method according to claim 6, wherein the constructing positive and negative samples described in step S44 comprises the following steps:

S441, input a sample s, find another same or similar sample s' from the input data set according to the sample characteristics, and form a value pair (s, s', 1), where 1 represents a match;

S442, repeat the process of step S441, and generate a positive sample;

S443, input a sample s, randomly select a sample s' from the complement set of the sample s according to the characteristics of the sample, and form a value pair (s, s', 0), where 0 indicates a mismatch;

S444, the process of step S443 is repeated, and a negative sample is generated.

8. An integrated system for privacy protection data recording, characterized in that it includes:

a data record structure creation module for creating a data record structure of the data record integration system, the data record structure including a quasi-identifier data structure and a key-value block data structure;

The system authority setting module is used to set the management authority of the nodes of the data recording integrated system;

The record linking module, based on the data record structure of the data record integration system, is configured to compare the similarity of the candidate record pairs by utilizing the management authority of the node, and output the data record information belonging to the same entity.

9. A computer-readable storage medium, characterized by comprising computer-executable instructions that, when executed by one or more processors, cause the processors to perform any one of claims 1-7 The steps of the privacy-preserving data record integration method.