CN113946871A - Privacy preserving data record integration method, system and computer readable storage medium - Google Patents

Privacy preserving data record integration method, system and computer readable storage medium Download PDF

Info

Publication number
CN113946871A
CN113946871A CN202111383157.3A CN202111383157A CN113946871A CN 113946871 A CN113946871 A CN 113946871A CN 202111383157 A CN202111383157 A CN 202111383157A CN 113946871 A CN113946871 A CN 113946871A
Authority
CN
China
Prior art keywords
data
record
privacy
data record
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111383157.3A
Other languages
Chinese (zh)
Inventor
袁理锋
姚思雨
殷为峰
李成煜
任一支
王冬
王烨茹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202111383157.3A priority Critical patent/CN113946871A/en
Publication of CN113946871A publication Critical patent/CN113946871A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/604Tools and structures for managing or administering access control systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2141Access rights, e.g. capability lists, access control lists, access tables, access matrices

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Automation & Control Theory (AREA)
  • Storage Device Security (AREA)

Abstract

The invention belongs to the technical field of privacy protection record linkage, and particularly relates to a privacy protection data record integration method, a system and a computer readable storage medium. The method comprises the following steps: s1, creating a data record structure; s2, setting privacy parameters and management authority of the nodes; s3, encoding and blocking data of the data record integration system by using the management authority and the privacy parameters of the nodes to generate a plurality of blocks containing candidate record pairs; and S4, performing similarity comparison on the generated candidate record pairs, and integrally outputting the data record information belonging to the same entity. The invention can ensure that the data integration operation of the cross-information system is not influenced while the record traceability is ensured, can support the realization of the data integration task under the condition that the data of the service scene and the management scene is confidential (or relates to personal privacy), and provides convenience for the realization of the multi-field application of the data record integration system.

Description

Privacy preserving data record integration method, system and computer readable storage medium
Technical Field
The invention belongs to the technical field of privacy protection record linkage, and particularly relates to a privacy protection data record integration method, a system and a computer readable storage medium.
Background
With the increasing demand and the increasing construction scale of information systems, the information systems are usually operated, data in various application programs are difficult to share, and the data islanding phenomenon becomes obvious. To solve this conflict, it is necessary to design a privacy-preserving data recording integration method, system and computer-readable storage medium that can effectively solve the data islanding problem of the information system.
For example, chinese patent application No. cn201811069639.x describes a big data acquisition and transaction system based on a block chain and a trusted computing platform, which includes: the system comprises an address verification module, a data acquisition module, a data uploading module, a data credibility verification module and a data reward payment module on a user chain. Although the problem of data source shortage is solved by fusing a large-scale personal data isolated island, and all-round supervision and protection are implemented on data acquisition, storage, packaging and uploading operations, the whole link credibility of the data is realized, and the privacy of users is protected when a data acquisition company authenticates the validity of public key addresses on a user chain by using a direct anonymous certification method; and the data reward payment is guaranteed to be open and transparent, so that the contradiction between personal privacy protection and big data acquisition is reconciled to a certain degree, the credibility of a data source is guaranteed, and the method has practicability, is simple and easy to implement, but has the defects that cross-field application fusion cannot be effectively supported, the application range and the application strength of data integration are limited, and the difficulty of data recording integration is not effectively improved.
Disclosure of Invention
The invention provides a privacy protection data recording integration method, a system and a computer readable storage medium which can effectively solve the problem of data isolated island of an information system, and aims to solve the problems that information systems are usually operated independently, data in various application programs are difficult to share, and the data isolated island phenomenon is increasingly obvious in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
the privacy protection data recording integration method comprises the following steps:
s1, creating a data record structure of the data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
s2, setting privacy parameters of the data recording integration system and management authority of the nodes;
s3, based on the data record structure, using the management authority and privacy parameters of the nodes to encode and block the data of the data record integration system, so as to generate a plurality of blocks containing candidate record pairs;
and S4, performing similarity comparison on the generated candidate record pairs, and integrally outputting the data record information belonging to the same entity.
Preferably, step S1 includes the steps of:
s11, preprocessing the data records in the data record structure, wherein the preprocessing comprises data cleaning and standardization, and correcting the data records by filling missing data, removing redundant values and unifying data formats in sequence;
s12, selecting a quasi-identifier, traversing attribute fields, and selecting a plurality of cascade attributes capable of uniquely identifying real world entity information according to the particularity of the scene as the quasi-identifier attributes;
s13, recording level Bloom Filter coding, splitting the selected quasi-identifier attribute into n-gram form, connecting through n Bloom filters, mapping into record level Bloom Filter RBF.
Preferably, the management authority of the node in step S2 includes a data viewing distribution authority and a data link authority;
the node with the data checking and distributing authority is used for checking the data in the information system by the data holder under the condition that the data recording system normally operates, and distributing the coded data to the link node;
and the node with the data link authority is used for linking the data of the candidate record pairs in the same block when the data is subjected to integration operation.
Preferably, step S3 includes the steps of:
screening candidate record pairs in a blocking manner, carrying out duplication removal screening on data of the data record integration system to obtain candidate record pair data, selecting a public attribute except a standard identifier to carry out Hash coding, and taking the result as a blocking key value; data records with the same partitioning key value are partitioned into the same block, and a plurality of block sets B ═ B are obtained1,b2,…,biIn which b isiRepresenting the ith block.
Preferably, the step S4 includes the following steps that a node having the data viewing distribution authority is used as a data node, and a node having the data link authority is used as a link node:
when the data of the data record integration system is sent to the link node by the data node, the twin neural network is used for comparing the similarity of the candidate record pairs at the link node, and the linked result is integrated and output.
Preferably, step S4 includes the steps of:
s41, extracting features at the link nodes by using a bidirectional recurrent neural network (BilSTM), and performing extraction on the input data set x1,x2Firstly, dividing attribute data into m batch _ sizes through an embedding layer of the BilSTM, and generating n x m dimensional conversion vectors which are random numbers between (-1, 1);
wherein n represents the number of bloom filters, m represents the number of batch _ size, and batch _ size represents the number of samples of one training;
s42, converting vector, sample xiAnd xjThrough the embedding layer, the generated vector is represented by yi,yj∈Rn*mR producedn*mThe dimensional feature vector is used as a new feature and forms an output sample x through a full connection layeriAnd xjCosine similarity of (d); the result, as a measure of similarity, is calculated by the following equation:
Figure BDA0003366391920000031
wherein fw (x)i) Representing input samples xiAt the mapping of the embedded layer(s),<fw(x1),fw(x2)>denotes fw (x)1) And fw (x)2) Dot product of (c), i | fw (x)i) I is fw (x)i) Absolute value of (d);
s43, calculating a loss function according to the similarity obtained in the step S42, continuously adjusting parameters of the twin neural network model by adopting a gradient descent method, and sequentially repeating the step S41 and the step S42 to optimize the effect of the twin neural network model;
for input sample anAnd bnThe formula for calculating the loss function between the two is as follows:
Figure BDA0003366391920000041
wherein d | | | an-bn||2Representing the euclidean distance of the features of the two samples, N being the number of the samples, y being a label for marking whether the two samples are matched, y being 1 representing that the two samples are similar or matched, y being 0 representing that the two samples are not matched, and margin being a set threshold;
s44, constructing positive and negative samples for training the twin neural network model;
and S45, using the trained twin neural network model to predict the test set to obtain a data record set belonging to the same entity.
Preferably, the constructing of the positive and negative samples in step S44 includes the following steps:
s441, inputting a sample S, finding another same or similar sample S 'from the input data set according to the sample characteristics to form a value pair (S, S', 1), wherein 1 represents matching;
s442, repeating the process of step S441, and generating a positive sample;
s443, inputting a sample S, randomly selecting a sample S 'from a complementary set of the sample S according to the characteristics of the sample, and forming a value pair (S, S', 0), wherein 0 represents mismatching;
s444, the process of step S443 is repeated, and a negative sample is generated.
The invention also provides a privacy-preserving data recording integrated system, which comprises:
a data record structure creation module to create a data record structure of a data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
the system authority setting module is used for setting the management authority of the nodes of the data recording integrated system;
and the record linking module is based on the data record structure of the data record integrated system and used for comparing the similarity of the candidate record pairs by utilizing the management authority of the nodes and outputting the data record information belonging to the same entity.
The present invention also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of the privacy-preserving data record integration method of any one of the above embodiments.
Compared with the prior art, the invention has the beneficial effects that: (1) the invention applies the deep learning method to the record link task of privacy protection, so that the technology can adapt to more scenes; different from the existing machine learning method, the twin neural network can automatically extract scene characteristics according to the attribute representation mode of the scene, and a model is constructed, so that the accuracy of the link is improved and the time efficiency of the whole process is improved on the basis of ensuring the credibility of the integration process; (2) the coded data recording structure can anonymize original data, so that the traceability of the data recording is guaranteed, the credibility of an integration process is guaranteed, personal privacy data in an information system cannot be revealed, and the coded data recording structure has necessity in industrial application engineering with higher requirements on data safety, data traceability and the like; (3) aiming at the characteristics of data isomerization storage, the invention automatically extracts the characteristics of scene data and normalizes the data format, thereby greatly reducing the difficulty of data recording integration; (4) the invention can integrate data between different information systems under the condition of ensuring the credibility of the process according to the link of the data in the data integration system, thereby effectively supporting the application fusion across fields and expanding the application range and the application intensity of data integration.
Drawings
FIG. 1 is a flow chart of a privacy preserving data record integration method according to an embodiment of the present invention;
FIG. 2 is a diagram of a data structure for record level bloom filter encoded records according to an embodiment of the present invention;
FIG. 3 is a block diagram of an architecture of an integrated system for privacy preserving data records according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a process of similarity comparison according to an embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
Example 1:
as shown in fig. 1, the privacy-preserving data recording integration method includes the following steps:
s1, creating a data record structure of the data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
step S1 specifically includes the following steps:
s11, preprocessing the data records in the data record structure, wherein the preprocessing comprises data cleaning and standardization, and correcting the data records by filling missing data, removing redundant values and unifying data formats in sequence;
s12, selecting a quasi-identifier, traversing attribute fields, and selecting a plurality of cascade attributes capable of uniquely identifying real world entity information according to the particularity of the scene as the quasi-identifier attributes;
s13, recording level Bloom Filter coding, splitting the selected quasi-identifier attribute into n-gram form, connecting through n Bloom filters, mapping into record level Bloom Filter RBF.
Specifically, the surname and the first name are selected as quasi-identifier attributes, and bloom filter encoding is performed on the quasi-identifier attributes to form a quasi-identifier record data structure, for example, the surname John and the first name Smith are split into a 2-gram form: jo, oh, hn, Sm, mi, it, th, and are connected by n bloom filters, and mapped into record level bloom filters RBF, as shown in fig. 2.
S2, setting privacy parameters of the data recording integration system and management authority of the nodes;
the management authority of the node in the step S2 includes a data viewing and distributing authority and a data linking authority;
the node with the data checking and distributing authority is used for checking the data in the information system by the data holder under the condition that the data recording system normally operates, and distributing the coded data to the link node;
and the node with the data link authority is used for linking the data of the candidate record pairs in the same block when the data is subjected to integration operation.
And taking the node with the data viewing and distributing authority as a data node, and taking the node with the data link authority as a link node. In the present invention, the data nodes each represent an information system.
As shown in FIG. 3, in a data isolation environment, data nodes 1 to n have data view distribution rights and the link node has data link rights.
S3, based on the data record structure, using the management authority and privacy parameters of the nodes to encode and block the data of the data record integration system, so as to generate a plurality of blocks containing candidate record pairs;
step S3 specifically includes the following steps:
screening candidate record pairs in a blocking manner, carrying out duplication removal screening on data of the data record integration system to obtain candidate record pair data, selecting a public attribute except a standard identifier to carry out Hash coding, and taking the result as a blocking key value; data records with the same partitioning key value are partitioned into the same block, and a plurality of block sets BB ═ b are obtained1,b2,…,biIn which b isiRepresenting the ith block.
And S4, performing similarity comparison on the generated candidate record pairs, and integrally outputting the data record information belonging to the same entity.
Step S4 includes the following steps:
when the data of the data record integration system is sent to the link node by the data node, the twin neural network is used for comparing the similarity of the candidate record pairs at the link node, and the linked result is integrated and output.
As shown in fig. 4, step S4 specifically includes the following steps:
s41, extracting features at the link nodes by using a bidirectional recurrent neural network (BilSTM), and performing extraction on the input data set x1,x2Firstly, dividing attribute data into m batch _ sizes through an embedding layer of the BilSTM, and generating n x m dimensional conversion vectors which are random numbers between (-1, 1);
wherein n represents the number of bloom filters, m represents the number of batch _ size, and batch _ size represents the number of samples of one training;
s42, converting vector, sample xiAnd xjThrough the embedding layerThe generated vector is represented by yi,yj∈Rn*mR producedn*mThe dimensional feature vector is used as a new feature and forms an output sample x through a full connection layeriAnd xjCosine similarity of (d); the result, as a measure of similarity, is calculated by the following equation:
Figure BDA0003366391920000071
wherein fw (x)i) Representing input samples xiAt the mapping of the embedded layer(s),<fw(x1),fw(x2)>denotes fw (x)1) And fw (x)2) Dot product of (c), i | fw (x)i) I is fw (x)i) Absolute value of (d);
s43, calculating a loss function according to the similarity obtained in the step S42, continuously adjusting parameters of the twin neural network model by adopting a gradient descent method, and sequentially repeating the step S41 and the step S42 to optimize the effect of the twin neural network model;
for input sample anAnd bnThe formula for calculating the loss function between the two is as follows:
Figure BDA0003366391920000081
wherein d | | | an-bn||2Representing the euclidean distance of the features of the two samples, N being the number of the samples, y being a label for marking whether the two samples are matched, y being 1 representing that the two samples are similar or matched, y being 0 representing that the two samples are not matched, and margin being a set threshold;
for y in L, when y is 1, the loss function is left
Figure BDA0003366391920000082
That is, the samples that are originally similar to each other need to be adjusted if the euclidean distance in the feature space is large. And when y is 0, the loss function is
Figure BDA0003366391920000083
That is, when the samples are not similar, the smaller the euclidean distance of the feature space, the larger the loss value.
S44, constructing positive and negative samples for training the twin neural network model;
the construction of the positive and negative samples comprises the following steps:
s441, inputting a sample S, finding another same or similar sample S 'from the input data set according to the sample characteristics to form a value pair (S, S', 1), wherein 1 represents matching;
s442, repeating the process of step S441, and generating a positive sample;
s443, inputting a sample S, randomly selecting a sample S 'from a complementary set of the sample S according to the characteristics of the sample, and forming a value pair (S, S', 0), wherein 0 represents mismatching;
s444, the process of step S443 is repeated, and a negative sample is generated.
And S45, using the trained twin neural network model to predict the test set to obtain a data record set belonging to the same entity.
Based on embodiment 1, the present invention also provides a privacy-preserving data recording integration system, including:
a data record structure creation module to create a data record structure of a data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
the system authority setting module is used for setting the management authority of the nodes of the data recording integrated system;
and the record linking module is based on the data record structure of the data record integrated system and used for comparing the similarity of the candidate record pairs by utilizing the management authority of the nodes and outputting the data record information belonging to the same entity.
In the recording process, the data is always in a data isolation environment, so that the data of each information system in the system is guaranteed to be credible in the transmission process.
Based on embodiment 1, the present invention further provides a computer-readable storage medium, which includes computer-executable instructions, and when the computer-executable instructions are executed by one or more processors, the processors execute the steps of the privacy-preserving data recording integration method according to any one of the above embodiments.
The invention applies the deep learning method to the record link task of privacy protection, so that the technology can adapt to more scenes; different from the existing machine learning method, the twin neural network can automatically extract scene characteristics according to the attribute representation mode of the scene, and a model is constructed, so that the accuracy of the link is improved and the time efficiency of the whole process is improved on the basis of ensuring the credibility of the integration process; the coded data recording structure can anonymize original data, so that the traceability of the data recording is guaranteed, the credibility of an integration process is guaranteed, personal privacy data in an information system cannot be revealed, and the coded data recording structure has necessity in industrial application engineering with higher requirements on data safety, data traceability and the like; aiming at the characteristics of data isomerization storage, the invention automatically extracts the characteristics of scene data and normalizes the data format, thereby greatly reducing the difficulty of data recording integration; the invention can integrate data between different information systems under the condition of ensuring the credibility of the process according to the link of the data in the data integration system, thereby effectively supporting the application fusion across fields and expanding the application range and the application intensity of data integration.
The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims (9)

1. The privacy-preserving data recording integration method is characterized by comprising the following steps of:
s1, creating a data record structure of the data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
s2, setting privacy parameters of the data recording integration system and management authority of the nodes;
s3, based on the data record structure, using the management authority and privacy parameters of the nodes to encode and block the data of the data record integration system, so as to generate a plurality of blocks containing candidate record pairs;
and S4, performing similarity comparison on the generated candidate record pairs, and integrally outputting the data record information belonging to the same entity.
2. The privacy-preserving data recording integration method as claimed in claim 1, wherein the step S1 includes the steps of:
s11, preprocessing the data records in the data record structure, wherein the preprocessing comprises data cleaning and standardization, and correcting the data records by filling missing data, removing redundant values and unifying data formats in sequence;
s12, selecting a quasi-identifier, traversing attribute fields, and selecting a plurality of cascade attributes capable of uniquely identifying real world entity information according to the particularity of the scene as the quasi-identifier attributes;
s13, recording level Bloom Filter coding, splitting the selected quasi-identifier attribute into n-gram form, connecting through n Bloom filters, mapping into record level Bloom Filter RBF.
3. The privacy-preserving data record integration method as claimed in claim 2, wherein the management authority of the node in step S2 includes data viewing distribution authority and data link authority;
the node with the data checking and distributing authority is used for checking the data in the information system by the data holder under the condition that the data recording system normally operates, and distributing the coded data to the link node;
and the node with the data link authority is used for linking the data of the candidate record pairs in the same block when the data is subjected to integration operation.
4. The privacy-preserving data recording integration method as claimed in claim 1, wherein the step S3 includes the steps of:
screening candidate record pairs in a blocking manner, carrying out duplication removal screening on data of the data record integration system to obtain candidate record pair data, selecting a public attribute except a standard identifier to carry out Hash coding, and taking the result as a blocking key value; data records with the same partitioning key value are partitioned into the same block, and a plurality of block sets B ═ B are obtained1,b2,...,biIn which b isiRepresenting the ith block.
5. The privacy-preserving data recording integration method as claimed in claim 3, wherein the node with the data viewing and distributing authority is used as a data node, the node with the data linking authority is used as a link node, and the step S4 includes the following steps:
when the data of the data record integration system is sent to the link node by the data node, the twin neural network is used for comparing the similarity of the candidate record pairs at the link node, and the linked result is integrated and output.
6. The privacy-preserving data recording integration method as claimed in claim 5, wherein the step S4 includes the steps of:
s41, extracting features at the link nodes by using a bidirectional recurrent neural network (BilSTM), and performing extraction on the input data set x1,x2Firstly, dividing attribute data into m batch _ sizes through an embedding layer of the BilSTM, and generating n x m dimensional conversion vectors which are random numbers between (-1, 1);
where n represents the number of bloom filters, m represents the number of batch _ size, which represents the number of samples of a training session.
S42, converting vector, sample xiAnd xjThrough the embedding layer, the generated vector is represented by yi,yj∈Rn*mR producedn*mThe dimensional feature vector is used as a new feature and forms an output sample x through a full connection layeriAnd xjCosine similarity of (d); the result, as a measure of similarity, is calculated by the following equation:
Figure FDA0003366391910000021
wherein fw (x)i) Representing input samples xiAt the mapping of the embedded layer(s),<fw(x1),fw(x2)>denotes fw (x)1) And fw (x)2) Dot product of (c), i | fw (x)i) I is fw (x)i) Absolute value of (d);
s43, calculating a loss function according to the similarity obtained in the step S42, continuously adjusting parameters of the twin neural network model by adopting a gradient descent method, and sequentially repeating the step S41 and the step S42 to optimize the effect of the twin neural network model;
for input sample anAnd bnThe formula for calculating the loss function between the two is as follows:
Figure FDA0003366391910000031
wherein d | | | an-bn||2Representing the euclidean distance of the features of the two samples, N being the number of the samples, y being a label for marking whether the two samples are matched, y being 1 representing that the two samples are similar or matched, y being 0 representing that the two samples are not matched, and margin being a set threshold;
s44, constructing positive and negative samples for training the twin neural network model;
and S45, using the trained twin neural network model to predict the test set to obtain a data record set belonging to the same entity.
7. The privacy-preserving data recording integration method as claimed in claim 6, wherein the constructing positive and negative samples in step S44 includes the following steps:
s441, inputting a sample S, finding another same or similar sample S 'from the input data set according to the sample characteristics to form a value pair (S, S', 1), wherein 1 represents matching;
s442, repeating the process of step S441, and generating a positive sample;
s443, inputting a sample S, randomly selecting a sample S 'from a complementary set of the sample S according to the characteristics of the sample, and forming a value pair (S, S', 0), wherein 0 represents mismatching;
s444, the process of step S443 is repeated, and a negative sample is generated.
8. A privacy preserving data recording integration system, comprising:
a data record structure creation module to create a data record structure of a data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
the system authority setting module is used for setting the management authority of the nodes of the data recording integrated system;
and the record linking module is based on the data record structure of the data record integrated system and used for comparing the similarity of the candidate record pairs by utilizing the management authority of the nodes and outputting the data record information belonging to the same entity.
9. Computer-readable storage media comprising computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of the privacy-preserving data record integration method of any one of claims 1-7.
CN202111383157.3A 2021-11-22 2021-11-22 Privacy preserving data record integration method, system and computer readable storage medium Pending CN113946871A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111383157.3A CN113946871A (en) 2021-11-22 2021-11-22 Privacy preserving data record integration method, system and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111383157.3A CN113946871A (en) 2021-11-22 2021-11-22 Privacy preserving data record integration method, system and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN113946871A true CN113946871A (en) 2022-01-18

Family

ID=79338719

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111383157.3A Pending CN113946871A (en) 2021-11-22 2021-11-22 Privacy preserving data record integration method, system and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113946871A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341023A (en) * 2023-05-24 2023-06-27 北京百度网讯科技有限公司 Block chain-based service address verification method, device, equipment and storage medium
CN116361859A (en) * 2023-06-02 2023-06-30 之江实验室 Cross-mechanism patient record linking method and system based on depth privacy encoder

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341023A (en) * 2023-05-24 2023-06-27 北京百度网讯科技有限公司 Block chain-based service address verification method, device, equipment and storage medium
CN116341023B (en) * 2023-05-24 2023-08-29 北京百度网讯科技有限公司 Block chain-based service address verification method, device, equipment and storage medium
CN116361859A (en) * 2023-06-02 2023-06-30 之江实验室 Cross-mechanism patient record linking method and system based on depth privacy encoder
CN116361859B (en) * 2023-06-02 2023-08-25 之江实验室 Cross-mechanism patient record linking method and system based on depth privacy encoder

Similar Documents

Publication Publication Date Title
US11727053B2 (en) Entity recognition from an image
US10552471B1 (en) Determining identities of multiple people in a digital image
US20240256704A1 (en) Efficient statistical techniques for detecting sensitive data
CN106203333A (en) Face identification method and system
CN113946871A (en) Privacy preserving data record integration method, system and computer readable storage medium
CN103136189A (en) Confidential information identifying method, information processing apparatus, and program
AU2019100349A4 (en) Face - Password Certification Based on Convolutional Neural Network
CN107046534A (en) A kind of network safety situation model training method, recognition methods and identifying device
CN111191041A (en) Characteristic data acquisition method, data storage method, device, equipment and medium
Dobbs et al. On art authentication and the Rijksmuseum challenge: A residual neural network approach
CN116933075A (en) Question-answering model training method, intelligent question-answering method and device in network security field
CN118278048A (en) Cloud computing-based data asset security monitoring system and method
CN113268506B (en) Query method and device of cache database, electronic equipment and readable storage medium
CN116881687B (en) Power grid sensitive data identification method and device based on feature extraction
CN113378723A (en) Automatic safety identification system for hidden danger of power transmission and transformation line based on depth residual error network
CN116824676A (en) Digital identity information generation method, application method, device, system and equipment
CN114998809B (en) ALBERT and multi-mode cyclic fusion-based false news detection method and system
Wurzenberger et al. Discovering insider threats from log data with high-performance bioinformatics tools
CN115599345A (en) Application security requirement analysis recommendation method based on knowledge graph
Liu et al. Subverting privacy-preserving gans: Hiding secrets in sanitized images
CN111061695B (en) File sharing method and system based on block chain
CN117197816B (en) User material identification method and system
CN116112264B (en) Method and device for controlling access to strategy hidden big data based on blockchain
CN117633753B (en) Operating system and method based on solid state disk array
CN116361859B (en) Cross-mechanism patient record linking method and system based on depth privacy encoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination