CN113946871A - Privacy preserving data record integration method, system and computer readable storage medium - Google Patents
Privacy preserving data record integration method, system and computer readable storage medium Download PDFInfo
- Publication number
- CN113946871A CN113946871A CN202111383157.3A CN202111383157A CN113946871A CN 113946871 A CN113946871 A CN 113946871A CN 202111383157 A CN202111383157 A CN 202111383157A CN 113946871 A CN113946871 A CN 113946871A
- Authority
- CN
- China
- Prior art keywords
- data
- record
- privacy
- data record
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
- G06F21/6254—Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/604—Tools and structures for managing or administering access control systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6227—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/21—Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/2141—Access rights, e.g. capability lists, access control lists, access tables, access matrices
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioethics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computer Security & Cryptography (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Automation & Control Theory (AREA)
- Storage Device Security (AREA)
Abstract
The invention belongs to the technical field of privacy protection record linkage, and particularly relates to a privacy protection data record integration method, a system and a computer readable storage medium. The method comprises the following steps: s1, creating a data record structure; s2, setting privacy parameters and management authority of the nodes; s3, encoding and blocking data of the data record integration system by using the management authority and the privacy parameters of the nodes to generate a plurality of blocks containing candidate record pairs; and S4, performing similarity comparison on the generated candidate record pairs, and integrally outputting the data record information belonging to the same entity. The invention can ensure that the data integration operation of the cross-information system is not influenced while the record traceability is ensured, can support the realization of the data integration task under the condition that the data of the service scene and the management scene is confidential (or relates to personal privacy), and provides convenience for the realization of the multi-field application of the data record integration system.
Description
Technical Field
The invention belongs to the technical field of privacy protection record linkage, and particularly relates to a privacy protection data record integration method, a system and a computer readable storage medium.
Background
With the increasing demand and the increasing construction scale of information systems, the information systems are usually operated, data in various application programs are difficult to share, and the data islanding phenomenon becomes obvious. To solve this conflict, it is necessary to design a privacy-preserving data recording integration method, system and computer-readable storage medium that can effectively solve the data islanding problem of the information system.
For example, chinese patent application No. cn201811069639.x describes a big data acquisition and transaction system based on a block chain and a trusted computing platform, which includes: the system comprises an address verification module, a data acquisition module, a data uploading module, a data credibility verification module and a data reward payment module on a user chain. Although the problem of data source shortage is solved by fusing a large-scale personal data isolated island, and all-round supervision and protection are implemented on data acquisition, storage, packaging and uploading operations, the whole link credibility of the data is realized, and the privacy of users is protected when a data acquisition company authenticates the validity of public key addresses on a user chain by using a direct anonymous certification method; and the data reward payment is guaranteed to be open and transparent, so that the contradiction between personal privacy protection and big data acquisition is reconciled to a certain degree, the credibility of a data source is guaranteed, and the method has practicability, is simple and easy to implement, but has the defects that cross-field application fusion cannot be effectively supported, the application range and the application strength of data integration are limited, and the difficulty of data recording integration is not effectively improved.
Disclosure of Invention
The invention provides a privacy protection data recording integration method, a system and a computer readable storage medium which can effectively solve the problem of data isolated island of an information system, and aims to solve the problems that information systems are usually operated independently, data in various application programs are difficult to share, and the data isolated island phenomenon is increasingly obvious in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
the privacy protection data recording integration method comprises the following steps:
s1, creating a data record structure of the data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
s2, setting privacy parameters of the data recording integration system and management authority of the nodes;
s3, based on the data record structure, using the management authority and privacy parameters of the nodes to encode and block the data of the data record integration system, so as to generate a plurality of blocks containing candidate record pairs;
and S4, performing similarity comparison on the generated candidate record pairs, and integrally outputting the data record information belonging to the same entity.
Preferably, step S1 includes the steps of:
s11, preprocessing the data records in the data record structure, wherein the preprocessing comprises data cleaning and standardization, and correcting the data records by filling missing data, removing redundant values and unifying data formats in sequence;
s12, selecting a quasi-identifier, traversing attribute fields, and selecting a plurality of cascade attributes capable of uniquely identifying real world entity information according to the particularity of the scene as the quasi-identifier attributes;
s13, recording level Bloom Filter coding, splitting the selected quasi-identifier attribute into n-gram form, connecting through n Bloom filters, mapping into record level Bloom Filter RBF.
Preferably, the management authority of the node in step S2 includes a data viewing distribution authority and a data link authority;
the node with the data checking and distributing authority is used for checking the data in the information system by the data holder under the condition that the data recording system normally operates, and distributing the coded data to the link node;
and the node with the data link authority is used for linking the data of the candidate record pairs in the same block when the data is subjected to integration operation.
Preferably, step S3 includes the steps of:
screening candidate record pairs in a blocking manner, carrying out duplication removal screening on data of the data record integration system to obtain candidate record pair data, selecting a public attribute except a standard identifier to carry out Hash coding, and taking the result as a blocking key value; data records with the same partitioning key value are partitioned into the same block, and a plurality of block sets B ═ B are obtained1,b2,…,biIn which b isiRepresenting the ith block.
Preferably, the step S4 includes the following steps that a node having the data viewing distribution authority is used as a data node, and a node having the data link authority is used as a link node:
when the data of the data record integration system is sent to the link node by the data node, the twin neural network is used for comparing the similarity of the candidate record pairs at the link node, and the linked result is integrated and output.
Preferably, step S4 includes the steps of:
s41, extracting features at the link nodes by using a bidirectional recurrent neural network (BilSTM), and performing extraction on the input data set x1,x2Firstly, dividing attribute data into m batch _ sizes through an embedding layer of the BilSTM, and generating n x m dimensional conversion vectors which are random numbers between (-1, 1);
wherein n represents the number of bloom filters, m represents the number of batch _ size, and batch _ size represents the number of samples of one training;
s42, converting vector, sample xiAnd xjThrough the embedding layer, the generated vector is represented by yi,yj∈Rn*mR producedn*mThe dimensional feature vector is used as a new feature and forms an output sample x through a full connection layeriAnd xjCosine similarity of (d); the result, as a measure of similarity, is calculated by the following equation:
wherein fw (x)i) Representing input samples xiAt the mapping of the embedded layer(s),<fw(x1),fw(x2)>denotes fw (x)1) And fw (x)2) Dot product of (c), i | fw (x)i) I is fw (x)i) Absolute value of (d);
s43, calculating a loss function according to the similarity obtained in the step S42, continuously adjusting parameters of the twin neural network model by adopting a gradient descent method, and sequentially repeating the step S41 and the step S42 to optimize the effect of the twin neural network model;
for input sample anAnd bnThe formula for calculating the loss function between the two is as follows:
wherein d | | | an-bn||2Representing the euclidean distance of the features of the two samples, N being the number of the samples, y being a label for marking whether the two samples are matched, y being 1 representing that the two samples are similar or matched, y being 0 representing that the two samples are not matched, and margin being a set threshold;
s44, constructing positive and negative samples for training the twin neural network model;
and S45, using the trained twin neural network model to predict the test set to obtain a data record set belonging to the same entity.
Preferably, the constructing of the positive and negative samples in step S44 includes the following steps:
s441, inputting a sample S, finding another same or similar sample S 'from the input data set according to the sample characteristics to form a value pair (S, S', 1), wherein 1 represents matching;
s442, repeating the process of step S441, and generating a positive sample;
s443, inputting a sample S, randomly selecting a sample S 'from a complementary set of the sample S according to the characteristics of the sample, and forming a value pair (S, S', 0), wherein 0 represents mismatching;
s444, the process of step S443 is repeated, and a negative sample is generated.
The invention also provides a privacy-preserving data recording integrated system, which comprises:
a data record structure creation module to create a data record structure of a data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
the system authority setting module is used for setting the management authority of the nodes of the data recording integrated system;
and the record linking module is based on the data record structure of the data record integrated system and used for comparing the similarity of the candidate record pairs by utilizing the management authority of the nodes and outputting the data record information belonging to the same entity.
The present invention also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of the privacy-preserving data record integration method of any one of the above embodiments.
Compared with the prior art, the invention has the beneficial effects that: (1) the invention applies the deep learning method to the record link task of privacy protection, so that the technology can adapt to more scenes; different from the existing machine learning method, the twin neural network can automatically extract scene characteristics according to the attribute representation mode of the scene, and a model is constructed, so that the accuracy of the link is improved and the time efficiency of the whole process is improved on the basis of ensuring the credibility of the integration process; (2) the coded data recording structure can anonymize original data, so that the traceability of the data recording is guaranteed, the credibility of an integration process is guaranteed, personal privacy data in an information system cannot be revealed, and the coded data recording structure has necessity in industrial application engineering with higher requirements on data safety, data traceability and the like; (3) aiming at the characteristics of data isomerization storage, the invention automatically extracts the characteristics of scene data and normalizes the data format, thereby greatly reducing the difficulty of data recording integration; (4) the invention can integrate data between different information systems under the condition of ensuring the credibility of the process according to the link of the data in the data integration system, thereby effectively supporting the application fusion across fields and expanding the application range and the application intensity of data integration.
Drawings
FIG. 1 is a flow chart of a privacy preserving data record integration method according to an embodiment of the present invention;
FIG. 2 is a diagram of a data structure for record level bloom filter encoded records according to an embodiment of the present invention;
FIG. 3 is a block diagram of an architecture of an integrated system for privacy preserving data records according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a process of similarity comparison according to an embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
Example 1:
as shown in fig. 1, the privacy-preserving data recording integration method includes the following steps:
s1, creating a data record structure of the data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
step S1 specifically includes the following steps:
s11, preprocessing the data records in the data record structure, wherein the preprocessing comprises data cleaning and standardization, and correcting the data records by filling missing data, removing redundant values and unifying data formats in sequence;
s12, selecting a quasi-identifier, traversing attribute fields, and selecting a plurality of cascade attributes capable of uniquely identifying real world entity information according to the particularity of the scene as the quasi-identifier attributes;
s13, recording level Bloom Filter coding, splitting the selected quasi-identifier attribute into n-gram form, connecting through n Bloom filters, mapping into record level Bloom Filter RBF.
Specifically, the surname and the first name are selected as quasi-identifier attributes, and bloom filter encoding is performed on the quasi-identifier attributes to form a quasi-identifier record data structure, for example, the surname John and the first name Smith are split into a 2-gram form: jo, oh, hn, Sm, mi, it, th, and are connected by n bloom filters, and mapped into record level bloom filters RBF, as shown in fig. 2.
S2, setting privacy parameters of the data recording integration system and management authority of the nodes;
the management authority of the node in the step S2 includes a data viewing and distributing authority and a data linking authority;
the node with the data checking and distributing authority is used for checking the data in the information system by the data holder under the condition that the data recording system normally operates, and distributing the coded data to the link node;
and the node with the data link authority is used for linking the data of the candidate record pairs in the same block when the data is subjected to integration operation.
And taking the node with the data viewing and distributing authority as a data node, and taking the node with the data link authority as a link node. In the present invention, the data nodes each represent an information system.
As shown in FIG. 3, in a data isolation environment, data nodes 1 to n have data view distribution rights and the link node has data link rights.
S3, based on the data record structure, using the management authority and privacy parameters of the nodes to encode and block the data of the data record integration system, so as to generate a plurality of blocks containing candidate record pairs;
step S3 specifically includes the following steps:
screening candidate record pairs in a blocking manner, carrying out duplication removal screening on data of the data record integration system to obtain candidate record pair data, selecting a public attribute except a standard identifier to carry out Hash coding, and taking the result as a blocking key value; data records with the same partitioning key value are partitioned into the same block, and a plurality of block sets BB ═ b are obtained1,b2,…,biIn which b isiRepresenting the ith block.
And S4, performing similarity comparison on the generated candidate record pairs, and integrally outputting the data record information belonging to the same entity.
Step S4 includes the following steps:
when the data of the data record integration system is sent to the link node by the data node, the twin neural network is used for comparing the similarity of the candidate record pairs at the link node, and the linked result is integrated and output.
As shown in fig. 4, step S4 specifically includes the following steps:
s41, extracting features at the link nodes by using a bidirectional recurrent neural network (BilSTM), and performing extraction on the input data set x1,x2Firstly, dividing attribute data into m batch _ sizes through an embedding layer of the BilSTM, and generating n x m dimensional conversion vectors which are random numbers between (-1, 1);
wherein n represents the number of bloom filters, m represents the number of batch _ size, and batch _ size represents the number of samples of one training;
s42, converting vector, sample xiAnd xjThrough the embedding layerThe generated vector is represented by yi,yj∈Rn*mR producedn*mThe dimensional feature vector is used as a new feature and forms an output sample x through a full connection layeriAnd xjCosine similarity of (d); the result, as a measure of similarity, is calculated by the following equation:
wherein fw (x)i) Representing input samples xiAt the mapping of the embedded layer(s),<fw(x1),fw(x2)>denotes fw (x)1) And fw (x)2) Dot product of (c), i | fw (x)i) I is fw (x)i) Absolute value of (d);
s43, calculating a loss function according to the similarity obtained in the step S42, continuously adjusting parameters of the twin neural network model by adopting a gradient descent method, and sequentially repeating the step S41 and the step S42 to optimize the effect of the twin neural network model;
for input sample anAnd bnThe formula for calculating the loss function between the two is as follows:
wherein d | | | an-bn||2Representing the euclidean distance of the features of the two samples, N being the number of the samples, y being a label for marking whether the two samples are matched, y being 1 representing that the two samples are similar or matched, y being 0 representing that the two samples are not matched, and margin being a set threshold;
for y in L, when y is 1, the loss function is leftThat is, the samples that are originally similar to each other need to be adjusted if the euclidean distance in the feature space is large. And when y is 0, the loss function isThat is, when the samples are not similar, the smaller the euclidean distance of the feature space, the larger the loss value.
S44, constructing positive and negative samples for training the twin neural network model;
the construction of the positive and negative samples comprises the following steps:
s441, inputting a sample S, finding another same or similar sample S 'from the input data set according to the sample characteristics to form a value pair (S, S', 1), wherein 1 represents matching;
s442, repeating the process of step S441, and generating a positive sample;
s443, inputting a sample S, randomly selecting a sample S 'from a complementary set of the sample S according to the characteristics of the sample, and forming a value pair (S, S', 0), wherein 0 represents mismatching;
s444, the process of step S443 is repeated, and a negative sample is generated.
And S45, using the trained twin neural network model to predict the test set to obtain a data record set belonging to the same entity.
Based on embodiment 1, the present invention also provides a privacy-preserving data recording integration system, including:
a data record structure creation module to create a data record structure of a data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
the system authority setting module is used for setting the management authority of the nodes of the data recording integrated system;
and the record linking module is based on the data record structure of the data record integrated system and used for comparing the similarity of the candidate record pairs by utilizing the management authority of the nodes and outputting the data record information belonging to the same entity.
In the recording process, the data is always in a data isolation environment, so that the data of each information system in the system is guaranteed to be credible in the transmission process.
Based on embodiment 1, the present invention further provides a computer-readable storage medium, which includes computer-executable instructions, and when the computer-executable instructions are executed by one or more processors, the processors execute the steps of the privacy-preserving data recording integration method according to any one of the above embodiments.
The invention applies the deep learning method to the record link task of privacy protection, so that the technology can adapt to more scenes; different from the existing machine learning method, the twin neural network can automatically extract scene characteristics according to the attribute representation mode of the scene, and a model is constructed, so that the accuracy of the link is improved and the time efficiency of the whole process is improved on the basis of ensuring the credibility of the integration process; the coded data recording structure can anonymize original data, so that the traceability of the data recording is guaranteed, the credibility of an integration process is guaranteed, personal privacy data in an information system cannot be revealed, and the coded data recording structure has necessity in industrial application engineering with higher requirements on data safety, data traceability and the like; aiming at the characteristics of data isomerization storage, the invention automatically extracts the characteristics of scene data and normalizes the data format, thereby greatly reducing the difficulty of data recording integration; the invention can integrate data between different information systems under the condition of ensuring the credibility of the process according to the link of the data in the data integration system, thereby effectively supporting the application fusion across fields and expanding the application range and the application intensity of data integration.
The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.
Claims (9)
1. The privacy-preserving data recording integration method is characterized by comprising the following steps of:
s1, creating a data record structure of the data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
s2, setting privacy parameters of the data recording integration system and management authority of the nodes;
s3, based on the data record structure, using the management authority and privacy parameters of the nodes to encode and block the data of the data record integration system, so as to generate a plurality of blocks containing candidate record pairs;
and S4, performing similarity comparison on the generated candidate record pairs, and integrally outputting the data record information belonging to the same entity.
2. The privacy-preserving data recording integration method as claimed in claim 1, wherein the step S1 includes the steps of:
s11, preprocessing the data records in the data record structure, wherein the preprocessing comprises data cleaning and standardization, and correcting the data records by filling missing data, removing redundant values and unifying data formats in sequence;
s12, selecting a quasi-identifier, traversing attribute fields, and selecting a plurality of cascade attributes capable of uniquely identifying real world entity information according to the particularity of the scene as the quasi-identifier attributes;
s13, recording level Bloom Filter coding, splitting the selected quasi-identifier attribute into n-gram form, connecting through n Bloom filters, mapping into record level Bloom Filter RBF.
3. The privacy-preserving data record integration method as claimed in claim 2, wherein the management authority of the node in step S2 includes data viewing distribution authority and data link authority;
the node with the data checking and distributing authority is used for checking the data in the information system by the data holder under the condition that the data recording system normally operates, and distributing the coded data to the link node;
and the node with the data link authority is used for linking the data of the candidate record pairs in the same block when the data is subjected to integration operation.
4. The privacy-preserving data recording integration method as claimed in claim 1, wherein the step S3 includes the steps of:
screening candidate record pairs in a blocking manner, carrying out duplication removal screening on data of the data record integration system to obtain candidate record pair data, selecting a public attribute except a standard identifier to carry out Hash coding, and taking the result as a blocking key value; data records with the same partitioning key value are partitioned into the same block, and a plurality of block sets B ═ B are obtained1,b2,...,biIn which b isiRepresenting the ith block.
5. The privacy-preserving data recording integration method as claimed in claim 3, wherein the node with the data viewing and distributing authority is used as a data node, the node with the data linking authority is used as a link node, and the step S4 includes the following steps:
when the data of the data record integration system is sent to the link node by the data node, the twin neural network is used for comparing the similarity of the candidate record pairs at the link node, and the linked result is integrated and output.
6. The privacy-preserving data recording integration method as claimed in claim 5, wherein the step S4 includes the steps of:
s41, extracting features at the link nodes by using a bidirectional recurrent neural network (BilSTM), and performing extraction on the input data set x1,x2Firstly, dividing attribute data into m batch _ sizes through an embedding layer of the BilSTM, and generating n x m dimensional conversion vectors which are random numbers between (-1, 1);
where n represents the number of bloom filters, m represents the number of batch _ size, which represents the number of samples of a training session.
S42, converting vector, sample xiAnd xjThrough the embedding layer, the generated vector is represented by yi,yj∈Rn*mR producedn*mThe dimensional feature vector is used as a new feature and forms an output sample x through a full connection layeriAnd xjCosine similarity of (d); the result, as a measure of similarity, is calculated by the following equation:
wherein fw (x)i) Representing input samples xiAt the mapping of the embedded layer(s),<fw(x1),fw(x2)>denotes fw (x)1) And fw (x)2) Dot product of (c), i | fw (x)i) I is fw (x)i) Absolute value of (d);
s43, calculating a loss function according to the similarity obtained in the step S42, continuously adjusting parameters of the twin neural network model by adopting a gradient descent method, and sequentially repeating the step S41 and the step S42 to optimize the effect of the twin neural network model;
for input sample anAnd bnThe formula for calculating the loss function between the two is as follows:
wherein d | | | an-bn||2Representing the euclidean distance of the features of the two samples, N being the number of the samples, y being a label for marking whether the two samples are matched, y being 1 representing that the two samples are similar or matched, y being 0 representing that the two samples are not matched, and margin being a set threshold;
s44, constructing positive and negative samples for training the twin neural network model;
and S45, using the trained twin neural network model to predict the test set to obtain a data record set belonging to the same entity.
7. The privacy-preserving data recording integration method as claimed in claim 6, wherein the constructing positive and negative samples in step S44 includes the following steps:
s441, inputting a sample S, finding another same or similar sample S 'from the input data set according to the sample characteristics to form a value pair (S, S', 1), wherein 1 represents matching;
s442, repeating the process of step S441, and generating a positive sample;
s443, inputting a sample S, randomly selecting a sample S 'from a complementary set of the sample S according to the characteristics of the sample, and forming a value pair (S, S', 0), wherein 0 represents mismatching;
s444, the process of step S443 is repeated, and a negative sample is generated.
8. A privacy preserving data recording integration system, comprising:
a data record structure creation module to create a data record structure of a data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
the system authority setting module is used for setting the management authority of the nodes of the data recording integrated system;
and the record linking module is based on the data record structure of the data record integrated system and used for comparing the similarity of the candidate record pairs by utilizing the management authority of the nodes and outputting the data record information belonging to the same entity.
9. Computer-readable storage media comprising computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of the privacy-preserving data record integration method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111383157.3A CN113946871A (en) | 2021-11-22 | 2021-11-22 | Privacy preserving data record integration method, system and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111383157.3A CN113946871A (en) | 2021-11-22 | 2021-11-22 | Privacy preserving data record integration method, system and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113946871A true CN113946871A (en) | 2022-01-18 |
Family
ID=79338719
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111383157.3A Pending CN113946871A (en) | 2021-11-22 | 2021-11-22 | Privacy preserving data record integration method, system and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113946871A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116341023A (en) * | 2023-05-24 | 2023-06-27 | 北京百度网讯科技有限公司 | Block chain-based service address verification method, device, equipment and storage medium |
CN116361859A (en) * | 2023-06-02 | 2023-06-30 | 之江实验室 | Cross-mechanism patient record linking method and system based on depth privacy encoder |
-
2021
- 2021-11-22 CN CN202111383157.3A patent/CN113946871A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116341023A (en) * | 2023-05-24 | 2023-06-27 | 北京百度网讯科技有限公司 | Block chain-based service address verification method, device, equipment and storage medium |
CN116341023B (en) * | 2023-05-24 | 2023-08-29 | 北京百度网讯科技有限公司 | Block chain-based service address verification method, device, equipment and storage medium |
CN116361859A (en) * | 2023-06-02 | 2023-06-30 | 之江实验室 | Cross-mechanism patient record linking method and system based on depth privacy encoder |
CN116361859B (en) * | 2023-06-02 | 2023-08-25 | 之江实验室 | Cross-mechanism patient record linking method and system based on depth privacy encoder |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11727053B2 (en) | Entity recognition from an image | |
US10552471B1 (en) | Determining identities of multiple people in a digital image | |
US20240256704A1 (en) | Efficient statistical techniques for detecting sensitive data | |
CN106203333A (en) | Face identification method and system | |
CN113946871A (en) | Privacy preserving data record integration method, system and computer readable storage medium | |
CN103136189A (en) | Confidential information identifying method, information processing apparatus, and program | |
AU2019100349A4 (en) | Face - Password Certification Based on Convolutional Neural Network | |
CN107046534A (en) | A kind of network safety situation model training method, recognition methods and identifying device | |
CN111191041A (en) | Characteristic data acquisition method, data storage method, device, equipment and medium | |
Dobbs et al. | On art authentication and the Rijksmuseum challenge: A residual neural network approach | |
CN116933075A (en) | Question-answering model training method, intelligent question-answering method and device in network security field | |
CN118278048A (en) | Cloud computing-based data asset security monitoring system and method | |
CN113268506B (en) | Query method and device of cache database, electronic equipment and readable storage medium | |
CN116881687B (en) | Power grid sensitive data identification method and device based on feature extraction | |
CN113378723A (en) | Automatic safety identification system for hidden danger of power transmission and transformation line based on depth residual error network | |
CN116824676A (en) | Digital identity information generation method, application method, device, system and equipment | |
CN114998809B (en) | ALBERT and multi-mode cyclic fusion-based false news detection method and system | |
Wurzenberger et al. | Discovering insider threats from log data with high-performance bioinformatics tools | |
CN115599345A (en) | Application security requirement analysis recommendation method based on knowledge graph | |
Liu et al. | Subverting privacy-preserving gans: Hiding secrets in sanitized images | |
CN111061695B (en) | File sharing method and system based on block chain | |
CN117197816B (en) | User material identification method and system | |
CN116112264B (en) | Method and device for controlling access to strategy hidden big data based on blockchain | |
CN117633753B (en) | Operating system and method based on solid state disk array | |
CN116361859B (en) | Cross-mechanism patient record linking method and system based on depth privacy encoder |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |