CN113946871A - Privacy preserving data record integration method, system and computer readable storage medium - Google Patents

Privacy preserving data record integration method, system and computer readable storage medium Download PDF

Info

Publication number
CN113946871A
CN113946871A CN202111383157.3A CN202111383157A CN113946871A CN 113946871 A CN113946871 A CN 113946871A CN 202111383157 A CN202111383157 A CN 202111383157A CN 113946871 A CN113946871 A CN 113946871A
Authority
CN
China
Prior art keywords
data
record
data record
node
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111383157.3A
Other languages
Chinese (zh)
Inventor
袁理锋
姚思雨
殷为峰
李成煜
任一支
王冬
王烨茹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202111383157.3A priority Critical patent/CN113946871A/en
Publication of CN113946871A publication Critical patent/CN113946871A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/604Tools and structures for managing or administering access control systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2141Access rights, e.g. capability lists, access control lists, access tables, access matrices

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Automation & Control Theory (AREA)
  • Storage Device Security (AREA)

Abstract

本发明属于隐私保护记录链接技术领域,具体涉及隐私保护数据记录集成方法、系统和计算机可读存储介质。包括以下步骤:S1,创建数据记录结构;S2,设置隐私参数以及节点的管理权限;S3,利用节点的管理权限以及隐私参数对数据记录集成系统的数据进行编码及分块,以生成包含候选记录对的多个分块;S4,对生成的候选记录对进行相似度比较,并集成输出属于同一实体的数据记录信息。本发明可以确保记录可追溯的同时又不影响跨信息系统的数据集成操作,可支撑业务场景及管理场景数据涉密(或涉及个人隐私)情况下数据集成任务的实现,为数据记录集成系统的多领域应用实现提供便利。

Figure 202111383157

The invention belongs to the technical field of privacy protection record linking, and in particular relates to a privacy protection data record integration method, system and computer-readable storage medium. It includes the following steps: S1, creating a data record structure; S2, setting privacy parameters and node management authority; S3, using the node management authority and privacy parameters to encode and segment the data of the data record integration system to generate candidate records containing candidate records Multiple blocks of the pair; S4, compare the similarity of the generated candidate record pairs, and integrate and output the data record information belonging to the same entity. The present invention can ensure that records are traceable without affecting data integration operations across information systems, and can support the realization of data integration tasks in the case of business scenarios and management scenarios involving confidentiality (or personal privacy). Provides convenience for multi-domain application implementation.

Figure 202111383157

Description

Privacy preserving data record integration method, system and computer readable storage medium
Technical Field
The invention belongs to the technical field of privacy protection record linkage, and particularly relates to a privacy protection data record integration method, a system and a computer readable storage medium.
Background
With the increasing demand and the increasing construction scale of information systems, the information systems are usually operated, data in various application programs are difficult to share, and the data islanding phenomenon becomes obvious. To solve this conflict, it is necessary to design a privacy-preserving data recording integration method, system and computer-readable storage medium that can effectively solve the data islanding problem of the information system.
For example, chinese patent application No. cn201811069639.x describes a big data acquisition and transaction system based on a block chain and a trusted computing platform, which includes: the system comprises an address verification module, a data acquisition module, a data uploading module, a data credibility verification module and a data reward payment module on a user chain. Although the problem of data source shortage is solved by fusing a large-scale personal data isolated island, and all-round supervision and protection are implemented on data acquisition, storage, packaging and uploading operations, the whole link credibility of the data is realized, and the privacy of users is protected when a data acquisition company authenticates the validity of public key addresses on a user chain by using a direct anonymous certification method; and the data reward payment is guaranteed to be open and transparent, so that the contradiction between personal privacy protection and big data acquisition is reconciled to a certain degree, the credibility of a data source is guaranteed, and the method has practicability, is simple and easy to implement, but has the defects that cross-field application fusion cannot be effectively supported, the application range and the application strength of data integration are limited, and the difficulty of data recording integration is not effectively improved.
Disclosure of Invention
The invention provides a privacy protection data recording integration method, a system and a computer readable storage medium which can effectively solve the problem of data isolated island of an information system, and aims to solve the problems that information systems are usually operated independently, data in various application programs are difficult to share, and the data isolated island phenomenon is increasingly obvious in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
the privacy protection data recording integration method comprises the following steps:
s1, creating a data record structure of the data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
s2, setting privacy parameters of the data recording integration system and management authority of the nodes;
s3, based on the data record structure, using the management authority and privacy parameters of the nodes to encode and block the data of the data record integration system, so as to generate a plurality of blocks containing candidate record pairs;
and S4, performing similarity comparison on the generated candidate record pairs, and integrally outputting the data record information belonging to the same entity.
Preferably, step S1 includes the steps of:
s11, preprocessing the data records in the data record structure, wherein the preprocessing comprises data cleaning and standardization, and correcting the data records by filling missing data, removing redundant values and unifying data formats in sequence;
s12, selecting a quasi-identifier, traversing attribute fields, and selecting a plurality of cascade attributes capable of uniquely identifying real world entity information according to the particularity of the scene as the quasi-identifier attributes;
s13, recording level Bloom Filter coding, splitting the selected quasi-identifier attribute into n-gram form, connecting through n Bloom filters, mapping into record level Bloom Filter RBF.
Preferably, the management authority of the node in step S2 includes a data viewing distribution authority and a data link authority;
the node with the data checking and distributing authority is used for checking the data in the information system by the data holder under the condition that the data recording system normally operates, and distributing the coded data to the link node;
and the node with the data link authority is used for linking the data of the candidate record pairs in the same block when the data is subjected to integration operation.
Preferably, step S3 includes the steps of:
screening candidate record pairs in a blocking manner, carrying out duplication removal screening on data of the data record integration system to obtain candidate record pair data, selecting a public attribute except a standard identifier to carry out Hash coding, and taking the result as a blocking key value; data records with the same partitioning key value are partitioned into the same block, and a plurality of block sets B ═ B are obtained1,b2,…,biIn which b isiRepresenting the ith block.
Preferably, the step S4 includes the following steps that a node having the data viewing distribution authority is used as a data node, and a node having the data link authority is used as a link node:
when the data of the data record integration system is sent to the link node by the data node, the twin neural network is used for comparing the similarity of the candidate record pairs at the link node, and the linked result is integrated and output.
Preferably, step S4 includes the steps of:
s41, extracting features at the link nodes by using a bidirectional recurrent neural network (BilSTM), and performing extraction on the input data set x1,x2Firstly, dividing attribute data into m batch _ sizes through an embedding layer of the BilSTM, and generating n x m dimensional conversion vectors which are random numbers between (-1, 1);
wherein n represents the number of bloom filters, m represents the number of batch _ size, and batch _ size represents the number of samples of one training;
s42, converting vector, sample xiAnd xjThrough the embedding layer, the generated vector is represented by yi,yj∈Rn*mR producedn*mThe dimensional feature vector is used as a new feature and forms an output sample x through a full connection layeriAnd xjCosine similarity of (d); the result, as a measure of similarity, is calculated by the following equation:
Figure BDA0003366391920000031
wherein fw (x)i) Representing input samples xiAt the mapping of the embedded layer(s),<fw(x1),fw(x2)>denotes fw (x)1) And fw (x)2) Dot product of (c), i | fw (x)i) I is fw (x)i) Absolute value of (d);
s43, calculating a loss function according to the similarity obtained in the step S42, continuously adjusting parameters of the twin neural network model by adopting a gradient descent method, and sequentially repeating the step S41 and the step S42 to optimize the effect of the twin neural network model;
for input sample anAnd bnThe formula for calculating the loss function between the two is as follows:
Figure BDA0003366391920000041
wherein d | | | an-bn||2Representing the euclidean distance of the features of the two samples, N being the number of the samples, y being a label for marking whether the two samples are matched, y being 1 representing that the two samples are similar or matched, y being 0 representing that the two samples are not matched, and margin being a set threshold;
s44, constructing positive and negative samples for training the twin neural network model;
and S45, using the trained twin neural network model to predict the test set to obtain a data record set belonging to the same entity.
Preferably, the constructing of the positive and negative samples in step S44 includes the following steps:
s441, inputting a sample S, finding another same or similar sample S 'from the input data set according to the sample characteristics to form a value pair (S, S', 1), wherein 1 represents matching;
s442, repeating the process of step S441, and generating a positive sample;
s443, inputting a sample S, randomly selecting a sample S 'from a complementary set of the sample S according to the characteristics of the sample, and forming a value pair (S, S', 0), wherein 0 represents mismatching;
s444, the process of step S443 is repeated, and a negative sample is generated.
The invention also provides a privacy-preserving data recording integrated system, which comprises:
a data record structure creation module to create a data record structure of a data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
the system authority setting module is used for setting the management authority of the nodes of the data recording integrated system;
and the record linking module is based on the data record structure of the data record integrated system and used for comparing the similarity of the candidate record pairs by utilizing the management authority of the nodes and outputting the data record information belonging to the same entity.
The present invention also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of the privacy-preserving data record integration method of any one of the above embodiments.
Compared with the prior art, the invention has the beneficial effects that: (1) the invention applies the deep learning method to the record link task of privacy protection, so that the technology can adapt to more scenes; different from the existing machine learning method, the twin neural network can automatically extract scene characteristics according to the attribute representation mode of the scene, and a model is constructed, so that the accuracy of the link is improved and the time efficiency of the whole process is improved on the basis of ensuring the credibility of the integration process; (2) the coded data recording structure can anonymize original data, so that the traceability of the data recording is guaranteed, the credibility of an integration process is guaranteed, personal privacy data in an information system cannot be revealed, and the coded data recording structure has necessity in industrial application engineering with higher requirements on data safety, data traceability and the like; (3) aiming at the characteristics of data isomerization storage, the invention automatically extracts the characteristics of scene data and normalizes the data format, thereby greatly reducing the difficulty of data recording integration; (4) the invention can integrate data between different information systems under the condition of ensuring the credibility of the process according to the link of the data in the data integration system, thereby effectively supporting the application fusion across fields and expanding the application range and the application intensity of data integration.
Drawings
FIG. 1 is a flow chart of a privacy preserving data record integration method according to an embodiment of the present invention;
FIG. 2 is a diagram of a data structure for record level bloom filter encoded records according to an embodiment of the present invention;
FIG. 3 is a block diagram of an architecture of an integrated system for privacy preserving data records according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a process of similarity comparison according to an embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
Example 1:
as shown in fig. 1, the privacy-preserving data recording integration method includes the following steps:
s1, creating a data record structure of the data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
step S1 specifically includes the following steps:
s11, preprocessing the data records in the data record structure, wherein the preprocessing comprises data cleaning and standardization, and correcting the data records by filling missing data, removing redundant values and unifying data formats in sequence;
s12, selecting a quasi-identifier, traversing attribute fields, and selecting a plurality of cascade attributes capable of uniquely identifying real world entity information according to the particularity of the scene as the quasi-identifier attributes;
s13, recording level Bloom Filter coding, splitting the selected quasi-identifier attribute into n-gram form, connecting through n Bloom filters, mapping into record level Bloom Filter RBF.
Specifically, the surname and the first name are selected as quasi-identifier attributes, and bloom filter encoding is performed on the quasi-identifier attributes to form a quasi-identifier record data structure, for example, the surname John and the first name Smith are split into a 2-gram form: jo, oh, hn, Sm, mi, it, th, and are connected by n bloom filters, and mapped into record level bloom filters RBF, as shown in fig. 2.
S2, setting privacy parameters of the data recording integration system and management authority of the nodes;
the management authority of the node in the step S2 includes a data viewing and distributing authority and a data linking authority;
the node with the data checking and distributing authority is used for checking the data in the information system by the data holder under the condition that the data recording system normally operates, and distributing the coded data to the link node;
and the node with the data link authority is used for linking the data of the candidate record pairs in the same block when the data is subjected to integration operation.
And taking the node with the data viewing and distributing authority as a data node, and taking the node with the data link authority as a link node. In the present invention, the data nodes each represent an information system.
As shown in FIG. 3, in a data isolation environment, data nodes 1 to n have data view distribution rights and the link node has data link rights.
S3, based on the data record structure, using the management authority and privacy parameters of the nodes to encode and block the data of the data record integration system, so as to generate a plurality of blocks containing candidate record pairs;
step S3 specifically includes the following steps:
screening candidate record pairs in a blocking manner, carrying out duplication removal screening on data of the data record integration system to obtain candidate record pair data, selecting a public attribute except a standard identifier to carry out Hash coding, and taking the result as a blocking key value; data records with the same partitioning key value are partitioned into the same block, and a plurality of block sets BB ═ b are obtained1,b2,…,biIn which b isiRepresenting the ith block.
And S4, performing similarity comparison on the generated candidate record pairs, and integrally outputting the data record information belonging to the same entity.
Step S4 includes the following steps:
when the data of the data record integration system is sent to the link node by the data node, the twin neural network is used for comparing the similarity of the candidate record pairs at the link node, and the linked result is integrated and output.
As shown in fig. 4, step S4 specifically includes the following steps:
s41, extracting features at the link nodes by using a bidirectional recurrent neural network (BilSTM), and performing extraction on the input data set x1,x2Firstly, dividing attribute data into m batch _ sizes through an embedding layer of the BilSTM, and generating n x m dimensional conversion vectors which are random numbers between (-1, 1);
wherein n represents the number of bloom filters, m represents the number of batch _ size, and batch _ size represents the number of samples of one training;
s42, converting vector, sample xiAnd xjThrough the embedding layerThe generated vector is represented by yi,yj∈Rn*mR producedn*mThe dimensional feature vector is used as a new feature and forms an output sample x through a full connection layeriAnd xjCosine similarity of (d); the result, as a measure of similarity, is calculated by the following equation:
Figure BDA0003366391920000071
wherein fw (x)i) Representing input samples xiAt the mapping of the embedded layer(s),<fw(x1),fw(x2)>denotes fw (x)1) And fw (x)2) Dot product of (c), i | fw (x)i) I is fw (x)i) Absolute value of (d);
s43, calculating a loss function according to the similarity obtained in the step S42, continuously adjusting parameters of the twin neural network model by adopting a gradient descent method, and sequentially repeating the step S41 and the step S42 to optimize the effect of the twin neural network model;
for input sample anAnd bnThe formula for calculating the loss function between the two is as follows:
Figure BDA0003366391920000081
wherein d | | | an-bn||2Representing the euclidean distance of the features of the two samples, N being the number of the samples, y being a label for marking whether the two samples are matched, y being 1 representing that the two samples are similar or matched, y being 0 representing that the two samples are not matched, and margin being a set threshold;
for y in L, when y is 1, the loss function is left
Figure BDA0003366391920000082
That is, the samples that are originally similar to each other need to be adjusted if the euclidean distance in the feature space is large. And when y is 0, the loss function is
Figure BDA0003366391920000083
That is, when the samples are not similar, the smaller the euclidean distance of the feature space, the larger the loss value.
S44, constructing positive and negative samples for training the twin neural network model;
the construction of the positive and negative samples comprises the following steps:
s441, inputting a sample S, finding another same or similar sample S 'from the input data set according to the sample characteristics to form a value pair (S, S', 1), wherein 1 represents matching;
s442, repeating the process of step S441, and generating a positive sample;
s443, inputting a sample S, randomly selecting a sample S 'from a complementary set of the sample S according to the characteristics of the sample, and forming a value pair (S, S', 0), wherein 0 represents mismatching;
s444, the process of step S443 is repeated, and a negative sample is generated.
And S45, using the trained twin neural network model to predict the test set to obtain a data record set belonging to the same entity.
Based on embodiment 1, the present invention also provides a privacy-preserving data recording integration system, including:
a data record structure creation module to create a data record structure of a data record integration system, the data record structure comprising a quasi-identifier data structure and a key-value block data structure;
the system authority setting module is used for setting the management authority of the nodes of the data recording integrated system;
and the record linking module is based on the data record structure of the data record integrated system and used for comparing the similarity of the candidate record pairs by utilizing the management authority of the nodes and outputting the data record information belonging to the same entity.
In the recording process, the data is always in a data isolation environment, so that the data of each information system in the system is guaranteed to be credible in the transmission process.
Based on embodiment 1, the present invention further provides a computer-readable storage medium, which includes computer-executable instructions, and when the computer-executable instructions are executed by one or more processors, the processors execute the steps of the privacy-preserving data recording integration method according to any one of the above embodiments.
The invention applies the deep learning method to the record link task of privacy protection, so that the technology can adapt to more scenes; different from the existing machine learning method, the twin neural network can automatically extract scene characteristics according to the attribute representation mode of the scene, and a model is constructed, so that the accuracy of the link is improved and the time efficiency of the whole process is improved on the basis of ensuring the credibility of the integration process; the coded data recording structure can anonymize original data, so that the traceability of the data recording is guaranteed, the credibility of an integration process is guaranteed, personal privacy data in an information system cannot be revealed, and the coded data recording structure has necessity in industrial application engineering with higher requirements on data safety, data traceability and the like; aiming at the characteristics of data isomerization storage, the invention automatically extracts the characteristics of scene data and normalizes the data format, thereby greatly reducing the difficulty of data recording integration; the invention can integrate data between different information systems under the condition of ensuring the credibility of the process according to the link of the data in the data integration system, thereby effectively supporting the application fusion across fields and expanding the application range and the application intensity of data integration.
The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims (9)

1.隐私保护数据记录集成方法,其特征在于,包括以下步骤:1. a privacy protection data record integration method, is characterized in that, comprises the following steps: S1,创建数据记录集成系统的数据记录结构,所述数据记录结构包括准标识符数据结构和键值块数据结构;S1, create a data record structure of a data record integration system, and the data record structure includes a quasi-identifier data structure and a key-value block data structure; S2,设置数据记录集成系统的隐私参数以及节点的管理权限;S2, set the privacy parameters of the data record integration system and the management authority of the nodes; S3,基于所述数据记录结构,利用所述节点的管理权限以及隐私参数对数据记录集成系统的数据进行编码及分块,以生成包含候选记录对的多个分块;S3, based on the data record structure, utilize the management authority and privacy parameters of the node to encode and segment the data of the data record integration system to generate multiple segments including candidate record pairs; S4,对生成的候选记录对进行相似度比较,并集成输出属于同一实体的数据记录信息。S4, compare the similarity of the generated candidate record pairs, and integrate and output data record information belonging to the same entity. 2.根据权利要求1所述的隐私保护数据记录集成方法,其特征在于,步骤S1包括如下步骤:2. The privacy protection data record integration method according to claim 1, wherein step S1 comprises the following steps: S11,对数据记录结构内的数据记录进行预处理,所述预处理包括数据清理和标准化,并依次通过填充缺失数据、去除冗余值、数据格式统一化对数据记录进行校正;S11, preprocessing the data records in the data record structure, the preprocessing includes data cleaning and standardization, and correcting the data records by filling missing data, removing redundant values, and unifying the data format in turn; S12,选择准标识符,遍历属性字段,根据场景的特殊性选择能够唯一标识真实世界实体信息的多个级联属性,作为准标识符属性;S12, selecting a quasi-identifier, traversing the attribute fields, and selecting multiple cascade attributes that can uniquely identify real-world entity information according to the particularity of the scene, as quasi-identifier attributes; S13,记录级Bloom Filter编码,将选定的准标识符属性拆分成n-gram的形式,通过n个布隆过滤器连接,映射成记录级的布隆过滤器RBF。S13 , record-level Bloom Filter coding, splitting the selected quasi-identifier attribute into the form of n-grams, connecting through n Bloom filters, and mapping them into a record-level Bloom filter RBF. 3.根据权利要求2所述的隐私保护数据记录集成方法,其特征在于,步骤S2中所述节点的管理权限包括数据查看分发权限和数据链接权限;3. The privacy-preserving data record integration method according to claim 2, wherein the management authority of the node in step S2 includes data viewing and distribution authority and data linking authority; 具有所述数据查看分发权限的节点,用于数据持有方在数据记录系统正常运行情况下,对信息系统中的数据进行查看,并将编码后的数据分发到链接节点;A node with the data viewing and distribution authority, for the data holder to view the data in the information system under the normal operation of the data recording system, and distribute the encoded data to the link node; 具有所述数据链接权限的节点,用于当数据发生集成操作时,对同一分块中的候选记录对的数据进行链接。The node with the data linking authority is used to link the data of the candidate record pair in the same block when the data integration operation occurs. 4.根据权利要求1所述的隐私保护数据记录集成方法,其特征在于,步骤S3包括如下步骤:4. The privacy protection data record integration method according to claim 1, wherein step S3 comprises the following steps: 分块筛选候选记录对,将数据记录集成系统的数据进行去重筛选以获得候选记录对数据,选取一个除准标识符外的公共属性进行哈希编码,并将结果作为分块键值;具有相同分块键值的数据记录被分到同一个块中,得到多个块集合B={b1,b2,...,bi},其中bi表示第i个分块。The candidate record pair is screened in blocks, the data of the data record integration system is deduplicated and screened to obtain the candidate record pair data, a common attribute except the quasi-identifier is selected for hash coding, and the result is used as the block key value; Data records with the same block key value are divided into the same block, and multiple block sets B={b 1 , b 2 , . . . , b i } are obtained, where b i represents the i-th block. 5.根据权利要求3所述的隐私保护数据记录集成方法,其特征在于,将具有所述数据查看分发权限的节点作为数据节点,将具有所述数据链接权限的节点作为链接节点,步骤S4包括如下步骤:5. The privacy protection data record integration method according to claim 3, wherein the node with the data viewing and distribution authority is used as a data node, and the node with the data link authority is used as a link node, and step S4 comprises: Follow the steps below: 当数据记录集成系统的数据由数据节点发送到链接节点时,在链接节点利用孪生神经网络对候选记录对进行相似度比较,并将链接后的结果进行集成输出。When the data of the data record integration system is sent from the data node to the link node, the twin neural network is used at the link node to compare the similarity of the candidate record pairs, and the linked results are integrated and output. 6.根据权利要求5所述的隐私保护数据记录集成方法,其特征在于,步骤S4包括如下步骤:6. The privacy protection data record integration method according to claim 5, wherein step S4 comprises the following steps: S41,在链接节点用双向循环神经网络BiLSTM提取特征,对于输入数据集x1,x2,首先经过BiLSTM的嵌入层,将属性数据拆分为m个batch_size,生成n*m维转换向量,所述转换向量是(-1,1)之间的随机数;S41 , extract features with a bidirectional recurrent neural network BiLSTM at the link node. For the input data sets x 1 , x 2 , first go through the embedding layer of BiLSTM, split the attribute data into m batch_sizes, and generate n*m-dimensional transformation vectors, so The conversion vector is a random number between (-1, 1); 其中,n表示布隆过滤器的数目,m表示batch_size的数目,batch_size表示一次训练的样本数目。Among them, n represents the number of bloom filters, m represents the number of batch_size, and batch_size represents the number of samples for one training. S42,转换向量,样本xi和xj经过嵌入层,将生成的向量表示yi,yj∈Rn*m,生成的Rn*m维特征向量作为新的特征,并经过全连接层形成输出样本xi和xj的余弦相似度;将如下公式计算的结果,作为相似性的度量:S42, convert the vector, the samples x i and x j go through the embedding layer, the generated vector represents y i , y j ∈R n*m , the generated R n*m dimension feature vector is used as a new feature, and goes through the fully connected layer Form the cosine similarity of output samples x i and x j ; use the result calculated by the following formula as a measure of similarity:
Figure FDA0003366391910000021
Figure FDA0003366391910000021
其中fw(xi)表示输入样本xi在嵌入层的映射,<fw(x1),fw(x2)>表示fw(x1)与fw(x2)的点乘结果,||fw(xi)||为fw(xi)的绝对值;where fw(x i ) represents the mapping of the input sample xi in the embedding layer, <fw(x 1 ), fw(x 2 )> represents the dot product result of fw(x 1 ) and fw(x 2 ), ||fw ( xi )|| is the absolute value of fw( xi ); S43,根据步骤S42获得的相似度计算损失函数,并且采用梯度下降法不断调节孪生神经网络模型参数,依次重复步骤S41和步骤S42,优化孪生神经网络模型效果;S43, calculate the loss function according to the similarity obtained in step S42, and use the gradient descent method to continuously adjust the parameters of the twin neural network model, and repeat steps S41 and S42 in turn to optimize the effect of the twin neural network model; 对于输入样本an和bn,二者之间的损失函数的计算公式如下:For input samples an and b n , the calculation formula of the loss function between them is as follows:
Figure FDA0003366391910000031
Figure FDA0003366391910000031
其中d=||an-bn||2,代表两个样本特征的欧氏距离,N为样本的数量,y为标签,用于标注两个样本是否匹配,y=1代表两个样本相似或者匹配,y=0则代表两个样本不匹配,margin为设定的阈值;Where d=||a n -b n || 2 , represents the Euclidean distance between the two sample features, N is the number of samples, y is the label, used to mark whether the two samples match, y=1 represents the two samples Similar or matching, y=0 means that the two samples do not match, and the margin is the set threshold; S44,构造正负样本,用于对孪生神经网络模型进行训练;S44, construct positive and negative samples for training the twin neural network model; S45,将训练完成后的孪生神经网络模型,用于对测试集进行预测,得到属于同一实体的数据记录集合。S45 , using the trained twin neural network model to predict the test set to obtain a set of data records belonging to the same entity.
7.根据权利要求6所述的隐私保护数据记录集成方法,其特征在于,步骤S44中所述构造正负样本,包括如下步骤:7. The privacy-preserving data record integration method according to claim 6, wherein the constructing positive and negative samples described in step S44 comprises the following steps: S441,输入一个样本s,根据样本特征,从输入数据集中找到另一个相同或相似的样本s′,形成值对(s,s′,1),其中1表示匹配;S441, input a sample s, find another same or similar sample s' from the input data set according to the sample characteristics, and form a value pair (s, s', 1), where 1 represents a match; S442,重复步骤S441过程,并生成正样本;S442, repeat the process of step S441, and generate a positive sample; S443,输入一个样本s,根据样本特征,从样本s的补集中随机选取一个样本s′,形成值对(s,s′,0),其中0表示不匹配;S443, input a sample s, randomly select a sample s' from the complement set of the sample s according to the characteristics of the sample, and form a value pair (s, s', 0), where 0 indicates a mismatch; S444,重复步骤S443过程,并生成负样本。S444, the process of step S443 is repeated, and a negative sample is generated. 8.隐私保护数据记录集成系统,其特征在于,包括:8. An integrated system for privacy protection data recording, characterized in that it includes: 数据记录结构创建模块,用于创建数据记录集成系统的数据记录结构,所述数据记录结构包括准标识符数据结构和键值块数据结构;a data record structure creation module for creating a data record structure of the data record integration system, the data record structure including a quasi-identifier data structure and a key-value block data structure; 系统权限设置模块,用于设置数据记录集成系统的节点的管理权限;The system authority setting module is used to set the management authority of the nodes of the data recording integrated system; 记录链接模块,基于所述数据记录集成系统的数据记录结构,用于通过利用节点的管理权限对候选记录对进行相似度比较,输出属于同一实体的数据记录信息。The record linking module, based on the data record structure of the data record integration system, is configured to compare the similarity of the candidate record pairs by utilizing the management authority of the node, and output the data record information belonging to the same entity. 9.计算机可读存储介质,其特征在于,包括计算机可执行指令,当所述计算机可执行指令被一个或多个处理器执行时,使得所述处理器执行权利要求1-7中任一项所述的隐私保护数据记录集成方法的步骤。9. A computer-readable storage medium, characterized by comprising computer-executable instructions that, when executed by one or more processors, cause the processors to perform any one of claims 1-7 The steps of the privacy-preserving data record integration method.
CN202111383157.3A 2021-11-22 2021-11-22 Privacy preserving data record integration method, system and computer readable storage medium Pending CN113946871A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111383157.3A CN113946871A (en) 2021-11-22 2021-11-22 Privacy preserving data record integration method, system and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111383157.3A CN113946871A (en) 2021-11-22 2021-11-22 Privacy preserving data record integration method, system and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN113946871A true CN113946871A (en) 2022-01-18

Family

ID=79338719

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111383157.3A Pending CN113946871A (en) 2021-11-22 2021-11-22 Privacy preserving data record integration method, system and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113946871A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341023A (en) * 2023-05-24 2023-06-27 北京百度网讯科技有限公司 Block chain-based service address verification method, device, equipment and storage medium
CN116361859A (en) * 2023-06-02 2023-06-30 之江实验室 Inter-institutional Patient Record Linking Method and System Based on Deep Privacy Encoder

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341023A (en) * 2023-05-24 2023-06-27 北京百度网讯科技有限公司 Block chain-based service address verification method, device, equipment and storage medium
CN116341023B (en) * 2023-05-24 2023-08-29 北京百度网讯科技有限公司 Block chain-based service address verification method, device, equipment and storage medium
CN116361859A (en) * 2023-06-02 2023-06-30 之江实验室 Inter-institutional Patient Record Linking Method and System Based on Deep Privacy Encoder
CN116361859B (en) * 2023-06-02 2023-08-25 之江实验室 Cross-mechanism patient record linking method and system based on depth privacy encoder

Similar Documents

Publication Publication Date Title
US20240078253A1 (en) Column data anonymization based on privacy category classification
US11983297B2 (en) Efficient statistical techniques for detecting sensitive data
CN113961759B (en) Abnormality detection method based on attribute map representation learning
US11797705B1 (en) Generative adversarial network for named entity recognition
US11500876B2 (en) Method for duplicate determination in a graph
Feng et al. Privacy-preserving tucker train decomposition over blockchain-based encrypted industrial IoT data
US11514321B1 (en) Artificial intelligence system using unsupervised transfer learning for intra-cluster analysis
US20210089667A1 (en) System and method for implementing attribute classification for pii data
Vatsalan et al. Efficient two-party private blocking based on sorted nearest neighborhood clustering
JP7408626B2 (en) Tenant identifier replacement
CN113946871A (en) Privacy preserving data record integration method, system and computer readable storage medium
EP3591561A1 (en) An anonymized data processing method and computer programs thereof
Zhang et al. A Robust k‐Means Clustering Algorithm Based on Observation Point Mechanism
CN118626811A (en) Industrial chain analysis method and system based on knowledge graph
Dobbs et al. On art authentication and the Rijksmuseum challenge: A residual neural network approach
Goswami et al. A survey on big data & privacy preserving publishing techniques
CN116933075A (en) Question-answering model training method, intelligent question-answering method and device in network security field
CN115686868B (en) Cross-node-oriented multi-mode retrieval method based on federated hash learning
Salem et al. Blockchain-based biometric identity management
JP2023517518A (en) Vector embedding model for relational tables with null or equivalent values
Wurzenberger et al. Discovering insider threats from log data with high-performance bioinformatics tools
CN116150185A (en) Data standard extraction method, device, equipment and medium based on artificial intelligence
Yang et al. GEUKE: A geographic entities uniformly explicit knowledge embedding model
Li Correlation temporal feature extraction network via residual network for English relation extraction.
CN117197816B (en) User material identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination