CN110610098A

CN110610098A - Data set generation method and device

Info

Publication number: CN110610098A
Application number: CN201810615202.5A
Authority: CN
Inventors: 牛家浩; 申山宏; 王德政; 程祥; 苏森; 唐朋; 邵华西
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2018-06-14
Filing date: 2018-06-14
Publication date: 2019-12-24
Anticipated expiration: 2038-06-14
Also published as: WO2019237840A1; CN110610098B

Abstract

The invention provides a data set generation method and a data set generation device, wherein the method comprises the following steps: each data owner of the multi-party vertical segmentation data calculates mutual information of a pair of explicit variables in a local original data set to generate leaf-level implicit variables; each data owner establishes a tree index locally, and the data owners are combined pairwise to form a data owner pair to perform matching of the tree index and calculation of mutual information between hidden variables of leaf layers; each data owner executes hidden tree structure learning and hidden tree parameter learning, and respectively generates a hidden tree locally; and each data owner generates a target data set from top to bottom according to the learned hidden tree structure and hidden tree parameters.

Description

Data set generation method and device

Technical Field

The invention relates to the field of data security, in particular to a data set generation method and device.

Background

With the rapid development of digital technologies such as smart cities, smart power grids, smart medical treatment and the like and the wide popularization of mobile terminal equipment, information such as clothes, eating, housing, health and medical treatment and the like of people is digitalized, massive data is generated every day, and the arrival of a big data era is promoted. The large amount of data is often owned by different data owners, for example, hospitals and financial institutions own a set of medical data and financial data, respectively. When data distributed among multiple parties have the same ID and contain different attributes, the data is called multi-party vertical split data. And multi-party vertical segmentation data is released, so that a data analyst can fully analyze and mine potential values in the data. However, vertically split data often contains a lot of sensitive information for individuals, and directly publishing such data inevitably reveals individual privacy information.

The proposal of the differential privacy protection model provides a feasible scheme for solving the data publishing problem meeting the privacy protection. Unlike anonymity-based privacy protection models, the differential privacy protection model provides a strict and quantifiable means of privacy protection, and the strength of privacy protection provided is independent of the background knowledge mastered by the attacker.

At present, in a single-party scenario, the Data issuing problem meeting the differential privacy is solved by a technology of issuing private Data Release via Bayesian Networks (PrivBayes): firstly, constructing a Bayesian network based on original data, and then adding noise into the constructed Bayesian network to enable the constructed Bayesian network to meet the requirement of differential privacy protection; and finally, generating new data release by using the Bayesian network containing noise. However, since the algorithm itself is a one-sided data oriented design, PrivBayes is not available in a multi-party scenario.

In a multi-party scenario, the existing vertical partitioning data distribution method (DistDiffGen) satisfying differential privacy protection can only be used for distributing statistical information required for constructing a decision tree classifier, and therefore, the method is only a data distribution method bound with a specific data analysis task. At present, the vertical segmentation data publishing method meeting the differential privacy protection in practical application can only be applied to classification tasks based on decision trees, but is not applicable to data analysis and mining tasks such as other types of classification tasks, clustering tasks, statistical analysis tasks and the like.

Disclosure of Invention

The embodiment of the invention provides a data set generation method and device, which are used for at least solving the privacy protection problem of data in the related technology.

According to an aspect of the present invention, there is provided a data set generation method, comprising: each data owner obtains mutual information of a pair of explicit variables in a local original data set to generate leaf-level implicit variables; each data owner establishes a tree index locally, and the data owners are combined pairwise to form a data owner pair to perform matching of the tree index and calculation of mutual information between hidden variables of leaf layers; each data owner executes hidden tree structure learning and hidden tree parameter learning, and respectively generates a hidden tree locally; and each data owner generates a target data set from top to bottom according to the learned hidden tree structure and hidden tree parameters.

According to another aspect of the present invention, there is provided a data set generating apparatus comprising: the hidden variable generation module is used for acquiring mutual information of a pair of explicit variables in the local original data set for each data owner to generate leaf-level hidden variables; the mutual information calculation module is used for establishing a tree index for each data owner locally, combining the data owners in pairs to form a data owner pair, and performing matching of the tree index and calculation of mutual information between hidden variables of the leaf layer; the hidden tree generation module is used for executing hidden tree structure learning and hidden tree parameter learning for each data owner and respectively generating a hidden tree locally; and the data set generation module is used for generating a target data set for each data owner from top to bottom according to the learned hidden tree structure and hidden tree parameters.

According to still another aspect of the present invention, there is also provided a data set generating system, which includes a plurality of data set generating apparatuses in the foregoing embodiments, wherein each data set generating apparatus corresponds to data processing of one data owner, and all the data set generating apparatuses are connected through a network.

According to yet another aspect of the present invention, there is also provided a storage medium having a computer-readable program stored therein, wherein the program is run to perform the method steps in the previous embodiments.

In the embodiment of the invention, the hidden tree model is adopted to model the data set distribution vertically divided among a plurality of data owners, and the data set containing noise is jointly issued according to the learned hidden tree model, so that the noise adding amount is reduced, the requirement on the differential privacy of the issued data set is met in the issuing process of multi-party vertically divided data, and meanwhile, the issued overall data can support a plurality of data analysis tasks.

Drawings

FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention;

FIG. 2 is a flow diagram of a data set generation method according to an embodiment of the invention;

FIG. 3 is a flow diagram of a method for data distribution with multi-party vertical partitioning according to an embodiment of the present invention;

FIG. 4 is a block diagram of the structure of a data set generating apparatus according to an embodiment of the present invention;

FIG. 5 is a block diagram of a multi-party vertically partitioned data distribution apparatus according to an embodiment of the present invention;

FIG. 6 is a flow chart of a method according to a first embodiment of the present invention;

FIG. 7 is a flowchart of a method according to a second embodiment of the invention;

FIG. 8 is a flow chart of a method according to a third embodiment of the present invention;

fig. 9 is a flowchart of a method according to a fourth embodiment of the invention.

Detailed Description

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention provides a multi-party vertical partition data distribution method which is independent of specific data analysis tasks and meets the requirement of differential privacy protection through the following embodiments. Under a big data environment, in the process of issuing multiparty vertical division data, the requirement on the difference privacy of the issued data set is met, and meanwhile, the issued overall data can support various data analysis tasks. Therefore, on the premise of protecting individual privacy, a data analyst can fully analyze the value in the mined data and provide more basis for decision support and scientific research.

It should be noted that, in the embodiment of the present invention, the data owner does not refer to a specific person, but refers to all parties of the multi-party vertically split data, which may be various data processing apparatuses that process the multi-party vertically split data. Such as a database, a big data platform, a server, etc., each data owner has its own data (i.e., data stored in a data warehouse or database).

FIG. 1 is a system architecture of an embodiment of the present invention. As shown in fig. 1, multi-party vertical split data (e.g., medical data or financial data) having the same ID but containing different attributes are distributed across three different data owners. In this embodiment, the data owner may be a server, and therefore, the server 1, the server 2, and the server 3 represent different data owners respectively. The server 1, the server 2 and the server 3 are connected through a wired or wireless network. In the present embodiment, the form and topology of the network connecting the server 1, the server 2, and the server 3 are not limited. Depending mainly on the geographical distribution and the actual needs among the various data owners. For example, it may be a local area network, the internet, or other private network. Through the connected network, heartbeat information can be sent between the servers, and the servers register as a multi-party vertical segmentation data distribution participant and distribute the generated multi-party vertical segmentation data set and the like.

By running the technical scheme provided by the embodiment of the invention on the system architecture shown in FIG. 1, the requirement on the differential privacy of the published data set can be met in the publishing process of the multi-party vertical partition data, and the published overall data can support various data analysis tasks.

In this embodiment, a data set generating method is provided, which may be implemented based on the system architecture of the above embodiments. FIG. 2 is a flow chart of a data set generation method according to an embodiment of the present invention. In this embodiment, a plurality of data owners are included, as shown in fig. 2, the process includes the following steps:

step S202, each data owner calculates mutual information of the explicit variable pairs in the local original data set to generate leaf-level implicit variables.

And step S204, each data owner establishes a tree index locally, the data owners are combined pairwise to form a data owner pair, and matching of the tree indexes and calculation of mutual information between hidden variables of the leaf layer are performed.

And step S206, each data owner executes hidden tree structure learning and hidden tree parameter learning, and the hidden trees are generated locally respectively.

And S208, generating a target data set from top to bottom by each data owner according to the learned hidden tree structure and hidden tree parameters.

In the above embodiment, the hidden tree model is used to model the data set distribution vertically divided among multiple data owners, and the data sets containing noise are jointly published according to the learned hidden tree model, so that the noise addition amount is reduced to the greatest extent under the condition that the published data meets the differential privacy.

The present invention further provides another embodiment of a data publishing method for multi-party vertical partition satisfying differential privacy protection independent of specific data analysis tasks, as shown in fig. 3, the method includes the following processes:

step S301, the original data set is subjected to unified coding, missing data filling, discretization and binarization to obtain a regular apparent variable data set.

And S302, combining the display variables pairwise to form a group of display variable pairs, and accessing data to calculate mutual information between each pair of display variables.

Step S303, generating a leaf-level hidden variable under the condition of meeting the differential privacy protection by using a differential privacy index mechanism.

And step S304, combining the leaf-level hidden variables pairwise to form hidden variable pairs for each data owner, and accessing the hidden variable data to calculate the mutual information between each pair of hidden variables.

Step S305, grouping the hidden variables of the leaf layer under the condition of meeting the differential privacy protection based on the calculated mutual information between the hidden variables of the leaf layer, and generating the hidden variables of the upper layer.

Step S306, repeating the step of generating the upper-layer hidden variable by the bottom-layer hidden variable from bottom to top until the upper-layer hidden variable only has one hidden variable node, and marking the hidden variable as a root node; and the hidden variable nodes form a tree index together with the connecting edges between the hidden variable nodes and the parent-child nodes, and the tree index is stored locally in the data owner.

In step S307, each data owner combines two by two to form a data owner pair, and then each data owner transmits negotiation related parameters including but not limited to data owner combination pairing, execution order of subsequent calculations by the data owner, maximum number of other data owners that a single data owner can communicate with at the same time, and the like.

Step S308, for each pair of data owners, a secure multi-party computing protocol is operated together, and a two-party hidden variable mutual information computing method based on tree index matching is operated under the condition that the secure multi-party computing protocol is encrypted. For each data owner, the above calculations may be performed simultaneously with communicating with multiple other data owners. A plurality of data owners jointly run a security protocol, and mutually information between previously calculated hidden variable pairs is broadcasted to all other data owners under the condition of encrypting a multi-party security calculation protocol; until each data owner locally stores the strength of association between the same and complete pairs of leaf-level hidden variables.

Step S309, multiple data owners independently run the maximum spanning tree construction method locally, the leaf layer hidden variables and the explicit variables are taken as nodes, the calculated association strength between the variables is taken as the weight of the corresponding connection edge, and the weight and the minimum loop-free connected graph are constructed. And randomly selecting a root node for the loop-free connected graph, and determining a parent-child relationship for the node pair connected with each connecting edge according to the path length distance between the root node and the node, so as to obtain a hidden tree structure.

Step S310, according to the generated hidden tree structure, each pair of mutually connected parent-child nodes are arranged from top to bottom, and the conditional probability among the parent-child nodes is calculated under the condition that the differential privacy protection is met by applying a Laplace mechanism.

Step S311, calculating the probability distribution of the root node in the original data set, and extracting a generated data set corresponding to the root node according to the probability distribution; and then calculating the joint distribution probability of the father node and the son node for each node layer by layer from top to bottom according to the current generated data of the father node, and generating data containing noise for each node according to the joint distribution probability and random distribution. And obtaining a generated data set which contains noise and meets the differential privacy protection.

In the embodiments of the present invention, on one hand, a differential privacy protection is provided for data jointly generated by each data owner in a multi-party data joint publishing process by using a differential privacy model leading in the data privacy field, so as to ensure that the data privacy is strictly protected. On the other hand, a hidden tree model is adopted to model data set distribution vertically divided among a plurality of data owners, and a data set containing noise is jointly issued according to the learned hidden tree model, so that the amount of noise added is reduced to the maximum extent under the condition that issued data meet the condition of differential privacy, the effectiveness of the issued data is improved, and the quality of the whole data service is ensured. In addition, in the embodiment, the calculation of mutual information between hidden variables with low correlation strength is abandoned by adopting a tree-index-based hidden variable mutual information calculation method, so that the number of times of adding noise is reduced, and the communication overhead is reduced while high-quality service is provided by comprehensively utilizing data of each party.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (which may be a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a data set generating apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description of the apparatus is omitted for brevity. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware or a combination of software and hardware is also conceivable.

Fig. 4 is a block diagram of a data set generating apparatus according to an embodiment of the present invention, and as shown in fig. 4, the apparatus includes a hidden variable generating module 10, a mutual information calculating module 20, a hidden tree generating module 30, and a data set generating module 40.

The hidden variable generation module 10 is configured to calculate, for each data owner, mutual information of a pair of explicit variables in the local original data set, and generate leaf-level hidden variables. The mutual information calculation module 20 is configured to locally establish a tree index for each data owner, combine every two data owners to form a data owner pair, and perform matching of the tree index and calculation of mutual information between hidden variables in the leaf layer. The hidden tree generation module 30 is configured to perform hidden tree structure learning and hidden tree parameter learning for each data owner, and generate a hidden tree locally. The data set generating module 40 is configured to generate a target data set for each data owner from top to bottom according to the learned hidden tree structure and hidden tree parameters.

In the above embodiment, the data set generating device may be a plurality of data set generating devices, each data set generating device corresponds to processing of data of one data owner, and all the data owners may be connected through a network.

The embodiment of the invention also provides another data set generating device. The device can be a data processing device such as a database, a big data platform, a server and the like. It should be noted that, although the same can be used to implement vertical split data distribution satisfying differential privacy, in the present embodiment, names and functional divisions of modules thereof are different from those in the above-described embodiment. As shown in fig. 5, in particular, the functions implemented by the modules in the present embodiment are as follows:

the data preprocessing module 50 mainly completes operations of encoding statistics, data cleaning, missing data filling, discretization, binarization and the like on the original data. The data pre-processing module 50 may further specifically include a data encoding sub-module 51, a data padding sub-module 52, a discretization sub-module 53, and a binarization sub-module 54.

Specifically, the data encoding sub-module 51 is used to perform unified encoding of a plurality of data sources. The data stuffing sub-module 52 is used to complete the washing and stuffing of the raw data. The discretization sub-module 53 is used to map continuous data or discrete type values to discrete numbers. The binarization submodule 54 is configured to convert the discrete variable into a binary variable having values 0 and 1.

The associated information collecting module 60 is configured to generate hidden variables for each data owner, construct a tree index, and perform tree index matching. The associated information collecting module 60 may further specifically include a hidden variable generating sub-module 61, a tree index constructing sub-module 62, and a tree index matching sub-module 63.

Specifically, the hidden variable generation submodule 61 is configured to group the explicit variables under the condition that the difference privacy is satisfied, and correspondingly generate leaf-level hidden variables. The tree index construction sub-module 62 is used to complete the bottom-up construction of the tree index structure under the condition that the differential privacy protection is satisfied. The tree index matching sub-module 63 performs top-to-bottom matching on the tree index to complete the ground pruning of the hidden variables to be calculated, and reduces the communication overhead in the process of calculating the correlation strength of the hidden variables.

The model learning and construction module 70 is configured to learn the hidden tree model according to mutual information between the hidden variables. The model learning and constructing module 70 may further include a ringless connected graph constructing sub-module 71, a hidden tree structure learning sub-module 72, and a hidden tree parameter learning sub-module 73.

Specifically, the acyclic connected graph constructing submodule 71 constructs a weight and a minimum acyclic connected graph by using leaf-level hidden variables as nodes and using the size of mutual information between the variables as a weight of a connecting edge. The hidden tree structure learning submodule 72 mainly completes the construction of the hidden tree structure. The hidden tree parameter learning submodule 73 mainly completes learning of conditional distribution parameters between parent and child nodes in the hidden tree structure.

The data publishing module 80 generates each record in the composite data set from the root node to the top according to the learned conditional distribution parameters between the hidden tree structure and the parent-child nodes.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in a plurality of processors.

Example one

In the present embodiment, the detailed description will be given by taking an example that K different specialized hospitals (K > ═ 2) jointly issue medical data on respective devices:

description of the implementation Environment: the K different specialized hospitals serve as K data owners, the data set generation devices (such as a database, a big data platform, a server and the like) are respectively deployed, and the data set generation device is used for generating the data set meeting the differential privacy protection.

As shown in fig. 6, this embodiment includes the steps of:

601, carrying out unified encoding, missing data filling, discretization and binarization on a local data set independently by K hospitals; the translation of the original data set to the corresponding digits of the binary data set is stored.

Step 602, combining every two display variables in K hospitals to form a group of display variable pairs, and accessing data to calculate mutual information between each pair of display variables; the mutual information calculation method is as follows:

where p (X), p (Y) are probability distributions when the variable X, Y takes the values X ═ X and Y ═ Y, respectively, and p (X, Y) is a joint probability distribution when the variable X takes the value X ═ X and the variable Y takes the value Y ═ Y.

And 603, grouping local explicit variables by using a differential privacy index mechanism under the condition of meeting differential privacy protection, wherein the number of the explicit variables in each group reaches the maximum number of the preset implicit variables but does not exceed the maximum number of the groups.

And step 604, generating an implicit variable for each local explicit variable group to obtain a leaf-level implicit variable data set.

605, combining the leaf-level hidden variables in pairs to form hidden variable pairs for each hospital, and accessing the hidden variable data to calculate the mutual information between each pair of hidden variables; the calculation method refers to step 502.

And 606, grouping the hidden variables of the leaf layer under the condition of meeting the differential privacy protection based on the calculated mutual information between the hidden variables of the leaf layer to generate the hidden variables of the upper layer.

Step 607, repeating the above step of generating the hidden variable at the upper layer from the hidden variable at the bottom layer from bottom to top until the hidden variable at the upper layer only has one hidden variable node, and marking the hidden variable as a root node; and the hidden variable nodes form a tree index together with the connecting edges between the hidden variable nodes and the parent-child nodes, and the tree index is stored locally in the data owner.

Step 608, each hospital combines two by two to form a data owner pair, and then each data owner transmits negotiation related parameters including but not limited to data owner combination pairing, data owner execution sequence for subsequent calculations, maximum number of other data owners that a single data owner can communicate with at the same time, and the like.

Step 609, for each pair of hospitals, the secure multiparty computing protocol is operated together, and the two-party hidden variable mutual information computing method based on the tree index matching is operated under the condition that the secure multiparty computing protocol is encrypted. For each hospital, it is possible to communicate with a plurality of other hospitals simultaneously, while performing the above calculations.

Step 610, a plurality of hospitals run a security protocol together, and mutual information between the previously calculated hidden variable pairs is sent to other hospitals under the condition that a multi-party security calculation protocol is encrypted; until each hospital stores the same and complete strength of association between pairs of leaf-level hidden variables locally.

Step 611, the multiple hospitals independently run the maximum spanning tree construction method locally, the leaf layer hidden variables and the explicit variables are used as nodes, the calculated association strength between the variables is used as the weight of the corresponding connection edge, and the weight and the minimum loop-free connection graph are constructed.

And 612, randomly selecting a root node for the acyclic connected graph, and determining a parent-child relationship for the node pair connected with each connecting edge according to the path length distance between the root node and the node, so as to obtain an implicit tree structure.

Step 613, according to the generated hidden tree structure, calculating conditional probability between parent and child nodes under the condition of satisfying differential privacy protection by applying a laplacian mechanism for each pair of mutually connected parent and child nodes from top to bottom.

Step 614, calculating the probability distribution of the root node in the original data set, and extracting a generated data set corresponding to the root node according to the probability distribution; and then calculating the joint distribution probability of the father node and the son node for each node layer by layer from top to bottom according to the current generated data of the father node, and generating data containing noise for each node according to the joint distribution probability and random distribution. And obtaining a generated data set which contains noise and meets the differential privacy protection.

Example two

In this embodiment, a detailed description will be given by taking an example that a financial institution K (K > ═ 2) jointly publishes financial information of a group of users satisfying differential privacy:

description of the implementation Environment: the K financial institutions are institutions which have different types of financial information of users, such as banks which have deposit information, security institutions which have stock information and the like. K financial institutions respectively deploy the data set generating device (which can be a database, a big data platform, a server and the like), and data issuing work meeting differential privacy protection is directly carried out by means of the device. As shown in fig. 7, this embodiment includes the steps of:

step 701, the host of each financial institution sends heartbeat information to other institutions through the network, and registers as a data publishing participant.

Step 702, each financial institution independently and randomly generates non-repetitive integer numbers as unique IDs, each financial institution issues the number of the source machines and the IDs of the local machines of the heartbeats received by the local machine to the next financial institution, and the data issuing process starts under the condition that all the financial institutions do not detect contradictory information.

And 703, each financial institution communicates, negotiates parameter configurations such as uniform coding standards, privacy parameter setting, maximum number of variable groups, discretization methods and the like, and broadcasts the negotiated parameter configurations to all data publishing participants.

And 704, each financial institution takes out the original data set from the local data warehouse, and sequentially operates the uniform coding submodule, the data filling submodule, the discretization submodule and the binarization submodule to obtain a regular binary display variable data set.

Step 705, operating a variable grouping submodule by each financial institution, combining the display variables pairwise to form a group of display variable pairs, and accessing a data warehouse to calculate the mutual information between each pair of display variables; and grouping the explicit variables under the condition of meeting differential privacy protection by using a differential privacy index mechanism.

And 706, operating the hidden variable generation sub-module independently by each financial institution, optimizing the maximum likelihood estimation of the distribution of the explicit variables by using the Lagrange multiplier, and iterating for multiple times until the generated hidden variables are distributed stably in the whole.

And 707, running the tree index construction submodule independently by each financial institution to generate a hidden variable index tree.

And 708, after the financial institutions finish the operation, sending heartbeat information to other financial institutions, announcing that the institution finishes the construction of the local tree index, and waiting for the other institutions to finish the construction of the digital index.

709, after all the later tree indexes are constructed, combining the financial institutions in pairs to form financial institution pairs, broadcasting the calculation sequence of the financial institution pairs and the maximum number of other data owners which can communicate with a single data owner, and performing subsequent calculation according to the negotiated calculation sequence after all the financial institutions confirm the relevant sequence and parameters through the network.

And 710, for each pair of financial institutions, operating a secure multi-party computing protocol together, operating a tree-shaped index matching submodule under the encryption of the secure multi-party computing protocol, and computing the association strength between hidden variables under the condition of meeting privacy protection.

And 711, after running all the tree indexes related to the financial institutions, sending heartbeat information to other financial institutions, announcing that the financial institutions finish the tree index matching, and waiting for other institutions to finish the tree index matching.

Step 712, the financial institutions operate the security protocol together, and broadcast the mutual information between the previously calculated hidden variable pairs to all other data owners under the condition of encrypting the multi-party security calculation protocol; until each data owner locally stores the same and complete association strength information between the leaf-level hidden variable pairs.

And 713, independently operating the hidden tree structure learning submodule by each financial institution, and constructing a weight and a minimum acyclic connected graph by adopting a maximum spanning tree construction method, taking the hidden variables and the displayed variables of the leaf layer as nodes, and taking the calculated association strength between the variables as the weight of a corresponding connecting edge. And randomly selecting a root node for the loop-free connected graph, and determining a parent-child relationship for the node pair connected with each connecting edge according to the path length distance between the root node and the node, so as to obtain a hidden tree structure.

And 714, operating the hidden tree parameter learning sub-modules independently by each financial institution, and calculating the conditional probability between the parent nodes and the child nodes under the condition of meeting the differential privacy protection by applying a Laplace mechanism for each pair of mutually connected parent nodes and child nodes from top to bottom according to the generated hidden tree structure.

Step 715, each financial institution independently operates the data generation submodule, calculates the probability distribution of the root node in the original data set, and extracts a generated data set corresponding to the root node according to the probability distribution; and then calculating the joint distribution probability of the father node and the son node for each node layer by layer from top to bottom according to the current generated data of the father node, and generating data containing noise for each node according to the joint distribution probability and random distribution. And obtaining a generated data set which contains noise and meets the differential privacy protection.

And 716, after finishing the data generation, each financial institution sends heartbeat information to other financial institutions, waits for all financial institutions to finish the data generation, and broadcasts the generated data set to the financial institutions which cannot finish the data generation.

EXAMPLE III

In this embodiment, a large enterprise deploys the devices among multiple internal departments to distribute data satisfying differential privacy protection for operation and maintenance and outsourcing personnel.

Description of the implementation Environment: different departments in the enterprise have data of different levels of the same group of individuals, and the data are positioned on a data warehouse of a server in each department and are isolated from an external network and other department networks. In this implementation, the individual data set generating devices (which may be databases, big data platforms, servers, etc.) communicate with each other via an intranet. As shown in fig. 8, the present embodiment includes the following steps:

801, a data set generation device open port deployed in each department monitors desensitization requests from the outside; each data source is bound to a data set generating device.

Step 802, the enterprise operation and maintenance personnel and outsourcing personnel submit requests to enterprise data service providing departments, and the requests obtain data which are from the same batch of users and belong to different departments.

And 803, the data service providing department authenticates the applicant, analyzes the data source to which the requested data belongs, and sends a desensitization request to the data set generating device bound with the data source.

And step 804, the related data set generation devices communicate, negotiate parameter configurations such as uniform coding standards, privacy parameter setting, maximum number of variable groups and discretization methods, and broadcast the negotiated parameter configurations to all data publishing participants.

Step 805, each device calls an embedded UDF (user defined function) provided by the local data warehouse to perform code conversion, missing data filling, discretization and binarization on the original data, and stores the result in a temporary table.

806, operating a variable grouping submodule by each device, combining the display variables pairwise to form a group of display variable pairs, and accessing a data warehouse to calculate the mutual information between each pair of display variables; and grouping the explicit variables under the condition of meeting differential privacy protection by using a differential privacy index mechanism.

And 807, each device independently operates a hidden variable generation submodule, optimizes the maximum likelihood estimation of the distribution of the explicit variables by using a Lagrange multiplier, and iterates for multiple times until the generated hidden variables are integrally distributed stably.

And 808, independently operating a tree index construction submodule by each device to generate a hidden variable index tree.

Step 809, after the devices finish the above steps, sending heartbeat information to other departments, declaring that the department finishes the construction of the local tree index, and waiting for the other departments to finish the construction of the digital index.

And 810, after all departments complete tree index construction, combining every two departments to form a department pair, broadcasting parameters such as the calculation sequence of the department pairs and the maximum number of other data owners which can be communicated by a single data owner, and the like, and performing subsequent calculation according to the negotiated calculation sequence after all departments perform communication to confirm the relevant sequence and the parameters.

And 811, for each pair of department pairs, operating the secure multiparty computing protocol together, operating the tree-shaped index matching submodule under the encryption of the secure multiparty computing protocol, and computing the association strength between the hidden variables under the condition of meeting the privacy protection.

Step 812, after each device runs all the tree indexes related to itself, it sends heartbeat information to other devices, declares that the device has completed the tree index matching, and waits for other devices to complete the tree index matching.

Step 813, each device operates the security protocol together, and broadcasts the mutual information between the previously calculated hidden variable pairs to all other devices under the condition of encrypting the multi-party security calculation protocol; until each device locally stores the same and complete association strength information between the leaf-level hidden variable pairs.

And 814, independently operating the hidden tree structure learning submodule by each device, adopting a maximum spanning tree construction method, taking the hidden variables and the apparent variables of the leaf layer as nodes, and taking the calculated association strength between the variables as the weight of the corresponding connecting edge, the weight and the minimum acyclic connected graph. And randomly selecting a root node for the loop-free connected graph, and determining a parent-child relationship for the node pair connected with each connecting edge according to the path length distance between the root node and the node, so as to obtain a hidden tree structure.

815, independently operating the hidden tree parameter learning sub-modules by each device, and calculating conditional probability between parent and child nodes by using a laplacian mechanism under the condition of meeting differential privacy protection for each pair of mutually connected parent and child nodes from top to bottom according to the generated hidden tree structure;

step 816, each device independently operates the data generation submodule, calculates the probability distribution of the root node in the original data set, and extracts the generated data set corresponding to the root node according to the probability distribution; and then calculating the joint distribution probability of the father node and the son node for each node layer by layer from top to bottom according to the current generated data of the father node, and generating data containing noise for each node according to the joint distribution probability and random distribution to obtain a generated data set containing noise and meeting the differential privacy protection.

Step 817, after completing the data generation, each device sends a heartbeat message to other devices, waits for all devices to complete the data generation, and broadcasts the generated data set to the devices that cannot complete the data generation.

Step 818, the generated data set is cached locally for the next query; optionally, a means for transmitting the generated data to a data service providing department.

Step 819, after the data service providing department verifies the data, the data is sent to the data requester.

Example four

In this embodiment, after an enterprise collects and purchases data of different aspects of the same group of users from multiple channels, the data is desensitized before being stored in a data warehouse, so as to avoid legal disputes that may be brought by data buying and selling and data collection.

Description of the implementation Environment: data of the same batch of users from multiple channels enter the data set generation device (which may be a database, a big data platform, a server, etc.) described in the foregoing embodiment in a data stream form, and after desensitizing the data set, the data set satisfying the differential privacy protection is stored in the data warehouse. As shown in fig. 9, the present embodiment includes the following steps:

and step 901, binding each data source with the data release device, and sending heartbeat information among the devices to confirm that the devices are in a good running state and the data flow is normal.

Step 902, each data issuing device reads in data streams with the same length to a memory; the data of the bound data stream is then temporarily closed.

903, each device checks the integrity and the validity of the data stream; if abnormal data occurs, the data of the batch is abandoned, and the next batch of flow data is continuously read in.

And 904, the data sources are communicated with one another, and the privacy parameter size and the distribution strategy are determined according to the data stream size.

And 905, sequentially operating the unified coding submodule, the data filling submodule, the discretization submodule and the binarization submodule by each device, and preprocessing the stream data stored in the memory.

Step 906, operating a variable grouping submodule by each device, combining the display variables pairwise to form a group of display variable pairs, and accessing a data warehouse to calculate mutual information between each pair of display variables; and grouping the explicit variables under the condition of meeting differential privacy protection by using a differential privacy index mechanism.

And 907, independently operating the hidden variable generation sub-modules by each device, optimizing the maximum likelihood estimation of the distribution of the explicit variables by using a Lagrange multiplier, and iterating for multiple times until the generated hidden variables are integrally distributed stably.

And 908, each device independently runs the tree index construction submodule to generate a hidden variable index tree.

After the above steps are completed, step 909, each device sends heartbeat information to other devices, declares that the device has completed the local tree index construction, and waits for the other devices to complete the construction of the tree index.

Step 910, after all the devices complete the tree index construction, combining the multiple devices two by two to form a device pair, broadcasting the calculation order of the multiple device pairs and the maximum number of other devices that can be communicated by a single device, and performing subsequent calculation according to the negotiated calculation order after all the devices confirm the relevant order and parameters.

911, for each pair of devices, a secure multiparty computing protocol is operated together, a tree-shaped index matching submodule is operated under the encryption of the secure multiparty computing protocol, and the association strength between hidden variables is calculated under the condition of meeting privacy protection.

Step 912, after each device runs all the tree indexes related to itself, it sends heartbeat information to other devices, declares that the device has completed the tree index matching, and waits for other devices to complete the tree index matching.

913, all the devices operate the security protocol together, and broadcast the mutual information between the previously calculated hidden variable pairs to all other devices under the condition of encrypting the multi-party security calculation protocol; until each device locally stores the same and complete association strength information between the leaf-level hidden variable pairs.

And 914, each device independently operates a hidden tree structure learning submodule, and adopts a maximum spanning tree construction method, takes the hidden variables and the apparent variables of the leaf layer as nodes, calculates the association strength between the variables as the weight of the corresponding connecting edge, and constructs the weight and the minimum acyclic connected graph. And randomly selecting a root node for the loop-free connected graph, and determining a parent-child relationship for the node pair connected with each connecting edge according to the path length distance between the root node and the node, so as to obtain a hidden tree structure.

And 915, independently operating the hidden tree parameter learning sub-modules by each device, and calculating the conditional probability between parent and child nodes by applying a Laplace mechanism under the condition of meeting the differential privacy protection for each pair of mutually connected parent and child nodes from top to bottom according to the generated hidden tree structure.

Step 916, each device independently operates a data generation submodule, calculates the probability distribution of the root node in the original data set, and extracts a generated data set corresponding to the root node according to the probability distribution; and then calculating the joint distribution probability of the father node and the son node for each node layer by layer from top to bottom according to the current generated data of the father node, and generating data containing noise for each node according to the joint distribution probability and random distribution. And obtaining a generated data set which contains noise and meets the differential privacy protection.

Step 917, after completing the data generation, each device sends a message to the other device to announce that the device has completed the data generation, and the other device stops the data publishing process immediately after receiving the message that the data generation is completed, so as to reduce the time consumption to the maximum extent.

Step 918, the device which completes the data generation firstly checks the data and then stores the data in the data warehouse.

Step 919, each device continues to read in the next batch of data streams from the data streams, and continues to desensitize the data streams before being put into storage according to the above steps.

The embodiment of the invention also provides a storage medium. In the present embodiment, the storage medium may be configured to store program codes for performing the steps of the foregoing embodiments. In the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, which can store program codes.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of generating a data set, comprising:

each data owner of the multi-party vertical segmentation data acquires mutual information of a pair of explicit variables in a local original data set to generate leaf-level implicit variables;

each data owner establishes a tree index locally, and the data owners are combined pairwise to form a data owner pair to perform matching of the tree index and calculation of mutual information between hidden variables of leaf layers;

each data owner executes hidden tree structure learning and hidden tree parameter learning, and respectively generates a hidden tree locally;

and each data owner generates a target data set from top to bottom according to the learned hidden tree structure and hidden tree parameters.

2. The data set generation method of claim 1, wherein each data owner obtains mutual information of a pair of explicit variables in a local original data set to generate leaf-level implicit variables, and the method comprises the following steps:

preprocessing the original data set to obtain a regular apparent variable data set, wherein the preprocessing comprises at least one of the following steps: unified coding, missing data filling, discretization and binarization;

combining the display variables in the display variable data set in pairs to form a display variable pair, calculating mutual information between each pair of display variables, and generating leaf-level hidden variables under the condition of meeting the differential privacy protection.

3. The method of claim 2, wherein each data owner locally builds a tree index, and the data owners combine to form a pair of data owners, and performing matching of the tree index and calculation of mutual information between hidden variables at leaf level comprises:

for each data owner, combining the leaf-level hidden variables pairwise to form hidden variable pairs, and calculating mutual information between the leaf-level hidden variables of each pair;

based on mutual information among the leaf-layer hidden variables, grouping the leaf-layer hidden variables under the condition of meeting differential privacy protection to generate upper-layer hidden variables, and repeating the step of generating the upper-layer hidden variables by the leaf-layer hidden variables from bottom to top until the upper-layer hidden variables only have one hidden variable node;

taking the hidden variable nodes as root nodes, forming a tree index by the root nodes, connecting edges among parent and child nodes and all the hidden variable nodes together, and storing the tree index in the local of a data owner;

the data owners are combined pairwise to form a data owner pair, and negotiation parameters are transmitted between the data owners, wherein the parameters comprise at least one of the following parameters: data owner combination pairing, execution order of subsequent calculations by the data owner, maximum number of data owners a single data owner can communicate with other data owners simultaneously;

each pair of data owners runs hidden variable mutual information calculation based on tree index matching under the condition of encryption of a secure multiparty calculation protocol;

and under the condition that a plurality of data owners encrypt in a multi-party security computing protocol, broadcasting mutual information between the calculated hidden variable pairs to all other data owners until each data owner locally stores the same and complete association strength between the hidden variable pairs.

4. The data set generation method of claim 3, wherein each data owner performs hidden tree structure learning and hidden tree parameter learning, each locally generating a hidden tree, comprising:

each data owner independently runs the maximum spanning tree construction method locally, leaf layer hidden variables and explicit variables are taken as nodes, the association strength between the variables is taken as the weight of a corresponding connecting edge, and a weight and a minimum acyclic connected graph are constructed;

and selecting a root node for the loop-free connected graph, and determining a parent-child relationship for the node pair connected with each connecting edge according to the length of the path with the root node to obtain a hidden tree structure.

5. The data set generation method of claim 4, wherein each data owner generates a target data set from the learned hidden tree structure and parameters from top to bottom, comprising:

under the condition that the differential privacy protection is met, each data owner calculates the conditional probability between each pair of mutually connected parent-child nodes from top to bottom according to the generated hidden tree structure;

and each data owner calculates the probability distribution of the root node in an original data set, extracts a generated data set corresponding to the root node according to the probability distribution, then calculates the joint distribution probability of the parent-child nodes for each node layer by layer from top to bottom, generates data containing noise for each node according to the joint distribution probability and random distribution, and generates a target data set.

6. The data set generation method of claim 1, further comprising, after the each data owner generates the target data set from the learned hidden tree structure and parameters from top to bottom:

and the data owner completing the generation of the target data set sends a message to other data owners, waits for all the data owners to complete the generation of the target data set, and broadcasts the generated target data set to the data owners incapable of completing the generation of the target data set.

7. A data set generation apparatus, comprising:

the hidden variable generation module is used for acquiring mutual information of a local original data set explicit variable pair for each data owner of multi-party vertical division data to generate leaf-level hidden variables;

the mutual information calculation module is used for establishing a tree index for each data owner locally, combining the data owners in pairs to form a data owner pair, and performing matching of the tree index and calculation of mutual information between hidden variables;

the hidden tree generation module is used for executing hidden tree structure learning and hidden tree parameter learning for each data owner and respectively generating a hidden tree locally;

and the data set generation module is used for generating a target data set for each data owner from top to bottom according to the learned hidden tree structure and hidden tree parameters.

8. The data set generation apparatus of claim 7, wherein the hidden variable generation module comprises:

the data preprocessing submodule is used for preprocessing the original data set to obtain a regular significant variable data set, wherein the preprocessing comprises at least one of the following steps: unified coding, missing data filling, discretization and binarization;

and the hidden variable generation submodule is used for combining every two displayed variables in the displayed variable data set to form a displayed variable pair, calculating mutual information between each pair of displayed variables, and generating leaf-level hidden variables under the condition of meeting the differential privacy protection.

9. The data set generation apparatus of claim 8, wherein the mutual information calculation module comprises:

the hidden variable pair calculation submodule is used for combining the hidden variables of the leaf layers pairwise to form a hidden variable pair for each data owner and calculating the mutual information between each pair of hidden variables;

the upper-layer hidden variable generation submodule is used for grouping the leaf-layer hidden variables under the condition of meeting the differential privacy protection based on the mutual information among the leaf-layer hidden variables to generate upper-layer hidden variables, and repeating the step of generating the upper-layer hidden variables from the leaf-layer hidden variables from bottom to top until the upper-layer hidden variables only have one hidden variable node;

the tree index construction module is used for taking the hidden variable nodes as root nodes, forming tree indexes by the root nodes, connecting edges among parent and child nodes and all the hidden variable nodes together, and storing the tree indexes in the local of a data owner;

a negotiation submodule, configured to combine the data owners two by two to form a data owner pair, where each data owner transmits a negotiation parameter, where the parameter includes at least one of: data owner combination pairing, execution order of subsequent calculations by the data owner, maximum number of data owners a single data owner can communicate with other data owners simultaneously;

the computing module is used for operating hidden variable mutual information computation based on tree index matching under the condition of encrypting a secure multiparty computing protocol for each pair of data owners;

and the broadcasting module is used for broadcasting the mutual information between the calculated hidden variable pairs to all other data owners by a plurality of data owners under the condition of encrypting the multi-party security calculation protocol until each data owner locally stores the same and complete correlation strength between the hidden variable pairs.

10. The data set generation apparatus of claim 9, wherein the hidden tree generation module comprises:

the ringless connected graph building submodule is used for independently operating a maximum spanning tree building method for each data owner locally, leaf layer hidden variables and explicit variables are used as nodes, the association strength between the variables is used as the weight of a corresponding connecting edge, and the weight and the minimum ringless connected graph are built;

and the hidden tree structure obtaining submodule is used for selecting a root node for the acyclic connected graph, and determining a parent-child relationship for the node pair connected with each connecting edge according to the length of the path of the root node to obtain a hidden tree structure.

11. The data set generation apparatus of claim 7, wherein the data set generation module comprises:

the probability calculation module is used for calculating the conditional probability between each pair of mutually connected parent-child nodes from top to bottom for each data owner according to the generated hidden tree structure under the condition that the differential privacy protection is met;

and the data set generation submodule is used for calculating the probability distribution of the root node in the original data set for each data owner, extracting a generated data set corresponding to the root node according to the probability distribution, calculating the joint distribution probability of the parent-child nodes for each node layer by layer from top to bottom according to the current generated data of the parent node, generating data containing noise for each node according to the joint distribution probability and random distribution, and generating a target data set.

12. The data set generation apparatus of claim 7, further comprising:

and the publishing module is used for sending messages to other data owners for the data owners who finish the generation of the target data set, waiting for all the data owners to finish the generation of the target data set, and broadcasting the generated target data set to the data owners who cannot finish the generation of the target data set.

13. A data set generating system comprising a plurality of data set generating apparatuses according to any one of claims 7 to 12, wherein each data set generating apparatus corresponds to data processing of a data owner, and all the data set generating apparatuses are connected via a network.

14. A storage medium, comprising a stored program, wherein the program when executed performs the method of any one of claims 1 to 6.

15. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 6.