CN110610098B

CN110610098B - Data set generation method and device

Info

Publication number: CN110610098B
Application number: CN201810615202.5A
Authority: CN
Inventors: 牛家浩; 申山宏; 王德政; 程祥; 苏森; 唐朋; 邵华西
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2018-06-14
Filing date: 2018-06-14
Publication date: 2023-05-30
Anticipated expiration: 2038-06-14
Also published as: WO2019237840A1; CN110610098A

Abstract

The invention provides a data set generation method and a device, wherein the method comprises the following steps: each data owner of the multiparty vertical segmentation data calculates mutual information of a local original data set display variable pair to generate leaf layer hidden variables; each data owner locally builds a tree index, the data owners are combined in pairs to form a data owner pair, and matching of the tree index and calculation of mutual information between leaf layer hidden variables are carried out; each data owner performs hidden tree structure learning and hidden tree parameter learning, each generating hidden trees locally; each data owner generates a target data set top-down based on the learned hidden tree structure and hidden tree parameters.

Description

Data set generation method and device

Technical Field

The invention relates to the field of data security, in particular to a data set generation method and device.

Background

Along with the rapid development of digital technologies such as smart cities, smart grids and smart medical treatment and the wide popularization of mobile terminal equipment, information such as clothing and eating and health medical treatment of people is digitalized, mass data are generated every day, and the arrival of big data age is facilitated. The large amount of data is often owned by different data owners, e.g., hospitals and financial institutions, respectively, having a set of medical and financial data. When data distributed among multiple parties has the same ID and contains different attributes, it is called multiparty vertical division data. And the multiparty vertical segmentation data is released, so that a data analyzer can fully analyze and mine potential values in the data. However, the vertical segmentation data often contains a large amount of sensitive information of individuals, and the direct release of such data inevitably reveals individual privacy information.

The differential privacy protection model provides a feasible scheme for solving the problem of data release meeting the privacy protection. Unlike anonymity-based privacy preserving models, differential privacy preserving models provide a strict, quantifiable means of privacy preservation, and the strength of privacy preservation provided is independent of background knowledge held by an attacker.

Currently, in a unilateral scenario, a technology of publishing private data (private Data Release via Bayesian Networks, privBayes) through a bayesian network solves the problem of publishing data meeting differential privacy: firstly, constructing a Bayesian network based on original data, and then adding noise into the constructed Bayesian network to enable the constructed Bayesian network to meet the differential privacy protection requirement; and finally, generating new data release by using the Bayesian network containing noise. However, priv Bayes is not available in multiparty scenarios because the algorithm itself is a unilateral data-oriented design.

In a multiparty scenario, the existing vertical segmentation data publishing method (DistDiffGen) meeting the requirement of differential privacy protection can only be used for publishing statistical information required for constructing a decision tree classifier, so that the method is only a data publishing method bound with a specific data analysis task. At present, the vertical segmentation data release method meeting the differential privacy protection in practical application can only be applied to classification tasks based on decision trees, but is not applicable to data analysis and mining tasks such as other types of classification tasks, clustering tasks, statistical analysis tasks and the like.

Disclosure of Invention

The embodiment of the invention provides a data set generation method and device, which at least solve the problem of privacy protection of data in the related technology.

According to an aspect of the present invention, there is provided a data set generating method including: each data owner obtains mutual information of a local original data set apparent variable pair and generates leaf layer hidden variables; each data owner locally establishes a tree index, the data owners are combined in pairs to form a data owner pair, and matching of the tree index and calculation of mutual information between leaf layer hidden variables are carried out; each data owner performs hidden tree structure learning and hidden tree parameter learning, and hidden trees are generated locally; the data owners generate the target data set from top to bottom according to the learned hidden tree structure and hidden tree parameters.

According to another aspect of the present invention, there is provided a data set generating apparatus comprising: the hidden variable generation module is used for acquiring mutual information of a local original data set hidden variable pair for each data owner and generating leaf layer hidden variables; the mutual information calculation module is used for locally establishing a tree index for each data owner, combining the data owners two by two to form a data owner pair, and carrying out matching of the tree index and calculation of mutual information between leaf layer hidden variables; the hidden tree generating module is used for executing hidden tree structure learning and hidden tree parameter learning for each data owner, and generating hidden trees locally; and the data set generation module is used for generating a target data set from top to bottom for each data owner according to the learned hidden tree structure and the hidden tree parameters.

According to yet another aspect of the present invention, there is also provided a data set generating system, which includes a plurality of data set generating devices in the foregoing embodiments, wherein each data set generating device corresponds to data processing of one data owner, and all the data set generating devices are connected through a network.

According to a further aspect of the present invention, there is also provided a storage medium having a computer readable program stored therein, wherein the program when run performs the method steps of the previous embodiments.

In the embodiment of the invention, the hidden tree model is adopted to model the distribution of the data sets vertically divided among a plurality of data owners, the noise-containing data sets are jointly issued according to the learned hidden tree model, the noise adding amount is reduced, the requirement on the differential privacy of the issued data sets is met in the issuing process of multiparty vertically divided data, and meanwhile, the issued whole data can support a plurality of data analysis tasks.

Drawings

FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the invention;

FIG. 2 is a flow chart of a data set generation method according to an embodiment of the invention;

FIG. 3 is a flow chart of a method of data distribution for multiparty vertical partitioning in accordance with an embodiment of the present invention;

FIG. 4 is a block diagram of a data set generating apparatus according to an embodiment of the present invention;

FIG. 5 is a block diagram of a data distribution device for multiparty vertical partitioning according to an embodiment of the present invention;

FIG. 6 is a flow chart of a method according to a first embodiment of the invention;

FIG. 7 is a flow chart of a method according to a second embodiment of the invention;

FIG. 8 is a flow chart of a method according to a third embodiment of the invention;

fig. 9 is a flow chart of a method according to a fourth embodiment of the invention.

Detailed Description

The present invention will be described in detail hereinafter with reference to the accompanying drawings in combination with examples. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.

The invention provides a multiparty vertical split data distribution method which is independent of specific data analysis tasks and meets differential privacy protection through the following embodiment. In the large data environment, in the process of publishing the multiparty vertical split data, the method meets the requirement of differential privacy of a published data set, and meanwhile, the published whole data can support various data analysis tasks. Therefore, the data analyst can fully analyze the value in the mined data on the premise of protecting the privacy of the individual, and more basis is provided for decision support and scientific research.

In the embodiment of the present invention, the data owner does not refer to a specific person, but refers to all parties of the multiparty vertically divided data, and may be various data processing apparatuses for processing the multiparty vertically divided data. Such as databases, large data platforms, servers, etc., each having its own data (i.e., data stored in a data warehouse or database).

Fig. 1 is a system architecture according to an embodiment of the present invention. As shown in fig. 1, multiparty vertical partition data (e.g., medical data or financial data) having the same ID but containing different attributes is distributed over three different data owners. In this embodiment, the data owners may be servers, and thus, the server 1, the server 2, and the server 3 represent different data owners, respectively. Wherein, the server 1, the server 2 and the server 3 are connected through a wired or wireless network. In the present embodiment, the network form and topology of the connection servers 1, 2, and 3 are not limited. Depending primarily on the geographical distribution and actual needs between the individual data owners. For example, it may be a local area network, the Internet, or other private network. Through the connected network, heartbeat information can be sent between servers, registered as multiparty vertical division data release participants, and generated multiparty vertical division data sets and the like are released.

By operating the technical scheme provided by the embodiment of the invention on the system architecture shown in fig. 1, the requirement of differential privacy of the issued data set is met in the process of issuing the multiparty vertical split data, and meanwhile, the issued whole data can support various data analysis tasks.

In this embodiment, a data set generating method is provided, which may be implemented based on the system architecture of the foregoing embodiment. Fig. 2 is a flowchart of a data set generation method according to an embodiment of the present invention. In this embodiment, a plurality of data owners are included, and as shown in fig. 2, the process includes the following steps:

in step S202, each data owner calculates mutual information of the local primary data set apparent variable pairs to generate leaf layer hidden variables.

In step S204, each data owner locally builds a tree index, and the data owners are combined two by two to form a data owner pair, so as to perform matching of the tree index and calculation of mutual information between leaf layer hidden variables.

Step S206, each data owner performs hidden tree structure learning and hidden tree parameter learning, and each hidden tree is generated locally.

Step S208, each data owner generates a target data set from the learned hidden tree structure and the hidden tree parameters from top to bottom.

In the above-described embodiments, the distribution of the data sets vertically divided among the plurality of data owners is modeled by using the hidden tree model, and the noisy data sets are jointly distributed according to the learned hidden tree model, so that the amount of noise addition is reduced to the greatest extent under the condition that the distributed data satisfies the differential privacy.

The invention also provides another embodiment of a data release method for multiparty vertical segmentation meeting differential privacy protection, which is irrelevant to a specific data analysis task, as shown in fig. 3, and the method comprises the following steps:

step S301, carrying out unified coding, missing data filling, discretization and binarization on the original data set to obtain a regular variable data set.

Step S302, combining the display variables in pairs to form a group of display variable pairs, and accessing data to calculate mutual information between each pair of display variables.

Step S303, generating leaf layer hidden variables under the condition of meeting differential privacy protection by utilizing a differential privacy index mechanism.

Step S304, for each data owner, combining leaf layer hidden variables in pairs to form hidden variable pairs, and accessing hidden variable data to calculate mutual information between each pair of hidden variables.

Step S305, based on the calculated mutual information between the leaf layer hidden variables, grouping the leaf layer hidden variables under the condition of meeting differential privacy protection, and generating an upper layer hidden variable.

Step S306, repeating the step of generating the upper hidden variable from the lower hidden variable from bottom to top until the upper hidden variable has only one hidden variable node, and recording the hidden variable as a root node; and connecting edges between the hidden variable nodes and the parent-child nodes, wherein each hidden variable node forms a tree index together and is stored in the local of the data owner.

In step S307, each data owner combines two by two to form a data owner pair, and each data owner then transmits negotiation related parameters including, but not limited to, the pairing status of the data owners, the execution order of the data owners for subsequent calculation, the maximum number of other data owners that can communicate with a single data owner at the same time, and so on.

Step S308, for each pair of data owners, running a secure multiparty computing protocol together, and running a two-party hidden variable mutual information computing method based on tree index matching under the encryption condition of the secure multiparty computing protocol. For each data owner, it is possible to communicate with a plurality of other data owners simultaneously, with the above calculations being performed simultaneously. A plurality of data owners jointly run a security protocol, and mutual information between hidden variable pairs calculated before is broadcast to all other data owners under the condition of encryption of a multi-party security calculation protocol; until each data owner has locally stored the same but complete association strength between pairs of leaf-layer hidden variables.

In step S309, the multiple data owners independently run the maximum spanning tree construction method locally, and construct a weighted and minimum loop-free connected graph by using the leaf layer hidden variables and the apparent variables as nodes and the calculated association strength between the variables as the weight of the corresponding connected edges. And randomly selecting a root node for the loop-free connected graph, and determining a parent-child relationship for the node pair connected by each connecting edge according to the path length distance between the root node and the node pair, so as to obtain the hidden tree structure.

Step S310, according to the generated hidden tree structure, the top-down condition probability between the father and son nodes is calculated under the condition of meeting differential privacy protection by applying a Laplacian mechanism for each pair of father and son nodes which are connected with each other.

Step S311, calculating probability distribution of the root node in the original data set, and extracting a generated data set corresponding to the root node according to the probability distribution; and then calculating the joint distribution probability of the parent-child nodes for each node from top to bottom according to the current generated data of the parent node, and generating noise-containing data for each node according to the joint distribution probability and the random distribution. A generated dataset is obtained that contains noise that satisfies differential privacy protection.

In the above embodiment of the present invention, on one hand, a differential privacy model leading in the data privacy field is used to provide differential privacy protection for data jointly generated by each data owner in the multi-party data joint publishing process, so as to ensure that the data privacy is strictly protected. On the other hand, the hidden tree model is also adopted to model the data set distribution vertically divided among a plurality of data owners, and the noisy data sets are jointly released according to the learned hidden tree model, so that the noise adding amount is reduced to the greatest extent under the condition that the released data meets differential privacy, the utility of the released data is improved, and the quality of the whole data service is ensured. In addition, in the embodiment, the hidden variable mutual information calculation method based on the tree index is also adopted to discard the calculation of the mutual information between hidden variables with smaller association strength, so that the number of times of adding noise is reduced, and the communication expense is reduced while the data of all parties are comprehensively utilized to provide high-quality service.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The embodiment also provides a data set generating device, which is used for implementing the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware, are also contemplated.

Fig. 4 is a block diagram of a data set generating apparatus according to an embodiment of the present invention, and as shown in fig. 4, the apparatus includes an hidden variable generating module 10, a mutual information calculating module 20, a hidden tree generating module 30, and a data set generating module 40.

The hidden variable generation module 10 is configured to calculate mutual information of a pair of apparent variables in the local original dataset for each data owner, and generate leaf layer hidden variables. The mutual information calculation module 20 is configured to locally establish a tree index for each data owner, combine the data owners two by two to form a data owner pair, and perform matching of the tree index and calculation of mutual information between leaf layer hidden variables. The hidden tree generating module 30 is configured to perform hidden tree structure learning and hidden tree parameter learning for each of the data owners, each of which generates a hidden tree locally. The data set generation module 40 is configured to generate a target data set for each data owner from top to bottom based on the learned hidden tree structure and hidden tree parameters.

In the above embodiment, the data set generating means may be plural, each data set generating means corresponds to processing of data of one data owner, and all the data owners may be connected via a network.

The embodiment of the invention also provides another data set generating device. The device can be a data processing device such as a database, a big data platform, a server and the like. Note that, although it is equally applicable to the realization of the vertical split data distribution satisfying the differential privacy, in the present embodiment, the names and functional divisions of the modules are different from those in the above-described embodiment. As shown in fig. 5, specifically, functions implemented by each module in the present embodiment are as follows:

the data preprocessing module 50 mainly performs operations such as coding statistics, data cleaning, missing data filling, discretization, binarization and the like on the original data. The data preprocessing module 50 may further specifically include a data encoding sub-module 51, a data populating sub-module 52, a discretizing sub-module 53, and a binarizing sub-module 54.

Specifically, the data encoding submodule 51 is configured to perform unified encoding of a plurality of data source encodings. The data populating sub-module 52 is used to complete the cleansing and population of the original data. The discretization submodule 53 is used to map continuous data or discrete type values into discrete numbers. The binarization sub-module 54 is used for converting the discrete variable into binary variables with values 0, 1.

The association information collection module 60 is used for generating hidden variables for each data owner, and constructing tree indexes for tree index matching. The association information acquisition module 60 may further specifically include an hidden variable generation sub-module 61, a tree index construction sub-module 62, and a tree index matching sub-module 63.

Specifically, the hidden variable generating sub-module 61 is configured to group the explicit variables under the condition that the differential privacy is satisfied, and correspondingly generate leaf layer hidden variables. The tree index building sub-module 62 is configured to complete the bottom-up building of the tree index structure under the condition that the differential privacy protection is satisfied. The tree index matching sub-module 63 performs top-down matching on the tree index to complete pruning of the hidden variable to be calculated on the ground, and reduces communication overhead in the process of calculating the association strength of the hidden variable.

The model learning and construction module 70 is used for learning the hidden tree model according to mutual information between hidden variables. The model learning and construction module 70 may further include a loop-free connected graph construction sub-module 71, a hidden tree structure learning sub-module 72, and a hidden tree parameter learning sub-module 73.

Specifically, the acyclic connected graph construction submodule 71 uses leaf layer hidden variables as nodes, and uses the magnitude of mutual information between the variables as the weight of the connecting edges to construct the weight and the minimum acyclic connected graph. The hidden tree structure learning sub-module 72 mainly completes the construction of the hidden tree structure. The hidden tree parameter learning submodule 73 mainly completes the learning of the conditional distribution parameters between the parent and child nodes in the hidden tree structure.

The data distribution module 80 generates each record in the composite dataset from top to bottom starting from the root node according to the learned conditional distribution parameters between the hidden tree structure and the parent-child nodes.

It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the modules are located in a plurality of processors, respectively.

Example 1

In this embodiment, the following describes in detail an example where K different specialty hospitals (K > =2) jointly issue medical data on respective devices:

description of the implementation environment: the K different special hospitals serve as K data owners, the data set generating devices (a database, a large data platform, a server and the like can be respectively deployed), and the data set generating devices are used for generating the data sets meeting the differential privacy protection.

As shown in fig. 6, this embodiment includes the steps of:

step 601, independently carrying out unified coding, missing data filling, discretization and binarization on a local data set by K hospitals; the conversion relationship of the original data set to the corresponding digits of the binary data set is stored.

Step 602, combining the display variables in pairs to form a group of display variable pairs, and accessing data to calculate mutual information between each pair of display variables; the method for calculating the mutual information comprises the following steps:

wherein p (X) and p (Y) are probability distributions when the variable X, Y takes the values x=x and y=y, respectively, and p (X, Y) is a joint probability distribution when the variable X takes the value x=x and the variable Y takes the value y=y.

And 603, grouping the local variables under the condition of meeting differential privacy protection by utilizing a differential privacy index mechanism, wherein each group comprises the maximum number of groups of hidden variables, the number of which reaches as much as possible but does not exceed the preset maximum number of groups of hidden variables.

Step 604, for each local argument group, generating an argument for the local argument group, and obtaining a leaf layer argument data set.

Step 605, for each hospital, combining leaf layer hidden variables in pairs to form hidden variable pairs, and accessing hidden variable data to calculate mutual information between each pair of hidden variables; the calculation method refers to step 502.

Step 606, grouping the leaf layer hidden variables under the condition of meeting differential privacy protection based on the calculated mutual information between the leaf layer hidden variables, and generating an upper layer hidden variable.

Step 607, repeating the step of generating the upper hidden variable from the bottom hidden variable from bottom to top until the upper hidden variable has only one hidden variable node, and recording the hidden variable as a root node; and connecting edges between the hidden variable nodes and the parent-child nodes, wherein each hidden variable node forms a tree index together and is stored in the local of the data owner.

Step 608, each hospital forms a data owner pair by two, and each data owner then communicates negotiation related parameters, including, but not limited to, the pairing of the data owners, the order of execution of subsequent computations by the data owners, the maximum number of other data owners that a single data owner can communicate with at the same time, and so on.

Step 609, for each pair of hospitals, running a secure multiparty computing protocol together, and running a two-party hidden variable mutual information computing method based on tree index matching under the encryption condition of the secure multiparty computing protocol. For each hospital, it is possible to communicate with a plurality of other hospitals simultaneously, with the above calculations being performed simultaneously.

Step 610, running a security protocol together by a plurality of hospitals, and sending mutual information between hidden variable pairs calculated before to other hospitals under the condition of encryption of a multi-party security calculation protocol; until each hospital locally stores the same but complete association strength between pairs of leaf layer hidden variables.

In step 611, multiple hospitals independently run the maximum spanning tree construction method locally, and construct a weighted and minimum loop-free connected graph by taking leaf layer hidden variables and apparent variables as nodes and the calculated association strength between the variables as the weight of the corresponding connecting edges.

Step 612, randomly selecting a root node for the loop-free connected graph, and determining a parent-child relationship for the node pair connected by each connecting edge according to the path length distance between the root node and the node pair, thereby obtaining the hidden tree structure.

Step 613, calculating the conditional probability between the parent node and the child node under the condition of meeting the differential privacy protection by applying a laplace mechanism for each pair of the interconnected parent-child nodes from top to bottom according to the generated hidden tree structure.

Step 614, calculating probability distribution of the root node in the original data set, and extracting a generated data set corresponding to the root node according to the probability distribution; and then calculating the joint distribution probability of the parent-child nodes for each node from top to bottom according to the current generated data of the parent node, and generating noise-containing data for each node according to the joint distribution probability and the random distribution. A generated dataset is obtained that contains noise that satisfies differential privacy protection.

Example two

In this embodiment, a detailed description will be given by taking, as an example, a group of financial information of users satisfying differential privacy issued by K financial institutions (K > =2):

description of the implementation environment: the K financial institutions are institutions with different types of financial information of users, such as banks with deposit information, stock institutions with stock information and the like. The K financial institutions deploy the data set generating devices (such as databases, large data platforms, servers and the like) respectively, and directly conduct data release work meeting differential privacy protection by means of the devices. As shown in fig. 7, this embodiment includes the steps of:

In step 701, the host computer of each financial institution sends heartbeat information to other institutions through the network, and registers as data release participants.

In step 702, each financial institution independently and randomly generates a non-repeating integer number as a unique ID, each financial institution issues the source number of heartbeats received by the own and the own ID to the next financial institution, and when no contradictory information is detected by all the financial institutions, the data issuing process starts.

Step 703, each financial institution communicates, negotiates the parameter configuration such as unified coding standard, privacy parameter setting, maximum number of variable groups, discretization method, and the like, and broadcasts the negotiated parameter configuration to all data release participants.

Step 704, each financial institution takes out the original data set from the local data warehouse, and sequentially operates the unified coding sub-module, the data filling sub-module, the discretization sub-module and the binarization sub-module to obtain a regular binary apparent variable data set.

Step 705, each financial institution operation variable grouping submodule combines the display variables into a group of display variable pairs, accesses the data warehouse to calculate the mutual information between each pair of display variables; and grouping the apparent variables under the condition of meeting differential privacy protection by utilizing a differential privacy index mechanism.

And 706, independently running the hidden variable generation submodule by each financial institution, and optimizing the maximum likelihood estimation of the apparent variable distribution by utilizing the Lagrange multiplier, and iterating for a plurality of times until the generated hidden variable overall distribution is stable.

Step 707, each financial institution independently runs the tree index building sub-module to generate the hidden variable index tree.

Step 708, after each financial institution runs out the steps, heartbeat information is sent to other financial institutions, the institutions are declared to have completed the construction of the local tree index, and the other institutions wait for completing the construction of the tree index.

Step 709, after all the tree indexes are built, the financial institutions are combined in pairs to form a financial institution pair, the calculation orders of the financial institution pairs and the maximum number of other data owners which can be communicated simultaneously by a single data owner are mutually broadcast, and after all the financial institutions confirm the related orders and parameters through a network, the subsequent calculation is performed according to the negotiated calculation orders.

Step 710, for each pair of financial institutions, running a secure multiparty computing protocol together, running a tree index matching sub-module under encryption of the secure multiparty computing protocol, and calculating association strength between hidden variables under the condition of meeting privacy protection.

Step 711, after each financial institution runs all the tree indexes related to the financial institution, heartbeat information is sent to other financial institutions, the institutions are announced to finish the tree index matching, and the other institutions wait for finishing the tree index matching.

Step 712, each financial institution runs a security protocol together, and broadcasts mutual information between hidden variable pairs calculated before to all other data owners under the condition of encryption of a multi-party security calculation protocol; until each data owner locally stores the same but complete association strength information between pairs of leaf layer hidden variables.

And 713, independently running hidden tree structure learning submodules by each financial institution, and constructing a weight and a minimum loop-free connected graph by taking leaf layer hidden variables and apparent variables as nodes and taking the calculated association strength between the variables as the weight of the corresponding connecting edges by adopting a maximum spanning tree construction method. And randomly selecting a root node for the loop-free connected graph, and determining a parent-child relationship for the node pair connected by each connecting edge according to the path length distance between the root node and the node pair, so as to obtain the hidden tree structure.

Step 714, each financial institution independently runs the hidden tree parameter learning submodule, and according to the generated hidden tree structure, the top down is for each pair of interconnected father-son nodes, and the Laplacian mechanism is used to calculate the conditional probability between the father-son nodes under the condition of meeting the differential privacy protection.

Step 715, each financial institution independently operates the data generation submodule, calculates probability distribution of the root node in the original data set, and extracts the generated data set corresponding to the root node according to the probability distribution; and then calculating the joint distribution probability of the parent-child nodes for each node from top to bottom according to the current generated data of the parent node, and generating noise-containing data for each node according to the joint distribution probability and the random distribution. A generated dataset is obtained that contains noise that satisfies differential privacy protection.

Step 716, each financial institution sends heartbeat information to other financial institutions after completing data generation, waits for all financial institutions to complete data generation, and broadcasts the generated data set to financial institutions incapable of completing data generation.

Example III

In this embodiment, a large enterprise deploys the device among multiple departments inside to publish data satisfying differential privacy protection for shipping and outsourcing personnel.

Description of the implementation environment: different departments within the enterprise have different levels of data for the same batch of individuals, and the data is located on a data warehouse of an internal server of each department and isolated from the external network and other department networks. In this embodiment, the data set generating devices (which may be databases, large data platforms, servers, etc.) communicate with each other via an intranet. As shown in fig. 8, the present embodiment includes the steps of:

Step 801, a data set generating device opening port deployed in each department monitors a desensitization request from the outside; each data source is bound to a data set generating device.

Step 802, enterprise operation and maintenance personnel and outsourcers submit requests to enterprise data service providing departments, requesting data belonging to different departments, originating from the same group of users.

Step 803, the data service providing department authenticates the applicant, analyzes the data source to which the requested data belongs, and sends a desensitization request to the data set generating device bound with the data source.

Step 804, the related multiple data set generating devices communicate, negotiate parameter configurations such as unified coding standard, privacy parameter setting, maximum number of variable groups, discretization method and the like, and broadcast the negotiated parameter configurations to all data release participants.

Step 805, each device invokes an embedded UDF (user-defined function) provided by the local data repository to transcode, fill in missing data, discretize, and binarize the original data, and stores the result in the temporary table.

Step 806, each device runs the variable grouping submodule, combine the apparent variable pair by pair to form a group of apparent variable pair, visit the data warehouse and calculate the mutual information between each pair of apparent variables; and grouping the apparent variables under the condition of meeting differential privacy protection by utilizing a differential privacy index mechanism.

And step 807, each device independently operates the hidden variable generation submodule, and the Lagrange multiplier is utilized to optimize the maximum likelihood estimation of the apparent variable distribution, and the iteration is repeated for a plurality of times until the generated hidden variable overall distribution is stable.

Step 808, each device independently runs the tree index building sub-module to generate the hidden variable index tree.

And step 809, after each device runs the steps, sending heartbeat information to other departments, declaring that the departments have completed the construction of the local tree-like index, and waiting for the other departments to complete the construction of the tree-like index.

Step 810, after all departments complete the tree index construction, the departments are combined in pairs to form department pairs, parameters such as the calculation orders of the multiple department pairs and the maximum number of other data owners which can be communicated simultaneously by a single data owner are mutually broadcast, and after all departments confirm related orders and parameters, subsequent calculation is performed according to the negotiated calculation orders.

And 811, running a secure multiparty computing protocol together for each pair of departments, running a tree index matching sub-module under the encryption of the secure multiparty computing protocol, and calculating the association strength between hidden variables under the condition of meeting privacy protection.

Step 812, after each device runs all the tree indexes related to itself, sending heartbeat information to other devices, declaring that the device has completed the tree index matching, and waiting for the other devices to complete the tree index matching.

Step 813, each device runs a security protocol together, and mutual information between hidden variable pairs calculated before is broadcast to all other devices under the condition of encryption of the multi-party security calculation protocol; until each device locally stores the same but complete association strength information between pairs of leaf-layer hidden variables.

Step 814, each device independently runs the hidden tree structure learning submodule, adopts a maximum spanning tree construction method, takes leaf layer hidden variables and apparent variables as nodes, and uses the calculated association strength between the variables as the weight of the corresponding connecting edge, the weight and the minimum loop-free connected graph. And randomly selecting a root node for the loop-free connected graph, and determining a parent-child relationship for the node pair connected by each connecting edge according to the path length distance between the root node and the node pair, so as to obtain the hidden tree structure.

Step 815, each device independently operates a hidden tree parameter learning submodule, and according to the generated hidden tree structure, the top down is provided for each pair of interconnected father-son nodes, and the Laplacian mechanism is used for calculating the conditional probability between the father-son nodes under the condition of meeting the differential privacy protection;

Step 816, each device independently operates a data generation sub-module, calculates probability distribution of the root node in the original data set, and extracts the generated data set corresponding to the root node according to the probability distribution; and then calculating the joint distribution probability of the father-son nodes for each node layer by layer from top to bottom according to the current generated data of the father node, and generating noise-containing data for each node according to the joint distribution probability and the random distribution to obtain a noise-containing generated data set meeting the differential privacy protection.

Step 817, each device sends a heartbeat message to other devices after completing data generation, waits for all devices to complete data generation, and broadcasts the generated data set to devices that cannot complete data generation.

Step 818, locally caching the generated data set for next query; optionally a means for transmitting the generated data to a data service providing department.

And step 819, after the data service providing department verifies the data, the data is sent to the data requester.

Example IV

In this embodiment, after the enterprise collects and purchases data of different aspects of the same batch of users from multiple channels, the data is desensitized before being stored in the data warehouse, so as to avoid legal disputes that may be caused by data buying and selling and data collection.

Description of the implementation environment: data from the same batch of users of multiple channels enters the data set generating device (which can be a database, a large data platform, a server and the like) in the previous embodiment in the form of data streams, and after the data set is desensitized by the device, the data set meeting differential privacy protection is stored in a data warehouse. As shown in fig. 9, the present embodiment includes the steps of:

step 901, binding each data source with a data release device, and mutually transmitting heartbeat information among the devices to confirm that the running state of the device is good and the data flow is normal.

Step 902, each data issuing device reads the data stream with the same length into a memory; the data read-in of the bound data stream is then temporarily blocked.

Step 903, each device performs integrity and validity check on the data stream; if abnormal data occurs, the batch data is abandoned, and the next batch data is read continuously.

Step 904, each data source communicates with each other, and the size of the privacy parameter and the allocation policy are determined according to the size of the data stream.

In step 905, each device sequentially runs a unified coding sub-module, a data filling sub-module, a discretization sub-module and a binarization sub-module, and preprocesses stream data stored in a memory.

Step 906, each device runs the variable grouping submodule, combine the apparent variable pair by pair to form a group of apparent variable pair, visit the data warehouse and calculate the mutual information between each pair of apparent variables; and grouping the apparent variables under the condition of meeting differential privacy protection by utilizing a differential privacy index mechanism.

And step 907, each device independently operates the hidden variable generation submodule, and the Lagrange multiplier is utilized to optimize the maximum likelihood estimation of the apparent variable distribution, and the iteration is repeated for a plurality of times until the generated hidden variable overall distribution is stable.

Step 908, each device independently runs the tree index building sub-module to generate the hidden variable index tree.

Step 909, after each device runs out of the steps, sending heartbeat information to other devices, declaring that the device has completed the local tree index construction, and waiting for the other devices to complete the construction of the tree index.

Step 910, after all devices complete tree index construction, the devices are combined in pairs to form a device pair, parameters such as calculation orders of the device pairs and the maximum number of other devices that can simultaneously communicate with the single device are mutually broadcast, and after all devices confirm related orders and parameters, subsequent calculation is performed according to negotiated operation orders.

Step 911, for each pair of devices, running a secure multiparty computing protocol together, running a tree index matching sub-module under the encryption of the secure multiparty computing protocol, and calculating the association strength between hidden variables under the condition of meeting privacy protection.

Step 912, after each device runs all the tree indexes related to itself, sending heartbeat information to other devices, declaring that the device has completed matching the tree indexes, and waiting for the other devices to complete matching the tree indexes.

Step 913, each device runs a security protocol together, and broadcasts the mutual information between the hidden variable pairs calculated before to all other devices under the condition of encryption of the multi-party security calculation protocol; until each device locally stores the same but complete association strength information between pairs of leaf-layer hidden variables.

Step 914, each device independently operates the hidden tree structure learning submodule, adopts a maximum spanning tree construction method, takes leaf layer hidden variables and apparent variables as nodes, and uses the calculated association strength between the variables as the weight of the corresponding connecting edges to construct a weight and a minimum loop-free connected graph. And randomly selecting a root node for the loop-free connected graph, and determining a parent-child relationship for the node pair connected by each connecting edge according to the path length distance between the root node and the node pair, so as to obtain the hidden tree structure.

In step 915, each device independently runs the hidden tree parameter learning submodule, and according to the generated hidden tree structure, the top-down is for each pair of interconnected father-son nodes, and the laplace mechanism is used to calculate the conditional probability between the father-son nodes under the condition of meeting the differential privacy protection.

Step 916, each device independently operates a data generation sub-module, calculates probability distribution of the root node in the original data set, and extracts the generated data set corresponding to the root node according to the probability distribution; and then calculating the joint distribution probability of the parent-child nodes for each node from top to bottom according to the current generated data of the parent node, and generating noise-containing data for each node according to the joint distribution probability and the random distribution. A generated dataset is obtained that contains noise that satisfies differential privacy protection.

Step 917, each device sends a message to other devices after completing the data generation, declares that the device has completed the data generation, and immediately stops the data release process after receiving the message after completing the data generation, so as to reduce the time consumption to the greatest extent.

Step 918, the device that first completes data generation performs data verification on the data, and then stores the data in a data warehouse.

Step 919, each device continues to read the next batch of data stream from the data stream, and continues to desensitize the data stream before warehousing according to the steps described above.

The embodiment of the invention also provides a storage medium. In this embodiment, the above-described storage medium may be configured to store program codes for performing the steps of the previous embodiments. In the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, steps shown or described may be performed in a different order than those illustrated herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data set generation method, comprising:

each data owner of the multiparty vertical segmentation data acquires mutual information of a local original data set display variable pair to generate leaf layer hidden variables;

each data owner locally establishes a tree index, the data owners are combined in pairs to form a data owner pair, and matching of the tree index and calculation of mutual information between leaf layer hidden variables are carried out;

each data owner performs hidden tree structure learning and hidden tree parameter learning, and hidden trees are generated locally;

generating a target data set from top to bottom by each data owner according to the learned hidden tree structure and hidden tree parameters;

the method comprises the steps of establishing a tree index locally by each data owner, combining the data owners in pairs to form a data owner pair, and carrying out matching of the tree index and calculation of mutual information between leaf layer hidden variables, wherein the method comprises the following steps: for each data owner, combining the leaf layer hidden variables in pairs to form hidden variable pairs, and calculating mutual information between each pair of leaf layer hidden variables; based on mutual information among the leaf layer hidden variables, grouping the leaf layer hidden variables under the condition of meeting differential privacy protection to generate an upper layer hidden variable, and repeating the step of generating the upper layer hidden variable by the leaf layer hidden variable from bottom to top until the upper layer hidden variable has only one hidden variable node; taking the hidden variable nodes as root nodes, forming tree indexes by connecting edges among the root nodes, father and child nodes and the hidden variable nodes, and storing the tree indexes in the local of a data owner; the data owners are combined in pairs to form a data owner pair, and negotiation parameters are transmitted among the data owners, wherein the parameters comprise at least one of the following: the data owners combine pairing, the order of execution of subsequent computations by the data owners, the maximum number of individual data owners that can communicate with other data owners simultaneously; each pair of data owners runs hidden variable mutual information calculation based on tree index matching under the condition of encryption of a secure multiparty calculation protocol; and broadcasting mutual information between the calculated hidden variable pairs to all other data owners by the plurality of data owners under the condition of encryption of the multi-party secure computing protocol until each data owner locally stores the same and complete association strength between the hidden variable pairs.

2. The data set generating method according to claim 1, wherein each data owner obtains mutual information of a pair of apparent variables in a local original data set, and generates leaf-layer hidden variables, comprising:

preprocessing the original data set to obtain a regular set of variable data, wherein the preprocessing comprises at least one of the following: unified coding, missing data filling, discretization and binarization;

and combining the display variables in the display variable data set in pairs to form a display variable pair, calculating mutual information between each pair of display variables, and generating leaf layer hidden variables under the condition of meeting differential privacy protection.

3. The data set generating method according to claim 1, wherein each of the data owners performs hidden tree structure learning and hidden tree parameter learning, each generating hidden trees locally, comprising:

each data owner independently operates a maximum spanning tree construction method locally, takes leaf layer hidden variables and apparent variables as nodes, and takes the association strength between the variables as the weight of corresponding connection edges to construct a weight and minimum loop-free connected graph;

and selecting a root node for the loop-free connected graph, and determining a parent-child relationship for node pairs connected with each connecting edge according to the length of the path of the root node to obtain a hidden tree structure.

4. A data set generating method according to claim 3, wherein said each data owner generates a target data set from the learned hidden tree structure and parameters top-down, comprising:

under the condition of meeting differential privacy protection, each data owner calculates the conditional probability between the father and son nodes for each pair of the father and son nodes connected with each other from top to bottom according to the generated hidden tree structure;

and each data owner calculates probability distribution of the root node in the original data set, extracts a generated data set corresponding to the root node according to the probability distribution, calculates joint distribution probability of the father-son nodes for each node layer by layer from top to bottom, generates noise-containing data for each node according to the joint distribution probability and random distribution, and generates a target data set.

5. The data set generating method according to claim 1, further comprising, after each of the data owners generates the target data set from the learned hidden tree structure and the parameters from top to bottom:

the data owners who complete the generation of the target data set send messages to other data owners, wait for all the data owners to complete the generation of the target data set, and broadcast the generated target data set to the data owners which cannot complete the generation of the target data set.

6. A data set generating apparatus, comprising:

the hidden variable generation module is used for acquiring mutual information of a local primary data set display variable pair for each data owner of the multiparty vertical segmentation data and generating leaf layer hidden variables;

the mutual information calculation module is used for locally establishing a tree index for each data owner, combining the data owners two by two to form a data owner pair, and carrying out matching of the tree index and calculation of mutual information between hidden variables;

the hidden tree generating module is used for executing hidden tree structure learning and hidden tree parameter learning for each data owner, and generating hidden trees locally;

the data set generation module is used for generating a target data set from top to bottom for each data owner according to the learned hidden tree structure and the hidden tree parameters;

wherein, mutual information calculation module includes: the hidden variable pair calculation sub-module is used for combining leaf layer hidden variables into hidden variable pairs for each data owner, and calculating mutual information between each pair of hidden variables; the upper layer hidden variable generation sub-module is used for grouping the leaf layer hidden variables under the condition of meeting differential privacy protection based on mutual information among the leaf layer hidden variables to generate an upper layer hidden variable, and repeating the step of generating the upper layer hidden variable by the leaf layer hidden variable from bottom to top until the upper layer hidden variable has only one hidden variable node; the tree index construction module is used for taking the hidden variable nodes as root nodes, forming tree indexes by the connection edges among the root nodes, the father and son nodes and the hidden variable nodes together, and storing the tree indexes in the local of a data owner; a negotiation submodule, configured to combine the data owners two by two to form a data owner pair, where each data owner transmits negotiation parameters, and the parameters include at least one of the following: the data owners combine pairing, the order of execution of subsequent computations by the data owners, the maximum number of individual data owners that can communicate with other data owners simultaneously; the computing module is used for running hidden variable mutual information computation based on tree index matching under the encryption condition of the secure multiparty computing protocol for each pair of data owners; and the broadcasting module is used for broadcasting the mutual information between the hidden variable pairs obtained by calculation to all other data owners under the condition of encryption of the multi-party secure computing protocol until each data owner locally stores the same and complete association strength between the hidden variable pairs.

7. The data set generating apparatus according to claim 6, wherein the hidden variable generating module comprises:

a data preprocessing sub-module, configured to preprocess the original data set to obtain a regular set of variable data, where the preprocessing includes at least one of: unified coding, missing data filling, discretization and binarization;

and the hidden variable generation submodule is used for combining the two-by-two display variables in the display variable data set to form a display variable pair, calculating mutual information between each pair of display variables and generating leaf layer hidden variables under the condition of meeting differential privacy protection.

8. The data set generating apparatus of claim 6, wherein the hidden tree generating module comprises:

the loop-free connected graph construction submodule is used for independently running a maximum spanning tree construction method locally for each data owner, constructing a weighted and minimum loop-free connected graph by taking leaf layer hidden variables and apparent variables as nodes and taking the association strength between the variables as the weight of corresponding connecting edges;

and the hidden tree structure acquisition submodule is used for selecting a root node for the loop-free connected graph, and determining a father-son relationship for the node pair connected with each connecting edge according to the length of the path of the root node to obtain the hidden tree structure.

9. The data set generating apparatus according to claim 6, wherein the data set generating module comprises:

the probability calculation module is used for calculating the conditional probability between the father-son nodes for each pair of the mutually connected father-son nodes from top to bottom according to the generated hidden tree structure under the condition that the differential privacy protection is met for each data owner;

and the data set generation sub-module is used for calculating the probability distribution of the root node in the original data set for each data owner, extracting the generated data set corresponding to the root node according to the probability distribution, then calculating the joint distribution probability of the father-son nodes for each node from top to bottom according to the current generated data of the father node layer by layer, and generating noise-containing data for each node according to the joint distribution probability and the random distribution to generate a target data set.

10. The data set generating apparatus according to claim 6, wherein the apparatus further comprises:

and the issuing module is used for sending a message to other data owners for completing the generation of the target data set, waiting for all the data owners to complete the generation of the target data set, and broadcasting the generated target data set to the data owners which cannot complete the generation of the target data set.

11. A data set generating system comprising a plurality of data set generating means as claimed in any one of claims 6 to 10, wherein each data set generating means corresponds to the data processing of one data owner, and all data set generating means are connected via a network.

12. A storage medium comprising a stored program, wherein the program when run performs the method of any one of claims 1 to 5.

13. A processor for running a program, wherein the program when run performs the method of any one of claims 1 to 5.