WO2019237840A1 - 数据集生成方法及装置 - Google Patents

数据集生成方法及装置 Download PDF

Info

Publication number
WO2019237840A1
WO2019237840A1 PCT/CN2019/084345 CN2019084345W WO2019237840A1 WO 2019237840 A1 WO2019237840 A1 WO 2019237840A1 CN 2019084345 W CN2019084345 W CN 2019084345W WO 2019237840 A1 WO2019237840 A1 WO 2019237840A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
hidden
data set
tree
variables
Prior art date
Application number
PCT/CN2019/084345
Other languages
English (en)
French (fr)
Inventor
牛家浩
申山宏
王德政
程祥
苏森
唐朋
邵华西
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2019237840A1 publication Critical patent/WO2019237840A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Definitions

  • the invention relates to the field of data security, and in particular, to a method and a device for generating a data set.
  • the proposed differential privacy protection model provides a feasible solution to the problem of data release that meets privacy protection. Unlike the privacy protection model based on anonymity, the differential privacy protection model provides a strict and quantifiable privacy protection method, and the strength of the privacy protection provided does not depend on the background knowledge possessed by the attacker.
  • the existing method for publishing vertical segmentation data that meets differential privacy protection can only be used to publish the statistical information required to build a decision tree classifier. Therefore, this method is only a task related to specific data analysis The bound data publishing method.
  • vertical segmentation data publishing methods that satisfy differential privacy protection in practical applications can only be applied to classification tasks based on decision trees, and are not available for other types of data analysis and mining tasks such as classification tasks, clustering tasks, and statistical analysis tasks.
  • the embodiments of the present invention provide a method and a device for generating a data set, so as to at least solve the problem of data privacy protection in related technologies.
  • a data set generation method including: each data owner obtains mutual information of a pair of explicit variables in a local original data set, and generates a hidden variable of a leaf layer; each data owner is locally A tree index is established, and the data owners are combined in pairs to form a data owner pair, which performs tree index matching and calculation of mutual information between hidden variables at the leaf level; each data owner performs hidden tree structure learning And hidden tree parameter learning, each generates a hidden tree locally; each data owner generates a target data set from top to bottom according to the learned hidden tree structure and hidden tree parameters.
  • a data set generating device includes: a hidden variable generating module configured to obtain, for each data owner, mutual information of a pair of explicit variables in a local original data set to generate a leaf layer hidden data.
  • a mutual information calculation module configured to establish a tree index locally for each data owner, combine the data owners in pairs to form a data owner pair, and perform tree index matching and leaf-level hidden variables Calculation of mutual information between each other;
  • a hidden tree generating module configured to perform hidden tree structure learning and hidden tree parameter learning for each of the data owners to generate a hidden tree locally respectively;
  • a data set generating module configured to be It is described that each data owner generates a target data set from top to bottom according to the learned hidden tree structure and hidden tree parameters.
  • a data set generation system includes a plurality of data set generation devices in the foregoing embodiments, wherein each data set generation device corresponds to data processing of a data owner, All data set generating devices are connected via a network.
  • a storage medium stores a computer-readable program, and when the program runs, the method steps in the foregoing embodiments are executed.
  • a hidden tree model is used to model the distribution of the data set that is vertically partitioned between multiple data owners, and the noisy data set is jointly released according to the learned hidden tree model, which reduces noise addition.
  • the amount of data is guaranteed to meet the requirements for differential privacy of the published data set during the multi-party vertical segmentation data publishing process, and the overall data released at the same time can support a variety of data analysis tasks.
  • FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention.
  • FIG. 2 is a flowchart of a data set generation method according to an embodiment of the present invention.
  • FIG. 3 is a flowchart of a multi-party vertical segmentation data publishing method according to an embodiment of the present invention
  • FIG. 4 is a structural block diagram of a data set generating device according to an embodiment of the present invention.
  • FIG. 5 is a structural block diagram of a multi-party vertical segmentation data publishing apparatus according to an embodiment of the present invention.
  • FIG. 7 is a flowchart of a method according to a second embodiment of the present invention.
  • FIG. 8 is a flowchart of a method according to a third embodiment of the present invention.
  • FIG. 9 is a flowchart of a method according to a fourth embodiment of the present invention.
  • the present invention provides a method for publishing multi-party vertical segmentation data that is independent of specific data analysis tasks and meets differential privacy protection through the following embodiments.
  • multi-party vertical segmentation data release In the big data environment, in the process of multi-party vertical segmentation data release, it not only meets the requirements for differential privacy of the published data set, but also enables the released overall data to support a variety of data analysis tasks. Therefore, under the premise of protecting the privacy of individuals, data analysts can fully analyze the value of mining data, providing more basis for decision support and scientific research.
  • the data owner does not refer to a specific person, but refers to all parties that vertically divide the data of multiple parties. It may be various data processing devices that process the data of the multiple parties vertically divided. .
  • databases, big data platforms, servers, etc. each data owner has their own data (that is, data stored in a data warehouse or database).
  • FIG. 1 is a system architecture of an embodiment of the present invention.
  • multi-party vertical segmentation data for example, medical data or financial data
  • the data owner may be a server. Therefore, server 1, server 2, and server 3 represent different data owners, respectively.
  • the server 1, the server 2 and the server 3 are connected through a wired or wireless network.
  • the network form and topology of the server 1, the server 2, and the server 3 are not limited. It mainly depends on the geographical distribution and actual needs among the data owners. For example, it can be a local area network, the Internet, or other private networks.
  • each server can send heartbeat information, register as a multi-party vertical segmentation data publishing participant, and publish the generated multi-party vertical segmentation data set.
  • FIG. 2 is a flowchart of a data set generation method according to an embodiment of the present invention. In this embodiment, multiple data owners are included. As shown in FIG. 2, the process includes the following steps:
  • each data owner calculates the mutual information of the explicit variable pairs in the local original data set to generate the leaf layer hidden variables.
  • each data owner establishes a tree index locally, and the data owners are combined in pairs to form a data owner pair, which performs tree index matching and calculation of mutual information between leaf-level hidden variables.
  • step S206 each data owner performs hidden tree structure learning and hidden tree parameter learning, and generates hidden trees locally.
  • each data owner generates a target data set from top to bottom according to the learned hidden tree structure and hidden tree parameters.
  • a hidden tree model is used to model the distribution of the data set that is vertically divided between multiple data owners, and the noisy data set is jointly released according to the learned hidden tree model, so that the published data Under the condition of differential privacy, the amount of noise added is minimized.
  • the present invention also provides another embodiment of a data publishing method for multi-party vertical segmentation that meets differential privacy protection and has nothing to do with specific data analysis tasks. As shown in FIG. 3, the method includes the following process:
  • step S301 the original data set is uniformly encoded, missing data is filled, discretized, and binarized to obtain a regular explicit variable data set.
  • step S302 the explicit variables are combined in pairs to form a set of explicit variable pairs, and the mutual information between each pair of explicit variables is calculated by accessing the data.
  • Step S303 using the differential privacy index mechanism, to generate the hidden variable of the leaf layer under the condition that the differential privacy protection is satisfied.
  • step S304 for each data owner, the hidden variables of the leaf layer are combined in pairs to form a hidden variable pair, and the hidden variable data is accessed to calculate the mutual information between each pair of hidden variables.
  • step S305 based on the calculated mutual information between the hidden variables of the leaf layer, the hidden variables of the leaf layer are grouped under the condition that the differential privacy protection is satisfied to generate an upper hidden variable.
  • step S306 the above steps of generating the upper hidden variable from the bottom hidden variable are repeated from bottom to top, until the upper hidden variable has only one hidden variable node, the hidden variable is recorded as the root node; the connection edge between the parent and child nodes, each hidden The variable nodes together form a tree index and are stored locally on the data owner.
  • each data owner is combined to form a data owner pair, and then each data owner passes relevant parameters for negotiation, including, but not limited to, the combination of the data owner, the order in which the data owner performs subsequent calculations, and the single data owner. The maximum number of other data owners that can communicate at the same time.
  • step S308 for each pair of data owners, a secure multiparty computing protocol is jointly run, and a two-party hidden variable mutual information computing method based on tree index matching is run under the condition that the secure multiparty computing protocol is encrypted.
  • a secure multiparty computing protocol is jointly run, and a two-party hidden variable mutual information computing method based on tree index matching is run under the condition that the secure multiparty computing protocol is encrypted.
  • Multiple data owners run the security protocol together, and broadcast the mutual information between the previously calculated hidden variable pairs to all other data owners under the condition of multi-party secure computing protocol encryption; until each data owner stores the same locally The strength of associations between pairs of complete leaf-level hidden variables.
  • step S309 multiple data owners independently run the maximum spanning tree construction method locally, using the leaf layer hidden variables and explicit variables as nodes, and calculating the strength of the association between the variables as the weights of the corresponding connection edges, constructing the weights and Minimal acyclic connectivity graph.
  • the root node is randomly selected for the acyclic connected graph, and the parent-child relationship is determined for each pair of nodes connected by the edge length according to the path length distance from the root node to obtain the hidden tree structure.
  • Step S310 According to the generated hidden tree structure, from top to bottom, each pair of interconnected parent and child nodes is applied, and the Laplacian mechanism is used to calculate the conditional probability between the parent and child nodes under the condition that the differential privacy protection is met.
  • Step S311 Calculate the probability distribution of the root node in the original data set, and extract the generated data set corresponding to the root node according to the probability distribution; then, from top to bottom, calculate the parent and child nodes for each node according to the current generation data of the parent node. According to the joint distribution probability and random distribution, noisy data is generated for each node. A noisy generated data set that meets differential privacy protection is obtained.
  • a differential privacy model leading in the field of data privacy is used to provide differential privacy protection for the data jointly generated by the data owners in the multi-party data joint publishing process to ensure that data privacy is strictly protected.
  • a hidden tree model is also used to model the distribution of data sets that are vertically split between multiple data owners. The noisy data sets are jointly released according to the learned hidden tree model, so that the published data meets -difference Under the condition of privacy, the amount of noise added is minimized, the utility of the published data is improved, and the quality of the overall data service is guaranteed.
  • a method of calculating mutual information of hidden variables based on a tree index is also used to abandon the calculation of mutual information between hidden variables with a small correlation strength, thereby reducing the number of times that noise needs to be added, thereby comprehensively utilizing all parties.
  • Data provides high-quality services while reducing communication overhead.
  • the method according to the above embodiments can be implemented by means of software plus a necessary universal hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is Better implementation.
  • the technical solution of the present invention in essence, or a part that contributes to the existing technology, can be embodied in the form of a software product, which is stored in a storage medium (such as ROM / RAM, magnetic disk, The CD-ROM) includes several instructions for causing a terminal device (which may be a computer, a server, or a network device, etc.) to execute the methods described in the embodiments of the present invention.
  • a data set generating device is also provided in this embodiment, and the device is used to implement the foregoing embodiments and preferred implementation manners, and the descriptions will not be repeated.
  • the term "module” may implement a combination of software and / or hardware for a predetermined function.
  • the devices described in the following embodiments are preferably implemented in software, hardware, or a combination of software and hardware may also be conceived.
  • FIG. 4 is a structural block diagram of a data set generation device according to an embodiment of the present invention. As shown in FIG. 4, the device includes a hidden variable generation module 10, a mutual information calculation module 20, a hidden tree generation module 30, and a data set generation module 40.
  • the hidden variable generating module 10 is configured to calculate, for each data owner, the mutual information of the explicit variable pairs in the local original data set to generate the leaf layer hidden variables.
  • the mutual information calculation module 20 is configured to establish a tree index locally for each data owner, combine the data owners in pairs to form a data owner pair, perform tree index matching and between leaf layer hidden variables Calculation of mutual information.
  • the hidden tree generation module 30 is configured to perform hidden tree structure learning and hidden tree parameter learning for each data owner, and generate a hidden tree locally respectively.
  • the data set generation module 40 is configured to generate a target data set from the top to the bottom for each data owner according to the learned hidden tree structure and hidden tree parameters.
  • each data set generating device corresponds to data processing of one data owner, and all data owners may be connected through a network.
  • An embodiment of the present invention also provides another apparatus for generating a data set.
  • the device may be a data processing device such as a database, a big data platform, or a server.
  • a data processing device such as a database, a big data platform, or a server.
  • the names and functional divisions of its modules are different from those in the above embodiments. As shown in FIG. 5, specifically, the functions implemented by the modules in this embodiment are as follows:
  • the data pre-processing module 50 mainly performs operations such as coding statistics of raw data, data cleaning, filling of missing data, discretization, and binarization.
  • the data pre-processing module 50 may further specifically include a data encoding sub-module 51, a data filling sub-module 52, a discretization sub-module 53 and a binarization sub-module 54.
  • the data encoding sub-module 51 is configured to complete unified encoding for encoding multiple data sources.
  • the data filling sub-module 52 is configured to complete the cleaning and filling of the original data.
  • the discretization sub-module 53 is configured to map continuous data or discrete type values into discrete numbers.
  • the binarization sub-module 54 is configured to convert a discrete variable into a binary variable with values 0 and 1.
  • the association information collection module 60 is configured to generate hidden variables for each data owner, and construct a tree index to perform tree index matching.
  • the association information collection module 60 may further specifically include a hidden variable generation sub-module 61, a tree-like index construction sub-module 62, and a tree-like index matching sub-module 63.
  • the hidden variable generating sub-module 61 is configured to group explicit variables under the condition of satisfying differential privacy, and generate leaf layer hidden variables correspondingly.
  • the tree-like index constructing sub-module 62 is configured to finish constructing the tree-like index structure from the bottom up under the condition of satisfying the differential privacy protection.
  • the tree-like index matching sub-module 63 performs top-to-bottom matching on the tree-like index to complete the ground pruning of the hidden variable to be calculated, and reduces the communication overhead in the process of calculating the association strength of the hidden variable.
  • the model learning and construction module 70 is configured to learn a hidden tree model based on mutual information between hidden variables.
  • the model learning and construction module 70 may further include an acyclic connected graph construction sub-module 71, a hidden tree structure learning sub-module 72, and a hidden tree parameter learning sub-module 73.
  • the acyclic connected graph construction sub-module 71 uses a leaf layer hidden variable as a node and the magnitude of mutual information between the variables as the weight of the connected edges to construct a weighted and minimal acyclic connected graph.
  • the hidden tree structure learning sub-module 72 mainly completes the construction of the hidden tree structure.
  • the hidden tree parameter learning sub-module 73 mainly completes the learning of conditional distribution parameters between parent and child nodes in the hidden tree structure.
  • the data publishing module 80 generates each record from the top down from the root node according to the learned conditional distribution parameters between the hidden tree structure and the parent and child nodes.
  • the above modules can be implemented by software or hardware. For the latter, they can be implemented in the following ways, but are not limited to the above: the above modules are all located in the same processor; or the above modules are located in multiple Processor.
  • the K different specialized hospitals as the K data owners each deploy the data set generating device (which can be a database, a big data platform, a server, etc.), and use the device to perform data that meets differential privacy protection Set generation.
  • the data set generating device which can be a database, a big data platform, a server, etc.
  • this embodiment includes the following steps:
  • Step 601 K hospitals independently encode local data sets, fill in missing data, discretize, and binarize; store the conversion relationship between the corresponding digits of the original data set and the binary data set.
  • Step 602 The K hospitals combine the explicit variables in pairs to form a set of explicit variable pairs, and access the data to calculate the mutual information between each pair of explicit variables.
  • the calculation method of the mutual information is as follows:
  • Step 603 Use the differential privacy index mechanism to group the local explicit variables under the condition of satisfying the differential privacy protection. Each group contains the number of explicit variables as much as possible but does not exceed the preset maximum number of hidden variables.
  • Step 604 Group each local explicit variable, generate a hidden variable for it, and obtain a leaf-level hidden variable data set.
  • step 605 for each hospital, the hidden variables of the leaf layer are combined in pairs to form a hidden variable pair, and the mutual information between each pair of hidden variables is calculated by accessing the hidden variable data; the calculation method refers to step 502.
  • Step 606 Based on the calculated mutual information between the hidden variables of the leaf layer, group the hidden variables of the leaf layer under the condition that the differential privacy protection is satisfied to generate an upper hidden variable.
  • Step 607 Repeat the above steps for generating the upper hidden variable from the bottom hidden variable from bottom to top, until the upper hidden variable has only one hidden variable node, and record the hidden variable as the root node; the connection edge between the parent and child nodes, each hidden The variable nodes together form a tree index and are stored locally on the data owner.
  • Step 608 Each hospital combines two pairs to form a data owner pair, and then each data owner passes the relevant parameters for negotiation, including not limited to the combination of data owner combinations, the execution order of subsequent calculations performed by the data owner, and a single data owner can simultaneously Maximum number of other data owners of the communication, etc.
  • Step 609 For each pair of hospitals, a secure multiparty computing protocol is jointly run, and a two-party hidden variable mutual information computing method based on tree index matching is run under the condition that the secure multiparty computing protocol is encrypted. For each hospital, you can communicate with multiple other hospitals at the same time and perform the above calculations simultaneously.
  • Step 610 Multiple hospitals jointly run the security protocol, and send the mutual information between the previously calculated hidden variable pairs to other hospitals under the condition that the multi-party secure computing protocol is encrypted; until each hospital stores the same and complete leaves locally The strength of associations between pairs of latent variable pairs.
  • Step 611 Multiple hospitals independently run the maximum spanning tree construction method locally, using the hidden variable and the explicit variable at the leaf layer as nodes, and calculating the strength of the association between the variables as the weights of the corresponding connecting edges to construct the weights and minimum Acyclic connectivity graph.
  • Step 612 Randomly select a root node for the acyclic connected graph, determine a parent-child relationship for each pair of nodes connected by each connecting edge according to a path length distance from the root node, and obtain a hidden tree structure.
  • Step 613 According to the generated hidden tree structure, from top to bottom, for each pair of interconnected parent and child nodes, the Laplacian mechanism is used to calculate the conditional probability between the parent and child nodes under the condition that the differential privacy protection is met.
  • Step 614 Calculate the probability distribution of the root node in the original data set, and extract the generated data set corresponding to the root node according to the probability distribution; then, from top to bottom, calculate the parent-child node for each node based on the current generated data of the parent node. According to the joint distribution probability and random distribution, noisy data is generated for each node. A noisy generated data set that meets differential privacy protection is obtained.
  • the K financial institutions mentioned are institutions with different types of financial information for users, such as banks with deposit information and securities institutions with stock information.
  • the K financial institutions each deploy the data set generating device (which can be a database, a big data platform, a server, etc.), and directly use the device to perform data publishing tasks that satisfy differential privacy protection. As shown in FIG. 7, this embodiment includes the following steps:
  • Step 701 The host of each financial institution sends heartbeat information to other institutions through the network, and registers as a data publishing participant.
  • Step 702 Each financial institution independently and randomly generates a unique integer number as a unique ID. Each financial institution publishes the number of source machines and the local ID of the heartbeat received by the financial institution to the next financial institution. When no contradictory information is detected, the data release process begins.
  • Step 703 Each financial institution communicates, negotiates parameter settings such as a unified coding standard, privacy parameter settings, maximum number of variable groups, and discretization methods, and broadcasts the negotiated parameter configurations to all data publishing participants.
  • parameter settings such as a unified coding standard, privacy parameter settings, maximum number of variable groups, and discretization methods
  • Step 704 Each financial institution takes the original data set from the local data warehouse, and runs the unified coding submodule, data filling submodule, discretization submodule, and binarization submodule in order to obtain a regular binary explicit variable data set.
  • Step 705 Each financial institution runs a variable grouping sub-module to combine the explicit variables in pairs to form a set of explicit variable pairs.
  • the data warehouse is accessed to calculate the mutual information between each pair of explicit variables.
  • the differential privacy index mechanism is used to satisfy the difference. Group explicit variables under privacy protection.
  • each financial institution independently runs a hidden variable generation sub-module, and uses a Lagrangian multiplier to optimize the maximum likelihood estimation of the distribution of the explicit variable, and iterates repeatedly until the generated hidden variable is stable as a whole.
  • Step 707 Each financial institution independently runs a tree index construction sub-module to generate a hidden variable index tree.
  • Step 708 After each financial institution runs the above steps, it sends heartbeat information to other financial institutions, announcing that the institution has completed the construction of the local tree index, and waits for other institutions to complete the construction of the digital index.
  • Step 709 After all the tree-like index construction is completed, a plurality of financial institutions are combined to form a financial institution pair, broadcasting the calculation order of multiple financial institution pairs and the maximum number of other data owners that a single data owner can communicate with at the same time. Parameters, after all financial institutions confirm the relevant order and parameters through the network, follow-up calculations are performed according to the negotiated operation order.
  • Step 710 For each pair of financial institution pairs, jointly run a secure multi-party computing protocol, run a tree index matching sub-module under the encryption of the secure multi-party computing protocol, and calculate the strength of association between hidden variables if privacy protection is met.
  • Step 711 After each financial institution runs all the tree-like index matches, it sends heartbeat information to other financial institutions, announces that the institution has completed the tree-like index matching, and waits for other institutions to complete the tree-like index matching.
  • Step 712 Each financial institution jointly runs a security protocol, and broadcasts the mutual information between the previously calculated hidden variable pairs to all other data owners under the condition of multi-party secure computing protocol encryption; until each data owner stores it locally The information about the strength of associations between pairs of hidden variable pairs at the same and complete leaf level.
  • Step 713 Each financial institution independently runs the hidden tree structure learning sub-module, adopts the maximum spanning tree construction method, and uses the hidden variable and the explicit variable at the leaf layer as nodes, and calculates the strength of the association between the variables as the weight of the corresponding connecting edge. Values, construct weighted and minimal acyclic connected graphs. The root node is randomly selected for the acyclic connected graph, and the parent-child relationship is determined for each pair of nodes connected by the edge length according to the path length distance from the root node to obtain the hidden tree structure.
  • Step 714 Each financial institution independently runs the hidden tree parameter learning sub-module, and according to the generated hidden tree structure, each pair of interconnected parent-child nodes is top-down, and the Laplacian mechanism is used to satisfy differential privacy. The conditional probability between parent and child nodes is calculated under the condition of protection.
  • Step 715 Each financial institution independently runs a data generation sub-module, calculates the probability distribution of the root node in the original data set, and extracts the generated data set corresponding to the root node according to the probability distribution; then, each node is layer-by-layer from top to bottom According to the current generated data of the parent node, the joint distribution probability of the parent and child nodes is calculated, and noisy data is generated for each node according to the joint distribution probability and random distribution. A noisy generated data set that meets differential privacy protection is obtained.
  • Step 716 After completing the data generation, each financial institution sends heartbeat information to other financial institutions, and waits for all financial institutions to complete the data generation. For financial institutions that cannot complete the data generation, the generated data set is broadcasted to them.
  • a large enterprise deploys the device among multiple internal departments to publish data that satisfies differential privacy protection to operation and maintenance and outsourcing personnel.
  • each data set generating device (which may be a database, a big data platform, a server, etc.) communicates with each other through an enterprise intranet. As shown in FIG. 8, this embodiment includes the following steps:
  • Step 801 The data set generating device deployed in each department opens a port and listens to desensitization requests from the outside; each data source is bound to a data set generating device.
  • Step 802 The enterprise operation and maintenance personnel and the outsourced personnel submit a request to the enterprise data service providing department, requesting to obtain data from the same batch of users and belonging to different departments.
  • Step 803 The data service providing department authenticates the applicant, analyzes the data source to which the requested data belongs, and sends a desensitization request to the data set generating device bound to the data source.
  • Step 804 The involved multiple data set generating devices communicate, negotiate parameter settings such as a unified coding standard, privacy parameter settings, maximum number of variable groups, and discretization methods, and broadcast the negotiated parameter configurations to all data releases. Participants.
  • parameter settings such as a unified coding standard, privacy parameter settings, maximum number of variable groups, and discretization methods
  • Step 805 Each device invokes an embedded UDF (user-defined function) provided by the local data warehouse to encode, transform, fill, discretize, and binarize the original data, and store the result in a temporary table.
  • UDF user-defined function
  • Step 806 Each device runs a variable grouping sub-module, and combines the explicit variables in pairs to form a set of explicit variable pairs.
  • the data warehouse is accessed to calculate the mutual information between each pair of explicit variables.
  • the differential privacy index mechanism is used to satisfy differential privacy protection. Group the explicit variables.
  • each device independently runs a hidden variable generation sub-module, optimizes the maximum likelihood estimation of the distribution of the explicit variable by using Lagrange multipliers, and iterates repeatedly until the generated hidden variable distribution as a whole is stable.
  • Step 808 Each device independently runs a tree-like index construction sub-module to generate a hidden variable index tree.
  • Step 809 After each device runs the above steps, it sends heartbeat information to other departments, announcing that the department has completed the construction of the local tree index, and waits for other departments to complete the construction of the digital index.
  • Step 810 After the tree index construction is completed for all the departments, a plurality of departments are combined to form department pairs, and the calculation order of the multiple department pairs and the maximum number of other data owners that a single data owner can communicate with at the same time are broadcasted to each other. After all departments have communicated to confirm the relevant sequence and parameters, follow-up calculations are performed according to the negotiated calculation sequence.
  • Step 811 For each pair of department pairs, jointly run a secure multi-party computing protocol, run a tree index matching sub-module under the encryption of the secure multi-party computing protocol, and calculate the strength of association between hidden variables if privacy protection is met.
  • Step 812 After each device runs all the tree-like index matches, it sends heartbeat information to other devices, announces that the device has completed the tree-like index matching, and waits for other devices to complete the tree-like index matching.
  • Step 813 Each device runs a security protocol together, and broadcasts the mutual information between the previously calculated hidden variable pairs to all other devices under the condition of multi-party secure computing protocol encryption; until each device stores the same and complete leaves locally Information about the strength of associations between pairs of latent variable pairs.
  • Step 814 Each device independently runs the hidden tree structure learning sub-module, adopts the maximum spanning tree construction method, and uses the hidden variable and the explicit variable at the leaf layer as nodes, and calculates the correlation strength between the variables as the weight of the corresponding connection edge. Weighted and minimal acyclic connected graph. The root node is randomly selected for the acyclic connected graph, and the parent-child relationship is determined for each pair of nodes connected by the edge length according to the path length distance from the root node to obtain the hidden tree structure.
  • Step 815 Each device independently runs a hidden tree parameter learning sub-module, and according to the generated hidden tree structure, top-down for each pair of interconnected parent-child nodes, using the Laplacian mechanism to satisfy differential privacy protection. Calculate the conditional probability between parent and child nodes under the conditions;
  • Step 816 each device independently runs a data generating sub-module, calculates the probability distribution of the root node in the original data set, and extracts the generated data set corresponding to the root node according to the probability distribution;
  • the current generated data of the nodes, the joint distribution probability of the parent and child nodes is calculated, and noisy data is generated for each node according to the joint distribution probability and random distribution, and a generated data set that contains noise and meets differential privacy protection is obtained.
  • Step 817 After completing the data generation, each device sends a heartbeat message to other devices, and waits for all devices to complete the data generation, and broadcasts the generated data set to the devices that cannot complete the data generation.
  • Step 818 The generated data set is cached locally for the next query; any device is used to send the generated data to the data service providing department.
  • Step 819 After verifying the data, the data service providing department sends the data to the data requester.
  • the company desensitizes the data before storing it in the data warehouse to avoid possible legal disputes caused by data sales and data collection. .
  • Implementation environment description The data from the same batch of users from multiple channels enters the data set generating device (which can be a database, big data platform, server, etc.) described in the previous embodiment in the form of a data stream, and the device sets the data set After desensitization, the data sets that satisfy differential privacy protection are stored in the data warehouse. As shown in FIG. 9, this embodiment includes the following steps:
  • Step 901 Each data source is bound to a data publishing device, and each device sends heartbeat information to each other to confirm that the device is running well and the data flow is normal.
  • Step 902 Each data publishing device reads a data stream of the same length into the memory; then, the data of the bound data stream is temporarily closed for reading.
  • Step 903 Each device checks the integrity and legality of the data stream; if abnormal data occurs, the batch of data is discarded, and the next batch of stream data is read in.
  • Step 904 Each data source communicates with each other, and determines the size of the privacy parameter and the allocation strategy according to the size of the data stream.
  • each device sequentially runs a unified encoding sub-module, a data filling sub-module, a discretization sub-module, and a binarization sub-module, and pre-processes the stream data stored in the memory.
  • Step 906 Each device runs a variable grouping sub-module to combine the explicit variables in pairs to form a set of explicit variable pairs.
  • the data warehouse is accessed to calculate the mutual information between each pair of explicit variables.
  • the differential privacy index mechanism is used to satisfy differential privacy protection. Group the explicit variables.
  • Step 907 Each device independently runs a hidden variable generation sub-module, optimizes the maximum likelihood estimation of the distribution of the explicit variable by using Lagrangian multipliers, and iterates repeatedly until the generated hidden variable overall distribution is stable.
  • Step 908 Each device independently runs a tree-like index building sub-module to generate a hidden variable index tree.
  • Step 909 After each device runs the above steps, it sends heartbeat information to other devices, announcing that the device has completed the construction of the local tree index, and waiting for other devices to complete the construction of the digital index.
  • Step 910 After the tree index construction is completed for all devices, multiple devices are combined to form device pairs, and the calculation order of multiple device pairs and the maximum number of other devices that a single device can communicate with at the same time are performed on all devices. After communication confirms the relevant sequence and parameters, subsequent calculations are performed according to the negotiated calculation sequence.
  • Step 911 For each pair of device pairs, jointly run a secure multi-party computing protocol, run a tree index matching sub-module under the encryption of the secure multi-party computing protocol, and calculate the strength of association between hidden variables if privacy protection is met.
  • step 912 after each device has run all the tree index matches related to itself, it sends heartbeat information to other devices, announces that the device has completed the tree index matching, and waits for other devices to complete the tree index matching.
  • Step 913 Each device runs a security protocol together, and broadcasts the mutual information between the previously calculated hidden variable pairs to all other devices under the condition of multi-party secure computing protocol encryption; until each device stores the same and complete leaves locally Information about the strength of associations between pairs of latent variable pairs.
  • Step 914 Each device independently runs a hidden tree structure learning sub-module, adopts a maximum spanning tree construction method, and uses the hidden variable and the explicit variable at the leaf layer as nodes, and calculates the strength of association between the variables as the weight of the corresponding connection edge. Construct weighted and minimal acyclic connected graphs. The root node is randomly selected for the acyclic connected graph, and the parent-child relationship is determined for each pair of nodes connected by the edge length according to the path length distance from the root node to obtain the hidden tree structure.
  • Step 915 Each device independently runs a hidden tree parameter learning sub-module, and according to the generated hidden tree structure, each pair of interconnected parent-child nodes is applied from top to bottom, and the Laplacian mechanism is used to satisfy differential privacy protection. The conditional probability between parent and child nodes is calculated under the conditions.
  • Step 916 Each device independently runs a data generation sub-module, calculates the probability distribution of the root node in the original data set, and extracts the generated data set corresponding to the root node according to the probability distribution; and then, from top to bottom, each node according to the parent
  • the current generated data of the nodes, the joint distribution probability of the parent and child nodes is calculated, and noisy data is generated for each node according to the joint distribution probability and random distribution.
  • a noisy generated data set that meets differential privacy protection is obtained.
  • each device sends a message to other devices, announcing that the device has completed the data generation, and upon receiving the message that the data generation is completed, the other devices immediately stop the data publishing process to minimize time consumption.
  • Step 918 The device that completes the data generation first performs data verification on the data, and then stores the data into the data warehouse.
  • Step 919 Each device continues to read the next batch of data streams from the data stream, and continues to perform desensitization before the data is stored in the database according to the above steps.
  • An embodiment of the present invention also provides a storage medium.
  • the above-mentioned storage medium may be configured to store program code for performing the steps of the embodiments in the foregoing.
  • the foregoing storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, a magnetic disk, or an optical disk.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • modules or steps of the present invention may be implemented by a general-purpose computing device, and they may be concentrated on a single computing device or distributed on a network composed of multiple computing devices.
  • they may be implemented with program code executable by a computing device, so that they may be stored in a storage device and executed by the computing device, and in some cases, may be in a different order than here
  • the steps shown or described are performed, or they are separately made into individual integrated circuit modules, or multiple modules or steps in them are made into a single integrated circuit module for implementation.
  • the invention is not limited to any particular combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明提供了一种数据集生成方法及装置,该方法包括:多方垂直分割数据的每个数据拥有者计算本地原始数据集中显变量对的互信息,生成叶子层隐变量;每个数据拥有者在本地建立树状索引,数据拥有者两两组合形成数据拥有者对,进行树状索引的匹配和叶子层隐变量之间的互信息的计算;每个数据拥有者执行隐树结构学习和隐树参数学习,各自在本地生成隐树;每个数据拥有者根据学习到的隐树结构和隐树参数自顶向下生成目标数据集。

Description

数据集生成方法及装置 技术领域
本发明涉及数据安全领域,具体而言,涉及一种数据集生成方法及装置。
背景技术
随着智慧城市、智能电网、智慧医疗等数字化技术的快速发展和移动终端设备的广泛普及,人们的衣食住行、健康医疗等信息被数字化,每天产生海量数据,促成了大数据时代的到来。大量的数据往往由不同的数据拥有者所有,例如,医院和金融机构分别拥有一组医疗数据和金融数据。当分布在多方的数据具有相同的ID包含不同的属性时,称之为多方垂直分割数据。发布多方垂直分割数据,有利于数据分析者充分分析和挖掘数据中潜在价值。然而,垂直分割数据中往往包含个体大量敏感信息,直接发布这种数据将不可避免地泄露个体隐私信息。
差分隐私保护模型的提出为解决满足隐私保护的数据发布问题提供了一种可行的方案。与基于匿名的隐私保护模型不同,差分隐私保护模型提供了一种严格、可量化的隐私保护手段,且所提供的隐私保护强度不依赖于攻击者所掌握的背景知识。
当前,在单方场景下,通过贝叶斯网络发布隐私的数据(private Data Release via Bayesian Networks,PrivBayes)技术解决了满足差分隐私的数据发布问题:它首先基于原始数据构建一个贝叶斯网络,接着在构建的贝叶斯网络中加入噪音,使其达到差分隐私保护要求;最后利用含有噪音的贝叶斯网络生成新的数据发布。然而由于算法本身是面向单方数据设计,PrivBayes在多方场景不可用。
在多方场景下,现有的满足差分隐私保护的垂直分割数据发布方法(DistDiffGen)仅能够用于发布构建决策树分类器所需的统计信息,因此,该方法仅是一种与具体数据分析任务绑定的数据发布方法。目前,实际应用中满足差分隐私保护的垂直分割数据发布方法仅能够应用于基于决策树的分类任务,而对于其他类型分类任务、聚类任务、统计分析任务等数据分析和挖掘任务则不可用。
发明内容
本发明实施例提供了一种数据集生成方法及装置,以至少解决相关技术中数据的隐私保护问题。
根据本发明的一个方面,提供了一种数据集生成法,包括:每个数据拥有者获取本地原始数据集中显变量对的互信息,生成叶子层隐变量;所述每个数据拥有者在本地建立树状索引,所述数据拥有者两两组合形成数据拥有者对,进行树状索引的匹配和叶子层隐变量之间 的互信息的计算;所述每个数据拥有者执行隐树结构学习和隐树参数学习,各自在本地生成隐树;所述每个数据拥有者根据学习到的隐树结构和隐树参数自顶向下生成目标数据集。
根据本发明的另一方面,提供了一种数据集生成装置,该装置包括:隐变量生成模块,设置为为每个数据拥有者获取本地原始数据集中显变量对的互信息,生成叶子层隐变量;互信息计算模块,设置为为所述每个数据拥有者在本地建立树状索引,将所述数据拥有者两两组合形成数据拥有者对,进行树状索引的匹配和叶子层隐变量之间的互信息的计算;隐树生成模块,设置为为所述每个数据拥有者执行隐树结构学习和隐树参数学习,各自在本地生成隐树;数据集生成模块,设置为为所述每个数据拥有者根据学习到的隐树结构和隐树参数自顶向下生成目标数据集。
根据本发明的又一方面,还提供了一种数据集生成系统,该系统包括前文实施例中的多个数据集生成装置,其中,每个数据集生成装置对应一个数据拥有者的数据处理,所有数据集生成装置通过网络相连接。
根据本发明的再一方面,还提供了一种存储介质,所述存储介质中存储有计算机可读程序,其中,所述程序运行时执行前文实施例中的方法步骤。
在本发明上述实施例中,通过采用隐树模型来建模垂直分割于多个数据拥有者之间的数据集分布,依据学习到的隐树模型联合发布含噪声的数据集,减少了噪音加入的量,保证在多方垂直分割数据的发布过程中,满足对于所发布的数据集的差分隐私的要求,同时发布的整体数据能够支持多种数据分析任务。
附图说明
图1是根据本发明实施例的系统架构示意图;
图2是根据本发明实施例的数据集生成方法流程图;
图3是根据本发明实施例的多方垂直分割的数据发布方法流程图;
图4是根据本发明实施例的数据集生成装置的结构框图;
图5是根据本发明实施例的多方垂直分割的数据发布装置的结构框图;
图6是根据本发明实施例一的方法流程图;
图7是根据本发明实施例二的方法流程图;
图8是根据本发明实施例三的方法流程图;
图9是根据本发明实施例四的方法流程图。
具体实施方式
下文中将参考附图结合实施例来详细说明本发明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
本发明通过下面的实施例提供了一种与具体数据分析任务无关的、满足差分隐私保护的多方垂直分割数据发布的方法。使得在大数据环境下,在多方垂直分割数据的发布过程中,既满足了对于所发布的数据集的差分隐私的要求,同时又使得发布的整体数据能够支持多种数据分析任务。从而实现在保护个体隐私的前提下,数据分析者可以充分地分析挖掘数据中的价值,为决策支持和科学研究提供更多依据。
需要说明的是,在本发明的实施例中,数据拥有者并不是指具体的人,而是指多方垂直分割数据的所有方,其可以是对多方垂直分割数据进行处理的各种数据处理装置。例如,数据库、大数据平台、服务器等,每个数据拥有者都有各自的数据(即保存在数据仓库或者数据库中的数据)。
图1为本发明的实施例的系统架构。如图1所示,具有相同的ID,但包含不同的属性的多方垂直分割数据(例如,医疗数据或金融数据)分布在三个不同的数据拥有者上。在本实施例中,所述数据拥有者可以为服务器,因此,服务器1、服务器2和服务器3分别代表不同的数据拥有者。其中,服务器1、服务器2和服务器3之间通过有线或无线网络连接。在本实施例中,连接服务器1、服务器2和服务器3的网络形式和拓扑结构不受限制。主要取决于各个数据拥有者之间的地理分布和实际需要。例如,可以是局域网,也可以是因特网,或者是其他专用网络。通过所连接的网络,各服务器之间可以发送心跳信息,注册成为多方垂直分割数据发布参与者,并发布所生成的多方垂直分割数据集等。
通过在图1所示系统架构上运行本发明实施例所提供的技术方案,可以保证在多方垂直分割数据的发布过程中,既满足对于所发布的数据集的差分隐私的要求,同时发布的整体数据能够支持多种数据分析任务。
在本实施例中提供了一种数据集生成方法,所述方法可以基于上述实施例的系统架构来实现。图2是根据本发明实施例的数据集生成方法的流程图。在本实施例中包括了多个数据拥有者,如图2所示,该流程包括如下步骤:
步骤S202,每个数据拥有者计算本地原始数据集中显变量对的互信息,生成叶子层隐变量。
步骤S204,所述每个数据拥有者在本地建立树状索引,所述数据拥有者两两组合形成数据拥有者对,进行树状索引的匹配和叶子层隐变量之间的互信息的计算。
步骤S206,所述每个数据拥有者执行隐树结构学习和隐树参数学习,各自在本地生成隐树。
步骤S208,所述每个数据拥有者根据学习到的隐树结构和隐树参数自顶向下的生成目标数据集。
在上述的实施例中,通过采用隐树模型来建模垂直分割于多个数据拥有者之间的数据集分布,依据学习到的隐树模型联合发布含噪声的数据集,从而在发布的数据满足差分隐私的条件下,最大程度地减少噪音加入的量。
本发明还提供了另一与具体数据分析任务无关的满足差分隐私保护的多方垂直分割的数据发布方法实施例,如图3所示,该方法包括如下流程:
步骤S301,将原始数据集进行统一编码、缺失数据填充、离散化、二进制化,得到规整的显变量数据集。
步骤S302,将显变量两两组合形成一组显变量对,访问数据计算出每对显变量之间的互信息。
步骤S303,利用差分隐私指数机制,在满足差分隐私保护的条件下生成叶子层隐变量。
步骤S304,对于每个数据拥有者,将叶子层隐变量两两组合形成隐变量对,访问隐变量数据计算出每对隐变量之间的互信息。
步骤S305,基于计算出来的叶子层隐变量之间的互信息,对叶子层隐变量在满足差分隐私保护的条件下进行分组,生成上层隐变量。
步骤S306,自底向上地重复上述由底层隐变量生成上层隐变量的步骤,直至上层隐变量只有一个隐变量节点,将该隐变量记为根节点;与父子节点之间的连接边,各个隐变量节点共同组成树状索引,存储在数据拥有者本地。
步骤S307,各个数据拥有者两两组合形成数据拥有者对,然后各个数据拥有者传递协商相关参数,包括不限于数据拥有者组合配对情况、数据拥有者对后续计算的执行次序、单个数据拥有者同时可以通讯的其他数据拥有者最大数目等等。
步骤S308,对于每对数据拥有者,共同运行安全多方计算协议,在安全多方计算协议加密的条件下运行基于树状索引匹配的两方隐变量互信息计算方法。对于每个数据拥有者,可以同时与多个其他数据拥有者进行通讯,同时进行上述计算。多个数据拥有者共同运行安全协议,在多方安全计算协议加密的条件下将之前计算到的隐变量对之间的互信息广播给其他所有数据拥有者;直至每个数据拥有者本地存储着相同而完备的叶子层隐变量对之间的关联强度。
步骤S309,多个数据拥有者独立地在本地运行最大生成树构建方法,以叶子层隐变量和显变量为节点,计算到的变量之间的关联强度作为相应连接边的权值,构建权重和最小的无环连通图。随机为该无环连通图选择根节点,按照与根节点路径长度距离来为每一条连接边所连接的节点对确定父子关系,得到隐树结构。
步骤S310,按照生成的隐树结构,自顶向下的为每一对相互连接的父子节点,运用拉普拉斯机制,在满足差分隐私保护的条件下计算出父子节点间的条件概率。
步骤S311,计算根节点在原始数据集的概率分布,依据概率分布抽取根节点对应的生成 数据集;然后自顶向下的逐层为每个节点依据父节点当前的生成数据,计算出父子节点的联合分布概率,依据联合分布概率和随机分布来为每个节点生成含噪声的数据。得到含有噪声的、满足差分隐私保护的生成数据集。
在本发明的上述实施例中,一方面,运用数据隐私领域领先的差分隐私模型在多方数据联合发布过程为各数据拥有者联合生成的数据提供差分隐私保护,保证数据隐私受到严格保护。另一方面,还采用隐树模型来建模垂直分割于多个数据拥有者之间的数据集分布,依据学习到的隐树模型联合发布含噪声的数据集,从而在发布的数据满足-差分隐私的条件下,最大程度地减少噪音加入的量,使得发布的数据的效用得到提升,保证整体数据服务的质量。另外,本实施例中还采用基于树状索引的隐变量互信息计算方法来舍弃关联强度较小的隐变量之间的互信息的计算,减少了需要加入噪声的次数,从而在综合利用各方数据提供高质量服务的同时,减少了通讯开销。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。
在本实施例中还提供了一种数据集生成装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能被构想的。
图4是根据本发明实施例的数据集生成装置的结构框图,如图4所示,该装置包括隐变量生成模块10、互信息计算模块20、隐树生成模块30和数据集生成模块40。
隐变量生成模块10设置为为每个数据拥有者计算本地原始数据集中显变量对的互信息,生成叶子层隐变量。互信息计算模块20设置为为所述每个数据拥有者在本地建立树状索引,将所述数据拥有者两两组合形成数据拥有者对,进行树状索引的匹配和叶子层隐变量之间的互信息的计算。隐树生成模块30设置为为所述每个数据拥有者执行隐树结构学习和隐树参数学习,各自在本地生成一棵隐树。数据集生成模块40设置为为所述每个数据拥有者根据学习到的隐树结构和隐树参数自顶向下生成目标数据集。
在上述实施例中,该数据集生成装置可以为多个,每个数据集生成装置对应一个数据拥有者的数据的处理,所有数据拥有者可通过网络进行连接。
本发明实施例还提供了另一种数据集生成装置。该装置可以为数据库、大数据平台、服务器等数据处理装置。需说明的是,虽然同样能用于实现满足差分隐私的垂直分割数据发布,但在本实施例中,其模块的名称以及功能划分与上述实施例中的模块不相同。如图5所示,具体地,在本实施例中各模块所实现的功能如下:
数据预处理模块50主要完成对原始数据的编码统计、数据清洗、缺失数据填充、离散化和二进制化等操作。数据预处理模块50可进一步具体包括数据编码子模块51、数据填充子模块52、离散化子模块53和二进制化子模块54。
具体地,数据编码子模块51设置为完成对多个数据源编码的统一编码。数据填充子模块52设置为完成对原始数据的清洗和填充。离散化子模块53设置为将连续数据或离散的类型值映射为离散数字。二进制化子模块54设置为将离散变量转化为取值0、1的二进制变量。
关联信息采集模块60设置为为各个数据拥有者生成隐变量,以及构建树状索引,进行树状索引匹配。关联信息采集模块60可进一步具体包括隐变量生成子模块61、树状索引构建子模块62和树状索引匹配子模块63。
具体地,隐变量生成子模块61设置为在满足差分隐私的条件下将显变量进行分组,对应生成叶子层隐变量。树状索引构建子模块62设置为完成在满足差分隐私保护的条件下自底向上地构建树状索引结构。树状索引匹配子模块63通过对树状索引进行自顶向下地匹配,完成对要计算的隐变量对地剪枝,在计算隐变量关联强度的过程中减少通讯开销。
模型学习与构建模块70设置为依据隐变量之间的互信息学习隐树模型。模型学习与构建模块70可进一步包括无环连通图构建子模块71、隐树结构学习子模块72、隐树参数学习子模块73。
具体地,无环连通图构建子模块71是以叶子层隐变量为节点,变量间互信息的大小作为连接边的权值,构建出权值和最小的无环连通图。隐树结构学习子模块72主要完成隐树结构的构建。隐树参数学习子模块73主要完成隐树结构中父子节点之间条件分布参数的学习。
数据发布模块80根据学习到的所述隐树结构和父子节点之间的条件分布参数,从根节点开始,自顶而下的生成合成数据集中的每一条记录。
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述模块分别位于多个处理器中。
实施例一
在本实施例中,以K家不同专科医院(K>=2)在各自设备上联合发布医疗数据为例进行详细描述:
实施环境说明:所述的K家不同专科医院作为K个数据拥有者,各自部署所述数据集生成装置(可以为数据库、大数据平台、服务器等),使用该装置进行满足差分隐私保护的数据集生成。
如图6所示,该实施例包括如下步骤:
步骤601、K家医院独立的对本地数据集进行统一编码、缺失数据填充、离散化、二进制化;存储原始数据集到二进制数据集的对应数位的转化关系。
步骤602、K家医院将显变量两两组合形成一组显变量对,访问数据计算出每对显变量之间的互信息;互信息的计算方法如下:
Figure PCTCN2019084345-appb-000001
其中p(x)、p(y)分别为变量X、Y取值X=x、Y=y时的概率分布,p(x,y)是变量X取值X=x,且变量Y取值Y=y的联合概率分布。
步骤603、利用差分隐私指数机制,在满足差分隐私保护的条件下将本地显变量分组,每个分组包含显变量个数尽量达到但不超过预设的隐变量最大分组个数。
步骤604、对于每个本地显变量分组,为其生成隐变量,得到叶子层隐变量数据集。
步骤605、对于每家医院,将叶子层隐变量两两组合形成隐变量对,访问隐变量数据计算出每对隐变量之间的互信息;计算方法参照步骤502。
步骤606、基于计算出来的叶子层隐变量之间的互信息,对叶子层隐变量在满足差分隐私保护的条件下进行分组,生成上层隐变量。
步骤607、自底向上地重复上述由底层隐变量生成上层隐变量的步骤,直至上层隐变量只有一个隐变量节点,将该隐变量记为根节点;与父子节点之间的连接边,各个隐变量节点共同组成树状索引,存储在数据拥有者本地。
步骤608、各个医院两两组合形成数据拥有者对,然后各个数据拥有者传递协商相关参数,包括不限于数据拥有者组合配对情况、数据拥有者对后续计算的执行次序、单个数据拥有者同时可以通讯的其他数据拥有者最大数目等等。
步骤609、对于每对医院,共同运行安全多方计算协议,在安全多方计算协议加密的条件下运行基于树状索引匹配的两方隐变量互信息计算方法。对于每个医院,可以同时与多个其他医院进行通讯,同时进行上述计算。
步骤610、多个医院共同运行安全协议,在多方安全计算协议加密的条件下将之前计算到的隐变量对之间的互信息发送给其他医院;直至每个医院本地存储着相同而完备的叶子层隐变量对之间的关联强度。
步骤611、多个医院独立地在本地运行最大生成树构建方法,以叶子层隐变量和显变量为节点,计算到的变量之间的关联强度作为相应连接边的权值,构建权重和最小的无环连通图。
步骤612、随机为该无环连通图选择根节点,按照与根节点路径长度距离来为每一条连接边所连接的节点对确定父子关系,得到隐树结构。
步骤613、按照生成的隐树结构,自顶向下的为每一对相互连接的父子节点,运用拉普拉斯机制,在满足差分隐私保护的条件下计算出父子节点间的条件概率。
步骤614、计算根节点在原始数据集的概率分布,依据概率分布抽取根节点对应的生成数据集;然后自顶向下的逐层为每个节点依据父节点当前的生成数据,计算出父子节点的联合分布概率,依据联合分布概率和随机分布来为每个节点生成含噪声的数据。得到含有噪声的、满足差分隐私保护的生成数据集。
实施例二
在本实施例中,以K家金融机构(K>=2)联合发布一批满足差分隐私的用户的金融信 息为例进行详细描述:
实施环境说明:所述的K家金融机构为拥有用户不同类型金融信息的机构,如拥有存款信息的银行、拥有股票信息的证券机构等。K家金融机构各自部署所述数据集生成装置(可以为数据库、大数据平台、服务器等),直接借助该装置进行满足差分隐私保护的数据发布工作。如图7所示,该实施例包括如下步骤:
步骤701、各家金融机构的主机通过网络向其他机构发送心跳信息,注册成为数据发布参与者。
步骤702、各家金融机构独立而随机的生成不重复整型数字作为唯一ID,每家金融机构向下个金融机构发布本机所接收到心跳的源机数量和本机ID,在所有金融机构均未检测到矛盾信息的情况下,数据发布流程开始。
步骤703、各家金融机构进行通讯,协商统一编码标准、隐私参数设定、变量分组最大个数、和离散化方法等参数配置,将协商后的参数配置广播给所有的数据发布参与者。
步骤704、各家金融机构从本地数据仓库中取出原始数据集,依次运行统一编码子模块、数据填充子模块、离散化子模块、二进制化子模块,得到规整的二进制显变量数据集。
步骤705、各家金融机构运行变量分组子模块,将显变量两两组合形成一组显变量对,访问数据仓库计算出每对显变量之间的互信息;利用差分隐私指数机制,在满足差分隐私保护的条件下对显变量进行分组。
步骤706、各家金融机构独立地运行隐变量生成子模块,利用拉格朗日乘子优化对显变量分布的极大似然估计,多次迭代直至生成的隐变量整体分布稳定。
步骤707、各家金融机构独立地运行树状索引构建子模块,生成隐变量索引树。
步骤708、各家金融机构运行完上述步骤之后,向其他金融机构发送心跳信息,宣布该机构已完成本地树状索引构建,等待其他机构完成数状索引的构建。
步骤709、在所有完后曾树状索引构建之后,多家金融机构两两组合形成金融机构对,相互广播多个金融机构对的计算次序和单个数据拥有者同时可以通讯的其他数据拥有者最大数目等参数,在所有金融机构通过网络确认相关次序以及参数之后,按协商的运算次序进行后续计算。
步骤710、对于每一对金融机构对,共同运行安全多方计算协议,在安全多方计算协议的加密下运行树状索引匹配子模块,在满足隐私保护的情况下计算隐变量之间的关联强度。
步骤711、各家金融机构运行完所有与自己相关的树状索引匹配之后,向其他金融机构发送心跳信息,宣布该机构已完成树状索引匹配,等待其他机构完成树状索引的匹配。
步骤712、各家金融机构共同运行安全协议,在多方安全计算协议加密的条件下将之前计算到的隐变量对之间的互信息广播给其他所有数据拥有者;直至每个数据拥有者本地存储着相同而完备的叶子层隐变量对之间的关联强度信息。
步骤713、各家金融机构独立地运行隐树结构学习子模块,采用最大生成树构建方法,以叶子层隐变量和显变量为节点,计算到的变量之间的关联强度作为相应连接边的权值,构建权重和最小的无环连通图。随机为该无环连通图选择根节点,按照与根节点路径长度距离 来为每一条连接边所连接的节点对确定父子关系,得到隐树结构。
步骤714、各家金融机构独立地运行隐树参数学习子模块,按照生成的隐树结构,自顶向下的为每一对相互连接的父子节点,运用拉普拉斯机制,在满足差分隐私保护的条件下计算出父子节点间的条件概率。
步骤715、各家金融机构独立地运行数据生成子模块,计算根节点在原始数据集的概率分布,依据概率分布抽取根节点对应的生成数据集;然后自顶向下的逐层为每个节点依据父节点当前的生成数据,计算出父子节点的联合分布概率,依据联合分布概率和随机分布来为每个节点生成含噪声的数据。得到含有噪声的、满足差分隐私保护的生成数据集。
步骤716、各家金融机构在完成数据生成之后,向其他金融机构发送心跳信息,等待所有金融机构完成数据生成,对于无法完成数据生成的金融机构,向其广播生成的数据集。
实施例三
在本实施例中,某大型企业在内部多个部门间部署所述装置,来发布满足差分隐私保护的数据给运维和外包人员使用。
实施环境说明:企业内部不同部门具有同一批个体的不同层次的数据,数据位于各部门内部服务器的数据仓库上,与外网和其他部门网络隔离。在本实施中,各个数据集生成装置(可以为数据库、大数据平台、服务器等)通过企业内网相互通讯。如图8所示,本实施例包括如下步骤:
步骤801、部署于各个部门的数据集生成装置开放端口,监听来自外界的脱敏请求;每个数据源与一个数据集生成装置绑定。
步骤802、企业运维人员和外包人员向企业数据服务提供部门提交请求,请求得到来源于关于同一批用户的、属于不同部门的数据。
步骤803、数据服务提供部门对申请者进行鉴权,解析出请求的数据所属的数据源,向数据源绑定的数据集生成装置发送脱敏请求。
步骤804、涉及的多个数据集生成装置进行通讯,协商统一编码标准、隐私参数设定、变量分组最大个数、和离散化方法等参数配置,将协商后的参数配置广播给所有的数据发布参与者。
步骤805、各个装置调用本地数据仓库提供的内嵌UDF(用户自定义函数)对原始数据进行编码转换、缺失数据填充、离散化、和二进制化,将结果存入临时表。
步骤806、各个装置运行变量分组子模块,将显变量两两组合形成一组显变量对,访问数据仓库计算出每对显变量之间的互信息;利用差分隐私指数机制,在满足差分隐私保护的条件下对显变量进行分组。
步骤807、各个装置独立地运行隐变量生成子模块,利用拉格朗日乘子优化对显变量分布的极大似然估计,多次迭代直至生成的隐变量整体分布稳定。
步骤808、各个装置独立地运行树状索引构建子模块,生成隐变量索引树。
步骤809、各个装置运行完上述步骤之后,向其他部门发送心跳信息,宣布该部门已完 成本地树状索引构建,等待其他部门完成数状索引的构建。
步骤810、在所有部门完成树状索引构建之后,多各部门两两组合形成部门对,相互广播多个部门对的计算次序和单个数据拥有者同时可以通讯的其他数据拥有者最大数目等参数,在所有部门进行通讯确认相关次序以及参数之后,按协商的运算次序进行后续计算。
步骤811、对于每一对部门对,共同运行安全多方计算协议,在安全多方计算协议的加密下运行树状索引匹配子模块,在满足隐私保护的情况下计算隐变量之间的关联强度。
步骤812、各个装置运行完所有与自己相关的树状索引匹配之后,向其他装置发送心跳信息,宣布该装置已完成树状索引匹配,等待其他装置完成树状索引的匹配。
步骤813、各个装置共同运行安全协议,在多方安全计算协议加密的条件下将之前计算到的隐变量对之间的互信息广播给其他所有装置;直至每个装置本地存储着相同而完备的叶子层隐变量对之间的关联强度信息。
步骤814、各个装置独立地运行隐树结构学习子模块,采用最大生成树构建方法,以叶子层隐变量和显变量为节点,计算到的变量之间的关联强度作为相应连接边的权值,权重和最小的无环连通图。随机为该无环连通图选择根节点,按照与根节点路径长度距离来为每一条连接边所连接的节点对确定父子关系,得到隐树结构。
步骤815、各个装置独立地运行隐树参数学习子模块,按照生成的隐树结构,自顶向下的为每一对相互连接的父子节点,运用拉普拉斯机制,在满足差分隐私保护的条件下计算出父子节点间的条件概率;
步骤816、各个装置独立地运行数据生成子模块,计算根节点在原始数据集的概率分布,依据概率分布抽取根节点对应的生成数据集;然后自顶向下的逐层为每个节点依据父节点当前的生成数据,计算出父子节点的联合分布概率,依据联合分布概率和随机分布来为每个节点生成含噪声的数据,得到含有噪声的、满足差分隐私保护的生成数据集。
步骤817、各个装置在完成数据生成之后,向其他装置发送心跳消息,等待所有装置完成数据生成,对于无法完成数据生成的装置,向其广播生成的数据集。
步骤818、将生成的数据集在本地缓存,以备下次查询;任选一个装置,将生成的数据发送给数据服务提供部门。
步骤819、数据服务提供部门对数据进行验证后,将数据发送给数据请求者。
实施例四
在本实施例中,企业从多个渠道收集和购买同一批用户的不同方面的数据后,在存入数据仓库前先对数据进行脱敏,以规避数据买卖和数据收集可能带来的法律纠纷。
实施环境说明:来源于多个渠道的同一批用户的数据以数据流形式,进入前文实施例所述的数据集生成装置(可以为数据库、大数据平台、服务器等),由该装置对数据集进行脱敏之后,将满足差分隐私保护的数据集存入数据仓库。如图9所示,本实施例包括如下步骤:
步骤901、各个数据源与数据发布装置绑定,各个装置之间相互发送心跳信息确认装置运行状态良好,数据流正常。
步骤902、各个数据发布装置读入相同长度的数据流到内存;然后暂时封闭绑定的数据流的数据读入。
步骤903、各个装置对数据流进行完整性与合法性校验;如果出现异常数据则舍弃该批次数据,继续读入下批流数据。
步骤904、各个数据源相互通讯,根据数据流大小确定隐私参数大小和分配策略。
步骤905、各个装置依次运行统一编码子模块、数据填充子模块、离散化子模块、二进制化子模块,对存入内存的流数据进行预处理。
步骤906、各个装置运行变量分组子模块,将显变量两两组合形成一组显变量对,访问数据仓库计算出每对显变量之间的互信息;利用差分隐私指数机制,在满足差分隐私保护的条件下对显变量进行分组。
步骤907、各个装置独立地运行隐变量生成子模块,利用拉格朗日乘子优化对显变量分布的极大似然估计,多次迭代直至生成的隐变量整体分布稳定。
步骤908、各个装置独立地运行树状索引构建子模块,生成隐变量索引树。
步骤909、各个装置运行完上述步骤之后,向其他装置发送心跳信息,宣布该装置已完成本地树状索引构建,等待其他装置完成数状索引的构建。
步骤910、在所有装置完成树状索引构建之后,多个装置两两组合形成装置对,相互广播多个装置对的计算次序和单个装置同时可以通讯的其他装置最大数目等参数,在所有装置进行通讯确认相关次序以及参数之后,按协商的运算次序进行后续计算。
步骤911、对于每一对装置对,共同运行安全多方计算协议,在安全多方计算协议的加密下运行树状索引匹配子模块,在满足隐私保护的情况下计算隐变量之间的关联强度。
步骤912、各个装置运行完所有与自己相关的树状索引匹配之后,向其他装置发送心跳信息,宣布该装置已完成树状索引匹配,等待其他装置完成树状索引的匹配。
步骤913、各个装置共同运行安全协议,在多方安全计算协议加密的条件下将之前计算到的隐变量对之间的互信息广播给其他所有装置;直至每个装置本地存储着相同而完备的叶子层隐变量对之间的关联强度信息。
步骤914、各个装置独立地运行隐树结构学习子模块,采用最大生成树构建方法,以叶子层隐变量和显变量为节点,计算到的变量之间的关联强度作为相应连接边的权值,构建权重和最小的无环连通图。随机为该无环连通图选择根节点,按照与根节点路径长度距离来为每一条连接边所连接的节点对确定父子关系,得到隐树结构。
步骤915、各个装置独立地运行隐树参数学习子模块,按照生成的隐树结构,自顶向下的为每一对相互连接的父子节点,运用拉普拉斯机制,在满足差分隐私保护的条件下计算出父子节点间的条件概率。
步骤916、各个装置独立地运行数据生成子模块,计算根节点在原始数据集的概率分布,依据概率分布抽取根节点对应的生成数据集;然后自顶向下的逐层为每个节点依据父节点当前的生成数据,计算出父子节点的联合分布概率,依据联合分布概率和随机分布来为每个节点生成含噪声的数据。得到含有噪声的、满足差分隐私保护的生成数据集。
步骤917、各个装置在完成数据生成之后,向其他装置发送消息,宣布该装置已经完成数据生成,其他装置接收到数据生成完毕的消息后,立刻停止数据发布过程,以最大程度减少时间消耗。
步骤918、最先完成数据生成的装置对数据进行数据校验,然后存入数据仓库。
步骤919、各个装置继续从数据流中读入下批数据流,继续按照上述步骤在数据入库前进行脱敏。
本发明的实施例还提供了一种存储介质。在本实施例中,上述存储介质可以被设置为存储用于执行前文中实施例的步骤的程序代码。在本实施例中,上述存储介质可以包括不限于:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
以上所述仅为本发明的优选实施例而已,不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (15)

  1. 一种数据集生成方法,包括:
    多方垂直分割数据的每个数据拥有者获取本地原始数据集中显变量对的互信息,生成叶子层隐变量;
    所述每个数据拥有者在本地建立树状索引,所述数据拥有者两两组合形成数据拥有者对,进行树状索引的匹配和叶子层隐变量之间的互信息的计算;
    所述每个数据拥有者执行隐树结构学习和隐树参数学习,各自在本地生成隐树;
    所述每个数据拥有者根据学习到的隐树结构和隐树参数自顶向下生成目标数据集。
  2. 根据权利要求1所述的数据集生成方法,其中,每个数据拥有者获取本地原始数据集中显变量对的互信息,生成叶子层隐变量,包括:
    将所述原始数据集进行预处理得到规整的显变量数据集,其中,所述预处理包括以下至少之一:统一编码、缺失数据填充、离散化、二进制化;
    将所述显变量数据集中的显变量两两组合形成显变量对,计算出每对显变量之间的互信息,在满足差分隐私保护的条件下生成叶子层隐变量。
  3. 根据权利要求2所述的数据集生成方法,其中,所述每个数据拥有者在本地建立树状索引,所述数据拥有者两两组合形成数据拥有者对,进行树状索引的匹配和叶子层隐变量之间的互信息的计算,包括:
    对于所述每个数据拥有者,将所述叶子层隐变量两两组合形成隐变量对,计算出每对叶子层隐变量之间的互信息;
    基于所述叶子层隐变量之间的互信息,在满足差分隐私保护的条件下对叶子层隐变量进行分组,生成上层隐变量,自底向上地重复由所述叶子层隐变量生成上层隐变量的步骤,直至上层隐变量只有一个隐变量节点;
    将所述隐变量节点作为根节点,将所述根节点、父子节点之间的连接边、以及各个隐变量节点共同组成树状索引,将所述树状索引存储在数据拥有者的本地;
    所述数据拥有者两两组合形成数据拥有者对,数据拥有者之间传递协商参数,其中所述参数包括以下至少之一:数据拥有者组合配对情况、数据拥有者对后续计算的执行次序、单个数据拥有者可以同时与其他数据拥有者进行通讯的最大数目;
    每对数据拥有者在安全多方计算协议加密的条件下运行基于树状索引匹配的隐变量互信息计算;
    多个数据拥有者在多方安全计算协议加密的条件下,将计算得到的隐变量对之间的互信息广播给其他所有数据拥有者,直至每个数据拥有者本地存储相同且完备的隐变量对之间的关联强度。
  4. 根据权利要求3所述的数据集生成方法,其中,所述每个数据拥有者执行隐树结构学习和隐树参数学习,各自在本地生成隐树,包括:
    每个所述数据拥有者独立地在本地运行最大生成树构建方法,以叶子层隐变量和显变量为节点,以变量之间的关联强度作为相应连接边的权值,构建权重和最小的无环连通图;
    为所述无环连通图选择根节点,按照与所述根节点路径的长度为每一条连接边所连接的节点对确定父子关系,得到隐树结构。
  5. 根据权利要求4所述的数据集生成方法,其中,所述每个数据拥有者根据学习到的隐树结构和参数自顶向下的生成目标数据集,包括:
    所述每个数据拥有者在满足差分隐私保护的条件下,按照生成的所述隐树结构,自顶向下的为每一对相互连接的父子节点计算出所述父子节点间的条件概率;
    所述每个数据拥有者计算所述根节点在原始数据集的概率分布,依据概率分布抽取所述根节点对应的生成数据集,然后自顶向下的逐层为每个节点计算出所述父子节点的联合分布概率,依据联合分布概率和随机分布来为每个节点生成含噪声的数据,生成目标数据集。
  6. 根据权利要求1所述的数据集生成方法,其中,在所述每个数据拥有者根据学习到的隐树结构和参数自顶向下的生成目标数据集之后,还包括:
    完成目标数据集生成的数据拥有者向其它数据拥有者发送消息,等待所有数据拥有者完成目标数据集的生成,向无法完成目标数据集生成的数据拥有者广播其所生成的目标数据集。
  7. 一种数据集生成装置,包括:
    隐变量生成模块,设置为为多方垂直分割数据的每个数据拥有者获取本地原始数据集中显变量对的互信息,生成叶子层隐变量;
    互信息计算模块,设置为为所述每个数据拥有者在本地建立树状索引,将所述数据拥有者两两组合形成数据拥有者对,进行树状索引的匹配和隐变量之间的互信息的计算;
    隐树生成模块,设置为为所述每个数据拥有者执行隐树结构学习和隐树参数学习,各自在本地生成隐树;
    数据集生成模块,设置为为所述每个数据拥有者根据学习到的隐树结构和隐树参数自顶向下生成目标数据集。
  8. 根据权利要求7所述的数据集生成装置,其中,所述隐变量生成模块包括:
    数据预处理子模块,设置为将所述原始数据集进行预处理得到规整的显变量数据集,其中,所述预处理包括以下至少之一:统一编码、缺失数据填充、离散化、二进制化;
    隐变量生成子模块,设置为将所述显变量数据集中的显变量两两组合形成显变量对,计算出每对显变量之间的互信息,在满足差分隐私保护的条件下生成叶子层隐变量。
  9. 根据权利要求8所述的数据集生成装置,其中,所述互信息计算模块包括:
    隐变量对计算子模块,设置为对于所述每个数据拥有者,将叶子层隐变量两两组合形成隐变量对,计算出每对隐变量之间的互信息;
    上层隐变量生成子模块,设置为基于所述叶子层隐变量之间的互信息,对叶子层隐变量在满足差分隐私保护的条件下进行分组,生成上层隐变量,自底向上地重复由叶子层隐变量生成上层隐变量的步骤,直至上层隐变量只有一个隐变量节点;
    树状索引构建模块,设置为将所述隐变量节点作为根节点,将所述根节点、父子节点之间的连接边、以及各个隐变量节点共同组成树状索引,将所述树状索引存储在数据拥有者本地;
    协商子模块,设置为将所述数据拥有者两两组合形成数据拥有者对,各个数据拥有者传递协商参数,其中所述参数包括以下至少之一:数据拥有者组合配对情况、数据拥有者对后续计算的执行次序、单个数据拥有者可以同时与其他数据拥有者进行通讯的最大数目;
    计算模块,设置为对于每对数据拥有者,在安全多方计算协议加密的条件下运行基于树状索引匹配的隐变量互信息计算;
    广播模块,设置为多个数据拥有者在多方安全计算协议加密的条件下将计算得到的隐变量对之间的互信息广播给其他所有数据拥有者,直至每个数据拥有者本地存储着相同而完备的隐变量对之间的关联强度。
  10. 根据权利要求9所述的数据集生成装置,其中,所述隐树生成模块包括:
    无环连通图构建子模块,设置为为每个所述数据拥有者独立地在本地运行最大生成树构建方法,以叶子层隐变量和显变量为节点,以变量之间的关联强度作为相应连接边的权值,构建权重和最小的无环连通图;
    隐树结构获取子模块,设置为为所述无环连通图选择根节点,按照与所述根节点路径的长度为每一条连接边所连接的节点对确定父子关系,得到隐树结构。
  11. 根据权利要求7所述的数据集生成装置,其中,所述数据集生成模块包括:
    概率计算模块,设置为为所述每个数据拥有者在满足差分隐私保护的条件下,按照生成的所述隐树结构,自顶向下的为每一对相互连接的父子节点计算出所述父子节点间的条件概率;
    数据集生成子模块,设置为为所述每个数据拥有者计算所述根节点在原始数据集的概率分布,依据概率分布抽取所述根节点对应的生成数据集,然后自顶向下的逐层为每 个节点根据父节点当前的生成数据,计算出所述父子节点的联合分布概率,依据联合分布概率和随机分布来为每个节点生成含噪声的数据,生成目标数据集。
  12. 根据权利要求7所述的数据集生成装置,其中,所述装置还包括:
    发布模块,设置为为完成目标数据集生成的数据拥有者向其它数据拥有者发送消息,等待所有数据拥有者完成目标数据集的生成,向无法完成目标数据集生成的数据拥有者广播其所生成的目标数据集。
  13. 一种数据集生成系统,包括多个权利要求7至12中任一项所述的数据集生成装置,其中,每个数据集生成装置对应一个数据拥有者的数据处理,所有数据集生成装置通过网络相连接。
  14. 一种存储介质,所述存储介质包括存储的程序,其中,所述程序运行时执行权利要求1至6中任一项所述的方法。
  15. 一种处理器,所述处理器用于运行程序,其中,所述程序运行时执行权利要求1至6中任一项所述的方法。
PCT/CN2019/084345 2018-06-14 2019-04-25 数据集生成方法及装置 WO2019237840A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810615202.5A CN110610098B (zh) 2018-06-14 2018-06-14 数据集生成方法及装置
CN201810615202.5 2018-06-14

Publications (1)

Publication Number Publication Date
WO2019237840A1 true WO2019237840A1 (zh) 2019-12-19

Family

ID=68841920

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/084345 WO2019237840A1 (zh) 2018-06-14 2019-04-25 数据集生成方法及装置

Country Status (2)

Country Link
CN (1) CN110610098B (zh)
WO (1) WO2019237840A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113112090A (zh) * 2021-04-29 2021-07-13 内蒙古电力(集团)有限责任公司内蒙古电力经济技术研究院分公司 基于综合互信息度的主成分分析的空间负荷预测方法
CN114218602A (zh) * 2021-12-10 2022-03-22 南京航空航天大学 一种基于垂直分割的差分隐私异构多属性数据发布方法
CN117371036A (zh) * 2023-10-19 2024-01-09 湖南工商大学 多模态交通流查询的格雷码差分隐私保护方法及装置
CN114218602B (zh) * 2021-12-10 2024-06-07 南京航空航天大学 一种基于垂直分割的差分隐私异构多属性数据发布方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765664B (zh) * 2021-01-26 2022-12-27 河南师范大学 一种具有差分隐私的安全多方k-means聚类方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512247A (zh) * 2015-11-30 2016-04-20 上海交通大学 基于一致性特征的非交互式差分隐私发布模型的优化方法
CN105528681A (zh) * 2015-12-21 2016-04-27 大连理工大学 一种基于隐树模型的冶金企业副产能源系统实时调整方法
CN106156858A (zh) * 2015-03-31 2016-11-23 日本电气株式会社 分片线性模型生成系统和生成方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009437B (zh) * 2016-10-27 2022-11-22 中兴通讯股份有限公司 数据发布方法和装置及终端

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156858A (zh) * 2015-03-31 2016-11-23 日本电气株式会社 分片线性模型生成系统和生成方法
CN105512247A (zh) * 2015-11-30 2016-04-20 上海交通大学 基于一致性特征的非交互式差分隐私发布模型的优化方法
CN105528681A (zh) * 2015-12-21 2016-04-27 大连理工大学 一种基于隐树模型的冶金企业副产能源系统实时调整方法

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113112090A (zh) * 2021-04-29 2021-07-13 内蒙古电力(集团)有限责任公司内蒙古电力经济技术研究院分公司 基于综合互信息度的主成分分析的空间负荷预测方法
CN113112090B (zh) * 2021-04-29 2023-12-19 内蒙古电力(集团)有限责任公司内蒙古电力经济技术研究院分公司 基于综合互信息度的主成分分析的空间负荷预测方法
CN114218602A (zh) * 2021-12-10 2022-03-22 南京航空航天大学 一种基于垂直分割的差分隐私异构多属性数据发布方法
CN114218602B (zh) * 2021-12-10 2024-06-07 南京航空航天大学 一种基于垂直分割的差分隐私异构多属性数据发布方法
CN117371036A (zh) * 2023-10-19 2024-01-09 湖南工商大学 多模态交通流查询的格雷码差分隐私保护方法及装置
CN117371036B (zh) * 2023-10-19 2024-04-30 湖南工商大学 多模态交通流查询的格雷码差分隐私保护方法及装置

Also Published As

Publication number Publication date
CN110610098B (zh) 2023-05-30
CN110610098A (zh) 2019-12-24

Similar Documents

Publication Publication Date Title
CN110084377B (zh) 用于构建决策树的方法和装置
Sharma et al. Blockchain based hybrid network architecture for the smart city
Wang et al. Blockchain-based data privacy management with nudge theory in open banking
Tong et al. A survey on algorithms for intelligent computing and smart city applications
Hakak et al. Securing smart cities through blockchain technology: Architecture, requirements, and challenges
Chi et al. A secure and efficient data sharing scheme based on blockchain in industrial Internet of Things
De Nadai et al. Are safer looking neighborhoods more lively? A multimodal investigation into urban life
Hei et al. A trusted feature aggregator federated learning for distributed malicious attack detection
Prabadevi et al. Toward blockchain for edge-of-things: a new paradigm, opportunities, and future directions
WO2019237840A1 (zh) 数据集生成方法及装置
Feng et al. Privacy-preserving tucker train decomposition over blockchain-based encrypted industrial IoT data
Joshi et al. Adoption of blockchain technology for privacy and security in the context of industry 4.0
CN111681091A (zh) 基于时间域信息的金融风险预测方法、装置及存储介质
Kumar et al. A novel architecture to identify locations for Real Estate Investment
CN104378370A (zh) 一种云计算中隐私数据的安全使用方法
CN113902037A (zh) 非正常银行账户识别方法、系统、电子设备及存储介质
Rathore et al. Blockchain applications for healthcare
Huang et al. Effectiveness of social welfare programmes in East Asia: A case study of Taiwan
Li et al. Feel: Federated end-to-end learning with non-iid data for vehicular ad hoc networks
Gao et al. Gradientcoin: A peer-to-peer decentralized large language models
Maji et al. Identification of city hotspots by analyzing telecom call detail records using complex network modeling
Mufiidah et al. The Benefits, Challenges, and Future of Blockchain and The Internet of Things
RU2686818C1 (ru) Способ масштабирования распределенной информационной системы
Singh Blockchain and IOT integrated Smart City Architecture
Lv et al. Edge-fog-cloud secure storage with deep-learning-assisted digital twins

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19819719

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12-05-2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19819719

Country of ref document: EP

Kind code of ref document: A1