CN112765653A

CN112765653A - Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization

Info

Publication number: CN112765653A
Application number: CN202110014817.4A
Authority: CN
Inventors: 周志刚; 白增亮; 王宇; 梁子恺; 吴天生
Original assignee: Shancai Hi Tech Shanxi Co ltd
Current assignee: Shancai Hi Tech Shanxi Co ltd
Priority date: 2021-01-06
Filing date: 2021-01-06
Publication date: 2021-05-07
Anticipated expiration: 2041-01-06
Also published as: CN112765653B

Abstract

The invention belongs to the field of data release, and particularly relates to a multi-source data fusion privacy protection method based on multi-privacy policy optimization combination optimization. A multi-party data fusion architecture based on re-anonymity (over-anonymity) is provided, and the condition that privacy of fused data is leaked is prevented. Further, the realistic significance of data fusion is to provide a more comprehensive data base for users to conduct extensive knowledge mining on the basis. Therefore, a multi-privacy protection strategy combined optimization scheme is designed, and the usability of the fused data is improved to the maximum extent while the privacy constraints of all parties are met. The strategy maps the data fusion of multiple sources and multiple privacy constraints into a hypergraph, each hypergraph is selected, solved and eliminated one by one on the hypergraph by using heuristic rules, the process of eliminating the hypergraph is also a process of realizing the privacy constraints one by one, and a data fusion scheme is formulated according to the process.

Description

Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization

Technical Field

The invention belongs to the field of data release, and particularly relates to a multi-source data fusion privacy protection method based on multi-privacy policy combination optimization.

Background

The multisource cross-platform and data application cross-domain are the most prominent characteristics of big data, and in the big data era, due to the explosive growth of data in different application fields, the single type of data (such as position data, social data, Cookie logs, shopping website pipelining and the like) is difficult to meet the requirements of people on upper-layer complex application services. For example, Bob needs the App to search nearby friends who like to play basketball, and implementation of this need requires organic fusion of location data with social data. The data cross-domain fusion needs of individuals, and real demand and application of the data cross-domain fusion between different departments in an enterprise, different-quality enterprises and even between the enterprise and a government department, such as accurate advertisement push, network taxi appointment optimization management, intelligent city subway line planning and the like, require data source owners of different field platforms to perform deep fusion and cooperation on own data levels. However, data of each platform often has a great use value, and may include sensitive/private information such as identity information, behavior information, financial information, even disease information of a user, and directly publishing the original data will necessarily cause disclosure of user privacy.

In order to prevent the privacy of users from being revealed, before large data fusion is released, desensitization processing (such as disturbance, noise addition, generalization and the like) needs to be performed on data sets of respective platforms, most of the conventional anonymous privacy protection methods only perform privacy protection on data of a single data source, and the problem of non-explicit privacy information disclosure brought by deep association analysis of the large data cannot be effectively solved; moreover, a single privacy protection method cannot meet the personalized privacy requirements of data users, just as the local privacy protection of data from various sources cannot avoid the risk of global data privacy disclosure after fusion (for example, Alice purchases an air ticket to Munich at an A ticket purchasing website and browses tourist attractions of Munich on a webpage of a B tourist company. A, B discloses information separately for outsourcing, wherein the A company adopts an information generalization technology based on 3-anonymity, namely generalizing the 'air ticket to Munich' into 'air ticket to Europe', the B company adopts a 3-diversity technology, namely, publishes the browsing behavior of two users browsing the website of the company with Alice as a group { 2017-07-119: 30: { Munich: New Swiss, Japan: Fuji, USA: Massachi institute of technology } }, assuming that Alice knows that Alice has a travel plan to go abroad, and learn from the stolen internet log that he has logged in the web pages of company a and B, by correlating A, B the information published by the two companies, the adversary can accurately deduce when Alice will go to the travel route of munich-new swan). The method is also the most essential problem facing the big data release privacy protection, namely privacy disclosure caused by the fact that an attacker constructs data association analysis after the distributed big data multiple sources are fused. One naive approach is to combine the naturally connected fusion data with privacy preserving method-level granularity. However, the combination of privacy preserving method level granularity may result in "over-protection" of private information, thereby severely reducing the availability of data, as shown in fig. 1: in data fusion of two parties, 29 pieces of noise need to be added in a scheme I (firstly carrying out 5-anonymity and then carrying out 3-diversity), and 20 pieces of noise need to be added in a scheme II (firstly carrying out 3-diversity and then carrying out 5-anonymity), so that the fine-grained combined optimization of the multiple privacy protection methods for maximizing data availability still is an open problem in the field of large data fusion release of privacy protection.

In the field of privacy protection of data release, traditional privacy protection algorithms comprise differential privacy, k anonymity, l-diversity anonymity, t-close anonymity and the like, and the improvement of some scholars on the traditional algorithms also has milestone significance, for example, by means of a semantic hierarchy tree, Wang and the like, by semantically generalizing the records with less quantity than anonymity requirements, the records can realize k-anonymity under wider semantics, however, the use of a record generalization technology causes irreversible information loss, and the use of a k-anonymity criterion on high-dimensional sparse data can greatly reduce the data availability; brijesh B et al propose a method to improve l-diversity anonymity with a significant improvement in run-time and with less information loss than the existing methods, while providing the same level of privacy due to the close arrangement of records in the initial equivalence class. In general, these conventional privacy preserving models are generally only applicable to static data distribution in a specific scenario. However, the risk faced by the big data release is reflected in the dynamic property of the release process, and has the characteristic of multi-source cross-platform release, so that an attacker needs to be prevented from performing correlation analysis on the data after multi-source fusion, and further the anonymity of the data is damaged.

In the aspect of privacy protection of data fusion, H Patel et al propose a safe fusion method for realizing data of two parties from bottom to top, but the premise of the model is that a trusted third party fuses all data to form a complete original data table, and then anonymization processing is realized on the data table, but the trusted third party does not exist in most cases, so the method of the document has low utilization value; jiang et al propose an DkA security fusion model for realizing data of two parties under a semi-honest model, the algorithm utilizes an exchangeable encryption strategy to hide original information in a communication process, and judges whether an anonymous threshold k is met or not by constructing a complete anonymous table to realize privacy protection in the data fusion process, but the resource consumption of the method is too large and the method is not suitable for the fusion of a large data set; clifton et al developed a secure data multiparty data integration tool for four typical operations of relational data counting, union, intersection and cartesian product; yeom et al studied indirect privacy disclosure caused by insufficient model generalization capability, and subsequently Mohammed et al realized data privacy protection for each party of data integration using a data generalization technique based on a classification tree structure, but the information loss of the integrated data was high, and the specific information loss degree was related to the data set. In the scheme, the same privacy protection strategy is assumed to be adopted by multiple parties participating in data fusion, however, in the face of different privacy protection requirements of big data, different platforms may adopt personalized privacy protection strategies according to the application requirements of the own parties before the big data fusion, and the existing scheme is difficult to apply.

Disclosure of Invention

The invention provides a multi-source data fusion privacy protection method based on multi-privacy policy combination optimization. Specifically, the patent firstly proposes a multi-party data fusion architecture based on re-anonymity (over-anonymity), wherein, inner-layer data anonymity exists before data fusion, and is implemented by respective local data owners to perform initial protection on data; when the outer-layer data anonymity occurs in data fusion, a plurality of parties participating in the fusion implement the anonymity according to a set multi-party privacy protection protocol (for simplifying the description, the anonymity is regarded as simultaneously meeting privacy constraints of all the parties), and the condition that the privacy of the fused data is leaked is prevented. Further, the realistic significance of data fusion is to provide a more comprehensive data base for users to conduct extensive knowledge mining on the basis. Therefore, the patent designs a multi-privacy protection strategy combined optimization scheme, and the usability of the fused data is improved to the maximum extent while privacy constraints of all parties are met. The strategy maps the data fusion of multiple sources and multiple privacy constraints into a hypergraph, each hypergraph is selected, solved and eliminated one by one on the hypergraph by using heuristic rules, the process of eliminating the hypergraph is also a process of realizing the privacy constraints one by one, and a data fusion scheme is formulated according to the process.

In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:

step 1, constructing a data multi-source fusion system model:

as shown in fig. 2, first, in a system model, data owners collect data from each party, and in order to prevent privacy disclosure, each party performs data anonymization operation; secondly, as the data volume of some entities is huge, the data must be stored in a public cloud, the data fusion of the public cloud is to organically integrate the multisource cross-platform data, and aims to better mine useful information by fusing complete data sets of all parties, and if the data fusion is simply carried out, the privacy snooping concern of the public cloud after the data fusion cannot be eliminated, so that the public cloud also needs to carry out heavy anonymity operation; in addition, the user can enjoy the convenience of big data by customizing the required services, however, unknown attackers may also be hidden in the user, so it is assumed here that the user is also "curious", i.e., the user is treated as a suspected privacy-mining group with the same attack capability as the cloud service provider.

Step 2, designing a multiple data fusion anonymity framework:

aiming at frequent cross-platform communication and sharing of big data information, the patent provides a re-anonymity framework based on multi-party data fusion, which comprises an initial state, a handshake process, data synchronization and secondary anonymity. Firstly, carrying out corresponding anonymization operation on respective data by each party in an initial state according to respective privacy protection requirements; secondly, the handshake process carries out multiparty communication, and each party issues respective data privacy protection requirements; data synchronization, namely, in the process, the distribution of public attribute values of multi-party data needs to be consistent, and meanwhile, the requirement of privacy protection of multiple parties is met; the method comprises the steps of firstly establishing a hierarchical structure diagram, secondly establishing anonymity, and finally establishing a probabilistic reasoning problem.

And step 3, realizing the privacy protection strategy:

given the privacy protection policy, the Bayesian network G is finally formed through the two processes, and then the attributes are required to be matched

Operate to make the privacy node X_sAnd the policy requirements are met. In particular, for privacy protection policies

This patent defines unit privacy protection operation, is to carry out d partition to the privacy budget, and each round only carries out privacy protection to the probability distribution of a selected attribute node, to waiting to carry out privacy operation's attribute

The method realizes four privacy protection operations of k-anonymity, l-diversity, t-close and attribute value generalization.

And 4, fusing and mapping the data with multiple privacy constraints into a hypergraph, designing a corresponding heuristic rule, and normalizing the data fusion process into a hypergraph resolution process, so that the usability of the data is improved while the privacy constraints are met.

Further, in the step 1, a data fusion model and a privacy attack model are constructed by adopting an antagonistic learning architecture, and the specific steps are as follows:

step 1.1, constructing a data fusion model:

the data fusion is to organically integrate data belonging to multiple sources, and aims to better mine useful information through a more complete data set than before so as to provide high-quality service for users. For ease of discussion, a formal description of the data is first given: a data set may be represented as a quadruple D (X, a, F, V), where X ═ X₁,x₂,…,x_nIs a set of data records, each item of data x_iAre all exclusively associated with one dedicated user u_i(ii) a A is an attribute set; further, the attributes are divided into an information attribute set IA and a sensitive attribute set SA according to their sensitivity, and IA ═ SA ═ a,

f is a set of relationships between X and A

Is attribute a_kThe value range of (2).

Definition 1 (equivalence classes): given a data set D (X, A, F, V), for

If there are t records { x₁,x₂,…,x_t(t ≧ 1) such that

Is established as { x₁,x₂,…,x_tIs an equivalence class on D for A', denoted as [ x ]_i]_A', whereas the set E of all equivalence classes formed by the attribute set A_A', forming a division of D, denoted D/E_A′. In particular, if

The corresponding equivalence class is referred to as the information equivalence class.

Definition 2 (data fusion): given m data sets { D₁,…,D_mAnd F, the fused data set D (X, { IA, SA }, F, V) satisfies:

in particular, if there are two data sets D to be fused_i,D_jSatisfy the requirement of

(

Representing a symmetric difference operator), it is called information delta fusion; if there is a record x_k∈X_i∩X_jSatisfy the following requirements

And is

Information refinement fusion is called; if x is arbitrarily recorded_k∈X_i∩X_jSatisfy the following requirements

(wherein, SA_i＝SA_j) Otherwise, it is called harmonious fusion. The research category of the patent is coordinated information increment and refinement fusion.

Step 1.2, constructing a privacy and privacy attack model:

the privacy refers to the single shooting of the user and the corresponding data sensitive attribute value, and if the single shooting relationship is revealed, the privacy of the user is revealed. According to the data model, users and data records have one-to-one mapping, and each user corresponds to a group of information attribute values from the data aspect, namely, the user and the information attribute values have a single shot, and the equivalence class containing the group of information attribute values also corresponds to a privacy attribute value set. According to the transitivity of the single-shot, if the group of information attribute values and the corresponding privacy attribute value set form the single-shot (that is, the privacy attribute value set only contains one element), the data privacy disclosure exists for the user when the data set is issued. More generally:

definition 3 (data privacy disclosure): given a data set D (X, { IA, SA }, F, V), if

[x_i]_IAFor the information equivalence class to which it belongs

Its corresponding privacy attribute value is noted

If it is

The data privacy is said to be compromised.

Definition 4 (knowledge-based attacks): suppose that the adversary knows the target user u_iInformation attribute value of

And knows the user's data record x_iIn a data set D (X, { IA, SA }, F, V) to be published, when data is published, an enemy may build the following relationship:

and from this forms a privacy inference probability, i.e. pair

User u_iValue is v on SA_jHas a probability of

(wherein C (#) is a counting statistical function; C;,/C;)_*A domain of discourse qualifier).

Definition 5 (multi-version attack of data incremental release): given a first published data set D (X, { IA, SA }, F, V) and a corresponding publisher published updated data set D '(X', { IA ', SA' }, F ', V'), it is assumed that the adversary compares a dedicated user u_iX' of (A), the following relationship can be constructed:

and form privacy inference probabilities, i.e. for both datasets

The adversary can infer that the privacy probability of the user is

(SEL () is a selection function).

Further, the step 2 of vertical encoding comprises two stages: a Bayesian network structure learning stage and a network coding stage;

the specific steps of the Bayesian network structure learning stage are as follows:

step 2.1, consider from dataset D ═ D₁,…,D_nThe learning Bayesian network structure in the Chinese scholar tree comprises m random variable sets

It is assumed that the variables are categorical variables (i.e., the number of states of the variables is finite) and that the numbers areThe data set is complete. The goal of the Bayesian network construction algorithm is to construct a set of parent items pi by defining each variable₁,…,Π_mAt node set

Find the highest scoring Directed Acyclic Graph (DAG, Directed Acyclic Graph)

By assuming a Markov condition, a joint probability distribution is introduced, each variable being conditionally independent of its non-descendant variables given its parent.

Step 2.2, different scoring functions can be used for the evaluation of the quality of the generated DAG, in this patent we use a Bayesian Information Criterion (BIC) score, which is proportional to the posterior probability multiplier of the DAG. BIC is resolvable, consisting of the sum of the scores of each variable and its parent set:

wherein LL (X)_i|Π_i) Represents X_iII with its father node set_iLog-likelihood function of (a):

Pen(X_i|Π_i) Represents X_iII with its father node set_iComplexity penalty function of (1):

wherein the content of the first and second substances,

is a conditional probability P (X)_i＝x|Π_iPi) poleLarge likelihood estimate, N_x,πDenotes (X ═ X | Π)_iPi) occurs in the dataset and | represents the size of the cartesian product space given the variable.

The patent uses a hill climbing method to generate a Bayesian network of corresponding data, and the main steps are as shown in an algorithm 1:

algorithm 1 Bayesian network structure generation method based on hill climbing method

It should be noted that the "flip edge" operation cannot be simply regarded as a sequence operation of "delete an edge and add an edge opposite to the previous operation direction". Because the algorithm adopts a greedy strategy, the deletion edge operation may reduce the BIC score of the Bayesian network, and the program is terminated in advance, so that the operation of adding the corresponding turning edge cannot be implemented.

The specific steps of the Bayesian network encoding stage are as follows:

step 2.3, a hierarchical structure diagram is constructed by longitudinally encoding a Bayesian network, wherein the hierarchical structure diagram comprises two stages: a bottom-up encoding phase and a top-down modification phase. Specifically, given a bayesian network, it can be converted into a hierarchical structure diagram by encoding, and algorithm 2 is a bayesian network encoding process:

algorithm 2 Bayesian network vertical coding

1) Bottom-up encoding stage. First, the hierarchy of all nodes is initially marked as zero, the algorithm continuously marks from the leaf nodes and gradually tracks the corresponding parent nodes. In each turn, when the hierarchy of the child node is q, the hierarchy of the parent node will be labeled q + 1. And then, only recording the current maximum code for the non-leaf node, namely if the code of the node is not 0, comparing the new code with the original code, keeping the larger code, if the code of the node is equal to the original code, stopping the upward backtracking of the node, and judging whether the leaf node queue is empty or not, and if the node queue is empty, stopping the backtracking. Next, the next leaf node is extracted for marking until the leaf node sequence is empty.

2) A top-down correction phase. All nodes are first sorted from large to small in a hierarchical structure and all node codes are initialized to be unmarked. Then, the algorithm extracts the unmarked node with the largest hierarchical structure (i.e. the largest code) in the node sequence, and takes the node as the starting point of traversing the graph in breadth, and the traversal is carried out in a downward breadth-first manner step by step. In each round, when the hierarchy of the parent node is q, the hierarchy of the child node will be labeled as q-1. Here, we will refer to q_oldComparing the value size of the current hierarchical structure of the represented node with the value size of the newly derived node representation node, and considering the following two cases: (a) when q is_old＜q_newThen, the algorithm sets the hierarchy of nodes to q_newAnd setting the node as marked; (b) when q is_old＝q_newAnd the node is marked, the downward traversal of the node will terminate early. Next, the extraction of the next unmarked node will continue until there are no unmarked nodes in the sequence.

Further, in the step 3, four privacy protection operations of k-anonymous, l-diversity, t-close-relation and property value generalization are realized, and the specific steps are as follows:

step 3.1, realizing that k-anonymous: attribute by domain expert or data owner

Value range setting, followed by attribute

Extending the value range space in a Bayesian network such that its value range is nullThe number of different values in between is equal to or greater than k. During correction, the father node of the privacy node distributes the attribute value with the maximum probability distribution value in the child nodes of the father node to the privacy node according to the correction principle of information entropy maximization, so that the privacy node meets the k requirement;

step 3.2, realizing l-diversity: in the same way, the attributes are paired according to the data side

Setting of range of value, for attribute

The value range space in the bayesian network is extended such that the number of different values in its value range space is greater than or equal to l. Corrected attribute

According to the correction principle of information entropy maximization, in each correction process, only selecting one value with the maximum probability distribution as a target object to be corrected, and averagely distributing the probability distribution value higher than the mean value to a newly-added attribute value;

and 3.3, realizing t-close: will attribute

Defining the value distribution condition causing the information entropy maximization in the value domain space as a theoretical standard, measuring by using the variance, and performing attribute matching

Correcting the probability distribution of each value to ensure that the variance between the occurrence probability of each value and the theoretical standard is not higher than t;

step 3.4, realizing generalization of attribute values: setting attributes according to attribute value hierarchical tree set by domain expert or data owner

The probability distributions of similar values in the value domain are fused. Will attribute

The attribute leaf nodes to be protected anonymously in the value domain and all the brother leaf nodes thereof are aggregated into an attribute node and replaced by the direct father node thereof, and the attribute value probability distribution corresponding to the node inherits from all the original leaf nodes participating in aggregation.

The data sets of different privacy protection strategies are converted into the Bayesian network, and the Bayesian network is encoded, so that a secondary anonymization process is realized, and a re-anonymization framework of multi-party data fusion is realized. However, in the multi-source data fusion process, a multi-privacy protection policy combination scheme needs to be optimized, so that the usability of fused data is improved to the maximum extent while privacy constraints of all parties are met.

Further, in the step 4, a heuristic rule is adopted to evolve the data fusion process into a hypergraph resolution process, and the specific steps are as follows:

i.e., PROG (HG) is:

FOR each excess edge M DO intersected with excess edge N

Bottom-up elimination of probability-independent tuples in R (M)

ENDFOR；

PROG(HG₁)，PROG(HG₂)，……，PROG(HG_k)；

The hypergraph resolution algorithm recursively calls the above 3 rules, each hypergraph is selected, solved and eliminated from the HG one by one, and a RESULT (HG) program PROG (HG) is constructed, wherein the process of resolving the hypergraph is also a process of realizing privacy constraints one by one. The hypergraph resolution heuristic algorithm is as follows:

algorithm 3 hypercritical resolution heuristic algorithm

The above mentioned two privacy protection policy operations F₁(A, B, D) and F₂As an example of a connected hypergraph of (D, E, G, H), we see how to construct the program prog (hg) using a heuristic algorithm and produce the result "result (hg):

(1) the solution of the hyper-edges { A, B, D }, { D, E, G, H }, and the result hypergraph is HG₁A prog (hg) program was obtained according to the digestion rule 3 (i.e., { B, D }, { D, G }, { a }, { E }, and { H })

PROG(HG₁)；

(2) Let HG₂＝({A}、{E}、{H})，HG₃Obtained by resolution rule 2 ({ B, D }, and { D, G }), PROG (HG) is obtained₁) Procedure for measuring the movement of a moving object

PROG(HG₂)、PROG(HG₃)；

RESULT(HG₁):＝RESULT(HG₂)×RESULT(HG₃)。

Because of HG₂Comprising three super-edges which are independent of each other, so PROG (HG)₂) Is composed of

RESULT(HG₂):＝R({A}、{E}、{H})

(3) Texture computation HG₃PROG (HG)₃) To resolve the hyper-edges { B, D }, { D, G }, the result hypergraph is HG₄Generating PROG (HG) according to the resolution rule 3₃) Procedure for measuring the movement of a moving object

PROG(HG₄)；

(4) Due to HG₄Contains only one hyper-edge, so that, as can be seen from rule 1, PROG (HG)₄) Is that

RESULT(HG₄):＝R({D,G})。

The final procedure can be written as:

of course, it is not necessary for any one product that embodies the invention to achieve all of the above advantages simultaneously.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a comparison case analysis of results of execution sequences of different privacy protection policies;

FIG. 2 is a system model for multi-source data fusion;

FIG. 3 is a hypergraph HG;

FIG. 4 shows comparison results of whether anonymity is re-determined;

FIG. 5 is a comparison of a naive algorithm and an optimized algorithm;

FIG. 6 is a graph of privacy attribute probabilities for discriminators and generators in different equivalence classes.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A multi-source data fusion privacy protection method based on multi-privacy policy combination optimization comprises the following steps:

step 1, constructing a data multi-source fusion system model: firstly, data from each party is collected by a data owner in a system model, and in order to prevent privacy disclosure, data anonymization operation is carried out by each party; secondly, as the data volume of some entities is huge, the data must be stored in a public cloud, the data fusion of the public cloud is to organically integrate the multisource cross-platform data, and aims to better mine useful information by fusing complete data sets of all parties, and if the data fusion is simply carried out, the privacy snooping concern of the public cloud after the data fusion cannot be eliminated, so that the public cloud also needs to carry out heavy anonymity operation; in addition, the user can enjoy the convenience of big data by customizing the required services, however, unknown attackers may also be hidden in the user, so it is assumed here that the user is also "curious", i.e., the user is treated as a suspected privacy-mining group with the same attack capability as the cloud service provider. The method comprises the following specific steps:

and 1.1, constructing a data fusion model. The data fusion is to organically integrate data belonging to multiple sources, and aims to better mine useful information through a more complete data set than before so as to provide high-quality service for users. The data set may be represented as a quadruple D (X, a, F, V), where X ═ X₁,x₂,…,x_nIs a set of data records, each item of data x_iAre all exclusively associated with one dedicated user u_i(ii) a A is an attribute set; further, the attributes are divided into an information attribute set IA and a sensitive attribute set SA according to their sensitivity, and IA ═ SA ═ a,

f is a set of relationships between X and A

Is attribute a_kThe value range of (2).

And 1.2, constructing a privacy and privacy attack model. The privacy refers to the single shooting of the user and the corresponding data sensitive attribute value, and if the single shooting relationship is revealed, the privacy of the user is revealed. According to the data model, users and data records have one-to-one mapping, and each user corresponds to a group of information attribute values from the data aspect, namely, the user and the information attribute values have a single shot, and the equivalence class containing the group of information attribute values also corresponds to a privacy attribute value set. According to the transitivity of the single-shot, if the group of information attribute values and the corresponding privacy attribute value set form the single-shot (that is, the privacy attribute value set only contains one element), the data privacy disclosure exists for the user when the data set is issued.

Step 2, designing a multiple data fusion anonymity framework: aiming at frequent cross-platform communication and sharing of big data information, the patent provides a re-anonymity framework based on multi-party data fusion, which comprises an initial state, a handshake process, data synchronization and secondary anonymity. Firstly, carrying out corresponding anonymization operation on respective data by each party in an initial state according to respective privacy protection requirements; secondly, the handshake process carries out multiparty communication, and each party issues respective data privacy protection requirements; data synchronization, namely, in the process, the distribution of public attribute values of multi-party data needs to be consistent, and meanwhile, the requirement of privacy protection of multiple parties is met; the method comprises the steps of firstly establishing a hierarchical structure diagram, secondly establishing anonymity, and finally establishing a probabilistic reasoning problem.

It is assumed that the variables are categorical variables (i.e., the number of states of the variables is finite) and the data set is complete. The goal of the Bayesian network construction algorithm is to construct a set of parent items pi by defining each variable₁,…,Π_mAt node set

Find the highest scoring Directed Acyclic Graph (DAG, Directed Acyclic Graph)

wherein the content of the first and second substances,

is a conditional probability P (X)_i＝x|Π_iPi) maximum likelihood estimation, N_x,πDenotes (X ═ X | Π)_iPi) occurs in the dataset and | represents the size of the cartesian product space given the variable.

The Bayesian network coding constructs a hierarchical structure diagram by longitudinally coding a Bayesian network, wherein the hierarchical structure diagram comprises two stages: a bottom-up encoding phase and a top-down modification phase. In particular, given a bayesian network, it can be transformed by encoding into a hierarchical structure diagram.

And 2.3, a bottom-up encoding stage. First, the hierarchy of all nodes is initially marked as zero, the algorithm continuously marks from the leaf nodes and gradually tracks the corresponding parent nodes. In each turn, when the hierarchy of the child node is q, the hierarchy of the parent node will be labeled q + 1. And then, only recording the current maximum code for the non-leaf node, namely if the code of the node is not 0, comparing the new code with the original code, keeping the larger code, if the code of the node is equal to the original code, stopping the upward backtracking of the node, and judging whether the leaf node queue is empty or not, and if the node queue is empty, stopping the backtracking. Next, the next leaf node is extracted for marking until the leaf node sequence is empty.

And 2.4, a top-down correction stage. All nodes are first sorted from large to small in a hierarchical structure and all node codes are initialized to be unmarked. Then, the algorithm extracts the unmarked node with the largest hierarchical structure (i.e. the largest code) in the node sequence, and takes the node as the starting point of traversing the graph in breadth, and the traversal is carried out in a downward breadth-first manner step by step. In each round, when the hierarchy of the parent node is q, the hierarchy of the child node will be labeled as q-1. Here, we will refer to q_oldComparing the value size of the current hierarchical structure of the represented node with the value size of the newly derived node representation node, and considering the following two cases: (a) when q is_old＜q_newThen, the algorithm sets the hierarchy of nodes to q_newAnd setting the node as marked; (b) when q is_old＝q_newAnd the node is marked, the downward traversal of the node will terminate early. Next, the extraction of the next unmarked node will continue until there are no unmarked nodes in the sequence.

And step 3, realizing the privacy protection strategy: given the privacy protection policy, the Bayesian network G is finally formed through the two processes, and then the attributes are required to be matched

Defining unit privacy protection operation, namely dividing privacy budget into d equal parts, performing privacy protection on probability distribution of only one selected attribute node in each round, and performing privacy protection on attributes to be subjected to privacy operation

Step 3.1, realizing that k-anonymous: attribute by domain expert or data owner

Value range setting, followed by attribute

The value domain space in the bayesian network is extended such that the number of different values in its value domain space is greater than or equal to k. During correction, the father node of the privacy node distributes the attribute value with the maximum probability distribution value in the child nodes of the father node to the privacy node according to the correction principle of information entropy maximization, so that the privacy node meets the k requirement;

Setting of range of value, for attribute

According to the modification principle of information entropy maximization,in the process of each round of correction, only one value with the maximum probability distribution is selected as a target object to be corrected, and the probability distribution value which is higher than the average value is averagely distributed to the newly-added attribute value;

and 3.3, realizing t-close: will attribute

And 4, fusing and mapping the data with multiple privacy constraints into a hypergraph, designing a corresponding heuristic rule, and normalizing the data fusion process into a hypergraph resolution process, so that the usability of the data is improved while the privacy constraints are met. The method comprises the following specific steps:

step 4.1, formally defining the privacy protection policy as a five-tuple F ═ (G, IA, SA, OP, V), wherein G represents a bayesian network converted from the data set; IA denotes an information attribute node, IA ═ a₁,a₂,…,a_m)，a₁,a₂,…,a_mAre not independent of each other, and have a probability dependence relationship; SA denotes a privacy node, OP denotes a certain operation step, and OP ═ OP (OP)₁,OP₂,…,OP_m). V denotes a value range after the operation OP,

judging the execution sequence of different privacy protection strategies from the data layer surface and the structural layer surface:

1) if a_mCan be formed by₁,a₂,…,a_nIs shown to be

That is OP_mPost-execution and vice versa;

from the structural level:

2) starting from the privacy node of the Bayesian network, the Bayesian network is encoded through a bottom-up encoding stage and a top-down modification stage, and the operation on the privacy attribute is compared through modifying the privacy node SA within the maximum modification threshold value

And

achieving the required privacy protection if OP_iLess influence on the data structure, then OP_iTo achieve the required performance ratio OP_jHigh, i.e.

OP_iComparison OP_jFirst, and vice versaVice versa;

3) if multiple operations on information attributes are involved

Then the following two cases are distinguished: firstly, if

Calculating value range of each operation respectively through probabilistic reasoning relationship between IA

If IA_j，

Then OP_iComparison OP_j、OP_kIs executed first if

Then OP_kComparison OP_j、OP_iFirstly, executing; second when

Then OP_kComparison OP_j、OP_iIs performed first, for OP_j、OP_iSequence if in operation

And

in (1),

then

Will affect OP_jThen, then

OP_iComparison OP_jFirst, and vice versa.

For example, let two privacy preserving policy operations be disaggregated as F₁(A, B, D) and F₂(D, E, G, H) respectively represented by supercedes { A, B, D }, { D, E, G, H }, wherein B and D are F₁Two steps in one of the operations are represented by conditional hyper-edges { B, D }, D, G being F₂Two steps in one of the operations, represented by conditional hyper-edges { D, G }, due to HG₃、HG₄And the operations are not independent from each other, so that an intersection exists between the operations, and A, E and H are three independent operations respectively represented by the super edges { A }, { E }, and { H }. The connection hypergraph HG between the hyper-edge relations can be obtained according to the hyper-edge relations, and is shown in FIG. 3:

step 4.2, by judging the execution sequence of different privacy protection strategies F, generating the following heuristic rules of ultra-edge resolution and PROG (HG):

rule 1. if hypergraph HG contains only one hyper-edge N, which can be resolved directly, prog (HG) contains only result (HG): r (N);

rule 2. if the hypergraph HG is k disjoint hypergraphs HG₁、HG₂……HG_kIf they can be executed in parallel, prog (hg) is:

PROG(HG₁)，PROG(HG₂)，……，PROG(HG_k)；

RESULT(HG):＝RESULT(HG₁)×RESULT(HG₂)×……×RESULT(HG_k)

rule 3. given a privacy node SA and its vertical code X, known from the properties of the Bayesian network_SAL, in all the chain set Links taking the privacy nodes as chain tail nodes, let X_iAnd X_jIs any two nodes in Links that are not SA, if X_i.L＜X_jL, then X is corrected at the same privacy preserving granularity_iThe probability distribution has less influence on the availability of the global data, so a lower proximity principle is formed, namely the closer the modified attribute nodes are to the privacy attribute, the more targeted the modification is. In other words, if the hypergraph HG is composed of k connected components HG₁、HG₂……HG_kAnd if the HG is in the set, judging the probability dependence of each super edge on the privacy node_iCompared with HG_jFurther down to the privacy node, then

I.e. HG_iDigestion is performed first and vice versa.

In this embodiment, the correctness and validity of the privacy protection model proposed by this patent are verified through experimental simulation. The architecture is realized by adopting python language, the hardware environment is Intel (R) core (TM) i5-1035G1CPU @1.00GHz 1.19GHz processor, the memory is 16G, and the operating system is Windows 10.

In order to highlight the superiority of the heavy anonymous handshake protocol in the first group of experiments, our experiments respectively compare the method of using the heavy anonymous handshake protocol with that of not using the method, firstly, the patent generates a data set by means of a bayesian network, and when the data set is generated, all parties are anonymous for the first time, and secondly, through experiments, we observe that the change of the data quantity has certain influence on the privacy disclosure probability, so that the data quantity is used as an independent variable of the test, the privacy disclosure probability is used as a dependent variable, and the experiment results are shown in fig. 4. It can be seen from the figure that, under the condition of extremely small data volume, the privacy disclosure probabilities of anonymization are basically similar, with the increase of experimental data volume, the privacy disclosure probability is obviously reduced by the anonymization method, when the data volume reaches 10 ten thousand, the privacy disclosure probability is reduced to be below 20%, on the contrary, under the condition of not using the anonymization method, with the increase of experimental data volume, the privacy disclosure probability is increased, and when the data volume reaches 10 ten thousand, the privacy disclosure probability is as high as 80%.

In order to verify that the optimization algorithm provided by the patent can greatly improve the data availability, a comparison experiment is designed, and a naive algorithm is compared with the optimization algorithm of the patent. The data availability is represented by Q in this experiment, which is given by the formula:

where a represents raw data and b represents noisy data, it can be observed from the formula that the more noise is added, the worse the data availability. Still, the data volumes are used as independent variables, and the data volumes are 5000, 1 ten thousand, 2 ten thousand, 4 ten thousand, 6 ten thousand, 8 ten thousand and 10 ten thousand respectively, and the experimental results of the data availability are observed, as shown in fig. 5. As can be seen from fig. 5, under the condition of extremely small data volume, the influence effect of the naive algorithm and the optimization algorithm on the data availability is not much different; when the data volume reaches 4 ten thousand, the data availability of the optimization algorithm is about 30% higher than that of a naive algorithm; when the data amount reaches 10 thousands, the data availability of the naive algorithm is about 40% lower than that of the optimization algorithm. Through the analysis, the data availability of the fusion algorithm optimized by the patent is far higher than that of a naive fusion algorithm.

In order to verify the usability of the method in incremental data fusion, the experiment utilizes the concept of a generative countermeasure network, firstly, a generated Bayesian network is utilized to generate a data set, a discriminator and a generator respectively sample the data set in different proportions, wherein the sampling percentage of the discriminator is 30%, the sampling percentage of the generator is 15%, the sampled data set is respectively generated into Bayesian networks by a hill-climbing method, then the respective generated Bayesian networks generate 40000 data sets with the same quantity again, KL divergence is utilized to measure the distribution difference of privacy attributes in certain equivalent classes of the two data sets, and the calculation formula is as follows:

the closer the KL divergence is to 0, the smaller the difference between the discriminator and the generator, the better the experimental effect.

The privacy attribute probability of the discriminators and the generators in three different equivalence classes is selected for calculation, the probability distribution is shown in fig. 6, and then the KL divergence degrees are calculated respectively to obtain KL1 which is 0.0042, KL2 which is 0.0043 and KL3 which is 0.0053, so that the fact that the three KL divergence degrees are close to 0 can be seen, the difference between the discriminators and the generators is very small, and the experiment effect is very good.

Through the analysis of the three simulation experiments, the method provided by the patent not only greatly improves the privacy protection effect of multi-source data fusion, but also greatly improves the usability of data.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A multi-source data fusion privacy protection method based on multi-privacy policy combination optimization is characterized by comprising the following steps:

step 1, constructing a data multi-source fusion system model: firstly, data from each party is collected by a data owner in a system model, and in order to prevent privacy disclosure, data anonymization operation is carried out by each party; secondly, as the data volume of some entities is huge, the data must be stored in a public cloud, the data fusion of the public cloud is to organically integrate the multisource cross-platform data, and aims to better mine useful information by fusing complete data sets of all parties, and if the data fusion is simply carried out, the privacy snooping concern of the public cloud after the data fusion cannot be eliminated, so that the public cloud also needs to carry out heavy anonymity operation; in addition, the user can enjoy the convenience brought by big data by customizing the required services, however, unknown attackers may also be hidden in the user, so it is assumed here that the user is also "curious", i.e., the user and the cloud service provider are regarded as a suspected privacy mining group with the same attack capability;

step 2, designing a multiple data fusion anonymity framework: aiming at frequent cross-platform communication and sharing of big data information, the patent provides a re-anonymity framework based on multi-party data fusion, which comprises an initial state, a handshake process, data synchronization and secondary anonymity. Firstly, carrying out corresponding anonymization operation on respective data by each party in an initial state according to respective privacy protection requirements; secondly, the handshake process carries out multiparty communication, and each party issues respective data privacy protection requirements; data synchronization, namely, in the process, the distribution of public attribute values of multi-party data needs to be consistent, and meanwhile, the requirement of privacy protection of multiple parties is met; the method comprises the following steps that secondary anonymity is the most key step in a re-anonymization framework, specifically, a data set is converted into a Bayesian network during secondary anonymity, a hierarchical structure diagram is constructed by encoding the Bayesian network, and finally a privacy protection problem is converted into a probabilistic reasoning problem, wherein the specific process comprises network structure learning and network encoding;

The method realizes four privacy protection operations of k-anonymity, l-diversity, t-close and attribute value generalization;

2. The multi-source data fusion privacy protection method based on multi-privacy policy combination optimization according to claim 1, wherein a data fusion model and a privacy attack model are constructed in the step 1 by adopting a counter-type learning architecture, and the specific steps are as follows:

step 1.1, constructing a data fusion model:

the data fusion is to organically integrate data belonging to multiple sources, and aims to better mine useful information through a more complete data set than before so as to provide high-quality service for users. For ease of discussion, a formal description of the data is first given: a data set may be represented as a quadruple D (X, a, F, y), where X ═ X₁，x₂，...，x_nIs a set of data records, each item of data x_iAre all exclusively associated with one dedicated user u_i(ii) a A is an attribute set; further, the attributes are divided into an information attribute set IA and a sensitive attribute set SA according to their sensitivity, and IA ═ SA ═ a,

f is a set of relationships between X and A

Is attribute a_kThe value range of (2).

Definition 1 (equivalence classes): given a data set D (X, A, F, V), for

If there are t records { x₁，x₂，…，x_t(t ≧ 1) such that

EstablishedBalance { x₁，x₂，...，x_tIs an equivalence class on D for A', denoted as [ x ]_i]_A′On the contrary, the set E of all equivalence classes formed by the set of attributes A_A′Form a division of D, denoted D/E_A′. In particular, if

Definition 2 (data fusion): given m data sets { D₁，...，D_mAnd F, the fused data set D (X, { IA, SA }, F, y) satisfies:

in particular, if there are two data sets D to be fused_i，D_jSatisfy the requirement of

(

(wherein, SA_i＝SA_j) Otherwise, it is called harmonious fusion. The research category of the patent is coordinated information increment and refinement fusion;

step 1.2, constructing a privacy and privacy attack model:

[x_i]_IAFor the information equivalence class to which it belongs

Its corresponding privacy attribute value is noted

If it is

The data privacy is said to be compromised.

and from this forms a privacy inference probability, i.e. pair

User u_iValue is v on SA_jHas a probability of

and form privacy inference probabilities, i.e. for both datasets

The adversary can infer that the privacy probability of the user is

(SEL () is a selection function).

3. The multi-source data fusion privacy protection method based on multi-privacy policy combination optimization according to claim 1, wherein the vertical encoding in the step 2 includes two stages: a Bayesian network structure learning stage and a network coding stage;

step 2.1, consider from dataset D ═ D₁，...，D_nThe learning Bayesian network structure in the Chinese scholar tree comprises m random variable sets

It is assumed that the variables are categorical variables (i.e., the number of states of the variables is finite) and the data set is complete. The goal of the Bayesian network construction algorithm is to construct a set of parent items pi by defining each variable₁，...，Π_mAt node set

Find the highest scoring directed acyclic graph

wherein LL (X)_i|∏_i) Represents X_iLog-likelihood function with its parent node set ii:

Pen(X_i|∏_i) Represents X_iAnd its father node set pi_iComplexity penalty function of (1):

wherein the content of the first and second substances,

is a conditional probability P (X)_i＝x|∏_iPi) maximum likelihood estimation, N_x，πDenotes (X ═ X | Π)_iPi) occurs in the dataset and | represents the size of the cartesian product space given the variable.

The Bayesian network coding constructs a hierarchical structure diagram by longitudinally coding a Bayesian network, wherein the hierarchical structure diagram comprises two stages: a bottom-up encoding phase and a top-down modification phase. The method comprises the following specific steps:

and 2.3, a bottom-up encoding stage. First, the hierarchy of all nodes is initially marked as zero, the algorithm continuously marks from the leaf nodes and gradually tracks the corresponding parent nodes. In each turn, when the hierarchy of the child node is q, the hierarchy of the parent node will be labeled q + 1. And then, only recording the current maximum code for the non-leaf node, namely if the code of the node is not O, comparing the new code with the original code, keeping the larger one, if the two codes are equal, stopping the upward backtracking of the node, and judging whether the leaf node queue is empty or not, and if the node queue is empty, stopping. Next, extracting a next leaf node for marking until the leaf node sequence is empty;

and 2.4, a top-down correction stage. All nodes are first sorted from large to small in a hierarchical structure and all node codes are initialized to be unmarked. Then, the algorithm extracts the unmarked node with the largest hierarchical structure (i.e. the largest code) in the node sequence, and takes the node as the starting point of traversing the graph in breadth, and the traversal is carried out in a downward breadth-first manner step by step. In each round, when the hierarchy of the parent node is q, the hierarchy of the child node will be marked as q-1. Here, we will refer to q_oldComparing the value size of the current hierarchical structure of the represented node with the value size of the newly derived node representation node, and considering the following two cases: (a) when q is_old＜q_newThen, the algorithm sets the hierarchy of nodes to q_newAnd setting the node as marked; (b) when q is_old＝q_newAnd the node is marked, the downward traversal of the node will terminate early. Next, the extraction of the next unmarked node will continue until there are no unmarked nodes in the sequence.

4. The multi-source data fusion privacy protection method for multi-privacy policy combination optimization according to claim 1, wherein four privacy protection operations of k-anonymous, l-diversity, t-close-then and attribute value generalization are implemented in step 3, and the specific steps are as follows:

step 3.1, realizing that k-anonymous: attribute by domain expert or data owner

Value range setting, followed by attribute

Setting of range of value, for attribute

Value range null in a bayesian networkAnd expanding the space so that the number of different values in the value domain space is more than or equal to l. Corrected attribute

and 3.3, realizing t-close: will attribute

5. The multi-source data fusion privacy protection method based on multi-privacy policy combination optimization according to claim 1, wherein a heuristic rule is adopted in the step 4 to evolve a data fusion process into a hypergraph resolution process, and the specific steps are as follows:

step 4.1, formally defining the privacy protection policy as a five-tuple F ═ (G, M, SA, OP, V), wherein G represents a bayesian network converted from the data set; IA denotes an information attribute node, IA ═ a₁，a₂，..，a_m)，a₁，a₂，..，a_mAre not independent of each other, and have a probability dependence relationship; SA denotes a privacy node, OP denotes a certain operation step, and OP ═ OP (OP)₁，OP₂，..，OP_m). V denotes a value range after the operation OP,

1) if a_mCan be formed by₁，a₂，...，a_nIs shown to be

That is OP_mPost-execution and vice versa;

from the structural level:

And

OP_iComparison OP_jFirst, and vice versa;

3) if multiple operations on information attributes are involved

Then the following two cases are distinguished: firstly, if

If IA_j，

Then OP_iComparison OP_j、OP_kIs executed first if

Then OP_kComparison OP_j、OP_iFirstly, executing; second when

And

in (1),

then

Will affect OP_jThen, then

OP_iComparison OP_jFirst, and vice versa.

rule 1. if a hypergraph HG contains only one hyper-edge N, which can be resolved directly, prog (HG) contains only result (HG): r (n);

PROG(HG₁)，PROG(HG₂)，……，PROG(HG_k)；

RESULT(HG)：＝RESULT(HG₁)×RESULT(HG₂)×……×RESULT(HG_k)

rule 3. given a privacy node SA and its vertical code X, known from the properties of the Bayesian network_SAL, in all the chain set Links taking the privacy nodes as chain tail nodes, let X_iAnd X_jIs any two nodes in Links that are not SA, if X_i.L＜X_jL, then X is corrected at the same privacy preserving granularity_iThe probability distribution has less influence on the availability of the global data, so a lower proximity principle is formed, namely the closer the modified attribute nodes are to the privacy attribute, the more targeted the modification is. In other words, if the hypergraph HG consists of k hypergraphsConnected component HG₁、HG₂……HG_kAnd if the HG is in the set, judging the probability dependence of each super edge on the privacy node_iCompared with HG_jFurther down to the privacy node, then

I.e. HG_iDigestion is performed first and vice versa.