CN112765653B

CN112765653B - Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization

Info

Publication number: CN112765653B
Application number: CN202110014817.4A
Authority: CN
Inventors: 周志刚; 白增亮; 王宇; 梁子恺; 吴天生
Original assignee: Shancai Hi Tech Shanxi Co ltd
Current assignee: Shancai Hi Tech Shanxi Co ltd
Priority date: 2021-01-06
Filing date: 2021-01-06
Publication date: 2022-11-25
Anticipated expiration: 2041-01-06
Also published as: CN112765653A

Abstract

The invention belongs to the field of data release, and particularly relates to a multi-source data fusion privacy protection method based on multi-privacy strategy optimization combination optimization. A multi-party data fusion architecture based on re-anonymity (over-anonymity) is provided, and the condition that privacy of fused data is leaked is prevented. Further, the realistic significance of data fusion is to provide a more comprehensive data base for users to conduct extensive knowledge mining on the basis. Therefore, a multi-privacy protection strategy combined optimization scheme is designed, and the usability of the fused data is improved to the maximum extent while the privacy constraints of all parties are met. The strategy maps the data fusion of multiple sources and multiple privacy constraints into a hypergraph, each hypergraph is selected, solved and eliminated one by one on the hypergraph by using heuristic rules, the process of eliminating the hypergraph is also a process of realizing the privacy constraints one by one, and a data fusion scheme is formulated according to the process.

Description

Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization

Technical Field

The invention belongs to the field of data release, and particularly relates to a multi-source data fusion privacy protection method based on multi-privacy policy combination optimization.

Background

The multisource cross-platform and data application cross-domain are the most prominent characteristics of big data, and in the big data era, due to the explosive growth of data in different application fields, the single type of data (such as position data, social data, cookie logs, shopping website pipelining and the like) is difficult to meet the requirements of people on upper-layer complex application services. For example, bob needs the App to search nearby friends who like to play basketball, and implementation of this need requires organic fusion of location data with social data. The data cross-domain fusion needs of individuals, and real demand and application of the data cross-domain fusion between different departments in an enterprise, different-quality enterprises and even between the enterprise and a government department, such as accurate advertisement push, network taxi appointment optimization management, intelligent city subway line planning and the like, require data source owners of different field platforms to perform deep fusion and cooperation on own data levels. However, data of each platform often has a great use value, and may include sensitive/private information such as identity information, behavior information, financial information, even disease information of a user, and directly publishing the original data will necessarily cause disclosure of user privacy.

In order to prevent the privacy of users from being revealed, before large data fusion is released, desensitization processing (such as disturbance, noise addition, generalization and the like) needs to be performed on data sets of respective platforms, most of the conventional anonymous privacy protection methods only perform privacy protection on data of a single data source, and the problem of non-explicit privacy information disclosure brought by deep association analysis of the large data cannot be effectively solved; moreover, a single privacy protection method has not been able to meet the personalized privacy requirements of data users, just as local privacy protection of the data from various sources does not avoid the risk of global privacy disclosure after fusion (for example, alice purchases a ticket to Munich at the ticket purchasing A site and browses the tourist attractions of Munich on the Web page of tourist company B. Both companies A and B issue information separately, wherein company A employs a 3-anonymized-based information generalization technique, i.e., "ticket to Munich" is generalized to "ticket to Europe", and company B employs a 3-diversity technique, i.e., the browsing behavior of two users who browse the company Web site simultaneously with Alice is issued as a set { 2017-07-11: { Munich: singesburg, japan: fushishan, USA: massachusettes institute of technology }.suppose that enemies know that Alice has a national plan and steals his log in the records of Internet, and steals his registration of his/her own of the A and B company A, and when the tourist routes of Munich can be issued by the Alice, the tourist routes of the two companies accurately released by the two tourist attractions. The method is also the most essential problem facing the big data release privacy protection, namely privacy disclosure caused by the fact that an attacker constructs data association analysis after the distributed big data multiple sources are fused. One naive approach is to combine the naturally connected fusion data with privacy preserving method-level granularity. However, the combination of privacy preserving method level granularity may result in "over-protection" of private information, thereby severely reducing the availability of data, as shown in fig. 1: in data fusion of two parties, 29 pieces of noise need to be added in a scheme I (firstly carrying out 5-anonymity and then carrying out 3-diversity), and 20 pieces of noise need to be added in a scheme II (firstly carrying out 3-diversity and then carrying out 5-anonymity), so that the fine-grained combined optimization of the multiple privacy protection methods for maximizing data availability still is an open problem in the field of large data fusion release of privacy protection.

In the field of privacy protection of data release, traditional privacy protection algorithms comprise differential privacy, k anonymity, l-diversity anonymity, t-close anonymity and the like, and the improvement of some scholars on the traditional algorithms also has milestone significance, for example, by means of a semantic hierarchy tree, wang and the like, by semantically generalizing the records with less quantity than anonymity requirements, the records can realize k-anonymity under wider semantics, however, the use of a record generalization technology causes irreversible information loss, and the use of a k-anonymity criterion on high-dimensional sparse data can greatly reduce the data availability; brijesh B et al propose a method to improve l-diversity anonymity with a significant improvement in run-time and with less information loss than the existing methods, while providing the same level of privacy due to the close arrangement of records in the initial equivalence class. In general, these conventional privacy preserving models are generally only applicable to static data distribution in a specific scenario. However, the risk faced by the big data release is reflected in the dynamic property of the release process, and has the characteristic of multi-source cross-platform release, so that an attacker needs to be prevented from performing correlation analysis on the data after multi-source fusion, and further the anonymity of the data is damaged.

In the aspect of privacy protection of data fusion, H Patel et al propose a safe fusion method for realizing data of two parties from bottom to top, but the premise of the model is that a trusted third party fuses all data to form a complete original data table, and then the data table is subjected to anonymization treatment, but the trusted third party does not exist in most cases, so the method of the document has low utilization value; jiang et al propose a DkA security fusion model for realizing data of two parties under a semi-honest model, the algorithm utilizes an exchangeable encryption strategy to hide original information in a communication process, and judges whether an anonymous threshold k is met or not by constructing a complete anonymous table to realize privacy protection in the data fusion process, but the resource consumption of the method is too large and the method is not suitable for the fusion of a large data set; clifton et al developed a secure data multiparty data integration tool for four typical operations of relational data counting, union, intersection and cartesian product; yeom et al studied indirect privacy disclosure caused by insufficient model generalization ability, and subsequently Mohammed et al realized data privacy protection for each party of data integration based on a classification tree structure using a data generalization technique, but the information loss of the integrated data was high, and the specific information loss degree was related to the data set. In the scheme, the same privacy protection strategy is assumed to be adopted by multiple parties participating in data fusion, however, in the face of different privacy protection requirements of big data, different platforms may adopt personalized privacy protection strategies according to the application requirements of the own parties before the big data fusion, and the existing scheme is difficult to apply.

Disclosure of Invention

The invention provides a multi-source data fusion privacy protection method based on multi-privacy policy combination optimization. Specifically, the patent firstly proposes a multiple-party data fusion architecture based on re-anonymity (over-anonymity), wherein, inner-layer data anonymity exists before data fusion, and is implemented by owners of respective local data to perform initial protection on the data; when the outer-layer data anonymity occurs in data fusion, a plurality of parties participating in the fusion implement the anonymity according to a set multi-party privacy protection protocol (for simplifying the description, the anonymity is regarded as simultaneously meeting privacy constraints of all the parties), and the condition that the privacy of the fused data is leaked is prevented. Further, the realistic significance of data fusion is to provide a more comprehensive data base for users to conduct extensive knowledge mining on the basis. Therefore, the patent designs a multi-privacy protection strategy combined optimization scheme, and the usability of the fused data is improved to the maximum extent while privacy constraints of all parties are met. The strategy maps the data fusion of multiple sources and multiple privacy constraints into a hypergraph, each hypergraph is selected, solved and eliminated one by one on the hypergraph by using heuristic rules, the process of eliminating the hypergraph is also a process of realizing the privacy constraints one by one, and a data fusion scheme is formulated according to the process.

In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:

step 1, constructing a data multi-source fusion system model:

as shown in fig. 2, first, in a system model, data owners collect data from each party, and in order to prevent privacy disclosure, each party performs data anonymization operation; secondly, as the data volume of some entities is huge, the data must be stored in a public cloud, the data fusion of the public cloud is to organically integrate the multisource cross-platform data, and aims to better mine useful information by fusing complete data sets of all parties, and if the data fusion is simply carried out, the privacy snooping concern of the public cloud after the data fusion cannot be eliminated, so that the public cloud also needs to carry out heavy anonymity operation; furthermore, the user can enjoy the convenience of big data by customizing the required services, however, unknown attackers may also be hidden in the user, so it is assumed here that the user is also "curious", i.e., the user and the cloud service provider are treated as a suspected privacy mining group with the same attack capability.

Step 2, designing a multiple data fusion anonymity framework:

aiming at frequent cross-platform communication and sharing of big data information, the patent provides a re-anonymization framework based on multi-party data fusion, which comprises an initial state, a handshake process, data synchronization and secondary anonymization. Firstly, carrying out corresponding anonymization operation on respective data by each party in an initial state according to respective privacy protection requirements; secondly, multiparty communication is carried out in the handshake process, and each party issues respective data privacy protection requirements; data synchronization, namely, in the process, the distribution of public attribute values of multi-party data needs to be consistent, and meanwhile, the requirement of privacy protection of multiple parties is met; the method comprises the steps of firstly establishing a hierarchical structure diagram, secondly establishing anonymity, secondly establishing a hierarchical structure diagram, secondly establishing a anonymity framework, and finally establishing a probabilistic reasoning problem.

And step 3, realizing the privacy protection strategy:

given the privacy protection policy, the Bayesian network G is finally formed through the two processes, and then the attributes are required to be matched

Operate to make the privacy node X _s And the policy requirements are met. In particular, for privacy protection policies

This patent defines unit privacy protection operation, is to carry out d partition to the privacy budget, and each round only carries out privacy protection to the probability distribution of a selected attribute node, to waiting to carry out privacy operation's attribute

The method realizes four privacy protection operations of k-anonymity, l-diversity, t-close and attribute value generalization.

And 4, fusing and mapping the data with multiple privacy constraints into a hypergraph, designing a corresponding heuristic rule, and evolving the data fusion process into a hypergraph resolution process, so that the usability of the data is improved while the privacy constraints are met.

Further, in the step 1, a data fusion model and a privacy attack model are constructed by adopting an antagonistic learning architecture, and the specific steps are as follows:

step 1.1, constructing a data fusion model:

the data fusion is to organically integrate data belonging to multiple sources, and aims to better mine useful information through a more complete data set than before so as to provide high-quality service for users. For ease of discussion, a formal description of the data is first given: one data set is represented as one quadruple D (X, a, F, V), where X = { X = { ₁ ,x ₂ ,…,x _n Is a set of data records, each item of data x _i Are all exclusively associated with one dedicated user u _i (ii) a A is an attribute set; further, the attributes are divided into an information attribute set IA and a sensitive attribute set SA according to their sensitivity, and IA £ SA = a,

f is a set of relationships between X and A

Is attribute a _k The value range of (2).

Definition 1 (equivalence classes): given a data set D (X, A, F, V), for

If there are t records { x ₁ ,x ₂ ,…,x _t (t ≧ 1) such that

Is established as { x ₁ ,x ₂ ,…,x _t Is an equivalence class on D for A', denoted as [ x ] _i ] _A′ On the contrary, by attribute set A ^′ Set E of all formed equivalence classes _A′ Form a division of D, denoted D/E _A′ . In particular, if

The corresponding equivalence class is referred to as the information equivalence class.

Definition 2 (data fusion): given m data sets { D ₁ ,…,D _m And F, the fused data set D (x, { IA, SA }, F, V) satisfies:

in particular, if there are two data sets D to be fused _i ,D _j Satisfy the requirement of

(

Representing a symmetric difference operator), it is called information delta fusion; if there is a record x _k ∈X _i ∩X _j Satisfy the following requirements

And is

Information refinement fusion is called; if x is arbitrarily recorded _k ∈X _i ∩X _j Satisfy the following requirements

(wherein, SA _i ＝SA _j ) Otherwise, it is called harmonious fusion. The research category of the patent is coordinated information increment and refinement fusion.

Step 1.2, constructing a privacy and privacy attack model:

the privacy refers to the single shooting of the user and the corresponding data sensitive attribute value, and if the single shooting relationship is revealed, the privacy of the user is revealed. According to the data model, users and data records have one-to-one mapping, and each user corresponds to a group of information attribute values from the data aspect, namely, the user and the information attribute values have a single shot, and the equivalence class containing the group of information attribute values also corresponds to a privacy attribute value set. According to the transitivity of the single-shot, if the group of information attribute values and the corresponding privacy attribute value set form the single-shot (that is, the privacy attribute value set only contains one element), the data privacy disclosure exists for the user when the data set is issued.

Definition 3 (data privacy disclosure): given a data set D (X, { IA, SA }, F, V), if

[x _i ] _IA For the information equivalence class to which it belongs

Its corresponding privacy attribute value is noted

If it is

The data privacy is said to be compromised.

Definition 4 (knowledge-based attacks): suppose that the adversary knows the target user u _i Information attribute value of

And knows the user's data record x _i In a data set D (X, { IA, SA }, F, V) to be published, when data is published, an enemy builds the following relationship:

and form privacy pushers based thereonPhysical probability, i.e. pair

User u _i Value is v on SA _j Has a probability of

(wherein C (#) is a counting statistical function; C;,/C;) _* As a domain qualifier).

Definition 5 (multi-version attack of data incremental publishing): given a first published data set D (X, { IA, SA }, F, V) and a corresponding publisher published updated data set D '(X', { IA ', SA' }, F ', V'), it is assumed that the adversary compares a dedicated user u _i X', X, the following relationship is constructed:

and form privacy inference probabilities, i.e. for both datasets

The adversary deduces that the privacy probability of the user is

Where SEL is the selection function.

Further, the step 2 of vertical encoding comprises two stages: a Bayesian network structure learning stage and a network coding stage;

the specific steps of the Bayesian network structure learning stage are as follows:

step 2.1, consider from dataset D = { D = { ₁ ,…,D _n The learning Bayesian network structure in the Chinese scholar tree comprises m random variable sets

It is assumed that the variables are categorical variables and that the data set is complete. The goal of the Bayesian network construction algorithm is to construct a set of parent items pi by defining each variable ₁ ,…,Π _m Find the highest scoring Directed Acyclic Graph (DAG) g on node set x. By assuming a Markov condition, a joint probability distribution is introduced, each variable being conditionally independent of its non-descendant variables given its parent.

Step 2.2, for the evaluation of the quality of the generated DAG, a Bayesian Information Criterion (BIC) score is used, using different scoring functions, which is proportional to the posterior probability multiplier of the DAG. BIC is decomposable, and is composed of the sum of scores of each variable and its parent node set:

wherein LL (X) _i |∏ _i ) Represents X _i And its father node set pi _i Log-likelihood function of (a):

Pen(X _i |Π _i ) Represents X _i Is assembled with the father node thereof _i Complexity penalty function of (1):

wherein the content of the first and second substances,

is a conditional probability P (X) _i ＝x|Π _i = π) maximum likelihood estimate, N _x,π Denotes (X = X | Π) _i = pi) the number of occurrences in the dataset, |, which represents the size of the cartesian product space given the variable.

The patent uses a hill climbing method to generate a Bayesian network of corresponding data, and the main steps are as shown in an algorithm 1:

algorithm 1 Bayesian network structure generation method based on hill-climbing method

It should be noted that the "flip edge" operation cannot be simply regarded as a sequence operation of "delete an edge and add an edge opposite to the previous operation direction". Because the algorithm adopts a greedy strategy, the deletion edge operation may reduce the BIC score of the Bayesian network, and the program is terminated in advance, so that the operation of adding the corresponding turning edge cannot be implemented.

The specific steps of the Bayesian network encoding stage are as follows:

step 2.3, a hierarchical structure diagram is constructed by longitudinally encoding a Bayesian network, wherein the hierarchical structure diagram comprises two stages: a bottom-up encoding phase and a top-down modification phase. Specifically, given a bayesian network, which is converted into a hierarchical structure diagram by encoding, algorithm 2 is a bayesian network encoding process:

algorithm 2 Bayesian network vertical coding

1) Bottom-up encoding stage. First, the hierarchy of all nodes is initially marked as zero, the algorithm continuously marks from leaf nodes and progressively tracks the corresponding parent node. In each turn, when the hierarchy of the child node is q, the hierarchy of the parent node will be labeled q +1. And then, only recording the current maximum code for the non-leaf node, namely if the code of the node is not 0, comparing the new code with the original code, keeping the larger code, if the code of the node is equal to the original code, stopping the upward backtracking of the node, and judging whether the leaf node queue is empty or not, and if the node queue is empty, stopping the backtracking. Next, the next leaf node is extracted for marking until the leaf node sequence is empty.

2) A top-down correction phase. All nodes are first sorted from large to small in a hierarchical structure and all node codes are initialized to be unmarked. Then, the algorithm extracts the unmarked node with the largest hierarchical structure (i.e. the largest code) in the node sequence, and takes the node as the starting point of traversing the graph in breadth, and traverses downwards in breadth first step by step. In each round, when the hierarchy of the parent node is q, the hierarchy of the child node will be labeled as q-1. Q is to be _old Comparing the value size of the current hierarchical structure of the represented node with the value size of the newly derived node representation node, and considering the following two cases: (a) When q is _old ＜q _new The algorithm sets the hierarchy of nodes to q _new And setting the node as marked; (b) When q is _old ＝q _new And the node is marked, the downward traversal of the node will terminate early. Next, the extraction of the next unmarked node will continue until there are no unmarked nodes in the sequence.

Further, in the step 3, four privacy protection operations of k-anonymous, l-diversity, t-close-relation and property value generalization are realized, and the specific steps are as follows:

step 3.1, realizing that k-anonymous: attribute by domain expert or data owner

Value range setting, followed by attribute

The value range space in the Bayesian network is expanded such that the number of different values in the value range space is greater than or equal to k. During correction, the father node of the privacy node distributes the attribute value with the maximum probability distribution value in the child nodes of the father node to the privacy node according to the correction principle of information entropy maximization, so that the privacy node meets the k requirement;

step 3.2, realizing l-diversity: according to data party pair attribute

Setting of range of value, for attribute

The value range space in the bayesian network is extended such that the number of different values in its value range space is greater than or equal to l. Corrected attribute

According to the correction principle of information entropy maximization, in each correction process, only selecting one value with the maximum probability distribution as a target object to be corrected, and averagely distributing the probability distribution value higher than the mean value to a newly-added attribute value;

and 3.3, realizing t-close: will attribute

Defining the value distribution condition causing information entropy maximization in value domain space as a theoretical standard, measuring by using variance, and performing attribute matching

Correcting the probability distribution of each value to ensure that the variance between the occurrence probability of each value and the theoretical standard is not higher than t;

step 3.4, realizing attribute value generalization: and fusing the probability distribution of the similar values in the attribute C value domain according to the attribute value hierarchical tree set for the attributes by the domain experts or the data owners. And aggregating the attribute leaf nodes to be protected anonymously and all sibling leaf nodes thereof in the attribute C value domain into an attribute node and replacing the attribute node by a direct father node thereof, wherein the attribute value probability distribution corresponding to the node inherits from all original leaf nodes participating in aggregation.

The data sets of different privacy protection strategies are converted into the Bayesian network, and the Bayesian network is encoded, so that a secondary anonymization process is realized, and a re-anonymization framework of multi-party data fusion is realized. However, in the multi-source data fusion process, a multi-privacy protection policy combination scheme needs to be optimized, so that the usability of fused data is improved to the maximum extent while privacy constraints of all parties are met.

Further, in the step 4, a heuristic rule is adopted to evolve the data fusion process into a hypergraph resolution process, and the specific steps are as follows:

i.e. PROG (HG) is:

FOR each excess edge M DO intersected with excess edge N

Bottom-up elimination of probabilistically independent tuples in R (M)

ENDFOR；

PROG(HG ₁ )，PROG(HG ₂ )，……，PROG(HG _k )；

The hypergraph resolution algorithm recursively calls the 3 rules, each hypergraph is selected, solved and eliminated from the HG one by one, a RESULT (HG) program PROG (HG) is constructed, and the process of resolving the hypergraph is also a process of realizing privacy constraints one by one. The hypergraph resolution heuristic algorithm is as follows:

algorithm 3 hypercritical resolution heuristic algorithm

The above mentioned two privacy-preserving policy operations F ₁ = (A, B, D) and F ₂ For example, we see how to construct the program PROG (HG) and generate the RESULT "RESULT (HG) using a heuristic algorithm:

(1) Resolving the hyper-edges { A, B, D }, { D, E, G, H }, and obtaining a result hypergraph HG ₁ = ({ B, D }, { D, G }, { a }, { E }, and { H }), and the PROG (HG) program is obtained according to the digestion rule 3

PROG(HG ₁ )；

(2) Let HG ₂ ＝({A}、{E}、{H})，HG ₃ = ({ B, D }, { D, G }), and obtaining PROG (HG) according to the resolution rule 2 ₁ ) Procedure for the preparation of a pharmaceutical composition

PROG(HG ₂ )、PROG(HG ₃ )；

RESULT(HG ₁ ):＝RESULT(HG ₂ )×RESULT(HG ₃ )。

Because of HG ₂ Comprising three super-edges which are independent of each other, so PROG (HG) ₂ ) Is composed of

RESULT(HG ₂ ):＝R({A}、{E}、{H})

(3) Texture computation HG ₃ PROG (HG) ₃ ) To resolve the hyper-edges { B, D }, { D, G }, the result hypergraph is HG ₄ Generating PROG (HG) according to the resolution rule 3 ₃ ) Procedure for measuring the movement of a moving object

PROG(HG ₄ )；

(4) Due to HG ₄ Contains only one super-edge, so rule 1 knows, PROG (HG) ₄ ) Is that

RESULT(HG ₄ ):＝R({D,G})。

The final procedure can be written as:

of course, it is not necessary for any product to achieve all of the above advantages at the same time in the practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a comparative case analysis of results of execution sequences of different privacy preserving policies;

FIG. 2 is a system model for multi-source data fusion;

FIG. 3 is a hypergraph HG;

FIG. 4 shows comparison results of whether anonymity is re-determined;

FIG. 5 is a comparison of a naive algorithm and an optimized algorithm;

FIG. 6 is a graph of privacy attribute probabilities for discriminators and generators in different equivalence classes.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A multi-source data fusion privacy protection method based on multi-privacy policy combination optimization comprises the following steps:

step 1, constructing a data multi-source fusion system model: firstly, data from each party is collected by a data owner in a system model, and in order to prevent privacy disclosure, data anonymization operation is carried out by each party; secondly, as the data volume of some entities is huge, the data must be stored in a public cloud, the data fusion of the public cloud is to organically integrate the multisource cross-platform data, and aims to better mine useful information by fusing complete data sets of all parties, and if the data fusion is simply carried out, the privacy snooping concern of the public cloud after the data fusion cannot be eliminated, so that the public cloud also needs to carry out heavy anonymity operation; furthermore, the user can enjoy the convenience of big data by customizing the required services, however, unknown attackers may also be hidden in the user, so it is assumed here that the user is also "curious", i.e., the user and the cloud service provider are treated as a suspected privacy mining group with the same attack capability. The method comprises the following specific steps:

step 1.1, constructing a data fusion modelAnd (4) molding. The data fusion is to organically integrate data belonging to multiple sources, and aims to better mine useful information through a more complete data set than before so as to provide high-quality service for users. The data set is represented as a quadruple D (X, a, F, V), where X = { X = { ₁ ,x ₂ ,…,x _n Is a set of data records, each item of data x _i Are all exclusively associated with one dedicated user u _i (ii) a A is an attribute set; further, the attributes are divided into an information attribute set IA and a sensitive attribute set SA according to their sensitivity, and IA £ SA = a,

f is a set of relationships between X and A

Is attribute a _k The value range of (2).

Step 1.2, a privacy and privacy attack model is constructed. The privacy refers to the single shooting of the user and the corresponding data sensitive attribute value, and if the single shooting relationship is revealed, the privacy of the user is revealed. According to the data model, users and data records have one-to-one mapping, and each user corresponds to a group of information attribute values from the data aspect, namely, the user and the information attribute values have a single shot, and the equivalence class containing the group of information attribute values also corresponds to a privacy attribute value set. According to the transitivity of the single-shot, if the group of information attribute values and the corresponding privacy attribute value set form the single-shot (that is, the privacy attribute value set only contains one element), the publication of the data set reveals the data privacy of the user.

Step 2, designing a multiple data fusion anonymity framework: aiming at frequent cross-platform communication and sharing of big data information, the patent provides a re-anonymity framework based on multi-party data fusion, which comprises an initial state, a handshake process, data synchronization and secondary anonymity. Firstly, carrying out corresponding anonymization operation on respective data by each party in an initial state according to respective privacy protection requirements; secondly, the handshake process carries out multiparty communication, and each party issues respective data privacy protection requirements; data synchronization, that is, in the process, the distribution of public attribute values of multi-party data needs to be consistent, and meanwhile, the requirement of privacy protection of multiple parties is met; the method comprises the steps of firstly establishing a hierarchical structure diagram, secondly establishing anonymity, and finally establishing a probabilistic reasoning problem.

It is assumed that the variables are categorical variables (i.e., the number of states of the variables is finite) and the data set is complete. The goal of the Bayesian network construction algorithm is to construct the algorithm by defining a parent set II for each variable ₁ ,…,Π _m At node set

Find the highest scoring Directed Acyclic Graph (DAG, directed Acyclic Graph)

By assuming a Markov condition, a joint probability distribution is introduced, each variable being conditionally independent of its non-descendant variables given its parent.

Step 2.2, for the evaluation of the quality of the generated DAG, different scoring functions are used, in this patent we use a Bayesian Information Criterion (BIC) score, which is proportional to the posterior probability multiplier of the DAG. BIC is resolvable, consisting of the sum of the scores of each variable and its parent set:

wherein LL (X) _i |Π _i ) Represents X _i II with its father node set _i Log likelihood ofFunction:

Pen(X _i |Π _i ) Represents X _i II with its father node set _i Complexity penalty function of (1):

wherein the content of the first and second substances,

is a conditional probability P (X) _i ＝x|Π _i = pi) maximum likelihood estimate, N _x,π Express (X = X | ii _i = pi) the number of occurrences in the dataset, |, which represents the size of the cartesian product space given the variable.

The Bayesian network coding constructs a hierarchical structure diagram by longitudinally coding a Bayesian network, wherein the hierarchical structure diagram comprises two stages: a bottom-up encoding phase and a top-down modification phase. Specifically, given a Bayesian network, it is converted into a hierarchy chart by encoding.

Step 2.3, a bottom-up encoding stage. First, the hierarchy of all nodes is initially marked as zero, the algorithm continuously marks from the leaf nodes and gradually tracks the corresponding parent nodes. In each round, when the hierarchy of a child node is q, the hierarchy of a parent node will be labeled q +1. And then, only recording the current maximum code for the non-leaf node, namely if the code of the node is not 0, comparing the new code with the original code, keeping the larger code, if the code of the node is equal to the original code, stopping the upward backtracking of the node, and judging whether the leaf node queue is empty or not, and if the node queue is empty, stopping the backtracking. Next, the next leaf node is extracted for marking until the leaf node sequence is empty.

And 2.4, a top-down correction stage. Firstly, all nodes are sorted from big to small according to a hierarchical structure, and the nodes are sortedThe node-encoded is initialized to be unmarked. Then, the algorithm extracts the unmarked node with the largest hierarchical structure (i.e. the largest code) in the node sequence, and takes the node as the starting point of traversing the graph in breadth, and traverses downwards in breadth first step by step. In each round, when the hierarchy of the parent node is q, the hierarchy of the child node will be labeled as q-1. Here, we will refer to q _old Comparing the value size of the current hierarchical structure of the represented node with the value size of the newly derived node representation node, and considering the following two cases: (a) When q is _old ＜q _new The algorithm sets the hierarchy of nodes to q _new And setting the node as marked; (b) When q is _old ＝q _new And the node is marked, the downward traversal of the node will terminate early. Next, the extraction of the next unmarked node will continue until there are no unmarked nodes in the sequence.

And step 3, realizing the privacy protection strategy: given the privacy protection policy, a Bayesian network G is finally formed through the steps 1-2, and then attributes need to be paired

Defining unit privacy protection operation, namely equally dividing privacy budget into d parts, performing privacy protection on probability distribution of only one selected attribute node in each round, and performing privacy protection on attributes to be subjected to privacy operation

Step 3.1, realizing that k-anonymous: attribute by domain expert or data owner

Value rangeSetting of the enclosures, and then pairing attributes

The value domain space in the bayesian network is extended such that the number of different values in its value domain space is greater than or equal to k. During correction, the father node of the privacy node distributes the attribute value with the maximum probability distribution value in the child nodes of the father node to the privacy node according to the correction principle of information entropy maximization, so that the privacy node meets the k requirement;

step 3.2, realizing l-diversity: as above, according to data party pair attributes

Setting of range of value, for attribute

And expanding the value range space in the Bayesian network so that the number of different values in the value range space is greater than or equal to l. Corrected attribute

and 3.3, realizing t-close: will attribute

Defining the value distribution condition causing the information entropy maximization in the value domain space as a theoretical standard, measuring by using the variance, and performing attribute matching

step 3.4, realizing attribute value generalization: attribute values set for attributes according to domain experts or data ownersHierarchical Tree, to attribute

The probability distributions of similar values in the value domain are fused. Will attribute

The attribute leaf nodes to be protected anonymously in the value domain and all the brother leaf nodes thereof are aggregated into an attribute node and replaced by the direct father node thereof, and the attribute value probability distribution corresponding to the node inherits from all the original leaf nodes participating in aggregation.

The data sets with different privacy protection strategies are converted into the Bayesian network, and the Bayesian network is encoded, so that a secondary anonymization process is realized, and a re-anonymization framework with multi-party data fusion is realized. However, in the multi-source data fusion process, a multi-privacy protection policy combination scheme needs to be optimized, so that the usability of fused data is improved to the maximum extent while privacy constraints of all parties are met.

And 4, fusing and mapping the data with multiple privacy constraints into a hypergraph, designing a corresponding heuristic rule, and normalizing the data fusion process into a hypergraph resolution process, so that the usability of the data is improved while the privacy constraints are met. The method comprises the following specific steps:

step 4.1, formally defining the privacy protection policy as a five-tuple F = (G, IA, SA, OP, V), wherein G represents a Bayesian network converted from a data set; IA denotes an information attribute node, IA = (a) ₁ ,a ₂ ,…,a _m )，a ₁ ,a ₂ ,…,a _m Are not independent of each other, and have a probability dependence relationship; SA denotes a privacy node, OP denotes a certain operation step, OP = (OP) ₁ ,OP ₂ ,…,OP _m ). V denotes a value range after the operation OP,

judging the execution sequence of different privacy protection strategies from the data layer and the structural layer:

1) If a _m Can be composed of ₁ ,a ₂ ,…,a _n Is shown to be

That is, OP _m Post-execution and vice versa;

from the structural level:

2) Starting from a privacy node of the Bayesian network, the Bayesian network is encoded through a bottom-up encoding stage and a top-down modification stage, and the privacy node SA is modified within a maximum modification threshold value to compare operations OP on privacy attributes _i (SA _i ,V _ai ) And OP _j (SA _k ,V _aj ) Achieving the required privacy protection if OP _i Less influence on the data structure, then OP _i To achieve the required performance ratio OP _j High, i.e.

OP _i Comparison OP _j First, and vice versa;

3) If multiple operations on information attributes are involved

The following two cases are distinguished: firstly, if

Calculating value range of each operation respectively through probabilistic reasoning relationship between IA

If IA _j ，

Then OP _i Comparison OP _j 、OP _k Is executed first if

Then OP _k Comparison OP _j 、OP _i Firstly, executing; second when

Then OP _k Comparison OP _j 、OP _i Is performed first, for OP _j 、OP _i Sequence if in operation

And

in (1),

then

Will affect OP _j Then, then

OP _o Comparison OP _j First, and vice versa.

For example, let two privacy preserving policy actions be decomposed as F ₁ = (A, B, D) and F ₂ And = (D, E, G, H), represented by supercedes { A, B, D }, { D, E, G, H }, respectively, wherein B and D are F ₁ Two steps in one of the operations are represented by conditional overcedges { B, D }, D, G being F ₂ Two steps in one of the operations, represented by conditional overcarrivals { D, G }, due to HG ₃ 、HG ₄ And the two operations are not independent of each other, so that an intersection exists between the two operations, wherein A, E and H are three independent operations respectively and are respectively represented by superedges { A }, { E }, and { H }. The connection hypergraph HG obtained according to the hyper-edge relationship is shown in FIG. 3:

and 4.2, generating the following heuristic rules of ultra-edge resolution and PROG (HG) through judging the execution sequence of different privacy protection strategies F:

rule 1. If the hypergraph HG contains only one hyper-edge N, it is resolved directly, PROG (HG) contains only RESULT (HG): = R (N);

rule 2. If the hypergraph HG is k disjoint hypergraphs HG ₁ 、HG ₂ ……HG _k The PROG (HG) is then:

PROG(HG ₁ )，PROG(HG ₂ )，……，PROG(HG _k )；

RESULT(HG):＝RESULT(HG ₁ )×RESULT(HG ₂ )×……×RESULT(HG _k )

rule 3. Given a privacy node SA and its vertical code X, known from the properties of the Bayesian network _SA L, in all the chain set Links taking the privacy nodes as chain tail nodes, let X _i And X _j Is any two nodes in Links that are not SA, if X _i .L＜X _j L, then X is corrected at the same privacy preserving granularity _i The probability distribution has less influence on the availability of the global data, so a lower proximity principle is formed, namely the closer the modified attribute nodes are to the privacy attribute, the more targeted the modification is. In other words, if the hypergraph HG is composed of k connected components HG ₁ 、HG ₂ ……HG _k And if the HG is in the set, judging the probability dependence of each super edge on the privacy node _i Compared with HG _j Further down to the privacy node, then

I.e. HG _i Digestion is performed first and vice versa.

In this embodiment, the correctness and validity of the privacy protection model proposed by this patent are verified through experimental simulation. The proposed architecture is implemented by adopting python language, the hardware environment is Intel (R) Core (TM) i5-1035G1CPU @1.00GHz 1.19GHz processor, the memory is 16G, and the operating system is Windows 10.

In order to highlight the superiority of the heavy anonymous handshake protocol in the first group of experiments, our experiments respectively compare the method of using the heavy anonymous handshake protocol with that of not using the method, firstly, the patent generates a data set by means of a bayesian network, and when the data set is generated, all parties are anonymous for the first time, and secondly, through experiments, we observe that the change of the data quantity has certain influence on the privacy disclosure probability, so that the data quantity is used as an independent variable of the test, the privacy disclosure probability is used as a dependent variable, and the experiment results are shown in fig. 4. It can be seen from the figure that, under the condition of extremely small data volume, the privacy disclosure probabilities of anonymization are basically similar, with the increase of experimental data volume, the privacy disclosure probability is obviously reduced by the anonymization method, when the data volume reaches 10 ten thousand, the privacy disclosure probability is reduced to be below 20%, on the contrary, under the condition of not using the anonymization method, with the increase of experimental data volume, the privacy disclosure probability is increased, and when the data volume reaches 10 ten thousand, the privacy disclosure probability is as high as 80%.

In order to verify that the optimization algorithm provided by the patent can greatly improve the data availability, a comparison experiment is designed, and a naive algorithm is compared with the optimization algorithm of the patent. The data availability is represented by Q in this experiment, which is given by the formula:

where a represents the original data and b represents the noisy data, it can be observed from the formula that the more noise is added, the worse the data availability. Still, the data volumes are used as independent variables, and the data volumes are 5000, 1 ten thousand, 2 ten thousand, 4 ten thousand, 6 ten thousand, 8 ten thousand and 10 ten thousand respectively, and the experimental results of the data availability are observed, as shown in fig. 5. As can be seen from fig. 5, under the condition of extremely small data volume, the influence effect of the naive algorithm and the optimization algorithm on the data availability is not much different; when the data volume reaches 4 ten thousand, the data availability of the optimization algorithm is about 30% higher than that of a naive algorithm; when the data amount reaches 10 thousands, the data availability of the naive algorithm is about 40% lower than that of the optimization algorithm. Through the analysis, the data availability of the fusion algorithm optimized by the patent is far higher than that of a naive fusion algorithm.

Third set of experiments to verify the usability of the method of this patent in incremental data fusion, this experiment utilized the concept of a generative countermeasure network, first generating a data set, a discriminator and aThe generator respectively samples the data sets in different proportions, wherein the sampling percentage of the discriminator is 30%, the sampling percentage of the generator is 15%, the sampled data sets are respectively generated into Bayesian networks by a hill climbing method, then the respective generated Bayesian networks are respectively generated into 40000 data sets with the same quantity, the KL divergence is used for measuring the distribution difference of privacy attributes in certain equivalence classes of the two data sets, and the calculation formula is as follows:

the closer the KL divergence is to 0, the smaller the difference between the discriminator and the generator, the better the experimental effect.

The privacy attribute probabilities of the discriminators and the generators in three different equivalence classes are selected for calculation, the probability distribution is shown in fig. 6, the KL divergence degrees of the classifiers are calculated respectively, KL1=0.0042, KL2=0.0043 and KL3=0.0053 are obtained, and it can be seen that the three KL divergence degrees are close to 0, so that the difference between the discriminators and the generators is small, and the experimental effect is good.

Through the analysis of the three simulation experiments, the method provided by the patent not only greatly improves the privacy protection effect of multi-source data fusion, but also greatly improves the usability of data.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A multi-source data fusion privacy protection method based on multi-privacy policy combination optimization is characterized by comprising the following steps:

step 1, constructing a data multi-source fusion system model: firstly, data from all parties is collected by a data owner in a system model, and all parties carry out data anonymization operation; secondly, storing the data in a public cloud, and performing anonymity operation on the public cloud; in addition, the user can enjoy convenience brought by big data by customizing required services;

step 2, designing a multiple data fusion anonymity framework: the method comprises four processes of initial state, handshake process, data synchronization and secondary anonymization, wherein firstly, each party in the initial state carries out corresponding anonymization operation on respective data according to respective privacy protection requirements; secondly, multiparty communication is carried out in the handshake process, and each party issues respective data privacy protection requirements; data synchronization, namely the distribution of public attribute values of multi-party data needs to be consistent in the process, the requirement of privacy protection of multiple parties is met, a data set is converted into a Bayesian network in the second anonymization, a hierarchical structure diagram is constructed by encoding the Bayesian network, and the specific process comprises network structure learning and network encoding;

step 3, realizing the privacy protection strategy: given the privacy protection policy, a Bayesian network G is finally formed through the steps 1-2, and then attributes need to be paired

Operate to make the privacy node X _s Satisfying policy requirements, in particular for privacy protection policies

Defining unit privacy protection operation, namely equally dividing privacy budget by d, performing privacy protection on probability distribution of selected attribute node in each round, and performing privacy operation on attributes to be subjected to privacy operation

The method realizes four privacy protection operations of k-anonymity, l-diversity, t-close and attribute value generalization;

step 4, the data fusion process is evolved into a hypergraph resolution process by fusing and mapping the multiparty and multi-privacy-constrained data into a hypergraph and designing a corresponding heuristic rule, and the specific steps are as follows:

step 4.1, formally defining the privacy protection policy as a five-tuple F = (G, IA, SA, OP, V), wherein G represents a Bayesian network converted from a data set; IA denotes an information attribute node, IA = (a) ₁ ,a ₂ ,…,a _m )，a ₁ ,a ₂ ,…,a _m Are not independent of each other, and have a probability dependence relationship; SA denotes a privacy node, OP denotes a certain operation step, OP = (OP) ₁ ,OP ₂ ,…,OP _m ) (ii) a V denotes a value range after the operation OP,

judging the execution sequence of different privacy protection strategies from the data layer surface and the structural layer surface:

1) If a _m Can be formed by ₁ ,a ₂ ,…,a _n Represents then OP _m ＜OP _n I.e. OP _m Post-execution and vice versa;

from the structural level:

2) Starting from the privacy node of the Bayesian network, the Bayesian network is encoded through a bottom-up encoding stage and a top-down modification stage, and the operation on the privacy attribute is compared through modifying the privacy node SA within the maximum modification threshold value

And

achieving the required privacy protection if OP _i Less influence on the data structure, then OP _i To achieve the required performance ratio OP _j High, i.e. OP _j ＜OP _i ，OP _i Comparison OP _j First, and vice versa;

3) If multiple operations on information attributes are involved

Then the following two cases are distinguished: firstly if OP _i ＜OP _j ＜OP _k ＜OP _i Calculating the value range of each operation respectively through the probabilistic reasoning relationship between the IA

If IA _j ，

Then OP _i Comparison OP _j 、OP _k Is executed first, if

Then OP _k Comparison OP _j 、OP _i Firstly, executing; when it is OP _i ＜OP _j ＜OP _k ，OP _i ＜OP _k Then OP _k Comparison OP _j 、OP _i Is performed first, for OP _j 、OP _i Sequence if in operation

And

in the step (1), the first step,

then

Will affect OP _j Then OP _j ＜OP _i ，OP _i Comparison OP _j First, and vice versa;

PROG(HG ₁ )，PROG(HG ₂ )，……，PROG(HG _k )；

RESULT(HG):＝RESULT(HG ₁ )×RESULT(HG ₂ )×……×RESULT(HG _k )

rule 3. Given a privacy node SA and its vertical code X, known from the properties of the Bayesian network _SA L, in all the chain set Links taking the privacy nodes as chain tail nodes, let X _i And X _j Is any two nodes in Links that are not SA, if X _i .L<X _j L, then X is corrected at the same privacy preserving granularity _i Has less influence on the global data availability, a closeness principle is formed, i.e. the closer the modified property node is to the privacy property, the more targeted the modification is, in other words, if the hypergraph HG is composed of k connected components HG ₁ 、HG ₂ ……HG _k And if the HG is in the set, judging the probability dependence of each super edge on the privacy node _i Compared with HG _j Closer down to the privacy node, HG _j ＜HG _i I.e. HG _i Digestion is carried out first and vice versa.

2. The multi-source data fusion privacy protection method based on multi-privacy policy combination optimization according to claim 1, wherein a data fusion model and a privacy attack model are constructed in the step 1 by adopting a counter-type learning architecture, and the specific steps are as follows:

step 1.1, constructing a data fusion model:

the data fusion is to organically integrate data belonging to multiple sources, and one data set is represented as a quadruple D (X, A, F, V), wherein X = { X = { (X) } ₁ ,x ₂ ,…,x _n Is a set of data records, each item of data x _i Are all exclusively associated with one dedicated user u _i (ii) a A is an attribute set; according to the sensitivity of attributes, it is divided into an information attribute set IA and a sensitive attribute set SA, and IA ≦ SA = a,

f is a set of relationships between X and A

Is attribute a _k A range of values of;

definition 1 equivalence classes: given a data set D (X, A, F, V), for

If there are t records { x ₁ ,x ₂ ,…,x _t (t ≧ 1) such that

Is established as { x ₁ ,x ₂ ,…,x _t Is an equivalence class on D for A', denoted as [ x ] _i ] _A′ On the contrary, the set E of all equivalence classes formed by the set of attributes A _A′ Form a division of D, denoted D/E _A′ If, if

The corresponding equivalence class is called information equivalence class;

definition 2 data fusion: given m data sets { D ₁ ,…,D _m And F, the fused data set D (X, { IA, SA }, F, V) satisfies:

if there are two data sets D to be fused _i ,D _j Satisfy the requirement of

Wherein

Representing a symmetry difference operator, which is called information increment fusion; if there is a record x _k ∈X _i ∩X _j To satisfy

And is

Wherein, SA _i ＝SA _j Otherwise, it is called coordination fusion; the research category of the method is coordinated information increment and refinement fusion;

step 1.2, constructing a privacy and privacy attack model:

the privacy refers to the single shooting of a user and a corresponding data sensitive attribute value, if the single shooting relationship is leaked, the privacy of the user is leaked, the data model can know that the user and the data record have one-to-one mapping, each user corresponds to a group of information attribute values from the data aspect, namely the user and the information attribute values have the single shooting, the equivalence class containing the group of information attribute values also corresponds to a privacy attribute value set, and according to the transitivity of the single shooting, if the group of information attribute values and the corresponding privacy attribute value set form the single shooting, the data privacy leakage of the user caused by the issuance of a data set exists;

define 3 data privacy leakage: given a data set D (X, { IA, SA }, F, V), if

[x _i ] _IA For the information equivalence class to which it belongs

Its corresponding privacy attribute value is noted

If it is

The data privacy is called to be revealed;

defining 4 knowledge inference attacks: suppose that the adversary knows the target user u _i Information attribute value of

(a _j E IA) and knows the user's data record x _i In a data set D (X, { IA, SA }, F, V) to be published, when data is published, an enemy builds the following relationship:

and from this forms a privacy inference probability, i.e. pair

User u _i Value is v on SA _j Has a probability of

Wherein C is a counting statistical function _* Qualifiers for a domain of discourse;

defining 5 Multi-versions of data incremental publishingAttack: given a first published data set D (X, { IA, SA }, F, V) and a corresponding publisher-published updated data set D' (X ^′ ,{IA ^′ ,SA ^′ },F ^′ ,V ^′ ) Suppose the adversary compares a private user u _i X of (2) ^′ X, the following relationship is constructed:

and form privacy inference probabilities, i.e. for both datasets

The adversary deduces that the privacy probability of the user is

Where SEL is the selection function.

3. The multi-source data fusion privacy protection method based on multi-privacy policy combination optimization according to claim 1, wherein the vertical encoding in the step 2 includes two stages: a Bayesian network structure learning stage and a network coding stage;

step 2.1, consider from dataset D = { D = { ₁ ,…,D _n Learning Bayesian network structure in (1) comprises m random variable sets of

Assuming that the variables are categorical variables and the data set is complete, the goal of the Bayesian network construction algorithm is to construct the data set by defining a parent set, II, for each variable ₁ ,…,П _m At node set

Find the highest scoring Directed acyclic graph (Directed Acyc)lic Graph, DAG) g, which, assuming Markov conditions, introduces a joint probability distribution, each variable being conditionally independent of its non-descendant variables given its parent;

step 2.2, for the evaluation of the quality of the generated DAG, different scoring functions are used, bayesian Information Criterion (BIC) scoring is adopted, the scoring is in direct proportion to the posterior probability multiplier of the DAG, the BIC is resolvable and is composed of the sum of the scores of each variable and the parent node set thereof:

wherein LL (X) _i |Π _i ) Represents X _i Collect with its father node II _i Log-likelihood function of (a):

Pen(X _i |∏ _i ) Represents X _i II with its father node set _i Complexity penalty function of (2):

wherein the content of the first and second substances,

is a conditional probability P (X) _i ＝x|∏ _i = pi) maximum likelihood estimate, N _x,π Denotes (X = X | Π) _i = π) number of occurrences in the dataset, | · | representing the size of the Cartesian product space given the variables;

the Bayesian network coding builds a hierarchical structure diagram by longitudinally coding the Bayesian network, wherein the hierarchical structure diagram comprises two stages: the encoding stage from bottom to top and the correction stage from top to bottom specifically comprise the following steps:

step 2.3, in the bottom-up encoding stage, firstly, the hierarchical structure of all nodes is marked as zero at first, the algorithm starts to continuously mark from leaf nodes, and corresponding father nodes are gradually tracked; in each turn, when the hierarchical structure of a child node is q, the hierarchical structure of a parent node is marked as q +1; then, only recording the current maximum code for the non-leaf node, namely if the code of the node is not 0, comparing the new code with the original code, and keeping the larger one, if the two codes are equal, stopping the upward backtracking of the node, and judging whether the leaf node queue is empty or not, and if the node queue is empty, stopping the backtracking; next, extracting a next leaf node for marking until the leaf node sequence is empty;

step 2.4, in the top-down correction stage, firstly, sequencing all nodes from large to small according to a hierarchical structure, and initializing all node codes to be unmarked; then, the algorithm extracts the unmarked node with the largest hierarchical structure code in the node sequence, takes the node as the starting point of the traversal graph in the breadth, and preferentially traverses the breadth step by step, and in each round, when the hierarchical structure of the father node is q, the hierarchical structure of the child node is marked as q-1; q is to be _old Comparing the value size of the current hierarchical structure of the represented node with the value size of the newly derived node representation node, and considering the following two cases: (a) When q is _old <q _new Then, the algorithm sets the hierarchy of nodes to q _new And the node is set to be marked; (b) When q is _old ＝q _new And when the node is marked, the downward traversal of the node is terminated in advance; next, the extraction of the next unmarked node will continue until there are no unmarked nodes in the sequence.

4. The multi-source data fusion privacy protection method for multi-privacy policy combination optimization according to claim 1, wherein four privacy protection operations of k-anonymous, l-diversity, t-close-then and attribute value generalization are implemented in step 3, and the specific steps are as follows:

step 3.1, realizing that k-anonymous: attribute by domain expert or data owner

Value range setting, followed by attribute

Expanding a value domain space in the Bayesian network to enable the number of different values in the value domain space to be larger than or equal to k, and assigning the attribute value with the maximum probability distribution value in the child node of the father node to the privacy node according to the information entropy maximization correction principle by the father node of the privacy node during correction to enable the privacy node to meet the k requirement;

step 3.2, realizing l-diversity: according to data party pair attribute

Setting of value range, for attributes

Expanding value domain space in Bayes network to make the number of different values in the value domain space greater than or equal to l, and modifying the attribute

According to the correction principle of maximizing the information entropy, only selecting one value with the maximum probability distribution as a target object to be corrected in each correction process, and averagely distributing the probability distribution value of the value higher than the mean value to a newly-added attribute value;

and 3.3, realizing t-close: will attribute

The probability distribution of each value is corrected so that each valueThe variance of the occurrence probability and the theoretical standard is not higher than t;

step 3.4, realizing generalization of attribute values: attribution is arranged according to an attribution value hierarchical tree set by a domain expert or a data owner to the attribution

Fusing the probability distribution of similar values in the value domain to obtain the attributes

The attribute leaf nodes to be protected anonymously in the value domain and all brother leaf nodes thereof are aggregated into an attribute node and are replaced by direct father nodes thereof, and the attribute value probability distribution corresponding to the node inherits from all original leaf nodes participating in aggregation;

the data sets with different privacy protection strategies are converted into the Bayesian network, and the Bayesian network is encoded, so that a secondary anonymization process is realized, and a re-anonymization framework with multi-party data fusion is realized.