CN112765653B - Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization - Google Patents

Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization Download PDF

Info

Publication number
CN112765653B
CN112765653B CN202110014817.4A CN202110014817A CN112765653B CN 112765653 B CN112765653 B CN 112765653B CN 202110014817 A CN202110014817 A CN 202110014817A CN 112765653 B CN112765653 B CN 112765653B
Authority
CN
China
Prior art keywords
data
privacy
node
attribute
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110014817.4A
Other languages
Chinese (zh)
Other versions
CN112765653A (en
Inventor
周志刚
白增亮
王宇
梁子恺
吴天生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shancai Hi Tech Shanxi Co ltd
Original Assignee
Shancai Hi Tech Shanxi Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shancai Hi Tech Shanxi Co ltd filed Critical Shancai Hi Tech Shanxi Co ltd
Priority to CN202110014817.4A priority Critical patent/CN112765653B/en
Publication of CN112765653A publication Critical patent/CN112765653A/en
Application granted granted Critical
Publication of CN112765653B publication Critical patent/CN112765653B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of data release, and particularly relates to a multi-source data fusion privacy protection method based on multi-privacy strategy optimization combination optimization. A multi-party data fusion architecture based on re-anonymity (over-anonymity) is provided, and the condition that privacy of fused data is leaked is prevented. Further, the realistic significance of data fusion is to provide a more comprehensive data base for users to conduct extensive knowledge mining on the basis. Therefore, a multi-privacy protection strategy combined optimization scheme is designed, and the usability of the fused data is improved to the maximum extent while the privacy constraints of all parties are met. The strategy maps the data fusion of multiple sources and multiple privacy constraints into a hypergraph, each hypergraph is selected, solved and eliminated one by one on the hypergraph by using heuristic rules, the process of eliminating the hypergraph is also a process of realizing the privacy constraints one by one, and a data fusion scheme is formulated according to the process.

Description

Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization
Technical Field
The invention belongs to the field of data release, and particularly relates to a multi-source data fusion privacy protection method based on multi-privacy policy combination optimization.
Background
The multisource cross-platform and data application cross-domain are the most prominent characteristics of big data, and in the big data era, due to the explosive growth of data in different application fields, the single type of data (such as position data, social data, cookie logs, shopping website pipelining and the like) is difficult to meet the requirements of people on upper-layer complex application services. For example, bob needs the App to search nearby friends who like to play basketball, and implementation of this need requires organic fusion of location data with social data. The data cross-domain fusion needs of individuals, and real demand and application of the data cross-domain fusion between different departments in an enterprise, different-quality enterprises and even between the enterprise and a government department, such as accurate advertisement push, network taxi appointment optimization management, intelligent city subway line planning and the like, require data source owners of different field platforms to perform deep fusion and cooperation on own data levels. However, data of each platform often has a great use value, and may include sensitive/private information such as identity information, behavior information, financial information, even disease information of a user, and directly publishing the original data will necessarily cause disclosure of user privacy.
In order to prevent the privacy of users from being revealed, before large data fusion is released, desensitization processing (such as disturbance, noise addition, generalization and the like) needs to be performed on data sets of respective platforms, most of the conventional anonymous privacy protection methods only perform privacy protection on data of a single data source, and the problem of non-explicit privacy information disclosure brought by deep association analysis of the large data cannot be effectively solved; moreover, a single privacy protection method has not been able to meet the personalized privacy requirements of data users, just as local privacy protection of the data from various sources does not avoid the risk of global privacy disclosure after fusion (for example, alice purchases a ticket to Munich at the ticket purchasing A site and browses the tourist attractions of Munich on the Web page of tourist company B. Both companies A and B issue information separately, wherein company A employs a 3-anonymized-based information generalization technique, i.e., "ticket to Munich" is generalized to "ticket to Europe", and company B employs a 3-diversity technique, i.e., the browsing behavior of two users who browse the company Web site simultaneously with Alice is issued as a set { 2017-07-11: { Munich: singesburg, japan: fushishan, USA: massachusettes institute of technology }.suppose that enemies know that Alice has a national plan and steals his log in the records of Internet, and steals his registration of his/her own of the A and B company A, and when the tourist routes of Munich can be issued by the Alice, the tourist routes of the two companies accurately released by the two tourist attractions. The method is also the most essential problem facing the big data release privacy protection, namely privacy disclosure caused by the fact that an attacker constructs data association analysis after the distributed big data multiple sources are fused. One naive approach is to combine the naturally connected fusion data with privacy preserving method-level granularity. However, the combination of privacy preserving method level granularity may result in "over-protection" of private information, thereby severely reducing the availability of data, as shown in fig. 1: in data fusion of two parties, 29 pieces of noise need to be added in a scheme I (firstly carrying out 5-anonymity and then carrying out 3-diversity), and 20 pieces of noise need to be added in a scheme II (firstly carrying out 3-diversity and then carrying out 5-anonymity), so that the fine-grained combined optimization of the multiple privacy protection methods for maximizing data availability still is an open problem in the field of large data fusion release of privacy protection.
In the field of privacy protection of data release, traditional privacy protection algorithms comprise differential privacy, k anonymity, l-diversity anonymity, t-close anonymity and the like, and the improvement of some scholars on the traditional algorithms also has milestone significance, for example, by means of a semantic hierarchy tree, wang and the like, by semantically generalizing the records with less quantity than anonymity requirements, the records can realize k-anonymity under wider semantics, however, the use of a record generalization technology causes irreversible information loss, and the use of a k-anonymity criterion on high-dimensional sparse data can greatly reduce the data availability; brijesh B et al propose a method to improve l-diversity anonymity with a significant improvement in run-time and with less information loss than the existing methods, while providing the same level of privacy due to the close arrangement of records in the initial equivalence class. In general, these conventional privacy preserving models are generally only applicable to static data distribution in a specific scenario. However, the risk faced by the big data release is reflected in the dynamic property of the release process, and has the characteristic of multi-source cross-platform release, so that an attacker needs to be prevented from performing correlation analysis on the data after multi-source fusion, and further the anonymity of the data is damaged.
In the aspect of privacy protection of data fusion, H Patel et al propose a safe fusion method for realizing data of two parties from bottom to top, but the premise of the model is that a trusted third party fuses all data to form a complete original data table, and then the data table is subjected to anonymization treatment, but the trusted third party does not exist in most cases, so the method of the document has low utilization value; jiang et al propose a DkA security fusion model for realizing data of two parties under a semi-honest model, the algorithm utilizes an exchangeable encryption strategy to hide original information in a communication process, and judges whether an anonymous threshold k is met or not by constructing a complete anonymous table to realize privacy protection in the data fusion process, but the resource consumption of the method is too large and the method is not suitable for the fusion of a large data set; clifton et al developed a secure data multiparty data integration tool for four typical operations of relational data counting, union, intersection and cartesian product; yeom et al studied indirect privacy disclosure caused by insufficient model generalization ability, and subsequently Mohammed et al realized data privacy protection for each party of data integration based on a classification tree structure using a data generalization technique, but the information loss of the integrated data was high, and the specific information loss degree was related to the data set. In the scheme, the same privacy protection strategy is assumed to be adopted by multiple parties participating in data fusion, however, in the face of different privacy protection requirements of big data, different platforms may adopt personalized privacy protection strategies according to the application requirements of the own parties before the big data fusion, and the existing scheme is difficult to apply.
Disclosure of Invention
The invention provides a multi-source data fusion privacy protection method based on multi-privacy policy combination optimization. Specifically, the patent firstly proposes a multiple-party data fusion architecture based on re-anonymity (over-anonymity), wherein, inner-layer data anonymity exists before data fusion, and is implemented by owners of respective local data to perform initial protection on the data; when the outer-layer data anonymity occurs in data fusion, a plurality of parties participating in the fusion implement the anonymity according to a set multi-party privacy protection protocol (for simplifying the description, the anonymity is regarded as simultaneously meeting privacy constraints of all the parties), and the condition that the privacy of the fused data is leaked is prevented. Further, the realistic significance of data fusion is to provide a more comprehensive data base for users to conduct extensive knowledge mining on the basis. Therefore, the patent designs a multi-privacy protection strategy combined optimization scheme, and the usability of the fused data is improved to the maximum extent while privacy constraints of all parties are met. The strategy maps the data fusion of multiple sources and multiple privacy constraints into a hypergraph, each hypergraph is selected, solved and eliminated one by one on the hypergraph by using heuristic rules, the process of eliminating the hypergraph is also a process of realizing the privacy constraints one by one, and a data fusion scheme is formulated according to the process.
In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:
step 1, constructing a data multi-source fusion system model:
as shown in fig. 2, first, in a system model, data owners collect data from each party, and in order to prevent privacy disclosure, each party performs data anonymization operation; secondly, as the data volume of some entities is huge, the data must be stored in a public cloud, the data fusion of the public cloud is to organically integrate the multisource cross-platform data, and aims to better mine useful information by fusing complete data sets of all parties, and if the data fusion is simply carried out, the privacy snooping concern of the public cloud after the data fusion cannot be eliminated, so that the public cloud also needs to carry out heavy anonymity operation; furthermore, the user can enjoy the convenience of big data by customizing the required services, however, unknown attackers may also be hidden in the user, so it is assumed here that the user is also "curious", i.e., the user and the cloud service provider are treated as a suspected privacy mining group with the same attack capability.
Step 2, designing a multiple data fusion anonymity framework:
aiming at frequent cross-platform communication and sharing of big data information, the patent provides a re-anonymization framework based on multi-party data fusion, which comprises an initial state, a handshake process, data synchronization and secondary anonymization. Firstly, carrying out corresponding anonymization operation on respective data by each party in an initial state according to respective privacy protection requirements; secondly, multiparty communication is carried out in the handshake process, and each party issues respective data privacy protection requirements; data synchronization, namely, in the process, the distribution of public attribute values of multi-party data needs to be consistent, and meanwhile, the requirement of privacy protection of multiple parties is met; the method comprises the steps of firstly establishing a hierarchical structure diagram, secondly establishing anonymity, secondly establishing a hierarchical structure diagram, secondly establishing a anonymity framework, and finally establishing a probabilistic reasoning problem.
And step 3, realizing the privacy protection strategy:
given the privacy protection policy, the Bayesian network G is finally formed through the two processes, and then the attributes are required to be matched
Figure GDA0003836312730000041
Operate to make the privacy node X s And the policy requirements are met. In particular, for privacy protection policies
Figure GDA0003836312730000042
Figure GDA0003836312730000043
This patent defines unit privacy protection operation, is to carry out d partition to the privacy budget, and each round only carries out privacy protection to the probability distribution of a selected attribute node, to waiting to carry out privacy operation's attribute
Figure GDA0003836312730000044
The method realizes four privacy protection operations of k-anonymity, l-diversity, t-close and attribute value generalization.
And 4, fusing and mapping the data with multiple privacy constraints into a hypergraph, designing a corresponding heuristic rule, and evolving the data fusion process into a hypergraph resolution process, so that the usability of the data is improved while the privacy constraints are met.
Further, in the step 1, a data fusion model and a privacy attack model are constructed by adopting an antagonistic learning architecture, and the specific steps are as follows:
step 1.1, constructing a data fusion model:
the data fusion is to organically integrate data belonging to multiple sources, and aims to better mine useful information through a more complete data set than before so as to provide high-quality service for users. For ease of discussion, a formal description of the data is first given: one data set is represented as one quadruple D (X, a, F, V), where X = { X = { 1 ,x 2 ,…,x n Is a set of data records, each item of data x i Are all exclusively associated with one dedicated user u i (ii) a A is an attribute set; further, the attributes are divided into an information attribute set IA and a sensitive attribute set SA according to their sensitivity, and IA £ SA = a,
Figure GDA0003836312730000045
f is a set of relationships between X and A
Figure GDA0003836312730000046
Figure GDA0003836312730000051
Is attribute a k The value range of (2).
Definition 1 (equivalence classes): given a data set D (X, A, F, V), for
Figure GDA0003836312730000052
If there are t records { x 1 ,x 2 ,…,x t (t ≧ 1) such that
Figure GDA0003836312730000053
Is established as { x 1 ,x 2 ,…,x t Is an equivalence class on D for A', denoted as [ x ] i ] A′ On the contrary, by attribute set A Set E of all formed equivalence classes A′ Form a division of D, denoted D/E A′ . In particular, if
Figure GDA0003836312730000054
The corresponding equivalence class is referred to as the information equivalence class.
Definition 2 (data fusion): given m data sets { D 1 ,…,D m And F, the fused data set D (x, { IA, SA }, F, V) satisfies:
Figure GDA0003836312730000055
in particular, if there are two data sets D to be fused i ,D j Satisfy the requirement of
Figure GDA0003836312730000056
(
Figure GDA00038363127300000519
Representing a symmetric difference operator), it is called information delta fusion; if there is a record x k ∈X i ∩X j Satisfy the following requirements
Figure GDA0003836312730000057
Figure GDA0003836312730000058
And is
Figure GDA0003836312730000059
Information refinement fusion is called; if x is arbitrarily recorded k ∈X i ∩X j Satisfy the following requirements
Figure GDA00038363127300000510
(wherein, SA i =SA j ) Otherwise, it is called harmonious fusion. The research category of the patent is coordinated information increment and refinement fusion.
Step 1.2, constructing a privacy and privacy attack model:
the privacy refers to the single shooting of the user and the corresponding data sensitive attribute value, and if the single shooting relationship is revealed, the privacy of the user is revealed. According to the data model, users and data records have one-to-one mapping, and each user corresponds to a group of information attribute values from the data aspect, namely, the user and the information attribute values have a single shot, and the equivalence class containing the group of information attribute values also corresponds to a privacy attribute value set. According to the transitivity of the single-shot, if the group of information attribute values and the corresponding privacy attribute value set form the single-shot (that is, the privacy attribute value set only contains one element), the data privacy disclosure exists for the user when the data set is issued.
Definition 3 (data privacy disclosure): given a data set D (X, { IA, SA }, F, V), if
Figure GDA00038363127300000517
[x i ] IA For the information equivalence class to which it belongs
Figure GDA00038363127300000518
Its corresponding privacy attribute value is noted
Figure GDA00038363127300000511
If it is
Figure GDA00038363127300000512
The data privacy is said to be compromised.
Definition 4 (knowledge-based attacks): suppose that the adversary knows the target user u i Information attribute value of
Figure GDA00038363127300000513
And knows the user's data record x i In a data set D (X, { IA, SA }, F, V) to be published, when data is published, an enemy builds the following relationship:
Figure GDA00038363127300000514
and form privacy pushers based thereonPhysical probability, i.e. pair
Figure GDA00038363127300000515
User u i Value is v on SA j Has a probability of
Figure GDA00038363127300000516
(wherein C (#) is a counting statistical function; C;,/C;) * As a domain qualifier).
Definition 5 (multi-version attack of data incremental publishing): given a first published data set D (X, { IA, SA }, F, V) and a corresponding publisher published updated data set D '(X', { IA ', SA' }, F ', V'), it is assumed that the adversary compares a dedicated user u i X', X, the following relationship is constructed:
Figure GDA0003836312730000061
and form privacy inference probabilities, i.e. for both datasets
Figure GDA0003836312730000062
The adversary deduces that the privacy probability of the user is
Figure GDA0003836312730000063
Where SEL is the selection function.
Further, the step 2 of vertical encoding comprises two stages: a Bayesian network structure learning stage and a network coding stage;
the specific steps of the Bayesian network structure learning stage are as follows:
step 2.1, consider from dataset D = { D = { 1 ,…,D n The learning Bayesian network structure in the Chinese scholar tree comprises m random variable sets
Figure GDA0003836312730000068
It is assumed that the variables are categorical variables and that the data set is complete. The goal of the Bayesian network construction algorithm is to construct a set of parent items pi by defining each variable 1 ,…,Π m Find the highest scoring Directed Acyclic Graph (DAG) g on node set x. By assuming a Markov condition, a joint probability distribution is introduced, each variable being conditionally independent of its non-descendant variables given its parent.
Step 2.2, for the evaluation of the quality of the generated DAG, a Bayesian Information Criterion (BIC) score is used, using different scoring functions, which is proportional to the posterior probability multiplier of the DAG. BIC is decomposable, and is composed of the sum of scores of each variable and its parent node set:
Figure GDA0003836312730000064
wherein LL (X) i |∏ i ) Represents X i And its father node set pi i Log-likelihood function of (a):
Figure GDA0003836312730000065
Pen(X ii ) Represents X i Is assembled with the father node thereof i Complexity penalty function of (1):
Figure GDA0003836312730000066
wherein the content of the first and second substances,
Figure GDA0003836312730000067
is a conditional probability P (X) i =x|Π i = π) maximum likelihood estimate, N x,π Denotes (X = X | Π) i = pi) the number of occurrences in the dataset, |, which represents the size of the cartesian product space given the variable.
The patent uses a hill climbing method to generate a Bayesian network of corresponding data, and the main steps are as shown in an algorithm 1:
algorithm 1 Bayesian network structure generation method based on hill-climbing method
Figure GDA0003836312730000071
It should be noted that the "flip edge" operation cannot be simply regarded as a sequence operation of "delete an edge and add an edge opposite to the previous operation direction". Because the algorithm adopts a greedy strategy, the deletion edge operation may reduce the BIC score of the Bayesian network, and the program is terminated in advance, so that the operation of adding the corresponding turning edge cannot be implemented.
The specific steps of the Bayesian network encoding stage are as follows:
step 2.3, a hierarchical structure diagram is constructed by longitudinally encoding a Bayesian network, wherein the hierarchical structure diagram comprises two stages: a bottom-up encoding phase and a top-down modification phase. Specifically, given a bayesian network, which is converted into a hierarchical structure diagram by encoding, algorithm 2 is a bayesian network encoding process:
algorithm 2 Bayesian network vertical coding
Figure GDA0003836312730000072
Figure GDA0003836312730000081
1) Bottom-up encoding stage. First, the hierarchy of all nodes is initially marked as zero, the algorithm continuously marks from leaf nodes and progressively tracks the corresponding parent node. In each turn, when the hierarchy of the child node is q, the hierarchy of the parent node will be labeled q +1. And then, only recording the current maximum code for the non-leaf node, namely if the code of the node is not 0, comparing the new code with the original code, keeping the larger code, if the code of the node is equal to the original code, stopping the upward backtracking of the node, and judging whether the leaf node queue is empty or not, and if the node queue is empty, stopping the backtracking. Next, the next leaf node is extracted for marking until the leaf node sequence is empty.
2) A top-down correction phase. All nodes are first sorted from large to small in a hierarchical structure and all node codes are initialized to be unmarked. Then, the algorithm extracts the unmarked node with the largest hierarchical structure (i.e. the largest code) in the node sequence, and takes the node as the starting point of traversing the graph in breadth, and traverses downwards in breadth first step by step. In each round, when the hierarchy of the parent node is q, the hierarchy of the child node will be labeled as q-1. Q is to be old Comparing the value size of the current hierarchical structure of the represented node with the value size of the newly derived node representation node, and considering the following two cases: (a) When q is old <q new The algorithm sets the hierarchy of nodes to q new And setting the node as marked; (b) When q is old =q new And the node is marked, the downward traversal of the node will terminate early. Next, the extraction of the next unmarked node will continue until there are no unmarked nodes in the sequence.
Further, in the step 3, four privacy protection operations of k-anonymous, l-diversity, t-close-relation and property value generalization are realized, and the specific steps are as follows:
step 3.1, realizing that k-anonymous: attribute by domain expert or data owner
Figure GDA0003836312730000091
Value range setting, followed by attribute
Figure GDA0003836312730000092
The value range space in the Bayesian network is expanded such that the number of different values in the value range space is greater than or equal to k. During correction, the father node of the privacy node distributes the attribute value with the maximum probability distribution value in the child nodes of the father node to the privacy node according to the correction principle of information entropy maximization, so that the privacy node meets the k requirement;
step 3.2, realizing l-diversity: according to data party pair attribute
Figure GDA0003836312730000093
Setting of range of value, for attribute
Figure GDA0003836312730000094
The value range space in the bayesian network is extended such that the number of different values in its value range space is greater than or equal to l. Corrected attribute
Figure GDA0003836312730000095
According to the correction principle of information entropy maximization, in each correction process, only selecting one value with the maximum probability distribution as a target object to be corrected, and averagely distributing the probability distribution value higher than the mean value to a newly-added attribute value;
and 3.3, realizing t-close: will attribute
Figure GDA0003836312730000096
Defining the value distribution condition causing information entropy maximization in value domain space as a theoretical standard, measuring by using variance, and performing attribute matching
Figure GDA0003836312730000097
Correcting the probability distribution of each value to ensure that the variance between the occurrence probability of each value and the theoretical standard is not higher than t;
step 3.4, realizing attribute value generalization: and fusing the probability distribution of the similar values in the attribute C value domain according to the attribute value hierarchical tree set for the attributes by the domain experts or the data owners. And aggregating the attribute leaf nodes to be protected anonymously and all sibling leaf nodes thereof in the attribute C value domain into an attribute node and replacing the attribute node by a direct father node thereof, wherein the attribute value probability distribution corresponding to the node inherits from all original leaf nodes participating in aggregation.
The data sets of different privacy protection strategies are converted into the Bayesian network, and the Bayesian network is encoded, so that a secondary anonymization process is realized, and a re-anonymization framework of multi-party data fusion is realized. However, in the multi-source data fusion process, a multi-privacy protection policy combination scheme needs to be optimized, so that the usability of fused data is improved to the maximum extent while privacy constraints of all parties are met.
Further, in the step 4, a heuristic rule is adopted to evolve the data fusion process into a hypergraph resolution process, and the specific steps are as follows:
i.e. PROG (HG) is:
FOR each excess edge M DO intersected with excess edge N
Bottom-up elimination of probabilistically independent tuples in R (M)
ENDFOR;
PROG(HG 1 ),PROG(HG 2 ),……,PROG(HG k );
Figure GDA0003836312730000102
The hypergraph resolution algorithm recursively calls the 3 rules, each hypergraph is selected, solved and eliminated from the HG one by one, a RESULT (HG) program PROG (HG) is constructed, and the process of resolving the hypergraph is also a process of realizing privacy constraints one by one. The hypergraph resolution heuristic algorithm is as follows:
algorithm 3 hypercritical resolution heuristic algorithm
Figure GDA0003836312730000101
The above mentioned two privacy-preserving policy operations F 1 = (A, B, D) and F 2 For example, we see how to construct the program PROG (HG) and generate the RESULT "RESULT (HG) using a heuristic algorithm:
(1) Resolving the hyper-edges { A, B, D }, { D, E, G, H }, and obtaining a result hypergraph HG 1 = ({ B, D }, { D, G }, { a }, { E }, and { H }), and the PROG (HG) program is obtained according to the digestion rule 3
PROG(HG 1 );
Figure GDA0003836312730000112
(2) Let HG 2 =({A}、{E}、{H}),HG 3 = ({ B, D }, { D, G }), and obtaining PROG (HG) according to the resolution rule 2 1 ) Procedure for the preparation of a pharmaceutical composition
PROG(HG 2 )、PROG(HG 3 );
RESULT(HG 1 ):=RESULT(HG 2 )×RESULT(HG 3 )。
Because of HG 2 Comprising three super-edges which are independent of each other, so PROG (HG) 2 ) Is composed of
RESULT(HG 2 ):=R({A}、{E}、{H})
(3) Texture computation HG 3 PROG (HG) 3 ) To resolve the hyper-edges { B, D }, { D, G }, the result hypergraph is HG 4 Generating PROG (HG) according to the resolution rule 3 3 ) Procedure for measuring the movement of a moving object
PROG(HG 4 );
Figure GDA0003836312730000113
(4) Due to HG 4 Contains only one super-edge, so rule 1 knows, PROG (HG) 4 ) Is that
RESULT(HG 4 ):=R({D,G})。
The final procedure can be written as:
Figure GDA0003836312730000111
of course, it is not necessary for any product to achieve all of the above advantages at the same time in the practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a comparative case analysis of results of execution sequences of different privacy preserving policies;
FIG. 2 is a system model for multi-source data fusion;
FIG. 3 is a hypergraph HG;
FIG. 4 shows comparison results of whether anonymity is re-determined;
FIG. 5 is a comparison of a naive algorithm and an optimized algorithm;
FIG. 6 is a graph of privacy attribute probabilities for discriminators and generators in different equivalence classes.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A multi-source data fusion privacy protection method based on multi-privacy policy combination optimization comprises the following steps:
step 1, constructing a data multi-source fusion system model: firstly, data from each party is collected by a data owner in a system model, and in order to prevent privacy disclosure, data anonymization operation is carried out by each party; secondly, as the data volume of some entities is huge, the data must be stored in a public cloud, the data fusion of the public cloud is to organically integrate the multisource cross-platform data, and aims to better mine useful information by fusing complete data sets of all parties, and if the data fusion is simply carried out, the privacy snooping concern of the public cloud after the data fusion cannot be eliminated, so that the public cloud also needs to carry out heavy anonymity operation; furthermore, the user can enjoy the convenience of big data by customizing the required services, however, unknown attackers may also be hidden in the user, so it is assumed here that the user is also "curious", i.e., the user and the cloud service provider are treated as a suspected privacy mining group with the same attack capability. The method comprises the following specific steps:
step 1.1, constructing a data fusion modelAnd (4) molding. The data fusion is to organically integrate data belonging to multiple sources, and aims to better mine useful information through a more complete data set than before so as to provide high-quality service for users. The data set is represented as a quadruple D (X, a, F, V), where X = { X = { 1 ,x 2 ,…,x n Is a set of data records, each item of data x i Are all exclusively associated with one dedicated user u i (ii) a A is an attribute set; further, the attributes are divided into an information attribute set IA and a sensitive attribute set SA according to their sensitivity, and IA £ SA = a,
Figure GDA0003836312730000121
f is a set of relationships between X and A
Figure GDA0003836312730000122
Is attribute a k The value range of (2).
Step 1.2, a privacy and privacy attack model is constructed. The privacy refers to the single shooting of the user and the corresponding data sensitive attribute value, and if the single shooting relationship is revealed, the privacy of the user is revealed. According to the data model, users and data records have one-to-one mapping, and each user corresponds to a group of information attribute values from the data aspect, namely, the user and the information attribute values have a single shot, and the equivalence class containing the group of information attribute values also corresponds to a privacy attribute value set. According to the transitivity of the single-shot, if the group of information attribute values and the corresponding privacy attribute value set form the single-shot (that is, the privacy attribute value set only contains one element), the publication of the data set reveals the data privacy of the user.
Step 2, designing a multiple data fusion anonymity framework: aiming at frequent cross-platform communication and sharing of big data information, the patent provides a re-anonymity framework based on multi-party data fusion, which comprises an initial state, a handshake process, data synchronization and secondary anonymity. Firstly, carrying out corresponding anonymization operation on respective data by each party in an initial state according to respective privacy protection requirements; secondly, the handshake process carries out multiparty communication, and each party issues respective data privacy protection requirements; data synchronization, that is, in the process, the distribution of public attribute values of multi-party data needs to be consistent, and meanwhile, the requirement of privacy protection of multiple parties is met; the method comprises the steps of firstly establishing a hierarchical structure diagram, secondly establishing anonymity, and finally establishing a probabilistic reasoning problem.
Step 2.1, consider from dataset D = { D = { 1 ,…,D n The learning Bayesian network structure in the Chinese scholar tree comprises m random variable sets
Figure GDA0003836312730000135
It is assumed that the variables are categorical variables (i.e., the number of states of the variables is finite) and the data set is complete. The goal of the Bayesian network construction algorithm is to construct the algorithm by defining a parent set II for each variable 1 ,…,Π m At node set
Figure GDA0003836312730000136
Find the highest scoring Directed Acyclic Graph (DAG, directed Acyclic Graph)
Figure GDA0003836312730000137
By assuming a Markov condition, a joint probability distribution is introduced, each variable being conditionally independent of its non-descendant variables given its parent.
Step 2.2, for the evaluation of the quality of the generated DAG, different scoring functions are used, in this patent we use a Bayesian Information Criterion (BIC) score, which is proportional to the posterior probability multiplier of the DAG. BIC is resolvable, consisting of the sum of the scores of each variable and its parent set:
Figure GDA0003836312730000131
wherein LL (X) ii ) Represents X i II with its father node set i Log likelihood ofFunction:
Figure GDA0003836312730000132
Pen(X ii ) Represents X i II with its father node set i Complexity penalty function of (1):
Figure GDA0003836312730000133
wherein the content of the first and second substances,
Figure GDA0003836312730000134
is a conditional probability P (X) i =x|Π i = pi) maximum likelihood estimate, N x,π Express (X = X | ii i = pi) the number of occurrences in the dataset, |, which represents the size of the cartesian product space given the variable.
The Bayesian network coding constructs a hierarchical structure diagram by longitudinally coding a Bayesian network, wherein the hierarchical structure diagram comprises two stages: a bottom-up encoding phase and a top-down modification phase. Specifically, given a Bayesian network, it is converted into a hierarchy chart by encoding.
Step 2.3, a bottom-up encoding stage. First, the hierarchy of all nodes is initially marked as zero, the algorithm continuously marks from the leaf nodes and gradually tracks the corresponding parent nodes. In each round, when the hierarchy of a child node is q, the hierarchy of a parent node will be labeled q +1. And then, only recording the current maximum code for the non-leaf node, namely if the code of the node is not 0, comparing the new code with the original code, keeping the larger code, if the code of the node is equal to the original code, stopping the upward backtracking of the node, and judging whether the leaf node queue is empty or not, and if the node queue is empty, stopping the backtracking. Next, the next leaf node is extracted for marking until the leaf node sequence is empty.
And 2.4, a top-down correction stage. Firstly, all nodes are sorted from big to small according to a hierarchical structure, and the nodes are sortedThe node-encoded is initialized to be unmarked. Then, the algorithm extracts the unmarked node with the largest hierarchical structure (i.e. the largest code) in the node sequence, and takes the node as the starting point of traversing the graph in breadth, and traverses downwards in breadth first step by step. In each round, when the hierarchy of the parent node is q, the hierarchy of the child node will be labeled as q-1. Here, we will refer to q old Comparing the value size of the current hierarchical structure of the represented node with the value size of the newly derived node representation node, and considering the following two cases: (a) When q is old <q new The algorithm sets the hierarchy of nodes to q new And setting the node as marked; (b) When q is old =q new And the node is marked, the downward traversal of the node will terminate early. Next, the extraction of the next unmarked node will continue until there are no unmarked nodes in the sequence.
And step 3, realizing the privacy protection strategy: given the privacy protection policy, a Bayesian network G is finally formed through the steps 1-2, and then attributes need to be paired
Figure GDA0003836312730000141
Operate to make the privacy node X s And the policy requirements are met. In particular, for privacy protection policies
Figure GDA0003836312730000142
Defining unit privacy protection operation, namely equally dividing privacy budget into d parts, performing privacy protection on probability distribution of only one selected attribute node in each round, and performing privacy protection on attributes to be subjected to privacy operation
Figure GDA0003836312730000143
The method realizes four privacy protection operations of k-anonymity, l-diversity, t-close and attribute value generalization.
Step 3.1, realizing that k-anonymous: attribute by domain expert or data owner
Figure GDA0003836312730000144
Value rangeSetting of the enclosures, and then pairing attributes
Figure GDA0003836312730000145
The value domain space in the bayesian network is extended such that the number of different values in its value domain space is greater than or equal to k. During correction, the father node of the privacy node distributes the attribute value with the maximum probability distribution value in the child nodes of the father node to the privacy node according to the correction principle of information entropy maximization, so that the privacy node meets the k requirement;
step 3.2, realizing l-diversity: as above, according to data party pair attributes
Figure GDA0003836312730000151
Setting of range of value, for attribute
Figure GDA0003836312730000152
And expanding the value range space in the Bayesian network so that the number of different values in the value range space is greater than or equal to l. Corrected attribute
Figure GDA0003836312730000153
According to the correction principle of information entropy maximization, in each correction process, only selecting one value with the maximum probability distribution as a target object to be corrected, and averagely distributing the probability distribution value higher than the mean value to a newly-added attribute value;
and 3.3, realizing t-close: will attribute
Figure GDA0003836312730000154
Defining the value distribution condition causing the information entropy maximization in the value domain space as a theoretical standard, measuring by using the variance, and performing attribute matching
Figure GDA0003836312730000155
Correcting the probability distribution of each value to ensure that the variance between the occurrence probability of each value and the theoretical standard is not higher than t;
step 3.4, realizing attribute value generalization: attribute values set for attributes according to domain experts or data ownersHierarchical Tree, to attribute
Figure GDA0003836312730000156
The probability distributions of similar values in the value domain are fused. Will attribute
Figure GDA0003836312730000157
The attribute leaf nodes to be protected anonymously in the value domain and all the brother leaf nodes thereof are aggregated into an attribute node and replaced by the direct father node thereof, and the attribute value probability distribution corresponding to the node inherits from all the original leaf nodes participating in aggregation.
The data sets with different privacy protection strategies are converted into the Bayesian network, and the Bayesian network is encoded, so that a secondary anonymization process is realized, and a re-anonymization framework with multi-party data fusion is realized. However, in the multi-source data fusion process, a multi-privacy protection policy combination scheme needs to be optimized, so that the usability of fused data is improved to the maximum extent while privacy constraints of all parties are met.
And 4, fusing and mapping the data with multiple privacy constraints into a hypergraph, designing a corresponding heuristic rule, and normalizing the data fusion process into a hypergraph resolution process, so that the usability of the data is improved while the privacy constraints are met. The method comprises the following specific steps:
step 4.1, formally defining the privacy protection policy as a five-tuple F = (G, IA, SA, OP, V), wherein G represents a Bayesian network converted from a data set; IA denotes an information attribute node, IA = (a) 1 ,a 2 ,…,a m ),a 1 ,a 2 ,…,a m Are not independent of each other, and have a probability dependence relationship; SA denotes a privacy node, OP denotes a certain operation step, OP = (OP) 1 ,OP 2 ,…,OP m ). V denotes a value range after the operation OP,
Figure GDA0003836312730000158
judging the execution sequence of different privacy protection strategies from the data layer and the structural layer:
1) If a m Can be composed of 1 ,a 2 ,…,a n Is shown to be
Figure GDA0003836312730000159
That is, OP m Post-execution and vice versa;
from the structural level:
2) Starting from a privacy node of the Bayesian network, the Bayesian network is encoded through a bottom-up encoding stage and a top-down modification stage, and the privacy node SA is modified within a maximum modification threshold value to compare operations OP on privacy attributes i (SA i ,V ai ) And OP j (SA k ,V aj ) Achieving the required privacy protection if OP i Less influence on the data structure, then OP i To achieve the required performance ratio OP j High, i.e.
Figure GDA0003836312730000169
OP i Comparison OP j First, and vice versa;
3) If multiple operations on information attributes are involved
Figure GDA0003836312730000161
The following two cases are distinguished: firstly, if
Figure GDA00038363127300001610
Calculating value range of each operation respectively through probabilistic reasoning relationship between IA
Figure GDA0003836312730000162
If IA j
Figure GDA0003836312730000163
Then OP i Comparison OP j 、OP k Is executed first if
Figure GDA0003836312730000164
Then OP k Comparison OP j 、OP i Firstly, executing; second when
Figure GDA00038363127300001611
Figure GDA00038363127300001612
Then OP k Comparison OP j 、OP i Is performed first, for OP j 、OP i Sequence if in operation
Figure GDA0003836312730000165
And
Figure GDA0003836312730000166
in (1),
Figure GDA0003836312730000167
then
Figure GDA0003836312730000168
Will affect OP j Then, then
Figure GDA00038363127300001613
OP o Comparison OP j First, and vice versa.
For example, let two privacy preserving policy actions be decomposed as F 1 = (A, B, D) and F 2 And = (D, E, G, H), represented by supercedes { A, B, D }, { D, E, G, H }, respectively, wherein B and D are F 1 Two steps in one of the operations are represented by conditional overcedges { B, D }, D, G being F 2 Two steps in one of the operations, represented by conditional overcarrivals { D, G }, due to HG 3 、HG 4 And the two operations are not independent of each other, so that an intersection exists between the two operations, wherein A, E and H are three independent operations respectively and are respectively represented by superedges { A }, { E }, and { H }. The connection hypergraph HG obtained according to the hyper-edge relationship is shown in FIG. 3:
and 4.2, generating the following heuristic rules of ultra-edge resolution and PROG (HG) through judging the execution sequence of different privacy protection strategies F:
rule 1. If the hypergraph HG contains only one hyper-edge N, it is resolved directly, PROG (HG) contains only RESULT (HG): = R (N);
rule 2. If the hypergraph HG is k disjoint hypergraphs HG 1 、HG 2 ……HG k The PROG (HG) is then:
PROG(HG 1 ),PROG(HG 2 ),……,PROG(HG k );
RESULT(HG):=RESULT(HG 1 )×RESULT(HG 2 )×……×RESULT(HG k )
rule 3. Given a privacy node SA and its vertical code X, known from the properties of the Bayesian network SA L, in all the chain set Links taking the privacy nodes as chain tail nodes, let X i And X j Is any two nodes in Links that are not SA, if X i .L<X j L, then X is corrected at the same privacy preserving granularity i The probability distribution has less influence on the availability of the global data, so a lower proximity principle is formed, namely the closer the modified attribute nodes are to the privacy attribute, the more targeted the modification is. In other words, if the hypergraph HG is composed of k connected components HG 1 、HG 2 ……HG k And if the HG is in the set, judging the probability dependence of each super edge on the privacy node i Compared with HG j Further down to the privacy node, then
Figure GDA0003836312730000172
I.e. HG i Digestion is performed first and vice versa.
In this embodiment, the correctness and validity of the privacy protection model proposed by this patent are verified through experimental simulation. The proposed architecture is implemented by adopting python language, the hardware environment is Intel (R) Core (TM) i5-1035G1CPU @1.00GHz 1.19GHz processor, the memory is 16G, and the operating system is Windows 10.
In order to highlight the superiority of the heavy anonymous handshake protocol in the first group of experiments, our experiments respectively compare the method of using the heavy anonymous handshake protocol with that of not using the method, firstly, the patent generates a data set by means of a bayesian network, and when the data set is generated, all parties are anonymous for the first time, and secondly, through experiments, we observe that the change of the data quantity has certain influence on the privacy disclosure probability, so that the data quantity is used as an independent variable of the test, the privacy disclosure probability is used as a dependent variable, and the experiment results are shown in fig. 4. It can be seen from the figure that, under the condition of extremely small data volume, the privacy disclosure probabilities of anonymization are basically similar, with the increase of experimental data volume, the privacy disclosure probability is obviously reduced by the anonymization method, when the data volume reaches 10 ten thousand, the privacy disclosure probability is reduced to be below 20%, on the contrary, under the condition of not using the anonymization method, with the increase of experimental data volume, the privacy disclosure probability is increased, and when the data volume reaches 10 ten thousand, the privacy disclosure probability is as high as 80%.
In order to verify that the optimization algorithm provided by the patent can greatly improve the data availability, a comparison experiment is designed, and a naive algorithm is compared with the optimization algorithm of the patent. The data availability is represented by Q in this experiment, which is given by the formula:
Figure GDA0003836312730000171
where a represents the original data and b represents the noisy data, it can be observed from the formula that the more noise is added, the worse the data availability. Still, the data volumes are used as independent variables, and the data volumes are 5000, 1 ten thousand, 2 ten thousand, 4 ten thousand, 6 ten thousand, 8 ten thousand and 10 ten thousand respectively, and the experimental results of the data availability are observed, as shown in fig. 5. As can be seen from fig. 5, under the condition of extremely small data volume, the influence effect of the naive algorithm and the optimization algorithm on the data availability is not much different; when the data volume reaches 4 ten thousand, the data availability of the optimization algorithm is about 30% higher than that of a naive algorithm; when the data amount reaches 10 thousands, the data availability of the naive algorithm is about 40% lower than that of the optimization algorithm. Through the analysis, the data availability of the fusion algorithm optimized by the patent is far higher than that of a naive fusion algorithm.
Third set of experiments to verify the usability of the method of this patent in incremental data fusion, this experiment utilized the concept of a generative countermeasure network, first generating a data set, a discriminator and aThe generator respectively samples the data sets in different proportions, wherein the sampling percentage of the discriminator is 30%, the sampling percentage of the generator is 15%, the sampled data sets are respectively generated into Bayesian networks by a hill climbing method, then the respective generated Bayesian networks are respectively generated into 40000 data sets with the same quantity, the KL divergence is used for measuring the distribution difference of privacy attributes in certain equivalence classes of the two data sets, and the calculation formula is as follows:
Figure GDA0003836312730000181
the closer the KL divergence is to 0, the smaller the difference between the discriminator and the generator, the better the experimental effect.
The privacy attribute probabilities of the discriminators and the generators in three different equivalence classes are selected for calculation, the probability distribution is shown in fig. 6, the KL divergence degrees of the classifiers are calculated respectively, KL1=0.0042, KL2=0.0043 and KL3=0.0053 are obtained, and it can be seen that the three KL divergence degrees are close to 0, so that the difference between the discriminators and the generators is small, and the experimental effect is good.
Through the analysis of the three simulation experiments, the method provided by the patent not only greatly improves the privacy protection effect of multi-source data fusion, but also greatly improves the usability of data.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (4)

1. A multi-source data fusion privacy protection method based on multi-privacy policy combination optimization is characterized by comprising the following steps:
step 1, constructing a data multi-source fusion system model: firstly, data from all parties is collected by a data owner in a system model, and all parties carry out data anonymization operation; secondly, storing the data in a public cloud, and performing anonymity operation on the public cloud; in addition, the user can enjoy convenience brought by big data by customizing required services;
step 2, designing a multiple data fusion anonymity framework: the method comprises four processes of initial state, handshake process, data synchronization and secondary anonymization, wherein firstly, each party in the initial state carries out corresponding anonymization operation on respective data according to respective privacy protection requirements; secondly, multiparty communication is carried out in the handshake process, and each party issues respective data privacy protection requirements; data synchronization, namely the distribution of public attribute values of multi-party data needs to be consistent in the process, the requirement of privacy protection of multiple parties is met, a data set is converted into a Bayesian network in the second anonymization, a hierarchical structure diagram is constructed by encoding the Bayesian network, and the specific process comprises network structure learning and network encoding;
step 3, realizing the privacy protection strategy: given the privacy protection policy, a Bayesian network G is finally formed through the steps 1-2, and then attributes need to be paired
Figure FDA0003848609140000011
Operate to make the privacy node X s Satisfying policy requirements, in particular for privacy protection policies
Figure FDA0003848609140000012
Defining unit privacy protection operation, namely equally dividing privacy budget by d, performing privacy protection on probability distribution of selected attribute node in each round, and performing privacy operation on attributes to be subjected to privacy operation
Figure FDA0003848609140000013
The method realizes four privacy protection operations of k-anonymity, l-diversity, t-close and attribute value generalization;
step 4, the data fusion process is evolved into a hypergraph resolution process by fusing and mapping the multiparty and multi-privacy-constrained data into a hypergraph and designing a corresponding heuristic rule, and the specific steps are as follows:
step 4.1, formally defining the privacy protection policy as a five-tuple F = (G, IA, SA, OP, V), wherein G represents a Bayesian network converted from a data set; IA denotes an information attribute node, IA = (a) 1 ,a 2 ,…,a m ),a 1 ,a 2 ,…,a m Are not independent of each other, and have a probability dependence relationship; SA denotes a privacy node, OP denotes a certain operation step, OP = (OP) 1 ,OP 2 ,…,OP m ) (ii) a V denotes a value range after the operation OP,
Figure FDA0003848609140000014
judging the execution sequence of different privacy protection strategies from the data layer surface and the structural layer surface:
1) If a m Can be formed by 1 ,a 2 ,…,a n Represents then OP m <OP n I.e. OP m Post-execution and vice versa;
from the structural level:
2) Starting from the privacy node of the Bayesian network, the Bayesian network is encoded through a bottom-up encoding stage and a top-down modification stage, and the operation on the privacy attribute is compared through modifying the privacy node SA within the maximum modification threshold value
Figure FDA0003848609140000021
And
Figure FDA0003848609140000022
achieving the required privacy protection if OP i Less influence on the data structure, then OP i To achieve the required performance ratio OP j High, i.e. OP j <OP i ,OP i Comparison OP j First, and vice versa;
3) If multiple operations on information attributes are involved
Figure FDA00038486091400000211
Figure FDA0003848609140000023
Then the following two cases are distinguished: firstly if OP i <OP j <OP k <OP i Calculating the value range of each operation respectively through the probabilistic reasoning relationship between the IA
Figure FDA0003848609140000024
If IA j
Figure FDA0003848609140000025
Then OP i Comparison OP j 、OP k Is executed first, if
Figure FDA0003848609140000026
Then OP k Comparison OP j 、OP i Firstly, executing; when it is OP i <OP j <OP k ,OP i <OP k Then OP k Comparison OP j 、OP i Is performed first, for OP j 、OP i Sequence if in operation
Figure FDA0003848609140000027
And
Figure FDA0003848609140000028
in the step (1), the first step,
Figure FDA0003848609140000029
then
Figure FDA00038486091400000210
Will affect OP j Then OP j <OP i ,OP i Comparison OP j First, and vice versa;
and 4.2, generating the following heuristic rules of ultra-edge resolution and PROG (HG) through judging the execution sequence of different privacy protection strategies F:
rule 1. If the hypergraph HG contains only one hyper-edge N, it is resolved directly, PROG (HG) contains only RESULT (HG): = R (N);
rule 2. If the hypergraph HG is k disjoint hypergraphs HG 1 、HG 2 ……HG k The PROG (HG) is then:
PROG(HG 1 ),PROG(HG 2 ),……,PROG(HG k );
RESULT(HG):=RESULT(HG 1 )×RESULT(HG 2 )×……×RESULT(HG k )
rule 3. Given a privacy node SA and its vertical code X, known from the properties of the Bayesian network SA L, in all the chain set Links taking the privacy nodes as chain tail nodes, let X i And X j Is any two nodes in Links that are not SA, if X i .L<X j L, then X is corrected at the same privacy preserving granularity i Has less influence on the global data availability, a closeness principle is formed, i.e. the closer the modified property node is to the privacy property, the more targeted the modification is, in other words, if the hypergraph HG is composed of k connected components HG 1 、HG 2 ……HG k And if the HG is in the set, judging the probability dependence of each super edge on the privacy node i Compared with HG j Closer down to the privacy node, HG j <HG i I.e. HG i Digestion is carried out first and vice versa.
2. The multi-source data fusion privacy protection method based on multi-privacy policy combination optimization according to claim 1, wherein a data fusion model and a privacy attack model are constructed in the step 1 by adopting a counter-type learning architecture, and the specific steps are as follows:
step 1.1, constructing a data fusion model:
the data fusion is to organically integrate data belonging to multiple sources, and one data set is represented as a quadruple D (X, A, F, V), wherein X = { X = { (X) } 1 ,x 2 ,…,x n Is a set of data records, each item of data x i Are all exclusively associated with one dedicated user u i (ii) a A is an attribute set; according to the sensitivity of attributes, it is divided into an information attribute set IA and a sensitive attribute set SA, and IA ≦ SA = a,
Figure FDA0003848609140000031
f is a set of relationships between X and A
Figure FDA0003848609140000032
Figure FDA0003848609140000033
Is attribute a k A range of values of;
definition 1 equivalence classes: given a data set D (X, A, F, V), for
Figure FDA0003848609140000034
If there are t records { x 1 ,x 2 ,…,x t (t ≧ 1) such that
Figure FDA0003848609140000035
Is established as { x 1 ,x 2 ,…,x t Is an equivalence class on D for A', denoted as [ x ] i ] A′ On the contrary, the set E of all equivalence classes formed by the set of attributes A A′ Form a division of D, denoted D/E A′ If, if
Figure FDA00038486091400000318
The corresponding equivalence class is called information equivalence class;
definition 2 data fusion: given m data sets { D 1 ,…,D m And F, the fused data set D (X, { IA, SA }, F, V) satisfies:
Figure FDA0003848609140000036
if there are two data sets D to be fused i ,D j Satisfy the requirement of
Figure FDA0003848609140000037
Wherein
Figure FDA0003848609140000038
Representing a symmetry difference operator, which is called information increment fusion; if there is a record x k ∈X i ∩X j To satisfy
Figure FDA0003848609140000039
And is
Figure FDA00038486091400000310
Information refinement fusion is called; if x is arbitrarily recorded k ∈X i ∩X j Satisfy the following requirements
Figure FDA00038486091400000311
Figure FDA00038486091400000312
Wherein, SA i =SA j Otherwise, it is called coordination fusion; the research category of the method is coordinated information increment and refinement fusion;
step 1.2, constructing a privacy and privacy attack model:
the privacy refers to the single shooting of a user and a corresponding data sensitive attribute value, if the single shooting relationship is leaked, the privacy of the user is leaked, the data model can know that the user and the data record have one-to-one mapping, each user corresponds to a group of information attribute values from the data aspect, namely the user and the information attribute values have the single shooting, the equivalence class containing the group of information attribute values also corresponds to a privacy attribute value set, and according to the transitivity of the single shooting, if the group of information attribute values and the corresponding privacy attribute value set form the single shooting, the data privacy leakage of the user caused by the issuance of a data set exists;
define 3 data privacy leakage: given a data set D (X, { IA, SA }, F, V), if
Figure FDA00038486091400000316
Figure FDA00038486091400000317
[x i ] IA For the information equivalence class to which it belongs
Figure FDA00038486091400000313
Its corresponding privacy attribute value is noted
Figure FDA00038486091400000314
If it is
Figure FDA00038486091400000315
The data privacy is called to be revealed;
defining 4 knowledge inference attacks: suppose that the adversary knows the target user u i Information attribute value of
Figure FDA0003848609140000041
(a j E IA) and knows the user's data record x i In a data set D (X, { IA, SA }, F, V) to be published, when data is published, an enemy builds the following relationship:
Figure FDA0003848609140000042
and from this forms a privacy inference probability, i.e. pair
Figure FDA0003848609140000043
User u i Value is v on SA j Has a probability of
Figure FDA0003848609140000044
Wherein C is a counting statistical function * Qualifiers for a domain of discourse;
defining 5 Multi-versions of data incremental publishingAttack: given a first published data set D (X, { IA, SA }, F, V) and a corresponding publisher-published updated data set D' (X ,{IA ,SA },F ,V ) Suppose the adversary compares a private user u i X of (2) X, the following relationship is constructed:
Figure FDA0003848609140000045
and form privacy inference probabilities, i.e. for both datasets
Figure FDA0003848609140000046
The adversary deduces that the privacy probability of the user is
Figure FDA0003848609140000047
Where SEL is the selection function.
3. The multi-source data fusion privacy protection method based on multi-privacy policy combination optimization according to claim 1, wherein the vertical encoding in the step 2 includes two stages: a Bayesian network structure learning stage and a network coding stage;
the specific steps of the Bayesian network structure learning stage are as follows:
step 2.1, consider from dataset D = { D = { 1 ,…,D n Learning Bayesian network structure in (1) comprises m random variable sets of
Figure FDA0003848609140000049
Assuming that the variables are categorical variables and the data set is complete, the goal of the Bayesian network construction algorithm is to construct the data set by defining a parent set, II, for each variable 1 ,…,П m At node set
Figure FDA00038486091400000410
Find the highest scoring Directed acyclic graph (Directed Acyc)lic Graph, DAG) g, which, assuming Markov conditions, introduces a joint probability distribution, each variable being conditionally independent of its non-descendant variables given its parent;
step 2.2, for the evaluation of the quality of the generated DAG, different scoring functions are used, bayesian Information Criterion (BIC) scoring is adopted, the scoring is in direct proportion to the posterior probability multiplier of the DAG, the BIC is resolvable and is composed of the sum of the scores of each variable and the parent node set thereof:
Figure FDA0003848609140000048
wherein LL (X) ii ) Represents X i Collect with its father node II i Log-likelihood function of (a):
Figure FDA0003848609140000051
Pen(X i |∏ i ) Represents X i II with its father node set i Complexity penalty function of (2):
Figure FDA0003848609140000052
wherein the content of the first and second substances,
Figure FDA0003848609140000053
is a conditional probability P (X) i =x|∏ i = pi) maximum likelihood estimate, N x,π Denotes (X = X | Π) i = π) number of occurrences in the dataset, | · | representing the size of the Cartesian product space given the variables;
the Bayesian network coding builds a hierarchical structure diagram by longitudinally coding the Bayesian network, wherein the hierarchical structure diagram comprises two stages: the encoding stage from bottom to top and the correction stage from top to bottom specifically comprise the following steps:
step 2.3, in the bottom-up encoding stage, firstly, the hierarchical structure of all nodes is marked as zero at first, the algorithm starts to continuously mark from leaf nodes, and corresponding father nodes are gradually tracked; in each turn, when the hierarchical structure of a child node is q, the hierarchical structure of a parent node is marked as q +1; then, only recording the current maximum code for the non-leaf node, namely if the code of the node is not 0, comparing the new code with the original code, and keeping the larger one, if the two codes are equal, stopping the upward backtracking of the node, and judging whether the leaf node queue is empty or not, and if the node queue is empty, stopping the backtracking; next, extracting a next leaf node for marking until the leaf node sequence is empty;
step 2.4, in the top-down correction stage, firstly, sequencing all nodes from large to small according to a hierarchical structure, and initializing all node codes to be unmarked; then, the algorithm extracts the unmarked node with the largest hierarchical structure code in the node sequence, takes the node as the starting point of the traversal graph in the breadth, and preferentially traverses the breadth step by step, and in each round, when the hierarchical structure of the father node is q, the hierarchical structure of the child node is marked as q-1; q is to be old Comparing the value size of the current hierarchical structure of the represented node with the value size of the newly derived node representation node, and considering the following two cases: (a) When q is old <q new Then, the algorithm sets the hierarchy of nodes to q new And the node is set to be marked; (b) When q is old =q new And when the node is marked, the downward traversal of the node is terminated in advance; next, the extraction of the next unmarked node will continue until there are no unmarked nodes in the sequence.
4. The multi-source data fusion privacy protection method for multi-privacy policy combination optimization according to claim 1, wherein four privacy protection operations of k-anonymous, l-diversity, t-close-then and attribute value generalization are implemented in step 3, and the specific steps are as follows:
step 3.1, realizing that k-anonymous: attribute by domain expert or data owner
Figure FDA0003848609140000054
Value range setting, followed by attribute
Figure FDA0003848609140000061
Expanding a value domain space in the Bayesian network to enable the number of different values in the value domain space to be larger than or equal to k, and assigning the attribute value with the maximum probability distribution value in the child node of the father node to the privacy node according to the information entropy maximization correction principle by the father node of the privacy node during correction to enable the privacy node to meet the k requirement;
step 3.2, realizing l-diversity: according to data party pair attribute
Figure FDA0003848609140000062
Setting of value range, for attributes
Figure FDA0003848609140000063
Expanding value domain space in Bayes network to make the number of different values in the value domain space greater than or equal to l, and modifying the attribute
Figure FDA0003848609140000064
According to the correction principle of maximizing the information entropy, only selecting one value with the maximum probability distribution as a target object to be corrected in each correction process, and averagely distributing the probability distribution value of the value higher than the mean value to a newly-added attribute value;
and 3.3, realizing t-close: will attribute
Figure FDA0003848609140000065
Defining the value distribution condition causing the information entropy maximization in the value domain space as a theoretical standard, measuring by using the variance, and performing attribute matching
Figure FDA0003848609140000066
The probability distribution of each value is corrected so that each valueThe variance of the occurrence probability and the theoretical standard is not higher than t;
step 3.4, realizing generalization of attribute values: attribution is arranged according to an attribution value hierarchical tree set by a domain expert or a data owner to the attribution
Figure FDA0003848609140000067
Fusing the probability distribution of similar values in the value domain to obtain the attributes
Figure FDA0003848609140000068
The attribute leaf nodes to be protected anonymously in the value domain and all brother leaf nodes thereof are aggregated into an attribute node and are replaced by direct father nodes thereof, and the attribute value probability distribution corresponding to the node inherits from all original leaf nodes participating in aggregation;
the data sets with different privacy protection strategies are converted into the Bayesian network, and the Bayesian network is encoded, so that a secondary anonymization process is realized, and a re-anonymization framework with multi-party data fusion is realized.
CN202110014817.4A 2021-01-06 2021-01-06 Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization Active CN112765653B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110014817.4A CN112765653B (en) 2021-01-06 2021-01-06 Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110014817.4A CN112765653B (en) 2021-01-06 2021-01-06 Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization

Publications (2)

Publication Number Publication Date
CN112765653A CN112765653A (en) 2021-05-07
CN112765653B true CN112765653B (en) 2022-11-25

Family

ID=75700181

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110014817.4A Active CN112765653B (en) 2021-01-06 2021-01-06 Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization

Country Status (1)

Country Link
CN (1) CN112765653B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420333B (en) * 2021-07-16 2022-10-04 合肥工业大学 Privacy-protection online taxi appointment and boarding point recommendation system and method
CN115118531B (en) * 2022-08-30 2022-11-18 北京金睛云华科技有限公司 Distributed cloud cooperative encryption method and device based on differential privacy
CN117035380B (en) * 2023-07-11 2024-04-16 山东理工大学 Cross-organization business process consistency detection and abnormal behavior diagnosis method and system
CN116702214B (en) * 2023-08-02 2023-11-07 山东省计算中心(国家超级计算济南中心) Privacy data release method and system based on coherent proximity and Bayesian network
CN117808643B (en) * 2024-02-29 2024-05-28 四川师范大学 Teaching management system based on Chinese language

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109618338A (en) * 2018-12-22 2019-04-12 山西财经大学 A kind of sensor network routing method based on Hilbert space filling curve
CN109726758A (en) * 2018-12-28 2019-05-07 辽宁工业大学 A kind of data fusion publication algorithm based on difference privacy
CN110096895A (en) * 2019-03-22 2019-08-06 西安电子科技大学 Service privacy leakage detection method, Internet of Things service platform based on association map
CN110363236A (en) * 2019-06-29 2019-10-22 河南大学 The high spectrum image extreme learning machine clustering method of sky spectrum joint hypergraph insertion
CN110866277A (en) * 2019-11-13 2020-03-06 电子科技大学广东电子信息工程研究院 Privacy protection method for data integration of DaaS application

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140287723A1 (en) * 2012-07-26 2014-09-25 Anonos Inc. Mobile Applications For Dynamic De-Identification And Anonymity
CN109871375A (en) * 2018-09-12 2019-06-11 国网浙江省电力有限公司嘉兴供电公司 The information platform and its control method of distributed new scale access

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109618338A (en) * 2018-12-22 2019-04-12 山西财经大学 A kind of sensor network routing method based on Hilbert space filling curve
CN109726758A (en) * 2018-12-28 2019-05-07 辽宁工业大学 A kind of data fusion publication algorithm based on difference privacy
CN110096895A (en) * 2019-03-22 2019-08-06 西安电子科技大学 Service privacy leakage detection method, Internet of Things service platform based on association map
CN110363236A (en) * 2019-06-29 2019-10-22 河南大学 The high spectrum image extreme learning machine clustering method of sky spectrum joint hypergraph insertion
CN110866277A (en) * 2019-11-13 2020-03-06 电子科技大学广东电子信息工程研究院 Privacy protection method for data integration of DaaS application

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
A semantic k-anonymity privacy protection method for publishing sparse location data;Xudong Yang等;《网页在线公开:https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6831149》;20191128;第1-7页 *
Practical and Privacy-assured Data Indexes for Outsourced Cloud Data;Zhigang Zhou等;《网页在线公开:https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8916470》;20140612;第1-6页 *
群智感知网络中基于隐私保护的数据融合方法;龙浩等;《计算机工程与设计》;20201229;第41卷(第12期);第3346-3352页 *
集体噪声信道中错误容忍的可控量子对话;常利伟等;《光通信技术》;20201022;第44卷(第9期);第7-12页 *
面向敏感值的层次化多源数据融合隐私保护;杨月平等;《计算机科学》;20171020;第44卷(第9期);第156-161页 *

Also Published As

Publication number Publication date
CN112765653A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN112765653B (en) Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization
Qian et al. De-anonymizing social networks and inferring private attributes using knowledge graphs
Zhu et al. Differential privacy and applications
Gardner et al. Incorporating vector space similarity in random walk inference over knowledge bases
Gardner et al. Improving learning and inference in a large knowledge-base using latent syntactic cues
Ju et al. Things and strings: improving place name disambiguation from short texts by combining entity co-occurrence with topic modeling
Gao et al. Local differential privately anonymizing online social networks under hrg-based model
Cao et al. HitFraud: a broad learning approach for collective fraud detection in heterogeneous information networks
Nasution et al. Social network extraction based on Web. A comparison of superficial methods
Xing et al. A survey of across social networks user identification
Deng et al. Contextualized knowledge-aware attentive neural network: Enhancing answer selection with knowledge
Su et al. Mining user-aware multi-relations for fake news detection in large scale online social networks
CN106649262B (en) Method for protecting sensitive information of enterprise hardware facilities in social media
Li et al. Let’s CoRank: trust of users and tweets on social networks
Ren et al. Cross-network social user embedding with hybrid differential privacy guarantees
Dong et al. Sentiment-aware fake news detection on social media with hypergraph attention networks
Xin et al. Subjective knowledge base construction powered by crowdsourcing and knowledge base
CN113065918A (en) Sparse trust recommendation method based on semi-supervised learning
Deng et al. A multiuser identification algorithm based on internet of things
CN112822004A (en) Belief network-based targeted privacy protection data publishing method
Wang et al. DHCF: Dual disentangled-view hierarchical contrastive learning for fake news detection on social media
Mami et al. Generating realistic synthetic relational data through graph variational autoencoders
Zhang et al. Improving entity linking in Chinese domain by sense embedding based on graph clustering
Banweer et al. Multi-stage collaborative filtering for tweet geolocation
Cui et al. Short text analysis based on dual semantic extension and deep hashing in microblog

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant