CN112765653A - Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization - Google Patents

Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization Download PDF

Info

Publication number
CN112765653A
CN112765653A CN202110014817.4A CN202110014817A CN112765653A CN 112765653 A CN112765653 A CN 112765653A CN 202110014817 A CN202110014817 A CN 202110014817A CN 112765653 A CN112765653 A CN 112765653A
Authority
CN
China
Prior art keywords
data
privacy
node
attribute
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110014817.4A
Other languages
Chinese (zh)
Other versions
CN112765653B (en
Inventor
周志刚
白增亮
王宇
梁子恺
吴天生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shancai Hi Tech Shanxi Co ltd
Original Assignee
Shancai Hi Tech Shanxi Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shancai Hi Tech Shanxi Co ltd filed Critical Shancai Hi Tech Shanxi Co ltd
Priority to CN202110014817.4A priority Critical patent/CN112765653B/en
Publication of CN112765653A publication Critical patent/CN112765653A/en
Application granted granted Critical
Publication of CN112765653B publication Critical patent/CN112765653B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of data release, and particularly relates to a multi-source data fusion privacy protection method based on multi-privacy policy optimization combination optimization. A multi-party data fusion architecture based on re-anonymity (over-anonymity) is provided, and the condition that privacy of fused data is leaked is prevented. Further, the realistic significance of data fusion is to provide a more comprehensive data base for users to conduct extensive knowledge mining on the basis. Therefore, a multi-privacy protection strategy combined optimization scheme is designed, and the usability of the fused data is improved to the maximum extent while the privacy constraints of all parties are met. The strategy maps the data fusion of multiple sources and multiple privacy constraints into a hypergraph, each hypergraph is selected, solved and eliminated one by one on the hypergraph by using heuristic rules, the process of eliminating the hypergraph is also a process of realizing the privacy constraints one by one, and a data fusion scheme is formulated according to the process.

Description

Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization
Technical Field
The invention belongs to the field of data release, and particularly relates to a multi-source data fusion privacy protection method based on multi-privacy policy combination optimization.
Background
The multisource cross-platform and data application cross-domain are the most prominent characteristics of big data, and in the big data era, due to the explosive growth of data in different application fields, the single type of data (such as position data, social data, Cookie logs, shopping website pipelining and the like) is difficult to meet the requirements of people on upper-layer complex application services. For example, Bob needs the App to search nearby friends who like to play basketball, and implementation of this need requires organic fusion of location data with social data. The data cross-domain fusion needs of individuals, and real demand and application of the data cross-domain fusion between different departments in an enterprise, different-quality enterprises and even between the enterprise and a government department, such as accurate advertisement push, network taxi appointment optimization management, intelligent city subway line planning and the like, require data source owners of different field platforms to perform deep fusion and cooperation on own data levels. However, data of each platform often has a great use value, and may include sensitive/private information such as identity information, behavior information, financial information, even disease information of a user, and directly publishing the original data will necessarily cause disclosure of user privacy.
In order to prevent the privacy of users from being revealed, before large data fusion is released, desensitization processing (such as disturbance, noise addition, generalization and the like) needs to be performed on data sets of respective platforms, most of the conventional anonymous privacy protection methods only perform privacy protection on data of a single data source, and the problem of non-explicit privacy information disclosure brought by deep association analysis of the large data cannot be effectively solved; moreover, a single privacy protection method cannot meet the personalized privacy requirements of data users, just as the local privacy protection of data from various sources cannot avoid the risk of global data privacy disclosure after fusion (for example, Alice purchases an air ticket to Munich at an A ticket purchasing website and browses tourist attractions of Munich on a webpage of a B tourist company. A, B discloses information separately for outsourcing, wherein the A company adopts an information generalization technology based on 3-anonymity, namely generalizing the 'air ticket to Munich' into 'air ticket to Europe', the B company adopts a 3-diversity technology, namely, publishes the browsing behavior of two users browsing the website of the company with Alice as a group { 2017-07-119: 30: { Munich: New Swiss, Japan: Fuji, USA: Massachi institute of technology } }, assuming that Alice knows that Alice has a travel plan to go abroad, and learn from the stolen internet log that he has logged in the web pages of company a and B, by correlating A, B the information published by the two companies, the adversary can accurately deduce when Alice will go to the travel route of munich-new swan). The method is also the most essential problem facing the big data release privacy protection, namely privacy disclosure caused by the fact that an attacker constructs data association analysis after the distributed big data multiple sources are fused. One naive approach is to combine the naturally connected fusion data with privacy preserving method-level granularity. However, the combination of privacy preserving method level granularity may result in "over-protection" of private information, thereby severely reducing the availability of data, as shown in fig. 1: in data fusion of two parties, 29 pieces of noise need to be added in a scheme I (firstly carrying out 5-anonymity and then carrying out 3-diversity), and 20 pieces of noise need to be added in a scheme II (firstly carrying out 3-diversity and then carrying out 5-anonymity), so that the fine-grained combined optimization of the multiple privacy protection methods for maximizing data availability still is an open problem in the field of large data fusion release of privacy protection.
In the field of privacy protection of data release, traditional privacy protection algorithms comprise differential privacy, k anonymity, l-diversity anonymity, t-close anonymity and the like, and the improvement of some scholars on the traditional algorithms also has milestone significance, for example, by means of a semantic hierarchy tree, Wang and the like, by semantically generalizing the records with less quantity than anonymity requirements, the records can realize k-anonymity under wider semantics, however, the use of a record generalization technology causes irreversible information loss, and the use of a k-anonymity criterion on high-dimensional sparse data can greatly reduce the data availability; brijesh B et al propose a method to improve l-diversity anonymity with a significant improvement in run-time and with less information loss than the existing methods, while providing the same level of privacy due to the close arrangement of records in the initial equivalence class. In general, these conventional privacy preserving models are generally only applicable to static data distribution in a specific scenario. However, the risk faced by the big data release is reflected in the dynamic property of the release process, and has the characteristic of multi-source cross-platform release, so that an attacker needs to be prevented from performing correlation analysis on the data after multi-source fusion, and further the anonymity of the data is damaged.
In the aspect of privacy protection of data fusion, H Patel et al propose a safe fusion method for realizing data of two parties from bottom to top, but the premise of the model is that a trusted third party fuses all data to form a complete original data table, and then anonymization processing is realized on the data table, but the trusted third party does not exist in most cases, so the method of the document has low utilization value; jiang et al propose an DkA security fusion model for realizing data of two parties under a semi-honest model, the algorithm utilizes an exchangeable encryption strategy to hide original information in a communication process, and judges whether an anonymous threshold k is met or not by constructing a complete anonymous table to realize privacy protection in the data fusion process, but the resource consumption of the method is too large and the method is not suitable for the fusion of a large data set; clifton et al developed a secure data multiparty data integration tool for four typical operations of relational data counting, union, intersection and cartesian product; yeom et al studied indirect privacy disclosure caused by insufficient model generalization capability, and subsequently Mohammed et al realized data privacy protection for each party of data integration using a data generalization technique based on a classification tree structure, but the information loss of the integrated data was high, and the specific information loss degree was related to the data set. In the scheme, the same privacy protection strategy is assumed to be adopted by multiple parties participating in data fusion, however, in the face of different privacy protection requirements of big data, different platforms may adopt personalized privacy protection strategies according to the application requirements of the own parties before the big data fusion, and the existing scheme is difficult to apply.
Disclosure of Invention
The invention provides a multi-source data fusion privacy protection method based on multi-privacy policy combination optimization. Specifically, the patent firstly proposes a multi-party data fusion architecture based on re-anonymity (over-anonymity), wherein, inner-layer data anonymity exists before data fusion, and is implemented by respective local data owners to perform initial protection on data; when the outer-layer data anonymity occurs in data fusion, a plurality of parties participating in the fusion implement the anonymity according to a set multi-party privacy protection protocol (for simplifying the description, the anonymity is regarded as simultaneously meeting privacy constraints of all the parties), and the condition that the privacy of the fused data is leaked is prevented. Further, the realistic significance of data fusion is to provide a more comprehensive data base for users to conduct extensive knowledge mining on the basis. Therefore, the patent designs a multi-privacy protection strategy combined optimization scheme, and the usability of the fused data is improved to the maximum extent while privacy constraints of all parties are met. The strategy maps the data fusion of multiple sources and multiple privacy constraints into a hypergraph, each hypergraph is selected, solved and eliminated one by one on the hypergraph by using heuristic rules, the process of eliminating the hypergraph is also a process of realizing the privacy constraints one by one, and a data fusion scheme is formulated according to the process.
In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:
step 1, constructing a data multi-source fusion system model:
as shown in fig. 2, first, in a system model, data owners collect data from each party, and in order to prevent privacy disclosure, each party performs data anonymization operation; secondly, as the data volume of some entities is huge, the data must be stored in a public cloud, the data fusion of the public cloud is to organically integrate the multisource cross-platform data, and aims to better mine useful information by fusing complete data sets of all parties, and if the data fusion is simply carried out, the privacy snooping concern of the public cloud after the data fusion cannot be eliminated, so that the public cloud also needs to carry out heavy anonymity operation; in addition, the user can enjoy the convenience of big data by customizing the required services, however, unknown attackers may also be hidden in the user, so it is assumed here that the user is also "curious", i.e., the user is treated as a suspected privacy-mining group with the same attack capability as the cloud service provider.
Step 2, designing a multiple data fusion anonymity framework:
aiming at frequent cross-platform communication and sharing of big data information, the patent provides a re-anonymity framework based on multi-party data fusion, which comprises an initial state, a handshake process, data synchronization and secondary anonymity. Firstly, carrying out corresponding anonymization operation on respective data by each party in an initial state according to respective privacy protection requirements; secondly, the handshake process carries out multiparty communication, and each party issues respective data privacy protection requirements; data synchronization, namely, in the process, the distribution of public attribute values of multi-party data needs to be consistent, and meanwhile, the requirement of privacy protection of multiple parties is met; the method comprises the steps of firstly establishing a hierarchical structure diagram, secondly establishing anonymity, and finally establishing a probabilistic reasoning problem.
And step 3, realizing the privacy protection strategy:
given the privacy protection policy, the Bayesian network G is finally formed through the two processes, and then the attributes are required to be matched
Figure BDA0002886363110000042
Operate to make the privacy node XsAnd the policy requirements are met. In particular, for privacy protection policies
Figure BDA0002886363110000043
Figure BDA0002886363110000044
This patent defines unit privacy protection operation, is to carry out d partition to the privacy budget, and each round only carries out privacy protection to the probability distribution of a selected attribute node, to waiting to carry out privacy operation's attribute
Figure BDA0002886363110000045
The method realizes four privacy protection operations of k-anonymity, l-diversity, t-close and attribute value generalization.
And 4, fusing and mapping the data with multiple privacy constraints into a hypergraph, designing a corresponding heuristic rule, and normalizing the data fusion process into a hypergraph resolution process, so that the usability of the data is improved while the privacy constraints are met.
Further, in the step 1, a data fusion model and a privacy attack model are constructed by adopting an antagonistic learning architecture, and the specific steps are as follows:
step 1.1, constructing a data fusion model:
the data fusion is to organically integrate data belonging to multiple sources, and aims to better mine useful information through a more complete data set than before so as to provide high-quality service for users. For ease of discussion, a formal description of the data is first given: a data set may be represented as a quadruple D (X, a, F, V), where X ═ X1,x2,…,xnIs a set of data records, each item of data xiAre all exclusively associated with one dedicated user ui(ii) a A is an attribute set; further, the attributes are divided into an information attribute set IA and a sensitive attribute set SA according to their sensitivity, and IA ═ SA ═ a,
Figure BDA0002886363110000041
f is a set of relationships between X and A
Figure BDA0002886363110000051
Figure BDA0002886363110000052
Is attribute akThe value range of (2).
Definition 1 (equivalence classes): given a data set D (X, A, F, V), for
Figure BDA0002886363110000053
If there are t records { x1,x2,…,xt(t ≧ 1) such that
Figure BDA0002886363110000054
Is established as { x1,x2,…,xtIs an equivalence class on D for A', denoted as [ x ]i]A', whereas the set E of all equivalence classes formed by the attribute set AA', forming a division of D, denoted D/EA′. In particular, if
Figure BDA0002886363110000055
The corresponding equivalence class is referred to as the information equivalence class.
Definition 2 (data fusion): given m data sets { D1,…,DmAnd F, the fused data set D (X, { IA, SA }, F, V) satisfies:
Figure BDA0002886363110000056
in particular, if there are two data sets D to be fusedi,DjSatisfy the requirement of
Figure BDA0002886363110000057
(
Figure BDA0002886363110000058
Representing a symmetric difference operator), it is called information delta fusion; if there is a record xk∈Xi∩XjSatisfy the following requirements
Figure BDA0002886363110000059
Figure BDA00028863631100000510
And is
Figure BDA00028863631100000511
Information refinement fusion is called; if x is arbitrarily recordedk∈Xi∩XjSatisfy the following requirements
Figure BDA00028863631100000512
(wherein, SAi=SAj) Otherwise, it is called harmonious fusion. The research category of the patent is coordinated information increment and refinement fusion.
Step 1.2, constructing a privacy and privacy attack model:
the privacy refers to the single shooting of the user and the corresponding data sensitive attribute value, and if the single shooting relationship is revealed, the privacy of the user is revealed. According to the data model, users and data records have one-to-one mapping, and each user corresponds to a group of information attribute values from the data aspect, namely, the user and the information attribute values have a single shot, and the equivalence class containing the group of information attribute values also corresponds to a privacy attribute value set. According to the transitivity of the single-shot, if the group of information attribute values and the corresponding privacy attribute value set form the single-shot (that is, the privacy attribute value set only contains one element), the data privacy disclosure exists for the user when the data set is issued. More generally:
definition 3 (data privacy disclosure): given a data set D (X, { IA, SA }, F, V), if
Figure BDA00028863631100000513
[xi]IAFor the information equivalence class to which it belongs
Figure BDA00028863631100000514
Its corresponding privacy attribute value is noted
Figure BDA00028863631100000515
If it is
Figure BDA00028863631100000516
The data privacy is said to be compromised.
Definition 4 (knowledge-based attacks): suppose that the adversary knows the target user uiInformation attribute value of
Figure BDA00028863631100000517
And knows the user's data record xiIn a data set D (X, { IA, SA }, F, V) to be published, when data is published, an enemy may build the following relationship:
Figure BDA00028863631100000518
and from this forms a privacy inference probability, i.e. pair
Figure BDA00028863631100000519
User uiValue is v on SAjHas a probability of
Figure BDA00028863631100000520
(wherein C (#) is a counting statistical function; C;,/C;)*A domain of discourse qualifier).
Definition 5 (multi-version attack of data incremental release): given a first published data set D (X, { IA, SA }, F, V) and a corresponding publisher published updated data set D '(X', { IA ', SA' }, F ', V'), it is assumed that the adversary compares a dedicated user uiX' of (A), the following relationship can be constructed:
Figure BDA0002886363110000061
and form privacy inference probabilities, i.e. for both datasets
Figure BDA0002886363110000062
The adversary can infer that the privacy probability of the user is
Figure BDA0002886363110000063
(SEL () is a selection function).
Further, the step 2 of vertical encoding comprises two stages: a Bayesian network structure learning stage and a network coding stage;
the specific steps of the Bayesian network structure learning stage are as follows:
step 2.1, consider from dataset D ═ D1,…,DnThe learning Bayesian network structure in the Chinese scholar tree comprises m random variable sets
Figure BDA0002886363110000064
It is assumed that the variables are categorical variables (i.e., the number of states of the variables is finite) and that the numbers areThe data set is complete. The goal of the Bayesian network construction algorithm is to construct a set of parent items pi by defining each variable1,…,ΠmAt node set
Figure BDA0002886363110000065
Find the highest scoring Directed Acyclic Graph (DAG, Directed Acyclic Graph)
Figure BDA0002886363110000066
By assuming a Markov condition, a joint probability distribution is introduced, each variable being conditionally independent of its non-descendant variables given its parent.
Step 2.2, different scoring functions can be used for the evaluation of the quality of the generated DAG, in this patent we use a Bayesian Information Criterion (BIC) score, which is proportional to the posterior probability multiplier of the DAG. BIC is resolvable, consisting of the sum of the scores of each variable and its parent set:
Figure BDA0002886363110000067
wherein LL (X)ii) Represents XiII with its father node setiLog-likelihood function of (a):
Figure BDA0002886363110000068
Pen(Xii) Represents XiII with its father node setiComplexity penalty function of (1):
Figure BDA0002886363110000069
wherein the content of the first and second substances,
Figure BDA00028863631100000610
is a conditional probability P (X)i=x|ΠiPi) poleLarge likelihood estimate, Nx,πDenotes (X ═ X | Π)iPi) occurs in the dataset and | represents the size of the cartesian product space given the variable.
The patent uses a hill climbing method to generate a Bayesian network of corresponding data, and the main steps are as shown in an algorithm 1:
algorithm 1 Bayesian network structure generation method based on hill climbing method
Figure BDA0002886363110000071
It should be noted that the "flip edge" operation cannot be simply regarded as a sequence operation of "delete an edge and add an edge opposite to the previous operation direction". Because the algorithm adopts a greedy strategy, the deletion edge operation may reduce the BIC score of the Bayesian network, and the program is terminated in advance, so that the operation of adding the corresponding turning edge cannot be implemented.
The specific steps of the Bayesian network encoding stage are as follows:
step 2.3, a hierarchical structure diagram is constructed by longitudinally encoding a Bayesian network, wherein the hierarchical structure diagram comprises two stages: a bottom-up encoding phase and a top-down modification phase. Specifically, given a bayesian network, it can be converted into a hierarchical structure diagram by encoding, and algorithm 2 is a bayesian network encoding process:
algorithm 2 Bayesian network vertical coding
Figure BDA0002886363110000072
Figure BDA0002886363110000081
1) Bottom-up encoding stage. First, the hierarchy of all nodes is initially marked as zero, the algorithm continuously marks from the leaf nodes and gradually tracks the corresponding parent nodes. In each turn, when the hierarchy of the child node is q, the hierarchy of the parent node will be labeled q + 1. And then, only recording the current maximum code for the non-leaf node, namely if the code of the node is not 0, comparing the new code with the original code, keeping the larger code, if the code of the node is equal to the original code, stopping the upward backtracking of the node, and judging whether the leaf node queue is empty or not, and if the node queue is empty, stopping the backtracking. Next, the next leaf node is extracted for marking until the leaf node sequence is empty.
2) A top-down correction phase. All nodes are first sorted from large to small in a hierarchical structure and all node codes are initialized to be unmarked. Then, the algorithm extracts the unmarked node with the largest hierarchical structure (i.e. the largest code) in the node sequence, and takes the node as the starting point of traversing the graph in breadth, and the traversal is carried out in a downward breadth-first manner step by step. In each round, when the hierarchy of the parent node is q, the hierarchy of the child node will be labeled as q-1. Here, we will refer to qoldComparing the value size of the current hierarchical structure of the represented node with the value size of the newly derived node representation node, and considering the following two cases: (a) when q isold<qnewThen, the algorithm sets the hierarchy of nodes to qnewAnd setting the node as marked; (b) when q isold=qnewAnd the node is marked, the downward traversal of the node will terminate early. Next, the extraction of the next unmarked node will continue until there are no unmarked nodes in the sequence.
Further, in the step 3, four privacy protection operations of k-anonymous, l-diversity, t-close-relation and property value generalization are realized, and the specific steps are as follows:
step 3.1, realizing that k-anonymous: attribute by domain expert or data owner
Figure BDA0002886363110000091
Value range setting, followed by attribute
Figure BDA0002886363110000092
Extending the value range space in a Bayesian network such that its value range is nullThe number of different values in between is equal to or greater than k. During correction, the father node of the privacy node distributes the attribute value with the maximum probability distribution value in the child nodes of the father node to the privacy node according to the correction principle of information entropy maximization, so that the privacy node meets the k requirement;
step 3.2, realizing l-diversity: in the same way, the attributes are paired according to the data side
Figure BDA0002886363110000093
Setting of range of value, for attribute
Figure BDA0002886363110000094
The value range space in the bayesian network is extended such that the number of different values in its value range space is greater than or equal to l. Corrected attribute
Figure BDA0002886363110000095
According to the correction principle of information entropy maximization, in each correction process, only selecting one value with the maximum probability distribution as a target object to be corrected, and averagely distributing the probability distribution value higher than the mean value to a newly-added attribute value;
and 3.3, realizing t-close: will attribute
Figure BDA0002886363110000096
Defining the value distribution condition causing the information entropy maximization in the value domain space as a theoretical standard, measuring by using the variance, and performing attribute matching
Figure BDA0002886363110000097
Correcting the probability distribution of each value to ensure that the variance between the occurrence probability of each value and the theoretical standard is not higher than t;
step 3.4, realizing generalization of attribute values: setting attributes according to attribute value hierarchical tree set by domain expert or data owner
Figure BDA0002886363110000098
The probability distributions of similar values in the value domain are fused. Will attribute
Figure BDA0002886363110000099
The attribute leaf nodes to be protected anonymously in the value domain and all the brother leaf nodes thereof are aggregated into an attribute node and replaced by the direct father node thereof, and the attribute value probability distribution corresponding to the node inherits from all the original leaf nodes participating in aggregation.
The data sets of different privacy protection strategies are converted into the Bayesian network, and the Bayesian network is encoded, so that a secondary anonymization process is realized, and a re-anonymization framework of multi-party data fusion is realized. However, in the multi-source data fusion process, a multi-privacy protection policy combination scheme needs to be optimized, so that the usability of fused data is improved to the maximum extent while privacy constraints of all parties are met.
Further, in the step 4, a heuristic rule is adopted to evolve the data fusion process into a hypergraph resolution process, and the specific steps are as follows:
i.e., PROG (HG) is:
FOR each excess edge M DO intersected with excess edge N
Bottom-up elimination of probability-independent tuples in R (M)
ENDFOR;
PROG(HG1),PROG(HG2),……,PROG(HGk);
Figure BDA0002886363110000102
The hypergraph resolution algorithm recursively calls the above 3 rules, each hypergraph is selected, solved and eliminated from the HG one by one, and a RESULT (HG) program PROG (HG) is constructed, wherein the process of resolving the hypergraph is also a process of realizing privacy constraints one by one. The hypergraph resolution heuristic algorithm is as follows:
algorithm 3 hypercritical resolution heuristic algorithm
Figure BDA0002886363110000101
The above mentioned two privacy protection policy operations F1(A, B, D) and F2As an example of a connected hypergraph of (D, E, G, H), we see how to construct the program prog (hg) using a heuristic algorithm and produce the result "result (hg):
(1) the solution of the hyper-edges { A, B, D }, { D, E, G, H }, and the result hypergraph is HG1A prog (hg) program was obtained according to the digestion rule 3 (i.e., { B, D }, { D, G }, { a }, { E }, and { H })
PROG(HG1);
Figure BDA0002886363110000114
(2) Let HG2=({A}、{E}、{H}),HG3Obtained by resolution rule 2 ({ B, D }, and { D, G }), PROG (HG) is obtained1) Procedure for measuring the movement of a moving object
PROG(HG2)、PROG(HG3);
RESULT(HG1):=RESULT(HG2)×RESULT(HG3)。
Because of HG2Comprising three super-edges which are independent of each other, so PROG (HG)2) Is composed of
RESULT(HG2):=R({A}、{E}、{H})
(3) Texture computation HG3PROG (HG)3) To resolve the hyper-edges { B, D }, { D, G }, the result hypergraph is HG4Generating PROG (HG) according to the resolution rule 33) Procedure for measuring the movement of a moving object
PROG(HG4);
Figure BDA0002886363110000115
(4) Due to HG4Contains only one hyper-edge, so that, as can be seen from rule 1, PROG (HG)4) Is that
RESULT(HG4):=R({D,G})。
The final procedure can be written as:
Figure BDA0002886363110000111
of course, it is not necessary for any one product that embodies the invention to achieve all of the above advantages simultaneously.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a comparison case analysis of results of execution sequences of different privacy protection policies;
FIG. 2 is a system model for multi-source data fusion;
FIG. 3 is a hypergraph HG;
FIG. 4 shows comparison results of whether anonymity is re-determined;
FIG. 5 is a comparison of a naive algorithm and an optimized algorithm;
FIG. 6 is a graph of privacy attribute probabilities for discriminators and generators in different equivalence classes.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A multi-source data fusion privacy protection method based on multi-privacy policy combination optimization comprises the following steps:
step 1, constructing a data multi-source fusion system model: firstly, data from each party is collected by a data owner in a system model, and in order to prevent privacy disclosure, data anonymization operation is carried out by each party; secondly, as the data volume of some entities is huge, the data must be stored in a public cloud, the data fusion of the public cloud is to organically integrate the multisource cross-platform data, and aims to better mine useful information by fusing complete data sets of all parties, and if the data fusion is simply carried out, the privacy snooping concern of the public cloud after the data fusion cannot be eliminated, so that the public cloud also needs to carry out heavy anonymity operation; in addition, the user can enjoy the convenience of big data by customizing the required services, however, unknown attackers may also be hidden in the user, so it is assumed here that the user is also "curious", i.e., the user is treated as a suspected privacy-mining group with the same attack capability as the cloud service provider. The method comprises the following specific steps:
and 1.1, constructing a data fusion model. The data fusion is to organically integrate data belonging to multiple sources, and aims to better mine useful information through a more complete data set than before so as to provide high-quality service for users. The data set may be represented as a quadruple D (X, a, F, V), where X ═ X1,x2,…,xnIs a set of data records, each item of data xiAre all exclusively associated with one dedicated user ui(ii) a A is an attribute set; further, the attributes are divided into an information attribute set IA and a sensitive attribute set SA according to their sensitivity, and IA ═ SA ═ a,
Figure BDA0002886363110000121
f is a set of relationships between X and A
Figure BDA0002886363110000122
Figure BDA0002886363110000123
Is attribute akThe value range of (2).
And 1.2, constructing a privacy and privacy attack model. The privacy refers to the single shooting of the user and the corresponding data sensitive attribute value, and if the single shooting relationship is revealed, the privacy of the user is revealed. According to the data model, users and data records have one-to-one mapping, and each user corresponds to a group of information attribute values from the data aspect, namely, the user and the information attribute values have a single shot, and the equivalence class containing the group of information attribute values also corresponds to a privacy attribute value set. According to the transitivity of the single-shot, if the group of information attribute values and the corresponding privacy attribute value set form the single-shot (that is, the privacy attribute value set only contains one element), the data privacy disclosure exists for the user when the data set is issued.
Step 2, designing a multiple data fusion anonymity framework: aiming at frequent cross-platform communication and sharing of big data information, the patent provides a re-anonymity framework based on multi-party data fusion, which comprises an initial state, a handshake process, data synchronization and secondary anonymity. Firstly, carrying out corresponding anonymization operation on respective data by each party in an initial state according to respective privacy protection requirements; secondly, the handshake process carries out multiparty communication, and each party issues respective data privacy protection requirements; data synchronization, namely, in the process, the distribution of public attribute values of multi-party data needs to be consistent, and meanwhile, the requirement of privacy protection of multiple parties is met; the method comprises the steps of firstly establishing a hierarchical structure diagram, secondly establishing anonymity, and finally establishing a probabilistic reasoning problem.
Step 2.1, consider from dataset D ═ D1,…,DnThe learning Bayesian network structure in the Chinese scholar tree comprises m random variable sets
Figure BDA0002886363110000131
It is assumed that the variables are categorical variables (i.e., the number of states of the variables is finite) and the data set is complete. The goal of the Bayesian network construction algorithm is to construct a set of parent items pi by defining each variable1,…,ΠmAt node set
Figure BDA0002886363110000132
Find the highest scoring Directed Acyclic Graph (DAG, Directed Acyclic Graph)
Figure BDA0002886363110000133
By assuming a Markov condition, a joint probability distribution is introduced, each variable being conditionally independent of its non-descendant variables given its parent.
Step 2.2, different scoring functions can be used for the evaluation of the quality of the generated DAG, in this patent we use a Bayesian Information Criterion (BIC) score, which is proportional to the posterior probability multiplier of the DAG. BIC is resolvable, consisting of the sum of the scores of each variable and its parent set:
Figure BDA0002886363110000134
wherein LL (X)ii) Represents XiII with its father node setiLog-likelihood function of (a):
Figure BDA0002886363110000135
Pen(Xii) Represents XiII with its father node setiComplexity penalty function of (1):
Figure BDA0002886363110000136
wherein the content of the first and second substances,
Figure BDA0002886363110000141
is a conditional probability P (X)i=x|ΠiPi) maximum likelihood estimation, Nx,πDenotes (X ═ X | Π)iPi) occurs in the dataset and | represents the size of the cartesian product space given the variable.
The Bayesian network coding constructs a hierarchical structure diagram by longitudinally coding a Bayesian network, wherein the hierarchical structure diagram comprises two stages: a bottom-up encoding phase and a top-down modification phase. In particular, given a bayesian network, it can be transformed by encoding into a hierarchical structure diagram.
And 2.3, a bottom-up encoding stage. First, the hierarchy of all nodes is initially marked as zero, the algorithm continuously marks from the leaf nodes and gradually tracks the corresponding parent nodes. In each turn, when the hierarchy of the child node is q, the hierarchy of the parent node will be labeled q + 1. And then, only recording the current maximum code for the non-leaf node, namely if the code of the node is not 0, comparing the new code with the original code, keeping the larger code, if the code of the node is equal to the original code, stopping the upward backtracking of the node, and judging whether the leaf node queue is empty or not, and if the node queue is empty, stopping the backtracking. Next, the next leaf node is extracted for marking until the leaf node sequence is empty.
And 2.4, a top-down correction stage. All nodes are first sorted from large to small in a hierarchical structure and all node codes are initialized to be unmarked. Then, the algorithm extracts the unmarked node with the largest hierarchical structure (i.e. the largest code) in the node sequence, and takes the node as the starting point of traversing the graph in breadth, and the traversal is carried out in a downward breadth-first manner step by step. In each round, when the hierarchy of the parent node is q, the hierarchy of the child node will be labeled as q-1. Here, we will refer to qoldComparing the value size of the current hierarchical structure of the represented node with the value size of the newly derived node representation node, and considering the following two cases: (a) when q isold<qnewThen, the algorithm sets the hierarchy of nodes to qnewAnd setting the node as marked; (b) when q isold=qnewAnd the node is marked, the downward traversal of the node will terminate early. Next, the extraction of the next unmarked node will continue until there are no unmarked nodes in the sequence.
And step 3, realizing the privacy protection strategy: given the privacy protection policy, the Bayesian network G is finally formed through the two processes, and then the attributes are required to be matched
Figure BDA0002886363110000142
Operate to make the privacy node XsAnd the policy requirements are met. In particular, for privacy protection policies
Figure BDA0002886363110000143
Defining unit privacy protection operation, namely dividing privacy budget into d equal parts, performing privacy protection on probability distribution of only one selected attribute node in each round, and performing privacy protection on attributes to be subjected to privacy operation
Figure BDA0002886363110000144
The method realizes four privacy protection operations of k-anonymity, l-diversity, t-close and attribute value generalization.
Step 3.1, realizing that k-anonymous: attribute by domain expert or data owner
Figure BDA0002886363110000145
Value range setting, followed by attribute
Figure BDA0002886363110000146
The value domain space in the bayesian network is extended such that the number of different values in its value domain space is greater than or equal to k. During correction, the father node of the privacy node distributes the attribute value with the maximum probability distribution value in the child nodes of the father node to the privacy node according to the correction principle of information entropy maximization, so that the privacy node meets the k requirement;
step 3.2, realizing l-diversity: in the same way, the attributes are paired according to the data side
Figure BDA0002886363110000151
Setting of range of value, for attribute
Figure BDA0002886363110000152
The value range space in the bayesian network is extended such that the number of different values in its value range space is greater than or equal to l. Corrected attribute
Figure BDA0002886363110000153
According to the modification principle of information entropy maximization,in the process of each round of correction, only one value with the maximum probability distribution is selected as a target object to be corrected, and the probability distribution value which is higher than the average value is averagely distributed to the newly-added attribute value;
and 3.3, realizing t-close: will attribute
Figure BDA0002886363110000154
Defining the value distribution condition causing the information entropy maximization in the value domain space as a theoretical standard, measuring by using the variance, and performing attribute matching
Figure BDA0002886363110000155
Correcting the probability distribution of each value to ensure that the variance between the occurrence probability of each value and the theoretical standard is not higher than t;
step 3.4, realizing generalization of attribute values: setting attributes according to attribute value hierarchical tree set by domain expert or data owner
Figure BDA0002886363110000156
The probability distributions of similar values in the value domain are fused. Will attribute
Figure BDA0002886363110000157
The attribute leaf nodes to be protected anonymously in the value domain and all the brother leaf nodes thereof are aggregated into an attribute node and replaced by the direct father node thereof, and the attribute value probability distribution corresponding to the node inherits from all the original leaf nodes participating in aggregation.
The data sets of different privacy protection strategies are converted into the Bayesian network, and the Bayesian network is encoded, so that a secondary anonymization process is realized, and a re-anonymization framework of multi-party data fusion is realized. However, in the multi-source data fusion process, a multi-privacy protection policy combination scheme needs to be optimized, so that the usability of fused data is improved to the maximum extent while privacy constraints of all parties are met.
And 4, fusing and mapping the data with multiple privacy constraints into a hypergraph, designing a corresponding heuristic rule, and normalizing the data fusion process into a hypergraph resolution process, so that the usability of the data is improved while the privacy constraints are met. The method comprises the following specific steps:
step 4.1, formally defining the privacy protection policy as a five-tuple F ═ (G, IA, SA, OP, V), wherein G represents a bayesian network converted from the data set; IA denotes an information attribute node, IA ═ a1,a2,…,am),a1,a2,…,amAre not independent of each other, and have a probability dependence relationship; SA denotes a privacy node, OP denotes a certain operation step, and OP ═ OP (OP)1,OP2,…,OPm). V denotes a value range after the operation OP,
Figure BDA0002886363110000158
judging the execution sequence of different privacy protection strategies from the data layer surface and the structural layer surface:
1) if amCan be formed by1,a2,…,anIs shown to be
Figure BDA0002886363110000159
That is OPmPost-execution and vice versa;
from the structural level:
2) starting from the privacy node of the Bayesian network, the Bayesian network is encoded through a bottom-up encoding stage and a top-down modification stage, and the operation on the privacy attribute is compared through modifying the privacy node SA within the maximum modification threshold value
Figure BDA0002886363110000161
And
Figure BDA0002886363110000162
achieving the required privacy protection if OPiLess influence on the data structure, then OPiTo achieve the required performance ratio OPjHigh, i.e.
Figure BDA00028863631100001611
OPiComparison OPjFirst, and vice versaVice versa;
3) if multiple operations on information attributes are involved
Figure BDA0002886363110000163
Then the following two cases are distinguished: firstly, if
Figure BDA00028863631100001612
Calculating value range of each operation respectively through probabilistic reasoning relationship between IA
Figure BDA0002886363110000164
If IAj
Figure BDA0002886363110000165
Then OPiComparison OPj、OPkIs executed first if
Figure BDA0002886363110000166
Then OPkComparison OPj、OPiFirstly, executing; second when
Figure BDA00028863631100001613
Figure BDA00028863631100001614
Then OPkComparison OPj、OPiIs performed first, for OPj、OPiSequence if in operation
Figure BDA0002886363110000167
And
Figure BDA0002886363110000168
in (1),
Figure BDA0002886363110000169
then
Figure BDA00028863631100001610
Will affect OPjThen, then
Figure BDA00028863631100001615
OPiComparison OPjFirst, and vice versa.
For example, let two privacy preserving policy operations be disaggregated as F1(A, B, D) and F2(D, E, G, H) respectively represented by supercedes { A, B, D }, { D, E, G, H }, wherein B and D are F1Two steps in one of the operations are represented by conditional hyper-edges { B, D }, D, G being F2Two steps in one of the operations, represented by conditional hyper-edges { D, G }, due to HG3、HG4And the operations are not independent from each other, so that an intersection exists between the operations, and A, E and H are three independent operations respectively represented by the super edges { A }, { E }, and { H }. The connection hypergraph HG between the hyper-edge relations can be obtained according to the hyper-edge relations, and is shown in FIG. 3:
step 4.2, by judging the execution sequence of different privacy protection strategies F, generating the following heuristic rules of ultra-edge resolution and PROG (HG):
rule 1. if hypergraph HG contains only one hyper-edge N, which can be resolved directly, prog (HG) contains only result (HG): r (N);
rule 2. if the hypergraph HG is k disjoint hypergraphs HG1、HG2……HGkIf they can be executed in parallel, prog (hg) is:
PROG(HG1),PROG(HG2),……,PROG(HGk);
RESULT(HG):=RESULT(HG1)×RESULT(HG2)×……×RESULT(HGk)
rule 3. given a privacy node SA and its vertical code X, known from the properties of the Bayesian networkSAL, in all the chain set Links taking the privacy nodes as chain tail nodes, let XiAnd XjIs any two nodes in Links that are not SA, if Xi.L<XjL, then X is corrected at the same privacy preserving granularityiThe probability distribution has less influence on the availability of the global data, so a lower proximity principle is formed, namely the closer the modified attribute nodes are to the privacy attribute, the more targeted the modification is. In other words, if the hypergraph HG is composed of k connected components HG1、HG2……HGkAnd if the HG is in the set, judging the probability dependence of each super edge on the privacy nodeiCompared with HGjFurther down to the privacy node, then
Figure BDA0002886363110000172
I.e. HGiDigestion is performed first and vice versa.
In this embodiment, the correctness and validity of the privacy protection model proposed by this patent are verified through experimental simulation. The architecture is realized by adopting python language, the hardware environment is Intel (R) core (TM) i5-1035G1CPU @1.00GHz 1.19GHz processor, the memory is 16G, and the operating system is Windows 10.
In order to highlight the superiority of the heavy anonymous handshake protocol in the first group of experiments, our experiments respectively compare the method of using the heavy anonymous handshake protocol with that of not using the method, firstly, the patent generates a data set by means of a bayesian network, and when the data set is generated, all parties are anonymous for the first time, and secondly, through experiments, we observe that the change of the data quantity has certain influence on the privacy disclosure probability, so that the data quantity is used as an independent variable of the test, the privacy disclosure probability is used as a dependent variable, and the experiment results are shown in fig. 4. It can be seen from the figure that, under the condition of extremely small data volume, the privacy disclosure probabilities of anonymization are basically similar, with the increase of experimental data volume, the privacy disclosure probability is obviously reduced by the anonymization method, when the data volume reaches 10 ten thousand, the privacy disclosure probability is reduced to be below 20%, on the contrary, under the condition of not using the anonymization method, with the increase of experimental data volume, the privacy disclosure probability is increased, and when the data volume reaches 10 ten thousand, the privacy disclosure probability is as high as 80%.
In order to verify that the optimization algorithm provided by the patent can greatly improve the data availability, a comparison experiment is designed, and a naive algorithm is compared with the optimization algorithm of the patent. The data availability is represented by Q in this experiment, which is given by the formula:
Figure BDA0002886363110000171
where a represents raw data and b represents noisy data, it can be observed from the formula that the more noise is added, the worse the data availability. Still, the data volumes are used as independent variables, and the data volumes are 5000, 1 ten thousand, 2 ten thousand, 4 ten thousand, 6 ten thousand, 8 ten thousand and 10 ten thousand respectively, and the experimental results of the data availability are observed, as shown in fig. 5. As can be seen from fig. 5, under the condition of extremely small data volume, the influence effect of the naive algorithm and the optimization algorithm on the data availability is not much different; when the data volume reaches 4 ten thousand, the data availability of the optimization algorithm is about 30% higher than that of a naive algorithm; when the data amount reaches 10 thousands, the data availability of the naive algorithm is about 40% lower than that of the optimization algorithm. Through the analysis, the data availability of the fusion algorithm optimized by the patent is far higher than that of a naive fusion algorithm.
In order to verify the usability of the method in incremental data fusion, the experiment utilizes the concept of a generative countermeasure network, firstly, a generated Bayesian network is utilized to generate a data set, a discriminator and a generator respectively sample the data set in different proportions, wherein the sampling percentage of the discriminator is 30%, the sampling percentage of the generator is 15%, the sampled data set is respectively generated into Bayesian networks by a hill-climbing method, then the respective generated Bayesian networks generate 40000 data sets with the same quantity again, KL divergence is utilized to measure the distribution difference of privacy attributes in certain equivalent classes of the two data sets, and the calculation formula is as follows:
Figure BDA0002886363110000181
the closer the KL divergence is to 0, the smaller the difference between the discriminator and the generator, the better the experimental effect.
The privacy attribute probability of the discriminators and the generators in three different equivalence classes is selected for calculation, the probability distribution is shown in fig. 6, and then the KL divergence degrees are calculated respectively to obtain KL1 which is 0.0042, KL2 which is 0.0043 and KL3 which is 0.0053, so that the fact that the three KL divergence degrees are close to 0 can be seen, the difference between the discriminators and the generators is very small, and the experiment effect is very good.
Through the analysis of the three simulation experiments, the method provided by the patent not only greatly improves the privacy protection effect of multi-source data fusion, but also greatly improves the usability of data.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (5)

1. A multi-source data fusion privacy protection method based on multi-privacy policy combination optimization is characterized by comprising the following steps:
step 1, constructing a data multi-source fusion system model: firstly, data from each party is collected by a data owner in a system model, and in order to prevent privacy disclosure, data anonymization operation is carried out by each party; secondly, as the data volume of some entities is huge, the data must be stored in a public cloud, the data fusion of the public cloud is to organically integrate the multisource cross-platform data, and aims to better mine useful information by fusing complete data sets of all parties, and if the data fusion is simply carried out, the privacy snooping concern of the public cloud after the data fusion cannot be eliminated, so that the public cloud also needs to carry out heavy anonymity operation; in addition, the user can enjoy the convenience brought by big data by customizing the required services, however, unknown attackers may also be hidden in the user, so it is assumed here that the user is also "curious", i.e., the user and the cloud service provider are regarded as a suspected privacy mining group with the same attack capability;
step 2, designing a multiple data fusion anonymity framework: aiming at frequent cross-platform communication and sharing of big data information, the patent provides a re-anonymity framework based on multi-party data fusion, which comprises an initial state, a handshake process, data synchronization and secondary anonymity. Firstly, carrying out corresponding anonymization operation on respective data by each party in an initial state according to respective privacy protection requirements; secondly, the handshake process carries out multiparty communication, and each party issues respective data privacy protection requirements; data synchronization, namely, in the process, the distribution of public attribute values of multi-party data needs to be consistent, and meanwhile, the requirement of privacy protection of multiple parties is met; the method comprises the following steps that secondary anonymity is the most key step in a re-anonymization framework, specifically, a data set is converted into a Bayesian network during secondary anonymity, a hierarchical structure diagram is constructed by encoding the Bayesian network, and finally a privacy protection problem is converted into a probabilistic reasoning problem, wherein the specific process comprises network structure learning and network encoding;
and step 3, realizing the privacy protection strategy: given the privacy protection policy, the Bayesian network G is finally formed through the two processes, and then the attributes are required to be matched
Figure FDA0002886363100000011
Operate to make the privacy node XsAnd the policy requirements are met. In particular, for privacy protection policies
Figure FDA0002886363100000012
This patent defines unit privacy protection operation, is to carry out d partition to the privacy budget, and each round only carries out privacy protection to the probability distribution of a selected attribute node, to waiting to carry out privacy operation's attribute
Figure FDA0002886363100000013
The method realizes four privacy protection operations of k-anonymity, l-diversity, t-close and attribute value generalization;
and 4, fusing and mapping the data with multiple privacy constraints into a hypergraph, designing a corresponding heuristic rule, and normalizing the data fusion process into a hypergraph resolution process, so that the usability of the data is improved while the privacy constraints are met.
2. The multi-source data fusion privacy protection method based on multi-privacy policy combination optimization according to claim 1, wherein a data fusion model and a privacy attack model are constructed in the step 1 by adopting a counter-type learning architecture, and the specific steps are as follows:
step 1.1, constructing a data fusion model:
the data fusion is to organically integrate data belonging to multiple sources, and aims to better mine useful information through a more complete data set than before so as to provide high-quality service for users. For ease of discussion, a formal description of the data is first given: a data set may be represented as a quadruple D (X, a, F, y), where X ═ X1,x2,...,xnIs a set of data records, each item of data xiAre all exclusively associated with one dedicated user ui(ii) a A is an attribute set; further, the attributes are divided into an information attribute set IA and a sensitive attribute set SA according to their sensitivity, and IA ═ SA ═ a,
Figure FDA0002886363100000021
f is a set of relationships between X and A
Figure FDA0002886363100000022
Figure FDA0002886363100000023
Is attribute akThe value range of (2).
Definition 1 (equivalence classes): given a data set D (X, A, F, V), for
Figure FDA0002886363100000024
If there are t records { x1,x2,…,xt(t ≧ 1) such that
Figure FDA0002886363100000025
EstablishedBalance { x1,x2,...,xtIs an equivalence class on D for A', denoted as [ x ]i]A′On the contrary, the set E of all equivalence classes formed by the set of attributes AA′Form a division of D, denoted D/EA′. In particular, if
Figure FDA0002886363100000026
The corresponding equivalence class is referred to as the information equivalence class.
Definition 2 (data fusion): given m data sets { D1,...,DmAnd F, the fused data set D (X, { IA, SA }, F, y) satisfies:
Figure FDA0002886363100000027
in particular, if there are two data sets D to be fusedi,DjSatisfy the requirement of
Figure FDA0002886363100000028
(
Figure FDA0002886363100000029
Representing a symmetric difference operator), it is called information delta fusion; if there is a record xk∈Xi∩XjSatisfy the following requirements
Figure FDA00028863631000000210
Figure FDA00028863631000000211
Information refinement fusion is called; if x is arbitrarily recordedk∈Xi∩XjSatisfy the following requirements
Figure FDA00028863631000000212
(wherein, SAi=SAj) Otherwise, it is called harmonious fusion. The research category of the patent is coordinated information increment and refinement fusion;
step 1.2, constructing a privacy and privacy attack model:
the privacy refers to the single shooting of the user and the corresponding data sensitive attribute value, and if the single shooting relationship is revealed, the privacy of the user is revealed. According to the data model, users and data records have one-to-one mapping, and each user corresponds to a group of information attribute values from the data aspect, namely, the user and the information attribute values have a single shot, and the equivalence class containing the group of information attribute values also corresponds to a privacy attribute value set. According to the transitivity of the single-shot, if the group of information attribute values and the corresponding privacy attribute value set form the single-shot (that is, the privacy attribute value set only contains one element), the data privacy disclosure exists for the user when the data set is issued. More generally:
definition 3 (data privacy disclosure): given a data set D (X, { IA, SA }, F, V), if
Figure FDA0002886363100000031
[xi]IAFor the information equivalence class to which it belongs
Figure FDA0002886363100000032
Its corresponding privacy attribute value is noted
Figure FDA0002886363100000033
If it is
Figure FDA0002886363100000034
The data privacy is said to be compromised.
Definition 4 (knowledge-based attacks): suppose that the adversary knows the target user uiInformation attribute value of
Figure FDA0002886363100000035
And knows the user's data record xiIn a data set D (X, { IA, SA }, F, V) to be published, when data is published, an enemy may build the following relationship:
Figure FDA0002886363100000036
and from this forms a privacy inference probability, i.e. pair
Figure FDA0002886363100000037
User uiValue is v on SAjHas a probability of
Figure FDA0002886363100000038
(wherein C (#) is a counting statistical function; C;,/C;)*A domain of discourse qualifier).
Definition 5 (multi-version attack of data incremental release): given a first published data set D (X, { IA, SA }, F, V) and a corresponding publisher published updated data set D '(X', { IA ', SA' }, F ', V'), it is assumed that the adversary compares a dedicated user uiX' of (A), the following relationship can be constructed:
Figure FDA0002886363100000039
and form privacy inference probabilities, i.e. for both datasets
Figure FDA00028863631000000310
The adversary can infer that the privacy probability of the user is
Figure FDA00028863631000000311
(SEL () is a selection function).
3. The multi-source data fusion privacy protection method based on multi-privacy policy combination optimization according to claim 1, wherein the vertical encoding in the step 2 includes two stages: a Bayesian network structure learning stage and a network coding stage;
the specific steps of the Bayesian network structure learning stage are as follows:
step 2.1, consider from dataset D ═ D1,...,DnThe learning Bayesian network structure in the Chinese scholar tree comprises m random variable sets
Figure FDA00028863631000000312
It is assumed that the variables are categorical variables (i.e., the number of states of the variables is finite) and the data set is complete. The goal of the Bayesian network construction algorithm is to construct a set of parent items pi by defining each variable1,...,ΠmAt node set
Figure FDA00028863631000000313
Find the highest scoring directed acyclic graph
Figure FDA00028863631000000314
Figure FDA00028863631000000315
By assuming a Markov condition, a joint probability distribution is introduced, each variable being conditionally independent of its non-descendant variables given its parent.
Step 2.2, different scoring functions can be used for the evaluation of the quality of the generated DAG, in this patent we use a Bayesian Information Criterion (BIC) score, which is proportional to the posterior probability multiplier of the DAG. BIC is resolvable, consisting of the sum of the scores of each variable and its parent set:
Figure FDA0002886363100000041
wherein LL (X)i|∏i) Represents XiLog-likelihood function with its parent node set ii:
Figure FDA0002886363100000042
Pen(Xi|∏i) Represents XiAnd its father node set piiComplexity penalty function of (1):
Figure FDA0002886363100000043
wherein the content of the first and second substances,
Figure FDA0002886363100000044
is a conditional probability P (X)i=x|∏iPi) maximum likelihood estimation, Nx,πDenotes (X ═ X | Π)iPi) occurs in the dataset and | represents the size of the cartesian product space given the variable.
The Bayesian network coding constructs a hierarchical structure diagram by longitudinally coding a Bayesian network, wherein the hierarchical structure diagram comprises two stages: a bottom-up encoding phase and a top-down modification phase. The method comprises the following specific steps:
and 2.3, a bottom-up encoding stage. First, the hierarchy of all nodes is initially marked as zero, the algorithm continuously marks from the leaf nodes and gradually tracks the corresponding parent nodes. In each turn, when the hierarchy of the child node is q, the hierarchy of the parent node will be labeled q + 1. And then, only recording the current maximum code for the non-leaf node, namely if the code of the node is not O, comparing the new code with the original code, keeping the larger one, if the two codes are equal, stopping the upward backtracking of the node, and judging whether the leaf node queue is empty or not, and if the node queue is empty, stopping. Next, extracting a next leaf node for marking until the leaf node sequence is empty;
and 2.4, a top-down correction stage. All nodes are first sorted from large to small in a hierarchical structure and all node codes are initialized to be unmarked. Then, the algorithm extracts the unmarked node with the largest hierarchical structure (i.e. the largest code) in the node sequence, and takes the node as the starting point of traversing the graph in breadth, and the traversal is carried out in a downward breadth-first manner step by step. In each round, when the hierarchy of the parent node is q, the hierarchy of the child node will be marked as q-1. Here, we will refer to qoldComparing the value size of the current hierarchical structure of the represented node with the value size of the newly derived node representation node, and considering the following two cases: (a) when q isold<qnewThen, the algorithm sets the hierarchy of nodes to qnewAnd setting the node as marked; (b) when q isold=qnewAnd the node is marked, the downward traversal of the node will terminate early. Next, the extraction of the next unmarked node will continue until there are no unmarked nodes in the sequence.
4. The multi-source data fusion privacy protection method for multi-privacy policy combination optimization according to claim 1, wherein four privacy protection operations of k-anonymous, l-diversity, t-close-then and attribute value generalization are implemented in step 3, and the specific steps are as follows:
step 3.1, realizing that k-anonymous: attribute by domain expert or data owner
Figure FDA0002886363100000051
Value range setting, followed by attribute
Figure FDA0002886363100000052
The value domain space in the bayesian network is extended such that the number of different values in its value domain space is greater than or equal to k. During correction, the father node of the privacy node distributes the attribute value with the maximum probability distribution value in the child nodes of the father node to the privacy node according to the correction principle of information entropy maximization, so that the privacy node meets the k requirement;
step 3.2, realizing l-diversity: in the same way, the attributes are paired according to the data side
Figure FDA0002886363100000053
Setting of range of value, for attribute
Figure FDA0002886363100000054
Value range null in a bayesian networkAnd expanding the space so that the number of different values in the value domain space is more than or equal to l. Corrected attribute
Figure FDA0002886363100000055
According to the correction principle of information entropy maximization, in each correction process, only selecting one value with the maximum probability distribution as a target object to be corrected, and averagely distributing the probability distribution value higher than the mean value to a newly-added attribute value;
and 3.3, realizing t-close: will attribute
Figure FDA0002886363100000056
Defining the value distribution condition causing the information entropy maximization in the value domain space as a theoretical standard, measuring by using the variance, and performing attribute matching
Figure FDA0002886363100000057
Correcting the probability distribution of each value to ensure that the variance between the occurrence probability of each value and the theoretical standard is not higher than t;
step 3.4, realizing generalization of attribute values: setting attributes according to attribute value hierarchical tree set by domain expert or data owner
Figure FDA0002886363100000058
The probability distributions of similar values in the value domain are fused. Will attribute
Figure FDA0002886363100000059
The attribute leaf nodes to be protected anonymously in the value domain and all the brother leaf nodes thereof are aggregated into an attribute node and replaced by the direct father node thereof, and the attribute value probability distribution corresponding to the node inherits from all the original leaf nodes participating in aggregation.
The data sets of different privacy protection strategies are converted into the Bayesian network, and the Bayesian network is encoded, so that a secondary anonymization process is realized, and a re-anonymization framework of multi-party data fusion is realized. However, in the multi-source data fusion process, a multi-privacy protection policy combination scheme needs to be optimized, so that the usability of fused data is improved to the maximum extent while privacy constraints of all parties are met.
5. The multi-source data fusion privacy protection method based on multi-privacy policy combination optimization according to claim 1, wherein a heuristic rule is adopted in the step 4 to evolve a data fusion process into a hypergraph resolution process, and the specific steps are as follows:
step 4.1, formally defining the privacy protection policy as a five-tuple F ═ (G, M, SA, OP, V), wherein G represents a bayesian network converted from the data set; IA denotes an information attribute node, IA ═ a1,a2,..,am),a1,a2,..,amAre not independent of each other, and have a probability dependence relationship; SA denotes a privacy node, OP denotes a certain operation step, and OP ═ OP (OP)1,OP2,..,OPm). V denotes a value range after the operation OP,
Figure FDA00028863631000000617
judging the execution sequence of different privacy protection strategies from the data layer surface and the structural layer surface:
1) if amCan be formed by1,a2,...,anIs shown to be
Figure FDA0002886363100000061
That is OPmPost-execution and vice versa;
from the structural level:
2) starting from the privacy node of the Bayesian network, the Bayesian network is encoded through a bottom-up encoding stage and a top-down modification stage, and the operation on the privacy attribute is compared through modifying the privacy node SA within the maximum modification threshold value
Figure FDA0002886363100000062
And
Figure FDA0002886363100000063
achieving the required privacy protection if OPiLess influence on the data structure, then OPiTo achieve the required performance ratio OPjHigh, i.e.
Figure FDA0002886363100000064
OPiComparison OPjFirst, and vice versa;
3) if multiple operations on information attributes are involved
Figure FDA0002886363100000065
Then the following two cases are distinguished: firstly, if
Figure FDA0002886363100000066
Calculating value range of each operation respectively through probabilistic reasoning relationship between IA
Figure FDA0002886363100000067
If IAj
Figure FDA0002886363100000068
Then OPiComparison OPj、OPkIs executed first if
Figure FDA0002886363100000069
Then OPkComparison OPj、OPiFirstly, executing; second when
Figure FDA00028863631000000610
Figure FDA00028863631000000611
Then OPkComparison OPj、OPiIs performed first, for OPj、OPiSequence if in operation
Figure FDA00028863631000000612
And
Figure FDA00028863631000000613
in (1),
Figure FDA00028863631000000614
then
Figure FDA00028863631000000615
Will affect OPjThen, then
Figure FDA00028863631000000616
OPiComparison OPjFirst, and vice versa.
Step 4.2, by judging the execution sequence of different privacy protection strategies F, generating the following heuristic rules of ultra-edge resolution and PROG (HG):
rule 1. if a hypergraph HG contains only one hyper-edge N, which can be resolved directly, prog (HG) contains only result (HG): r (n);
rule 2. if the hypergraph HG is k disjoint hypergraphs HG1、HG2……HGkIf they can be executed in parallel, prog (hg) is:
PROG(HG1),PROG(HG2),……,PROG(HGk);
RESULT(HG):=RESULT(HG1)×RESULT(HG2)×……×RESULT(HGk)
rule 3. given a privacy node SA and its vertical code X, known from the properties of the Bayesian networkSAL, in all the chain set Links taking the privacy nodes as chain tail nodes, let XiAnd XjIs any two nodes in Links that are not SA, if Xi.L<XjL, then X is corrected at the same privacy preserving granularityiThe probability distribution has less influence on the availability of the global data, so a lower proximity principle is formed, namely the closer the modified attribute nodes are to the privacy attribute, the more targeted the modification is. In other words, if the hypergraph HG consists of k hypergraphsConnected component HG1、HG2……HGkAnd if the HG is in the set, judging the probability dependence of each super edge on the privacy nodeiCompared with HGjFurther down to the privacy node, then
Figure FDA0002886363100000071
I.e. HGiDigestion is performed first and vice versa.
CN202110014817.4A 2021-01-06 2021-01-06 Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization Active CN112765653B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110014817.4A CN112765653B (en) 2021-01-06 2021-01-06 Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110014817.4A CN112765653B (en) 2021-01-06 2021-01-06 Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization

Publications (2)

Publication Number Publication Date
CN112765653A true CN112765653A (en) 2021-05-07
CN112765653B CN112765653B (en) 2022-11-25

Family

ID=75700181

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110014817.4A Active CN112765653B (en) 2021-01-06 2021-01-06 Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization

Country Status (1)

Country Link
CN (1) CN112765653B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420333A (en) * 2021-07-16 2021-09-21 合肥工业大学 Privacy-protection online taxi appointment boarding point recommendation system and method
CN115118531A (en) * 2022-08-30 2022-09-27 北京金睛云华科技有限公司 Distributed cloud cooperative encryption method and device based on differential privacy
CN116702214A (en) * 2023-08-02 2023-09-05 山东省计算中心(国家超级计算济南中心) Privacy data release method and system based on coherent proximity and Bayesian network
CN117035380A (en) * 2023-07-11 2023-11-10 山东理工大学 Cross-organization business process consistency detection and abnormal behavior diagnosis method and system
CN117808643A (en) * 2024-02-29 2024-04-02 四川师范大学 Teaching management system based on Chinese language

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140287723A1 (en) * 2012-07-26 2014-09-25 Anonos Inc. Mobile Applications For Dynamic De-Identification And Anonymity
CN109618338A (en) * 2018-12-22 2019-04-12 山西财经大学 A kind of sensor network routing method based on Hilbert space filling curve
CN109726758A (en) * 2018-12-28 2019-05-07 辽宁工业大学 A kind of data fusion publication algorithm based on difference privacy
CN109871375A (en) * 2018-09-12 2019-06-11 国网浙江省电力有限公司嘉兴供电公司 The information platform and its control method of distributed new scale access
CN110096895A (en) * 2019-03-22 2019-08-06 西安电子科技大学 Service privacy leakage detection method, Internet of Things service platform based on association map
CN110363236A (en) * 2019-06-29 2019-10-22 河南大学 The high spectrum image extreme learning machine clustering method of sky spectrum joint hypergraph insertion
CN110866277A (en) * 2019-11-13 2020-03-06 电子科技大学广东电子信息工程研究院 Privacy protection method for data integration of DaaS application

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140287723A1 (en) * 2012-07-26 2014-09-25 Anonos Inc. Mobile Applications For Dynamic De-Identification And Anonymity
CN109871375A (en) * 2018-09-12 2019-06-11 国网浙江省电力有限公司嘉兴供电公司 The information platform and its control method of distributed new scale access
CN109618338A (en) * 2018-12-22 2019-04-12 山西财经大学 A kind of sensor network routing method based on Hilbert space filling curve
CN109726758A (en) * 2018-12-28 2019-05-07 辽宁工业大学 A kind of data fusion publication algorithm based on difference privacy
CN110096895A (en) * 2019-03-22 2019-08-06 西安电子科技大学 Service privacy leakage detection method, Internet of Things service platform based on association map
CN110363236A (en) * 2019-06-29 2019-10-22 河南大学 The high spectrum image extreme learning machine clustering method of sky spectrum joint hypergraph insertion
CN110866277A (en) * 2019-11-13 2020-03-06 电子科技大学广东电子信息工程研究院 Privacy protection method for data integration of DaaS application

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
XUDONG YANG等: "A semantic k-anonymity privacy protection method for publishing sparse location data", 《网页在线公开:HTTPS://IEEEXPLORE.IEEE.ORG/STAMP/STAMP.JSP?TP=&ARNUMBER=6831149》 *
ZHIGANG ZHOU等: "Practical and Privacy-assured Data Indexes for Outsourced Cloud Data", 《网页在线公开:HTTPS://IEEEXPLORE.IEEE.ORG/STAMP/STAMP.JSP?TP=&ARNUMBER=8916470》 *
常利伟等: "集体噪声信道中错误容忍的可控量子对话", 《光通信技术》 *
李尚等: "大数据安全高效搜索与隐私保护机制展望", 《网络与信息安全学报》 *
杨月平等: "面向敏感值的层次化多源数据融合隐私保护", 《计算机科学》 *
龙浩等: "群智感知网络中基于隐私保护的数据融合方法", 《计算机工程与设计》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420333A (en) * 2021-07-16 2021-09-21 合肥工业大学 Privacy-protection online taxi appointment boarding point recommendation system and method
CN113420333B (en) * 2021-07-16 2022-10-04 合肥工业大学 Privacy-protection online taxi appointment and boarding point recommendation system and method
CN115118531A (en) * 2022-08-30 2022-09-27 北京金睛云华科技有限公司 Distributed cloud cooperative encryption method and device based on differential privacy
CN115118531B (en) * 2022-08-30 2022-11-18 北京金睛云华科技有限公司 Distributed cloud cooperative encryption method and device based on differential privacy
CN117035380A (en) * 2023-07-11 2023-11-10 山东理工大学 Cross-organization business process consistency detection and abnormal behavior diagnosis method and system
CN117035380B (en) * 2023-07-11 2024-04-16 山东理工大学 Cross-organization business process consistency detection and abnormal behavior diagnosis method and system
CN116702214A (en) * 2023-08-02 2023-09-05 山东省计算中心(国家超级计算济南中心) Privacy data release method and system based on coherent proximity and Bayesian network
CN116702214B (en) * 2023-08-02 2023-11-07 山东省计算中心(国家超级计算济南中心) Privacy data release method and system based on coherent proximity and Bayesian network
CN117808643A (en) * 2024-02-29 2024-04-02 四川师范大学 Teaching management system based on Chinese language
CN117808643B (en) * 2024-02-29 2024-05-28 四川师范大学 Teaching management system based on Chinese language

Also Published As

Publication number Publication date
CN112765653B (en) 2022-11-25

Similar Documents

Publication Publication Date Title
CN112765653B (en) Multi-source data fusion privacy protection method based on multi-privacy policy combination optimization
CN111506714B (en) Question answering based on knowledge graph embedding
Zhang et al. Shne: Representation learning for semantic-associated heterogeneous networks
Qian et al. De-anonymizing social networks and inferring private attributes using knowledge graphs
Fan et al. Privacy preserving classification on local differential privacy in data centers
Gehrke et al. Towards privacy for social networks: A zero-knowledge based definition of privacy
Jiang et al. Retweet-bert: political leaning detection using language features and information diffusion on social networks
Xing et al. A survey of across social networks user identification
Cao et al. HitFraud: a broad learning approach for collective fraud detection in heterogeneous information networks
Lu et al. Entity alignment via knowledge embedding and type matching constraints for knowledge graph inference
Liu et al. Social network rumor detection method combining dual-attention mechanism with graph convolutional network
Chen et al. A topic-sensitive trust evaluation approach for users in online communities
Ren et al. Cross-network social user embedding with hybrid differential privacy guarantees
Sun et al. Differentially private deep learning with smooth sensitivity
Dong et al. Sentiment-aware fake news detection on social media with hypergraph attention networks
Xin et al. Subjective knowledge base construction powered by crowdsourcing and knowledge base
CN113065918A (en) Sparse trust recommendation method based on semi-supervised learning
Deng et al. A multiuser identification algorithm based on internet of things
Mami et al. Generating realistic synthetic relational data through graph variational autoencoders
He et al. Relationship identification across heterogeneous online social networks
Banweer et al. Multi-stage collaborative filtering for tweet geolocation
Jiang et al. Structure-attribute-based social network deanonymization with spectral graph partitioning
Jing et al. Disinformation propagation trend analysis and identification based on social situation analytics and multilevel attention network
Zhang et al. Improving entity linking in Chinese domain by sense embedding based on graph clustering
Yan et al. Finding Quasi-identifiers for K-Anonymity Model by the Set of Cut-vertex.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant