CN112528316A - Privacy protection lineage workflow publishing method based on Bayesian network - Google Patents
Privacy protection lineage workflow publishing method based on Bayesian network Download PDFInfo
- Publication number
- CN112528316A CN112528316A CN202010984734.3A CN202010984734A CN112528316A CN 112528316 A CN112528316 A CN 112528316A CN 202010984734 A CN202010984734 A CN 202010984734A CN 112528316 A CN112528316 A CN 112528316A
- Authority
- CN
- China
- Prior art keywords
- module
- workflow
- input
- output
- lineage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6227—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Bioethics (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Mathematics (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Artificial Intelligence (AREA)
- Algebra (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention discloses a privacy protection lineage workflow publishing method based on a Bayesian network, which comprises the following steps: measuring the degree of dependence among modules in the lineage workflow by training a Bayesian network, and evaluating different modules to have different importance in tracing query; dividing strong and weak association modules in a workflow, designing a customized hiding processing scheme aiming at different module types, comprehensively balancing privacy and usability, and ensuring that a lineage path originally passing through the module is still reserved after hiding operation for the strong association module; for the weak association module, the weak association dependency is sacrificed to ensure privacy security. The invention combines the minimum bisection splitting method for the privacy module and the data deletion dependence method, thereby realizing that the availability of the traceability query is effectively maintained while the privacy of the lineage workflow module is protected from being leaked.
Description
Technical Field
The invention relates to a data privacy protection publishing method, which is a lineage workflow oriented object, protects privacy module information from leakage and simultaneously considers traceability query availability maintenance.
Background
The Data lineage (Data Provenance) is also called Data tracing, and is used for describing the source, generation and evolution process of Data, and the application of the Data lineage can be roughly divided into the following categories according to the application targets: data quality evaluation, data recovery, data verification and data reference. The lineage Workflow (Provenance Workflow) is the main expression form of the data lineage, the concept of the Workflow (Workflow) originates from the eighty-year generation of the twentieth century, a heterogeneous distributed execution environment gradually replaces centralized information processing, the Workflow technology is widely applied to various flow interactive scenes, and the Workflow technology is used as a description model of the flow form, contains the generation and evolution information of the data and is an important expression form of the data lineage.
The function module is used as a main constituent element of the lineage workflow, the relation between input data and output data can be abstracted mathematically into mapping, namely the function of the function module can be represented by mapping, according to whether privacy information is contained, the function module in the lineage workflow can be divided into a public module and a privacy module, wherein the privacy module generally means that the function mapping of the module has privacy, a workflow owner is unwilling to publish and share specific functions of the module, and the module privacy protection strategy of the lineage workflow mainly comprises an edge increasing/decreasing module and an aggregation/splitting module, so that an attacker is prevented from reversely releasing the function mapping of the module through the input and output data.
One important availability of the lineage workflow is represented as traceability information query, and an auxiliary decision is made through a traceability result, wherein the traceability query is a query of a data evolution process in workflow execution, the traceability query requires that a result contains correct data description information, and the query result should avoid containing irrelevant redundant information. The traceback query will typically contain the following query types: querying historical source data of known data, querying a evolutionary path of the data within a defined range, querying an overlapping lineage of multiple data, namely a common module and common historical data. However, the lineage workflow itself may contain private or sensitive information, and direct publication of it may result in privacy disclosure. The existing system workflow module privacy protection method has the following defects:
(1) whether the module is in the common path of the lineage workflow or not is not considered when the privacy module is processed, the importance of the module in the tracing query is lack of attention, the modules with different importance adopt the same hiding strategy, and the availability of the important path information after the hiding processing cannot be guaranteed.
(2) The module aggregation is used as a main hiding strategy, selection standards and range control of an aggregation module are lacked, and the tracing query availability is lost to a large extent when the hiding granularity is large.
Disclosure of Invention
Aiming at the problems, the invention discloses a privacy protection lineage workflow publishing method based on a Bayesian network based on a multiple privacy protection strength idea of balancing privacy and usability, which realizes effective maintenance of usability of traceability query while protecting lineage workflow module privacy from disclosure.
In order to achieve the purpose, the technical scheme adopted by the invention is a privacy protection lineage workflow sharing and publishing method, which comprises the following steps:
the method comprises the following steps that (1) based on an original workflow WF, workflow execution information is independently and repeatedly executed and collected, whether each data stream exists in one execution is recorded to serve as a sample S, and a sample set S is formed;
step (2) training to obtain the structure and parameters of a Bayesian network BN according to the sample set obtained in the step (1);
step (3) based on the BN in the step (2), evaluating that different modules have different importance in source tracing query, and dividing a privacy module in a workflow into a strong correlation module and a weak correlation module;
and (4) dividing the modules into four types according to different entrance and exit degrees: the privacy module belonging to a certain type is subdivided into a strong correlation module and a weak correlation module; combining a module splitting method and a deletion dependence method to formulate a hiding strategy of each type of privacy module;
step (5) giving a privacy module set PrIMs to the original workflow WF, and performing hidden processing according to the step (4) to obtain the published workflow WF*.
For the convenience of the subsequent description, the following formal definitions are given:
function modules in a function Module (Module) workflow are represented as a quad M ═ IM,OM,FM, PM) Wherein: (1) i isM={inM 1,inM 2,…,inM uIs the set of input ports, O, of module MM={outM 1,outM 2,…,outM vIs the set of output ports of module M, andthat is, there is no port that is both an input port and an output port for the same module;
(2)FM={f1,f2,…,fvin which fi:outM i=fi(IM) Each output port out of the moduleM iCorrespondence mapping fiDependent variable of (2), input port set IMCorrespondence mapping fiAn independent variable of (d);
(3)PM={pM 1,pM 2,…,pM ris r optional parameter sets of module M.
Lineage Workflow (Workflow) the lineage Workflow is represented as a four-tuple WF ═ T, I, O, D, where: (1) t ═ M1,M2,…,MnThe system is a processing module set of a lineage workflow WF;
(2)I={i1,i2,…,isthe global input data set (including parameter inputs of the modules) of the lineage workflow WF is denoted by O ═ O1,o2,…,otIs the global output data set of the lineage workflow WF, andnamely, a data stream which is global output data and global input data does not exist in the lineage workflow;
(3)D={d1,d2,…,dkthe data flow set in the lineage workflow WF is used as the data flow set;
(4)make the data flow out inMThen via seq (d)i) Then flows into inMI.e., WF is a directed acyclic graph.
The method for generating the sample set in the step (1) comprises the following steps: recording a data stream set D ═ D { D } in the once execution process of the workflow WF1,d2,…,dkWhether each element participates in execution or not is recorded as T if participating, otherwise, the element is recorded as F, and a sample s is formed as d1 T/F,d2 T/F,…,dk T/F](ii) a The experiment was repeated n times independently at random to obtain a total sample set S ═ SiI is more than or equal to 1 and less than or equal to n. By performing n independent repeated workflow executions, n samples are determined, and the subsequent probability calculation is continued.
The method for constructing the single-condition Bayesian network in the step (2) comprises the following steps:
1) determining a variable set describing the problem field, and determining the state and value range of each variable of the variable set. Using the data flow set D ═ { D ═ D in the work flow WF1,d2,…,dkTaking the variable value as T/F as a network variable set, namely a network node set, and representing the existence of the data stream;
2) and determining the connection from the dependent variable to the effect variable according to the probability dependency relationship or the prior dependency relationship among the nodes, and determining the network structure. Based on the self structure information of the workflow WF ═ (T, I, O, D), connecting the network nodes in the step 1) by adopting directed edges to form a directed acyclic graph G ═ V, E; the method for constructing the network structure involved in the steps (1) and (2) in claim 3 is as follows:
(b) to module Mk=(IMk,OMk,FMk,PMk) E.g. T, go through inMk∈IMkAnd outMk∈OMkFind in VMkAnd outMkIf the corresponding nodes v and u are found successfully, a directed edge of v → u is created, and E is added;
(c) module M in pair (b)kGo through inMk∈IMkAnd outMk∈OMkFind in VMkAnd outMkIf the corresponding nodes v and u are found successfully, a directed edge of v → u is created, and E is added;
(d) the bayesian network structure G ═ (V, E) is obtained after the above steps.
3) Since the training sample set S has no condition of losing data and belongs to parameter learning of complete data, the learning conditional probability of the Maximum Likelihood Estimation (MLE) method is degraded into frequency statistics. After the parameter information (conditional probability table CPT) is obtained by learning, the bayesian network construction is finished. The single-condition bayesian network parameter learning algorithm is described as follows:
(a) for each edge e in G ═<v,u>E, set count cnt _ xv=0,cnt_xvu=0;
(b) For each record S ∈ S, if xvPresent in the sample s, cnt _ xvSelf-increment by 1; if xvPresent in sample s and xuPresent in the sample s, cnt _ xvuSelf-increment by 1;
(c) conditional probability corresponding to edge eAdding CPT; and (b) returning to the step (a) until all the probabilities of all edges in the E are calculated. According to whether the edge in the sample set exists or not, all the edges E in the edge set EAnd calculating the dependency probability to obtain the dependency degree of the nodes at the two ends of the edge.
The method for dividing the strong and weak association modules in the step (3) is as follows:
for module M ═ IM,OM,FM,PM) If M satisfies: to pairAndP(outM|inM) If the alpha is more than or equal to alpha, M is a strong correlation module; otherwise, M is a weak association module. Where α is the privacy probability threshold, P (out)M |inM) The module M is represented at the input inMIn the presence of conditions, output outMConditional probability of presence. Based on the above method for dividing the strong and weak association modules of the module M, the module set T of the workflow WF is { M ═ M1,M2,…, MnDividing the elements in the structure into strong/weak association modules.
The privacy protection policy in the step (4) is specifically as follows:
1) single-input single-output type module
For a single input single output type module, in the presence of input data, output data must be present, and therefore is a strongly associated module, as shown in fig. 2, in the case of a module M participating in a certain workflow execution, i.e. input data dxIn the presence of conditions, outputting data dyMust be present, therefore P (d)y|dx) 1. If the type module is identified as the privacy module by the owner of the lineage workflow, a single data flow d is used in the release diagramxyReplace the entire module, preserve original dx→dyAnd the path does not influence the query of the tracing path.
2) Single-input multi-output type module
For a single input multiple output type module, as shown in figure 3(a),
(a) if the module M is a strongly-associated module, separating a plurality of outputs of the module M, splitting the module M and simultaneously ensuring the inputThe association with multiple outputs still exists. To ensure that interference information is minimized, M is split into two submodules M1And M2As shown in FIG. 3 (b);
(b) if the module M is a weak association module, deleting the weakest association of the module M, namely deleting the minimum dependency relationship d in the conditional probability related to the module in the Bayesian networkx→dyIn the release diagram, module M needs to hide the output dyThe corresponding port. To ensure the connectivity of the workflow diagram structure, the following two points need to be considered:
if d is usedyFor input data, module N is a multiple input module, with M corresponding to output d deletedyThe port of (2) does not destroy the connectivity of the workflow diagram structure, and inputs d of NyDelete port of and get original dyCharacterizing as adding N to the input parameter;
② if with dyThe module N for inputting data being a single input module, i.e. the module N has only dyOne input, if the hidden scheme in 1) is used, will result in N missing input ports, which does not conform to the workflow definition. M is deleting the corresponding dyAnd meanwhile, the output port of the system is traced back to find the parent module MP, so that the output port is increased to the output port of N for the MP, and the structural integrity of the workflow is ensured.
3) Multi-input single-output type module
For a multi-input single-output type module,
(a) if the module M is a strongly-associated module, separating a plurality of inputs of the module M, splitting the module M, and simultaneously ensuring that the association of the plurality of inputs and the plurality of outputs still exists. To ensure that interference information is minimized, M minimum is split into two sub-modules M1And M2,
(b) If the module is a weak association module, deleting the weakest association of the module M, namely deleting the minimum dependency relationship d in the conditional probability related to the module in the Bayesian networkx→dzIn the release diagram, module M needs to hide the input dxThe corresponding port. To ensure the connectivity of the workflow diagram structure, the following two points need to be considered:
if d is usedxFor output data, module N is a multi-output module, with M corresponding inputs d deletedxThe port of (A) does not destroy the connectivity of the workflow diagram structure, and inputs d of M are connectedxDelete port of and get original dxCharacterizing the input parameter as adding M;
② if with dxThe module N for outputting data being a single output module, i.e. the module has only dxOne output, if the hidden scheme in 1) is used, will result in N missing output ports, which does not conform to the workflow definition. M is deleting the corresponding dxAnd meanwhile, finding the subsequent module MC backwards, adding an input port for the MC, adjusting the N output ports to correspond to the input ports of the MC, and ensuring the structural integrity of the workflow.
4) Multiple input multiple output type module
For a mimo type module, the module can be regarded as a combination of a mimo type module and a mimo type module.
(a) If the module M is a strong association module, separating a plurality of inputs or outputs of the M, splitting the module M, simultaneously ensuring that the association of the plurality of inputs and the plurality of outputs still exists, and splitting the M into two sub-modules M for ensuring the interference information minimization1And M2. If the input data is separated, separating the input port of M to M1And M2With the output port of M as M2The output port of (a); if the output data is separated, separating the output port of M to M1And M2Output port of M, input data of M being M1The input port of (a).
(b) If the module M is a weakly associated module,
ifI.e. a certain output portThe association dependency probability of all input ports is less than the privacy probability threshold alpha, and the output ports are hidden in the distribution graphWhen the output ports are hidden, the number of the input ports of the module is irrelevant, so the weak association module hiding strategy in (2) is also suitable for hidingFor example, P (d)e|da)<α∧P(de|db)<α∧P(de|dc)<α, hiding d in MeCorresponding to the output port.
② ifI.e. all output ports to a certain input port inM iAre all smaller than the privacy probability threshold value alpha, the input port in is hidden in the distribution graphM iWhen the input port is hidden, the number of the output ports of the module is irrelevant, so the weak correlation module hiding strategy in (3) is also suitable for hiding the inM i. For example, P (d)d|da)<α∧P(de|da)<α∧P(df|da)<α, hiding d in MaCorresponding to the input port.
And thirdly, splitting and hiding by adopting a strong association module strategy under other conditions. For a multi-input multi-output type module, when the first step and the second step are not satisfied, a privacy protection section cannot be carried out by hiding a certain input edge or certain output edge; therefore, a weak hiding strategy, namely module splitting, is adopted as a privacy measure.
The lineage workflow privacy protection issuing method in the step (5) is specifically as follows: giving a privacy module set PrIMs to the original workflow WF, judging the specific type of each module M in the PrIMs according to the step (5) and carrying out hiding operation to obtain the issued workflow WF*。
Compared with the prior art, the invention has the following advantages: the technical scheme provides the privacy protection lineage workflow publishing method based on the Bayesian network, aiming at the problems that the existing module privacy-oriented protection method splits the relationship between the module and the workflow structure, and the important degree of the module in the data evolution process is not considered, so that the published lineage workflow has poor traceability query usability and the like. By collecting a large number of module participation samples in the random workflow execution process, a Bayesian network model is constructed, and the degree of dependence among related modules in the workflow is measured, so that the functions of different privacy modules in the workflow traceability query are determined; a protection method for personalized module privacy is provided, based on the established Bayesian network, strong and weak association modules of a workflow are divided, different hiding strategies are designed, and hiding processing of the privacy modules is maintained at the local part of the workflow, so that modification of the original workflow structure is reduced, and traceability query availability is maintained.
Drawings
FIG. 1 is an overall flow chart of the method of the present invention;
FIG. 2 is a diagram of a single input single output type module privacy protection strategy;
FIG. 3 is a diagram of a single input multiple output type module privacy protection strategy;
FIG. 4 is a diagram of a multiple input single output type module privacy protection strategy;
FIG. 5 is a diagram of a multiple input multiple output type module privacy protection strategy;
FIG. 6 is an original lineage workflow WF;
FIG. 7 is a WF-corresponding Bayesian network structure;
FIG. 8 is a release lineage workflow WF*。
Detailed Description
For the purposes of promoting an understanding and understanding of the invention, reference will now be made in detail to the present invention as illustrated in the accompanying drawings and described in the following detailed description.
Example 1: in order to achieve the above object, the technical solution adopted by the present invention is a method for sharing and publishing a workflow of a privacy-preserving world, comprising the following steps (as shown in fig. 1):
the method comprises the following steps that (1) based on an original workflow WF, workflow execution information is independently and repeatedly executed and collected, whether each data stream exists in one execution is recorded to serve as a sample S, and a sample set S is formed;
step (2) training to obtain the structure and parameters of a Bayesian network BN according to the sample set obtained in the step (1);
step (3) based on the BN in the step (2), evaluating that different modules have different importance in source tracing query, and dividing a privacy module in a workflow into a strong correlation module and a weak correlation module;
and (4) dividing the modules into four types according to different entrance and exit degrees: the privacy module belonging to a certain type is subdivided into a strong correlation module and a weak correlation module; combining a module splitting method and a deletion dependence method to formulate a hiding strategy of each type of privacy module;
step (5) giving a privacy module set PrIMs to the original workflow WF, and performing hidden processing according to the step (4) to obtain the published workflow WF*.
For the convenience of the subsequent description, the following formal definitions are given:
function modules in a function Module (Module) workflow are represented as a quad M ═ IM,OM,FM, PM) Wherein: (1) i isM={inM 1,inM 2,…,inM uIs the set of input ports, O, of module MM={outM 1,outM 2,…,outM vIs the set of output ports of module M, andthat is, there is no port that is both an input port and an output port for the same module;
(2)FM={f1,f2,…,fvin which fi:outM i=fi(IM) Each output port out of the moduleM iCorrespondence mapping fiDependent variable of (2), input port set IMCorrespondence mapping fiAn independent variable of (d);
(3)PM={pM 1,pM 2,…,pM ris r optional parameter sets of module M.
Lineage Workflow (Workflow) the lineage Workflow is represented as a four-tuple WF ═ T, I, O, D, where: (1) t ═ M1,M2,…,MnThe system is a processing module set of a lineage workflow WF;
(2)I={i1,i2,…,isthe global input data set (including parameter inputs of the modules) of the lineage workflow WF is denoted by O ═ O1,o2,…,otIs the global output data set of the lineage workflow WF, andnamely, a data stream which is global output data and global input data does not exist in the lineage workflow;
(3)D={d1,d2,…,dkthe data flow set in the lineage workflow WF is used as the data flow set;
(4)make the data flow out inMThen via seq (d)i) Then flows into inMI.e., WF is a directed acyclic graph.
The method for generating the sample set in the step (1) comprises the following steps: recording a data stream set D ═ D { D } in the once execution process of the workflow WF1,d2,…,dkWhether each element participates in execution or not is recorded as T if participating, otherwise, the element is recorded as F, and a sample s is formed as d1 T/F,d2 T/F,…,dk T/F](ii) a The experiment was repeated n times independently at random to obtain a total sample set S ═ Si|1≤i≤n};
The method for constructing the single-condition Bayesian network in the step (2) comprises the following steps:
1) determining a set of variables describing the problem domain, for each variable of the set of variablesThe quantities determine their state and value ranges. Using the data flow set D ═ { D ═ D in the work flow WF1,d2,…,dkTaking the variable value as T/F as a network variable set, namely a network node set, and representing the existence of the data stream;
2) and determining the connection from the dependent variable to the effect variable according to the probability dependency relationship or the prior dependency relationship among the nodes, and determining the network structure. Based on the self structure information of the workflow WF ═ (T, I, O, D), connecting the network nodes in the step 1) by adopting directed edges to form a directed acyclic graph G ═ V, E; the method for constructing the network structure involved in the steps (1) and (2) in claim 3 is as follows:
(b) to module Mk=(IMk,OMk,FMk,PMk) E.g. T, go through inMk∈IMkAnd outMk∈OMkFind in VMkAnd outMkIf the corresponding nodes v and u are found successfully, a directed edge of v → u is created, and E is added;
(c) module M in pair (b)kGo through inMk∈IMkAnd outMk∈OMkFind in VMkAnd outMkIf the corresponding nodes v and u are found successfully, a directed edge of v → u is created, and E is added;
(d) the bayesian network structure G ═ (V, E) is obtained after the above steps.
3) Since the training sample set S has no condition of losing data and belongs to parameter learning of complete data, the learning conditional probability of the Maximum Likelihood Estimation (MLE) method is degraded into frequency statistics. After the parameter information (conditional probability table CPT) is obtained by learning, the bayesian network construction is finished. The single-condition bayesian network parameter learning algorithm is described as follows:
(a) for each edge e in G ═<v,u>E, set count cnt _ xv=0,cnt_xvu=0;
(b) For each record S ∈ S, if xvPresent in the sample s, cnt _ xvSelf-increment by 1; if xvPresent in sample s and xuPresent in the sample s, cnt _ xvuSelf-increment by 1;
(c) conditional probability corresponding to edge eAdding CPT; and (b) returning to the step (a) until all the probabilities of all edges in the E are calculated.
The method for dividing the strong and weak association modules in the step (3) is as follows:
for module M ═ IM,OM,FM,PM) If M satisfies: to pairAndP(outM|inM) If the alpha is more than or equal to alpha, M is a strong correlation module; otherwise, M is a weak association module. Where α is the privacy probability threshold, P (out)M |inM) The module M is represented at the input inMIn the presence of conditions, output outMConditional probability of presence. Based on the above method for dividing the strong and weak association modules of the module M, the module set T of the workflow WF is { M ═ M1,M2,…, MnDividing the elements in the structure into strong/weak association modules.
The privacy protection policy in the step (4) is specifically as follows:
5) single-input single-output type module
For a single input single output type module, in the presence of input data, output data must be present, and therefore is a strongly associated module, as shown in fig. 2, in the case of a module M participating in a certain workflow execution, i.e. input data dxIn the presence of conditions, outputting data dyMust be present, therefore P (d)y|dx) 1. If the type module is identified as the privacy module by the owner of the lineage workflow, a single data flow d is used in the release diagramxyReplace the entire module, preserve original dx→dyAnd the path does not influence the query of the tracing path.
6) Single-input multi-output type module
For a single input multiple output type module, as shown in figure 3(a),
(a) if the module M is a strongly-associated module, separating a plurality of outputs of the module M, splitting the module M, and simultaneously ensuring that the association between the input and the plurality of outputs still exists. To ensure that interference information is minimized, M is split into two submodules M1And M2As shown in FIG. 3 (b);
(b) if the module M is a weak association module, deleting the weakest association of the module M, namely deleting the minimum dependency relationship d in the conditional probability related to the module in the Bayesian networkx→dyIn the release diagram, module M needs to hide the output dyThe corresponding port. To ensure the connectivity of the workflow diagram structure, the following two points need to be considered:
if d is usedyFor input data, module N is a multiple input module, with M corresponding to output d deletedyThe port of (2) does not destroy the connectivity of the workflow diagram structure, and inputs d of NyDelete port of and get original dyCharacterizing the input parameter as adding N, as shown in FIG. 3 (c);
② if with dyThe module N for inputting data being a single input module, i.e. the module N has only dyOne input, if the hidden scheme in 1) is used, will result in N missing input ports, which does not conform to the workflow definition. M is deleting the corresponding dyAnd meanwhile, the output port of the system is traced back to find the parent module MP, so that the output port is increased to the output port of N for the MP, and the structural integrity of the workflow is ensured. As shown in fig. 3 (d).
7) Multi-input single-output type module
For a multiple input single output type module, as shown in fig. 4 (a):
(a) if module M is a strongly associated module, multiple outputs of M are usedAnd separating, namely splitting the module M, and simultaneously ensuring that the association of a plurality of inputs and outputs still exists. To ensure that interference information is minimized, M minimum is split into two sub-modules M1And M2As shown in FIG. 4 (b);
(b) if the module is a weak association module, deleting the weakest association of the module M, namely deleting the minimum dependency relationship d in the conditional probability related to the module in the Bayesian networkx→dzIn the release diagram, module M needs to hide the input dxThe corresponding port. To ensure the connectivity of the workflow diagram structure, the following two points need to be considered:
if d is usedxFor output data, module N is a multi-output module, with M corresponding inputs d deletedxThe port of (A) does not destroy the connectivity of the workflow diagram structure, and inputs d of M are connectedxDelete port of and get original dxCharacterizing the input parameter as adding M, as shown in FIG. 4 (c);
② if with dxThe module N for outputting data being a single output module, i.e. the module has only dxOne output, if the hidden scheme in 1) is used, will result in N missing output ports, which does not conform to the workflow definition. M is deleting the corresponding dxAnd meanwhile, finding the subsequent module MC backwards, adding an input port for the MC, adjusting the N output ports to correspond to the input ports of the MC, and ensuring the structural integrity of the workflow. As shown in fig. 4 (d).
8) Multiple input multiple output type module
For the mimo type module, as shown in fig. 5(a), the type module can be regarded as a comprehensive form of a single-input multiple-output type module and a multiple-input single-output type module.
(a) If the module M is a strong association module, separating a plurality of inputs or outputs of the M, splitting the module M, simultaneously ensuring that the association of the plurality of inputs and the plurality of outputs still exists, and splitting the M into two sub-modules M for ensuring the interference information minimization1And M2. If the input data is separated, separating the input port of M to M1And M2With the output port of M as M2Of the output portAs shown in fig. 5 (b); if the output data is separated, separating the output port of M to M1And M2The input data of M is taken as M1As shown in fig. 5 (c).
(b) If the module M is a weakly associated module,
ifI.e. a certain output portThe association dependency probability of all input ports is less than the privacy probability threshold alpha, and the output ports are hidden in the distribution graphWhen the output ports are hidden, the number of the input ports of the module is irrelevant, so the weak association module hiding strategy in (2) is also suitable for hidingFor example, P (d)e|da)<α∧P(de|db)<α∧P(de|dc)<α, hiding d in MeCorresponding to the output port, as shown in fig. 5 (d).
② ifI.e. all output ports to a certain input port inM iAre all smaller than the privacy probability threshold value alpha, the input port in is hidden in the distribution graphM iWhen the input port is hidden, the number of the output ports of the module is irrelevant, so the weak correlation module hiding strategy in (3) is also suitable for hiding the inM i. For example, P (d)d|da)<α∧P(de|da)<α∧P(df|da)<α, hiding d in MaCorresponding to the input port, as shown in fig. 5 (e).
And thirdly, splitting and hiding by adopting a strong association module strategy under other conditions.
The method for privacy protection release of the lineage workflow in the step (5) is specifically as follows: giving a privacy module set PrIMs for the original workflow WF, judging the specific type of each module M in the PrIMs according to the step (5) and carrying out hiding operation to obtain the issued workflow WF*。
The application example is as follows:
FIG. 6 shows a lineage workflow WF, T ═ M1,M2,…,M7,M8},I={i1,p2,p3,p4}, O={o1,o2},D={d1,d2,d3,…,d12,d13}. Independently repeated 30 times for the lineage workflow WF and recording the existence of the data flow, the sample set is obtained as follows:
S={[0,1,0,0,0,0,1,0,1,1,1,0,0],
[1,1,0,0,1,0,0,1,1,0,1,1,0],
[0,1,0,0,0,0,0,1,1,1,1,1,0],
[0,0,0,1,1,0,1,1,0,0,0,0,1],
……
[1,1,0,1,1,1,1,0,0,0,0,0,1]}
according to the algorithm Construct SC-BN in the steps (2) and (3) of the invention, a single-condition bayesian network structure can be obtained as shown in fig. 7. According to the algorithm Parameter Learning in SC-BN in the steps (2) and (3), the conditional probability table in the network can be obtained.
i1→d1:
d2→d4:
……
d13→o2:
Based on d abovei→djPr (d) in conditional probability tablej=T|diT), given set of privacy modules PriMs is { M) according to the strong and weak association module definitions in step (3)2,M5Middle module M2And M5And judging the type of the module. Based on the conditional probability information, ifThen M is judged2Is a strongly associated module, and M2Belongs to a multi-input single-output type module, and then according to the strategy described in FIG. 4(b), for M2Splitting a module into two parts at minimum; whileAnd isDetermination M5Is a weakly associated module, and M5Belongs to a multi-input multi-output type module, namelyAll output ports (d)10,d11) To input port d7The correlation dependence probabilities are all smaller than the privacy probability threshold value alpha, and the input port d is hidden in the distribution graph7. Through the hiding process, the WF of the release workflow diagram can be obtained*As shown in fig. 8.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (6)
1. The privacy protection lineage workflow publishing method based on the Bayesian network is characterized by comprising the following steps of:
step (1): based on an original workflow WF, independently and repeatedly executing randomly and collecting workflow execution information, recording whether each data stream exists in one execution as a sample S, and forming a sample set S;
step (2): training to obtain the structure and parameters of the Bayesian network BN according to the sample set obtained in the step (1); and (3): based on the BN in the step (2), evaluating different modules with different importance in source tracing query, and dividing a privacy module in the workflow into a strong correlation module and a weak correlation module;
and (4): the modules are divided into four types according to different entrance and exit degrees: the privacy module belonging to a certain type is subdivided into a strong correlation module and a weak correlation module; combining a module splitting method and a deletion dependence method to formulate a hiding strategy of each type of privacy module;
and (5): giving a privacy module set PrIMs to the original workflow WF, and hiding according to the step (4) to obtain the published workflow WF*;
For the convenience of the subsequent description, the following formal definitions are given:
function modules in a function Module (Module) workflow are represented as a quad M ═ IM,OM,FM,PM) Wherein: (1) i isM={inM 1,inM 2,…,inM uIs the set of input ports, O, of module MM={outM 1,outM 2,…,outM vIs the set of output ports of module M, andthat is, there is no port that is both an input port and an output port for the same module;
(2)FM={f1,f2,…,fvin which fi:outM i=fi(IM) Each output port out of the moduleM iCorrespondence mapping fiDependent variable of (2), input port set IMCorrespondence mapping fiAn independent variable of (d);
(3)PM={pM 1,pM 2,…,pM rr is the r selectable parameter sets of module M;
lineage Workflow (Workflow) the lineage Workflow is represented as a four-tuple WF ═ T, I, O, D, where: (1) t ═ M1,M2,…,MnThe system is a processing module set of a lineage workflow WF;
(2)I={i1,i2,…,isthe global input data set (including parameter input of each module) of the lineage workflow WF is, O ═ O1,o2,…,otIs the global output data set of the lineage workflow WF, andnamely, a data stream which is global output data and global input data does not exist in the lineage workflow;
(3)D={d1,d2,…,dkthe data flow set in the lineage workflow WF is used as the data flow set;
2. The method for releasing the privacy-preserving lineage workflow of the bayesian network according to claim 1, wherein the sample set generating method in the step (1) is: record workflow WF oneIn the secondary execution process, the data flow set D ═ { D ═ D1,d2,…,dkWhether each element participates in execution or not is recorded as T if participating, otherwise, the element is recorded as F, and a sample s is formed as d1 T/F,d2 T /F,…,dk T/F](ii) a The experiment was repeated n times independently at random to obtain a total sample set S ═ Si|1≤i≤n}。
3. The method for releasing the workflow of the lineage protected by privacy of a bayesian network according to claim 1, wherein the method for constructing the one-condition bayesian network in the step (2) is as follows:
a Single Condition Bayesian Network (SC-BN) G ═ V, E is represented as a Directed Acyclic Graph (DAG), where V represents the set of all nodes in the graph and E represents the set of directed edges in the graph, let xvA random variable represented by a certain node V in G belonging to V;e represents an edge of v → u, and the weight of e corresponds to P (x)u|xv) Is represented as xvIn the presence or absence of xuA conditional probability of presence or absence;
(1) determining a variable set describing the problem field, determining the state and the value range of each variable of the variable set, and using a data flow set D in the workflow WF as { D {1,d2,…,dkTaking the variable value as T/F as a network variable set, namely a network node set, and representing the existence of the data stream;
(2) and determining the connection from the dependent variable to the effect variable according to the probability dependency relationship or the prior dependency relationship between the nodes, and determining the network structure. Based on the self structure information of the workflow WF ═ (T, I, O, D), connecting the network nodes in the step (1) by adopting directed edges to form a directed acyclic graph G ═ V, E; the method for constructing the network structure involved in the steps (1) and (2) in claim 3 is as follows:
(b) to module Mk=(IMk,OMk,FMk,PMk) E.g. T, go through inMk∈IMkAnd outMk∈OMkFind in VMkAnd outMkIf the corresponding nodes v and u are found successfully, a directed edge of v → u is created, and E is added;
(c) module M in pair (b)kGo through inMk∈IMkAnd outMk∈OMkFind in VMkAnd outMkIf the corresponding nodes v and u are found successfully, a directed edge of v → u is created, and E is added;
(d) obtaining a Bayesian network structure G ═ (V, E) through the steps;
(3) because the condition of data loss does not exist in the training sample set S and belongs to parameter learning of complete data, the learning condition probability of a Maximum Likelihood Estimation (MLE) method is degraded into frequency statistics, after parameter information (a condition probability table CPT) is obtained through learning, the construction of the Bayesian network is finished, and the single-condition Bayesian network parameter learning method is described as follows:
(a) for each edge e in G ═<v,u>E, set count cnt _ xv=0,cnt_xvu=0;
(b) For each record S ∈ S, if xvPresent in the sample s, cnt _ xvSelf-increment by 1; if xvIs present in the sample s and xuPresent in the sample s, cnt _ xvuSelf-increment by 1;
4. The method for issuing the workflow of the privacy protection lineage of the bayesian network according to claim 1, wherein the strong and weak association modules in the step (3) are divided as follows:
strong/weak association module: for module M ═ IM,OM,FM,PM) If M satisfies: to pairAnd P(outM|inM) If the alpha is more than or equal to alpha, M is a strong correlation module; otherwise M is a weak association module, where α is a privacy probability threshold, P (out)M|inM) The module M is represented at the input inMIn the presence of conditions, output outMA conditional probability of presence;
based on the above definition, the module set T of the workflow WF is set to { M ═ M1,M2,…,MnDividing the elements in the structure into strong/weak association modules.
5. The method for releasing the lineage workflow of privacy protection of a bayesian network according to claim 1, wherein the privacy protection policy in the step (4) is specifically as follows:
(1) single input single output type module:
for a single input single output type module, output data must exist in the presence of input data, so that the module is a strongly associated module, and in the case that the module M participates in a certain workflow execution, that is, the input data dxIn the presence of conditions, outputting data dyMust be present, therefore P (d)y|dx) If the type module is identified as the privacy module by the owner of the lineage workflow, the module is divided into a single data flow d in the release diagramxyReplace the entire module, preserve original dx→dyPath without influencing tracing pathQuerying of (2);
(2) single input multiple output type module:
for a single-input multi-output type module,
(a) if the module M is a strongly-associated module, separating a plurality of outputs of the module M, splitting the module M, and simultaneously ensuring that the association between the input and the plurality of outputs still exists. To ensure that interference information is minimized, M is split into two sub-modules M1And M2,
(b) If the module M is a weak association module, deleting the weakest association of the module M, namely deleting the minimum dependency relationship d in the conditional probability related to the module in the Bayesian networkx→dyIn the release diagram, module M needs to hide the output dyCorresponding ports, in order to ensure connectivity of the workflow diagram structure, the following two points need to be considered:
if d is usedyFor input data, module N is a multiple input module, with M corresponding to output d deletedyThe port of (2) does not destroy the connectivity of the workflow diagram structure, and inputs d of NyDelete port of and get original dyCharacterized in that the input parameters are added with N,
② if with dyThe module N for inputting data being a single input module, i.e. the module N has only dyOne input, if hidden according to the scheme in 1), will result in N missing input ports, not conforming to the workflow definition, M deleting the corresponding dyAnd meanwhile, the output port of the system is traced back to find the parent module MP, so that the output port is increased to the output port of N for the MP, and the structural integrity of the workflow is ensured.
(3) Multi-input single-output type module
For a multi-input single-output type module,
(a) if the module M is a strongly-associated module, separating a plurality of inputs of the module M, splitting the module M, and simultaneously ensuring that the association of the plurality of inputs and the plurality of outputs still exists. In order to ensure the minimum interference information, the minimum M is split into two sub-modules M1And M2,
(b) If the module is a weak association module, deleting the weakest association of the module M, namely deleting the conditional probability related to the module in the Bayesian networkMinimum dependency dx→dzIn the release diagram, module M needs to hide the input dxA corresponding port; to ensure the connectivity of the workflow diagram structure, the following two points need to be considered:
if d is usedxFor output data, module N is a multi-output module, with M corresponding inputs d deletedxThe port of (A) does not destroy the connectivity of the workflow diagram structure, and inputs d of MxDelete port of and get original dxCharacterized in that the input parameters are added to M,
② if with dxThe module N for outputting data being a single output module, i.e. the module has only dxOne output, if the hidden scheme in 1) is used, will result in N missing output ports, which does not conform to the workflow definition. M is deleting the corresponding dxAnd meanwhile, finding the subsequent module MC backwards, adding an input port for the MC, adjusting the N output ports to correspond to the input ports of the MC, and ensuring the structural integrity of the workflow.
(4) Multiple input multiple output type module:
for the multi-input multi-output type module, the type module can be regarded as a comprehensive form of a single-input multi-output type module and a multi-input single-output type module;
(a) if the module M is a strong association module, separating a plurality of inputs or outputs of the M, splitting the module M, simultaneously ensuring that the association of the plurality of inputs and the plurality of outputs still exists, and splitting the M into two sub-modules M for ensuring the minimization of interference information1And M2. If the input data is separated, separating the input port of M to M1And M2With the output port of M as M2If the output port of (1) is separated from the output data, then the output port of M is separated to M1And M2Output port of M, input data of M being M1The input port of (a) is,
(b) if the module M is a weakly associated module,
ifP(outM j|inM i) < alpha, i.e. a certain output port outM jThe association dependency probability of all input ports is less than the privacy probability threshold alpha, and the output port out is hidden in the distribution graphM jWhen the output ports are hidden, the number of the input ports of the module is irrelevant, so the weak association module hiding strategy in (2) is also suitable for hiding outM j. For example, P (d)e|da)<α∧P(de|db)<α∧P(de|dc) < alpha, hiding d in MeCorresponding to the output port, as shown in fig. 5 (d).
② ifP(outM j|inM i) < alpha, i.e. all output ports to an input port inM iAre all smaller than the privacy probability threshold value alpha, the input port in is hidden in the distribution graphM iWhen the input ports are hidden, the number of the output ports of the module is irrelevant, so the weak association module hiding strategy in (3) is also suitable for hiding the inM i. For example, P (d)d|da)<α∧P(de|da)<α∧P(df|da) < alpha, hiding d in MaCorresponding to the input port,
and thirdly, splitting and hiding by adopting a strong association module strategy under other conditions.
6. The bayesian network-based privacy protection lineage workflow publishing method according to claim 1, wherein the lineage workflow privacy protection publishing method in the step (5) is specifically as follows: giving a privacy module set PrIMs for the original workflow WF, judging the specific type of each module M in the PrIMs according to the step (5) and carrying out hiding operation to obtain the issued workflow WF*。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010984734.3A CN112528316B (en) | 2020-09-18 | 2020-09-18 | Privacy protection lineage workflow publishing method based on Bayesian network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010984734.3A CN112528316B (en) | 2020-09-18 | 2020-09-18 | Privacy protection lineage workflow publishing method based on Bayesian network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112528316A true CN112528316A (en) | 2021-03-19 |
CN112528316B CN112528316B (en) | 2022-07-15 |
Family
ID=74978843
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010984734.3A Active CN112528316B (en) | 2020-09-18 | 2020-09-18 | Privacy protection lineage workflow publishing method based on Bayesian network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112528316B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117786739A (en) * | 2023-12-19 | 2024-03-29 | 国网青海省电力公司信息通信公司 | Data processing method, server and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107103000A (en) * | 2016-02-23 | 2017-08-29 | 广州启法信息科技有限公司 | It is a kind of based on correlation rule and the integrated recommended technology of Bayesian network |
CN107871087A (en) * | 2017-11-08 | 2018-04-03 | 广西师范大学 | The personalized difference method for secret protection that high dimensional data is issued under distributed environment |
CN107910009A (en) * | 2017-11-02 | 2018-04-13 | 中国科学院声学研究所 | A kind of symbol based on Bayesian inference rewrites Information Hiding & Detecting method and system |
-
2020
- 2020-09-18 CN CN202010984734.3A patent/CN112528316B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107103000A (en) * | 2016-02-23 | 2017-08-29 | 广州启法信息科技有限公司 | It is a kind of based on correlation rule and the integrated recommended technology of Bayesian network |
CN107910009A (en) * | 2017-11-02 | 2018-04-13 | 中国科学院声学研究所 | A kind of symbol based on Bayesian inference rewrites Information Hiding & Detecting method and system |
CN107871087A (en) * | 2017-11-08 | 2018-04-03 | 广西师范大学 | The personalized difference method for secret protection that high dimensional data is issued under distributed environment |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117786739A (en) * | 2023-12-19 | 2024-03-29 | 国网青海省电力公司信息通信公司 | Data processing method, server and system |
Also Published As
Publication number | Publication date |
---|---|
CN112528316B (en) | 2022-07-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Roth et al. | Black-box identification of discrete event systems with optimal partitioning of concurrent subsystems | |
WO2021149518A1 (en) | Conversion device for secure computation, secure computation system, conversion method for secure computation, and conversion program for secure computation | |
Faramondi et al. | Network structural vulnerability: a multiobjective attacker perspective | |
Papadimitriou et al. | DStress: Efficient differentially private computations on distributed data | |
CN111340493A (en) | Multi-dimensional distributed abnormal transaction behavior detection method | |
Verma et al. | Introduction of formal methods in blockchain consensus mechanism and its associated protocols | |
Gade et al. | Private optimization on networks | |
Yamamoto et al. | eFL-Boost: Efficient federated learning for gradient boosting decision trees | |
WO2019138584A1 (en) | Classification tree generation method, classification tree generation device, and classification tree generation program | |
Yeh et al. | A new subtraction-based algorithm for the d-MPs for all d problem | |
CN112528316B (en) | Privacy protection lineage workflow publishing method based on Bayesian network | |
Zhang et al. | An online Kullback–Leibler divergence-based stealthy attack against cyber-physical systems | |
Xing et al. | Zero-knowledge proof meets machine learning in verifiability: A survey | |
Gambs et al. | Reconstruction attack through classifier analysis | |
Levitin et al. | Optimal spot-checking for collusion tolerance in computer grids | |
CN112231746B (en) | Joint data analysis method, device, system and computer readable storage medium | |
Huang et al. | Deep learning modeling attack analysis for multiple fpga-based apuf protection structures | |
Qian et al. | Harmonic-coupled Riccati equation and its applications in distributed filtering | |
Hosseinzadeh Lotfi et al. | An MOLP based procedure for finding efficient units in DEA models | |
Li et al. | A novel strategy of combining variable ordering heuristics for constraint satisfaction problems | |
Haghighat et al. | Service integrity assurance for distributed computation outsourcing | |
CN114091057A (en) | Federal learning safety audit method and device based on model simulation | |
Yuan et al. | Privacy preserving graph publication in a distributed environment | |
Wallis et al. | QUDOS: quorum-based cloud-edge distributed DNNs for security enhanced industry 4.0 | |
Sadowski et al. | Learning and exploiting mixed variable dependencies with a model-based EA |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |