CN117349896B - Data collection method, analysis method and analysis system based on sensitivity classification - Google Patents

Data collection method, analysis method and analysis system based on sensitivity classification Download PDF

Info

Publication number
CN117349896B
CN117349896B CN202311649483.3A CN202311649483A CN117349896B CN 117349896 B CN117349896 B CN 117349896B CN 202311649483 A CN202311649483 A CN 202311649483A CN 117349896 B CN117349896 B CN 117349896B
Authority
CN
China
Prior art keywords
data
disturbance
sensitivity
probability
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311649483.3A
Other languages
Chinese (zh)
Other versions
CN117349896A (en
Inventor
周礼亮
陈亚青
张陆游
李涛
熊蓉玲
冉华明
叶宇桐
张敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
CETC 10 Research Institute
Original Assignee
Institute of Software of CAS
CETC 10 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS, CETC 10 Research Institute filed Critical Institute of Software of CAS
Priority to CN202311649483.3A priority Critical patent/CN117349896B/en
Publication of CN117349896A publication Critical patent/CN117349896A/en
Application granted granted Critical
Publication of CN117349896B publication Critical patent/CN117349896B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Pure & Applied Mathematics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Bioethics (AREA)
  • Computer Hardware Design (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Security & Cryptography (AREA)
  • Algebra (AREA)
  • Medical Informatics (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to the technical field of information safety, and discloses a data collection method, an analysis method and an analysis system based on sensitivity classification. The invention solves the problems of the prior art, such as the statistical analysis accuracy of disturbance data.

Description

Data collection method, analysis method and analysis system based on sensitivity classification
Technical Field
The invention relates to the technical field of information security, in particular to a data collection method, an analysis method and an analysis system based on sensitive classification.
Background
Local differential privacy is a privacy model based on strict mathematical definitions that allows users to locally perturb real data, submit perturbed data to untrusted analysts instead of real data, and the analysts develop statistical analysis on the perturbed data. Specifically, a local disturbance algorithm meeting local differential privacy carries out randomization processing on user privacy data, so that the distribution of randomization output obtained by different input data is guaranteed to be similar, the similar degree is measured by privacy budget, the smaller the privacy budget is, the more similar the randomization output distribution of different data is, the more difficult the different input data is to distinguish, and the higher the privacy protection degree is. The local differential privacy model does not require the user to trust the data collector, and the disturbance algorithm has lower calculation cost, however, the statistical analysis accuracy of disturbance data is limited by the size of the data set and the privacy budget.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a data collection method, an analysis method and an analysis system based on sensitive grading, which solve the problems of statistical analysis accuracy and the like of disturbance data in the prior art.
The invention solves the problems by adopting the following technical scheme:
a data collection method based on sensitivity grading grades real data according to sensitivity degree, so that a client adopts different disturbance strategies according to the sensitivity grading of the data when disturbing the real data.
As a preferred technical scheme, the method comprises the following steps:
s1, dividing a value range of data to be collected into high sensitivity by a server sideAnd hyposensitisation->Two sensitive categories and determining the value of privacy budget +.>The data is divided into high sensitivity and low sensitivity>And privacy budget->Sending to each client;
s2, each client receives a high-low sensitivity division mode sent by the serverAnd privacy budget->And then, calling a disturbance algorithm of the sensitive classification to output a disturbance result of the real data, and then sending the disturbance result to the server.
In step S2, if the real data is highly sensitive, the probability distribution of the disturbance result of the disturbance algorithm of the sensitive classification is the same as the probability distribution of the disturbance result of the disturbance algorithm satisfying the local differential privacy; if the real data are low-sensitivity, probability distribution of disturbance results of a disturbance algorithm of sensitivity classification meets the following conditions: the probability that the disturbance result is other low-sensitivity data than the real data is 0, and the probability distribution that the disturbance result is the non-low-sensitivity data is the same as that of the disturbance result of the disturbance algorithm satisfying the local differential privacy.
As a preferable technical solution, the perturbation algorithm satisfying the local differential privacy refers to: for continuous data, a segmentation mechanism is adopted; for discrete data, a generic random response.
As a preferred embodiment, if the data are realIs continuous data, the disturbance algorithm of the sensitivity classification is a segmentation mechanism of the sensitivity classification, specifically:
if it isThen perturbation of +.>Randomization of output->The probability distribution of (2) is:
in the method, in the process of the invention,,/>,/>,/>
if it isLet us assume->Randomizing the output->The probability distribution of (2) is:
in the method, in the process of the invention,,/>,/>
wherein,representing real data +.>Representing disturbance results (I)>Representing real data as +.>Time and disturbance results->Probability distribution of->Indicates the division mode of high sensitivity and low sensitivity>Indicates the high-sensitivity data value interval, < >>Representing a low-sensitivity data value interval, < + >>Representing privacy budget, ++>The maximum value of the segmentation mechanism disturbance result representing the segmentation mechanism and the sensitivity classification,minimum value of segmentation mechanism disturbance result representing segmentation mechanism and sensitivity grading, +.>Representing an interval containing real data in the disturbance result of the segmentation mechanism, the segmentation mechanism disturbing the real data with higher probability to the data in the interval, & lt+ & gt>For the left end of the interval, +.>Right end of interval, ++>Representing the result of the segmentation mechanism disturbance as interval +.>Probability of any data>The segmentation mechanism representing the sensitivity classification perturbs the low-sensitivity data to the probability of the original real data,representing the result of the segmentation mechanism disturbance as interval +.>Probability of any data other than ∈>The segmentation mechanism, which is also a sensitivity hierarchy, perturbs the low-sensitivity data to +.>Any other thanProbability of a data->Representing a low sensitivity data value interval +.>Left end point of->Representing a low sensitivity data value interval +.>Is the right end point of (c).
As a preferred embodiment, if the data are realDiscrete data, the disturbance algorithm of the sensitivity classification is a general random response of the sensitivity classification, specifically:
if it isThen perturbation of +.>Randomization of output->The probability distribution of (2) is:
in the method, in the process of the invention,,/>
if it isRandomizing the output->The probability distribution of (2) is:
in the method, in the process of the invention,,/>
wherein,representing real data +.>Representing disturbance results (I)>Representing real data as +.>Time and disturbance results->Probability distribution of->Indicates the division mode of high sensitivity and low sensitivity>Representing a high-sensitivity data value set, +.>Representing a low sensitive data value set, +.>Representing privacy budget, ++>Representing trueProbability of obtaining original real data by general random response disturbance of real data,/for real data>Representing the probability of real data obtaining non-real data through general random response disturbance, and simultaneously +.>A generalized random response, also sensitive to rank, perturbs low sensitive data to +.>Probability of any data other than +.>The generic random response, which represents the sensitivity ranking, perturbs the low-sensitivity data to the probability of the original real data.
The data analysis method based on the sensitive grading comprises the data collection method based on the sensitive grading, and further comprises the following steps:
and S3, after receiving the disturbance results of the clients, the server applies a desired maximization estimation algorithm to the disturbance results to complete the analysis task.
As a preferred technical solution, step S3 includes the following steps:
s31, according to the high-low sensitivity dividing modePrivacy budget->And a disturbance algorithm of the sensitivity classification, calculating a transition probability matrix +.>
S32, the expectation maximization estimation algorithm is based on the disturbance result of the clientObtaining the frequency distribution->And combining the transition probability matrix->Iterative updating of probability distribution estimation value of real data>The method comprises the following steps of:
wherein,index indicating each possible value of the real data,/-, for example>Index indicating each possible value in the disturbance result,/->Index indicating each possible value of the real data,/-, for example>Representing probability matrix->Middle->Line, th->Matrix elements of columns>Indicate by->True data disturbance output +.>Probability of seed perturbation result,/->Indicating the number of update rounds,/->Indicate->In wheel update +.>Is>Component(s)>Indicate at +.>In wheel +.>Probability estimate for true data, +.>Indicate->In wheel update +.>Is>Component(s)>Indicate at +.>In wheel +.>Probability estimate for true data, +.>Indicate->In wheel update +.>Is>Component(s)>Indicate at +.>In wheel +.>Probability estimate for true data, +.>Representation->Is>Component(s)>Results of disturbanceMiddle->The frequency of seed value;
s33, the EM algorithm finally converges to the estimated value of the probability distribution of the real dataThe method comprises the steps of carrying out a first treatment on the surface of the According to->And (5) completing the analysis task.
As a preferred embodiment, in step S31, for continuous data,after discretizing the input and output interval, the first part is>The disturbance output of the subinterval of the real data is the +.>Probability integration of sub-intervals of the disturbance result; for discrete data, the +.>Is made up of->The true data value disturbance outputs +.>Probability of the values of the disturbance results;
in step S33, for continuous data, completing the analysis task means completing the average value estimation task, and the average value estimation result is thatIs->The method comprises the steps of carrying out a first treatment on the surface of the For discrete data, completing the analysis task refers to completing the task of estimating the frequency distribution, and the result of the frequency distribution estimation is: true data->Frequency of->The method comprises the steps of carrying out a first treatment on the surface of the Wherein,vrepresenting the value of the real data->Is thatvIndex value in the value space of the real data.
The data analysis system based on the sensitive grading is used for realizing the data analysis method based on the sensitive grading, and comprises a service end and one or more clients, wherein the service end is respectively in communication connection with each client;
the client comprises:
the data storage module is used for storing the real data of the user;
the sensitivity grading disturbance module is used for storing a sensitivity grading disturbance algorithm and applying the sensitivity grading disturbance algorithm to the real data to obtain a disturbance result;
the client communication module is used for transmitting the disturbance result to the server;
the server side comprises:
the preset module is used for presetting parameters of a client disturbance algorithm, wherein the parameters of the disturbance algorithm comprise a high-low sensitivity division mode and privacy budget;
the communication module is used for sending the parameters of the preset disturbance algorithm to the client and receiving the disturbance result transmitted by the client;
and the data aggregation module is used for running an expected maximization estimation algorithm on the disturbance result reported by the client and completing the analysis task.
Compared with the prior art, the invention has the following beneficial effects:
(1) The data collection method of the invention protects the privacy of sensitive data and ensures that the disturbance result has high availability; specifically, the data collection method considers the sensitivity grading characteristic of real data, when a client locally perturbs the real data, different perturbation strategies are adopted according to the sensitivity degree of the data, so that the high-sensitivity data is ensured to meet the local differential privacy protection requirement, the privacy protection effect of the low-sensitivity data is reasonably reduced, and the usability of a perturbation result is effectively improved;
(2) The data analysis method has higher analysis accuracy; the data analysis method adopts the EM algorithm to process the disturbance results of all clients, and the probability distribution of real data is fitted in an iterative mode, so that the problem that the probability distribution is unreasonable possibly obtained by the estimation method based on the analytical expression is avoided, and the accuracy of the analysis result is effectively improved;
(3) In the system, the complexity of the disturbance algorithm of the client is O (1), the communication consumption with the server is also O (1), the operation is efficient, and the system is convenient to deploy in mobile equipment with limited computing power and bandwidth; in addition, the system comprises a data storage module, a disturbance processing module and a communication module, and has clear and definite functions and easy realization.
Drawings
FIG. 1 is a schematic diagram of a data analysis system based on sensitivity classification according to the present invention;
FIG. 2 is a schematic diagram of disturbance result probability distribution of continuous high-sensitivity data disturbance by adopting an SDPM algorithm;
FIG. 3 is a schematic diagram of disturbance result probability distribution of continuous low-sensitivity data disturbance by adopting an SDPM algorithm;
FIG. 4 is a schematic diagram of probability distribution of disturbance results of disturbance discrete type high-sensitivity data by adopting an SDGRR algorithm;
fig. 5 is a schematic diagram of probability distribution of disturbance results of disturbance discrete low-sensitivity data by using an SDGRR algorithm.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
Example 1
As shown in fig. 1 to 5, to improve the accuracy of statistical analysis when the data set is limited in size or the privacy budget is small, relaxing the strict definition of the local differential privacy model for the needs of a specific application scenario is a feasible method. In some practical scenarios, the sensitivity of the data to be collected may be different, for example, in a scenario where the mobile phone map app collects the geographic location of the user to count traffic flow, the location of home address, work unit, etc. directly related to the privacy information of the user is significantly more sensitive than the public places such as coffee shops. Aiming at the common data sensitivity grading scene, a more reasonable variant differential privacy model is needed to be adopted, namely, random disturbance with different degrees is provided for data with different sensitivity degrees, so that better balance between data privacy and analysis result accuracy is realized.
Aiming at privacy data with high-low distinction of sensitivity degree, the invention provides a data collection method, an analysis method and an analysis system based on sensitivity grading, which strictly protect the privacy of original data in the process of collecting and analyzing the data, and supported analysis tasks comprise frequency distribution estimation of discrete data and average value estimation of continuous data. The method can prove that the high availability of the analysis result can be ensured while protecting the data privacy.
The invention aims at utilizing the sensitivity grading characteristic of data in discrete or continuous privacy data collection and analysis scenes, and has higher data analysis accuracy.
The technical scheme adopted by the invention is as follows:
a privacy collection and analysis method for sensitive hierarchical data comprises the following steps:
s1, dividing a value range of data to be collected into two types of high sensitivity type and low sensitivity type by a server, determining a value of privacy budget, and transmitting the data, the high-low sensitivity division mode of the data and the privacy budget to each client.
S2, after each client receives the data of the server, a disturbance algorithm of the sensitivity classification is called, namely different strategies are selected to disturb the real data according to the sensitivity category of the real data of the user, and finally disturbance results are sent to the server.
S3, after the server receives the disturbance results of the clients, an expectation maximization estimation (Expectation Maximization, abbreviated as EM) algorithm is applied to the data, and an analysis task is completed.
Further, the dividing the value range of the data to be collected into two types, namely high sensitivity type and low sensitivity type, refers to designating each value as high sensitivity type or low sensitivity type according to the sensitivity degree of the general user to the value of the data.
Further, the completion of the analysis task means: and (3) completing a frequency distribution estimation task for discrete data and completing a mean value estimation task for continuous data.
Further, in S2, the perturbation algorithm of the sensitivity classification refers to: for highly sensitive data, the randomized output distribution of the highly sensitive data subjected to disturbance is equal to the randomized output distribution processed by a local disturbance algorithm meeting local differential privacy; for low-sensitivity data, the probability that the disturbed randomized output is distributed on the low-sensitivity data is concentrated on real data, the probability on other low-sensitivity data except the real data is zero, and the probability on other outputs is equivalent to the output probability processed by a local disturbance algorithm meeting the local differential privacy.
Further, the local perturbation algorithm satisfying the local differential privacy refers to: the algorithm for disturbing discrete data is a general random response (General Random Response, abbreviated as GRR), and the algorithm for disturbing continuous data is a segmentation mechanism (Piecewise Mechanism, abbreviated as PM).
Further, the disturbance algorithm of the sensitivity level for the discrete data is named as a general random response of the sensitivity level (Sensitivity Discriminant General Random Response, abbreviated as SDGRR, the SDGRR algorithm internally calls the GRR algorithm), and the disturbance algorithm of the sensitivity level for the continuous data is named as a segmentation mechanism of the sensitivity level (Sensitivity Discriminant Piecewise Mechanism, abbreviated as SDPM, the SDPM algorithm internally calls the PM algorithm). Further, in S3, the analysis task refers to: and (3) completing a frequency distribution estimation task for discrete data and completing a mean value estimation task for continuous data.
Further, a client, comprising:
and the data storage module is responsible for storing the real sensitive data of the user.
The sensitivity grading disturbance module is responsible for storing disturbance algorithms of sensitivity grading, comprising a disturbance algorithm SDGRR of sensitivity grading aiming at discrete data and a disturbance algorithm SDPM of sensitivity grading aiming at continuous data, and applying the disturbance algorithm of sensitivity grading to real sensitive data to obtain a randomized disturbance result.
And the client communication module is responsible for transmitting the disturbance result of the disturbance algorithm of the sensitivity classification to the server.
Further, a server includes:
the preset module is responsible for determining parameters of a disturbance algorithm of the client, namely a high-low sensitive dividing mode and privacy budget of data to be collected.
The server communication module is responsible for sending the parameters of the preset disturbance algorithm to the client and receiving the disturbance result of the client.
And the data aggregation module is responsible for running an EM algorithm on the disturbance result reported by the client to complete frequency distribution estimation of discrete data or average estimation of continuous data.
A data analysis system based on sensitive grading, comprising the client and the server described above.
Compared with the prior art, the invention has the advantages that:
(1) The data collection method of the invention protects the privacy of sensitive data and ensures that the disturbance result has high availability; specifically, the data collection method considers the sensitivity grading characteristic of real data, when a client locally perturbs the real data, different perturbation strategies are adopted according to the sensitivity degree of the data, so that the high-sensitivity data is ensured to meet the local differential privacy protection requirement, the privacy protection effect of the low-sensitivity data is reasonably reduced, and the usability of a perturbation result is effectively improved;
(2) The data analysis method has higher analysis accuracy; the data analysis method adopts the EM algorithm to process the disturbance results of all clients, and the probability distribution of real data is fitted in an iterative mode, so that the problem that the probability distribution is unreasonable possibly obtained by the estimation method based on the analytical expression is avoided, and the accuracy of the analysis result is effectively improved;
(3) In the system, the complexity of the disturbance algorithm of the client is O (1), the communication consumption with the server is also O (1), the operation is efficient, and the system is convenient to deploy in mobile equipment with limited computing power and bandwidth; in addition, the system comprises a data storage module, a disturbance processing module and a communication module, and has clear and definite functions and easy realization.
Example 2
As further optimization of embodiment 1, as shown in fig. 1 to 5, this embodiment further includes the following technical features on the basis of embodiment 1:
the data analysis system based on sensitive classification comprises a client and a server:
1. client side:
the client is deployed on the user equipment and stores sensitive data of the user. When the server side needs to analyze the sensitive data of the user, the client side calls a disturbance algorithm of the sensitive classification, and the real sensitive data is disturbed according to two parameters of a high-low sensitive division mode and a privacy budget of the sensitive data, and a disturbance result is sent to the server side.
The technical scheme of the client is shown in fig. 1, and mainly comprises a data storage module, a sensitive hierarchical disturbance module and a communication module. The data storage module stores the true sensitive data of the user. The sensitivity classification perturbation module stores perturbation algorithms of two sensitivity classifications respectively aiming at discrete data and continuous data, and the perturbation process can be expressed by using the randomized output distribution shown in fig. 2 to 5. The communication module is responsible for sending the disturbed data to the server.
2. The server side:
the server is controlled by a data collector, and the purpose of the data collector is to analyze the frequency of discrete data or the average value of continuous data of a user, and the accuracy of the analysis result depends on the size of privacy budget, the high-low sensitive dividing mode of sensitive data and the number of users by processing disturbance results reported by the client to perform statistical analysis.
The structure of the technical scheme of the server side is shown in fig. 1, and mainly comprises a preset module, a communication module and a data aggregation module. The preset module is responsible for determining a high-low sensitive dividing mode and privacy budget of data to be collected. The communication module is responsible for transmitting the division mode and the privacy budget to the client and receiving the disturbance result reported by the client. And the data aggregation module is responsible for running an EM algorithm on the disturbance result reported by the client, estimating the distribution of the user data, and completing the task of frequency distribution estimation or mean value estimation aiming at different types of user data.
Specific implementations of the key technology modules described in this summary will be exemplarily explained below, but the scope of the invention is not limited by such explanation.
1. The main flow of the technology is described:
1.1 The server presets algorithm parameters:
the server takes all possible values of the data to be collected in a preset moduleThe classification is two types: high sensitivity value set->And low sensitivity value set->And sets the privacy budget of the client +.>. The division refers to that each value is designated as a high-sensitivity type or a low-sensitivity type according to the sensitivity degree of a general user to the data value. For example, in the task of counting the frequency of diseases in the population, the sensitivity of some specific diseases such as "AIDS" should be higher than that of common diseases such as "fever", "diarrhea", etc., and thus "AIDS" will be assigned to->"fever" and "diarrhea" will be assigned to +.>. The privacy budget is the maximum difference quantization value of the randomized output distribution of different input data, and is an important parameter of the local differential privacy definition. In particular, the privacy budget is +.>Is a local differential privacy perturbation algorithm of->The method meets the following conditions: for any two different inputs +.>And->And an arbitrary randomized output +>,/>This is true. Dividing the sensitivity of high and low>And privacy budget->To each client, step (1) of fig. 1.
1.2 Client perturbs the real data:
client-side receiving high-low sensitivity dividing modeAnd privacy budget->Afterwards, a disturbance algorithm of the sensitive hierarchy is called, i.e. the real data +.>Is included in the high sensitivity value set +.>Or low sensitivity value set +.>For the followingAnd selecting different disturbance strategies to disturb the real data according to the sensitive category of the real data. The disturbance algorithm input of the sensitivity hierarchy is user real data +.>High-low sensitivity dividing mode>And privacy budget->The output is disturbance result->
If it isIs continuous data, call SDPM algorithm disturbance +.>. Specifically, if->Then a perturbation algorithm satisfying the local differential privacy, i.e. the PM algorithm, is used to perturb +.>It randomizes the output +.>The probability distribution of (2) is shown in fig. 2, namely:
wherein the method comprises the steps of,/>,/>,/>. If->Disturbance of ∈10 using a disturbance algorithm>Distributing the randomized output at +.>The probability of the upper part is all concentrated at +.>But at +.>The upper probability is zero while the probability on the other outputs is equivalent to the output probability of the PM algorithm. Let->It randomizes the output +.>The probability distribution of (2) is shown in fig. 3, namely:
wherein the method comprises the steps of,/>,/>
If it isIs discrete data, and the SDGRR algorithm is adopted to disturb +.>. Specifically, if->Then a perturbation algorithm satisfying the local differential privacy, i.e. GRR algorithm, is used to perturb ∈>It randomizes the output +.>The probability distribution of (2) is shown in fig. 4, namely:
wherein the method comprises the steps of,/>. If->Perturbation by privacy perturbation algorithm>Distributing the randomized output at +.>The probability of the upper part is all concentrated at +.>But at +.>The upper probability is zero while the probability on the other outputs is equal to the output of the GRRProbability of randomizing output +.>The probability distribution of (2) is shown in fig. 5, namely:
wherein the method comprises the steps of,/>
The real data is perturbed, i.e. sampled from the corresponding randomized output distribution. Finally, the client terminal uses the communication module to make the disturbance resultAnd (3) sending the message to a server, namely the step (2) of fig. 1.
1.3 Server side estimation distribution:
the server receives disturbance results sent by all clients through the communication moduleThe data are then processed at a data aggregation module to complete the analysis task. In particular, the data aggregation module stores an EM algorithm, the input of which is the disturbance result of the client +.>High-low sensitivity dividing mode>And privacy budget->The output is the distribution estimate of the user's real data +.>. The EM algorithm is firstly based on the high-low sensitivity division mode +.>Privacy budget->And a disturbance algorithm of the sensitivity classification, calculating a transition probability matrix +.>Wherein matrix element->Represented by->Seed input value disturbance output +.>Probability of the output value. For discrete data, the probability is a probability of a certain point on the randomized output distribution, and for continuous data, the probability is an integral of the randomized output distribution over a certain subinterval after discretizing the input/output space. Subsequently, the EM algorithm is based on the client disturbance result +.>Obtaining the distribution of disturbance data->And combining the transition probability matrix->Iterative updating of probability distribution estimation value of real data>The method comprises the following steps:
wherein the method comprises the steps ofIndicate->Distribution estimation value of user real data in round update +.>Is>The component, i.e. the +.>Probability estimate of the seed likelihood value, < ->Is the distribution of disturbance data->Is>The component, i.e. the +.>A frequency of possible values. The EM algorithm eventually converges to the maximum likelihood estimate of the true data distribution. Finally, the distribution estimation value of the user real data output according to the EM algorithm>Completing the task of frequency distribution estimation or average estimation: for discrete data, data +.>The frequency of (2)>Wherein->Is data ofvIndex values in a value space of the user data; for continuous dataThe mean value is +.>Is->
In order to demonstrate the effect of the present invention in improving the accuracy of data analysis, experimental comparison results of the present invention with an advanced scheme based on local differential privacy are presented below, as shown in tables 1 and 2. The following two tables show the mean square error (Mean Square Error, abbreviated MSE) of the results of the SDGRR/SDPM scheme of the invention versus the local differential privacy advanced scheme GRR/PM when different privacy budgets are set for the two analysis tasks of frequency estimation using all data under the academic attributes of the 1990 U.S. census data set provided by UCI KDD website (UCI Knowledge Discovery in Databases Archive) to simulate the user's discrete type privacy data and mean value estimation using the human height synthetic data provided by SOCR (Statistics Online Computational Resource) to simulate the user's continuous type privacy data, respectively.
TABLE 1 mean square error comparison of frequency estimation results under different privacy budgets
TABLE 2 mean estimation results mean squared error contrast under different privacy budgets
Experimental results show that for different privacy budgets, the mean square error of the estimation result of the SDGRR/SDPM scheme is lower than that of GRR/PM. Especially in the case of small privacy budget) When SDGRR is about 1 order of magnitude lower than GRR in mean square error; at->When compared to PM, the mean square error of SDPM is reduced by 2 orders of magnitude.
As described above, the present invention can be preferably implemented.
All of the features disclosed in all of the embodiments of this specification, or all of the steps in any method or process disclosed implicitly, except for the mutually exclusive features and/or steps, may be combined and/or expanded and substituted in any way.
The foregoing description of the preferred embodiment of the invention is not intended to limit the invention in any way, but rather to cover all modifications, equivalents, improvements and alternatives falling within the spirit and principles of the invention.

Claims (8)

1. The data collection method based on the sensitivity grading is characterized in that real data are graded according to the sensitivity degree, so that a client adopts different disturbance strategies according to the sensitivity grading of the data when the real data are disturbed;
the method comprises the following steps:
s1, dividing a value range of data to be collected into high sensitivity by a server sideAnd hyposensitisation->Two sensitive categories and determining the value of privacy budget +.>The data is divided into high sensitivity and low sensitivity>And privacy budget->Sending to each client;
s2, each client receives a high-low sensitivity division mode sent by the serverAnd privacy budget->Then, a disturbance algorithm of the sensitive classification is called to output a disturbance result of the real data, and the disturbance result is sent to the server;
in step S2, if the real data is highly sensitive, the probability distribution of the disturbance result of the disturbance algorithm of the sensitive classification is the same as the probability distribution of the disturbance result of the disturbance algorithm satisfying the local differential privacy; if the real data are low-sensitivity, probability distribution of disturbance results of a disturbance algorithm of sensitivity classification meets the following conditions: the probability that the disturbance result is other low-sensitivity data than the real data is 0, and the probability distribution that the disturbance result is the non-low-sensitivity data is the same as that of the disturbance result of the disturbance algorithm satisfying the local differential privacy.
2. The data collection method based on sensitivity classification according to claim 1, wherein the perturbation algorithm satisfying local differential privacy is: for continuous data, a segmentation mechanism is adopted; for discrete data, a generic random response.
3. A sensitive hierarchical data collection method according to claim 1, wherein if true dataIs continuous data, the disturbance algorithm of the sensitivity classification is a segmentation mechanism of the sensitivity classification, specifically:
if it isThen a perturbation algorithm satisfying the local differential privacy is usedDisturbance->Randomization of output->The probability distribution of (2) is:
in the method, in the process of the invention,,/>,/>,/>,/>
if it isLet us assume->Randomizing the output->The probability distribution of (2) is:
in the method, in the process of the invention,,/>,/>
wherein,representing real data +.>Representing disturbance results (I)>Representing real data as +.>Time and disturbance results->Probability distribution of->Indicates the division mode of high sensitivity and low sensitivity>Indicates the high-sensitivity data value interval, < >>Representing a low-sensitivity data value interval, < + >>Representing privacy budget, ++>Maximum value of the segmentation mechanism disturbance result representing segmentation mechanism and sensitivity grading, +.>Minimum value of segmentation mechanism disturbance result representing segmentation mechanism and sensitivity grading, +.>Representing an interval containing real data in the disturbance result of the segmentation mechanism, the segmentation mechanism will disturb the real data with higher probability for the data in this interval,for the left end of the interval, +.>Right end of interval, ++>Representing the result of the segmentation mechanism disturbance as interval +.>Probability of any data>A segmentation mechanism representing sensitivity grading perturbs the low sensitivity data into the probability of the original real data,/a>Representing the result of the segmentation mechanism disturbance as interval +.>Probability of any data other than ∈>The segmentation mechanism, which is also a sensitivity hierarchy, perturbs the low-sensitivity data to +.>Probability of any data other than,/>Representing a low sensitivity data value interval +.>Left end point of->Representing a low sensitivity data value interval +.>Is the right end point of (c).
4. A sensitive hierarchical data collection method according to claim 1, wherein if true dataDiscrete data, the disturbance algorithm of the sensitivity classification is a general random response of the sensitivity classification, specifically:
if it isThen perturbation of +.>Randomization of output->The probability distribution of (2) is:
in the method, in the process of the invention,,/>
if it isRandomizing the output->The probability distribution of (2) is:
in the method, in the process of the invention,,/>
wherein,representing real data +.>Representing disturbance results (I)>Representing real data as +.>Time and disturbance results->Probability distribution of->Indicates the division mode of high sensitivity and low sensitivity>Representing a set of values for highly sensitive data,/>Representing a low sensitive data value set, +.>Representing privacy budget, ++>Representing probability of obtaining original real data by general random response disturbance of real data, < >>Representing the probability of real data obtaining non-real data through general random response disturbance, and simultaneously +.>A generalized random response, also sensitive to rank, perturbs low sensitive data to +.>Probability of any data other than +.>The generic random response, which represents the sensitivity ranking, perturbs the low-sensitivity data to the probability of the original real data.
5. A data analysis method based on sensitivity grading, characterized by comprising a data collection method based on sensitivity grading according to any of claims 1 to 4, further comprising the steps of:
and S3, after receiving the disturbance results of the clients, the server applies a desired maximization estimation algorithm to the disturbance results to complete the analysis task.
6. The method of claim 5, wherein step S3 comprises the steps of:
s31 root ofAccording to the sensitive division mode of the heightPrivacy budget->And a disturbance algorithm of the sensitivity classification, calculating a transition probability matrix +.>
S32, the expectation maximization estimation algorithm is based on the disturbance result of the clientObtaining the frequency distribution->And combining the transition probability matrix->Iterative updating of probability distribution estimation value of real data>The method comprises the following steps of:
wherein,、/>number of index representing two real data, +.>Number representing index in disturbance result, +.>Representing probability matrix->Middle->Line, th->Matrix elements of columns>Indicate by->True data disturbance output +.>The probability of the result of the seed disturbance,indicating the number of update rounds,/->Indicate->In wheel update +.>Is>Component(s)>Indicate at +.>In wheel +.>Probability estimate for true data, +.>Indicate->In wheel update +.>Is>Component(s)>Indicate at +.>In wheel +.>The probability estimates for the true data are derived,indicate->In wheel update +.>Is>Component(s)>Indicate at +.>In wheel +.>Probability estimate for true data, +.>Representation->Is>Component(s)>Refer to disturbance results->Middle->The frequency of seed value;
s33, the EM algorithm finally converges to the estimated value of the probability distribution of the real dataThe method comprises the steps of carrying out a first treatment on the surface of the According to->And (5) completing the analysis task.
7. The method of claim 6, wherein in step S31, for continuous data,after discretizing the input and output interval, the first part is>The disturbance output of the subinterval of the real data is the +.>Probability integration of sub-intervals of the disturbance result; for discrete data, the +.>Is made up of->The true data value disturbance outputs +.>Probability of the values of the disturbance results;
in step S33, for continuous data, completing the analysis task means completing the average value estimation task, and the average value estimation result is thatIs->The method comprises the steps of carrying out a first treatment on the surface of the For discrete data, completing the analysis task refers to completing the task of estimating the frequency distribution, and the result of the frequency distribution estimation is: true data->Frequency of->The method comprises the steps of carrying out a first treatment on the surface of the Wherein,vrepresenting the value of the real data->Is thatvIndex value in the value space of the real data.
8. A data analysis system based on sensitive grading, which is characterized in that the data analysis system based on sensitive grading is used for realizing the data analysis method based on sensitive grading according to any one of claims 5 to 7, and comprises a service end and one or more clients, wherein the service end is respectively in communication connection with each client;
the client comprises:
the data storage module is used for storing the real data of the user;
the sensitivity grading disturbance module is used for storing a sensitivity grading disturbance algorithm and applying the sensitivity grading disturbance algorithm to the real data to obtain a disturbance result;
the client communication module is used for transmitting the disturbance result to the server;
the server side comprises:
the preset module is used for presetting parameters of a client disturbance algorithm, wherein the parameters of the disturbance algorithm comprise a high-low sensitivity division mode and privacy budget;
the communication module is used for sending the parameters of the preset disturbance algorithm to the client and receiving the disturbance result transmitted by the client;
and the data aggregation module is used for running an expected maximization estimation algorithm on the disturbance result reported by the client and completing the analysis task.
CN202311649483.3A 2023-12-05 2023-12-05 Data collection method, analysis method and analysis system based on sensitivity classification Active CN117349896B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311649483.3A CN117349896B (en) 2023-12-05 2023-12-05 Data collection method, analysis method and analysis system based on sensitivity classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311649483.3A CN117349896B (en) 2023-12-05 2023-12-05 Data collection method, analysis method and analysis system based on sensitivity classification

Publications (2)

Publication Number Publication Date
CN117349896A CN117349896A (en) 2024-01-05
CN117349896B true CN117349896B (en) 2024-02-06

Family

ID=89357948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311649483.3A Active CN117349896B (en) 2023-12-05 2023-12-05 Data collection method, analysis method and analysis system based on sensitivity classification

Country Status (1)

Country Link
CN (1) CN117349896B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334539A (en) * 2019-06-12 2019-10-15 北京邮电大学 A kind of personalized method for secret protection and device based on random response
CN113094746A (en) * 2021-03-31 2021-07-09 北京邮电大学 High-dimensional data publishing method based on localized differential privacy and related equipment
CN113254988A (en) * 2021-04-25 2021-08-13 西安电子科技大学 High-dimensional sensitive data privacy classified protection publishing method, system, medium and equipment
CN115098881A (en) * 2022-04-07 2022-09-23 河海大学 Data disturbance method and device based on sensitivity level division
CN115906164A (en) * 2022-11-22 2023-04-04 南京航空航天大学 Local differential privacy-based utility optimization key value data protection method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11272403B2 (en) * 2016-03-30 2022-03-08 Telefonaktiebolaget Lm Ericsson (Publ) Control link definition in networked control system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334539A (en) * 2019-06-12 2019-10-15 北京邮电大学 A kind of personalized method for secret protection and device based on random response
CN113094746A (en) * 2021-03-31 2021-07-09 北京邮电大学 High-dimensional data publishing method based on localized differential privacy and related equipment
CN113254988A (en) * 2021-04-25 2021-08-13 西安电子科技大学 High-dimensional sensitive data privacy classified protection publishing method, system, medium and equipment
CN115098881A (en) * 2022-04-07 2022-09-23 河海大学 Data disturbance method and device based on sensitivity level division
CN115906164A (en) * 2022-11-22 2023-04-04 南京航空航天大学 Local differential privacy-based utility optimization key value data protection method and device

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
Research on Distributed Differential Desensitization Algorithm for Wireless Sensor Network;Yu Liu等;《2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA)》;第180-185页 *
基于属性分级的高维数据差分隐私发布;刘沛骞等;《计算机应用研究》;第1-2页 *
基于差分隐私模型的位置轨迹发布技术研究;冯登国等;《电子与信息学报》;第42卷(第01期);第74-88页 *
基于深度学习的GPS轨迹去匿名研究;卜冠华等;《计算机工程与科学》;第44卷(第02期);第244-250页 *
基于深度强化学习的智能决策方法;熊蓉玲等;《电讯技术》;第63卷(第01期);第1-6页 *
大数据计算环境下的隐私保护技术研究进展;钱文君等;《计算机学报》;第45卷(第04期);第669-701页 *
局部差分隐私约束的关联属性不变后随机响应扰动;杨高明等;《电子学报》;第47卷(第05期);第1079-1085页 *

Also Published As

Publication number Publication date
CN117349896A (en) 2024-01-05

Similar Documents

Publication Publication Date Title
US10719852B2 (en) Systems and methods for using spatial and temporal analysis to associate data sources with mobile devices
CN108279383B (en) Battery life prediction method, battery data server and battery data processing system
WO2022111327A1 (en) Risk level data processing method and apparatus, and storage medium and electronic device
US20120130940A1 (en) Real-time analytics of streaming data
WO2019062405A1 (en) Application program processing method and apparatus, storage medium, and electronic device
CN109726885A (en) Electricity consumption anomaly assessment method, apparatus, equipment and computer storage medium
CN103455842B (en) Credibility measuring method combining Bayesian algorithm and MapReduce
CN109902506B (en) Local differential privacy data sharing method and system with multiple privacy budgets
US20220264254A1 (en) Systems and methods for using spatial and temporal analysis to associate data sources with mobile devices
Jin et al. Distributed Byzantine tolerant stochastic gradient descent in the era of big data
Sathiyanarayanan et al. Visual analysis of predictive policing to improve crime investigation
Wu et al. Characterizing and predicting individual traffic usage of mobile application in cellular network
Bao et al. Privacy-preserving collaborative filtering algorithm based on local differential privacy
CN117349896B (en) Data collection method, analysis method and analysis system based on sensitivity classification
CN114240060A (en) Risk control method, risk processing system, risk processing device, server, and storage medium
CN103699546A (en) Method and device of generating IP (Internet Protocol) database of internet bar
CN112416590A (en) Server system resource adjusting method and device, computer equipment and storage medium
CN106101839A (en) A kind of method identifying that television user gathers
CN115114381A (en) Graph statistical analysis method oriented to localized differential privacy
CN112437051B (en) Negative feedback training method and device for network risk detection model and computer equipment
Filip et al. EdgeMQ: Towards a message queuing processing system for cloud-edge computing:(Use cases on water and forest monitoring)
CN106210120A (en) A kind of recommendation method of server and device thereof
CN114979163B (en) Management method, device and storage medium for load balancing configuration
CN112288521A (en) Method, apparatus, electronic device and readable medium for adjusting property value of article
CN109697196A (en) A kind of situation modeling method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant