CN114741726A

CN114741726A - Data processing method and device based on privacy protection and electronic equipment

Info

Publication number: CN114741726A
Application number: CN202210385699.2A
Authority: CN
Inventors: 韩紫微; 宋启威; 杨妍
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2022-07-12

Abstract

The embodiment of the application discloses a data processing method and device based on privacy protection and electronic equipment, wherein the method comprises the following steps: firstly, acquiring original data, wherein the original data is classified data; after the chi-square value of the original data is determined, determining the value range of the chi-square value of the noise data, wherein the noise data is obtained by adding noise to the original data, the size relationship between the chi-square value of the original data and the critical value is the same as the size relationship between the chi-square value of the noise data and the critical value, and the chi-square value of the noise data is smaller than the chi-square value of the original data; determining a value range of the first noise frequency according to the value range of the chi-square value of the original data and the chi-square value of the noise data; determining the value of the first noise frequency according to the value range of the first noise frequency and the difference between the original data and the noise data; and adding noise to the original data based on the value of the first noise frequency to obtain noise data so as to improve the usability of the data subjected to privacy protection.

Description

Data processing method and device based on privacy protection and electronic equipment

Technical Field

The present invention relates to the field of computers, and in particular, to a data processing method and apparatus based on privacy protection, and an electronic device.

Background

With the development of scientific technology, research on data is receiving attention. For example, researchers have studied influence relationships between variables using data on a data distribution platform for the purpose of scientific research. In the data distribution scenario, there may be a problem of data privacy disclosure.

Data is usually processed for privacy protection to reduce the data privacy disclosure. However, the current processing method for protecting the privacy of the data causes a great reduction in the value of the data to scientific research, that is, the usability of the data subjected to the privacy protection is low.

Disclosure of Invention

In view of this, the present application provides a data processing method and apparatus based on privacy protection, and an electronic device, so as to improve the usability of data subjected to privacy protection.

In the present application, the processes of acquiring and processing data are all performed with the knowledge of the data owner, and are all performed under the consent of the data owner.

In a first aspect, the present application provides a data processing method based on privacy protection, including:

acquiring original data; the values of the original data only comprise a first value and a second value, and the original data comprises the original data of the case group and the original data of the comparison group;

determining a chi-square value of the original data;

determining the value range of the chi-square value of the noise data according to the chi-square value of the original data; the noise data comprise noise data of a case group and noise data of a contrast group, wherein the noise data of the case group are obtained by adding noise to original data of the case group, and the noise data of the contrast group are obtained by adding noise to the original data of the contrast group; the magnitude relation between the chi-square value of the original data and the critical value is the same as the magnitude relation between the chi-square value of the noise data and the critical value; the chi-square value of the noise data is smaller than the chi-square value of the original data;

determining a value range of the first noise frequency according to the value ranges of the chi-square value of the original data and the chi-square value of the noise data; wherein the first noise frequency is the number of the first value in the noise data of the case group;

determining the value of the first noise frequency according to the value range of the first noise frequency and the difference between the original data and the noise data;

and respectively carrying out noise adding treatment on the original data of the case group and the original data of the contrast group based on the value of the first noise frequency to obtain the noise data of the case group and the noise data of the contrast group.

In a possible implementation manner, determining the chi-squared value of the original data specifically includes:

determining a chi-square value of the original data according to the first original frequency, the second original frequency, the first number and the second number; the first original frequency is the number of the first value in the original data of the case group, the second original frequency is the number of the second value in the original data of the case group, the first number is the sum of the numbers of the first value in the original data of the first original frequency and the original data of the comparison group, and the second number is the sum of the numbers of the second value in the original data of the comparison group of the second original frequency b.

In one possible implementation, the threshold is obtained by looking up a chi-squared distribution threshold table.

In a possible implementation manner, determining a value of the first noise frequency according to a value range of the first noise frequency and a difference between original data and noise data specifically includes:

the value of the first noise frequency is determined within a range of values of the first noise frequency so as to maximize a difference between the original data and the noise data.

In one possible implementation, the difference between the raw data and the noise data specifically includes an expected estimation error between the raw data and the noise data.

In a possible implementation manner, after performing denoising processing on the original data of the case group and the original data of the comparison group respectively based on a value of the first noise frequency to obtain noise data of the case group and noise data of the comparison group, the method further includes:

post-processing the noise data of the case group and the noise data of the control group, the post-processing including at least one of: and rounding the data or rounding the data according to the precision requirement.

In one possible implementation, the method further includes:

acquiring identity data, wherein the identity data is used for identifying the identity of an individual;

and carrying out fuzzy processing on the identity data, wherein the fuzzy processing comprises the following steps: the method comprises the steps of carrying out de-identification processing on the identity data or carrying out generalization processing on the identity data, wherein the generalization processing is to replace difference characters in a plurality of identity data by utilizing preset characters.

In a second aspect, the present application provides a data processing apparatus based on privacy protection, the apparatus comprising:

the data acquisition unit is used for acquiring original data, wherein the values of the original data only comprise a first value and a second value, and the original data comprises original data of a case group and original data of a comparison group;

a first determination unit for determining a chi-squared value of the original data;

the second determining unit is used for determining the value range of the chi-square value of the noise data according to the chi-square value of the original data; the noise data comprise noise data of a case group and noise data of a contrast group, wherein the noise data of the case group are obtained by adding noise to original data of the case group, and the noise data of the contrast group are obtained by adding noise to the original data of the contrast group; the magnitude relation between the chi-square value of the original data and the critical value is the same as the magnitude relation between the chi-square value of the noise data and the critical value; the chi-square value of the noise data is smaller than the chi-square value of the original data;

a third determining unit, configured to determine a value range of the first noise frequency according to a chi-square value of the original data and a value range of a chi-square value of the noise data, where the first noise frequency is a number of first values taken as values in the noise data of the case group;

a fourth determining unit, configured to determine a value of the first noise frequency according to a value range of the first noise frequency and a difference between the original data and the noise data;

and the noise adding processing unit is used for respectively adding noise to the original data of the case group and the original data of the contrast group based on the value of the first noise frequency to obtain the noise data of the case group and the noise data of the contrast group.

In a third aspect, the present application provides an electronic device, which includes a processor and a memory, where the memory stores codes, and the processor is configured to call the codes stored in the memory to execute any one of the methods described above.

In a fourth aspect, the present application provides a computer readable storage medium for storing a computer program for performing the method of any one of the above.

Drawings

Fig. 1 is a flowchart of a data processing method based on privacy protection according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a data processing apparatus based on privacy protection according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to facilitate understanding of technical solutions provided in the embodiments of the present application, a data processing method and apparatus based on privacy protection and an electronic device provided in the embodiments of the present application are described below with reference to the accompanying drawings.

While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Other embodiments, which can be derived by those skilled in the art from the embodiments given herein without any inventive contribution, are also within the scope of the present application.

In the claims and specification of the present application and in the drawings accompanying the description, the terms "comprise" and "have" and any variations thereof, are intended to cover non-exclusive inclusions.

In a data distribution platform for medical data and demographic data for scientific research, survey data is generally distributed or a data query interface is provided to allow a researcher to perform statistical difference analysis research, for example, research influence relationships between categorical variables or quantitative variables, based on the queried statistical data or downloaded raw data. However, in such a data distribution scenario, there are some data privacy problems, for example, an attacker may perform a differential attack in combination with statistical data and related background knowledge to determine whether a certain individual exists in the database; or the original data and other databases are combined to carry out reasoning attack attribute attack and the like; even some databases, while publishing only the query interface for statistics, are still vulnerable to crawlers, resulting in the crawling of the user's raw data.

In such a data distribution scenario, researchers want to acquire data, and data providers need to consider the degree of privacy protection of the data. At the moment, cryptography cannot exert the advantages of cryptography, and the access control limits the threshold of data acquisition research, in contrast, data privacy is protected by adopting proper disturbance on data in differential privacy, noise meeting Laplace distribution is added to original data within a certain privacy tolerance range, and the method is more suitable for data protection in a release scene.

Based on this, in the embodiment of the present application provided by the applicant, first, original data is obtained, values of the original data only include a first value and a second value, the original data includes original data of a case group and original data of a comparison group, that is, the original data is classified data; after the chi-square value of the original data is determined, determining the value range of the chi-square value of the noise data according to the chi-square value of the original data; the noise data comprise noise data of a case group and noise data of a contrast group, wherein the noise data of the case group are obtained by adding noise to original data of the case group, and the noise data of the contrast group are obtained by adding noise to the original data of the contrast group; the magnitude relation between the chi-square value of the original data and the critical value is the same as the magnitude relation between the chi-square value of the noise data and the critical value; the chi-square value of the noise data is smaller than the chi-square value of the original data; then, determining a value range of the first noise frequency according to the value ranges of the chi-square value of the original data and the chi-square value of the noise data, wherein the first noise frequency is the number of the first values in the noise data of the case group; then, determining the value of the first noise frequency according to the value range of the first noise frequency and the difference between the original data and the noise data; and finally, based on the value of the first noise frequency, respectively carrying out noise adding processing on the original data of the case group and the original data of the contrast group to obtain the noise data of the case group and the noise data of the contrast group.

By adopting the technical scheme of the embodiment of the application, the original data (the binary data) is subjected to noise addition to obtain the noise data, so that privacy protection is realized, and the problem of data practicability is considered in the privacy protection processing process. Specifically, the statistical significance of the noise data and the original data is kept consistent by determining the value range of the chi-square value of the noise data; the degree of privacy protection is represented by the difference between the original data and the noise data, and the degree of privacy protection is considered when the value of the first noise frequency is determined, so that the finally obtained noise data can have the privacy protection effect and higher data practicability at the same time.

In the embodiment of the present application, the processes of acquiring and processing data are all performed under the condition that the owner of the data knows the processes, and are all performed under the condition that the owner of the data agrees with the processes.

The application provides a data processing method based on privacy protection.

Referring to fig. 1, fig. 1 is a flowchart of a data processing method based on privacy protection according to an embodiment of the present disclosure.

As shown in fig. 1, the data processing method based on privacy protection in the embodiment of the present application includes S101 to S106.

S101, acquiring original data, wherein values of the original data only comprise a first value and a second value, and the original data comprises original data of a case group and original data of a contrast group.

The original data are binary data, and the values of the original data only comprise two kinds.

For example, a binary data set: the method is characterized in that two variables have two values respectively, for example, the influence relationship of the variable A on the variable B is to be researched through chi-square test, the variable A is 0 or 1, the variable B is yes or no, and a sample set containing the variable A and the variable B is a binary data set.

For case and control groups, for example, whether the study age had an effect (or significant effect) on the development of a certain disease, case and control groups represent the diseased and normal populations, respectively. The data for the case group corresponds to individuals who are diseased individuals and the data for the control group corresponds to individuals who are normal individuals.

S102, determining a chi-square value of the original data.

The chi-square value is a statistic in the non-parametric test, is mainly used in the non-parametric statistical analysis, and is a main test index in the chi-square test.

Chi-square test is a hypothesis test method of counting data with wide application, belongs to the category of non-parameter test, and mainly compares two or more sample rates (composition ratio) and the correlation analysis of two classification variables, and the basic idea is to compare the coincidence degree or fitting goodness of theoretical frequency and actual frequency. Chi-square checks are typically directed to categorical variables.

S103, determining a value range of a chi-square value of the noise data according to the chi-square value of the original data, wherein the noise signal is obtained by adding noise to the original data.

The noise data includes only two values, a first value and a second value, and the noise data includes noise data of the case group and noise data of the control group.

The noise signal is obtained by adding noise to the original data, that is, the noise data of the case group is obtained by adding noise to the original data of the case group, and the noise data of the control group is obtained by adding noise to the original data of the control group.

And (3) adding noise to the original data to apply disturbance, thereby realizing privacy protection of the original data.

The magnitude relation between the chi-square value of the original data and the critical value is the same as the magnitude relation between the chi-square value of the noise data and the critical value; the chi-squared value of the noise data is less than the chi-squared value of the original data.

The magnitude relationship of the chi-squared value of the original data to the threshold value is the same as the magnitude relationship of the chi-squared value of the noise data to the threshold value to reduce the degradation of the utility of the noisy data by having the chi-squared value of the noise data still within the statistical significance range of the original data.

The chi-squared value of the noise data does not exceed the original chi-squared value to avoid undue noise disturbance.

By constraining the relationship between the chi-squared value of noisy data and the original chi-squared value, the reduction in utility of noisy data can be reduced.

S104, determining a value range of the first noise frequency according to the value range of the chi-square value of the original data and the chi-square value of the noise data, wherein the first noise frequency is the number of the first values in the noise data of the case group.

The number of the original data of the case group with the first value can be obtained by carrying out statistical analysis on the original data; after the noise disturbance is performed on the original data, the number changes to the first noise frequency.

And S105, determining the value of the first noise frequency according to the value range of the first noise frequency and the difference between the original data and the noise data.

The difference between the raw data and the noisy data is used to measure the degree of privacy protection for the raw data.

The greater the difference between the raw data and the noisy data, indicating greater noise perturbation to the raw data and greater degree of data privacy protection.

And S106, based on the value of the first noise frequency, respectively carrying out noise adding processing on the original data of the case group and the original data of the contrast group to obtain the noise data of the case group and the noise data of the contrast group.

And respectively carrying out noise adding processing on the original data of the case group and the original data of the contrast group to obtain the noise data of the case group and the noise data of the contrast group, namely carrying out noise adding processing on the original data to obtain the noise data.

After the original data is subjected to the noise adding processing, the value of the first noise frequency will be changed. After the value of the first noise frequency is determined, a way of adding noise (for example, the amount of noise added) to the original data can be determined, so that noise data can be obtained by adding noise based on the obtained way of adding noise, and the value of the first noise frequency can be obtained by performing statistical analysis on the noise data (counting the number of the first value in the noise data of the case group).

The following description is made with reference to specific implementations.

The embodiment of the application also provides another data privacy protection method.

The privacy protection method for data in the embodiment of the application comprises S201-S203.

In the embodiment of the present application, the processes of acquiring and processing data are all performed under the condition that the owner of the data knows the information, and are all performed under the condition that the owner of the data agrees.

S201, acquiring data, and dividing the data into identity metadata and classification data.

The identity metadata may include one or more types and the classification data may include one or more types.

The values of the classification data include only two.

Data to be protected from privacy is generally classified into two types, one type is identity metadata, and the other type is classified data.

Identity metadata is used to identify the identity of an individual, for example, the identity metadata of an individual may include multiple types, such as: the name, identification number, address, academic calendar, gender, age, telephone number, etc. of the individual. Generally, individual identity metadata refers primarily to data that uniquely or synthetically infers the privacy of an individual.

The classification data is data used in statistical studies, for example, classification data for an individual may include: such as income of the individual, diseased conditions, etc.

Classification data includes only two cases, e.g., individual income includes "high" and "low"; diseased conditions include "diseased" and "not diseased".

The identity metadata and classification data may have a causal relationship, for example, between age and disease condition.

The classification data may be data used in statistical studies; identity metadata, while generally not within the scope of data research, may be subject to inference attacks.

Reasoning attack refers to that an attacker identifies or speculates the identity of an individual through combined query of multiple databases and in combination with a large amount of background knowledge.

In order to more clearly describe the data privacy protection method in the embodiment of the present application, a specific application scenario in the embodiment of the present application is described below.

It can be understood that the application scenario is only used for explaining the data privacy protection method in the embodiment of the present application, and the data privacy protection method in the embodiment of the present application may also be used in any other scenario.

Data for privacy protection is needed to investigate whether age has a significant impact on the condition of a disease.

The acquired data may be obtained by chi-square verification.

The acquired data specifically comprises data of a case group and data of a control group, wherein the data of the case group is data of an individual with a disease, and the data of the control group is data of an individual without a disease.

And dividing the acquired data into identity metadata and classification data, and dividing the data of the case group and the data of the contrast group.

Specifically, the distribution table of the acquired data may be divided to obtain a portion including the identity metadata and a portion including the classification data.

For example, categories of identity metadata may include: name, gender, birthday, zip code, the category of the classification data may include age.

It is to be understood that the above categories of the identity metadata and the categories of the classification data are examples, and others may be included.

"age" is classification data in the present embodiment, and thus "age" includes two cases "0" and "1". In particular, a fixed age value may be set. When the age of the individual is greater than or equal to the value, the classification data "age" takes "1"; when the age of the individual is less than this value, the classification data "age" takes "0".

S202, anonymous protection processing is carried out on the identity metadata.

The identity metadata may include one or more types of identity metadata.

The method of anonymous protection processing may be the same for each type of identity metadata.

When there are a plurality of types of identity metadata, the method of the anonymous protection process may be the same or different for each type of identity metadata.

The following is an example of a method for performing anonymous protection processing provided in this embodiment of the present application, and in some possible cases, other manners may also be used to perform anonymous protection processing on identity metadata.

In one possible implementation, the identity metadata is de-identified.

For example, for the names of individuals, the real names of the individuals are hidden in a de-identification mode and are uniformly abstracted into self-added ID numbers.

For example, the name of the individual includes: a first name, a second name, and a third name. The first name, the second name and the third name are all composed of real surnames and real first names.

The individual names are subjected to de-identification processing, the individual names are uniformly abstracted into self-added ID numbers, the first name is processed to obtain 1, the second name is processed to obtain 2, and the third name is processed to obtain 3.

That is, the identity metadata includes a first name, a second name, and a third name, and the identity metadata is subjected to de-identification processing to obtain IDs 1, 2, and 3.

In one possible implementation, the identity metadata is subjected to data generalization.

For example, a data generalization method is adopted for attributes such as birthday, zip code, and residential address.

Specifically, the same digits or characters are reserved, and any special character is used for replacing a difference character, so that the difficulty of reasoning and attacking data is increased.

Please refer to table 1 and table 2, where table 1 is data before anonymous protection processing is performed on the identity metadata of the case group, and table 2 is data before anonymous protection processing is performed on the identity metadata of the control group.

TABLE 1 data sheet for case group

TABLE 2 data Table of control group

As shown in tables 1 and 2, the data in the case group is divided into identity metadata and classification data; the identity metadata includes: name, birthday, zip code, etc., and classification data includes age, etc. Data in the control group are as above.

Please refer to tables 3 and 4, where table 3 is data after anonymous protection processing of the identity metadata of the case group, and table 4 is data after anonymous protection processing of the identity metadata of the control group.

TABLE 3 data sheet for case group

TABLE 4 data Table of control group

The identity metadata of the ID is obtained by processing the identity metadata of the name.

In one possible implementation, portions of each identity metadata are discarded.

For example, each identity metadata consists of three parts: A. b and C, and the anonymous protection processing of the data is realized by discarding the part C.

In some possible implementation manners, multiple anonymous protection processing manners can be comprehensively utilized to process data.

As shown in tables 1 to 4, for identity metadata of the type "birthday", the raw data includes specific year, month and date; the month and date in the original data are discarded, only the year is retained, and the last digit of the year (which may be regarded as a difference character) is replaced with a special character "X", resulting in the data in table 2.

S203, differential privacy protection processing is carried out on the binary data.

Values for the classifier data include 0 and 1, such as "age" in tables 1-4.

In one possible implementation, privacy protection of the binary data is achieved by adding noise to the binary data.

For example, to protect the "age" data, the "age" data in the case group and the control group needs to be noisy.

When the binary data of a plurality of classes are included, the noise adding process may be performed on the binary data of each class, respectively.

The following specifically describes the noise addition to the binary data, including S301 to S307.

S301, obtaining the classification data of the case group and the classification data of the control group.

Here, the binary data of the case group and the binary data of the control group can be regarded as raw data, that is, data before processing.

The raw data includes raw data for the case group and raw data for the control group.

The values of the classified data include two values, namely a first value and a second value. For example, as shown in table 1, the first value and the second value are 0 and 1, respectively.

Referring to table 5, table 5 is a binary data table before adding noise, specifically, table 5 is a frequency table of chi-square test.

TABLE 5 binary data link table before adding noise

Age (age)	Case group	Control group	Total up to
				0	a	c	m
1	b	d	n
				Total up to	N/2	N/2	N

The binary data can be regarded as independent variables, and the frequency of the independent variables refers to the number of actual occurrences of the independent variables, for example, the frequency of age "0" in case group is a, and the frequency of age "1" in control group is d.

The data is subjected to noise addition to obtain a two-classification data link table subjected to noise addition, and the frequency of the independent variable changes due to the fact that the data is subjected to noise addition, as shown in table 6.

Referring to tables 5 and 6, for age "0" in the case group, the frequency of the argument before adding noise was a, and the frequency of the argument after adding noise was changed to a'.

The total count after noise addition is set to be unchanged, i.e. the statistical data m, N and N in table 5 are ensured to be unchanged, and when a is changed to a', b, c and d are also changed.

Determining a as a tracking target, firstly determining a range of a noise-added chi-square value (recorded as a chi-square value perturbation range), then selecting proper a ' according to the chi-square value perturbation range, and then adding noise to enable the number of the independent variable 0 to be a ' (the number of ages of 0 in a case group is a '), namely achieving the final required noise-added result.

TABLE 6 Bisection data link table after noise addition

Age (age)	Case group	Control group	Total up to
				0	a’	c’＝m-a’	m
1	b’＝N/2-a’	d’＝N/2-c’	n
				Total up to	N/2	N/2	N

S302, determining an original chi-square value.

The original chi-squared value is calculated by the following equation (1):

the meanings of the individual parameters in the above formula are indicated in Table 5.

And S303, determining the value range of the chi-square value of the noise data.

The noise data is noise-perturbed data, including noise data for the case group and noise data for the control group.

The noise data is obtained by adding noise to the original data: the noise data of the case group is obtained by adding noise to the original data of the case group, and the noise data of the control group is obtained by adding noise to the original data of the control group.

After the original chi-square value is obtained, the value range of the chi-square value of the noise data is determined.

The value range of the chi-squared value of the noise data can be regarded as a chi-squared value disturbing range.

In the embodiment of the application, the reduction of the effectiveness of the noise addition on the data is reduced by enabling the chi-squared test statistic value of the noise data to be still within the statistical significance range of the original data.

For example, noise data is obtained by adding noise to raw data (binary data). When the original data is subjected to statistical analysis to obtain that the independent variable has a significant influence on the dependent variable (namely, the original chi-squared value is greater than the critical chi-squared value), the noise data obtained after the noise is added is subjected to statistical analysis, and the independent variable still can be obtained to have a significant influence on the dependent variable (namely, the chi-squared value of the noise data is still greater than the critical chi-squared value).

In addition, the chi-squared value of the noise data does not exceed the original chi-squared value, so as not to excessively perform noise disturbance.

Based on this, the chi-squared value of the noise data needs to satisfy the following constrained inequality:

χ²in the form of the original chi-squared value,

for the chi-squared value of the noise data,

is the critical chi-squared value.

For the 2 x 2 list table, the degree of freedom v is 1, and given the significance level α, the critical chi-squared value can be determined by looking up the table in the chi-squared value table according to the degree of freedom v and the significance level α to be 3.84.

S304, determining the value range of a

a ' is the number of age ' 0's in the case group after the noise addition, i.e. the frequency of the argument after the noise addition.

And determining the value range of a' according to the obtained original chi-square value and the value range of the chi-square value of the noise data.

That is, the value range of a' is determined according to the above equations (1) and (2).

The value range of a' can be obtained through calculation:

a’∈(a₁,a₂)∪(a₃,a₄)

where, U is union operator, (a)₁,a₂)∪(a₃,a₄) Is (a)₁,a₂) And (a)₃,a₄) Is collected.

C₁＝m²-nmχ²/N

C₂＝m²-nmχ_α,v ²/N

S304, determining the value of a

After the value range of a 'is determined, the proper value of a' is determined by balancing data utility and privacy protection.

The relationship between the chi-squared value of the noisy data and the original chi-squared value can characterize the impact that the data is affected for use after the noise is added.

In the embodiment of the application, the utility of the data is measured by using the chi-square value.

The isomorphic constraint on the relationship between the chi-squared value of the noisy data and the original chi-squared value can reduce the reduction of the utility of the noisy data.

The degree of privacy protection for the raw data can be measured by the Expected Estimation Error (EEE) from the raw data and the noisy data.

The expected estimation error is an expected estimation error of data before and after the noise addition, and is obtained according to the following formula:

y is noise added to the original data, P (Y)_i) Indicating that the ith noise is added to the numberThe probability of the ith record of the database, D is the original database, | D | is the size of the database, D' is the disturbed database after noise addition, | D_i-D′_iAnd | represents the element difference of the corresponding tuple.

EEE is used for measuring the error between D and D', and the disturbance degree caused by noise can be visually displayed. The larger the EEE, the greater the noise disturbance, and the higher the degree of privacy protection of the data.

The following is a description of equalizing data utility and privacy protection.

By carrying out noise processing on the data, the privacy protection of the data can be realized. By ensuring that the noise-added chi-square value is within the original statistical significance range, the utility loss of the noise-added data is reduced. For example, the chi-squared value of the data before noise addition is less than the critical value of the chi-squared value, and the chi-squared value of the data after noise addition is also less than the critical value of the chi-squared value.

The Expected Estimation Error (EEE) is obtained by multiplying the absolute value of the difference between the data before and after the noise is added and the noise probability, so that the larger the expected estimation error is, the larger the disturbance of the noise to the data is, and the higher the data privacy protection degree is.

In summary, when the chi-squared value of the noise data is within the original statistical significance range and the expected estimation error is the largest, it can be considered that the data utility and privacy protection are balanced.

Specifically, when the noise adding quantity changes, the noise adding quantity with the chi-squared value of the noise data within the original statistical significance range and the expected estimation error maximum is determined as the finally selected noise adding quantity.

The larger the difference between a' and a, the more the noise added to the data perturbs the data; the larger the estimation error is expected to be, the higher the degree of privacy protection on the data is. Therefore, the value farthest from the value of a is selected as a'.

Meanwhile, since the chi-squared value is a quadratic function, the closer the a' value is to the boundary value, the closer the noisy chi-squared value is to the original chi-squared value, i.e., the statistical significance of the chi-squared value is not greatly changed (hardly changed). Therefore, when the frequency of the original data is a distance (a)₁,a₂) When further away, determinea′＝a₁+ 1; when the frequency of the original data is a distance (a)₃,a₄) Further away, determine a ═ a₄-1。

At present, for the noise adding processing of data, laplacian noise is usually added to the data, however, the random added noise causes more loss of data utility. According to the embodiment of the application, the reasonable noise adding quantity is determined by calculating the noise balance point, and the loss of data utility is reduced.

S305, generating noise based on the value of a

After the noise adding quantity is determined, corresponding quantity of Laplace or discrete Laplace noise is generated, so that the statistic frequency after noise adding is a.

The probability distribution density of the noise is related to the differential privacy budget. In a possible implementation manner, the privacy budget is determined to be zero, so that the disturbance range of the noise is larger, and a better disturbance effect is obtained.

The size of the privacy budget may also be determined as other values, and the size of the privacy budget is not specifically limited in the embodiments of the present application.

The size of the privacy budget usually does not affect the final noise adding quantity result and the uniform meaning of the final data, so that the privacy budget can be adjusted according to actual requirements.

At present, Laplacian noise is added to data to realize differential privacy protection, and the data utility loss is more due to the sound which is usually added randomly. According to the technical scheme, the noise balance point is determined, and the reasonable noise adding quantity is found, so that the data utility is improved.

And S306, based on the generated noise, respectively carrying out noise adding processing on the binary data of the case group and the binary data of the contrast group.

After the generated noise is obtained, noise addition processing is performed on the binary data of the case group and the binary data of the control group to obtain noise data (noise data of the case group and noise data of the control group).

And after the noise is generated, respectively adding noise to the classified data according to the case group and the contrast group.

S307, post-processing is carried out on the noisy binary data

Post-processing can be carried out according to actual requirements. For example, some data may have a range limited to only one interval, and the range of noise generation is arbitrary, so that post-processing can be performed on the binary data after noise addition.

The post-processing may include performing modulo remainder operations, rounding operations, data rounding processing, etc. on the noisy binary data.

The rounding operation is a count preservation method of accuracy that enables the difference between the actual value and the preserved portion to be no more than one-half the last order of magnitude.

For example, the range of the original binary data is (m, n), modulo remainder is performed on the noisy binary data, and modulo (n-m) remainder is performed on the noisy binary data so that the range of the noisy binary data is also (m, n).

If there is a rounding requirement for the denoised binary data, rounding processing may be performed on the denoised binary data.

If the data has the requirement of precision, the denoised binary data can be rounded to meet the requirement of precision.

In some possible cases, the post-processing operation performed after the noise addition may further include other operations, which are not specifically limited in this embodiment of the present application.

In some possible cases, post-processing operations may also not be performed after the binomial data is denoised.

And obtaining the identity metadata protected anonymously and the binary data subjected to noise addition through the processing.

In some possible implementation manners, the two types of data can be published in combination, so that the data privacy degree is improved, the protected data cannot lose the statistical significance of the protected data, and the data utility and privacy protection can be balanced.

In a specific statistical analysis data release scene, potential attack risks such as differential attack and reasoning attack exist, the technical scheme of the embodiment of the application provides a data processing method around a large amount of individual identity data and binary data sets related to chi-square test in the differential analysis research, and the data processing method is used for maintaining the statistical analysis significance of the data sets and protecting privacy.

The technical scheme of the embodiment of the application mainly aims at data protection of chi-square inspection. For example, when the open platform issues data suitable for chi-square inspection, the technical scheme of the embodiment of the application is adopted to protect and then distribute the data, so that the individual identity native information can be protected to a certain extent to perform generalization and fuzzy processing, and chi-square value constraint can be used to ensure that important sample data maintains the statistical significance after reasonable disturbance, thereby realizing the balance of data privacy protection and data availability.

The embodiment of the application also provides a data processing device based on privacy protection.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a data processing apparatus based on privacy protection according to an embodiment of the present disclosure.

As shown in fig. 2, a data processing apparatus 200 based on privacy protection according to an embodiment of the present application includes the following units:

the data acquiring unit 201 is configured to acquire original data, where values of the original data only include a first value and a second value, and the original data includes original data of a case group and original data of a comparison group.

A first determining unit 202, configured to determine a chi-squared value of the original data.

A second determining unit 203, configured to determine a value range of a chi-squared value of the noise data according to the chi-squared value of the original data; the noise data comprise noise data of a case group and noise data of a contrast group, wherein the noise data of the case group are obtained by adding noise to original data of the case group, and the noise data of the contrast group are obtained by adding noise to the original data of the contrast group; the magnitude relation between the chi-square value of the original data and the critical value is the same as the magnitude relation between the chi-square value of the noise data and the critical value; the chi-squared value of the noise data is less than the chi-squared value of the original data.

A third determining unit 204, configured to determine a value range of the first noise frequency according to the chi-square value of the original data and the value range of the chi-square value of the noise data, where the first noise frequency is a number of the first values taken from the noise data of the case group.

A fourth determining unit 205, configured to determine a value of the first noise frequency according to the value range of the first noise frequency and the difference between the original data and the noise data.

And the noise adding processing unit 206 is configured to perform noise adding processing on the original data of the case group and the original data of the comparison group respectively based on the value of the first noise frequency to obtain noise data of the case group and noise data of the comparison group.

The units included in the data processing apparatus 200 based on privacy protection can achieve the same technical effects as the data processing method based on privacy protection in the above embodiments, and are not described herein again to avoid repetition.

The embodiment of the application also provides the electronic equipment.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

As shown in fig. 3, an electronic device 300 according to an embodiment of the present application includes a processor 301 and a memory 302, where the memory 302 stores codes, and the processor 301 is configured to call the codes stored in the memory 302 to execute any one of the above data processing methods based on privacy protection.

The units included in the electronic device 300 can achieve the same technical effects as those of the data processing method based on privacy protection in the above embodiments, and are not described herein again to avoid repetition.

In an embodiment of the present application, a computer-readable storage medium is further provided, where the computer-readable storage medium is used to store a computer program, and the computer program is used to execute the data processing method based on privacy protection, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A data processing method based on privacy protection, the method comprising:

acquiring original data; the values of the original data only comprise a first value and a second value, and the original data comprises original data of a case group and original data of a contrast group;

determining a chi-squared value of the raw data;

determining the value range of the chi-square value of the noise data according to the chi-square value of the original data; the noise data only comprises the first value and the second value, the noise data comprises noise data of a case group and noise data of a contrast group, the noise data of the case group is obtained by adding noise to original data of the case group, and the noise data of the contrast group is obtained by adding noise to the original data of the contrast group; the magnitude relation between the chi-square value of the original data and the critical value is the same as the magnitude relation between the chi-square value of the noise data and the critical value; the chi-square value of the noise data is smaller than the chi-square value of the original data;

2. The method according to claim 1, wherein the determining the chi-squared value of the raw data specifically comprises:

determining a chi-square value of the original data according to the first original frequency, the second original frequency, the first number and the second number; the first original frequency count is the number of the first value in the original data of the case group, the second original frequency count is the number of the second value in the original data of the case group, the first number is the sum of the numbers of the first value in the original data of the first original frequency count and the original data of the comparison group, and the second number is the sum of the numbers of the second value in the original data of the comparison group.

3. The method of any one of claims 1 or 2, wherein the threshold is obtained by looking up a chi-squared distribution threshold table.

4. The method according to claim 1, wherein the determining a value of the first noise frequency according to a value range of the first noise frequency and a difference between the original data and the noise data includes:

determining a value of the first noise frequency within a range of values of the first noise frequency to maximize a difference between the original data and the noise data.

5. The method of claim 1, wherein the difference between the raw data and the noise data comprises in particular an expected estimation error between the raw data and the noise data.

6. The method of claim 1, wherein after the denoising the raw data of the case group and the raw data of the control group based on the value of the first noise frequency to obtain the noise data of the case group and the noise data of the control group, the method further comprises:

7. The method of claim 1, further comprising:

obfuscating the identity data, wherein the obfuscating comprises: and carrying out de-identification processing on the identity data or carrying out generalization processing on the identity data, wherein the generalization processing is to replace a plurality of difference characters in the identity data by utilizing preset characters.

8. A data processing apparatus based on privacy protection, the apparatus comprising:

the second determining unit is used for determining the value range of the chi-square value of the noise data according to the chi-square value of the original data; the noise data only comprises the first value and the second value, the noise data comprises noise data of a case group and noise data of a contrast group, the noise data of the case group is obtained by adding noise to original data of the case group, and the noise data of the contrast group is obtained by adding noise to the original data of the contrast group; the magnitude relation between the chi-square value of the original data and the critical value is the same as the magnitude relation between the chi-square value of the noise data and the critical value; the chi-square value of the noise data is smaller than the chi-square value of the original data;

a third determining unit, configured to determine a value range of a first noise frequency according to a chi-square value of the original data and a value range of a chi-square value of the noise data, where the first noise frequency is a number of noise data of the case group whose value is the first value;

and the noise adding processing unit is used for respectively adding noise to the original data of the case group and the original data of the comparison group based on the value of the first noise frequency to obtain the noise data of the case group and the noise data of the comparison group.

9. An electronic device comprising a processor and a memory, wherein the memory stores code and the processor is configured to invoke the code stored in the memory to perform the method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium is adapted to store a computer program for performing the method of any of claims 1 to 7.