CN116611030A

CN116611030A - Localized differential privacy protection logistic regression method based on compression

Info

Publication number: CN116611030A
Application number: CN202310576399.7A
Authority: CN
Inventors: 王慧婷; 陈燕俐; 杨庚; 王周生
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-08-18

Abstract

The application discloses a logistic regression method for localized differential privacy protection based on compression. In the method, a user calculates a gradient vector according to model parameters issued by a server, and encodes the gradient vector to obtain an input vector; and obtaining an output vector through disturbance of a random response mechanism, so as to realize privacy protection. The server aggregates and corrects the output vectors to obtain unbiased average values which can be used for updating model parameters, and issues updated parameters to users participating in the training of the next round. And training a logistic regression model after multiple iterations. Finally, the logistic regression model is utilized to classify and predict the user data of the unknown class labels. The application introduces a compressed localized differential privacy model, which improves the utility and estimation precision of data statistics while protecting the privacy of users, balances the privacy protection and the data availability, provides classified prediction for users, ensures that an attacker cannot reversely deduce the individual data in the training data, and has higher classification accuracy.

Description

Localized differential privacy protection logistic regression method based on compression

Technical Field

The application belongs to the field of information security, and particularly relates to a logistic regression method for localized differential privacy protection based on compression.

Background

Explosive growth of data has prompted the development of data mining techniques. Massive data are generated in equipment connected to the Internet every day, and the user experience and the service quality can be effectively improved through analyzing and mining the data. However, devices that are involved in private sensitive information may risk leakage of user privacy information during access to the internet and providing other devices with access to data, which may cause immeasurable loss to the user. Thus, there is a need to provide privacy protection for personal data while mining and analyzing the data.

Differential privacy (Differential Privacy, DP) is a privacy preserving model that can be strictly proven, with strict mathematical theory basis. At present, differential privacy is widely applied to the fields of machine learning, deep neural networks, federal learning and the like. The conventional differential privacy model is also called a centralized differential privacy model, and a third party data collector is generally responsible for collecting user data and publishing a data set or related statistical information after privacy treatment by combining a differential privacy technology. However, centralized differential privacy relies on trusted data collectors, and thus, when one untrusted data collector collects data, the privacy of each participant cannot be guaranteed. Along with the proposal of localized differential privacy (Local Differential Privacy, LDP), LDP inherits the characteristic of DP quantification privacy protection degree, and transfers the privacy processing process from a third party collector to a terminal of a user, so that the threat of privacy disclosure possibly brought by an untrusted third party server is eliminated. Considering that perturbation mechanisms to meet LDP typically require the assumption that the number of user groups is on the millions of large data sets, the concept of compressed local differential privacy (Condensed Local Differential Privacy, CLDP) was proposed, which introduced a distance metric to build indistinguishability between data, scalable privacy by distance (i.e., degree of similarity) between data. The CLDP model may provide better data statistics utility with more small data sets or data dimensions than the LDP model.

At present, a logistic regression algorithm based on localized differential privacy protection is usually realized by introducing a mean value estimation algorithm of multi-dimensional numerical data meeting LDP and combining a gradient descent method. However, the existing multi-dimensional numerical data average estimation algorithm meeting the LDP generally splits the privacy budget into different dimensions, or randomly samples single-dimensional data instead of all the dimensional data to solve the multi-dimensional problem, which may cause a defect of low statistical utility of algorithm data. In addition, such mean estimation algorithms typically assume large datasets with a number of user groups on the order of millions, and application to small datasets can cause problems with lower estimation accuracy.

Disclosure of Invention

The application aims to: the application aims to provide a logistic regression method for localized differential privacy protection based on compression. Aiming at the defects of the existing logistic regression method based on localized differential privacy protection, the application introduces a compressed local differential privacy model (CLDP), and the CLDP model has higher data statistical utility under the condition of facing more data dimensions and small data sets. While at the same time. The application does not need to sample when processing the multidimensional gradient vector, thereby effectively avoiding estimation errors caused by sampling.

The technical scheme is as follows: according to the logistic regression method based on the compressed localization differential privacy protection, the steps are executed, a user perturbs the gradient vector to achieve privacy protection, the server side aggregates the perturbed output vector to restore the average value of all user input vectors, the average value is brought into iteration type updating model parameters of logistic regression, and the user predicts labels of the user based on the model by utilizing decision boundaries of hypothesis functions;

s1: in the training stage, the server initializes logistic regression model parameters, sets privacy budget value alpha, and discloses the initial model parameters and the privacy budget value to the user.

Wherein the server is an untrusted entity responsible for aggregating user gradients and computing model parameters.

The users have training data required to participate in the logistic regression model, consisting of n users.

S2: user _i According to the model parameters issued by the server, calculating to obtain a d+1-dimensional numerical gradient vector

S3: user _i For its gradient vector at the user sideInput vector s encoded into d+1 dimensions ⁽ⁱ⁾ ；

S4: user _i For input vector s ⁽ⁱ⁾ Disturbance is carried out by using a disturbance mechanism M _CLDP-ME Will input vector s ⁽ⁱ⁾ Disturbance to output vector t ⁽ⁱ⁾ So that it satisfies the α -compressed local differential privacy;

s5: user _i The disturbed output vector t ⁽ⁱ⁾ Sending the data to a server;

s6: the server outputs the vector t according to all users ⁽ⁱ⁾ And carrying out statistical analysis to obtain the average value of n input vectors.

S7: and the server brings the obtained mean value result into iteration type of logistic regression, and further updates the model parameter theta. And repeating the steps S2-S7 until the model converges and the iteration is ended.

S8: finally, the parameter theta of the model, namely the logistic regression classifier model, is obtained. For users of unknown class labels, substituting θ and its attributes into the hypothetical function of logistic regression, the user can use the decision boundaries of the hypothetical function to predict the labels.

Further, in step S1, the server initializes logistic regression model parameters θ. Where θ is a d+1-dimensional vector, representing d+1-dimensional space composed of real numbersA vector, each element in the vector being 0.n users { User } ₁ ,...,User _i ,...,User _n -per user->Having a private record User ⁽ⁱ⁾ ＝(x ⁽ⁱ⁾ ,y ⁽ⁱ⁾ ) Wherein the d-dimensional numerical attribute is included->Class label y ⁽ⁱ⁾ E {0,1}. The server sends the initial logistic regression model parameters theta and the privacy budget alpha to n users ₁ ,...,User _i ,...,User _n }。

Further, step S2 is specifically a gradient calculation formula according to logistic regressionWherein the function is assumed to be +.> Each User _i On the user side, a gradient vector of d+1-dimensional numerical value can be obtained>Wherein i is E [1, n]，j∈[0,d]。

Further, the step S3 specifically includes: for a pair ofNormalization allows->For->Discretizing to obtain input vector->So that is arbitrary->The discretization is as follows:

wherein ,representing User _i Gradient vector of>The value of the j-th bit.

Further, the step S4 specifically includes the following steps:

s41: for a vector with dimension d+1, dividing the total sample into d+2 groups of sample subspaces, wherein the sample subspace with similarity k has the size ofCalculate to->As a normalization factor;

s42: simplifying the normalization factor according to the binomial theorem to obtain:

s43: definition of perturbation mechanism M _CLDP-ME For any input vector s ⁽ⁱ⁾ Through a disturbance mechanism M _CLDP-ME Obtaining output vectorThe probability of (2) is as follows:

wherein: pr []Representing the probability distribution value, α being the privacy budget under the CLDP model, u (s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ ) As a utility function, u (s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ )＝d+1-d(s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ ) For defining an input vector s ⁽ⁱ⁾ And output vector t ⁽ⁱ⁾ Similarity between the effective functions u (s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ )∈[0,d+1]. Wherein the distance function d (s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ ) By hamming distance, i.e. definition For exclusive or operation, reflecting the dissimilarity degree between the two vectors, providing basis for the similarity degree between the vectors, and meeting the non-negativity, identity, symmetry and triangle inequality of the distance function; the higher the similarity between the two vectors, the greater the utility value, representing the output vector t ⁽ⁱ⁾ With a higher probability of approaching the input vector s ⁽ⁱ⁾ ；

S44: the perturbation mechanism is further simplified according to the binomial theorem:

input vector by user terminalThrough a disturbance mechanism M _CLDP-ME The output vector is obtained>

Further, the user side inputs the vectorThrough a disturbance mechanism M _CLDP-ME The output vector is obtained>The method comprises the following steps: initializing an output vector: />Generating a uniform random variable r.epsilon.0, 1.0) if ∈0.0>Let t ⁽ⁱ⁾ The j-th component of (a)>If it isLet t ⁽ⁱ⁾ The j-th component of (a)>Let t ⁽ⁱ⁾ The final output vector is obtained after the calculation of each component in the above formula.

Further, the step S6 specifically includes the following steps:

s41: the server collects the output vectors uploaded by the n users after disturbance

S42: initializing mean estimate vectors

S43：Component of->The calculation is as follows:

will beAnd each component in the model is calculated according to the formula to obtain a final average value estimation vector.

Further, the step S7 specifically includes the following steps:

s71: obtaining an average value unbiased estimation of the current round output vector according to the step S6The model update is carried in, and the specific update formula is as follows:

wherein η is the learning rate.

S72: the updated model parameters theta are sent to n users { User } ₁ ,...,User _i ,...,User _n }。

Further, the specific steps of step S8 are as follows:

s81: the server distributes logistic regression classifier model parameters theta to users to be classified;

s82: user for unknown class labels _k Privacy record User with attribute data only containing d dimension ^(k) ＝(x ^(k) ) There isDefinition x ^(k)′ ＝[1,x ^(k) ]Obtaining an attribute vector x of d+1 dimension ^(k)′ I.e.

S83: for attribute vector x ^(k)′ Carrying out normalization treatment;

s84: will be theta and x ^(k)′ Substituting the hypothesized function of logistic regression: wherein ,

s85: the user can use the hypothesized function h _θ (x ^(k)′ ) Predicting the label of the label, when h _θ (x ^(i)′ ) When the number is more than or equal to 0.5, classifying the result into 1 class; when h _θ (x ^(i)′ ) When < 0.5, the results are classified into class 0.

The beneficial effects are that: compared with the prior art, the application has the following remarkable advantages:

1. the application designs a logistic regression method for localized differential privacy protection based on compression. In this method, a perturbation mechanism M is defined _CLDP-ME The compressed localization differential privacy protection model is introduced, so that the output value approaches to the original value with higher probability, the user privacy is protected, and the statistical utility on the multidimensional data set is improved. The disturbance mechanism M _CLDP-ME Higher estimation accuracy is also exhibited on small datasets.

2. Under the protection of the compressed localized differential privacy, even if an attacker has all background knowledge except the target privacy information, the user privacy data can be effectively protected.

3. The logistic regression method based on the compression localized differential privacy protection can obtain higher classification accuracy and has better practical value compared with the existing method under the condition of the same privacy protection degree.

Drawings

Fig. 1 is a schematic diagram of a logistic regression method based on compression localized differential privacy protection of the present application.

FIG. 2 is a comparative schematic diagram of the performance of the present application.

FIG. 3 is a comparative performance schematic of the present application.

FIG. 4 is a comparative performance schematic of the present application.

FIG. 5 is a comparative performance schematic of the present application.

Detailed Description

The technical scheme of the application is further described below with reference to the accompanying drawings.

As shown in fig. 1, the logistic regression method based on the compression localization differential privacy protection of the present embodiment generally includes the following implementation steps:

S3: the user carries out gradient vector on the user terminalInput vector s encoded into d+1 dimensions ⁽ⁱ⁾ 。

S4: user _i For input vector s ⁽ⁱ⁾ Disturbance is carried out by using a disturbance mechanism M _CLDP-ME Will input vector s ⁽ⁱ⁾ Disturbance to output vector t ⁽ⁱ⁾ So that it satisfies the α -compressed local differential privacy.

S5: the user outputs the disturbed output vector t ⁽ⁱ⁾ And sending the data to a server.

S6: the server side performs statistical analysis according to disturbance data sent by all users, and the average value of the n input vectors is restored as far as possible.

S7: and the server brings the obtained mean value result into iteration type of logistic regression, and further updates the model parameter theta.

And repeating S2-S7 until the model converges and the iteration is ended.

S8: finally, the parameter theta of the model, namely the logistic regression classifier model, is obtained. For users of unknown class labels, substituting θ and its attributes into the hypothetical function of logistic regression, users can use the decision boundaries of the hypothetical function to predict their labels.

In step (S2) of the method, the server parameterized logistic regression model parameters include the following:

the server initializes logistic regression model parameters θ. Where θ is a d+1-dimensional vector, representing a d+1 dimensional space vector consisting of real numbers, each element in the vector is 0.n users { User } ₁ ,...,User _i ,...,User _n User of each User _i Having a private record User ⁽ⁱ⁾ ＝(x ⁽ⁱ⁾ ,y ⁽ⁱ⁾ ) Wherein the d-dimensional numerical attribute is included->Class label y ⁽ⁱ⁾ E {0,1}. The server sends the initial logistic regression model parameters theta and the privacy budget alpha to n users ₁ ,...,User _i ,...,User _n }。

In step (S2) of the method, the user side calculates a gradient vector for the original data, which includes the following steps:

gradient calculation based on logistic regressionWherein the function is assumed to beEach User _i On the user side, a gradient vector of d+1-dimensional numerical value can be obtained>Wherein i is E [1, n]，j∈[0,d]。

In step (S3) of the method, the encoding of the gradient vector by the user side includes the following processes:

s31: for a pair ofNormalization allows->For arbitrary dimension->The normalization is as follows:

wherein ,representing the maximum value of the occurrence in the j-th bit in the original gradient vector of all users participating in the training,/->Representing the minimum of occurrences in the j-th bit in the original gradient vectors of all users participating in the training.

S32: for a pair ofDiscretizing to obtain input vector->So that it is arbitraryThe discretization is as follows:

wherein ,representing User _i Gradient vector of>The value of the j-th bit.

Due toIt is known that the discretization step ensures the unbiasedness of the data.

In step (S4) of the method, the perturbation mechanism M _CLDP-ME The specific steps of (a) are as follows:

s41: the total sample can be divided into d+2 groups of sample subspaces, and the sample subspace with the similarity k has the size ofThus can calculate +.>As a normalization factor.

S42: the normalization factor can be simplified according to the binomial theorem, resulting in:

wherein: pr []Representing the probability distribution value, α is the privacy budget under the CLDP model. u(s) ⁽ⁱ⁾ ,t ⁽ⁱ⁾ ) As a utility function, u (s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ )＝d+1-d(s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ ) For defining an input vector s ⁽ⁱ⁾ And output vector t ⁽ⁱ⁾ Similarity between the effective functions u (s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ )∈[0,d+1]. Wherein the distance function d (s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ ) By hamming distance, i.e. definition The degree of dissimilarity between the two vectors is reflected for exclusive-or operation, so that a basis is provided for the degree of similarity between the vectors, and the non-negativity, the identity, the symmetry and the triangular inequality of the distance function are met. The higher the similarity between the two vectors, the greater the utility value, representing the output vector t ⁽ⁱ⁾ With a higher probability of approaching the input vector s ⁽ⁱ⁾ 。

S44: the perturbation mechanism M can be further simplified according to the binomial theorem _CLDP-ME ：

Input vector by user terminalThrough a disturbance mechanism M _CLDP-ME The output vector is obtained>The method comprises the following specific steps:

1) Initializing an output vector:

2) Generating a uniform random variable r E [0.0, 1.0)

3) If it isLet t ⁽ⁱ⁾ The j-th component of (a)>If->Let t ⁽ⁱ⁾ The j-th component of (a)>

Let t ⁽ⁱ⁾ The final output vector is obtained after the calculation of each component in the above formula.

In step (S6) of the method, the specific steps of server-side mean value estimation are as follows:

s61: the server collects the output vectors uploaded by the n users after disturbance

S62: initializing mean estimate vectors

S63：Component of->The calculation is as follows:

In the step (S7) of the method, the specific steps of updating the model parameters at the server end are as follows:

s71: obtaining an unbiased estimate of the mean of the current round output vector according to step (S6)The model update is carried in, and the specific update formula is as follows:

wherein η is the learning rate.

In the step (S8) of the method, the specific steps of the user end for classifying and predicting the user attribute data of the unknown class label are as follows:

s81: the server distributes logistic regression classifier model parameters θ to the users to be classified.

S83: for attribute vector x ^(k)′ The normalization processing method consistent with step (S31) is maintained.

s85: the user can use the hypothesized function h _θ (x ^(k)′ ) Its label is predicted. When h _θ (x ^(i)′ ) When the number is more than or equal to 0.5, classifying the result into 1 class; when h _θ (x ^(i)′ ) When < 0.5, the results are classified into class 0.

The logistic regression method based on the localized differential privacy protection is specifically processed. To illustrate the defined perturbation mechanism M in the present application _CLDP-ME The compressed local differential privacy can be satisfied, and theoretical evidence is given below.

1. Disturbance mechanism M _CLDP-ME The local differential privacy of alpha-compression is satisfied.

And (3) proving: for a vector s of arbitrary input length d+1 ⁽ⁱ⁾ and s′⁽ⁱ⁾ Through disturbance mechanism M _CLDP-ME Post output t ⁽ⁱ⁾ The method comprises the following steps:

because of the distance function d (s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ ) Trigonometry is satisfied, so:

d(s′ ⁽ⁱ⁾ ,t ⁽ⁱ⁾ )-d(s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ )≤d(s′ ⁽ⁱ⁾ ,s ⁽ⁱ⁾ )

thus, there are:

the theorem holds true according to the definition of the local differential privacy of the α -compression. The syndrome is known.

To illustrate the perturbation mechanism M in the present application _CLDP-ME The validity of the mean value estimation is demonstrated below as unbiased and estimation error.

2. Let trueThe real mean vector and the estimated mean vector are respectively represented as z= [ z ] ₀ ,...,z _j ,...,z _d] and for arbitrary->z _j Is the true mean of the j-th component, +.>For the mean value of the unbiased estimate of the jth component, there is an unbiased estimate +.>This is true.

And (3) proving: according to the disturbance mechanism M _CLDP-ME The method comprises the following steps:

the expectations of estimating the mean are:

the syndrome is known.

3. For the followingIs provided with->To estimate mean value unbiased, z _j Is true mean, i.e.)>Estimation error of algorithm: />1-betaThe rate is established. Wherein j is E [0, d]N is the number of users and α is the privacy budget.

And (3) proving: according to the Huo Fuding inequality, there is given an upper probability limit for the deviation of the mean value of the random variable from its expected value, which is:wherein S is n independent random variables x _j Mean value of x _j ∈[a _j ,b _j ]. Because of z _j Is the average value of n independent random variables with the value of 1 or-1, and can be obtained:

wherein b_j ＝1,a _j = -1, so there is:

due toSo there are:

is provided withWill->Substituting the above formula, the following can be obtained:

thenAt least with a probability of 1-beta. The syndrome is known.

The following is a perturbation mechanism M defined in the logistic regression method based on compression localized differential privacy protection _CLDP-ME Is a result of the experiment. The experimental environment is Intel (R) Core (TM) i7-4770HQ,2.20GHz, 16GB memory and Windows 10 operating system. The programming language employs Python.

To verify the usefulness of the perturbation mechanism, a mean squared error (MSE, mean Square Error) was employed to measure the mean estimation accuracy of the mechanism and the most representative mean estimation algorithms Harmony, PM and three-output at present. The mean square error MSE is:t is the number of runs, z represents the true mean, < ->Representing the estimated mean. The larger the MSE, the more noise that is introduced, the lower the availability of data. In order to eliminate the error effect, taking the randomness of the algorithm into consideration, each algorithm is respectively operated for 50 times on the data set, and then the MSE average value is obtained.

In order to make the result real and reliable, the experiment is simulated by adopting 3 synthetic data sets and 1 real data set, wherein the size of the synthetic data sets is 50000 records, each record consists of 6 attributes, and the requirements are respectively satisfied:

1) A Uniform data set;

2) Normal-1 dataset following Normal distribution, mean 0, standard deviation 1;

3) Normal-2 dataset following Normal distribution, mean 1, standard deviation 2.

The real data set adopts an Adult data set in a UCI machine learning library, and 6 numerical value type attributes are selected and normalized.

Fig. 2 shows the effect of privacy budget α or epsilon variations on the four algorithms MSE. The lower the degree of privacy protection, the smaller the error in the data collected by the third party server, and therefore the MSE for both algorithms, when the privacy budget α or epsilon increases from 0.1 to 2. Disturbance mechanism M under the same privacy budget _CLDP-ME Is better than Harmony, PM and three-output due to the perturbation mechanism M _CLDP-ME The CLDP model is introduced, and compared with LDP, the CLDP model introduces a distance metric, so that the output value approaches to the true value with higher probability, thereby improving the utility of the data in the mean value estimation.

Fig. 3 shows the relationship between MSE and the number of users (data set size). In order to study the influence of the number of users on the MSE, a user data set is selected for average value estimation, the privacy budget alpha or epsilon is set to be 1, the attribute number is set to be 6, and the record number n of the data set is valued as follows:

n= {50000,40000,30000,20000,10000,5000,1000,500}. FIG. 3 illustrates that MSE tends to decrease as the number of users in a data set increases, the more user samples that a third party collects, the more accurate the result of unbiased estimation of the original data. Disturbance mechanism M _CLDP-ME MSE values are less than Harmony, PM and three-output over data sets of different user numbers. Experimental results show that the disturbance mechanism M _CLDP-ME Compared with the existing LDP multidimensional numerical data average value estimation algorithm, the LDP multidimensional numerical data average value estimation algorithm has higher estimation precision on a small data set.

Fig. 4 shows the relationship between MSE and attribute dimension (data dimension). To study the effect of attribute number on MSE, a form dataset is selected for mean value estimation, privacy budget α or ε is set to 1, user number is set to 50000, and attribute dimension of dataset is taken as d= {1,5,10,15,20}. FIG. 4 illustrates that the MSE of the Harmony, PM and three-output methods increases with increasing attribute dimension d, because the upper bound error of the algorithm and the attribute dimension are positively correlated. While disturbingMechanism M _CLDP-ME The error of (2) is not related to the dimension and does not change with the increase of the dimension of the attribute, so that the method is more suitable for the situation that the dimension of the attribute is higher.

The following are experimental results of a logistic regression method based on compressed localized differential privacy protection. The experimental environment is Intel (R) Core (TM) i7-4770HQ,2.20GHz, 16GB memory and Windows 10 operating system. The programming language is implemented using Python. 2019Airline deltays, google Merchandise Sale Prediction in kagle was used. Wherein 2019Airline delay data is derived from the United states traffic statistics office, and predicts whether the aircraft will delay take-off according to attribute information such as airports and weather conditions; google Merchandise Sale Prediction is derived from the BigQuery data warehouse of google for predicting whether each session will cause the visitor to perform the operation of adding merchandise to the shopping cart.

The experiment adopts the accuracy F of the classification prediction result _acc To measure the utility of the application in classification tasks, i.e. F _acc =predict the correct total number of data/total number of data of test set. The data set is divided into 5 mutually exclusive subsets by adopting a five-fold cross validation method, the training and testing processes of the logistic regression classifier are respectively carried out 5 times, and the consistency of subset distribution is maintained through layered sampling. The 4 subsets are alternately selected as training sets, with the remaining subset being the test set. Taking F after 50 experiments were performed on each set of data set training and testing, taking into account the randomness of the method _acc As the final classification accuracy.

Fig. 5 shows the classification accuracy of each method under 2 data sets and different privacy budgets in the experiment, and the privacy budgets alpha or epsilon take values of 0.1, 0.2, 0.5, 1, 1.5 and 2. For the same data set, as the privacy budget increases, the privacy protection degree decreases, and the classification accuracy also increases gradually. Under the condition of the same iteration times and the same privacy budget, the method of the application has higher classification accuracy than Harmony, PM and thread-output application in a logistic regression task, because M adopted by disturbance gradient is submitted to a server _CLDP-ME The statistical utility of data is higher compared to Harmony, PM and three-output, making the modelTraining is more stable.

The foregoing is merely exemplary embodiments of the present application, and is not intended to limit the scope of the present application; any substitutions and modifications made without departing from the spirit of the application are within the scope of the application.

Claims

1. A logistic regression method based on compression localization differential privacy protection is characterized in that the method comprises the following steps that a user encodes gradient vectors into input vectors and performs disturbance to achieve privacy protection, a server side aggregates the disturbed output vectors, restores the average value of all user input vectors, brings the average value into iterative updating model parameters of logistic regression, and the user predicts labels of the user based on the model by utilizing decision boundaries of hypothesis functions;

s1: in the training stage, initializing logistic regression model parameters by a server, setting privacy budget values, and disclosing the initial model parameters and the privacy budget values to users;

the server is responsible for aggregating user gradients and calculating model parameters;

each User _i The training data needed by participating in the logistic regression model is possessed, and the training data is composed of n users;

s6: the server outputs the vector t according to all users ⁽ⁱ⁾ Carrying out statistical analysis to obtain the average value of n input vectors;

s7: the server brings the obtained mean value result into iteration type of logistic regression, and updates the model parameter theta; repeating S2-S7 until the model converges and the iteration is finished, and entering step 8;

s8: obtaining a parameter theta of the model, namely a logistic regression classifier model; for users of unknown class labels, θ and its attributes are substituted into the hypothesis function of the logistic regression, and the users use the decision boundaries of the hypothesis function to predict their labels.

2. The logistic regression method of claim 1, wherein step S1 comprises the steps of:

s11: initializing a logistic regression model parameter theta by a server;

where θ is a d+1-dimensional vector, representing a d+1 dimensional space vector consisting of real numbers, each element in the vector being 0; n users { User } ₁ ,...,User _i ,...,User _n User of each User _i Having a private record User ⁽ⁱ⁾ ＝(x ⁽ⁱ⁾ ,y ⁽ⁱ⁾ ) Wherein the d-dimensional numerical attribute is included->Class label y ⁽ⁱ⁾ ∈{0,1}；

S12: the server sends the initial logistic regression model parameters theta and the privacy budget alpha to n users ₁ ,...,User _i ,...,User _n }。

3. The logistic regression method of claim 1, wherein in step S2, the gradient calculation formula according to the logistic regression is usedWherein the function is assumed to be +.>Each User _i On the user side, a gradient vector of d+1-dimensional numerical value can be obtained>Wherein i is E [1, n]，j∈[0,d]。

4. The logistic regression method of claim 1, wherein step S3 comprises the steps of:

s31: for a pair ofNormalization allows->

S32: for a pair ofDiscretizing to obtain input vector->So that is arbitrary->The discretization is as follows:

wherein ,representing User _i Gradient vector of>The value of the j-th bit.

5. The logistic regression method of claim 1, wherein step S4 comprises the steps of:

wherein: pr []Representing the probability distribution value, α being the privacy budget under the CLDP model, u (s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ ) As a utility function, u (s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ )＝d+1-d(s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ ) For defining an input vector s ⁽ⁱ⁾ And output vector t ⁽ⁱ⁾ Similarity between the effective functions u (s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ )∈[0,d+1]The method comprises the steps of carrying out a first treatment on the surface of the Wherein the distance function d (s ⁽ⁱ⁾ ,t ⁽ⁱ⁾ ) By hamming distance, i.e. definition For exclusive or operation, reflecting the dissimilarity degree between the two vectors, providing basis for the similarity degree between the vectors, and meeting the non-negativity, identity, symmetry and triangle inequality of the distance function; the higher the similarity between the two vectors, the greater the utility value, representing the output vector t ⁽ⁱ⁾ With a higher probability of approaching the input vector s ⁽ⁱ⁾ ；

6. The logistic regression method of claim 5, wherein the user sideInput it into vectorThrough a disturbance mechanism M _CLDP-ME Obtaining output vectorThe method comprises the following steps: initializing an output vector: />Generating a uniform random variable r.epsilon.0, 1.0) if ∈0.0>Let t ⁽ⁱ⁾ The j-th component of (a) if +.>Let t ⁽ⁱ⁾ The j-th component in (a)Let t ⁽ⁱ⁾ The final output vector is obtained after the calculation of each component in the above formula.

7. The logistic regression method of claim 1, wherein step S6 comprises the steps of:

S62: initializing mean estimate vectors

S63：Component of->The calculation is as follows:

8. The logistic regression method of claim 1, wherein step S7 comprises the steps of:

wherein η is the learning rate;

9. The logistic regression method of claim 1, wherein the specific steps of step S8 are:

s82: for unknown class labelsUser of (C) _k Privacy record User with attribute data only containing d dimension ^(k) ＝(x ^(k) ) There isDefinition x ^(k)′ ＝[1,x ^(k) ]Obtaining an attribute vector x of d+1 dimension ^(k)′ I.e.

S83: for attribute vector x ^(k)′ Carrying out normalization treatment;