CN116611030A - Localized differential privacy protection logistic regression method based on compression - Google Patents

Localized differential privacy protection logistic regression method based on compression Download PDF

Info

Publication number
CN116611030A
CN116611030A CN202310576399.7A CN202310576399A CN116611030A CN 116611030 A CN116611030 A CN 116611030A CN 202310576399 A CN202310576399 A CN 202310576399A CN 116611030 A CN116611030 A CN 116611030A
Authority
CN
China
Prior art keywords
user
vector
logistic regression
model
users
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310576399.7A
Other languages
Chinese (zh)
Inventor
王慧婷
陈燕俐
杨庚
王周生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202310576399.7A priority Critical patent/CN116611030A/en
Publication of CN116611030A publication Critical patent/CN116611030A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/27Regression, e.g. linear or logistic regression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a logistic regression method for localized differential privacy protection based on compression. In the method, a user calculates a gradient vector according to model parameters issued by a server, and encodes the gradient vector to obtain an input vector; and obtaining an output vector through disturbance of a random response mechanism, so as to realize privacy protection. The server aggregates and corrects the output vectors to obtain unbiased average values which can be used for updating model parameters, and issues updated parameters to users participating in the training of the next round. And training a logistic regression model after multiple iterations. Finally, the logistic regression model is utilized to classify and predict the user data of the unknown class labels. The application introduces a compressed localized differential privacy model, which improves the utility and estimation precision of data statistics while protecting the privacy of users, balances the privacy protection and the data availability, provides classified prediction for users, ensures that an attacker cannot reversely deduce the individual data in the training data, and has higher classification accuracy.

Description

Localized differential privacy protection logistic regression method based on compression
Technical Field
The application belongs to the field of information security, and particularly relates to a logistic regression method for localized differential privacy protection based on compression.
Background
Explosive growth of data has prompted the development of data mining techniques. Massive data are generated in equipment connected to the Internet every day, and the user experience and the service quality can be effectively improved through analyzing and mining the data. However, devices that are involved in private sensitive information may risk leakage of user privacy information during access to the internet and providing other devices with access to data, which may cause immeasurable loss to the user. Thus, there is a need to provide privacy protection for personal data while mining and analyzing the data.
Differential privacy (Differential Privacy, DP) is a privacy preserving model that can be strictly proven, with strict mathematical theory basis. At present, differential privacy is widely applied to the fields of machine learning, deep neural networks, federal learning and the like. The conventional differential privacy model is also called a centralized differential privacy model, and a third party data collector is generally responsible for collecting user data and publishing a data set or related statistical information after privacy treatment by combining a differential privacy technology. However, centralized differential privacy relies on trusted data collectors, and thus, when one untrusted data collector collects data, the privacy of each participant cannot be guaranteed. Along with the proposal of localized differential privacy (Local Differential Privacy, LDP), LDP inherits the characteristic of DP quantification privacy protection degree, and transfers the privacy processing process from a third party collector to a terminal of a user, so that the threat of privacy disclosure possibly brought by an untrusted third party server is eliminated. Considering that perturbation mechanisms to meet LDP typically require the assumption that the number of user groups is on the millions of large data sets, the concept of compressed local differential privacy (Condensed Local Differential Privacy, CLDP) was proposed, which introduced a distance metric to build indistinguishability between data, scalable privacy by distance (i.e., degree of similarity) between data. The CLDP model may provide better data statistics utility with more small data sets or data dimensions than the LDP model.
At present, a logistic regression algorithm based on localized differential privacy protection is usually realized by introducing a mean value estimation algorithm of multi-dimensional numerical data meeting LDP and combining a gradient descent method. However, the existing multi-dimensional numerical data average estimation algorithm meeting the LDP generally splits the privacy budget into different dimensions, or randomly samples single-dimensional data instead of all the dimensional data to solve the multi-dimensional problem, which may cause a defect of low statistical utility of algorithm data. In addition, such mean estimation algorithms typically assume large datasets with a number of user groups on the order of millions, and application to small datasets can cause problems with lower estimation accuracy.
Disclosure of Invention
The application aims to: the application aims to provide a logistic regression method for localized differential privacy protection based on compression. Aiming at the defects of the existing logistic regression method based on localized differential privacy protection, the application introduces a compressed local differential privacy model (CLDP), and the CLDP model has higher data statistical utility under the condition of facing more data dimensions and small data sets. While at the same time. The application does not need to sample when processing the multidimensional gradient vector, thereby effectively avoiding estimation errors caused by sampling.
The technical scheme is as follows: according to the logistic regression method based on the compressed localization differential privacy protection, the steps are executed, a user perturbs the gradient vector to achieve privacy protection, the server side aggregates the perturbed output vector to restore the average value of all user input vectors, the average value is brought into iteration type updating model parameters of logistic regression, and the user predicts labels of the user based on the model by utilizing decision boundaries of hypothesis functions;
s1: in the training stage, the server initializes logistic regression model parameters, sets privacy budget value alpha, and discloses the initial model parameters and the privacy budget value to the user.
Wherein the server is an untrusted entity responsible for aggregating user gradients and computing model parameters.
The users have training data required to participate in the logistic regression model, consisting of n users.
S2: user i According to the model parameters issued by the server, calculating to obtain a d+1-dimensional numerical gradient vector
S3: user i For its gradient vector at the user sideInput vector s encoded into d+1 dimensions (i)
S4: user i For input vector s (i) Disturbance is carried out by using a disturbance mechanism M CLDP-ME Will input vector s (i) Disturbance to output vector t (i) So that it satisfies the α -compressed local differential privacy;
s5: user i The disturbed output vector t (i) Sending the data to a server;
s6: the server outputs the vector t according to all users (i) And carrying out statistical analysis to obtain the average value of n input vectors.
S7: and the server brings the obtained mean value result into iteration type of logistic regression, and further updates the model parameter theta. And repeating the steps S2-S7 until the model converges and the iteration is ended.
S8: finally, the parameter theta of the model, namely the logistic regression classifier model, is obtained. For users of unknown class labels, substituting θ and its attributes into the hypothetical function of logistic regression, the user can use the decision boundaries of the hypothetical function to predict the labels.
Further, in step S1, the server initializes logistic regression model parameters θ. Where θ is a d+1-dimensional vector, representing d+1-dimensional space composed of real numbersA vector, each element in the vector being 0.n users { User } 1 ,...,User i ,...,User n -per user->Having a private record User (i) =(x (i) ,y (i) ) Wherein the d-dimensional numerical attribute is included->Class label y (i) E {0,1}. The server sends the initial logistic regression model parameters theta and the privacy budget alpha to n users 1 ,...,User i ,...,User n }。
Further, step S2 is specifically a gradient calculation formula according to logistic regressionWherein the function is assumed to be +.> Each User i On the user side, a gradient vector of d+1-dimensional numerical value can be obtained>Wherein i is E [1, n],j∈[0,d]。
Further, the step S3 specifically includes: for a pair ofNormalization allows->For->Discretizing to obtain input vector->So that is arbitrary->The discretization is as follows:
wherein ,representing User i Gradient vector of>The value of the j-th bit.
Further, the step S4 specifically includes the following steps:
s41: for a vector with dimension d+1, dividing the total sample into d+2 groups of sample subspaces, wherein the sample subspace with similarity k has the size ofCalculate to->As a normalization factor;
s42: simplifying the normalization factor according to the binomial theorem to obtain:
s43: definition of perturbation mechanism M CLDP-ME For any input vector s (i) Through a disturbance mechanism M CLDP-ME Obtaining output vectorThe probability of (2) is as follows:
wherein: pr []Representing the probability distribution value, α being the privacy budget under the CLDP model, u (s (i) ,t (i) ) As a utility function, u (s (i) ,t (i) )=d+1-d(s (i) ,t (i) ) For defining an input vector s (i) And output vector t (i) Similarity between the effective functions u (s (i) ,t (i) )∈[0,d+1]. Wherein the distance function d (s (i) ,t (i) ) By hamming distance, i.e. definition For exclusive or operation, reflecting the dissimilarity degree between the two vectors, providing basis for the similarity degree between the vectors, and meeting the non-negativity, identity, symmetry and triangle inequality of the distance function; the higher the similarity between the two vectors, the greater the utility value, representing the output vector t (i) With a higher probability of approaching the input vector s (i)
S44: the perturbation mechanism is further simplified according to the binomial theorem:
input vector by user terminalThrough a disturbance mechanism M CLDP-ME The output vector is obtained>
Further, the user side inputs the vectorThrough a disturbance mechanism M CLDP-ME The output vector is obtained>The method comprises the following steps: initializing an output vector: />Generating a uniform random variable r.epsilon.0, 1.0) if ∈0.0>Let t (i) The j-th component of (a)>If it isLet t (i) The j-th component of (a)>Let t (i) The final output vector is obtained after the calculation of each component in the above formula.
Further, the step S6 specifically includes the following steps:
s41: the server collects the output vectors uploaded by the n users after disturbance
S42: initializing mean estimate vectors
S43:Component of->The calculation is as follows:
will beAnd each component in the model is calculated according to the formula to obtain a final average value estimation vector.
Further, the step S7 specifically includes the following steps:
s71: obtaining an average value unbiased estimation of the current round output vector according to the step S6The model update is carried in, and the specific update formula is as follows:
wherein η is the learning rate.
S72: the updated model parameters theta are sent to n users { User } 1 ,...,User i ,...,User n }。
Further, the specific steps of step S8 are as follows:
s81: the server distributes logistic regression classifier model parameters theta to users to be classified;
s82: user for unknown class labels k Privacy record User with attribute data only containing d dimension (k) =(x (k) ) There isDefinition x (k)′ =[1,x (k) ]Obtaining an attribute vector x of d+1 dimension (k)′ I.e.
S83: for attribute vector x (k)′ Carrying out normalization treatment;
s84: will be theta and x (k)′ Substituting the hypothesized function of logistic regression: wherein ,
s85: the user can use the hypothesized function h θ (x (k)′ ) Predicting the label of the label, when h θ (x (i)′ ) When the number is more than or equal to 0.5, classifying the result into 1 class; when h θ (x (i)′ ) When < 0.5, the results are classified into class 0.
The beneficial effects are that: compared with the prior art, the application has the following remarkable advantages:
1. the application designs a logistic regression method for localized differential privacy protection based on compression. In this method, a perturbation mechanism M is defined CLDP-ME The compressed localization differential privacy protection model is introduced, so that the output value approaches to the original value with higher probability, the user privacy is protected, and the statistical utility on the multidimensional data set is improved. The disturbance mechanism M CLDP-ME Higher estimation accuracy is also exhibited on small datasets.
2. Under the protection of the compressed localized differential privacy, even if an attacker has all background knowledge except the target privacy information, the user privacy data can be effectively protected.
3. The logistic regression method based on the compression localized differential privacy protection can obtain higher classification accuracy and has better practical value compared with the existing method under the condition of the same privacy protection degree.
Drawings
Fig. 1 is a schematic diagram of a logistic regression method based on compression localized differential privacy protection of the present application.
FIG. 2 is a comparative schematic diagram of the performance of the present application.
FIG. 3 is a comparative performance schematic of the present application.
FIG. 4 is a comparative performance schematic of the present application.
FIG. 5 is a comparative performance schematic of the present application.
Detailed Description
The technical scheme of the application is further described below with reference to the accompanying drawings.
As shown in fig. 1, the logistic regression method based on the compression localization differential privacy protection of the present embodiment generally includes the following implementation steps:
s1: in the training stage, the server initializes logistic regression model parameters, sets privacy budget value alpha, and discloses the initial model parameters and the privacy budget value to the user.
Wherein the server is an untrusted entity responsible for aggregating user gradients and computing model parameters.
The users have training data required to participate in the logistic regression model, consisting of n users.
S2: user i According to the model parameters issued by the server, calculating to obtain a d+1-dimensional numerical gradient vector
S3: the user carries out gradient vector on the user terminalInput vector s encoded into d+1 dimensions (i)
S4: user i For input vector s (i) Disturbance is carried out by using a disturbance mechanism M CLDP-ME Will input vector s (i) Disturbance to output vector t (i) So that it satisfies the α -compressed local differential privacy.
S5: the user outputs the disturbed output vector t (i) And sending the data to a server.
S6: the server side performs statistical analysis according to disturbance data sent by all users, and the average value of the n input vectors is restored as far as possible.
S7: and the server brings the obtained mean value result into iteration type of logistic regression, and further updates the model parameter theta.
And repeating S2-S7 until the model converges and the iteration is ended.
S8: finally, the parameter theta of the model, namely the logistic regression classifier model, is obtained. For users of unknown class labels, substituting θ and its attributes into the hypothetical function of logistic regression, users can use the decision boundaries of the hypothetical function to predict their labels.
In step (S2) of the method, the server parameterized logistic regression model parameters include the following:
the server initializes logistic regression model parameters θ. Where θ is a d+1-dimensional vector, representing a d+1 dimensional space vector consisting of real numbers, each element in the vector is 0.n users { User } 1 ,...,User i ,...,User n User of each User i Having a private record User (i) =(x (i) ,y (i) ) Wherein the d-dimensional numerical attribute is included->Class label y (i) E {0,1}. The server sends the initial logistic regression model parameters theta and the privacy budget alpha to n users 1 ,...,User i ,...,User n }。
In step (S2) of the method, the user side calculates a gradient vector for the original data, which includes the following steps:
gradient calculation based on logistic regressionWherein the function is assumed to beEach User i On the user side, a gradient vector of d+1-dimensional numerical value can be obtained>Wherein i is E [1, n],j∈[0,d]。
In step (S3) of the method, the encoding of the gradient vector by the user side includes the following processes:
s31: for a pair ofNormalization allows->For arbitrary dimension->The normalization is as follows:
wherein ,representing the maximum value of the occurrence in the j-th bit in the original gradient vector of all users participating in the training,/->Representing the minimum of occurrences in the j-th bit in the original gradient vectors of all users participating in the training.
S32: for a pair ofDiscretizing to obtain input vector->So that it is arbitraryThe discretization is as follows:
wherein ,representing User i Gradient vector of>The value of the j-th bit.
Due toIt is known that the discretization step ensures the unbiasedness of the data.
In step (S4) of the method, the perturbation mechanism M CLDP-ME The specific steps of (a) are as follows:
s41: the total sample can be divided into d+2 groups of sample subspaces, and the sample subspace with the similarity k has the size ofThus can calculate +.>As a normalization factor.
S42: the normalization factor can be simplified according to the binomial theorem, resulting in:
s43: definition of perturbation mechanism M CLDP-ME For any input vector s (i) Through a disturbance mechanism M CLDP-ME Obtaining output vectorThe probability of (2) is as follows:
wherein: pr []Representing the probability distribution value, α is the privacy budget under the CLDP model. u(s) (i) ,t (i) ) As a utility function, u (s (i) ,t (i) )=d+1-d(s (i) ,t (i) ) For defining an input vector s (i) And output vector t (i) Similarity between the effective functions u (s (i) ,t (i) )∈[0,d+1]. Wherein the distance function d (s (i) ,t (i) ) By hamming distance, i.e. definition The degree of dissimilarity between the two vectors is reflected for exclusive-or operation, so that a basis is provided for the degree of similarity between the vectors, and the non-negativity, the identity, the symmetry and the triangular inequality of the distance function are met. The higher the similarity between the two vectors, the greater the utility value, representing the output vector t (i) With a higher probability of approaching the input vector s (i)
S44: the perturbation mechanism M can be further simplified according to the binomial theorem CLDP-ME
Input vector by user terminalThrough a disturbance mechanism M CLDP-ME The output vector is obtained>The method comprises the following specific steps:
1) Initializing an output vector:
2) Generating a uniform random variable r E [0.0, 1.0)
3) If it isLet t (i) The j-th component of (a)>If->Let t (i) The j-th component of (a)>
Let t (i) The final output vector is obtained after the calculation of each component in the above formula.
In step (S6) of the method, the specific steps of server-side mean value estimation are as follows:
s61: the server collects the output vectors uploaded by the n users after disturbance
S62: initializing mean estimate vectors
S63:Component of->The calculation is as follows:
will beAnd each component in the model is calculated according to the formula to obtain a final average value estimation vector.
In the step (S7) of the method, the specific steps of updating the model parameters at the server end are as follows:
s71: obtaining an unbiased estimate of the mean of the current round output vector according to step (S6)The model update is carried in, and the specific update formula is as follows:
wherein η is the learning rate.
S72: the updated model parameters theta are sent to n users { User } 1 ,...,User i ,...,User n }。
In the step (S8) of the method, the specific steps of the user end for classifying and predicting the user attribute data of the unknown class label are as follows:
s81: the server distributes logistic regression classifier model parameters θ to the users to be classified.
S82: user for unknown class labels k Privacy record User with attribute data only containing d dimension (k) =(x (k) ) There isDefinition x (k)′ =[1,x (k) ]Obtaining an attribute vector x of d+1 dimension (k)′ I.e.
S83: for attribute vector x (k)′ The normalization processing method consistent with step (S31) is maintained.
S84: will be theta and x (k)′ Substituting the hypothesized function of logistic regression: wherein ,
s85: the user can use the hypothesized function h θ (x (k)′ ) Its label is predicted. When h θ (x (i)′ ) When the number is more than or equal to 0.5, classifying the result into 1 class; when h θ (x (i)′ ) When < 0.5, the results are classified into class 0.
The logistic regression method based on the localized differential privacy protection is specifically processed. To illustrate the defined perturbation mechanism M in the present application CLDP-ME The compressed local differential privacy can be satisfied, and theoretical evidence is given below.
1. Disturbance mechanism M CLDP-ME The local differential privacy of alpha-compression is satisfied.
And (3) proving: for a vector s of arbitrary input length d+1 (i) and s′(i) Through disturbance mechanism M CLDP-ME Post output t (i) The method comprises the following steps:
because of the distance function d (s (i) ,t (i) ) Trigonometry is satisfied, so:
d(s′ (i) ,t (i) )-d(s (i) ,t (i) )≤d(s′ (i) ,s (i) )
thus, there are:
the theorem holds true according to the definition of the local differential privacy of the α -compression. The syndrome is known.
To illustrate the perturbation mechanism M in the present application CLDP-ME The validity of the mean value estimation is demonstrated below as unbiased and estimation error.
2. Let trueThe real mean vector and the estimated mean vector are respectively represented as z= [ z ] 0 ,...,z j ,...,z d] and for arbitrary->z j Is the true mean of the j-th component, +.>For the mean value of the unbiased estimate of the jth component, there is an unbiased estimate +.>This is true.
And (3) proving: according to the disturbance mechanism M CLDP-ME The method comprises the following steps:
the expectations of estimating the mean are:
the syndrome is known.
3. For the followingIs provided with->To estimate mean value unbiased, z j Is true mean, i.e.)>Estimation error of algorithm: />1-betaThe rate is established. Wherein j is E [0, d]N is the number of users and α is the privacy budget.
And (3) proving: according to the Huo Fuding inequality, there is given an upper probability limit for the deviation of the mean value of the random variable from its expected value, which is:wherein S is n independent random variables x j Mean value of x j ∈[a j ,b j ]. Because of z j Is the average value of n independent random variables with the value of 1 or-1, and can be obtained:
wherein bj =1,a j = -1, so there is:
due toSo there are:
is provided withWill->Substituting the above formula, the following can be obtained:
thenAt least with a probability of 1-beta. The syndrome is known.
The following is a perturbation mechanism M defined in the logistic regression method based on compression localized differential privacy protection CLDP-ME Is a result of the experiment. The experimental environment is Intel (R) Core (TM) i7-4770HQ,2.20GHz, 16GB memory and Windows 10 operating system. The programming language employs Python.
To verify the usefulness of the perturbation mechanism, a mean squared error (MSE, mean Square Error) was employed to measure the mean estimation accuracy of the mechanism and the most representative mean estimation algorithms Harmony, PM and three-output at present. The mean square error MSE is:t is the number of runs, z represents the true mean, < ->Representing the estimated mean. The larger the MSE, the more noise that is introduced, the lower the availability of data. In order to eliminate the error effect, taking the randomness of the algorithm into consideration, each algorithm is respectively operated for 50 times on the data set, and then the MSE average value is obtained.
In order to make the result real and reliable, the experiment is simulated by adopting 3 synthetic data sets and 1 real data set, wherein the size of the synthetic data sets is 50000 records, each record consists of 6 attributes, and the requirements are respectively satisfied:
1) A Uniform data set;
2) Normal-1 dataset following Normal distribution, mean 0, standard deviation 1;
3) Normal-2 dataset following Normal distribution, mean 1, standard deviation 2.
The real data set adopts an Adult data set in a UCI machine learning library, and 6 numerical value type attributes are selected and normalized.
Fig. 2 shows the effect of privacy budget α or epsilon variations on the four algorithms MSE. The lower the degree of privacy protection, the smaller the error in the data collected by the third party server, and therefore the MSE for both algorithms, when the privacy budget α or epsilon increases from 0.1 to 2. Disturbance mechanism M under the same privacy budget CLDP-ME Is better than Harmony, PM and three-output due to the perturbation mechanism M CLDP-ME The CLDP model is introduced, and compared with LDP, the CLDP model introduces a distance metric, so that the output value approaches to the true value with higher probability, thereby improving the utility of the data in the mean value estimation.
Fig. 3 shows the relationship between MSE and the number of users (data set size). In order to study the influence of the number of users on the MSE, a user data set is selected for average value estimation, the privacy budget alpha or epsilon is set to be 1, the attribute number is set to be 6, and the record number n of the data set is valued as follows:
n= {50000,40000,30000,20000,10000,5000,1000,500}. FIG. 3 illustrates that MSE tends to decrease as the number of users in a data set increases, the more user samples that a third party collects, the more accurate the result of unbiased estimation of the original data. Disturbance mechanism M CLDP-ME MSE values are less than Harmony, PM and three-output over data sets of different user numbers. Experimental results show that the disturbance mechanism M CLDP-ME Compared with the existing LDP multidimensional numerical data average value estimation algorithm, the LDP multidimensional numerical data average value estimation algorithm has higher estimation precision on a small data set.
Fig. 4 shows the relationship between MSE and attribute dimension (data dimension). To study the effect of attribute number on MSE, a form dataset is selected for mean value estimation, privacy budget α or ε is set to 1, user number is set to 50000, and attribute dimension of dataset is taken as d= {1,5,10,15,20}. FIG. 4 illustrates that the MSE of the Harmony, PM and three-output methods increases with increasing attribute dimension d, because the upper bound error of the algorithm and the attribute dimension are positively correlated. While disturbingMechanism M CLDP-ME The error of (2) is not related to the dimension and does not change with the increase of the dimension of the attribute, so that the method is more suitable for the situation that the dimension of the attribute is higher.
The following are experimental results of a logistic regression method based on compressed localized differential privacy protection. The experimental environment is Intel (R) Core (TM) i7-4770HQ,2.20GHz, 16GB memory and Windows 10 operating system. The programming language is implemented using Python. 2019Airline deltays, google Merchandise Sale Prediction in kagle was used. Wherein 2019Airline delay data is derived from the United states traffic statistics office, and predicts whether the aircraft will delay take-off according to attribute information such as airports and weather conditions; google Merchandise Sale Prediction is derived from the BigQuery data warehouse of google for predicting whether each session will cause the visitor to perform the operation of adding merchandise to the shopping cart.
The experiment adopts the accuracy F of the classification prediction result acc To measure the utility of the application in classification tasks, i.e. F acc =predict the correct total number of data/total number of data of test set. The data set is divided into 5 mutually exclusive subsets by adopting a five-fold cross validation method, the training and testing processes of the logistic regression classifier are respectively carried out 5 times, and the consistency of subset distribution is maintained through layered sampling. The 4 subsets are alternately selected as training sets, with the remaining subset being the test set. Taking F after 50 experiments were performed on each set of data set training and testing, taking into account the randomness of the method acc As the final classification accuracy.
Fig. 5 shows the classification accuracy of each method under 2 data sets and different privacy budgets in the experiment, and the privacy budgets alpha or epsilon take values of 0.1, 0.2, 0.5, 1, 1.5 and 2. For the same data set, as the privacy budget increases, the privacy protection degree decreases, and the classification accuracy also increases gradually. Under the condition of the same iteration times and the same privacy budget, the method of the application has higher classification accuracy than Harmony, PM and thread-output application in a logistic regression task, because M adopted by disturbance gradient is submitted to a server CLDP-ME The statistical utility of data is higher compared to Harmony, PM and three-output, making the modelTraining is more stable.
The foregoing is merely exemplary embodiments of the present application, and is not intended to limit the scope of the present application; any substitutions and modifications made without departing from the spirit of the application are within the scope of the application.

Claims (9)

1. A logistic regression method based on compression localization differential privacy protection is characterized in that the method comprises the following steps that a user encodes gradient vectors into input vectors and performs disturbance to achieve privacy protection, a server side aggregates the disturbed output vectors, restores the average value of all user input vectors, brings the average value into iterative updating model parameters of logistic regression, and the user predicts labels of the user based on the model by utilizing decision boundaries of hypothesis functions;
s1: in the training stage, initializing logistic regression model parameters by a server, setting privacy budget values, and disclosing the initial model parameters and the privacy budget values to users;
the server is responsible for aggregating user gradients and calculating model parameters;
each User i The training data needed by participating in the logistic regression model is possessed, and the training data is composed of n users;
s2: user i According to the model parameters issued by the server, calculating to obtain a d+1-dimensional numerical gradient vector
S3: user i For its gradient vector at the user sideInput vector s encoded into d+1 dimensions (i)
S4: user i For input vector s (i) Disturbance is carried out by using a disturbance mechanism M CLDP-ME Will input vector s (i) Disturbance to output vector t (i) So that it satisfies the α -compressed local differential privacy;
s5: user i The disturbed output vector t (i) Sending the data to a server;
s6: the server outputs the vector t according to all users (i) Carrying out statistical analysis to obtain the average value of n input vectors;
s7: the server brings the obtained mean value result into iteration type of logistic regression, and updates the model parameter theta; repeating S2-S7 until the model converges and the iteration is finished, and entering step 8;
s8: obtaining a parameter theta of the model, namely a logistic regression classifier model; for users of unknown class labels, θ and its attributes are substituted into the hypothesis function of the logistic regression, and the users use the decision boundaries of the hypothesis function to predict their labels.
2. The logistic regression method of claim 1, wherein step S1 comprises the steps of:
s11: initializing a logistic regression model parameter theta by a server;
where θ is a d+1-dimensional vector, representing a d+1 dimensional space vector consisting of real numbers, each element in the vector being 0; n users { User } 1 ,...,User i ,...,User n User of each User i Having a private record User (i) =(x (i) ,y (i) ) Wherein the d-dimensional numerical attribute is included->Class label y (i) ∈{0,1};
S12: the server sends the initial logistic regression model parameters theta and the privacy budget alpha to n users 1 ,...,User i ,...,User n }。
3. The logistic regression method of claim 1, wherein in step S2, the gradient calculation formula according to the logistic regression is usedWherein the function is assumed to be +.>Each User i On the user side, a gradient vector of d+1-dimensional numerical value can be obtained>Wherein i is E [1, n],j∈[0,d]。
4. The logistic regression method of claim 1, wherein step S3 comprises the steps of:
s31: for a pair ofNormalization allows->
S32: for a pair ofDiscretizing to obtain input vector->So that is arbitrary->The discretization is as follows:
wherein ,representing User i Gradient vector of>The value of the j-th bit.
5. The logistic regression method of claim 1, wherein step S4 comprises the steps of:
s41: for a vector with dimension d+1, dividing the total sample into d+2 groups of sample subspaces, wherein the sample subspace with similarity k has the size ofCalculate to->As a normalization factor;
s42: simplifying the normalization factor according to the binomial theorem to obtain:
s43: definition of perturbation mechanism M CLDP-ME For any input vector s (i) Through a disturbance mechanism M CLDP-ME Obtaining output vectorThe probability of (2) is as follows:
wherein: pr []Representing the probability distribution value, α being the privacy budget under the CLDP model, u (s (i) ,t (i) ) As a utility function, u (s (i) ,t (i) )=d+1-d(s (i) ,t (i) ) For defining an input vector s (i) And output vector t (i) Similarity between the effective functions u (s (i) ,t (i) )∈[0,d+1]The method comprises the steps of carrying out a first treatment on the surface of the Wherein the distance function d (s (i) ,t (i) ) By hamming distance, i.e. definition For exclusive or operation, reflecting the dissimilarity degree between the two vectors, providing basis for the similarity degree between the vectors, and meeting the non-negativity, identity, symmetry and triangle inequality of the distance function; the higher the similarity between the two vectors, the greater the utility value, representing the output vector t (i) With a higher probability of approaching the input vector s (i)
S44: the perturbation mechanism is further simplified according to the binomial theorem:
input vector by user terminalThrough a disturbance mechanism M CLDP-ME The output vector is obtained>
6. The logistic regression method of claim 5, wherein the user sideInput it into vectorThrough a disturbance mechanism M CLDP-ME Obtaining output vectorThe method comprises the following steps: initializing an output vector: />Generating a uniform random variable r.epsilon.0, 1.0) if ∈0.0>Let t (i) The j-th component of (a) if +.>Let t (i) The j-th component in (a)Let t (i) The final output vector is obtained after the calculation of each component in the above formula.
7. The logistic regression method of claim 1, wherein step S6 comprises the steps of:
s61: the server collects the output vectors uploaded by the n users after disturbance
S62: initializing mean estimate vectors
S63:Component of->The calculation is as follows:
will beAnd each component in the model is calculated according to the formula to obtain a final average value estimation vector.
8. The logistic regression method of claim 1, wherein step S7 comprises the steps of:
s71: obtaining an average value unbiased estimation of the current round output vector according to the step S6The model update is carried in, and the specific update formula is as follows:
wherein η is the learning rate;
s72: the updated model parameters theta are sent to n users { User } 1 ,...,User i ,...,User n }。
9. The logistic regression method of claim 1, wherein the specific steps of step S8 are:
s81: the server distributes logistic regression classifier model parameters theta to users to be classified;
s82: for unknown class labelsUser of (C) k Privacy record User with attribute data only containing d dimension (k) =(x (k) ) There isDefinition x (k)′ =[1,x (k) ]Obtaining an attribute vector x of d+1 dimension (k)′ I.e.
S83: for attribute vector x (k)′ Carrying out normalization treatment;
s84: will be theta and x (k)′ Substituting the hypothesized function of logistic regression: wherein ,
s85: the user can use the hypothesized function h θ (x (k)′ ) Predicting the label of the label, when h θ (x (i)′ ) When the number is more than or equal to 0.5, classifying the result into 1 class; when h θ (x (i)′ ) When < 0.5, the results are classified into class 0.
CN202310576399.7A 2023-05-22 2023-05-22 Localized differential privacy protection logistic regression method based on compression Pending CN116611030A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310576399.7A CN116611030A (en) 2023-05-22 2023-05-22 Localized differential privacy protection logistic regression method based on compression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310576399.7A CN116611030A (en) 2023-05-22 2023-05-22 Localized differential privacy protection logistic regression method based on compression

Publications (1)

Publication Number Publication Date
CN116611030A true CN116611030A (en) 2023-08-18

Family

ID=87681270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310576399.7A Pending CN116611030A (en) 2023-05-22 2023-05-22 Localized differential privacy protection logistic regression method based on compression

Country Status (1)

Country Link
CN (1) CN116611030A (en)

Similar Documents

Publication Publication Date Title
Tian et al. An intrusion detection approach based on improved deep belief network
Ding et al. Modeling extreme events in time series prediction
Li et al. Sample-level data selection for federated learning
Xiao et al. Towards confidence in the truth: A bootstrapping based truth discovery approach
Liu et al. An intrusion detection method for internet of things based on suppressed fuzzy clustering
CN105224872B (en) A kind of user&#39;s anomaly detection method based on neural network clustering
CN110460458B (en) Flow anomaly detection method based on multi-order Markov chain
Du et al. GAN-based anomaly detection for multivariate time series using polluted training set
Dina et al. Effect of balancing data using synthetic data on the performance of machine learning classifiers for intrusion detection in computer networks
Xu et al. Evolutionary spectral clustering with adaptive forgetting factor
Idé et al. Multi-task multi-modal models for collective anomaly detection
CN111431849B (en) Network intrusion detection method and device
CN111539444A (en) Gaussian mixture model method for modified mode recognition and statistical modeling
Xie et al. Off-policy evaluation and learning from logged bandit feedback: Error reduction via surrogate policy
Yang et al. An unsupervised learning‐based network threat situation assessment model for internet of things
Lawrence et al. Explaining neural matrix factorization with gradient rollback
Liu et al. Multi-step attack scenarios mining based on neural network and Bayesian network attack graph
Wang et al. An ensemble classification algorithm based on information entropy for data streams
Krishnamurthy et al. Tracking infection diffusion in social networks: Filtering algorithms and threshold bounds
Jana et al. Support recovery with stochastic gates: Theory and application for linear models
Liao et al. A novel classification and identification scheme of emitter signals based on ward’s clustering and probabilistic neural networks with correlation analysis
CN112613032A (en) Host intrusion detection method and device based on system call sequence
Yawata et al. QUBO Decision Tree: Annealing machine extends decision tree splitting
Li et al. Privacy‐preserving constrained spectral clustering algorithm for large‐scale data sets
CN111401440A (en) Target classification recognition method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination