CN111539769A

CN111539769A - Training method and device of anomaly detection model based on differential privacy

Info

Publication number: CN111539769A
Application number: CN202010343419.2A
Authority: CN
Inventors: 熊涛
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2020-08-14
Also published as: TW202143146A; WO2021218828A1; TWI764640B

Abstract

The embodiment of the specification provides a training method of an anomaly detection model based on differential privacy, which comprises the following steps: inputting a first vector of any sample in the training set into an encoding network, outputting a second vector of the reduced dimension through an encoder, and outputting a third vector of the restored vector through a decoder. Then, an evaluation vector is constructed based on the second vector, the evaluation vector is input to an evaluation network, and the sub-distribution probability that the sample output by the evaluation network belongs to K sub-Gaussian distributions in the mixed Gaussian distribution is obtained. And then, according to the evaluation vector and the sub-distribution probability corresponding to each sample in the training set, obtaining a first probability of the arbitrary sample in the Gaussian mixture distribution. A prediction penalty is determined therefrom that is inversely related to the first probability for each sample and inversely related to a similarity between the first vector and the third vector. Further, noise is added to the original gradient obtained based on the predicted loss by the differential privacy method, and the model parameters of the abnormality detection model are adjusted by using the gradient including the noise.

Description

Training method and device of anomaly detection model based on differential privacy

Technical Field

One or more embodiments of the present specification relate to the field of computer technology, and in particular, to a method and an apparatus for training an anomaly detection model based on differential privacy, which are executed by a computer.

Background

With the development of computer technology, security becomes an increasing concern, such as security of computer data, security of transactions for electronic payments, security of network access, and the like. For this reason, in many scenarios, it is necessary to find abnormal samples that may affect security from a large number of samples and take measures against the abnormal samples.

For example, it is desirable to detect abnormal transaction operations from a large sample of transaction operations, thereby protecting against fraudulent transactions in advance; it is desirable to detect anomalous accesses from a sample of network accesses in order to discover insecure accesses, such as hacking; hopefully, abnormal accounts are found from user accounts which perform various operations, so that accounts suspected of performing high-risk operations (fraud transactions, false transactions such as bill swiping and the like, network attacks) are locked; it is desirable to discover abnormal operations from among a large number of benefit-drawing operations (e.g., operations to draw marketing red packs, rewards, coupons, etc.), to protect against "black-produce" operations to maliciously draw benefits, etc.

However, in many cases, the calibration of the abnormal samples is very time and labor consuming, and the number of abnormal samples is generally small, which makes the conventional typical supervised learning method difficult to function. Therefore, some unsupervised approaches have been proposed to try to detect abnormal samples from a large number of samples. Unsupervised anomaly detection is typically based on an estimate of the distribution probability or density of samples, and statistically finds those outlier samples that deviate from most conventional samples as anomalous samples.

However, the existing unsupervised anomaly detection model often has the risks of leaking the training samples and the defects of insufficient robustness and insufficient generalization capability caused by overfitting. Accordingly, improved approaches are desired that result in safer and more efficient anomaly detection models.

Disclosure of Invention

One or more embodiments of the present specification describe a method for training an anomaly detection model based on differential privacy, so as to obtain an anomaly detection model with privacy protection and robustness.

According to a first aspect, there is provided a training method of an anomaly detection model based on differential privacy, the anomaly detection model comprising a self-coding network and an evaluation network, the self-coding network comprising an encoder and a decoder; the method comprises the following steps:

inputting a first feature vector corresponding to any service sample in a training set into the self-coding network, outputting a second feature vector for reducing the dimension of the first feature vector through the encoder, and outputting a third feature vector for restoring the first feature vector based on the second feature vector through the decoder;

constructing an evaluation vector based on the second feature vector, and inputting the evaluation vector into the evaluation network;

acquiring the sub-distribution probability of K sub-Gaussian distributions of the arbitrary service sample output by the evaluation network, wherein the K sub-Gaussian distributions belong to the mixed Gaussian distribution;

obtaining a first probability of any service sample in the Gaussian mixture distribution according to the evaluation vector and the sub-distribution probability corresponding to each service sample in the training set;

determining a prediction loss corresponding to the training set, wherein the prediction loss is inversely related to the first probability corresponding to each business sample and inversely related to a similarity between the first feature vector and the third feature vector corresponding to each business sample;

and adding noise to the original gradient obtained based on the prediction loss by using a differential privacy mode, and adjusting the model parameters of the anomaly detection model by using the gradient containing the noise.

In one embodiment, the evaluation vector is the second feature vector.

In another embodiment, the evaluation vector is constructed by: obtaining a reconstruction error vector based on the first feature vector and the third feature vector; combining the second feature vector and the reconstructed error vector as the evaluation vector.

According to one embodiment, the first probability is determined by: determining the mean value and covariance of each sub-Gaussian distribution in the K sub-Gaussian distributions and the occurrence probability of the sub-Gaussian distribution in the K sub-Gaussian distributions according to the evaluation vector and the sub-distribution probability of each service sample; reconstructing the mixed Gaussian distribution according to the mean value, covariance and occurrence probability of each sub-Gaussian distribution; and substituting the evaluation vector of any service sample into the reconstructed mixed Gaussian distribution to obtain the first probability.

In one embodiment, the step of determining the predicted loss corresponding to the training set may include: determining a first loss item according to the first probability corresponding to each business sample, wherein the first loss item is inversely related to the first probability of each business sample; determining a second loss item according to the similarity between the first eigenvector and the third eigenvector corresponding to each service sample, wherein the second loss item is inversely related to the similarity; and according to a preset weight factor, carrying out weighted summation on the first loss term and the second loss term to obtain the predicted loss.

According to an embodiment, adding noise to the original gradient obtained based on the prediction loss by using a differential privacy method may specifically include: determining an original gradient that reduces the prediction loss according to the prediction loss; based on a preset clipping threshold value, clipping the original gradient to obtain a clipping gradient; determining Gaussian noise for realizing differential privacy by utilizing a Gaussian distribution determined based on the clipping threshold, wherein the variance of the Gaussian distribution is positively correlated with the square of the clipping threshold; and superposing the Gaussian noise and the cutting gradient to obtain the gradient containing the noise.

In one embodiment, a first original gradient corresponding to the evaluation network and a second original gradient corresponding to the self-encoding network are determined by gradient back propagation, respectively; respectively adding noise in the first original gradient and the second original gradient by using a differential privacy mode to obtain a first noise gradient and a second noise gradient; adjusting a parameter of the evaluation network using the first noise gradient; and adjusting the parameters of the self-coding network by using the second noise gradient.

In another embodiment, on the basis of respectively determining a first original gradient and a second original gradient through gradient back propagation, noise is added to the second original gradient in a differential privacy mode to obtain a second noise gradient; adjusting parameters of the evaluation network by using the first original gradient; and adjusting the parameters of the self-coding network by using the second noise gradient.

In various embodiments, the arbitrary traffic sample may include one of: sample user, sample merchant, sample event.

According to a second aspect, there is provided a method of predicting an abnormal sample, comprising:

acquiring an anomaly detection model based on differential privacy, which is obtained by training according to the method of the first aspect, wherein the anomaly detection model comprises a self-coding network and an evaluation network, and the self-coding network comprises an encoder and a decoder;

inputting a first target vector corresponding to a target service sample to be detected into the self-coding network, and outputting a second target vector for reducing the dimension of the first target vector through the encoder;

constructing a target evaluation vector based on the second target vector;

inputting the target evaluation vector into a Gaussian mixture distribution constructed by the evaluation network to obtain a target probability of the target service sample in the Gaussian mixture distribution;

and determining whether the target service sample is an abnormal sample or not according to the target probability.

According to a third aspect, there is provided a training apparatus for an anomaly detection model based on differential privacy, the anomaly detection model comprising a self-encoding network and an evaluation network, the self-encoding network comprising an encoder and a decoder; the device comprises:

a first input unit, configured to input a first feature vector corresponding to any service sample in a training set into the self-coding network, output a second feature vector for reducing the dimension of the first feature vector through the encoder, and output a third feature vector for restoring the first feature vector based on the second feature vector through the decoder;

a second input unit configured to construct an evaluation vector based on the second feature vector, and input the evaluation vector into the evaluation network;

a sub-distribution obtaining unit configured to obtain a sub-distribution probability that the arbitrary service sample output by the evaluation network belongs to K sub-Gaussian distributions in a mixed Gaussian distribution;

a probability determining unit configured to obtain a first probability of the arbitrary service sample in the gaussian mixture distribution according to the evaluation vector and the sub-distribution probability corresponding to each service sample in the training set;

a loss determining unit configured to determine a prediction loss corresponding to the training set, wherein the prediction loss is negatively correlated with the first probability corresponding to each business sample and is negatively correlated with a similarity between a first feature vector and a third feature vector corresponding to each business sample;

and a parameter adjusting unit configured to add noise to an original gradient obtained based on the prediction loss in a differential privacy manner, and adjust a model parameter of the abnormality detection model using a gradient including the noise.

According to a fourth aspect, there is provided an apparatus for predicting abnormal samples, comprising:

a model obtaining unit configured to obtain an anomaly detection model based on differential privacy, which is trained by the apparatus according to the third aspect, the anomaly detection model including a self-coding network and an evaluation network, the self-coding network including an encoder and a decoder;

the input unit is configured to input a first target vector corresponding to a target service sample to be detected into the self-coding network, and output a second target vector for reducing the dimension of the first target vector through the encoder;

a vector construction unit configured to construct a target evaluation vector based on the second target vector;

the probability determining unit is configured to input the target evaluation vector into a Gaussian mixture distribution constructed by the evaluation network to obtain a target probability of the target service sample in the Gaussian mixture distribution;

and the abnormity judging unit is configured to determine whether the target service sample is an abnormal sample according to the target probability.

According to a fifth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first or second aspect.

According to a sixth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and wherein the processor, when executing the executable code, implements the method of the first or second aspect.

By the method and the device provided by the embodiment of the specification, the differential privacy is introduced into the anomaly detection model in a gradient descending mode of the differential privacy. The anomaly detection model thus obtained has at least two advantages. Firstly, because differential privacy is introduced, the information of the training sample is difficult to reverse-deduce or identify based on the public model, and privacy protection is provided for the model. Furthermore, the objective of the training process of the unsupervised anomaly detection model is to fit the distribution of the training samples. Conventional training often causes overfitting of some samples, and in particular, some noise samples sometimes exist in a training set, and when a model is overfitting to the noise samples, the predictive performance of the model is reduced. Due to the introduction of the differential privacy, noise is added in the gradient, so that the model can resist the influence of noise samples, the over-fitting condition is avoided, and the robustness and the prediction performance of the abnormal detection model are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 illustrates an architectural diagram of an anomaly detection model according to the concepts of the present technology;

FIG. 2 illustrates a flow diagram of a method of training a differential privacy based anomaly detection model, according to one embodiment;

FIG. 3 illustrates a flow diagram of a method for anomaly detection of traffic samples in one embodiment;

FIG. 4 shows a schematic block diagram of a training apparatus of an anomaly detection model according to one embodiment;

FIG. 5 shows a schematic block diagram of an apparatus to predict an abnormal sample according to one embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

Fig. 1 illustrates an architecture diagram of an anomaly detection model according to the technical concept of the present specification. As shown in fig. 1, the anomaly detection model generally includes a self-encoding network 100 and an evaluation network 200, the self-encoding network 100 including an encoder 110 and a decoder 120. The encoder 110 is used for encoding a high-dimensional feature vector x of an input traffic sample into a low-dimensional vector z_cThe decoder 120 is based on the low-dimensional vector z_cAnd outputting a decoding vector x' for restoring the high-dimensional feature vector x. Trained self-encoding network, encoder derived low-dimensional vector z_cThe core characteristics of the original high-dimensional characteristic vector x can be well characterized, and the function of vector dimension reduction is achieved.

The distribution statistics of the samples in the sample set are based on the reduced-dimension low-dimension vector z_cAnd then the process is carried out. In particular, a low-dimensional vector z of individual samples of the encoder output may be encoded_cInput into the evaluation network 200. According to an embodiment of the present description, the evaluation network 200 is a network based on a mixed gaussian distribution model gmm (gaussian Mixture model) that assumes that a plurality of samples as a whole obey a mixed gaussian distribution that is a combination of K sub-gaussian distributions. Thus, the evaluation network 200 may output, for each sample, its sub-distribution probability belonging to K sub-gaussian distributions, respectively. The whole sub-distribution probability of a plurality of samples can be used for reconstructing the mixed Gaussian distribution, so that the unsupervised training and learning of the GMM are realized.

Further, to enhance the privacy security and robustness of the model, differential privacy may be introduced in the anomaly detection model, in particular in the encoder 110. Specifically, the encoder based on the differential privacy can be obtained by adopting gradient descent based on the differential privacy and adding noise in the gradient in the training process. Therefore, on one hand, the safety of private data is protected, the training samples are prevented from being reversely deduced from the abnormal detection model obtained through training, on the other hand, due to the introduction of differential privacy, the model is prevented from being over-fitted to some samples (particularly the samples with noise interference), and therefore the robustness of the abnormal detection model is improved.

The following describes a specific implementation of the above concept.

FIG. 2 illustrates a flow diagram of a method of training a differential privacy based anomaly detection model, according to one embodiment. It is to be appreciated that the method can be performed by any apparatus, device, platform, cluster of devices having computing and processing capabilities. The following describes a training process of the anomaly detection model based on differential privacy, with reference to the architecture of the anomaly detection model shown in fig. 1 and the method flow shown in fig. 2.

First, in step 21, a first feature vector x corresponding to an arbitrary first service sample in a training set is input into a coding network, and a second feature vector z for reducing the dimension of the first feature vector x is output through a coder_cAnd outputting, by the decoder, a third eigenvector x' that restores the first eigenvector x based on the second eigenvector z.

Specifically, the training set may be a sample set obtained by randomly sampling the service samples, and each service sample is not labeled with an artificially labeled anomaly/normal label. In various embodiments, the business sample can be a sample user, a sample merchant, a sample event, and the like, where the sample event can in turn include, for example, a transaction event, a login event, a purchase event, a social interaction event, and the like.

Assuming that the training set contains N traffic samples, the first traffic sample may be any one of the traffic samples. The first feature vector x may contain different content depending on the specific instance of the traffic sample. For example, when the business sample is a user, the first feature vector x may contain attribute features of the user, such as basic attribute features of age, gender, registration duration, education level, and behavior attribute features such as recent browsing history and recent shopping history. For another example, when the business sample is a merchant, the first feature vector x may include attribute features of the merchant, such as merchant category, registration duration, commodity quantity, sales volume, attention number, and the like. Or, in an example, the service sample is a service event, such as a login event, and the corresponding first feature vector x may include attribute features of a login user, behavior features of a login behavior, device features of a device used for login, and the like.

Generally, to better characterize the traffic samples, the first feature vector x may be a feature vector of a higher dimension, e.g. several hundred dimensions, or even higher. High-dimensional vectors present certain difficulties for sample distribution statistics. Therefore, in the embodiments of the present specification, a self-coding network is used to perform dimension reduction.

Specifically, the first feature vector x is input to the encoder 110 shown in fig. 1. The encoder 110 may be implemented as a multi-layer perceptron, in which the number of neurons in each layer decreases gradually, and a second feature vector z is obtained in the output layer_cAlso known as code vectors. Coded vector z_cD is much smaller than the dimension D of the input first feature vector x, thereby realizing the dimension reduction of the input vector. For example, a feature vector x of several hundred dimensions may be compressed into a coded vector z of several tens or even several dimensions_c。

The coded vector z_cIs further input to a decoder 120. The decoder 120 is structurally symmetric to the encoder 110, and its algorithm and model parameters are associated with (e.g., inverse to) the corresponding ones in the encoder 110. Thus, the decoder 120 may be based on the encoded vector z_cRestoring the first eigenvector x and outputting a third eigenvectorx'. It will be appreciated that the code vector z_cThe first feature vector x is reduced in dimension, and the information loss of the dimension reduction operation is smaller, or the code vector z after dimension reduction is performed_cThe higher the information content is, the easier it is to restore the input feature vector x, i.e. the higher the similarity between the first feature vector x and the restored third feature vector x'. This property can be subsequently used to train the self-encoding network.

Next, in step 22, a second feature vector z is obtained based on the above dimension reduction_cAnd constructing an evaluation vector z, and inputting the evaluation vector z into an evaluation network.

In one embodiment, the second feature vector z may be directly combined_cAs evaluation vector z, the evaluation network 200 of fig. 1 is entered.

In another embodiment, the reconstructed error vector z may be obtained based on the first eigenvector x and the restored third eigenvector x' described above_rThen the second feature vector z_cAnd the reconstructed error vector z_rCombined as an evaluation vector z. This process can be expressed as:

z_r＝f(x,x’) (1)

z＝[z_c,z_r](2)

wherein f in the above formula (1) represents calculating the reconstruction error vector z_rAs a function of (c). In different examples, the function f may be to calculate an absolute euclidean distance, a relative euclidean distance, a cosine similarity, etc. of the first eigenvector x and the third eigenvector x'.

Second feature vector z in equation (2)_cAnd reconstructing the error vector z_rThe combining may include splicing, summing, weighted summing, and the like.

In the above various ways, an evaluation vector z can be obtained, which has a dimension much smaller than the original first feature vector x. The evaluation vector z is then input into the evaluation network 200.

As previously mentioned, the evaluation network 200 is based on a mixture gaussian distribution model GMM. According to GMM, the sample distribution is assumed to follow a Gaussian mixture distribution, the mixtureThe gaussian distribution can be decomposed into a combination of K sub-gaussian distributions. When the evaluation vector z corresponding to the first traffic sample is input into the evaluation network 200, in step 23, the evaluation network 200 may output the sub-distribution probabilities of the first traffic sample in K sub-gaussian distributions, respectively, based on the evaluation vector z

Wherein

Is a K-dimensional vector, wherein the kth element is the probability of the first traffic sample in the kth sub-gaussian distribution. In one example, the sub-distribution probabilities are as described above

Is the probability of distribution normalized using the softmax function, where the sum of K elements is 1.

It is understood that the above first traffic sample is any one of N samples included in the training set. For each sample i of the N samples, an evaluation vector z thereof can be obtained through the above steps 21-23_iProbability of sum sub-distribution

Then, in step 24, the gaussian mixture distribution may be reconstructed according to the evaluation vector and the sub-distribution probability corresponding to each service sample in the N samples of the training set, so as to obtain the first probability of the first service sample in the gaussian mixture distribution.

In one embodiment, the vector z of evaluations for each traffic sample i may be first determined_iAnd corresponding sub-distribution probabilities

And determining the occurrence probability, the mean and the covariance of any K-th sub-Gaussian distribution in the K sub-Gaussian distributions, wherein the occurrence probability is the occurrence probability of the K-th sub-Gaussian distribution in the K sub-Gaussian distributions.

Specifically, inIn one example, the probability of occurrence of the K-th sub-Gaussian distribution in the K-th sub-Gaussian distribution can be determined by the following formula (3)

Wherein the content of the first and second substances,

represents the probability of the sample i in the k-th sub-Gaussian distribution, in other words, it is the sub-distribution probability vector corresponding to the sample i

The kth element in (1). The probability of occurrence of the K-th sub-Gaussian distribution in the K-th sub-Gaussian distribution is obtained by summing the probabilities of the N samples in the K-th sub-Gaussian distribution

From the definitions of the mean and covariance of the Gaussian distribution, the mean of the kth sub-Gaussian distribution can be determined by the following equation (4)

Determining the covariance of the kth sub-Gaussian distribution by the following equation (5)

In the above formulas (4) and (5),

denotes the probability of a sample i of the N samples in the k-th sub-Gaussian distribution, z_iIs the evaluation vector for sample i.

Thus, based on the respective evaluation vectors and the respective sub-distribution probabilities of the N samples in the training set, the occurrence probability, the mean value and the covariance of each sub-Gaussian distribution are obtained. Reconstructing each sub-Gaussian distribution through the mean and covariance of each sub-Gaussian distribution; and further combining the occurrence probability of each sub-Gaussian distribution to reconstruct and obtain the mixed Gaussian distribution. Specifically, the mixture gaussian distribution may be a total distribution obtained by combining the sub-gaussian distributions together with the occurrence probability as a weight.

Based on the reconstructed gaussian mixture distribution, a first probability P of the first service sample in the gaussian mixture distribution can be obtained:

that is, the first probability P is obtained by substituting the evaluation vector z of the first service sample into the gaussian mixture distribution.

Next, in step 25, according to the restoration degree of the third feature vector output by the decoder for each sample in the training set to the first feature vector and the first probability of each sample obtained as above, the prediction loss L corresponding to the training set is determined, where the prediction loss L is negatively correlated with the first probability P corresponding to each service sample and with the similarity between the first feature vector and the third feature vector corresponding to each service sample.

Specifically, in one embodiment, a first loss term L1 may be determined based on the first probability for the respective sample, the first loss term L1 being inversely related to the first probability for the respective sample. For example, the probability loss corresponding to the arbitrary first traffic sample is set to e (z) (or referred to as sample energy), and the probability loss e (z) is negatively related to the first probability P corresponding to the sample. For example, in one example:

e (z) ═ -logP, i.e.:

as such, the first loss term L1 may be a sum or mean of the probability losses of N samples, such as:

it should be understood that, the gaussian mixture is reconstructed based on the sub-distribution probability of each sample in each sub-gaussian distribution, and then the probability of each sample in the reconstructed gaussian mixture is obtained, so that the whole first probability of the N samples may reflect the fitting condition of the gaussian mixture to the N sample distributions, and the first loss term L1 actually corresponds to the fitting loss of the whole N samples fitting the gaussian mixture.

On the other hand, a second loss term L2 may be determined according to the similarity between the first eigenvector and the third eigenvector corresponding to each traffic sample, where the second loss term L2 is negatively correlated with the similarity. For example, the vector reconstruction loss corresponding to the arbitrary first traffic sample is set to Lr (x, x '), which is negatively related to the similarity between x and x ', i.e., the more similar x and x ', the smaller Lr value. The similarity between two vectors can be calculated and measured in a number of ways, such as cosine similarity, Euclidean distance, and so on. As such, the second loss term L2 may be the sum or mean of the vector reconstruction losses of N samples, such as:

then, the first loss term L1 and the second loss term L2 are weighted and summed according to a preset weighting factor to obtain the total predicted loss L of the training set. In one example, the predicted loss L can be written as:

wherein λ is₁As a weighting factor, a hyperparameter may be used.

In another embodiment, the predicted loss L may also be set to:

in formula (11), λ₁And λ₂For the weighting factor, the last term is used to represent the covariance matrix

Is used to prevent the matrix from being irreversible.

Thus, in the above manner, a prediction loss for the training set is obtained. Next, based on the predicted loss, a gradient of model parameters that reduces the loss may be determined for updating and tuning the model parameters.

Innovatively, in the embodiment of the present specification, in step 26, based on the original gradient obtained from the above prediction loss, noise is added to the original gradient in a differential privacy manner, and the model parameters of the anomaly detection model are adjusted by using the gradient containing the noise.

Differential privacy (differential privacy) is a means in cryptography that aims to maximize the accuracy of a data query while minimizing the chance of identifying its records when querying from a statistical database]<＝e×Pr[M(D＇)∈SM]Algorithm M is then said to provide-differential privacy protection, where the parameters are referred to as privacy protection budget, for balancing the degree and accuracy of privacy protection. And may be generally preset. The closer to 0, eThe closer to 1, the closer the processing results of the random algorithm to the two neighboring data sets D and D', the stronger the degree of privacy protection.

Implementations of differential privacy include noise mechanisms, exponential mechanisms, and the like. In order to introduce differential privacy in the model, according to embodiments of the present specification, a noise mechanism is utilized herein to achieve differential privacy by adding noise in the parameter gradient. Depending on the noise scheme, the noise may be embodied as laplacian noise, gaussian noise, or the like. In this step 26, differential privacy is achieved by adding gaussian noise in the gradient, according to one embodiment. The specific process may include the following steps.

Firstly, an original gradient which reduces the prediction loss can be determined according to the prediction loss L; then, based on a preset cutting threshold value, cutting the original gradient to obtain a cutting gradient; then, determining Gaussian noise for realizing difference privacy by utilizing Gaussian distribution determined based on the clipping threshold, wherein the variance of the Gaussian distribution is positively correlated with the square of the clipping threshold; then, the gaussian noise thus obtained is superimposed on the clipping gradient to obtain a gradient including noise.

More specifically, as an example, assume that for the above training set, the resulting raw gradient is:

wherein t represents the current t-th iteration training, X represents the training set used by the current round, and g_t(X) represents the resulting loss gradient, θ, for the training set batch_tRepresents the model parameter at the start of the t-th round of training, L (θ)_tAnd X) represents the aforementioned prediction loss.

As described above, the addition of the noise for implementing the differential privacy to the original gradient may be implemented by means such as laplacian noise, gaussian noise, or the like. In an embodiment, for example, gaussian noise is taken as an example, gradient clipping may be performed on an original gradient based on a preset clipping threshold to obtain a clipping gradient, gaussian noise for implementing differential privacy is determined based on the clipping threshold and a predetermined noise scaling coefficient (a preset super parameter), and then the clipping gradient and the gaussian noise are fused (e.g., summed) to obtain a gradient including noise. It can be understood that this way, on one hand, performs clipping on the original gradient, and on the other hand, superimposes the clipped gradients, thereby performing differential privacy processing satisfying gaussian noise on the gradient.

For example, the original gradient is gradient clipped to:

wherein the content of the first and second substances,

representing the gradient after clipping, C representing the clipping threshold, | g (X) | non-woven cells₂Denotes g_t(X) second order norm. That is, in the case where the gradient is less than or equal to the clipping threshold C, the original gradient is retained, and in the case where the gradient is greater than the clipping threshold C, the original gradient is clipped to a corresponding size in a proportion greater than the clipping threshold C.

Adding gaussian noise to the clipped gradient to obtain a gradient containing noise, for example:

wherein the content of the first and second substances,

representing gradients containing noise;

representing the probability density coincidence with 0 as mean, σ²C²Gaussian noise which is a Gaussian distribution of variances; sigma represents the noise scaling coefficient, is a preset hyper parameter and can be set as required; c is the clipping threshold; indicating function, may take 0 or 1, for example, it may be set that even rounds in a plurality of rounds of training take 1 and odd rounds take 0.

Then, the gradient after gaussian noise addition can be used to adjust the model parameters to, with the goal of minimizing the aforementioned prediction loss L:

wherein, η_tThe learning step length or learning rate is a preset hyper-parameter, such as 0.5, 0.3, etc.; theta_t+1And (4) showing the adjusted model parameters obtained by the t-th round of training. And under the condition that the difference privacy is met by gradient-added Gaussian noise, the adjustment of the model parameters meets the difference privacy.

The above describes an implementation of adding noise to the gradient and updating the model parameters according to the gradient containing the noise.

On the other hand, as shown in fig. 1, the anomaly detection model in the present solution includes a self-coding network and an evaluation network, and accordingly, the model parameters may be divided into self-coding network parameters and evaluation network parameters, and the two parameters are updated according to corresponding gradients respectively. Generally, in models implemented by multilayer neural networks, the gradient is typically determined layer by back propagation. Therefore, in the anomaly detection model shown in fig. 1, after the prediction loss is obtained from the model output, the first original gradient corresponding to the evaluation network is determined first by the gradient back propagation, and then the back propagation is continued to determine the second original gradient corresponding to the self-coding network. When noise is added to the gradient based on the differential privacy, the noise may be added from the first original gradient or may be added only for the second original gradient.

Specifically, in one embodiment, on the basis of respectively determining a first original gradient corresponding to the evaluation network and a second original gradient corresponding to the self-coding network, noise is respectively added to the first original gradient and the second original gradient in a differential privacy manner, so as to obtain a first noise gradient and a second noise gradient. Then, adjusting parameters of the evaluation network by using the first noise gradient; and adjusting parameters of the self-coding network by using the second noise gradient. In this way, differential privacy is introduced throughout the anomaly detection model.

In another embodiment, based on the determination of a first original gradient corresponding to the evaluation network and a second original gradient corresponding to the self-coding network, respectively, noise is added to the second original gradient in a differential privacy manner to obtain a second noise gradient. Then, adjusting parameters of the evaluation network by utilizing the first original gradient; and adjusting parameters of the self-coding network by using the second noise gradient. The core of adjusting the model parameters from the encoder network is to adjust the model parameters of the encoder, since the parameters of the decoder are associated with the encoder. As such, differential privacy is introduced in the encoder.

It is to be understood that the encoder is located most upstream of the entire network model when processing traffic samples in the forward direction. The differential privacy is introduced into the encoder, so that the subsequent processing has the characteristic of differential privacy, and the effect of enabling the whole anomaly detection model to have the characteristic of differential privacy can be achieved.

Thus, differential privacy is introduced into the anomaly detection model through a gradient descent mode of the differential privacy. The anomaly detection model thus obtained has at least two advantages. Firstly, because differential privacy is introduced, the information of the training sample is difficult to reverse-deduce or identify based on the public model, and privacy protection is provided for the model. Furthermore, the objective of the training process of the unsupervised anomaly detection model is to fit the distribution of the training samples. Conventional training often causes overfitting of some samples, and in particular, some noise samples sometimes exist in a training set, and when a model is overfitting to the noise samples, the predictive performance of the model is reduced. Due to the introduction of the differential privacy, noise is added in the gradient, so that the model can resist the influence of noise samples, the over-fitting condition is avoided, and the robustness and the prediction performance of the abnormal detection model are improved.

By using the anomaly detection model based on the difference privacy obtained by the training mode, the anomaly of the target sample to be detected can be detected. FIG. 3 illustrates a flow diagram of a method for anomaly detection of traffic samples in one embodiment. Similarly, the method may be performed by any apparatus, device, platform, cluster of devices having computing, processing capabilities.

As shown in fig. 3, in step 31, the difference privacy-based abnormality detection model trained in the above manner is first obtained. As shown in fig. 1, the anomaly detection model includes a self-encoding network including an encoder and a decoder, and an evaluation network. Through the training process, the evaluation network constructs a Gaussian mixture model which can better fit the distribution of the service samples. The anomaly detection model is a model in which differential privacy is introduced. More particularly, at least the encoder therein has differential privacy features.

In step 32, a first target vector x corresponding to the target service sample to be tested is obtained_tInputting the self-coding network, and outputting a second target vector for reducing the dimension of the first target vector through the coder. This process is similar to step 21 of fig. 2 and will not be repeated.

Then, in step 33, a target evaluation vector z is constructed based on the second target vector_t. It is to be understood that the target evaluation vector is constructed in a manner corresponding to the training phase. In one case, the second target vector is directly taken as the target evaluation vector. In another case, a third target vector x 'of decoder output is obtained'_t(ii) a Based on a first target vector x_tAnd a third target vector x'_tObtaining a reconstruction error vector; the second target vector and the reconstructed error vector are then combined as a target evaluation vector z_t。

Next, at step 34, the target evaluation vector z_tAnd inputting the target business sample into the Gaussian mixture distribution constructed by the evaluation network to obtain the target probability of the target business sample in the Gaussian mixture distribution. In particular, the target evaluation vector z can be directly evaluated_tAnd substituting the parameters into the formula (6), wherein the parameters of the mixed Gaussian distribution are the parameters which are determined by the evaluation network through a training process.

Then, in step 35, it is determined whether the target traffic sample is an abnormal sample according to the target probability. Specifically, the target probability may be compared with a preset probability threshold, and when the target probability is smaller than the probability threshold, the current target service sample is considered as an abnormal sample.

In another example, the target probability may be further substituted into the foregoing formula (7) (or it may be considered as directly substituting the target evaluation vector into the formula (7)), so as to obtain the probability loss E (z) of the traffic sample_t). And when the probability loss is greater than a certain threshold value, the current target service sample is considered as an abnormal sample. Thus, the abnormity detection of the traffic sample is realized.

According to another aspect of the embodiments, there is also provided a training apparatus for an anomaly detection model based on differential privacy, which may be deployed in any apparatus, device, platform, or device cluster having computing and processing capabilities. FIG. 4 shows a schematic block diagram of a training apparatus of an anomaly detection model according to one embodiment. As shown in fig. 4, the training apparatus 400 includes:

a first input unit 41, configured to input a first feature vector corresponding to any service sample in a training set into the self-coding network, output a second feature vector for dimensionality reduction of the first feature vector through the encoder, and output a third feature vector for restoration of the first feature vector based on the second feature vector through the decoder;

a second input unit 42 configured to construct an evaluation vector based on the second feature vector, and input the evaluation vector into the evaluation network;

a sub-distribution obtaining unit 43, configured to obtain a sub-distribution probability that the arbitrary service sample output by the evaluation network belongs to K sub-gaussian distributions in a mixture gaussian distribution;

a probability determining unit 44, configured to obtain a first probability of the arbitrary service sample in the gaussian mixture distribution according to the evaluation vector and the sub-distribution probability corresponding to each service sample in the training set;

a loss determining unit 45 configured to determine a prediction loss corresponding to the training set, where the prediction loss is negatively correlated with the first probability corresponding to each business sample, and is negatively correlated with a similarity between the first feature vector and the third feature vector corresponding to each business sample;

a parameter adjusting unit 46 configured to add noise to the original gradient obtained based on the prediction loss by using a differential privacy method, and adjust the model parameters of the abnormality detection model by using a gradient including the noise.

In one embodiment, the second input unit 42 is configured to: and taking the second feature vector as the evaluation vector.

In another embodiment, the second input unit 42 is configured to: obtaining a reconstruction error vector based on the first feature vector and the third feature vector; combining the second feature vector and the reconstructed error vector as the evaluation vector.

According to one embodiment, the probability determination unit 44 is configured to: determining the mean value and covariance of each sub-Gaussian distribution in the K sub-Gaussian distributions and the occurrence probability of the sub-Gaussian distribution in the K sub-Gaussian distributions according to the evaluation vector and the sub-distribution probability of each service sample; reconstructing the mixed Gaussian distribution according to the mean value, covariance and occurrence probability of each sub-Gaussian distribution; and substituting the evaluation vector of any service sample into the reconstructed mixed Gaussian distribution to obtain the first probability.

In one embodiment, the loss determination unit 45 is configured to: determining a first loss item according to the first probability corresponding to each business sample, wherein the first loss item is inversely related to the first probability of each business sample; determining a second loss item according to the similarity between the first eigenvector and the third eigenvector corresponding to each service sample, wherein the second loss item is inversely related to the similarity; and according to a preset weight factor, carrying out weighted summation on the first loss term and the second loss term to obtain the predicted loss.

According to an embodiment, the parameter adjustment unit 46 is configured to: determining an original gradient that reduces the prediction loss according to the prediction loss; based on a preset clipping threshold value, clipping the original gradient to obtain a clipping gradient; determining Gaussian noise for realizing differential privacy by utilizing a Gaussian distribution determined based on the clipping threshold, wherein the variance of the Gaussian distribution is positively correlated with the square of the clipping threshold; and superposing the Gaussian noise and the cutting gradient to obtain the gradient containing the noise.

In one embodiment, the parameter adjustment unit 46 may be configured to:

determining a first original gradient corresponding to the evaluation network and a second original gradient corresponding to the self-encoding network respectively through gradient back propagation; respectively adding noise in the first original gradient and the second original gradient by using a differential privacy mode to obtain a first noise gradient and a second noise gradient;

adjusting a parameter of the evaluation network using the first noise gradient; and adjusting the parameters of the self-coding network by using the second noise gradient.

In another embodiment, the parameter adjustment unit 46 may be configured to:

determining a first original gradient corresponding to the evaluation network and a second original gradient corresponding to the self-encoding network respectively through gradient back propagation; adding noise in the second original gradient by using a differential privacy mode to obtain a second noise gradient;

adjusting parameters of the evaluation network by using the first original gradient; and adjusting the parameters of the self-coding network by using the second noise gradient.

In various embodiments, the traffic samples may include one of: sample user, sample merchant, sample event.

It should be noted that the apparatus 400 shown in fig. 4 is an apparatus embodiment corresponding to the method embodiment shown in fig. 2, and the corresponding description in the method embodiment shown in fig. 2 is also applicable to the apparatus 400, and is not repeated herein.

According to another aspect, an apparatus for predicting an abnormal sample is also provided, which may be deployed in any apparatus, device, platform, or device cluster having computing and processing capabilities. FIG. 5 shows a schematic block diagram of an apparatus to predict an abnormal sample according to one embodiment. As shown in fig. 5, the prediction apparatus 500 includes:

a model obtaining unit 51 configured to obtain an anomaly detection model based on differential privacy trained according to the apparatus of fig. 4, where the anomaly detection model includes a self-coding network and an evaluation network, and the self-coding network includes an encoder and a decoder;

an input unit 52, configured to input a first target vector corresponding to a target service sample to be measured into the self-encoding network, and output a second target vector for reducing the dimension of the first target vector through the encoder;

a vector construction unit 53 configured to construct a target evaluation vector based on the second target vector;

a probability determining unit 54 configured to input the target evaluation vector into a gaussian mixture distribution constructed by the evaluation network, so as to obtain a target probability of the target service sample in the gaussian mixture distribution;

and the abnormality judgment unit 55 is configured to determine whether the target service sample is an abnormal sample according to the target probability.

In an embodiment, the vector constructing unit 53 is specifically configured to: acquiring a third target vector output by the decoder; obtaining a reconstruction error vector based on the first target vector and the third target vector; combining the second target vector and the reconstructed error vector as the target evaluation vector.

According to an embodiment of a further aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method described in connection with fig. 2.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of this specification may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments are intended to explain the technical idea, technical solutions and advantages of the present specification in further detail, and it should be understood that the above-mentioned embodiments are merely specific embodiments of the technical idea of the present specification, and are not intended to limit the scope of the technical idea of the present specification, and any modification, equivalent replacement, improvement, etc. made on the basis of the technical solutions of the embodiments of the present specification should be included in the scope of the technical idea of the present specification.

Claims

1. A training method of an anomaly detection model based on differential privacy is disclosed, wherein the anomaly detection model comprises a self-coding network and an evaluation network, and the self-coding network comprises an encoder and a decoder; the method comprises the following steps:

2. The method of claim 1, wherein constructing an evaluation vector based on the second feature vector comprises: and taking the second feature vector as the evaluation vector.

3. The method of claim 1, wherein constructing an evaluation vector based on the second feature vector comprises:

obtaining a reconstruction error vector based on the first feature vector and the third feature vector;

combining the second feature vector and the reconstructed error vector as the evaluation vector.

4. The method of claim 1, wherein obtaining a first probability of the arbitrary traffic sample in the gaussian mixture distribution according to the evaluation vector and the sub-distribution probability corresponding to each traffic sample in the training set comprises:

determining the mean value and covariance of each sub-Gaussian distribution in the K sub-Gaussian distributions and the occurrence probability of the sub-Gaussian distribution in the K sub-Gaussian distributions according to the evaluation vector and the sub-distribution probability of each service sample;

reconstructing the mixed Gaussian distribution according to the mean value, covariance and occurrence probability of each sub-Gaussian distribution;

and substituting the evaluation vector of any service sample into the reconstructed mixed Gaussian distribution to obtain the first probability.

5. The method of claim 1, wherein determining the predicted loss for the training set comprises:

determining a first loss item according to the first probability corresponding to each business sample, wherein the first loss item is inversely related to the first probability of each business sample;

determining a second loss item according to the similarity between the first eigenvector and the third eigenvector corresponding to each service sample, wherein the second loss item is inversely related to the similarity;

and according to a preset weight factor, carrying out weighted summation on the first loss term and the second loss term to obtain the predicted loss.

6. The method of claim 1, wherein adding noise to the original gradient based on the predicted loss using differential privacy comprises:

determining an original gradient that reduces the prediction loss according to the prediction loss;

based on a preset clipping threshold value, clipping the original gradient to obtain a clipping gradient;

determining Gaussian noise for realizing differential privacy by utilizing a Gaussian distribution determined based on the clipping threshold, wherein the variance of the Gaussian distribution is positively correlated with the square of the clipping threshold;

and superposing the Gaussian noise and the cutting gradient to obtain the gradient containing the noise.

7. The method of claim 1, wherein adding noise to the original gradient based on the predicted loss using differential privacy comprises: determining a first original gradient corresponding to the evaluation network and a second original gradient corresponding to the self-encoding network respectively through gradient back propagation; respectively adding noise in the first original gradient and the second original gradient by using a differential privacy mode to obtain a first noise gradient and a second noise gradient;

adjusting model parameters of the abnormal sample detection model using a gradient containing noise, comprising:

8. The method of claim 1, wherein adding noise to the original gradient based on the predicted loss using differential privacy comprises: determining a first original gradient corresponding to the evaluation network and a second original gradient corresponding to the self-encoding network respectively through gradient back propagation; adding noise in the second original gradient by using a differential privacy mode to obtain a second noise gradient;

9. The method of claim 1, wherein the arbitrary traffic sample comprises one of: sample user, sample merchant, sample event.

10. A method of predicting an abnormal sample, comprising:

acquiring an anomaly detection model based on differential privacy trained according to the method of claim 1, wherein the anomaly detection model comprises a self-coding network and an evaluation network, and the self-coding network comprises an encoder and a decoder;

constructing a target evaluation vector based on the second target vector;

11. The method of claim 10, wherein constructing a target evaluation vector based on the second target vector comprises:

acquiring a third target vector output by the decoder;

obtaining a reconstruction error vector based on the first target vector and the third target vector;

combining the second target vector and the reconstructed error vector as the target evaluation vector.

12. An anomaly detection model training device based on differential privacy, wherein the anomaly detection model comprises a self-coding network and an evaluation network, and the self-coding network comprises an encoder and a decoder; the device comprises:

13. The apparatus of claim 12, wherein the second input unit is configured to: and taking the second feature vector as the evaluation vector.

14. The apparatus of claim 12, wherein the second input unit is configured to:

15. The apparatus of claim 12, wherein the probability determination unit is configured to:

16. The apparatus of claim 12, wherein the loss determination unit is configured to:

17. The apparatus of claim 12, wherein the parameter adjustment unit is configured to:

18. The apparatus of claim 12, wherein the parameter adjustment unit is configured to:

19. The apparatus of claim 12, wherein the parameter adjustment unit is configured to:

20. The apparatus method of claim 12, wherein the arbitrary traffic sample comprises one of: sample user, sample merchant, sample event.

21. An apparatus for predicting an abnormal sample, comprising:

a model obtaining unit configured to obtain an anomaly detection model based on differential privacy trained by the apparatus according to claim 12, the anomaly detection model including a self-encoding network and an evaluation network, the self-encoding network including an encoder and a decoder;

22. The apparatus of claim 21, wherein the vector construction unit is configured to:

acquiring a third target vector output by the decoder;

23. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-11.