CN114419379A

CN114419379A - System and method for improving fairness of deep learning model based on antagonistic disturbance

Info

Publication number: CN114419379A
Application number: CN202210320949.4A
Authority: CN
Inventors: 王志波; 董小威; 任奎
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-04-29

Abstract

The invention discloses a system and a method for improving fairness of a deep learning model based on antagonistic disturbance, wherein the system comprises a deployment model, a disturbance generator and a discriminator, the deployment model comprises a feature extractor and a label predictor, and the disturbance generator is connected with the feature extractor. The invention processes the input data of the deployment model without changing the deep learning model. The model fairness is improved based on the adversarial disturbance, a corresponding disturbance generator and a discriminator are designed, the discriminator is used for capturing sensitive attribute information related to the fairness, training optimization of the disturbance generator is guided, the sensitive attribute information of adversarial disturbance hidden data is generated, target task related information is reserved, the sensitive information of input data is prevented from being extracted by the model in the feature extraction process, and therefore prediction fairness is improved.

Description

System and method for improving fairness of deep learning model based on antagonistic disturbance

Technical Field

The invention relates to the field of trusted Artificial Intelligence (AI), in particular to a system and a method for improving fairness of a deep learning model based on adversarial disturbance.

Background

In recent years, deep neural networks have exhibited excellent performance in various fields such as image processing, natural language processing, voice recognition, and the like. Although the popularization of the application of the artificial intelligence technology promotes the change of various fields and brings convenience and improvement to human life, researches find that the existing partial artificial intelligence systems have ethical risks, and the systems contain bias and discrimination to specific groups and even place the vulnerable groups at a more unfavorable position. Therefore, the prejudice of the deep learning model is relieved, the fairness of model decision is improved, and the important premise for ensuring the reliable application of the artificial intelligence system is provided. The deep learning model usually learns from data, if the distribution of data of different groups is not balanced, a false statistical association exists between a target task label and a label of a sensitive attribute, which causes the model to learn the false association, and associates a predicted target task label with the label of the sensitive attribute, thereby generating a bias for a specific group. The existing technology for improving the fairness of the deep learning model essentially needs to modify the deployed model to prevent the model from learning false association so as to eliminate the bias of a specific group, thereby greatly limiting the practical application of the fairness mechanism of the deep learning model.

Disclosure of Invention

Aiming at the defect that the deployed deep learning model needs to be modified in the prior art, the invention provides a system and a method for improving the fairness of the deep learning model based on antagonistic disturbance, and the fairness is improved under the condition that the deep learning model is not changed.

In order to achieve the purpose, the invention is realized by the following technical scheme:

the invention discloses a depth learning model fairness promotion system based on antagonistic disturbance, which comprises a deployment model, a disturbance generator and a discriminator, wherein the deployment model comprises a feature extractor and a label predictor, the disturbance generator is connected with the feature extractor, the feature extractor is respectively connected with the label predictor and the discriminator, an image is input by the feature extractor, the image is subjected to a hidden space representation by the feature extractor, the hidden space representation is output as a prediction result of a target label after being input into the label predictor, and the hidden space representation is output as a prediction result of image sensitive attribute after being input into the discriminator.

As a further improvement, the input of the disturbance generator is an image, the output is antagonistic disturbance, and the disturbance value is added with the input image and then input into the feature extractor.

The invention also discloses a method for improving the fairness of the deep learning model based on the antagonistic disturbance, which comprises the following steps:

1) adding antagonistic disturbance to the image by using a disturbance generator, inputting the disturbed image into a feature extractor of a deployment model, outputting a hidden space representation of the image by the feature extractor, and obtaining a prediction result of a target label after the hidden space representation is input into a label predictor;

2) measuring sensitive attribute information contained in the disturbed image, inputting the hidden space representation into a discriminator to obtain a prediction result of the image sensitive attribute, training the discriminator to predict the sensitive attribute from the hidden space representation, and updating the discriminator;

3) updating the disturbance generator to better generate the antagonistic disturbance, deceiving the discriminator to ensure that the image added with the antagonistic disturbance does not contain information of sensitive attributes in a hidden space representation as much as possible, and simultaneously ensure that a prediction result of the target label predictor is as accurate as possible;

4) and (3) repeating the step 2) and the step 3) until the generator can well cheat the discriminator, the target label predictor has high accuracy, the disturbance generator at the moment is integrated into a deployment model data preprocessing link as a fairness promotion module, and antagonism disturbance is added to the input image to promote fairness.

As a further improvement, it is possible to,the deployment model of the invention is expressed as

Wherein

In order to provide a feature extractor for a computer,

for the target label predictor, the input image is

The sensitive property is

The object label is

。

As a further improvement, in step 1) of the invention, a disturbance generator is used

For images

Adding antagonistic disturbance, the disturbed image is

Disturbance satisfies

Norm limitation

The disturbed image

Input deployment model, feature extractor for deployment model

Implicit spatial representation of an output image

And obtaining the prediction result of the target label after the label predictor is input in the hidden space representation

。

As a further improvement, in the step 2) of the invention, the updating is carried out

So that the discriminator can accurately capture the sensitive attribute from the hidden space representation

Is determined by the information of (a) a,

the loss function of (d) is:

wherein

Representing cross entropy, the hidden space of the perturbed data is represented as

Arbiter for sensitive attribute

Is output as

，

Representing the true sensitive property.

As a further improvement, in the step 3) of the invention, the size is increased

Entropy of prediction of disturbed image

In disturbing the sample

Making a random guess above, the loss of entropy is expressed as:

wherein the content of the first and second substances,

representing entropy, to this point, the generator

The total loss for improving fairness is expressed as

，

Is a smaller value, controlling the weight of the entropy-constrained term.

As a further improvement, the step 3) of the invention is described, except that it is responsible for fairness perception

Besides, information of the target label needs to be kept in the hidden space representation, the performance of the model on target label prediction needs to be kept, and a loss term responsible for the accuracy of the model needs to be:

wherein

The cross-entropy is represented by the cross-entropy,

the output of the target label predictor representing the model,

during the course of updating, by adding

While at the same time reducing

Deceiving the discriminator and keeping the accuracy of target label prediction;

and

is balanced by a parameter

The control is carried out by controlling the temperature of the air conditioner,

the higher the primary task accuracy can be maintained,

the lower the fairness can be improved,

loss function of

Expressed as the total loss function design contains negativesLoss of fairness awareness

And loss of retention accuracy

And the disturbance generator learns to generate the antagonistic disturbance meeting the requirement, and the fairness of the model is improved while the target label prediction accuracy is kept:

as a further improvement, in step 4) of the invention, the disturbance generator

And discriminator

Conducting a mini-max game until the generator can fool the discriminator well and the target label predictor has a high accuracy, at which point the generator will be used

Deployed as a model

Adaptively generating perturbations for the input data.

As a further improvement, in the mini-max game process, the discriminator

Maximizing prediction of sensitive attributes from feature space

Ability of disturbance generator

Then an attempt is made to fool as much as possible

At the same time let

The target label of the sample after disturbance can be predicted, and the process target function can be formalized as follows:

wherein, the parameters to be updated in the objective function are

And

update

Updating by maximizing (max) the above-mentioned objective function

Minimizing (min) the above objective function, a constraint term representation generator of the objective function

For input image

Applying a perturbation to the image

Disturbance satisfies

Norm limitation

The implicit space obtained by the data after disturbance is expressed as

。

The invention has the following beneficial technical effects:

step 1) in the technical scheme of the invention, firstly, antagonistic disturbance is added to an image to improve the fairness of a model; secondly, a disturbance generator is introduced to generate antagonistic disturbance, so that after training of the disturbance generator is completed, the generator can generate the antagonistic disturbance for any image, fairness of a model is improved, and sensitive attributes and target labels of the image do not need to be known.

In step 2) in the technical scheme of the invention, in the process of deceiving the discriminator, in addition to the cross entropy, the entropy is also used, so that the generated antagonism disturbance can be increased

For the entropy of the prediction of the disturbance image, the model is prevented from extracting sensitive attribute information but not extracting information with opposite sensitive attributes, for example, the input image is male, and the model is expected not to extract sex information but rather extracting information with opposite sex after disturbance.

According to the method and the device, the sensitive characteristics of the data are prevented from being extracted by the deployment model by modifying the input image, so that the fairness can be improved under the condition of not changing the model. The invention processes the input data of the deployment model without changing the deep learning model. The method improves the model fairness based on the antagonistic disturbance, and designs a corresponding disturbance generator and a discriminator, wherein the disturbance generator is directly used for generating the antagonistic disturbance, and the discriminator assists the training of the disturbance generator. The method comprises the steps of capturing sensitive attribute information related to fairness by using a discriminator, guiding training optimization of a disturbance generator, generating antagonistic disturbance hiding data sensitive attribute information, and reserving target task related information, so that a model is prevented from extracting sensitive information of input data in a feature extraction process, and accordingly fairness prediction is improved.

Drawings

FIG. 1 is a block diagram of a deep learning model fairness boosting system based on antagonistic perturbations.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by the following embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

FIG. 1 is a frame diagram of a depth learning model fairness promotion system based on antagonistic disturbance, which includes a deployment model, a disturbance generator and a discriminator, wherein the deployment model includes a feature extractor and a label predictor, the disturbance generator is connected with the feature extractor, the feature extractor is respectively connected with the label predictor and the discriminator, an image is input by the feature extractor, the image is represented by a hidden space obtained by the feature extractor, the hidden space represents a prediction result of a target label output after the label predictor is input, and the hidden space represents a prediction result of an image sensitive attribute output after the discriminator is input.

Comprises the following steps:

1) adding antagonistic disturbance to an input image by using a disturbance generator, inputting the disturbed image into a deployment model, outputting a hidden space representation of the image by a feature extractor of the deployment model, and obtaining a prediction result of a target label after the hidden space representation is input into a label predictor;

2) training a sensitive attribute discriminator to predict a sensitive attribute from the hidden space representation, and updating the discriminator to guide the updating of the disturbance generator;

3) updating the disturbance generator to better generate the antagonistic disturbance, deceiving the discriminator to ensure that the image added with the antagonistic disturbance does not contain information of sensitive attributes in a hidden space representation as much as possible, and simultaneously ensure that a prediction result of the label predictor is as accurate as possible;

4) repeating the step 2) and the step 3) until the generator can well cheat the discriminator, the accuracy of the label predictor is high, the generator at the moment is used as a fairness promotion module to be integrated into a deployment model data preprocessing link, and antagonism disturbance is added to the input image to promote fairness;

the deployment model may be represented as

Wherein

A representative feature extractor for extracting a feature of the image,

representing a label predictor, input image is noted

The sensitivity attribute of the image is recorded as

Object tag is marked as

The implicit space output in the feature extraction process is represented as

The final output result of the model is the output of the label predictor

. Disturbance generator

Is inputted as

The output is the antagonistic disturbance, the disturbance value and the input image

The summed values are input to a feature extractor. Distinguishing device

Connected to the output of the feature extractor, from a hidden spatial representation

The output of the medium prediction sensitive attribute is the predicted value of the sensitive attribute

。

The method specifically comprises the following steps:

1) the implicit space output in the feature extraction process is represented as

The final output result of the model is the output of the label predictor

. Training a reactive perturbation generation module using training data, modifying input data using the module, and using the perturbation generator

For data

Adding antagonistic disturbance, the disturbed image is

Disturbance satisfies

Limiting the norm; the disturbed image is processed

Input deployment model, feature extractor for deployment model

Implicit spatial representation of an output image

。

2) Measuring sensitive attribute information contained in disturbed image, training discriminator

From implicit spatial representation

Predicting sensitivity attribute and comparing discriminator

Updating is carried out, and information of the sensitive attribute is better captured so as to guide the updating of the disturbance generator;

hidden spatial representation of post-perturbation data as

Device for discriminating

The output of the predicted post-disturbance image sensitivity attribute is

By updating

Is determined by the information of (a) a,

the loss function of (d) is:

wherein

The cross-entropy is represented by the cross-entropy,

representing the true sensitive property. By minimizing

Loss function of

Continuously updating sensitive attribute discriminator

。

3) To disturbance generator

And updating is carried out, antagonistic disturbance is generated better, a discriminator is deceived, the data added with the antagonistic disturbance does not contain information of sensitive attributes in a hidden space representation as much as possible, and meanwhile, the prediction result of the label predictor is accurate as much as possible.

Disturbance generator

To-be-deceived discriminator

And the method prevents the model from extracting sensitive attribute information, thereby eliminating the association between the sensitive attribute and the target label and improving the fairness under the condition of not changing the model. On the one hand, need to be maximized

However, this causes the image to be moved in feature space to the other side of the sensitive property hyperplane. Therefore, it is required to increase

For the disturbed image

Predicted entropy of let

In that

Making a random guess above, the loss of entropy for this term can be expressed as:

wherein the content of the first and second substances,

representing entropy. Thus, the disturbance generator

The total loss for improving fairness is expressed as

，

Is a smaller value, controlling the weight of the entropy-constrained term. In addition to being responsible for perceiving fairness

In addition, it is necessary to keep the information of the target label in the hidden space representation and keep the model's performance on target label prediction, so a loss term responsible for the model accuracy is needed:

wherein

The cross-entropy is represented by the cross-entropy,

the output of the label predictor representing the model.

During the course of updating, by adding

While at the same time reducing

And deceiving the discriminator and keeping the accuracy of target label prediction.

Loss function of

Expressed as:

wherein the parameters

Control of

And

the balance of (a) to (b) is,

the higher the primary task accuracy can be maintained,

the lower the fairness is improved.

4) And (3) repeating the step 2) and the step 3) to carry out iterative training until the generator can well cheat the discriminator, the accuracy of the label predictor is high, the generator at the moment is integrated into a deployment model data preprocessing link as a fairness promotion module, and antagonism disturbance is added to input data to promote fairness of the deployment model.

In the course of iterative training, the disturbance generator

And discriminator

Conducting mini-max games until the creator can fool the discriminator well and the label predictor is accurate, at which point the creator will be played

Deployed as a model

Adaptively generating perturbations for the input data. Discriminator in infinitesimal-infinitesimal game process

Maximizing slave spaceInter-prediction sensitivity attribute

Ability of disturbance generator

Then an attempt is made to fool as much as possible

At the same time let

The target label of the disturbed image can be predicted, and the target function can be formalized as follows:

wherein, the parameters to be updated in the objective function are

And

update

Updating by maximizing (max) the above-mentioned objective function

The above objective function is minimized (min). Constraint term representation generator of objective function

For input image

Applying a perturbation to the image

Disturbance satisfies

Norm limitation

The implicit space obtained by the data after disturbance is expressed as

. Distinguishing device

Maximizing prediction of sensitive attributes from feature space

Ability of disturbance generator

Then an attempt is made to fool as much as possible

At the same time let

The target label of the disturbed image can be predicted. When the generator can cheat the discriminator well and the accuracy of the label predictor is high, stopping the iterative training and leading the generator to be used

Deployed as a model

A data preprocessing module, generator

Antagonistic perturbations can be generated adaptively for the input image.

According to the method for improving the fairness of the deep learning model based on the antagonistic disturbance, the antagonistic disturbance is added to the given deployment model through the training disturbance generator, the relevant characteristics of the sensitive attributes of the model are prevented from being extracted, images with different sensitive attribute values are treated fairly, and therefore the fairness of the deployment model is improved. The invention is tested on CelebA image data set, and in the testing process, the target label

That is, the target task label value and the sensitive attribute which need to be predicted for the deployment model

For gender, values of the target label and the sensitive attribute are { -1,1 }. In the test, in order to verify the improvement performance of the method on different fairness level models, models obtained by 4 different training modes are adopted as deployment models:

1) normal training model: training the model to minimize target label prediction loss on the dataset;

2) the confrontation training model comprises the following steps: adding a discriminator at the output end of the model, learning and predicting the sensitive attribute value by the discriminator, minimizing the target label prediction loss of the model on the data set in the process of training the model, and maximizing the discriminator loss so as to reduce the model bias;

3) and (3) turning over the label: randomly turning over a target label of the training set data, expanding bias in the data set, and minimizing target label prediction loss on the data set, so that the model learns the bias existing in the data;

4) gradient inversion model: and (3) inverting the gradient of reverse transmission of the discriminator in the antithetical training model, minimizing the target label prediction loss of the model on the data set in the process of training the model, and minimizing the loss of the discriminator so as to enlarge the bias of the model.

TABLE 1-1 method run results on Normal training model

TABLE 1-2 method run results on the challenge training model

Tables 1-3 running the results on the Label flipping model

Tables 1-4 results of the method run on a gradient inversion model

。

In the fairness promotion performance, the worse the original fairness of the deployment model is, the more the promotable space is, and the fairness promotion effect is more obvious by using the method and the device. As shown in tables 1-1, 1-2, 1-3, and 1-4 above, each table represents results on deployment models of different fairness levels. In each table, the first column represents different target label prediction tasks, including Smiling, Attractive and Blond _ Hair; the second column represents the input as an original image or a perturbed image; the third column ACC index is used to measure the accuracy of the target task,

，

the representation of the function of the indicative function,

indicating the result of the prediction of the image object label,

true target label representing an image, ACC being the number of predicted correct divided by the total

Higher ACC indicates better target task prediction performance; the fourth and fifth columns measure fairness,

，

representing the sensitive attribute value of the image, DP calculates the difference of the probabilities that the population of different sensitive attribute values is predicted as positive by the model, and

the DEO measures the difference between the false positive rate and the false negative rate of different sensitive attribute value groups, and the closer the two rates are to 0, the better the fairness is. The values of each row in the table represent the test results of either the original image or the perturbed image using the present invention on the deployment model. The experiment shows that the deployment model has certain prejudice, and the method can improve the fairness to a certain extent while maintaining the accuracy of the main task; when a deployment model has a large bias, the method can obviously improve fairness while effectively maintaining the accuracy of the main task; when the deployment model is fairly fair, the method can further improve fairness in a small range while maintaining accuracy of the main task.

TABLE 2-1 test results on the Ali API by the method

TABLE 2-2 method test results on Baidu API

。

Under the condition that the deployment model cannot be accessed, the method can also achieve a certain fairness promotion effect. As shown in tables 2-1 and 2-2, the smile detection interface (API) provided by the ali and Baidu vision open platform is used for experimental testing, the CelebA data set is used for training the disturbance generator, and when the deployment model architecture and parameters are unknown, the method can improve fairness while better maintaining the target label prediction accuracy.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The system is characterized by comprising a deployment model, a disturbance generator and a discriminator, wherein the deployment model comprises a feature extractor and a label predictor, the disturbance generator is connected with the feature extractor, the feature extractor is respectively connected with the label predictor and the discriminator, an image is input by the feature extractor, the image is represented by a hidden space through the feature extractor, the hidden space represents a prediction result of a target label output after being input into the label predictor, and the hidden space represents a prediction result of an image sensitive attribute output after being input into the discriminator.

2. The system for improving fairness of a deep learning model based on antagonistic disturbance according to claim 1, wherein an input of the disturbance generator is an image, an output of the disturbance generator is the antagonistic disturbance, and a disturbance value is added to the input image and then input to the feature extractor.

3. A method for improving fairness of a deep learning model based on antagonistic disturbance is characterized by comprising the following steps:

4. The method for improving fairness-based deep learning model fairness based on adversarial disturbance as claimed in claim 3, wherein the deployment model is expressed as

Wherein

In order to provide a feature extractor for a computer,

for the target label predictor, the input image is

The sensitive property is

The object label is

。

5. The method for improving fairness of deep learning model based on adversarial disturbance according to claim 4, wherein in the step 1), a disturbance generator is used

For images

Adding antagonistic disturbance, the disturbed image is

Disturbance satisfies

Norm limitation

The disturbed image

Input deployment model, feature extractor for deployment model

Implicit spatial representation of an output image

。

6. The method as claimed in claim 4, wherein the step 2) is performed by updating

Is determined by the information of (a) a,

the loss function of (d) is:

wherein

Arbiter for sensitive attribute

Is output as

，

Representing the true sensitive property.

7. The method for improving fairness of deep learning model based on adversarial disturbance as claimed in claim 4, 5 or 6, wherein in step 3), the increase is performed

Entropy of prediction of disturbed image

In disturbing the sample

Making a random guess above, the loss of entropy is expressed as:

wherein the content of the first and second substances,

representing entropy, to this point, the generator

The total loss for improving fairness is expressed as

，

Is a smaller value, controlling the weight of the entropy-constrained term.

8. The method of claim 7 for improving fairness based on adversarial perturbation in deep learning modelLifting method, characterized in that in said step 3), except for being responsible for fairness perception

wherein

The cross-entropy is represented by the cross-entropy,

the output of the target label predictor representing the model,

during the course of updating, by adding

While at the same time reducing

and

is balanced by a parameter

the higher the primary task accuracy can be maintained,

the lower the fairness can be improved,

loss function of

Expressed as:

。

9. the method for improving fairness based on the adversarial disturbance deep learning model in claim 4 or 8, wherein in step 4), the disturbance generator

And discriminator

Deployed as a model

Adaptively generating perturbations for the input data.

10. The adversarial-perturbation-based deep-learning model fairness of claim 9The character of the character promoting method is that in the minimum-maximum game process, the discriminator