CN111860441A

CN111860441A - Video target identification method based on unbiased depth migration learning

Info

Publication number: CN111860441A
Application number: CN202010761599.6A
Authority: CN
Inventors: 陈晋音; 徐思雨; 陈治清; 徐国宁; 缪盛欢
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2020-10-30
Anticipated expiration: 2040-07-31
Also published as: CN111860441B

Abstract

The invention discloses a video target identification method based on non-bias depth migration learning, which comprises the following steps: (1) constructing a source domain image set and a target domain image set; (2) constructing a bias attribute prediction model; (3) constructing a training frame based on transfer learning, wherein the training frame comprises two branches with the same structure as the bias attribute prediction model, and an adaptive layer is added at the same position of each branch, wherein the input of a first branch is a sample second branch composed of a source domain image and a task label, and the input of the second branch is a video frame image; (3) constructing a loss function of a training frame, wherein the loss function comprises bias attribute classification loss of a bias attribute prediction model, task attribute classification loss of a first branch and MMD distance loss between adaptation layers of two branches; (4) training the training frame, and extracting a second branch with determined parameters as an unbiased video target recognition model; (5) and inputting the video frame to be recognized into the unbiased video target recognition model, and outputting a target recognition result.

Description

Video target identification method based on unbiased depth migration learning

Technical Field

The invention belongs to the field of target identification, and particularly relates to a video target identification method based on non-bias depth migration learning.

Background

One of the main assumptions of many machine learning and data mining algorithms is that the trained and tested data must be in the same feature space, with the same distribution. However, in many practical applications, this assumption may not hold. In recent years, transfer learning has come to be a new learning framework.

The transfer learning is to transfer the knowledge of one domain (i.e. the source domain) to another domain (i.e. the target domain) so that the target domain can achieve better learning effect. In general, the source domain data volume is sufficient, and the target domain data volume is small, so that the scenario is suitable for the transfer learning. For example, we need to classify a task, but the data in the task is insufficient (target domain), but there is a lot of related training data (source domain), but the training data is different from the feature distribution of the test data in the classification task that needs to be performed (for example, in video target recognition, the data in the picture data set is sufficient, but the video screenshot of the classification task that needs to be performed is extremely insufficient), in which case, if a suitable migration learning method can be adopted, the classification recognition result of the task with insufficient samples can be greatly improved.

Although deep migration learning has been greatly developed in practical applications, some recent studies show that the deep migration learning model has an unfair bias because the deep migration learning model is sensitive to some irrelevant attributes, and the decision thereof is often dependent on such wrong attribute association. This prejudice may be manifested in: when such systems are used to classify images containing people, they may over-associate protective attributes such as gender, race, or age with objects or action tags, thereby magnifying the social stereotypy impression, leading to erroneous decisions. Moreover, the trained model greatly expands the association of certain labels with protection attributes beyond the acceptance of people on biased datasets. Therefore, when such a biased deep migration learning model reacts to recognizing or detecting images, many negative effects and social hazards may be caused.

Generally, the work of mitigating and preventing the bias of deep migration learning models in decision making is mainly based on three aspects: (1) eliminating bias existing in the sample data set by preprocessing the data set; (2) directly modifying the deep migration learning model to eliminate bias existing in the model; (3) and evaluating the fairness of the deep migration learning model. Although these methods are effective in preventing the deep migration learning model from generating bias in decision making, recent research shows that the deep migration learning model can amplify bias existing in sample data set.

In view of the bias problem of the deep migration learning model and the limitation of the current research on preventing the bias, the research on a video target identification method without bias has extremely important theoretical and practical significance.

Disclosure of Invention

The invention aims to provide a video target identification method based on unbiased depth migration learning, which identifies a target in a video frame by constructing an unbiased depth migration learning model so as to improve the accuracy of target identification.

In order to achieve the purpose, the invention provides the following technical scheme:

a video target identification method based on non-bias depth transfer learning comprises the following steps:

(1) constructing a source domain image set containing task labels and bias labels, and constructing a target domain image set from video frame images extracted from videos;

(2) constructing a bias attribute prediction model, taking a sample consisting of a source domain image and a bias label as the input of the bias attribute prediction model, and outputting the bias attribute prediction model as a bias attribute prediction value;

(3) constructing a training framework based on transfer learning, wherein the training framework comprises two branches with the same structure as the bias attribute prediction model, and an adaptive layer is added at the same position of each branch, wherein the input of the first branch is a sample consisting of a source domain image and a task label, the output is a predicted value of the task attribute, the input of the second branch is a video frame image, and the output is a predicted value of the task attribute;

(3) constructing a loss function of a training frame, wherein the loss function comprises bias attribute classification loss of a bias attribute prediction model, task attribute classification loss of a first branch and MMD distance loss between adaptation layers of two branches;

(4) training a training frame by using a loss function according to the source domain image set and the target-oriented image set, and after training is finished, extracting a second branch with determined parameters as an unbiased video target recognition model;

(5) and inputting the video frame to be recognized into the unbiased video target recognition model, and outputting a target recognition result.

Compared with the prior art, the invention has the beneficial effects that at least:

according to the video target identification method based on unbiased depth migration learning, model parameters are optimized through a migration learning mode for the source domain image and the video frame image, loss caused by the bias attribute is removed in the learning process, so that the identification result of the bias attribute on a target task is weakened, the unbiased video target identification model is obtained, the unbiased video target identification model is used for identifying the target of the video frame image, the influence of the bias attribute on the identification result can be avoided, and the accuracy of target identification is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a video target identification method based on unbiased depth migration learning according to an embodiment;

fig. 2 is a training framework of a video target recognition model according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

The method aims to solve the problem that the target recognition result is inaccurate due to the fact that the bias exists in the model. The embodiment provides a video target identification method based on unbiased depth transfer learning, a biased loss function score is obtained by constructing a deep learning network, and the biased loss function score is superposed with a loss function score of an original target task and then is used as a total loss function of an unbiased depth transfer learning model, so that the sensitivity of the deep learning model to irrelevant features is reduced. And then training the model by using the source domain data set and the target domain data set to obtain a non-bias depth migration learning model, and further ensuring that the depth migration learning model makes a fair decision.

As shown in fig. 1, the video target identification method based on non-bias depth migration learning provided by the embodiment includes the following steps:

step 1, defining bias of a video target recognition model based on transfer learning.

The invention defines the phenomenon that a video target recognition model is influenced by irrelevant but sensitive characteristics when a decision is made, and the decision of the video target recognition model can depend on the wrong attribute association as the biased behavior of the model. Taking gender ambiguity as an example, assuming that the gender label is an irrelevant but sensitive attribute of the deep learning model, in the prediction tasks of other labels, although gender prediction is not included, gender characteristics may affect other classification tasks, so that the deep learning generates gender discrimination, that is, the model has bias.

And 2, preparing and preprocessing a data set picture.

Step 2 is mainly used for constructing a sample data set for training a model, and includes constructing a source domain image set including task labels and bias labels, wherein the source domain image set may be a part of information extracted from a Pascal VOC 2007 data set, and includes task labels of video object identification tasks, such as identifying professions in hospital scenes, including doctors, nurses, cleaning staff, security guards, patients, and the like, besides a large number of scene images. Also included are bias tags that affect the profession, such as gender tags, race tags, and the like. The sample data set also includes a target domain image set that contains video frame images extracted from video frames, does not contain any labels for the video frame images, and these video frame images are very few, perhaps only 10 or 20 frames, which are too few to train video target recognition learning.

After the source domain image set and the target domain image set are obtained, the source domain image and the video frame image need to be normalized, and then the source domain image and the video frame image can be input into a model to train the model.

And 3, constructing a bias attribute prediction model.

The constructed bias attribute prediction model aims at learning to predict bias attribute information from a source domain image, and is input into a sample consisting of the source domain image and a bias label and output as a bias attribute prediction value. The bias attribute prediction model can adopt a CNN network consisting of convolutional layers and fully-connected layers, specifically adopt a CNN network consisting of 5 convolutional layers and 3 fully-connected layers which are sequentially connected, and the activation function adopts a ReLU function. And (3) after the input source domain image is subjected to convolutional layer feature extraction and full-link layer full-link operation, obtaining the probability distribution of the bias attribute predicted value through a softmax function.

In training, the bias attribute prediction model attempts to predict the image x from the input source domain_iPrediction bias attribute g_iBut attempts to minimize the loss of the amount of information it can extract, so the bias attribute classification loss is:

wherein，c(x_i) Predicting model for ith source domain image x for bias attribute_iBias attribute of g_iFor the ith source domain image x_iA bias label of L_c(. cndot.) is a cross entropy function.

Step 4, constructing a training framework and a loss function based on transfer learning

As shown in fig. 2, the training framework based on the transfer learning includes two branches, and the structures of the two branches are the same as the structure of the bias attribute prediction model, that is, each branch includes a convolutional neural network composed of a convolutional layer and a fully-connected layer, and specifically includes a convolutional layer 1, a convolutional layer 2, a convolutional layer 3, a convolutional layer 4, a convolutional layer 5, a fully-connected layer 6, a fully-connected layer 7, and a fully-connected layer 8, which are connected in sequence. The convolutional neural networks of the two branches share the weight. On the basis, an adaptive layer is added and removed at the same position of each branch, the adaptive layer maps input data, and the MMD distance is calculated according to the output of the adaptive layer to evaluate the difference between the source domain and the target domain. When the adaptive layer is selected, the setting position and the size of the adaptive layer are preferably selected between the fully connected layers with the aim of high identification accuracy and minimum MMD distance. Preferably, an adaptation layer is arranged between the second fully-connected layer and the third fully-connected layer, and the adaptation layer maps the input data and has a size of 256.

The input of the first branch is a sample consisting of a source domain image and a task label, the probability distribution of a task attribute predicted value is output after 5 convolutional layer feature extractions and the connection operation of a full connection layer 6, an adaptation layer, a full connection layer 7 and a full connection layer 8; the input of the second branch is a video frame image, after 5 convolutional layer feature extractions, the probability distribution of the task attribute predicted value is output after the connection operation of the full connection layer 6, the adaptation layer and the full connection layers 7 and 8.

When the loss function is constructed, the loss function comprises bias attribute classification loss of a bias attribute prediction model, task attribute classification loss of a first branch and MMD distance loss between adaptation layers of two branches.

Calculating a domain loss function (domain loss) from the output of the adaptation layer, and reducing the difference between the source domain and the target domain by minimizing the MMD distance by using the MMD (maximum Mean variance) distance between the source domain and the target domain as the domain loss function, wherein the MMD is calculated by the following formula:

wherein x is_sFor a set X of images from the source domain_SSource domain image of x_tFor a set X of images from the target field_TPhi () is the output of the adaptation layer, | | | |, is the L1 distance.

Task attribute classification loss

Loss of target task combined with loss of MMD distance

Wherein, lambda is a hyper-parameter, the value range is 0-1, and X is_SFor the source domain image set, X_TThe task label is a target domain image set, and Y is a task label of a source domain image set;

finally, the Loss function Loss of the training framework is:

and 5, constructing an unbiased video target recognition model according to the training framework.

After the loss function is constructed, training the training frame by using the loss function to optimize the network parameters of the training frame, wherein the specific process comprises the following steps:

collecting X images of source domain_SAnd a target domain image set X_TInputting the image into a training framework, and collecting the image set X of the source domain_SAnd input into the bias attribute prediction modelAnd calculating a Loss function value according to the Loss function Loss, updating the network parameters of the training frame by adopting an Adam optimizer, finishing the training when the maximum iteration times is reached, and extracting a second branch determined by the parameters to serve as an unbiased video target recognition model.

And evaluating and obtaining the de-bias performance of the unbiased video target recognition model. Bias attributes such as gender are irrelevant but sensitive attributes of the video target recognition model, so that fair decision of the model is influenced, and in order to reduce the sensitivity of the model to the characteristics, a non-bias video target recognition model is generated through a training framework, namely the model is not sensitive to gender characteristics, which means that the gender of sample data cannot be predicted through the non-bias video target recognition model. Therefore, an evaluation index for the unbiased video object recognition model is defined: the unbiased degree β represents that the unbiased degree of the model is high when β is close to 1, and represents that the unbiased degree of the model is low when β is close to 0, and the β calculation formula is as follows:

wherein n represents the total number of source domain images including the bias label in the source domain image set, c (x)_i) Representing a source domain image x_iOutputting through an unbiased attribute prediction model; function l [. C]It is explained that when the equation holds in parentheses, the value is 0, and vice versa is 1.

Step 6, video target identification application

After the unbiased video target identification model is obtained, the unbiased attribute prediction model can be used for target identification, and in specific application, video frames are intercepted from a video, normalized and input into the unbiased video target identification model, and a target identification result is calculated and output.

According to the video screenshot recognition method based on unbiased depth migration learning, a bias loss function of sample data in a model is obtained through a depth learning network, and an unbiased video target recognition model can be generated through a model knowledge migration and migration learning training method, so that the performance of an original video screenshot classification task is guaranteed, the sensitivity of the video target recognition model to irrelevant features is reduced, the fairness of the video target recognition model in decision classification is further guaranteed, and guidance is provided for research on elimination of model bias.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A video target identification method based on non-bias depth transfer learning is characterized by comprising the following steps:

2. The video target recognition method based on the unbiased depth migration learning of claim 1, wherein the constructed Loss function Loss is:

classify losses for bias attributes:

wherein, c (x)_i) Predicting model for ith source domain image x for bias attribute_iBias attribute of g_iFor the ith source domain image x_iA bias label of L_c(. is a cross entropy function;

a target task loss composed of a task attribute classification loss and an MMD distance loss:

wherein, lambda is a hyper-parameter, the value range is 0-1, and X is_SFor the source domain image set, X_TIs a task tag of a target domain image set, Y is a task tag of a source domain image set, MMD²(X_S,X_T) Is X_SAnd X_TThe square of the MMD distance of (a), the formula is calculated as:

wherein x is_sIs derived from X_SSource domain image of x_tIs derived from X_TPhi () is the output of the adaptation layer, | | | |, is the L1 distance.

3. The video target identification method based on unbiased depth migration learning of claim 1, wherein the bias attribute prediction model adopts a CNN network composed of convolutional layers and fully-connected layers.

4. The video target identification method based on unbiased depth migration learning of claim 1, wherein the bias attribute prediction model adopts a CNN network formed by sequentially connecting 5 convolutional layers and 3 fully-connected layers.

5. The video target identification method based on the non-partial depth migration learning according to claim 1, wherein the setting position and the size of the adaptive layer are preferably selected between the fully connected layers with high identification accuracy and MMD distance as a target.

6. The method according to claim 4, wherein an adaptation layer is arranged between the second fully-connected layer and the third fully-connected layer, and the adaptation layer maps the input data and has a size of 256.

7. The video target identification method based on unbiased depth migration learning of claim 1, wherein after the source domain image set and the target domain image set are obtained, the source domain image and the video frame image are normalized.

8. The video target recognition method based on unbiased depth migration learning of claim 1, wherein in application, video frames are intercepted from a video, normalized, input into an unbiased video target recognition model, and a target recognition result is output through calculation.