CN114528911A

CN114528911A - Multi-label image classification method and model construction method and device for multi-branch structure

Info

Publication number: CN114528911A
Application number: CN202210021186.3A
Authority: CN
Inventors: 范建平; 雷俊婷; 赵万青; 彭进业; 张晓丹; 杨文静; 王珺
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2022-01-10
Filing date: 2022-01-10
Publication date: 2022-05-24

Abstract

The invention provides a multi-label image classification method and a model construction method and a device of a multi-branch structure, aiming at the characteristics that a plurality of semantic objects with different sizes and different characteristics exist in a multi-label image, the method takes the characteristics of different parts of a characteristic extraction network as the input characteristics of different branches, each branch is subjected to characteristic fusion in a mode that the characteristic of the branch is taken as the main part and the characteristics of other branches are taken as auxiliary parts, the fused characteristics are input into an Attention network, the further extraction of the characteristics is realized, the final characteristics for prediction are obtained, independent prediction is carried out, then the maximum value of a plurality of branch prediction results is selected by category as the predicted value of the whole network for the category, and the prediction result of the whole network for an input sample is finally obtained. The method solves the defect that the prior art cannot comprehensively and accurately predict all semantic targets in the multi-label image, and effectively improves the accuracy of multi-label image classification.

Description

Multi-label image classification method and model construction method and device for multi-branch structure

Technical Field

The invention belongs to the technical field of image classification, relates to a multi-label image classification method, and particularly relates to a multi-label image classification method with a multi-branch structure, a model construction method and a model construction device.

Background

The image classification task is the basis of many visual tasks, and in real life, an image is often composed of a plurality of objects, for example, a person, a dog and a cat are contained in an image, so that the multi-label image is more practical, and the image usually contains a plurality of semantic objects with different sizes. The method for solving the multi-label image classification problem comprises a traditional machine learning algorithm and a deep learning algorithm.

The traditional machine learning algorithm is mainly based on the following two ideas: (1) problem transformation method. Namely, the multi-label image classification problem can be regarded as the image classification problem of a plurality of single labels, and a plurality of classifiers can be trained to perform single label classification for a plurality of times. (2) A new algorithm. The multi-label problem is not converted into the known single-label problem, but an algorithm suitable for multi-label image classification is provided directly according to the characteristics of the image.

With the continuous development of deep learning, a plurality of deep learning algorithms are also used for multi-label image classification, and the neural network can learn effective characteristics in large-scale data by utilizing the strong nonlinear representation capability of the neural network, so that the image classification precision is improved. Based on the BING theory proposed by the professor of Chengming, Weiyun, Chaojing in 2014, and the like, a Hypotheses-CNN-Pooling frame structure is proposed, namely, a plurality of candidate areas are extracted from each input picture, then each candidate area is sent into CNN for classification training, each candidate area generates a c-dimensional prediction result, and finally, max Pooling is utilized to obtain a final classification result. The method can extract a plurality of Hypotheses, but a plurality of candidate regions are generated for each picture, and the candidate regions are trained by being fed into the CNN, so that a large amount of calculation is caused. The CNN network has strong nonlinear characterization capability, the RNN network can establish the association between image and label, the Jeang Wang et al in 2016 propose a CNN-RNN combined network structure, the CNN network is used for extracting image features, the RNN is used for searching, the method considers the correlation between categories, the result is good only for large targets and objects with dependency, but poor for small targets and objects without dependency, and the method cannot well identify a plurality of targets with different sizes. And Zhang J et al added a Regional LSTM module on the basis of the CNN-RNN network structure in the same year, the Regional LSTM module plays a guiding role in the characteristics obtained by the CNN network, can obtain the position information of the corresponding characteristics, and further considers the dependency among the characteristics, the position and the label. Graph convolution networks have also been applied to multi-label image classification networks, Chen Z M et al, 2019, to graph convolution networks in multi-label classification, and so on. The dependency relationship between the features and the labels is mainly utilized, and when the features do not have corresponding dependency relationship and the size and the abstraction degree of the target are different, the proper features need to be selected in a targeted manner according to the characteristics of the target to perform category prediction.

For the multi-label image classification problem, the related algorithm is continuously emitted. The existing method does not give full play to the classification advantages of different characteristics on different semantic targets, so that the classification accuracy of multi-label images needs to be improved.

Disclosure of Invention

Aiming at the defects and shortcomings in the prior art, the invention provides a multi-label image classification method with a multi-branch structure, a model construction method and a model construction device, solves the problems that in the prior art, all semantic targets in an input image are difficult to predict comprehensively, the small-size targets are often ignored and the like, and makes full use of the characteristics of different characteristics, so that the accuracy of multi-label image classification is improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

a multi-label image classification model construction method of a multi-branch structure comprises the following steps:

step 1: dividing an original data set according to a preset proportion to obtain a training set and a testing set, wherein the training set and the testing set comprise real labels corresponding to each image;

step 2: inputting the training set into a feature extraction network, and obtaining features F from different positions of the feature extraction network₁、F₂And F₃；

And step 3: the characteristic F obtained in the step 2₁、F₂And F₃Respectively as three branches L₁、L₂And L₃The input features of the branch are taken as the main input features of the branch, and the input features of other branches are taken as the auxiliary input features of the branch to respectively perform feature fusion to obtain fused features F₁₁、F₂₁And F₃₁；

And 4, step 4: fusing the characteristics F obtained in the step 3₁₁、F₂₁And F₃₁Inputting the feature F into a coronate-attention network to obtain the feature F weighted by the coronate-attention network action₁₂、F₂₂And F₃₂；

And 5: weighted feature F obtained from step 4₁₂、F₂₂And F₃₂Respectively carrying out category prediction to obtain prediction scores of all categories;

step 6: selecting the maximum value as the final prediction score of the category from the prediction scores obtained in the step 5 category by category to obtain the prediction result of the input image;

and 7: comparing the prediction result obtained in the step 6 with the real label of the image to obtain a Loss value, performing back propagation to update network parameters, completing training in a preset training batch to obtain a multi-label image classification model with a multi-branch structure, and inputting the test set into the trained network to obtain the corresponding classification accuracy; the multi-label image classification model of the multi-branch structure is used for multi-label image classification.

The invention also comprises the following technical characteristics:

optionally, the feature extraction network in step 2 is a ResNet101 network, and the ResNet101 network structure is determined according toThe sub-division is 6 parts: conv1, Conv2_ x, Conv3_ x, Conv4_ x, Conv5_ x, FC, respectively, feature F being introduced at Conv3_ x, Conv4_ x outputs₁、F₂For acting as branch L₁、L₂And adding the SPP block at the output of Conv5_ x, and then applying the SPP block to the input feature F₃As branch L₃The input feature of (1).

Optionally, step 3 specifically includes:

step 3.1: for feature F₁Feature F is upsampled₂And F₃Becomes equal to F₁The consistency is achieved; the size and the characteristic F after sampling₁Consistent characteristic F₃And F₃And feature F₁Splicing is carried out, splicing is carried out on the channel dimension, and the spliced characteristics are represented as follows: f₁₁＝N₁*(C₁+C₂+C₃)*H₁*W₁Wherein feature F₁₁Number of channels is C, and the num value of (A) is N1₁+C₂+C₃Dimension H₁*W₁；

Step 3.2: for feature F₂Feature F is upsampled or downsampled₁And F₃Becomes equal to F₂The consistency is achieved; the size and the characteristic F after sampling₂Consistent characteristic F₁And F₃And feature F₂Splicing is carried out, splicing is carried out on the channel dimension, and the spliced characteristics are represented as follows: f₂₁＝N₂*(C₁+C₂+C₃)*H₂*W₂Wherein feature F₂₁Num value of N₂The number of channels is C₁+C₂+C₃Dimension H₂*W₂；

Step 3.3: for feature F₃Using downsampling to convert the features F₁And F₂Becomes equal to F₃The consistency is achieved; the size and the characteristic F after sampling₃Consistent characteristic F₁And F₂And feature F₃Splicing is carried out, splicing is carried out on the channel dimension, and the spliced characteristics are represented as follows: f₃₁＝N₃*(C₁+C₂+C₃)*H₃*W₃Wherein feature F₃₁Number of channels is C, and the num value of (A) is N3₁+C₂+C₃Dimension H₃*W₃。

Optionally, the step 4 specifically includes:

step 4.1: will be characterized by F₁₁、F₂₁And F₃₁All input into a coronatine-authentication network to respectively obtain the characteristics F₁₁、F₂₁And F₃₁Outputting three groups of corresponding codes along the horizontal coordinate direction and the vertical coordinate direction;

step 4.2: for feature F₁₁Applying a 1 × 1 convolution function and a non-linear activation function to the feature F obtained in step 4.1₁₁Generating the feature F₁₁The spatial information of (a) is encoded in the horizontal and vertical directions;

step 4.3: the intermediate feature mapping f is segmented into two independent tensors f along two spatial dimensions of the horizontal coordinate direction and the vertical coordinate direction_h，f_wTensor f using 1 × 1 convolution function respectively_h，f_wThe number of channels is changed into the number of channels which is the same as the number of channels of the input characteristic, the number of channels is acted by a sigmod function and then used as attention weight, and the attention weight is multiplied by the input characteristic to obtain weighted characteristic F₁₂；

Step 4.4: for feature F₂₁And F₃₁Respectively repeating the step 4.2 and the step 4.3 to respectively obtain the characteristic F₂₁And F₃₁Corresponding weighted feature F₂₂And F₃₂。

Optionally, in step 5, the class prediction independently predicts which classes in the class space the corresponding features of each branch belong to; each branch is independently predicted to obtain a matrix with the size of (batch _ size, num _ classes), wherein the batch _ size is the number of input images per time, num _ classes is the total number of label types of the data set, zero is used as a threshold, and when the prediction score is larger than zero, the input images contain the type, otherwise, the input images do not contain the type.

Optionally, in step 6, a matrix with a size of (batch _ size, num _ classes) obtained by independent prediction according to each branch is selected category by category, and for each category, the maximum value of the prediction scores of the three branch prediction results for the category is selected by a max function as the score of the whole network for the input image for the category; and repeating the above operations on all the categories in the category space to obtain the final prediction result of the input image.

A multi-label image classification model building device with a multi-branch structure comprises:

the system comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining an original data set and dividing the original data set according to a preset proportion to obtain a training set and a testing set, and the training set and the testing set comprise real labels corresponding to images;

a feature extraction module for inputting the training set into the feature extraction network to obtain features F from different parts of the feature extraction network₁、F₂And F₃；

A feature fusion module for fusing the features F₁、F₂And F₃Respectively as three branches L₁、L₂And L₃The input features of the branch are taken as the main input features of the branch, and the input features of other branches are taken as the auxiliary input features of the branch to respectively perform feature fusion to obtain fused features F₁₁、F₂₁And F₃₁；

A weighting module for weighting the obtained fused features F₁₁、F₂₁And F₃₁Inputting the characteristic F into a coronatine-attention network to obtain a characteristic F weighted by the coronatine-attention network action₁₂、F₂₂And F₃₂；

A class prediction module for predicting class according to the weighted feature F₁₂、F₂₂And F₃₂Respectively carrying out category prediction to obtain prediction scores of all categories;

a category-by-category selection prediction result module for selecting the maximum value as the final prediction score of the category from the prediction scores category by category to obtain the prediction result of the input image;

and the model training module is used for comparing the prediction result with the real label of the image to obtain a Loss value, performing back propagation to update the network parameters, finishing training in a preset training batch to obtain a multi-label image classification model with a multi-branch structure, and the multi-label image classification model with the multi-branch structure is used for multi-label image classification.

A multi-label image classification method inputs an image to be classified into a multi-label image classification model with a multi-branch structure and outputs a multi-label classification result.

A computer device comprising a memory and a processor, wherein the memory stores a computer program, and wherein the processor executes the computer program to implement the steps of the multi-label image classification model construction method for a multi-branch structure or the steps of the multi-label image classification method.

A computer readable storage medium for storing program instructions executable by a processor to perform the steps of the multi-label image classification model construction method of the multi-branch structure or to perform the steps of the multi-label image classification method.

Compared with the prior art, the invention has the beneficial technical effects that:

the invention utilizes the characteristics that the characteristics of different parts of the characteristic extraction network have different characteristics, for example, the characteristics of a lower layer have more detail information, the characteristics of a higher layer have more semantic information, the characteristics of different characteristics can be used for processing corresponding characteristics in a targeted manner, so that a multi-branch structure is provided, information fusion that each branch takes the characteristic of the branch as the main part and other branch characteristics as auxiliary parts is further realized through characteristic fusion operation, the characteristic is further extracted in a weighted manner by the Attention network, a plurality of branches are independently predicted, and finally, the branch result with the best effect aiming at the type is selected as the prediction result of the whole network, thereby effectively improving the classification accuracy of the image on the whole.

Drawings

Fig. 1 is a flowchart illustrating a multi-label image classification method with a multi-branch structure according to the present invention.

Detailed Description

The invention provides a multi-label image classification method with a multi-branch structure and a model construction method and device, which take the characteristics of multi-label images into consideration, wherein a common image comprises a plurality of semantic objects, and the plurality of semantic objects have differences in size and the like. And a plurality of branches are independently predicted, and the final selection effect is the best, so that the classification accuracy of the network is effectively improved on the whole.

Aiming at the characteristics that a plurality of semantic objects with different sizes and different characteristics exist in a multi-label image, on the basis of a traditional CNN (CNN) feature extraction network, features at different positions of the network are used as input features of subsequent branches, feature fusion operation is carried out on each branch, the fused features are input into an Attention network to obtain final features for prediction, independent prediction is carried out, a branch result with the best effect is selected by classes to serve as a prediction value of the whole network for the classes, and the prediction result of the whole network for an input sample is finally obtained. Wherein:

the feature extraction network utilizes the traditional ResNet network to extract image features, and selects the features at different positions of the feature extraction network as subsequent network branches L₁、L₂And L₃The input feature of (1).

The feature fusion is to firstly change the sizes of other branch features to be consistent with the size of the branch feature by utilizing up-sampling or down-sampling, and then realize the fusion of a plurality of same-size features by utilizing a concatemate splicing technology; each branch repeats the above-mentioned fusion operation, namely uses the branch characteristic as the main body and uses other branch characteristics as supplements, thereby realizing that each branch has its own characteristic and has more comprehensive target information.

The anchoring network is used for further weighting and extracting the fused features, a coordinate-anchoring method is used for acting on the fused features, and the weighting mode not only pays Attention to the channel direction but also pays Attention to the accurate position information of the features.

And selecting the branch result with the best effect category by category, namely selecting the maximum value of the predicted value of the category in the multi-branch prediction result by utilizing a max function through multi-branch independent prediction, and realizing the overall improvement of the classification accuracy of the network.

One embodiment of the present invention provides a method for constructing a multi-label image classification model with a multi-branch structure, as shown in fig. 1, including the following steps:

step 1, dividing an original data set according to a preset proportion to obtain a training set and a testing set, wherein the training set and the testing set comprise real labels corresponding to each image; the original dataset is a public dataset commonly used for multi-label image classification; in this example, the COCO, VOC2007, Flicker25k data sets were used, and these public data sets were already classified into training and testing sets, for example, if they were not directly classified into training and testing sets, the data sets were classified into training and testing sets according to a ratio of 8: 2.

The feature extraction network is realized by adopting a ResNet101 network, and output features are led out from different positions of the ResNet101 network as branches L₁、L₂And L₃The input feature of (1); the ResNet network mainly has two basic blocks, namely BasicBlock and Bottleneck, wherein the BasicBlock consists of two 3 × 3 volume blocks and identity mapping. The bottleeck is composed of three volume blocks of 1 × 1, 3 × 3, and 1 × 1 and identity mapping. The ResNet101 network structure is sequentially divided into 6 parts: convl, Conv2_ x, Conv3_ x, Conv4_ x, Conv5_ x, FC, respectively, features F being introduced at Conv3_ x, Conv4_ x outputs₁、F₂For acting as branch L₁、L₂And adding the SPP block at the output of Conv5_ x, and then applying the SPP block to the input feature F₃As branch L₃The input feature of (1); ResNet101 includes a base block Bottleneeck, where Conv1 is the convolutional layer with core 7,conv2_ x is the result of Conv1 added with the maximum pooling with core 3, and then connected with 3 Bottleneck basic blocks, Conv3_ x, Conv4_ x and Conv5_ x are respectively composed of 4, 23 and 3 Bottleneck basic blocks.

Feature fusion is an iterative operation performed on each branch, comprising two stages, the first of which is to input the feature size (H) at the present branch_i×W_i) For the standard, the residual branch input features are up-sampled or down-sampled to the same size (H) as the present branch input features_i×W_i) (ii) a The second stage is to adopt concatemate splicing to realize feature fusion aiming at the result of the first stage to obtain a fused feature F₁₁、F₂₁And F₃₁. Specifically, the feature fusion module is repeatedly applied to each branch, based on the feature of the branch, the up-sampling is mainly performed by adopting a nearest neighbor interpolation method or the down-sampling is performed by adopting a convolution operation of 3 × 3, the feature sizes of other branches are changed into the same size as the feature of the branch, and then the splicing fusion is performed.

The specific steps of step 3 include:

step 3.1: for feature F₁Feature F is upsampled₂And F₃Becomes equal to F₁The consistency is achieved; the size and the characteristic F after sampling₁Consistent characteristic F₂And F₃And feature F₁Make a splice, assume F₁＝(N₁，C₁，H₁，W₁)、F₂＝(N₂，C₂，H₂，W₂)、F₃＝(N₃，C₃，H₃，W₃) Wherein N is₁、C₁、H₁And W₁Are respectively a feature F₁Num value, number of channels, length and width, N₂、C₂、H₂And W₂Are respectively a feature F₂Num value, number of channels, length and width, N₃、C₃、H₃And W₃Are respectively a feature F₃Num value, number of channels, length and width; then, splicing is performed on the channel dimension, and the characteristics after splicing are expressed as: f₁₁＝N₁*(C₁+C₂+C₃)*H₁*W₁Wherein feature F₁₁Number of channels is C, and the num value of (A) is N1₁+C₂+C₃Dimension H₁*W₁；

Step 3.3: for feature F₃Using downsampling to transform the feature F₁And F₂Becomes equal to F₃The consistency is achieved; the size and the characteristic F after sampling₃Consistent characteristic F₁And F₂And feature F₃Splicing is carried out, splicing is carried out on the channel dimension, and the spliced characteristics are represented as follows: f₃₁＝N₃*(C₁+C₂+C₃)*H₃*W₃Wherein feature F₃₁Num value of N₃The number of channels is C₁+C₂+C₃Dimension H₃*W₃。

And 4, step 4: fusing the characteristics F obtained in the step 3₁₁、F₂₁And F₃₁Inputting into a coronate-attack network to obtain a channelFeature F after weighting of icordinate-attribute network action₁₂、F₂₂And F₃₂。

The specific steps of step 4 include:

step 4.1: will be characterized by F₁₁、F₂₁And F₃₁All input into a coronatine-authentication network to respectively obtain the characteristics F₁₁、F₂₁And F₃₁Outputting the corresponding three groups of codes along the horizontal coordinate direction and the vertical coordinate direction; specifically, first, the feature F is set₁₁Each channel is encoded along a horizontal coordinate direction and a vertical coordinate direction with a size of (H)₁，W₁) When the height is h, the output of the corresponding c-th channel is:

wherein, the first and the second end of the pipe are connected with each other,

is characterized by F₁₁The coded output of the feature of the c channel along the horizontal coordinate direction, h is the height, W is the feature F₁₁I is a variable, 0 ≦ i < W, x_c(h, i) varies with i; when the width is w, the output of the corresponding c-th channel is:

wherein the content of the first and second substances,

is characterized by F₁₁The feature of the c-th channel is encoded and output along the vertical coordinate direction, w is the height, and H is the feature F₁₁J is a variable, j is more than or equal to 0 and less than H, x_c(j, w) varies with j; subsequently subjecting the feature F₁₁The coded output of all channels along the horizontal coordinate direction is subjected to a concatemate operation to obtain z^hBy the same token, z is obtained^w；

Step 4.2: for feature F₁₁1 × 1 convolution functions and non-linesThe sexual activation function acting on the coded output z obtained in step 4.1^h，z^wTo generate a feature F₁₁Spatial information of (2) intermediate feature mapping for encoding in horizontal and vertical directions

f∈R^(C/r×(H+W))Wherein z is^h，z^wRespectively the coded outputs spliced along the horizontal direction and the vertical direction, F is a convolution function of 1 multiplied by 1, delta is a nonlinear activation function, r represents a down-sampling proportion, and C is a characteristic F₁₁The number of channels, H, W being respectively characteristic F₁₁Length and width.

Step 4.3: the intermediate feature mapping f is segmented into two independent tensors f along two spatial dimensions of the horizontal coordinate direction and the vertical coordinate direction_h，f_wTensor f using 1 × 1 convolution function respectively_h，f_wThe number of channels is changed into the number of channels which is the same as the number of channels of the input characteristic, the number of channels is acted by a sigmod function and then used as attention weight, and the attention weight is multiplied by the input characteristic to obtain weighted characteristic F₁₂(ii) a Specifically, the attention weights of the two directions are g^hAnd g^wMultiplying the weight by the input feature to obtain a weighted feature F₁₂Comprises the following steps: f₁₂＝F₁₁×g^h×g^w。

And 5: weighted feature F obtained from step 4₁₂、F₂₂And F₃₂Respectively carrying out category prediction to obtain prediction scores of all categories; the label of the image is one or more of the categories;

the class prediction is independent class prediction of each branch, which classes in the class space the features belong to are predicted to obtain a matrix with the size of (batch _ size, num _ classes), wherein the batch _ size is the number of pictures input into the network each time, and the num _ classes is the total number of label classes corresponding to the data set; at this time, zero is used as a threshold, and if the threshold is larger than zero, the input image includes the category, and otherwise, the input image does not include the category.

the class-by-class selection is a matrix of (batch _ size, num _ classes) size obtained by independent prediction for each branch, and for each class, the maximum value of the prediction scores for that class in the three branch prediction results is selected by the max function as the score for that class for the input image of the entire network. And repeating the above operations on all the categories in the category space to obtain the final prediction result of the whole network on the input image.

And 7: and (4) comparing the prediction result obtained in the step (6) with the real label of the image to obtain a Loss value, performing back propagation to update network parameters, completing training in a preset training batch, and obtaining the multi-label image classification model with a multi-branch structure.

Specifically, a Loss value is calculated by adopting a BCEWithLogitsLoss function, wherein the Loss value comprises a Sigmoid layer and a BCELoss layer; assuming that the network has N batchs, each of which predicts N labels, the BCEWithLoitsLoss calculation formula is as follows:

Loss＝{l₁，…，l_N}

l_n＝-[y_n·log(δ(x_n))+(1-y_n)·log(1-δ(x_n))]

where δ (x)_n) For the Sigmoid function, the interval for mapping input X to (0, 1) is calculated as:

x_nto predict the score, y_nIs a real label.

In an embodiment of the present invention, a multi-label image classification model building apparatus with a multi-branch structure is provided, including:

the determining module is used for determining an original data set and dividing the original data set according to a preset proportion to obtain a training set and a testing set, wherein the training set and the testing set comprise real labels corresponding to each image;

a feature extraction module for inputting the training set into the feature extraction network to obtain features F from different parts of the feature extraction network₁、F₂And F₃(ii) a The feature extraction network is realized by adopting a ResNet101 network, and the ResNet101 network structure is sequentially divided into 6 parts: conv1, Conv2_ x, Conv3_ x, Conv4_ x, Conv5_ x, FC, respectively, feature F being introduced at Conv3_ x, Conv4_ x outputs₁、F₂For acting as branch L₁、L₂And adding the SPP block at the output of Conv5_ x, and then applying the SPP block to the input feature F₃As branch L₃The input feature of (1).

A feature fusion module for fusing the features F₁、F₂And F₃Respectively as three branches L₁、L₂And L₃The input features of the branch are taken as the main input features of the branch, and the input features of other branches are taken as the auxiliary input features of the branch to respectively perform feature fusion to obtain fused features F₁₁、F₂₁And F₃₁(ii) a Specifically, for feature F₁Feature F is upsampled₂And F₃Becomes equal to F₁Matching the sampled size with the feature F₁Consistent characteristic F₂And F₃And feature F₁Splicing is performed for feature F₂Feature F is upsampled or downsampled₁And F₃Becomes equal to F₂Consistent stitching for feature F₃Using downsampling to convert the features F₁And F₂Becomes equal to F₃And (6) splicing uniformly.

A weighting module for weighting the obtained fused features F₁₁、F₂₁And F₃₁Inputting the feature F into a coronate-attention network to obtain the feature F weighted by the coronate-attention network action₁₂、F₂₂And F₃₂(ii) a Specifically, feature F₁₁、F₂₁And F₃₁All input into a coronatine-authentication network to respectively obtain the characteristics F₁₁、F₂₁And F₃₁Outputting three groups of corresponding codes along the horizontal coordinate direction and the vertical coordinate direction; will be characterized by F₁₁The coded output of all channels along the horizontal coordinate direction is subjected to a continate operation to obtain z^hBy the same token, z is obtained^w(ii) a Applying a 1 x 1 convolution function and a non-linear activation function to the encoded output z^h，z^wGenerating feature F₁₁The spatial information is coded in the horizontal and vertical directions by an intermediate feature mapping f, and then the intermediate feature mapping f is divided into two independent tensors f along two spatial dimensions of the horizontal coordinate direction and the vertical coordinate direction_h，f_wTensor f using 1 × 1 convolution function respectively_h，f_wThe number of channels is changed into the number of channels which is the same as the number of channels of the input characteristic, the number of channels is acted by a sigmod function and then used as attention weight, and the attention weight is multiplied by the input characteristic to obtain weighted characteristic F₁₂(ii) a Repeating the above operation to obtain the characteristics F₂₁And F₃₁Corresponding weighted feature F₂₂And F₃₂。

A class prediction module for predicting class according to the weighted feature F₁₂、F₂₂And F₃₂Respectively carrying out category prediction to obtain prediction scores of all categories; specifically, the category prediction is independent prediction of each branch, which classes the features belong to in the category space is predicted, and a matrix of (batch _ size, num _ classes) is obtained, where the batch _ size is the number of pictures input into the network each time, and num _ classes is the total number of labels corresponding to the data set; and zero is used as a threshold value, and if the threshold value is larger than zero, the input image contains the category, and otherwise, the input image does not contain the category.

A category-by-category selection prediction result module for selecting the maximum value as the final prediction score of the category from the prediction scores category by category to obtain the prediction result of the input image; specifically, a matrix of (batch _ size, num _ classes) size obtained by independent prediction for each branch is selected category by category, and for each category, the maximum value of the prediction scores for the category in the three branch prediction results is selected as the score for the category for the input image of the entire network by the max function. And repeating the above operations on all the categories in the category space to obtain the final prediction result of the whole network on the input image.

The model training module is used for comparing the prediction result with the real label of the image to obtain a Loss value, performing back propagation to update network parameters, finishing training in a preset training batch to obtain a multi-label image classification model with a multi-branch structure, and the multi-label image classification model with the multi-branch structure is used for multi-label image classification; specifically, a Loss value is calculated by adopting a BCEWithLoitsLoss function, wherein the Loss value comprises a Sigmoid layer and a BCELoss layer; assuming that the network has N batchs, each of which predicts N labels, the BCEWithLoitsLoss calculation formula is as follows:

Loss＝{l₁，…，l_N}

l_n＝-[y_n·log(δ(x_n))+(1-y_n)·log(1-δ(x_n))]

x_nto predict the score, y_nIs a real label.

In one embodiment, a multi-label image classification method is provided, wherein an image to be classified is input into the constructed multi-label image classification model with the multi-branch structure, and a multi-label classification result is output.

In one embodiment, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the multi-label image classification model building method of the multi-branch structure or the steps of the multi-label image classification method of the above embodiments when executing the computer program.

In one embodiment, a computer readable storage medium is provided for storing program instructions executable by a processor to implement the steps of the multi-label image classification model construction method of the multi-branch structure or the steps of the multi-label image classification method of the above embodiments.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment the computer program product is embodied as a computer storage medium, in another alternative embodiment the computer program product is embodied as a software product or the like.

Each functional unit in each embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a volatile or non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the present solution or a part of the solution that substantially contributes to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiment provides an accuracy verification experiment for multi-label image classification:

TABLE 1

	CNN-RNN	RLSD	DELTA	ResNet101	Multi-banch(Ours)
						COCO	61.2	65.9	71.3	81.76	82.26
VOC2007	84.0	87.5	90.3	91.26	91.31
						Flicker25k	-	-	-	79.15	80.23

Table 1 is a comparison table of prediction accuracy of the same data set using the multi-label image classification method of the multi-branch structure of the present invention and the existing classification method. The method provided by the text is expressed as Multi-bank (ours), and compared with the accuracy of the existing Multi-label image classification method, the method has the advantage that the accuracy is higher compared with other methods.

Claims

1. A multi-label image classification model construction method of a multi-branch structure is characterized by comprising the following steps:

And 4, step 4: fusing the characteristics F obtained in the step 3₁₁、F₂₁And F₃₁Inputting the characteristic F into a coronatine-attention network to obtain a characteristic F weighted by the coronatine-attention network action₁₂、F₂₂And F₃₂；

2. The multi-label image of a multi-branch structure of claim 1The classification model construction method is characterized in that the feature extraction network in the step 2 is a ResNet101 network, and the ResNet101 network structure is sequentially divided into 6 parts: conv1, Conv2_ x, Conv3_ x, Conv4_ x, Conv5_ x, FC, respectively, feature F being introduced at Conv3_ x, Conv4_ x outputs₁、F₂For acting as branch L₁、L₂And adding the SPP block at the output of Conv5_ x, and then applying the SPP block to the input feature F₃As branch L₃The input feature of (1).

3. The method for constructing a multi-label image classification model of a multi-branch structure according to claim 1, wherein the step 3 specifically comprises:

step 3.1: for feature F₁Feature F is upsampled₂And F₃Becomes equal to F₁The consistency is achieved; the size and the characteristic F after sampling₁Consistent characteristic F₂And F₃And feature F₁Splicing is carried out, splicing is carried out on the channel dimension, and the spliced characteristics are represented as follows: f₁₁＝N₁*(C₁+C₂+C₃)*H₁*W₁Wherein feature F₁₁Num value of N₁The number of channels is C₁+C₂+C₃Dimension H₁*W₁；

Step 3.3: for feature F₃Using down-samplingSign F₁And F₂Becomes equal to F₃The consistency is achieved; the size and the characteristic F after sampling₃Consistent feature F₁And F₂And feature F₃Splicing is carried out, splicing is carried out on the channel dimension, and the spliced characteristics are represented as follows: f₃₁＝N₃*(C₁+C₂+C₃)*H₃*W₃Wherein feature F₃₁Num value of N₃The number of channels is C₁+C₂+C₃Dimension H₃*W₃。

4. The method for constructing a multi-label image classification model of a multi-branch structure according to claim 1, wherein the step 4 specifically comprises:

and 4.2: for feature F₁₁Applying a 1 × 1 convolution function and a non-linear activation function to the feature F obtained in step 4.1₁₁Generating the feature F₁₁The spatial information of (a) is encoded in the horizontal and vertical directions;

Step 4.4: for feature F₂₁And F₃₁And repeating the step 4.2 and the step 4.3 to obtain the characteristic F₂₁And F₃₁Corresponding weighted feature F₂₂And F₃₂。

5. The method for constructing a multi-label image classification model of a multi-branch structure according to claim 1, wherein in the step 5, the class prediction independently predicts for each branch which classes in the class space the corresponding features belong to; each branch is independently predicted to obtain a matrix with the size of (batch _ size, num _ classes), wherein the batch _ size is the number of input images per time, num _ classes is the total number of label types of the data set, zero is used as a threshold, and when the prediction score is larger than zero, the input images contain the type, otherwise, the input images do not contain the type.

6. The method according to claim 1, wherein in the step 6, a matrix of (batch _ sizes, num _ classes) size obtained by independent prediction of each branch is selected by category, and for each category, the maximum value of the prediction scores for the category in the three branch prediction results is selected as the score for the category for the input image by the whole network through a max function; and repeating the above operations on all the categories in the category space to obtain the final prediction result of the input image.

7. The utility model provides a many labels image classification model construction equipment of many branch structures which characterized in that includes:

8. A multi-label image classification method, characterized in that, the image to be classified is input into the multi-label image classification model with multi-branch structure as claimed in any one of claims 1 to 7, and the multi-label classification result is output.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the method for constructing a multi-labeled image classification model of a multi-branch structure according to any one of claims 1 to 6 or implements the steps of the method for classifying a multi-labeled image according to claim 8.

10. A computer-readable storage medium for storing program instructions executable by a processor to perform the steps of the multi-label image classification model construction method of a multi-branch structure according to any one of claims 1 to 6 or to perform the steps of the multi-label image classification method according to claim 8.