CN114528911A - Multi-label image classification method and model construction method and device for multi-branch structure - Google Patents

Multi-label image classification method and model construction method and device for multi-branch structure Download PDF

Info

Publication number
CN114528911A
CN114528911A CN202210021186.3A CN202210021186A CN114528911A CN 114528911 A CN114528911 A CN 114528911A CN 202210021186 A CN202210021186 A CN 202210021186A CN 114528911 A CN114528911 A CN 114528911A
Authority
CN
China
Prior art keywords
feature
branch
input
category
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210021186.3A
Other languages
Chinese (zh)
Inventor
范建平
雷俊婷
赵万青
彭进业
张晓丹
杨文静
王珺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest University
Original Assignee
Northwest University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest University filed Critical Northwest University
Priority to CN202210021186.3A priority Critical patent/CN114528911A/en
Publication of CN114528911A publication Critical patent/CN114528911A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-label image classification method and a model construction method and a device of a multi-branch structure, aiming at the characteristics that a plurality of semantic objects with different sizes and different characteristics exist in a multi-label image, the method takes the characteristics of different parts of a characteristic extraction network as the input characteristics of different branches, each branch is subjected to characteristic fusion in a mode that the characteristic of the branch is taken as the main part and the characteristics of other branches are taken as auxiliary parts, the fused characteristics are input into an Attention network, the further extraction of the characteristics is realized, the final characteristics for prediction are obtained, independent prediction is carried out, then the maximum value of a plurality of branch prediction results is selected by category as the predicted value of the whole network for the category, and the prediction result of the whole network for an input sample is finally obtained. The method solves the defect that the prior art cannot comprehensively and accurately predict all semantic targets in the multi-label image, and effectively improves the accuracy of multi-label image classification.

Description

Multi-label image classification method and model construction method and device for multi-branch structure
Technical Field
The invention belongs to the technical field of image classification, relates to a multi-label image classification method, and particularly relates to a multi-label image classification method with a multi-branch structure, a model construction method and a model construction device.
Background
The image classification task is the basis of many visual tasks, and in real life, an image is often composed of a plurality of objects, for example, a person, a dog and a cat are contained in an image, so that the multi-label image is more practical, and the image usually contains a plurality of semantic objects with different sizes. The method for solving the multi-label image classification problem comprises a traditional machine learning algorithm and a deep learning algorithm.
The traditional machine learning algorithm is mainly based on the following two ideas: (1) problem transformation method. Namely, the multi-label image classification problem can be regarded as the image classification problem of a plurality of single labels, and a plurality of classifiers can be trained to perform single label classification for a plurality of times. (2) A new algorithm. The multi-label problem is not converted into the known single-label problem, but an algorithm suitable for multi-label image classification is provided directly according to the characteristics of the image.
With the continuous development of deep learning, a plurality of deep learning algorithms are also used for multi-label image classification, and the neural network can learn effective characteristics in large-scale data by utilizing the strong nonlinear representation capability of the neural network, so that the image classification precision is improved. Based on the BING theory proposed by the professor of Chengming, Weiyun, Chaojing in 2014, and the like, a Hypotheses-CNN-Pooling frame structure is proposed, namely, a plurality of candidate areas are extracted from each input picture, then each candidate area is sent into CNN for classification training, each candidate area generates a c-dimensional prediction result, and finally, max Pooling is utilized to obtain a final classification result. The method can extract a plurality of Hypotheses, but a plurality of candidate regions are generated for each picture, and the candidate regions are trained by being fed into the CNN, so that a large amount of calculation is caused. The CNN network has strong nonlinear characterization capability, the RNN network can establish the association between image and label, the Jeang Wang et al in 2016 propose a CNN-RNN combined network structure, the CNN network is used for extracting image features, the RNN is used for searching, the method considers the correlation between categories, the result is good only for large targets and objects with dependency, but poor for small targets and objects without dependency, and the method cannot well identify a plurality of targets with different sizes. And Zhang J et al added a Regional LSTM module on the basis of the CNN-RNN network structure in the same year, the Regional LSTM module plays a guiding role in the characteristics obtained by the CNN network, can obtain the position information of the corresponding characteristics, and further considers the dependency among the characteristics, the position and the label. Graph convolution networks have also been applied to multi-label image classification networks, Chen Z M et al, 2019, to graph convolution networks in multi-label classification, and so on. The dependency relationship between the features and the labels is mainly utilized, and when the features do not have corresponding dependency relationship and the size and the abstraction degree of the target are different, the proper features need to be selected in a targeted manner according to the characteristics of the target to perform category prediction.
For the multi-label image classification problem, the related algorithm is continuously emitted. The existing method does not give full play to the classification advantages of different characteristics on different semantic targets, so that the classification accuracy of multi-label images needs to be improved.
Disclosure of Invention
Aiming at the defects and shortcomings in the prior art, the invention provides a multi-label image classification method with a multi-branch structure, a model construction method and a model construction device, solves the problems that in the prior art, all semantic targets in an input image are difficult to predict comprehensively, the small-size targets are often ignored and the like, and makes full use of the characteristics of different characteristics, so that the accuracy of multi-label image classification is improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
a multi-label image classification model construction method of a multi-branch structure comprises the following steps:
step 1: dividing an original data set according to a preset proportion to obtain a training set and a testing set, wherein the training set and the testing set comprise real labels corresponding to each image;
step 2: inputting the training set into a feature extraction network, and obtaining features F from different positions of the feature extraction network1、F2And F3
And step 3: the characteristic F obtained in the step 21、F2And F3Respectively as three branches L1、L2And L3The input features of the branch are taken as the main input features of the branch, and the input features of other branches are taken as the auxiliary input features of the branch to respectively perform feature fusion to obtain fused features F11、F21And F31
And 4, step 4: fusing the characteristics F obtained in the step 311、F21And F31Inputting the feature F into a coronate-attention network to obtain the feature F weighted by the coronate-attention network action12、F22And F32
And 5: weighted feature F obtained from step 412、F22And F32Respectively carrying out category prediction to obtain prediction scores of all categories;
step 6: selecting the maximum value as the final prediction score of the category from the prediction scores obtained in the step 5 category by category to obtain the prediction result of the input image;
and 7: comparing the prediction result obtained in the step 6 with the real label of the image to obtain a Loss value, performing back propagation to update network parameters, completing training in a preset training batch to obtain a multi-label image classification model with a multi-branch structure, and inputting the test set into the trained network to obtain the corresponding classification accuracy; the multi-label image classification model of the multi-branch structure is used for multi-label image classification.
The invention also comprises the following technical characteristics:
optionally, the feature extraction network in step 2 is a ResNet101 network, and the ResNet101 network structure is determined according toThe sub-division is 6 parts: conv1, Conv2_ x, Conv3_ x, Conv4_ x, Conv5_ x, FC, respectively, feature F being introduced at Conv3_ x, Conv4_ x outputs1、F2For acting as branch L1、L2And adding the SPP block at the output of Conv5_ x, and then applying the SPP block to the input feature F3As branch L3The input feature of (1).
Optionally, step 3 specifically includes:
step 3.1: for feature F1Feature F is upsampled2And F3Becomes equal to F1The consistency is achieved; the size and the characteristic F after sampling1Consistent characteristic F3And F3And feature F1Splicing is carried out, splicing is carried out on the channel dimension, and the spliced characteristics are represented as follows: f11=N1*(C1+C2+C3)*H1*W1Wherein feature F11Number of channels is C, and the num value of (A) is N11+C2+C3Dimension H1*W1
Step 3.2: for feature F2Feature F is upsampled or downsampled1And F3Becomes equal to F2The consistency is achieved; the size and the characteristic F after sampling2Consistent characteristic F1And F3And feature F2Splicing is carried out, splicing is carried out on the channel dimension, and the spliced characteristics are represented as follows: f21=N2*(C1+C2+C3)*H2*W2Wherein feature F21Num value of N2The number of channels is C1+C2+C3Dimension H2*W2
Step 3.3: for feature F3Using downsampling to convert the features F1And F2Becomes equal to F3The consistency is achieved; the size and the characteristic F after sampling3Consistent characteristic F1And F2And feature F3Splicing is carried out, splicing is carried out on the channel dimension, and the spliced characteristics are represented as follows: f31=N3*(C1+C2+C3)*H3*W3Wherein feature F31Number of channels is C, and the num value of (A) is N31+C2+C3Dimension H3*W3
Optionally, the step 4 specifically includes:
step 4.1: will be characterized by F11、F21And F31All input into a coronatine-authentication network to respectively obtain the characteristics F11、F21And F31Outputting three groups of corresponding codes along the horizontal coordinate direction and the vertical coordinate direction;
step 4.2: for feature F11Applying a 1 × 1 convolution function and a non-linear activation function to the feature F obtained in step 4.111Generating the feature F11The spatial information of (a) is encoded in the horizontal and vertical directions;
step 4.3: the intermediate feature mapping f is segmented into two independent tensors f along two spatial dimensions of the horizontal coordinate direction and the vertical coordinate directionh,fwTensor f using 1 × 1 convolution function respectivelyh,fwThe number of channels is changed into the number of channels which is the same as the number of channels of the input characteristic, the number of channels is acted by a sigmod function and then used as attention weight, and the attention weight is multiplied by the input characteristic to obtain weighted characteristic F12
Step 4.4: for feature F21And F31Respectively repeating the step 4.2 and the step 4.3 to respectively obtain the characteristic F21And F31Corresponding weighted feature F22And F32
Optionally, in step 5, the class prediction independently predicts which classes in the class space the corresponding features of each branch belong to; each branch is independently predicted to obtain a matrix with the size of (batch _ size, num _ classes), wherein the batch _ size is the number of input images per time, num _ classes is the total number of label types of the data set, zero is used as a threshold, and when the prediction score is larger than zero, the input images contain the type, otherwise, the input images do not contain the type.
Optionally, in step 6, a matrix with a size of (batch _ size, num _ classes) obtained by independent prediction according to each branch is selected category by category, and for each category, the maximum value of the prediction scores of the three branch prediction results for the category is selected by a max function as the score of the whole network for the input image for the category; and repeating the above operations on all the categories in the category space to obtain the final prediction result of the input image.
A multi-label image classification model building device with a multi-branch structure comprises:
the system comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining an original data set and dividing the original data set according to a preset proportion to obtain a training set and a testing set, and the training set and the testing set comprise real labels corresponding to images;
a feature extraction module for inputting the training set into the feature extraction network to obtain features F from different parts of the feature extraction network1、F2And F3
A feature fusion module for fusing the features F1、F2And F3Respectively as three branches L1、L2And L3The input features of the branch are taken as the main input features of the branch, and the input features of other branches are taken as the auxiliary input features of the branch to respectively perform feature fusion to obtain fused features F11、F21And F31
A weighting module for weighting the obtained fused features F11、F21And F31Inputting the characteristic F into a coronatine-attention network to obtain a characteristic F weighted by the coronatine-attention network action12、F22And F32
A class prediction module for predicting class according to the weighted feature F12、F22And F32Respectively carrying out category prediction to obtain prediction scores of all categories;
a category-by-category selection prediction result module for selecting the maximum value as the final prediction score of the category from the prediction scores category by category to obtain the prediction result of the input image;
and the model training module is used for comparing the prediction result with the real label of the image to obtain a Loss value, performing back propagation to update the network parameters, finishing training in a preset training batch to obtain a multi-label image classification model with a multi-branch structure, and the multi-label image classification model with the multi-branch structure is used for multi-label image classification.
A multi-label image classification method inputs an image to be classified into a multi-label image classification model with a multi-branch structure and outputs a multi-label classification result.
A computer device comprising a memory and a processor, wherein the memory stores a computer program, and wherein the processor executes the computer program to implement the steps of the multi-label image classification model construction method for a multi-branch structure or the steps of the multi-label image classification method.
A computer readable storage medium for storing program instructions executable by a processor to perform the steps of the multi-label image classification model construction method of the multi-branch structure or to perform the steps of the multi-label image classification method.
Compared with the prior art, the invention has the beneficial technical effects that:
the invention utilizes the characteristics that the characteristics of different parts of the characteristic extraction network have different characteristics, for example, the characteristics of a lower layer have more detail information, the characteristics of a higher layer have more semantic information, the characteristics of different characteristics can be used for processing corresponding characteristics in a targeted manner, so that a multi-branch structure is provided, information fusion that each branch takes the characteristic of the branch as the main part and other branch characteristics as auxiliary parts is further realized through characteristic fusion operation, the characteristic is further extracted in a weighted manner by the Attention network, a plurality of branches are independently predicted, and finally, the branch result with the best effect aiming at the type is selected as the prediction result of the whole network, thereby effectively improving the classification accuracy of the image on the whole.
Drawings
Fig. 1 is a flowchart illustrating a multi-label image classification method with a multi-branch structure according to the present invention.
Detailed Description
The invention provides a multi-label image classification method with a multi-branch structure and a model construction method and device, which take the characteristics of multi-label images into consideration, wherein a common image comprises a plurality of semantic objects, and the plurality of semantic objects have differences in size and the like. And a plurality of branches are independently predicted, and the final selection effect is the best, so that the classification accuracy of the network is effectively improved on the whole.
Aiming at the characteristics that a plurality of semantic objects with different sizes and different characteristics exist in a multi-label image, on the basis of a traditional CNN (CNN) feature extraction network, features at different positions of the network are used as input features of subsequent branches, feature fusion operation is carried out on each branch, the fused features are input into an Attention network to obtain final features for prediction, independent prediction is carried out, a branch result with the best effect is selected by classes to serve as a prediction value of the whole network for the classes, and the prediction result of the whole network for an input sample is finally obtained. Wherein:
the feature extraction network utilizes the traditional ResNet network to extract image features, and selects the features at different positions of the feature extraction network as subsequent network branches L1、L2And L3The input feature of (1).
The feature fusion is to firstly change the sizes of other branch features to be consistent with the size of the branch feature by utilizing up-sampling or down-sampling, and then realize the fusion of a plurality of same-size features by utilizing a concatemate splicing technology; each branch repeats the above-mentioned fusion operation, namely uses the branch characteristic as the main body and uses other branch characteristics as supplements, thereby realizing that each branch has its own characteristic and has more comprehensive target information.
The anchoring network is used for further weighting and extracting the fused features, a coordinate-anchoring method is used for acting on the fused features, and the weighting mode not only pays Attention to the channel direction but also pays Attention to the accurate position information of the features.
And selecting the branch result with the best effect category by category, namely selecting the maximum value of the predicted value of the category in the multi-branch prediction result by utilizing a max function through multi-branch independent prediction, and realizing the overall improvement of the classification accuracy of the network.
One embodiment of the present invention provides a method for constructing a multi-label image classification model with a multi-branch structure, as shown in fig. 1, including the following steps:
step 1, dividing an original data set according to a preset proportion to obtain a training set and a testing set, wherein the training set and the testing set comprise real labels corresponding to each image; the original dataset is a public dataset commonly used for multi-label image classification; in this example, the COCO, VOC2007, Flicker25k data sets were used, and these public data sets were already classified into training and testing sets, for example, if they were not directly classified into training and testing sets, the data sets were classified into training and testing sets according to a ratio of 8: 2.
Step 2: inputting the training set into a feature extraction network, and obtaining features F from different positions of the feature extraction network1、F2And F3
The feature extraction network is realized by adopting a ResNet101 network, and output features are led out from different positions of the ResNet101 network as branches L1、L2And L3The input feature of (1); the ResNet network mainly has two basic blocks, namely BasicBlock and Bottleneck, wherein the BasicBlock consists of two 3 × 3 volume blocks and identity mapping. The bottleeck is composed of three volume blocks of 1 × 1, 3 × 3, and 1 × 1 and identity mapping. The ResNet101 network structure is sequentially divided into 6 parts: convl, Conv2_ x, Conv3_ x, Conv4_ x, Conv5_ x, FC, respectively, features F being introduced at Conv3_ x, Conv4_ x outputs1、F2For acting as branch L1、L2And adding the SPP block at the output of Conv5_ x, and then applying the SPP block to the input feature F3As branch L3The input feature of (1); ResNet101 includes a base block Bottleneeck, where Conv1 is the convolutional layer with core 7,conv2_ x is the result of Conv1 added with the maximum pooling with core 3, and then connected with 3 Bottleneck basic blocks, Conv3_ x, Conv4_ x and Conv5_ x are respectively composed of 4, 23 and 3 Bottleneck basic blocks.
And step 3: the characteristic F obtained in the step 21、F2And F3Respectively as three branches L1、L2And L3The input features of the branch are taken as the main input features of the branch, and the input features of other branches are taken as the auxiliary input features of the branch to respectively perform feature fusion to obtain fused features F11、F21And F31
Feature fusion is an iterative operation performed on each branch, comprising two stages, the first of which is to input the feature size (H) at the present branchi×Wi) For the standard, the residual branch input features are up-sampled or down-sampled to the same size (H) as the present branch input featuresi×Wi) (ii) a The second stage is to adopt concatemate splicing to realize feature fusion aiming at the result of the first stage to obtain a fused feature F11、F21And F31. Specifically, the feature fusion module is repeatedly applied to each branch, based on the feature of the branch, the up-sampling is mainly performed by adopting a nearest neighbor interpolation method or the down-sampling is performed by adopting a convolution operation of 3 × 3, the feature sizes of other branches are changed into the same size as the feature of the branch, and then the splicing fusion is performed.
The specific steps of step 3 include:
step 3.1: for feature F1Feature F is upsampled2And F3Becomes equal to F1The consistency is achieved; the size and the characteristic F after sampling1Consistent characteristic F2And F3And feature F1Make a splice, assume F1=(N1,C1,H1,W1)、F2=(N2,C2,H2,W2)、F3=(N3,C3,H3,W3) Wherein N is1、C1、H1And W1Are respectively a feature F1Num value, number of channels, length and width, N2、C2、H2And W2Are respectively a feature F2Num value, number of channels, length and width, N3、C3、H3And W3Are respectively a feature F3Num value, number of channels, length and width; then, splicing is performed on the channel dimension, and the characteristics after splicing are expressed as: f11=N1*(C1+C2+C3)*H1*W1Wherein feature F11Number of channels is C, and the num value of (A) is N11+C2+C3Dimension H1*W1
Step 3.2: for feature F2Feature F is upsampled or downsampled1And F3Becomes equal to F2The consistency is achieved; the size and the characteristic F after sampling2Consistent characteristic F1And F3And feature F2Splicing is carried out, splicing is carried out on the channel dimension, and the spliced characteristics are represented as follows: f21=N2*(C1+C2+C3)*H2*W2Wherein feature F21Num value of N2The number of channels is C1+C2+C3Dimension H2*W2
Step 3.3: for feature F3Using downsampling to transform the feature F1And F2Becomes equal to F3The consistency is achieved; the size and the characteristic F after sampling3Consistent characteristic F1And F2And feature F3Splicing is carried out, splicing is carried out on the channel dimension, and the spliced characteristics are represented as follows: f31=N3*(C1+C2+C3)*H3*W3Wherein feature F31Num value of N3The number of channels is C1+C2+C3Dimension H3*W3
And 4, step 4: fusing the characteristics F obtained in the step 311、F21And F31Inputting into a coronate-attack network to obtain a channelFeature F after weighting of icordinate-attribute network action12、F22And F32
The specific steps of step 4 include:
step 4.1: will be characterized by F11、F21And F31All input into a coronatine-authentication network to respectively obtain the characteristics F11、F21And F31Outputting the corresponding three groups of codes along the horizontal coordinate direction and the vertical coordinate direction; specifically, first, the feature F is set11Each channel is encoded along a horizontal coordinate direction and a vertical coordinate direction with a size of (H)1,W1) When the height is h, the output of the corresponding c-th channel is:
Figure BDA0003462661800000071
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003462661800000072
is characterized by F11The coded output of the feature of the c channel along the horizontal coordinate direction, h is the height, W is the feature F11I is a variable, 0 ≦ i < W, xc(h, i) varies with i; when the width is w, the output of the corresponding c-th channel is:
Figure BDA0003462661800000081
Figure BDA0003462661800000082
wherein the content of the first and second substances,
Figure BDA0003462661800000083
is characterized by F11The feature of the c-th channel is encoded and output along the vertical coordinate direction, w is the height, and H is the feature F11J is a variable, j is more than or equal to 0 and less than H, xc(j, w) varies with j; subsequently subjecting the feature F11The coded output of all channels along the horizontal coordinate direction is subjected to a concatemate operation to obtain zhBy the same token, z is obtainedw
Step 4.2: for feature F111 × 1 convolution functions and non-linesThe sexual activation function acting on the coded output z obtained in step 4.1h,zwTo generate a feature F11Spatial information of (2) intermediate feature mapping for encoding in horizontal and vertical directions
Figure BDA0003462661800000084
f∈R(C/r×(H+W))Wherein z ish,zwRespectively the coded outputs spliced along the horizontal direction and the vertical direction, F is a convolution function of 1 multiplied by 1, delta is a nonlinear activation function, r represents a down-sampling proportion, and C is a characteristic F11The number of channels, H, W being respectively characteristic F11Length and width.
Step 4.3: the intermediate feature mapping f is segmented into two independent tensors f along two spatial dimensions of the horizontal coordinate direction and the vertical coordinate directionh,fwTensor f using 1 × 1 convolution function respectivelyh,fwThe number of channels is changed into the number of channels which is the same as the number of channels of the input characteristic, the number of channels is acted by a sigmod function and then used as attention weight, and the attention weight is multiplied by the input characteristic to obtain weighted characteristic F12(ii) a Specifically, the attention weights of the two directions are ghAnd gwMultiplying the weight by the input feature to obtain a weighted feature F12Comprises the following steps: f12=F11×gh×gw
Step 4.4: for feature F21And F31Respectively repeating the step 4.2 and the step 4.3 to respectively obtain the characteristic F21And F31Corresponding weighted feature F22And F32
And 5: weighted feature F obtained from step 412、F22And F32Respectively carrying out category prediction to obtain prediction scores of all categories; the label of the image is one or more of the categories;
the class prediction is independent class prediction of each branch, which classes in the class space the features belong to are predicted to obtain a matrix with the size of (batch _ size, num _ classes), wherein the batch _ size is the number of pictures input into the network each time, and the num _ classes is the total number of label classes corresponding to the data set; at this time, zero is used as a threshold, and if the threshold is larger than zero, the input image includes the category, and otherwise, the input image does not include the category.
Step 6: selecting the maximum value as the final prediction score of the category from the prediction scores obtained in the step 5 category by category to obtain the prediction result of the input image;
the class-by-class selection is a matrix of (batch _ size, num _ classes) size obtained by independent prediction for each branch, and for each class, the maximum value of the prediction scores for that class in the three branch prediction results is selected by the max function as the score for that class for the input image of the entire network. And repeating the above operations on all the categories in the category space to obtain the final prediction result of the whole network on the input image.
And 7: and (4) comparing the prediction result obtained in the step (6) with the real label of the image to obtain a Loss value, performing back propagation to update network parameters, completing training in a preset training batch, and obtaining the multi-label image classification model with a multi-branch structure.
Specifically, a Loss value is calculated by adopting a BCEWithLogitsLoss function, wherein the Loss value comprises a Sigmoid layer and a BCELoss layer; assuming that the network has N batchs, each of which predicts N labels, the BCEWithLoitsLoss calculation formula is as follows:
Loss={l1,…,lN}
ln=-[yn·log(δ(xn))+(1-yn)·log(1-δ(xn))]
where δ (x)n) For the Sigmoid function, the interval for mapping input X to (0, 1) is calculated as:
Figure BDA0003462661800000091
xnto predict the score, ynIs a real label.
In an embodiment of the present invention, a multi-label image classification model building apparatus with a multi-branch structure is provided, including:
the determining module is used for determining an original data set and dividing the original data set according to a preset proportion to obtain a training set and a testing set, wherein the training set and the testing set comprise real labels corresponding to each image;
a feature extraction module for inputting the training set into the feature extraction network to obtain features F from different parts of the feature extraction network1、F2And F3(ii) a The feature extraction network is realized by adopting a ResNet101 network, and the ResNet101 network structure is sequentially divided into 6 parts: conv1, Conv2_ x, Conv3_ x, Conv4_ x, Conv5_ x, FC, respectively, feature F being introduced at Conv3_ x, Conv4_ x outputs1、F2For acting as branch L1、L2And adding the SPP block at the output of Conv5_ x, and then applying the SPP block to the input feature F3As branch L3The input feature of (1).
A feature fusion module for fusing the features F1、F2And F3Respectively as three branches L1、L2And L3The input features of the branch are taken as the main input features of the branch, and the input features of other branches are taken as the auxiliary input features of the branch to respectively perform feature fusion to obtain fused features F11、F21And F31(ii) a Specifically, for feature F1Feature F is upsampled2And F3Becomes equal to F1Matching the sampled size with the feature F1Consistent characteristic F2And F3And feature F1Splicing is performed for feature F2Feature F is upsampled or downsampled1And F3Becomes equal to F2Consistent stitching for feature F3Using downsampling to convert the features F1And F2Becomes equal to F3And (6) splicing uniformly.
A weighting module for weighting the obtained fused features F11、F21And F31Inputting the feature F into a coronate-attention network to obtain the feature F weighted by the coronate-attention network action12、F22And F32(ii) a Specifically, feature F11、F21And F31All input into a coronatine-authentication network to respectively obtain the characteristics F11、F21And F31Outputting three groups of corresponding codes along the horizontal coordinate direction and the vertical coordinate direction; will be characterized by F11The coded output of all channels along the horizontal coordinate direction is subjected to a continate operation to obtain zhBy the same token, z is obtainedw(ii) a Applying a 1 x 1 convolution function and a non-linear activation function to the encoded output zh,zwGenerating feature F11The spatial information is coded in the horizontal and vertical directions by an intermediate feature mapping f, and then the intermediate feature mapping f is divided into two independent tensors f along two spatial dimensions of the horizontal coordinate direction and the vertical coordinate directionh,fwTensor f using 1 × 1 convolution function respectivelyh,fwThe number of channels is changed into the number of channels which is the same as the number of channels of the input characteristic, the number of channels is acted by a sigmod function and then used as attention weight, and the attention weight is multiplied by the input characteristic to obtain weighted characteristic F12(ii) a Repeating the above operation to obtain the characteristics F21And F31Corresponding weighted feature F22And F32
A class prediction module for predicting class according to the weighted feature F12、F22And F32Respectively carrying out category prediction to obtain prediction scores of all categories; specifically, the category prediction is independent prediction of each branch, which classes the features belong to in the category space is predicted, and a matrix of (batch _ size, num _ classes) is obtained, where the batch _ size is the number of pictures input into the network each time, and num _ classes is the total number of labels corresponding to the data set; and zero is used as a threshold value, and if the threshold value is larger than zero, the input image contains the category, and otherwise, the input image does not contain the category.
A category-by-category selection prediction result module for selecting the maximum value as the final prediction score of the category from the prediction scores category by category to obtain the prediction result of the input image; specifically, a matrix of (batch _ size, num _ classes) size obtained by independent prediction for each branch is selected category by category, and for each category, the maximum value of the prediction scores for the category in the three branch prediction results is selected as the score for the category for the input image of the entire network by the max function. And repeating the above operations on all the categories in the category space to obtain the final prediction result of the whole network on the input image.
The model training module is used for comparing the prediction result with the real label of the image to obtain a Loss value, performing back propagation to update network parameters, finishing training in a preset training batch to obtain a multi-label image classification model with a multi-branch structure, and the multi-label image classification model with the multi-branch structure is used for multi-label image classification; specifically, a Loss value is calculated by adopting a BCEWithLoitsLoss function, wherein the Loss value comprises a Sigmoid layer and a BCELoss layer; assuming that the network has N batchs, each of which predicts N labels, the BCEWithLoitsLoss calculation formula is as follows:
Loss={l1,…,lN}
ln=-[yn·log(δ(xn))+(1-yn)·log(1-δ(xn))]
where δ (x)n) For the Sigmoid function, the interval for mapping input X to (0, 1) is calculated as:
Figure BDA0003462661800000101
xnto predict the score, ynIs a real label.
In one embodiment, a multi-label image classification method is provided, wherein an image to be classified is input into the constructed multi-label image classification model with the multi-branch structure, and a multi-label classification result is output.
In one embodiment, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the multi-label image classification model building method of the multi-branch structure or the steps of the multi-label image classification method of the above embodiments when executing the computer program.
In one embodiment, a computer readable storage medium is provided for storing program instructions executable by a processor to implement the steps of the multi-label image classification model construction method of the multi-branch structure or the steps of the multi-label image classification method of the above embodiments.
The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment the computer program product is embodied as a computer storage medium, in another alternative embodiment the computer program product is embodied as a software product or the like.
Each functional unit in each embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a volatile or non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the present solution or a part of the solution that substantially contributes to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The embodiment provides an accuracy verification experiment for multi-label image classification:
TABLE 1
CNN-RNN RLSD DELTA ResNet101 Multi-banch(Ours)
COCO 61.2 65.9 71.3 81.76 82.26
VOC2007 84.0 87.5 90.3 91.26 91.31
Flicker25k - - - 79.15 80.23
Table 1 is a comparison table of prediction accuracy of the same data set using the multi-label image classification method of the multi-branch structure of the present invention and the existing classification method. The method provided by the text is expressed as Multi-bank (ours), and compared with the accuracy of the existing Multi-label image classification method, the method has the advantage that the accuracy is higher compared with other methods.

Claims (10)

1. A multi-label image classification model construction method of a multi-branch structure is characterized by comprising the following steps:
step 1: dividing an original data set according to a preset proportion to obtain a training set and a testing set, wherein the training set and the testing set comprise real labels corresponding to each image;
step 2: inputting the training set into a feature extraction network, and obtaining features F from different positions of the feature extraction network1、F2And F3
And step 3: the characteristic F obtained in the step 21、F2And F3Respectively as three branches L1、L2And L3The input features of the branch are taken as the main input features of the branch, and the input features of other branches are taken as the auxiliary input features of the branch to respectively perform feature fusion to obtain fused features F11、F21And F31
And 4, step 4: fusing the characteristics F obtained in the step 311、F21And F31Inputting the characteristic F into a coronatine-attention network to obtain a characteristic F weighted by the coronatine-attention network action12、F22And F32
And 5: weighted feature F obtained from step 412、F22And F32Respectively carrying out category prediction to obtain prediction scores of all categories;
step 6: selecting the maximum value as the final prediction score of the category from the prediction scores obtained in the step 5 category by category to obtain the prediction result of the input image;
and 7: comparing the prediction result obtained in the step 6 with the real label of the image to obtain a Loss value, performing back propagation to update network parameters, completing training in a preset training batch to obtain a multi-label image classification model with a multi-branch structure, and inputting the test set into the trained network to obtain the corresponding classification accuracy; the multi-label image classification model of the multi-branch structure is used for multi-label image classification.
2. The multi-label image of a multi-branch structure of claim 1The classification model construction method is characterized in that the feature extraction network in the step 2 is a ResNet101 network, and the ResNet101 network structure is sequentially divided into 6 parts: conv1, Conv2_ x, Conv3_ x, Conv4_ x, Conv5_ x, FC, respectively, feature F being introduced at Conv3_ x, Conv4_ x outputs1、F2For acting as branch L1、L2And adding the SPP block at the output of Conv5_ x, and then applying the SPP block to the input feature F3As branch L3The input feature of (1).
3. The method for constructing a multi-label image classification model of a multi-branch structure according to claim 1, wherein the step 3 specifically comprises:
step 3.1: for feature F1Feature F is upsampled2And F3Becomes equal to F1The consistency is achieved; the size and the characteristic F after sampling1Consistent characteristic F2And F3And feature F1Splicing is carried out, splicing is carried out on the channel dimension, and the spliced characteristics are represented as follows: f11=N1*(C1+C2+C3)*H1*W1Wherein feature F11Num value of N1The number of channels is C1+C2+C3Dimension H1*W1
Step 3.2: for feature F2Feature F is upsampled or downsampled1And F3Becomes equal to F2The consistency is achieved; the size and the characteristic F after sampling2Consistent characteristic F1And F3And feature F2Splicing is carried out, splicing is carried out on the channel dimension, and the spliced characteristics are represented as follows: f21=N2*(C1+C2+C3)*H2*W2Wherein feature F21Num value of N2The number of channels is C1+C2+C3Dimension H2*W2
Step 3.3: for feature F3Using down-samplingSign F1And F2Becomes equal to F3The consistency is achieved; the size and the characteristic F after sampling3Consistent feature F1And F2And feature F3Splicing is carried out, splicing is carried out on the channel dimension, and the spliced characteristics are represented as follows: f31=N3*(C1+C2+C3)*H3*W3Wherein feature F31Num value of N3The number of channels is C1+C2+C3Dimension H3*W3
4. The method for constructing a multi-label image classification model of a multi-branch structure according to claim 1, wherein the step 4 specifically comprises:
step 4.1: will be characterized by F11、F21And F31All input into a coronatine-authentication network to respectively obtain the characteristics F11、F21And F31Outputting three groups of corresponding codes along the horizontal coordinate direction and the vertical coordinate direction;
and 4.2: for feature F11Applying a 1 × 1 convolution function and a non-linear activation function to the feature F obtained in step 4.111Generating the feature F11The spatial information of (a) is encoded in the horizontal and vertical directions;
step 4.3: the intermediate feature mapping f is segmented into two independent tensors f along two spatial dimensions of the horizontal coordinate direction and the vertical coordinate directionh,fwTensor f using 1 × 1 convolution function respectivelyh,fwThe number of channels is changed into the number of channels which is the same as the number of channels of the input characteristic, the number of channels is acted by a sigmod function and then used as attention weight, and the attention weight is multiplied by the input characteristic to obtain weighted characteristic F12
Step 4.4: for feature F21And F31And repeating the step 4.2 and the step 4.3 to obtain the characteristic F21And F31Corresponding weighted feature F22And F32
5. The method for constructing a multi-label image classification model of a multi-branch structure according to claim 1, wherein in the step 5, the class prediction independently predicts for each branch which classes in the class space the corresponding features belong to; each branch is independently predicted to obtain a matrix with the size of (batch _ size, num _ classes), wherein the batch _ size is the number of input images per time, num _ classes is the total number of label types of the data set, zero is used as a threshold, and when the prediction score is larger than zero, the input images contain the type, otherwise, the input images do not contain the type.
6. The method according to claim 1, wherein in the step 6, a matrix of (batch _ sizes, num _ classes) size obtained by independent prediction of each branch is selected by category, and for each category, the maximum value of the prediction scores for the category in the three branch prediction results is selected as the score for the category for the input image by the whole network through a max function; and repeating the above operations on all the categories in the category space to obtain the final prediction result of the input image.
7. The utility model provides a many labels image classification model construction equipment of many branch structures which characterized in that includes:
the system comprises a determining module, a judging module and a judging module, wherein the determining module is used for determining an original data set and dividing the original data set according to a preset proportion to obtain a training set and a testing set, and the training set and the testing set comprise real labels corresponding to images;
a feature extraction module for inputting the training set into the feature extraction network to obtain features F from different parts of the feature extraction network1、F2And F3
A feature fusion module for fusing the features F1、F2And F3Respectively as three branches L1、L2And L3The input features of the branch are taken as the main input features of the branch, and the input features of other branches are taken as the auxiliary input features of the branch to respectively perform feature fusion to obtain fused features F11、F21And F31
A weighting module for weighting the obtained fused features F11、F21And F31Inputting the characteristic F into a coronatine-attention network to obtain a characteristic F weighted by the coronatine-attention network action12、F22And F32
A class prediction module for predicting class according to the weighted feature F12、F22And F32Respectively carrying out category prediction to obtain prediction scores of all categories;
a category-by-category selection prediction result module for selecting the maximum value as the final prediction score of the category from the prediction scores category by category to obtain the prediction result of the input image;
and the model training module is used for comparing the prediction result with the real label of the image to obtain a Loss value, performing back propagation to update the network parameters, finishing training in a preset training batch to obtain a multi-label image classification model with a multi-branch structure, and the multi-label image classification model with the multi-branch structure is used for multi-label image classification.
8. A multi-label image classification method, characterized in that, the image to be classified is input into the multi-label image classification model with multi-branch structure as claimed in any one of claims 1 to 7, and the multi-label classification result is output.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the method for constructing a multi-labeled image classification model of a multi-branch structure according to any one of claims 1 to 6 or implements the steps of the method for classifying a multi-labeled image according to claim 8.
10. A computer-readable storage medium for storing program instructions executable by a processor to perform the steps of the multi-label image classification model construction method of a multi-branch structure according to any one of claims 1 to 6 or to perform the steps of the multi-label image classification method according to claim 8.
CN202210021186.3A 2022-01-10 2022-01-10 Multi-label image classification method and model construction method and device for multi-branch structure Pending CN114528911A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210021186.3A CN114528911A (en) 2022-01-10 2022-01-10 Multi-label image classification method and model construction method and device for multi-branch structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210021186.3A CN114528911A (en) 2022-01-10 2022-01-10 Multi-label image classification method and model construction method and device for multi-branch structure

Publications (1)

Publication Number Publication Date
CN114528911A true CN114528911A (en) 2022-05-24

Family

ID=81620474

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210021186.3A Pending CN114528911A (en) 2022-01-10 2022-01-10 Multi-label image classification method and model construction method and device for multi-branch structure

Country Status (1)

Country Link
CN (1) CN114528911A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679552A (en) * 2017-09-11 2018-02-09 北京飞搜科技有限公司 A kind of scene classification method and system based on multiple-limb training
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning
CN111753966A (en) * 2020-07-02 2020-10-09 成都睿码科技有限责任公司 Implementation method for implementing multi-label model training framework by using missing multi-label data
CN113657425A (en) * 2021-06-28 2021-11-16 华南师范大学 Multi-label image classification method based on multi-scale and cross-modal attention mechanism
CN113807362A (en) * 2021-09-03 2021-12-17 西安电子科技大学 Image classification method based on interlayer semantic information fusion deep convolutional network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679552A (en) * 2017-09-11 2018-02-09 北京飞搜科技有限公司 A kind of scene classification method and system based on multiple-limb training
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning
CN111753966A (en) * 2020-07-02 2020-10-09 成都睿码科技有限责任公司 Implementation method for implementing multi-label model training framework by using missing multi-label data
CN113657425A (en) * 2021-06-28 2021-11-16 华南师范大学 Multi-label image classification method based on multi-scale and cross-modal attention mechanism
CN113807362A (en) * 2021-09-03 2021-12-17 西安电子科技大学 Image classification method based on interlayer semantic information fusion deep convolutional network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵海英;周伟;侯小刚;齐光磊;: "多标签分类的传统民族服饰纹样图像语义理解", 光学精密工程, no. 03 *

Similar Documents

Publication Publication Date Title
CN110837836B (en) Semi-supervised semantic segmentation method based on maximized confidence
Wang et al. Cliffnet for monocular depth estimation with hierarchical embedding loss
US9323886B2 (en) Performance predicting apparatus, performance predicting method, and program
CN112580782B (en) Channel-enhanced dual-attention generation countermeasure network and image generation method
CN110837846A (en) Image recognition model construction method, image recognition method and device
CN111723674A (en) Remote sensing image scene classification method based on Markov chain Monte Carlo and variation deduction and semi-Bayesian deep learning
CN107784288A (en) A kind of iteration positioning formula method for detecting human face based on deep neural network
CN107392919A (en) Gray threshold acquisition methods, image partition method based on self-adapted genetic algorithm
Jindal et al. Offline handwritten Gurumukhi character recognition system using deep learning
CN110599502A (en) Skin lesion segmentation method based on deep learning
CN115966010A (en) Expression recognition method based on attention and multi-scale feature fusion
CN113011243A (en) Facial expression analysis method based on capsule network
Sun et al. Adaptive activation thresholding: Dynamic routing type behavior for interpretability in convolutional neural networks
Huynh et al. Joint age estimation and gender classification of Asian faces using wide ResNet
CN114219049B (en) Fine-grained curbstone image classification method and device based on hierarchical constraint
CN115035341A (en) Image recognition knowledge distillation method capable of automatically selecting student model structure
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN111753995A (en) Local interpretable method based on gradient lifting tree
CN116912268A (en) Skin lesion image segmentation method, device, equipment and storage medium
CN116129189A (en) Plant disease identification method, plant disease identification equipment, storage medium and plant disease identification device
CN114528911A (en) Multi-label image classification method and model construction method and device for multi-branch structure
CN114581789A (en) Hyperspectral image classification method and system
CN115205877A (en) Irregular typesetting invoice document layout prediction method and device and storage medium
CN114997175A (en) Emotion analysis method based on field confrontation training
Zhou et al. Research on knowledge distillation algorithm based on Yolov5 attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned
AD01 Patent right deemed abandoned

Effective date of abandoning: 20240517