CN117456268A

CN117456268A - Image multi-label classification method, device, equipment and storage medium

Info

Publication number: CN117456268A
Application number: CN202311544865.XA
Authority: CN
Inventors: 天尧; 张力文; 谢志强; 金子杰
Original assignee: Tianyi Digital Life Technology Co Ltd
Current assignee: Tianyi Shilian Technology Co ltd
Priority date: 2023-11-17
Filing date: 2023-11-17
Publication date: 2024-01-26

Abstract

The application discloses a method, a device, equipment and a storage medium for classifying multiple labels of images, wherein the method comprises the following steps: inputting the target image into a preset attention EfficientNet-B0 network; after a weight parameter matrix is obtained by forced orthogonalization before back propagation, carrying out feature extraction operation on an input target image according to the weight parameter matrix through a point-by-point convolution layer and a full connection layer to obtain a target feature vector; and carrying out predictive analysis on the target feature vector through an output layer of a preset attention Efficient Net-B0 network to obtain a multi-classification result comprising a plurality of predictive labels. The receptive field network layer and forced orthogonalization processing based on the non-local attention response mechanism can improve the accuracy and the prediction efficiency of the prediction result. The method and the device can solve the technical problems that the existing neural network for classifying the image multi-label has larger redundant information and large calculated amount, so that the model efficiency is poor and the accuracy is low.

Description

Image multi-label classification method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision, and in particular, to a method, an apparatus, a device, and a storage medium for classifying multiple labels of an image.

Background

The problem of image multi-label classification is one of the key problems in the field of computer vision. The method mainly aims at accurately distinguishing the images of different categories by utilizing semantic information of the images, so that the minimized classification error is realized. The task requirement is to assign one or more appropriate labels to a given target image; the input of the algorithm or model is an image, and the output is a plurality of category labels corresponding to the image.

The existing image multi-label classification has higher difficulty compared with single-label classification, so the accuracy is lower; network parameters and feature channels in the deep learning-based image multi-label classification technology are high in redundancy, so that an intermediate feature layer is easy to fail, and the effects of the feature layer and the network parameters are difficult to play; in addition, the neural network increases the complexity of the model by increasing the convolutional layer to increase the receptive field, resulting in a larger calculation amount and a longer calculation time, thereby reducing the prediction efficiency.

Disclosure of Invention

The application provides an image multi-label classification method, device, equipment and storage medium, which are used for solving the technical problems that the existing neural network for classifying the image multi-label has larger redundant information, large calculation amount and poor model efficiency and lower accuracy.

In view of this, a first aspect of the present application provides an image multi-label classification method, including:

inputting a target image into a preset attention EfficientNet-B0 network, wherein the preset attention EfficientNet-B0 network constructs a global receptive field network layer based on a non-local attention response mechanism;

after a weight parameter matrix is obtained by forced orthogonalization before back propagation, carrying out feature extraction operation on the input target image through a point-by-point convolution layer and a full connection layer according to the weight parameter matrix to obtain a target feature vector;

and carrying out predictive analysis on the target feature vector through an output layer of the preset attention Efficient Net-B0 network to obtain a multi-classification result, wherein the multi-classification result comprises a plurality of predictive labels.

Preferably, the inputting the target image into a preset attention effect net-B0 network, the preset attention effect net-B0 network constructs a global receptive field network layer based on a non-local attention response mechanism, and the method further comprises the following steps:

constructing a basic Efficient Net-B0 network based on an inverted bottleneck residual block of the MobileNet V2;

and adding a non-local attention response mechanism layer in a plurality of deep network layers of the basic Efficient Net-B0 network to generate a preset attention Efficient Net-B0 network.

and carrying out optimization pre-training on the preset attention EfficientNet-B0 network through a preset training data set to obtain optimization model parameters.

Preferably, after the step of performing forced orthogonalization to obtain a weight parameter matrix before back propagation, performing feature extraction operation on the input target image according to the weight parameter matrix by using a point-by-point convolution layer and a full connection layer to obtain a target feature vector, including:

before back propagation, performing singular value decomposition on initial parameter matrixes of a point-by-point convolution layer and a full connection layer through a singular value decomposition algorithm to obtain a decomposition matrix;

performing forced orthogonal calculation based on the decomposition matrix to obtain a weight parameter matrix;

and performing feature extraction operation on the input target image according to the weight parameter matrix through a point-by-point convolution layer and a full connection layer to obtain a target feature vector.

A second aspect of the present application provides an image multi-label classification device, comprising:

the image input unit is used for inputting the target image into a preset attention EfficientNet-B0 network, and the preset attention EfficientNet-B0 network constructs a global receptive field network layer based on a non-local attention response mechanism;

the convolution optimization unit is used for carrying out forced orthogonalization before back propagation to obtain a weight parameter matrix, and then carrying out feature extraction operation on the input target image through a point-by-point convolution layer and a full connection layer according to the weight parameter matrix to obtain a target feature vector;

the classification prediction unit is used for performing prediction analysis on the target feature vector through the output layer of the preset attention EfficientNet-B0 network to obtain a multi-classification result, wherein the multi-classification result comprises a plurality of prediction labels.

Preferably, the method further comprises:

the model building unit is used for building a basic EfficientNet-B0 network based on the inverted bottleneck residual block of the MobileNet V2;

and the global optimization unit is used for adding a non-local attention response mechanism layer into a plurality of deep network layers of the basic Efficient Net-B0 network to generate a preset attention Efficient Net-B0 network.

Preferably, the method further comprises:

the pre-training unit is used for carrying out optimization pre-training on the preset attention EfficientNet-B0 network through the preset training data set to obtain optimized model parameters.

Preferably, the convolution optimization unit is specifically configured to:

A third aspect of the present application provides an image multi-label classification device, the device comprising a processor and a memory;

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the image multi-label classification method according to the first aspect according to instructions in the program code.

A fourth aspect of the present application provides a computer readable storage medium for storing program code for performing the image multi-label classification method of the first aspect.

From the above technical solutions, the embodiments of the present application have the following advantages:

in the present application, a method for classifying multiple labels of an image is provided, including: inputting a target image into a preset attention EfficientNet-B0 network, and constructing a global receptive field network layer by the preset attention EfficientNet-B0 network based on a non-local attention response mechanism; after a weight parameter matrix is obtained by forced orthogonalization before back propagation, carrying out feature extraction operation on an input target image according to the weight parameter matrix through a point-by-point convolution layer and a full connection layer to obtain a target feature vector; and carrying out predictive analysis on the target feature vector through an output layer of a preset attention Efficient Net-B0 network to obtain a multi-classification result, wherein the multi-classification result comprises a plurality of predictive labels.

According to the image multi-label classification method, the global receptive field network layer is built in the Efficient Net-B0 network architecture through the non-local attention response mechanism, the purpose of expanding receptive fields can be achieved without adding an additional convolution layer, the complexity of a model can be reduced to a certain extent, and the calculated amount is reduced; in addition, the optimization processing is carried out on the model parameter matrix through forced orthogonalization, so that the wireless correlation between output features can be ensured, the redundancy of feature channels is reduced, the high efficiency of a feature layer is improved, and the accuracy of a prediction result can be improved; in addition, regularization operations subject the parameters to regularization constraints, which can also reduce over-fitting of model training. Therefore, the method and the device can solve the technical problems that the existing neural network for classifying the image multi-label has larger redundant information and large calculation amount, so that the model efficiency is poor and the accuracy is low.

Drawings

Fig. 1 is a flow chart of an image multi-label classification method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an image multi-label classification device according to an embodiment of the present application;

fig. 3 is a schematic diagram of an afflicientnet-B0 network structure of an inverted bottleneck residual block based on mobilenet v2 according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an Efficient Net-B0 network connection of an inverted bottleneck residual block based on MobileNet V2 according to an embodiment of the present application

FIG. 5 is a schematic diagram of a global receptive field network structure based on a non-local attention response mechanism provided in embodiments of the application;

fig. 6 is a schematic diagram of calculation and analysis of an input vector by a network layer based on a weight parameter matrix according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an Efficient Net-B0 network structure with orthogonalization of fusion weight parameters according to an embodiment of the present application;

fig. 8 is a schematic diagram of performing graph multi-label classification on a preset attention effect net-B0 network according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

For ease of understanding, referring to fig. 1, an embodiment of an image multi-label classification method provided in the present application includes:

and step 101, inputting the target image into a preset attention EfficientNet-B0 network, and constructing a global receptive field network layer by the preset attention EfficientNet-B0 network based on a non-local attention response mechanism.

Further, step 101, before further includes:

and adding a non-local attention response mechanism layer in a plurality of deep network layers of the basic EfficientNet-B0 network to generate a preset attention EfficientNet-B0 network.

Further, step 101, before further includes:

It should be noted that referring to fig. 3 and 4, the afflicientnet infrastructure of the preset attention afflicientnet-B0 network of the present embodiment is a convolutional neural network architecture and scaling method, which uses a composite coefficient to scale all dimensions of depth/width/resolution uniformly. Furthermore, on the basis of the network of EfficientNet, a basic Eff ictientNet-B0 network is generated from the inverted bottleneck residual block, extrusion and excitation block of MobileNet V2, which can migrate well over numerous data sets and achieve the most advanced accuracy, with one order of magnitude less parameters.

In order to increase the receptive field of the model without increasing the network convolution layer, in this embodiment, a global receptive field network layer based on a non-local attention response mechanism is added to a plurality of deep network layers of the basic efficience entNet-B0 network, and the specific structure is shown in fig. 5, and the number and the insertion position of the global receptive field network layer can be changed according to the actual situation, which is not limited herein. For example, global receptive field network layers based on non-local attention response mechanisms may be added at layers 6, 12 and 18, respectively, of the underlying Efficient Net-B0 network. Capturing remote dependency in an image by introducing a global attention mechanism, and calculating the response of a certain position as a weighted sum of all position characteristics by non-local operation of a network layer; the receptive field of the whole graph can be obtained through only one network level, so that the characterization capability of the network is expanded.

It can be understood that the preset training data set may be a standard training data base obtained by preprocessing and then sorting some historical graphic data, and for the preset training data set used in the pre-training, the image single-label classification and the multi-label classification may be mixed, or all the preset training data set may be the image multi-label classification, which is not specifically limited, does not affect the pre-training, and specific processes are not described herein. The obtained optimized model parameters can be used for updating a preset attention Efficient Net-B0 network, so that the preset attention Effi entNet-B0 network can analyze the target image more efficiently.

Step 102, after a weight parameter matrix is obtained by forced orthogonalization before back propagation, feature extraction operation is carried out on an input target image according to the weight parameter matrix through a point-by-point convolution layer and a full connection layer, and a target feature vector is obtained.

Further, step 102 includes:

forced orthogonal calculation is carried out based on the decomposition matrix, so that a weight parameter matrix is obtained;

and performing feature extraction operation on the input target image according to the weight parameter matrix through the point-by-point convolution layer and the full connection layer to obtain a target feature vector.

It should be noted that, in this embodiment, before back propagation, singular value decomposition is performed on initial parameter matrices of a point-by-point convolution layer and a full connection layer, if the initial parameter matrix is W and elements therein all belong to a real number complex domain, then singular value decomposition may be performed:

W＝U×∑×V ^*

wherein U and V ^* Are unitary matrices, and sigma is a diagonal matrix; these decomposed matrices may be referred to as decomposed matrices; then, forced orthogonal calculation is carried out based on the decomposition matrix, so as to obtain a weight parameter matrix W':

W'＝UV ^*

the obtained weight parameter matrix W' can be applied to the convolution layer for feature extraction, or the parameter matrices in the point-by-point convolution layer and the full connection layer can be subjected to forced orthogonalization constraint processing by adopting the decomposition calculation method; after which back propagation is performed.

Further, the point-by-point convolution layer and the full connection layer of the present embodiment are linear layers connecting the input channel and the output channel. Therefore, when feature extraction analysis is performed based on the weight parameter matrix, it is equivalent to performing linear transformation on the input vector, i.e., y=w' ×x, where x and y are the input vector and the output vector, respectively, and x is the matrix multiplication, specifically refer to fig. 6. The training process of the preset attention afflicientnet-B0 network based on feature extraction is shown in fig. 7.

After the parameter matrixes in the point-by-point convolution layer and the full connection layer in the embodiment are forcedly converted into orthogonal matrixes, the three decomposition matrixes are orthogonal to each other, so that no linear correlation exists between channels of the corresponding network layer, the obtained feature vector y is not overlapped, the diversity of the output channel expression is increased, and the feature vector variety is enriched. And the L2 norm of the orthogonal matrix is 1, the L2 norm of the vector remains unchanged after the weight parameter matrix is calculated, i.e., |x| ² ＝|y| ² The method comprises the steps of carrying out a first treatment on the surface of the In the back propagation training stage, the norms of the vectors can be kept unchanged to ensure that the gradients are kept consistent in the back propagation process, so that gradient dissipation or gradient explosion is effectively avoided. In addition, the forced regularization operation may also act as a regularization constraint on parameters in the network layer, in turn reducing overfitting.

And 103, performing predictive analysis on the target feature vector through an output layer of a preset attention Efficient Net-B0 network to obtain a multi-classification result, wherein the multi-classification result comprises a plurality of predictive labels.

Referring to fig. 8, after feature extraction is performed on a target image in an input preset attention afflicientnet-B0 network through a deep convolutional neural network, a plurality of prediction labels can be given through output layer prediction, for example, after label prediction is performed on the input target image in fig. 8, a plurality of labels including dogs, grasslands and flowers can be obtained, so that multi-label classification of the image is realized.

According to the image multi-label classification method provided by the embodiment of the application, the global receptive field network layer is built in the Efficient Net-B0 network architecture through the non-local attention response mechanism, the purpose of expanding receptive fields can be achieved without adding an additional convolution layer, the complexity of a model can be reduced to a certain extent, and the calculated amount is reduced; in addition, the optimization processing is carried out on the model parameter matrix through forced orthogonalization, so that the wireless correlation between output features can be ensured, the redundancy of feature channels is reduced, the high efficiency of a feature layer is improved, and the accuracy of a prediction result can be improved; in addition, regularization operations subject the parameters to regularization constraints, which can also reduce over-fitting of model training. Therefore, the embodiment of the application can solve the technical problems that the existing neural network for classifying the image multi-label has larger redundant information and large calculation amount, so that the model efficiency is poor and the accuracy is low.

For ease of understanding, referring to fig. 2, the present application provides an embodiment of an image multi-label classification apparatus, including:

an image input unit 201, configured to input a target image into a preset attention afflicientnet-B0 network, where the preset attention afflicientnet-B0 network constructs a global receptive field network layer based on a non-local attention response mechanism;

the convolution optimization unit 202 is configured to perform a feature extraction operation on an input target image according to the weight parameter matrix through the point-by-point convolution layer and the full connection layer after performing forced orthogonalization to obtain the weight parameter matrix before back propagation, so as to obtain a target feature vector;

the classification prediction unit 203 is configured to perform prediction analysis on the target feature vector through an output layer of the preset attention effective net-B0 network, so as to obtain a multi-classification result, where the multi-classification result includes a plurality of prediction labels.

Further, the method further comprises the following steps:

a model building unit 204, configured to build a basic afflicientnet-B0 network based on the inverted bottleneck residual block of mobilenet v 2;

the global optimization unit 205 is configured to add a non-local attention response mechanism layer to a plurality of deep network layers of the basic afflicientnet-B0 network, and generate a preset attention afflicientnet-B0 network.

Further, the method further comprises the following steps:

the pre-training unit 206 is configured to perform optimization pre-training on the preset attention efficiency net-B0 network through the preset training data set, so as to obtain optimization model parameters.

Further, the convolution optimization unit 202 is specifically configured to:

The application also provides image multi-label classification equipment, which comprises a processor and a memory;

the memory is used for storing the program codes and transmitting the program codes to the processor;

the processor is configured to execute the image multi-label classification method in the method embodiment according to the instructions in the program code.

The application also provides a computer readable storage medium for storing program code for executing the image multi-label classification method in the above method embodiment.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to execute all or part of the steps of the methods described in the embodiments of the present application by a computer device (which may be a personal computer, a server, or a network device, etc.). And the aforementioned storage medium includes: u disk, mobile hard disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method for classifying multiple labels of an image, comprising:

2. The method of claim 1, wherein the inputting the target image into a preset attention effect net-B0 network, the preset attention effect net-B0 network constructing a global receptive field network layer based on a non-local attention response mechanism, further comprising:

3. The method of claim 1, wherein the inputting the target image into a preset attention effect net-B0 network, the preset attention effect net-B0 network constructing a global receptive field network layer based on a non-local attention response mechanism, further comprising:

4. The method for classifying multiple labels of image according to claim 1, wherein after the step of performing forced orthogonalization to obtain a weight parameter matrix before back propagation, performing feature extraction operation on the input target image according to the weight parameter matrix by a point-by-point convolution layer and a full connection layer to obtain a target feature vector, comprising:

5. An image multi-label classification device, comprising:

6. The image multi-label classification apparatus according to claim 5, further comprising:

7. The image multi-label classification apparatus according to claim 5, further comprising:

8. The image multi-label classification apparatus according to claim 5, wherein the convolution optimization unit is specifically configured to:

9. An image multi-label classification device, characterized in that the device comprises a processor and a memory;

the processor is configured to perform the image multi-label classification method of any one of claims 1-4 according to instructions in the program code.

10. A computer readable storage medium for storing program code for performing the image multi-label classification method of any one of claims 1-4.