CN116935142A

CN116935142A - Training method of multi-label classification model, multi-label classification method and product

Info

Publication number: CN116935142A
Application number: CN202311015028.8A
Authority: CN
Inventors: 何兰青; 刘家明; 胡馨月
Original assignee: Beijing Yingtong Intelligent Technology Co ltd; Beijing Airdoc Technology Co Ltd
Current assignee: Beijing Yingtong Intelligent Technology Co ltd; Beijing Airdoc Technology Co Ltd
Priority date: 2023-08-11
Filing date: 2023-08-11
Publication date: 2023-10-24

Abstract

The application discloses a training method of a multi-label classification model, a multi-label classification method and a product, wherein the training method comprises the following steps: inputting a training image into the backbone network for feature extraction to output an initial feature map, wherein the training image is provided with a category label; inputting the initial feature map into a first convolution branch and a second convolution branch of the classification network for processing respectively so as to output a first space feature map and a second space feature map respectively; weighting the first space feature map based on the second space feature map to obtain a class-specific space feature map; outputting a class score of the training image based on the class-specific spatial feature map; and training the multi-label classification model based on the class score and the class label. According to the training method provided by the embodiment of the application, the trained multi-label classification model has better classification performance.

Description

Training method of multi-label classification model, multi-label classification method and product

Technical Field

The present application relates generally to the field of image processing technology. More particularly, the present application relates to a training method of a multi-label classification model, a method of multi-label classification based on an image, an apparatus for multi-label classification, and a computer-readable storage medium.

Background

The multi-label classification is a task of outputting a plurality of classification labels included in an input image based on the image. For medical images, multi-label classification may be used to analyze the probability of multiple disease labels contained in the medical image, thereby assisting the physician in diagnosis and treatment. Currently, the main method for classifying medical images by multiple labels is a method based on a Convolutional Neural Network (CNN), namely, the input medical images are subjected to multi-layer convolution, pooling, full connection and other operations to obtain a feature vector, and then the probability of each disease label is output through a multiple-output classifier.

However, the conventional CNN method has some drawbacks. Typically, the CNN uses global averaging pooling to perform feature dimension reduction before the full connection layer, which loses spatial information of the features and cannot effectively capture different spatial regions occupied by different classes of objects. For some disease labels with well-defined local lesion features, the classification performance is reduced due to the loss of the local information.

In view of the foregoing, there is a need to provide an image classification scheme that can effectively capture spatial information of features to improve the classification performance of multi-label classification tasks.

Disclosure of Invention

To address at least one or more of the technical problems mentioned above, the present application proposes, in various aspects, a training method of a multi-label classification model, a method of multi-label classification based on images, an apparatus for multi-label classification, and a computer-readable storage medium.

In a first aspect, the present application provides a training method of a multi-label classification model comprising a backbone network and a classification network comprising a plurality of convolution branches, the training method comprising: inputting a training image into the backbone network for feature extraction to output an initial feature map, wherein the training image is provided with a category label; inputting the initial feature map into a first convolution branch and a second convolution branch of the classification network for processing respectively so as to output a first space feature map and a second space feature map respectively; weighting the first space feature map based on the second space feature map to obtain a class-specific space feature map; outputting a class score of the training image based on the class-specific spatial feature map; and training the multi-label classification model based on the class score and the class label.

In some embodiments, the first convolution branch is configured to output a first spatial signature of a plurality of categories and the second convolution branch is configured to output a second spatial signature of a plurality of categories; the weighting operation includes: normalizing the second space feature map of each category to obtain the space attention score of each category; and taking the spatial attention score of each category as a weight, and carrying out weighting operation on the first spatial feature map of the corresponding category to obtain a category-specific spatial feature map of each category.

In other embodiments, the first spatial signature and the second spatial signature are each obtained by: z _1c ＝X ^T w _1c ；z _2c ＝X ^T w _2c The method comprises the steps of carrying out a first treatment on the surface of the Wherein z is _1c A first spatial feature map representing class c, z _2c A second spatial feature map representing class c, X representing the initial feature map, w _1c First convolution weight, w, representing class c _2c A second convolution weight representing class c.

In still other embodiments, the class-specific spatial signature is obtained by a weighting operation as follows: v _c ＝z _1c *σ(z _2c ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein v is _c Class-specific spatial feature map representing class c, z _1c A first spatial feature map representing class c, z _2c And a second space feature diagram representing class c, wherein sigma represents a Sigmoid function, and x represents multiplication of corresponding elements of the matrix.

In some embodiments, outputting the class score for the training image based on the class-specific spatial feature map comprises: global average pooling is carried out on the class-specific spatial feature map; and normalizing the global average pooled result to obtain the class score.

In other embodiments, the training method further comprises: acquiring focus position information related to the category label in the training image; and taking the focus position information as a supervision signal of the second space feature map to train the multi-label classification model.

In still other embodiments, the training method further comprises: processing the initial feature map by using a self-attention mechanism to obtain a self-attention feature map; and outputting a category score for the training image based on the class-specific spatial feature map and the self-attention feature map.

In some embodiments, the self-attention profile is obtained by: k (K) _c ＝X ^T W _Kc ；Q _c ＝X ^T W _Qc ；Wherein Kc represents a K-graph of the c-class, qc represents a Q-graph of the c-class, X represents an initial feature graph, W _Kc Representing a weight matrix for calculating Kc, W _Qc Representing a weight matrix for calculating Qc, z _3c Representing a self-attention profile, M represents the number of channels of the initial profile.

In other embodiments, outputting the class score for the training image based on the class-specific spatial feature map and the self-attention feature map comprises: carrying out global average pooling on the multiplication results of the class-specific spatial feature map and the self-attention feature map; and normalizing the global average pooled result to obtain the class score.

In still other embodiments, the category score for each category is obtained by: s is(s) _c ＝σ(GAP(z _1c *σ(z _2c )*z _3c ) A) is provided; wherein s is _c Class score, z, representing class c _1c A first spatial feature map representing class c, z _2c A second spatial signature representing class c, z _3c Representing a self-attention profile, σ represents a Sigmoid function, x represents matrix corresponding element multiplication, and GAP represents global average pooling.

In some embodiments, the training image comprises a fundus image.

In a second aspect, the present application provides a method of multi-label classification based on images, comprising: inputting an image to be classified into a multi-label classification model trained by the training method according to any one of the first aspect of the application; and classifying the images to be classified by using the multi-label classification model, and outputting classification results.

In a third aspect, the present application provides an apparatus for multi-tag classification, comprising: a processor for executing program instructions; and a memory storing the program instructions which, when loaded and executed by the processor, cause the processor to perform the training method according to the application as set forth in any one of the first aspects or to perform the method according to the application as set forth in the second aspect.

In a fourth aspect, the present application provides a computer readable storage medium, characterized in that it has stored thereon computer readable instructions which, when executed by one or more processors, implement the training method according to the application as described in any of the first aspects or the method according to the application as described in the second aspect.

Through the training scheme of the multi-label classification model and the scheme of multi-label classification based on the images, the embodiment of the application effectively captures and reserves the spatial information of the features by setting a plurality of convolution branches on the classification network of the multi-label classification model and carrying out weighting operation on the spatial feature graphs respectively output by the convolution branches, thereby leading the trained multi-label classification model to have better classification performance. Further, in some embodiments, by incorporating self-attention mechanisms in the classification network, the multi-label classification model may be made to learn interactions between features at different spatial locations, so that the trained multi-label classification model will have better capture capabilities for long-range features (i.e., features that are far from spatially) and feature correlations.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. In the drawings, embodiments of the application are illustrated by way of example and not by way of limitation, and like reference numerals refer to similar or corresponding parts and in which:

FIG. 1 illustrates an exemplary flow chart of a training method for a multi-label classification model according to an embodiment of the application;

FIG. 2 shows a schematic diagram of training image processing via a backbone network, according to an embodiment of the application;

FIG. 3 shows a schematic flow diagram of a training method of a multi-label classification model according to another embodiment of the application;

FIG. 4 illustrates an exemplary flow chart of a training method utilizing a self-attention mechanism in accordance with an embodiment of the present application;

FIG. 5 shows a schematic flow diagram of a training method of a multi-label classification model including a self-attention mechanism according to another embodiment of the application;

FIG. 6 illustrates a flow chart of a method for multi-label classification based on images in accordance with an embodiment of the application; and

fig. 7 is a schematic block diagram illustrating a system for multi-label classification in accordance with an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be understood that the terms "comprises" and "comprising," when used in this specification and in the claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification and claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present specification and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

The present inventors have found that in order to better adapt to multi-tag classification tasks, researchers have developed a simple and efficient multi-tag image recognition module, called class specific residual attention (CSRA for short), which can effectively capture different spatial regions occupied by different classes of objects. CSRA has achieved advanced results in multi-tag identification while being much simpler than other traditional methods. However, CSRA has a limitation in that it has only one set of convolution operations, and by normalizing the results of the set of convolution operations and multiplying the results of the set of convolution operations by the results of the set of convolution operations themselves, the larger the feature locations of the results of the convolution operations are weighted more heavily, which may emphasize some of the clear local lesion features but may also result in more false positives.

That is, CSRA may make the feature strong (i.e., positive features) stronger, while features with weak contributions or negative contributions (or reverse features) are almost disabled, and overstressing local positive features is not suitable for some disease labels that require attention to global features. In some application scenarios, for certain disease signatures, they may be positively correlated with certain lesion features, or negatively correlated with other certain lesion features. For example, when a certain focal feature is present in a medical image, it is indicated that the patient may suffer from a certain disease (such focal feature is referred to as a forward feature); and when some other focal feature is present in the medical image, it may be stated that the patient does not suffer from the disease (such focal feature is called the inverse feature). In this application scenario, the CSRA will disable the inverse feature, thereby affecting the classification result of the multi-label classification model.

Based on the above, the inventor provides a brand new solution, and captures spatial features of different levels based on categories by arranging a plurality of convolution branches, so that the trained multi-label classification model not only can identify the spatial features at different spatial positions related to the categories, but also can avoid limitation caused by over-emphasis of forward features, thereby enabling the trained multi-label classification model to be suitable for more application scenes and outputting more accurate classification results. Specific embodiments of the present application are described in detail below with reference to the accompanying drawings.

FIG. 1 illustrates an exemplary flow chart of a training method for a multi-label classification model according to an embodiment of the application. A multi-label classification model according to an embodiment of the application may include a backbone network and a classification network, wherein the classification network may include a plurality of convolution branches. As shown in fig. 1, the training method 100 may include: in step 101, a training image may be input to a backbone network for feature extraction to output an initial feature map, where the training image has class labels.

In some embodiments, the backbone network may include multiple convolution layers, nonlinear activation layers, pooling layers, and the like, for feature extraction of the image to obtain an initial feature map. In other embodiments, the backbone network may employ a network structure such as an Efficient Net-B4, or a CNN backbone network. In some embodiments, the training image may include a medical image or the like. In other embodiments, the training image may include a fundus image, or the like.

Fig. 2 shows a schematic diagram of training image processing via a backbone network according to an embodiment of the application. As shown in fig. 2, after the training image 201 is input to the backbone network 202, an initial feature map may be output through a feature extraction operation of the backbone network 202, where the initial feature map may be, for example, an mxp matrix, where M and P are positive integers, where M represents the number of channels of the initial feature map, and P represents the number of spatial positions of the initial feature map, for example, k×k spatial positions are included in the initial feature map shown in fig. 2. It will be appreciated that the elements of each spatial location on the initial signature correspond to a region in the training image 201, thereby preserving spatially specific information. It will also be appreciated that the initial profile shown in fig. 2 is exemplary and not limiting, e.g., the initial profile may not be limited to square profiles in the illustrations, but may be rectangular profiles as desired (e.g., p=k×h, k and h are positive integers, and k and h are not equal).

The description continues with returning to fig. 1. As shown in fig. 1, in step 102, an initial signature may be input into a first convolution branch and a second convolution branch of a classification network for processing, respectively, to output a first spatial signature and a second spatial signature, respectively. The classification network may include at least one first convolution branch and at least one second convolution branch. The first convolution branch and the second convolution branch may be two parallel convolution branches, i.e. both may have the same input data.

In some embodiments, the first convolution branch may perform a 1×1 convolution operation on the initial feature map and the second convolution branch may perform a 1×1 convolution operation on the initial feature map. In other embodiments, each of the convolution weights of the first convolution branch and the second convolution branch may be associated with a class and used to perform a feature dimension to class dimension operation on the initial feature map, respectively, i.e., the convolution kernel dimension may be the same as the number of classes, such that a class-based first spatial feature map and a class-based second spatial feature map may be output, respectively. In some embodiments, a first convolution branch may be used to output a first spatial signature of a plurality of categories and a second convolution branch may be used to output a second spatial signature of a plurality of categories.

For example, assuming that the initial feature map includes 128-dimensional features and the multi-label classification task is a very class task (i.e., includes 10 classes), the initial classification map may be dimension-reduced by the operations of the first convolution branch and the second convolution branch to obtain a first spatial feature map of 10 class dimensions and a second spatial feature map of 10 class dimensions. Also for example, assuming that the initial feature map includes 128-dimensional features and the multi-label classification task is a classification task of 200 classes, the initial feature map may be upscaled by the operations of the first convolution branch and the second convolution branch to obtain a first spatial feature map of 200 class dimensions and a second spatial feature map of 200 class dimensions. The categories described herein may be understood as label categories of multi-label classification tasks, or classification categories.

Next, in step 103, the first spatial feature map may be weighted based on the second spatial feature map to obtain a class-specific spatial feature map. In some embodiments, the weights of the weighting operation may be determined based on the second spatial signature to weight the first spatial signature. Class-specific features may be generated for each class by weighting the first spatial feature map of the respective class based on the second spatial feature map of each class, thereby generating class-specific spatial feature maps based on the class.

Because the first space feature map and the second space feature map are respectively obtained by two different convolution branches, by training parameters of the first convolution branch and the second convolution branch, the first space feature map and the second space feature map can respectively reserve or capture features of different levels (for example, the first space feature map can be used for reserving all feature information, and the second space feature map can be used for capturing local feature information), so that local positive features are not excessively enhanced and local negative features are completely ignored when weighting operation is carried out between the first space feature map and the second space feature map, the obtained class-specific space feature map can be suitable for more classification scenes better due to both local features and global features, and the accuracy of classification results of the multi-label classification model is improved.

Flow may then proceed to step 104 where a class score for the training image may be output based on the class-specific spatial feature map. In some embodiments, a class score for each class may be output based on the class-specific spatial feature map for each class. In other embodiments, step 104 may include: carrying out global average pooling on the class-specific spatial feature map; and normalizing the global average pooled result to obtain a class score. The class-specific spatial feature map for each class may be globally averaged pooled and the results of the globally averaged pooling for each class normalized to obtain a class score for each class of the multi-label classification task.

As further shown in fig. 1, in step 105, the multi-label classification model may be trained based on the class score and the class labels. In some embodiments, the category label may be labeling information for a disease category associated with the training image. In other embodiments, the multi-label classification model may be trained based on the first penalty of both the class score and the class label. In still other embodiments, the first loss may be calculated by, for example, a cross entropy loss function or the like. In some embodiments, the training image for training may include a plurality of training images having category labels that may encompass categories required for the classification task of the multi-label classification model.

While the training method of the multi-label classification model according to the embodiment of the present application has been described above with reference to fig. 1 by way of example, it can be understood that, unlike the conventional CNN that directly performs global averaging pooling and normalization operations after obtaining an initial feature map to obtain a class score, and also unlike the CSRA that uses the same set of convolution operations to enhance local forward features, the training method of the embodiment of the present application processes the initial feature map by setting a plurality of convolution branches, not only can capture class-specific spatial features, but also can avoid the problem caused by excessive enhancement of local forward features, so that the trained multi-label classification model will have stronger multi-label classification capability. It is also to be understood that the above description is exemplary and not limiting, and that a training method according to another embodiment of the present application will be described below in conjunction with fig. 3.

Fig. 3 shows a schematic flow diagram of a training method of a multi-label classification model according to another embodiment of the application. As will be appreciated from the following description, the training method 300 illustrated in FIG. 3 may be one embodied form of the training method 100 described hereinabove in connection with FIG. 1, and thus the description of the training method 100 described hereinabove in connection with FIG. 1 may also be applicable to the following description of the training method 300.

As shown in fig. 3, in the training method 300, a training image 201 may first be input into a backbone network 202 of a multi-label classification model for feature extraction to obtain an initial feature map X. The initial signature X may then be input into a first convolution branch for processing to obtain a first spatial signature z ₁ The method comprises the steps of carrying out a first treatment on the surface of the Inputting the initial characteristic diagram X into a second convolution branch for processing to obtain a second spatial characteristic diagram z ₂ . In some embodiments, the first spatial feature map z ₁ And a second spatial feature map z ₂ Can be calculated by the following equations 1 and 2, respectively:

z _1c ＝X ^T w _1c (equation 1);

z _2c ＝X ^T w _2c (equation 2);

wherein z is _1c A first spatial feature map representing class c, z _2c A second spatial feature map representing class c, X representing the initial feature map, w _1c First convolution weight, w, representing class c _2c A second convolution weight representing class c. The first convolution weight and the second convolution weight may be determined by training of a model. The first convolution branch and the second convolution branch can both execute a convolution operation of 1×1, the dimension of the convolution kernel can be determined according to the number N of categories required by the classification task of the multi-label classification model, and the value range of c can be a positive integer before 1-N.

Then, after the first spatial feature map z is obtained ₁ And a second spatial feature map z ₂ Thereafter, it may be based on the second spatial feature map z ₂ For the first space feature map z ₁ A weighting operation is performed to obtain a class-specific spatial signature V. In some embodiments, the weighting operation may include: second spatial feature map z for each category ₂ Performing normalization operation to obtain the spatial attention score of each category; and a first spatial feature map z for each category, with the spatial attention score for the corresponding category as a weight ₁ And carrying out weighting operation to obtain class-specific spatial feature diagrams of each class. In other embodiments, the normalization operation may include Sigmoid function computation. The corresponding category here means that the category of the spatial attention score as the weight (i.e., the category of the second spatial feature map) corresponds to the same category as the category of the first spatial feature map to which the weighting operation is performed.

For example, in one particular embodiment, class-specific spatial feature map V may be obtained by a weighting operation as follows:

v _c ＝z _1c *σ(z _2c ) (equation 3);

wherein v is _c Class-specific spatial feature map representing class c, z _1c A first spatial feature map representing class c, z _2c And a second space feature diagram representing class c, wherein sigma represents a Sigmoid function, and x represents multiplication of corresponding elements of the matrix. Comparison with each otherIn the Softmax function, the feature of each spatial position in the feature map can be independent rather than competitive by utilizing the Sigmoid function for normalization operation, so that the spatial attention score is more flexible and can adapt to more situations.

The features at each spatial location in the class-specific spatial feature map may be expressed as:

v _cj ＝z _1cj σ(z _2cj ) (equation 4);

wherein j represents the j-th spatial position in the feature map, and there are P spatial positions in total, v _cj Representing features at the jth spatial position in class-specific spatial feature map of class c, z _1c Representing features at the jth spatial position in the first spatial feature map of class c, z _2c Representing features at the j-th spatial position in the second spatial feature map of class c, σ represents a Sigmoid function.

Further, after obtaining the class-specific spatial feature map V, the class score S of the training image 201 may be output based on the class-specific spatial feature map V. In some embodiments, class-specific spatial feature map V may be globally averaged pooled; and normalizing the global average pooled result to obtain a class score. For example, this can be achieved by the following formula:

s _c ＝σ(GAP(v _c ) (equation 5);

wherein s is _c Class score representing class c, σ representing Sigmoid function, GAP representing global average pooling, v _c Class-specific spatial feature maps representing class c.

Still further, the multi-label classification model may be trained based on the distance between the class score S and the class labels of the training image 201.

As further shown in fig. 3, in some embodiments, training method 300 may further comprise: acquiring focus position information 301 related to the category label in the training image 201; focal position information 301 as a second spatial feature map Z ₂ To train the multi-label classification model.In other embodiments, lesion detection and/or lesion segmentation operations may be performed on the training image 201 to obtain lesion location information 301. Based on the relevance of the lesion to the disease, the lesion location information 301 marked in each training image 201 may be related to the disease category contained in the category label of the current training image 201, such that a relevance between the disease category and the lesion information may be established.

Focal position information 301 as a second spatial signature Z ₂ Can also be understood as a supervisory signal having the lesion position information 301 as a spatial attention score. In one embodiment, the multi-label classification model may be trained using lesion location information 301 as labeling information for training images. In another specific implementation, the lesion location information 301 may be converted to a heat map as a supervisory signal for the second spatial signature. In still other embodiments, the multi-label classification model may be trained based on a second loss between the lesion location information and the second spatial signature. In some embodiments, the second loss may be calculated by, for example, a cross entropy loss function or the like.

For example, in some application scenarios, taking retinal images as an example, one important indicator of glaucoma diagnosis is the optic-disc ratio of the optic cup, so the position of the optic disc can be used as a supervisory signal for the second spatial signature for guiding the learning of the spatial attention score for the glaucoma category. In other application scenarios, the heat map of the position information of the bleeding point in the fundus image may be corresponding to the corresponding position of the second spatial feature map, so as to facilitate learning of the spatial attention score of the diabetic retinopathy class on the heat map covered by the fundus bleeding point.

While the training method according to the embodiment of the present application has been described above with reference to fig. 3 by way of example, it will be appreciated that the above description is by way of example and not limitation, and that the classification network of the multi-label classification model according to the embodiment of the present application may not be limited to include only the first convolution branch and the second convolution branch, but may be provided with more convolution branches as desired. It can be further understood that by introducing lesion location information as a supervisory signal, the extraction capability of local lesion information of the second spatial feature map can be enhanced, so that the multi-label classification model obtains stronger capturing capability of association of the category labels and the local information.

In addition, the inventors have found that the relationship of some disease signatures to lesions is not a one-to-one absolute relationship. For example, fundus hemorrhage points not only suggest the possibility of a sugar network, but also hypertension, so the judgment of these disease signatures often requires other evidence to work together, rather than just looking at a single lesion feature.

The inventors have also noted that some approaches introduce attention mechanisms into the field of visual algorithms, such as SENET, CBAM, etc. The transducer is introduced into the field of vision algorithms and is also born by ViT, swin and the like. The attention mechanism can capture the interaction between the features at different spatial positions, and the attention mechanism introducing method is to add various attention structures into the main network, so that the feature extraction capability of the traditional CNN method is improved. However, this also brings a huge number of trainable parameters, and the requirement for a rich degree of training data is a great challenge. At the same time, these approaches to attention mechanisms are class independent and therefore have no particular advantage for the task of multi-label classification.

Based on the above findings, the present inventors have further optimized the training method of the multi-label classification model provided by the present application. As will be described in detail below in connection with fig. 4 and 5.

FIG. 4 illustrates an exemplary flow chart of a training method utilizing a self-attention mechanism in accordance with an embodiment of the present application. As shown in fig. 4, the training method 400 may include: in step 401, a training image may be input into a backbone network of a multi-label classification model for feature extraction to output an initial feature map, where the training image has class labels. Next, in step 402, the initial feature map may be input into a first convolution branch and a second convolution branch, respectively, of a classification network of the multi-label classification model for processing to output a first spatial feature map and a second spatial feature map, respectively. The flow may then proceed to step 403 where the first spatial signature may be weighted based on the second spatial signature to obtain a class-specific spatial signature. Step 401, step 402 and step 403 have been described in detail in connection with any of the embodiments of fig. 1-3, and will not be described in detail here.

As further shown in fig. 4, in step 404, the initial profile may be processed using a self-attention mechanism to obtain a self-attention profile. In some embodiments, the features of each spatial location in the initial feature map may be subjected to a full join layer operation to obtain a K ("Key") vector and a Q ("Query") vector, respectively, and the two may be subjected to a self-attention operation to obtain a self-attention feature map. In the classification network, the self-attention feature map can be related to each category in the multi-label classification task by training the category-based weight parameters in the full connection layer, and the self-attention mechanism has the characteristics of simple structure, easy training and the like, so that the self-attention mechanism can be applied to the multi-label classification task.

Further, in step 405, a class score for the training image may be output based on the class-specific spatial feature map and the self-attention feature map. In some embodiments, a category score for each category may be output based on the class-specific spatial feature map and the self-attention feature map for each category. In other embodiments, step 405 may include: carrying out global average pooling on the multiplication results of the class-specific spatial feature map and the self-attention feature map; and normalizing the global average pooled result to obtain a class score.

The multi-label classification model may then be trained based on the class scores and class labels in step 406. Step 406 is the same as or similar to step 105 described above in connection with fig. 1 and will not be described again here.

While the training method of the multi-label classification model according to the further embodiment of the present application is described above with reference to fig. 4 as an example, it may be appreciated that, compared to adding an attention mechanism in a backbone network, adding a self-attention mechanism in a classification network (i.e., performing self-attention processing on the output of the backbone network) according to an embodiment of the present application, a weight parameter associated with each class of the multi-label task may be trained to improve the capturing capability of the multi-label classification model for the correlation between local features based on each class. By combining the class-specific spatial feature map and the self-attention feature map, the multi-label classification model has the capturing capability of the feature relevance on the local features, the long-range features and different spatial positions, and has stronger multi-label classification capability.

It is also to be understood that the above description is intended to be illustrative, and not restrictive. For example, in some embodiments, in the training method 400 shown in fig. 4, focus position information may be further included as a supervisory signal of the second spatial feature map, and according to this arrangement, the relationship between the local focus feature in the second feature map and the category label of the disease may be enhanced under the supervision of using the focus position information, and at the same time, interference of the focus information on disease category judgment under a non-absolute relationship may be eliminated, so that the multi-label classification model not only has the capability of highlighting the local focus feature, but also has the capability of correlating the relationships between the features, which is beneficial to further improving the classification effect of the multi-label classification model.

FIG. 5 shows a schematic flow diagram of a training method of a multi-label classification model including a self-attention mechanism according to another embodiment of the application. As will be appreciated from the following description, the training method 500 illustrated in FIG. 5 may be one embodied representation of the training method 400 described hereinabove in connection with FIG. 4, and thus the description of the training method 400 described hereinabove in connection with FIG. 4 may also be applicable to the following description of the training method 500.

As shown in fig. 5, in the training method 500, a training image 201 may first be input into a backbone network 202 of a multi-label classification model for feature extraction to obtain an initial feature map X. The initial signature X may then be input into a first convolution branch for processing to obtain a first spatial signature z ₁ The method comprises the steps of carrying out a first treatment on the surface of the Inputting the initial characteristic diagram X into a second convolution branch for processing to obtain a second spatial characteristic diagram z ₂ . ThenAfter the first space feature map z is obtained ₁ And a second spatial feature map z ₂ Thereafter, it may be based on the second spatial feature map z ₂ For the first space feature map z ₁ A weighting operation is performed to obtain a class-specific spatial signature V.

In some embodiments, the classification network may further include a third convolution branch and a fourth convolution branch for implementing a self-attention mechanism. The third convolution branch and the fourth convolution branch may be a plurality of convolution branches juxtaposed with the first convolution branch and the second convolution branch. Specifically, the third convolution branch and the fourth convolution branch may be configured to process the initial feature map X to generate a feature map K (or K-map) and a feature map Q (or Q-map), respectively. The self-attention characteristic diagram z can be obtained by performing self-attention operation through the characteristic diagram K and the characteristic diagram Q ₃ . In some embodiments, the third convolution branch and the fourth convolution branch may each perform a 1×1 convolution operation, and the dimension of the convolution kernel may be m×m dimensions.

In other embodiments, self-attention profile z ₃ Is a class-specific self-attention profile associated with a class, which can be derived by:

K _c ＝X ^T W _Kc (equation 6);

Q _c ＝X ^T W _Qc (equation 7);

wherein Kc represents a K-graph of the c-class, qc represents a Q-graph of the c-class, X represents an initial feature graph, W _Kc And W is _Qc Are all weight parameters in the execution autonomy mechanism, W _Kc Representing a weight matrix for calculating Kc, W _Qc Representing a weight matrix for calculating Qc, z _3c A self-attention profile representing a category, M representing the number of channels of the initial profile,indicating the transpose of Kc. W (W) _Kc And W is _Qc Can be determined by training.

As further shown in fig. 5, a self-attention profile z is obtained ₃ And class-specific spatial feature map V, feature map Z (or fusion feature map) may be obtained by combining the two. In some embodiments, the self-attention profile z may be obtained by combining ₃ And class-specific spatial feature map V to obtain feature map Z. Based on the feature map Z, a category score S may be output.

In one particular embodiment, the category score S for each category may be obtained by:

s _c ＝σ(GAP(z _c ))＝σ(GAP(z _1c *σ(z _2c )*z _3c ) (equation 9);

wherein s is _c Class score, z, representing class c _1c A first spatial feature map representing class c, z _2c A second spatial signature representing class c, z _3c Self-attention feature map representing class c, σ representing Sigmoid function, x representing matrix multiplication of corresponding elements, GAP representing global average pooling, z _c Representing the fusion characteristics of class c.

Based on this, assuming that the initial feature map X is an mxp feature map, a first spatial feature map z of each category obtained through the first convolution branch operation and the second convolution branch operation ₁ And a second spatial feature map z ₂ The characteristic diagrams can be P multiplied by 1, and the obtained class specific spatial characteristic diagram V of each class is the characteristic diagram of P multiplied by 1; the K diagram and the Q diagram obtained by the operation of the initial feature diagram X through the third convolution branch and the fourth convolution branch can be P multiplied by M feature diagrams, and the obtained self-attention feature diagram z ₃ A p×1 feature map; self-attention profile z ₃ The feature map Z for each class obtained by multiplying the class-specific spatial feature map V is also a p×1 feature map. M represents the number of channels of the initial feature map X, and P represents the number of spatial positions of the initial feature map X.

The training method of the embodiment of the present application is described above with reference to the several drawings, and in order to facilitate understanding of the technical effects of the training method of the embodiment of the present application, several experimental examples will be described below.

The training set comprising a plurality of fundus images is applied to different multi-label classification models for training in fundus multi-disease classification tasks by constructing the multi-label classification models with different structures. The multi-label classification task includes 80 label classes, wherein positive and negative samples of most of the label classes are unbalanced, and positive samples are fewer, so in the following experimental examples, mAP (Mean Average Precision, average value of average accuracy of all the classes) is adopted as an evaluation index. In addition, the backbone networks used in the following experimental examples may all be Efficient Net-B4. The experimental results are shown in table 1 below.

Table 1:

in table 1, "backbone network+gap" is a conventional CNN structure, the class-specific spatial attention refers to a process of obtaining a class-specific spatial feature map after the first convolution branch and the second convolution branch are processed, the class-specific self-attention refers to a process of obtaining a self-attention feature map in the classification network by using a self-attention mechanism such as described in fig. 4 or fig. 5, and the lesion location information supervision refers to a process of supervising the second spatial feature map by using the lesion location information.

As can be seen from the results in table 1, compared with the conventional CNN structure used for the multi-label classification task (i.e., experimental example 1), the multi-label classification models of experimental examples 3 to 6 trained by the training method according to the embodiment of the present application have better classification performance. Compared with the multi-label classification (i.e., experiment 2) using the CSRA module, the multi-label classification models of experiment 3, experiment 5 and experiment 6 trained by the training method according to the embodiment of the application have better classification performance. Further, by combining the class-specific spatial attention with the class-specific self-attention, the performance of the trained multi-label classification model can be significantly improved. Based on the method, the model performance of the multi-label classification model is further improved by combining the focus position information supervision scheme.

It will be appreciated that after training, various parameters in the multi-label classification model may be determined, including, for example, a first convolution weight, a second convolution weight, a weight parameter in the autonomic mechanism, and the like. After the multi-label classification model is trained, reasoning or prediction can be performed using the trained multi-label classification model to achieve desired functions, such as image recognition, image classification, and the like. In the reasoning phase, the parameters determined in the training phase can be used directly. Thus, the present application in yet another aspect provides a method of multi-label classification based on images, i.e. an inference method or a detection method of a multi-label classification model. This will be described below with reference to fig. 6.

FIG. 6 shows a flowchart of a method for image-based multi-label classification in accordance with an embodiment of the application. As shown in fig. 6, method 600 may include: in step 601, the image to be classified may be input into a multi-label classification model trained by the training method described above in connection with any of the embodiments of fig. 1-5. Next, in step 602, the multi-label classification model may be used to perform a classification operation on the image to be classified and output a classification result.

In some embodiments, the image to be classified may include a medical image or the like. In other embodiments, the image to be classified may include a fundus image, or the like. Inputting an image to be classified into a multi-label classification model trained by the training method according to the embodiment of the application, wherein a backbone network in the multi-label classification model can extract the characteristics of the image to be classified so as to generate an initial characteristic map; on the basis, the initial feature map can be classified by the parameters determined in the classification network of the trained multi-label classification model, so as to output the class score (i.e. classification result) of each of a plurality of classes included in the multi-label classification task, wherein the class score can be a probability value.

The above-described aspects of embodiments of the present application may be implemented by means of program instructions. Thus, the present application also provides an apparatus for multi-tag classification, comprising: a processor for executing program instructions; and a memory storing program instructions that, when loaded and executed by the processor, cause the processor to perform the training method described above in connection with any of the embodiments of fig. 1-5 or to perform the method of image-based multi-label classification described above in connection with fig. 6.

Fig. 7 is a schematic block diagram illustrating a system for multi-label classification in accordance with an embodiment of the present application. The system 700 may include the device 701 according to an embodiment of the present application, and peripheral devices and external networks thereof, where the device 701 is used for training a multi-label classification model or for classifying an image to be classified, so as to implement the technical solutions of the embodiments of the present application described in any of the foregoing in connection with fig. 1 to 6.

As shown in fig. 7, the device 701 may include a CPU 7011, which may be a general-purpose CPU, a special-purpose CPU, or other execution unit for information processing and program execution. Further, the device 701 may further include a mass memory 7012 and a read only memory ROM 7013, wherein the mass memory 7012 may be configured to store various kinds of data including training images, lesion location information, weight parameters, classification results, and the like, and various programs required to run a neural network, and the ROM 7013 may be configured to store data required for power-on self-test of the device 701, initialization of various functional modules in the system, driving programs for basic input/output of the system, and booting of the operating system.

Further, the device 700 also includes other hardware platforms or components, such as the illustrated TPU 7014, GPU 7015, FPGA 7016, and MLU 7017. It will be appreciated that while various hardware platforms or components are shown in device 700, this is by way of example only and not limitation, and that one of skill in the art may add or remove corresponding hardware as desired. For example, device 701 may include only a CPU as a well-known hardware platform and another hardware platform as a test hardware platform of the present application.

The device 701 of the present application further comprises a communication interface 7018 whereby it may be connected to a local area network/wireless local area network (LAN/WLAN) 705 via the communication interface 7018 and further to a local server 706 or to the Internet ("Internet") 707 via the LAN/WLAN. Alternatively or additionally, the device 701 of the present application may also be directly connected to the internet or cellular network via the communication interface 7018 based on wireless communication technology, such as third generation ("3G"), fourth generation ("4G"), or 5 th generation ("5G") wireless communication technology. In some application scenarios, the device 801 of the present application may also access the server 708 and possibly the database 709 of the external network as needed in order to obtain various known training data, etc., and may store various parameters or intermediate data remotely.

Peripheral devices of the device 701 may include a display device 702, an input device 703, and a data transmission interface 704. In one embodiment, the display 702 may include, for example, one or more speakers and/or one or more visual displays configured to provide voice prompts and/or visual displays of the results of the operation or classification of the inventive device. The input device 703 may include, for example, a keyboard, mouse, microphone, gesture-capturing camera, or other input buttons or controls configured to receive input of training data or user instructions. The data transfer interface 704 may include, for example, a serial interface, a parallel interface, or a universal serial bus interface ("USB"), a small computer system interface ("SCSI"), serial ATA, fireWire ("FireWire"), PCI Express, and high definition multimedia interface ("HDMI"), etc., configured for data transfer and interaction with other devices or systems. According to aspects of the present application, the data transmission interface 704 may receive data for training images, lesion location information, or the like, and transmit various types of data and results to the device 701.

The above-described CPU 7011, mass memory 7012, read only memory ROM 7013, TPU 7014, GPU 7015, FPGA 7016, MLU 7017, and communication interface 7018 of the device 701 of the present application can be connected to each other through a bus 7019, and data interaction with peripheral devices can be achieved through the bus. In one embodiment, the cpu 7011 may control other hardware components in the device 701 and its peripherals through the bus 7019.

In operation, the processor CPU 7011 of the apparatus 701 of the present application may receive training images or images to be classified via the input device 703 or the data transmission interface 704 and retrieve computer program instructions or code (e.g., code related to a neural network) stored in the memory 7012 to train or classify the received training data or images to be classified to obtain weight parameters or classification results of the trained multi-label classification model. After the CPU 7011 determines the classification result by executing the program instruction, the classification result may be displayed on the display device 702 or output by means of voice prompt. In addition, the device 701 may also upload the classification results to a network, such as a remote database 709, via the communication interface 7018.

It should also be appreciated that any module, unit, component, server, computer, terminal, or device executing instructions of examples of the application can include or otherwise access a computer-readable medium, such as a storage medium, computer storage medium, or data storage device (removable) and/or non-removable) such as magnetic disk, optical disk, or magnetic tape. Computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

Based on the foregoing, the present application also provides a computer-readable storage medium having stored thereon computer-readable instructions that, when executed by one or more processors, implement a training method as described above in connection with any of the embodiments of fig. 1-5 or a method of image-based multi-label classification as described above in connection with fig. 6.

The computer readable storage medium may be any suitable magnetic or magneto-optical storage medium, such as, for example, resistive Random Access Memory RRAM (Resistive Random Access Memory), dynamic Random Access Memory DRAM (Dynamic Random Access Memory), static Random Access Memory SRAM (Static Random-Access Memory), enhanced dynamic Random Access Memory EDRAM (Enhanced Dynamic Random Access Memory), high-Bandwidth Memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc., or any other medium that may be used to store the desired information and that may be accessed by an application, a module, or both. Any such computer storage media may be part of, or accessible by, or connectable to, the device. Any of the applications or modules described herein may be implemented using computer-readable/executable instructions that may be stored or otherwise maintained by such computer-readable media.

Through the above description of the multi-label classification model training method and the multiple embodiments of the present application, it can be understood by those skilled in the art that the training method of the present application processes the initial feature map by setting multiple convolution branches in the classification network of the multi-label classification model, so as to avoid the defect of overstressing the local forward feature in the weighting operation, and make the trained multi-label classification model have better classification performance.

Further, in the training method of the embodiment of the application, focus position information is introduced as a supervision signal of the second spatial feature map, so that the second convolution branch can learn local focus features, and the output second spatial feature map can highlight local features on different spatial positions related to category labels, thereby enhancing the capturing capability of the multi-label classification model on the relation among the local features, focus features and disease categories. Furthermore, by combining a self-attention mechanism in the classification network, the capturing capability of the multi-classification label model on the relevance between the features at different spatial positions can be remarkably improved, and further the model performance of the multi-classification label model can be remarkably improved.

While various embodiments of the present application have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will occur to those skilled in the art without departing from the spirit and scope of the application. It should be understood that various alternatives to the embodiments of the application described herein may be employed in practicing the application. The appended claims are intended to define the scope of the application and are therefore to cover all equivalents or alternatives falling within the scope of these claims.

Claims

1. A method of training a multi-label classification model, the multi-label classification model comprising a backbone network and a classification network, the classification network comprising a plurality of convolutionally branched branches, the training method comprising:

inputting a training image into the backbone network for feature extraction to output an initial feature map, wherein the training image is provided with a category label;

inputting the initial feature map into a first convolution branch and a second convolution branch of the classification network for processing respectively so as to output a first space feature map and a second space feature map respectively;

Weighting the first space feature map based on the second space feature map to obtain a class-specific space feature map;

outputting a class score of the training image based on the class-specific spatial feature map; and

training the multi-label classification model based on the class score and the class label.

2. The training method of claim 1, wherein the first convolution branch is configured to output a first spatial signature of a plurality of classes and the second convolution branch is configured to output a second spatial signature of a plurality of classes;

the weighting operation includes:

normalizing the second space feature map of each category to obtain the space attention score of each category; and

and taking the spatial attention score of each category as a weight, and carrying out weighting operation on the first spatial feature map of the corresponding category to obtain a category-specific spatial feature map of each category.

3. Training method according to claim 1 or 2, characterized in that the first and the second spatial feature map are obtained by the following processing, respectively:

z _1c ＝X ^T w _1c

z _2c ＝X ^T w _2c

wherein z is _1c A first spatial feature map representing class c, z _2c A second spatial feature map representing class c, X representing the initial feature map, w _1c First convolution weight, w, representing class c _2c A second convolution weight representing class c.

4. A training method as claimed in claim 3, characterized in that the class-specific spatial signature is obtained by a weighting operation as follows:

v _c ＝z _1c *σ(z _2c )

wherein v is _c Class-specific spatial feature map representing class c, z _1c A first spatial feature map representing class c, z _2c And a second space feature diagram representing class c, wherein sigma represents a Sigmoid function, and x represents multiplication of corresponding elements of the matrix.

5. The training method of claim 1, wherein outputting the class score of the training image based on the class-specific spatial feature map comprises:

global average pooling is carried out on the class-specific spatial feature map; and

and normalizing the global average pooled result to obtain the class score.

6. The training method of claim 1, further comprising:

acquiring focus position information related to the category label in the training image; and

and taking the focus position information as a supervision signal of the second space feature map to train the multi-label classification model.

7. The training method of claim 1 or 6, further comprising:

processing the initial feature map by using a self-attention mechanism to obtain a self-attention feature map; and

based on the class-specific spatial feature map and the self-attention feature map, a class score of the training image is output.

8. The training method of claim 7, wherein the self-attention profile is obtained by:

K _c ＝X ^T W _Kc

Q _c ＝X ^T W _Qc

wherein Kc represents a K-graph of the c-class, qc represents a Q-graph of the c-class, X represents an initial feature graph, W _Kc Representing a weight matrix for calculating Kc, W _Qc Representing a weight matrix for calculating Qc, z _3c Representing a self-attention profile, M represents the number of channels of the initial profile.

9. The training method of claim 7, wherein outputting a category score for the training image based on the class-specific spatial feature map and the self-attention feature map comprises:

carrying out global average pooling on the multiplication results of the class-specific spatial feature map and the self-attention feature map; and

and normalizing the global average pooled result to obtain the class score.

10. Training method according to claim 9, characterized in that the class score of each class is obtained by the following operation:

s _c ＝σ(GAP(z _1c *σ(z _2c )*z _3c ))

wherein s is _c Class score, z, representing class c _1c A first spatial feature map representing class c, z _2c A second spatial signature representing class c, z _3c Representing a self-attention profile, σ represents a Sigmoid function, x represents matrix corresponding element multiplication, and GAP represents global average pooling.

11. The training method of claim 1, wherein the training image comprises a fundus image.

12. A method for multi-label classification based on images, comprising:

inputting an image to be classified into a multi-label classification model trained by the training method according to any one of claims 1-11; and

and performing classification operation on the images to be classified by using the multi-label classification model, and outputting classification results.

13. An apparatus for multi-tag classification, comprising:

a processor for executing program instructions; and

memory storing the program instructions that, when loaded and executed by the processor, cause the processor to perform the training method according to any one of claims 1-11 or the method according to claim 12.

14. A computer readable storage medium having stored thereon computer readable instructions which, when executed by one or more processors, implement the training method of any of claims 1-11 or the method of claim 12.