CN113516133A

CN113516133A - Multi-modal image classification method and system

Info

Publication number: CN113516133A
Application number: CN202110355430.5A
Authority: CN
Inventors: 王勇; 袁狄; 何小宇
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2021-04-01
Filing date: 2021-04-01
Publication date: 2021-10-19
Anticipated expiration: 2041-04-01
Also published as: CN113516133B

Abstract

The invention discloses a multi-modal image classification method and a multi-modal image classification system, wherein a plurality of single-modal network feature extraction modules are established, each single-modal network feature extraction module comprises a multi-order cascaded residual module, and the last-order residual module is sequentially connected with a pooling layer, a full-link layer and a softmax layer to obtain a single-modal network channel; all the single-mode network channels form a single-mode network model; fusing each order residual error module in the single mode network model by utilizing a cooperative attention mechanism to obtain a fused mode network model; and fusing the single-mode network model and the fused-mode network model to obtain a classification model. The invention can improve the classification precision.

Description

Multi-modal image classification method and system

Technical Field

The invention relates to the field of image processing, in particular to a multi-modal image classification method and system.

Background

Currently, the image classification technology based on deep learning has been widely applied, and a deep learning model is used to process a shot image to judge the category of an object in the image. The existing image classification method adopts a single-mode image for processing, but the single-mode image cannot well cover the characteristics of a target object, so that the classification precision is influenced.

Disclosure of Invention

The invention aims to solve the technical problem that the prior art is insufficient, and provides a multi-mode image classification method and system to improve classification accuracy.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method of multi-modal image classification comprising the steps of:

s1, establishing a plurality of single-mode network feature extraction modules, wherein each single-mode network feature extraction module comprises a multi-order cascaded residual module, and the last-order residual module is sequentially connected with a pooling layer, a full connection layer and a softmax layer to obtain a single-mode network channel; all the single-mode network channels form a single-mode network model;

s2, fusing each order residual error module in the single mode network model by utilizing a cooperative attention mechanism to obtain a fused mode network model;

and S3, fusing the single-mode network model and the fusion-mode network model to obtain a classification model.

The invention provides a multi-level fusion learning network structure, which processes any single-mode characteristics in a mode of a depth residual error module (residual error module) and transmits semantic information layer by layer. Along with the deepening of a network structure, the scale of the feature map is gradually reduced, global and detail local information is integrated, and richer semantic information is provided. Meanwhile, the single-mode features are mutually supervised and fused, single-mode feature information is continuously enriched and perfected, and the spatial resolution of the feature map is improved, so that the image classification precision is improved.

In step S1, the full link layer and the classification layer of the ResNet50 network are removed, and the remaining cascaded residual modules form the single-mode network channel. The structure can improve the expression capability of the network through sufficient depth and layer-by-layer feature learning, the image features of higher layers are extracted by utilizing the cascaded residual error modules, from the stage 1 to the stage n, the learned feature expression is more complex along with the deepening of the network, and the semantic information contained in the output feature graph becomes richer.

The specific implementation process of step S2 includes: respectively and sequentially performing convolution operation on the single-mode image features extracted by the 1 st order residual error modules of all single-mode channels in the single-mode network model, and respectively inputting the results of the convolution operation into a Softmax activation function to obtain a spatial attention vector; multiplying the spatial attention vector by the single-mode image features extracted by each 1 st order residual error module to obtain 1 st order fusion features, performing convolution operation on the fusion features, and inputting the result of the convolution operation into a Sigmoid activation function to obtain a 1 st order fusion mode network model; for the nth order residual error module, respectively and sequentially performing convolution operation on the single-mode image features extracted by each nth order residual error module of all single-mode channels in the single-mode network model, and respectively inputting the result of each convolution operation into a Softmax activation function to obtain a spatial attention vector; multiplying the space attention vector by the single-mode image features extracted by each nth-order residual error module to obtain nth-order fusion features; splicing the nth order fusion characteristic and the (n-1) th order fusion characteristic, performing convolution operation on the spliced fusion characteristic, and inputting a convolution operation result into a Sigmoid activation function to obtain an nth order fusion modal network model; wherein n is greater than 1; and the final-order converged modal network model is a converged modal network model. The model can efficiently fuse the cross and complementary information among the single modes of each stage, reduce the operation amount of the network model in an information compression mode, emphasize key image information through an attention mechanism, and simultaneously inhibit noise information irrelevant to image classification.

The specific implementation process of step S3 includes: respectively calculating the output of a single-mode network model and the difference score between the output of a fusion-mode network model and an actual sample through cosine distance similarity, converting the difference score into consistency weight distribution by using a Tanh activation function, and then performing normalization processing on the consistency weight distribution by using a SoftMax function so as to obtain consistency weight parameters of different models in the fusion process; respectively calculating a first difference value of the single-mode image features output by the single-mode network model and the actual label category and a second difference value of the fusion mode feature output result output by the fusion mode network model and the actual label category according to the consistency weight parameters, taking the sum of squares of the first difference value and the second difference value as a loss function of network training, minimizing the loss function through a self-adaptive gradient descent method, updating network model parameters, obtaining a trained network model, and obtaining a classification model. The step can integrate the output information from the single mode network model and the fusion mode network model, and dynamically adjust the importance degree of each model through the consistency weight so as to obtain better global information and improve the performance and generalization capability of the models on image classification tasks.

In the invention, in order to simplify the model and improve the classification efficiency, the number of the single-mode network channels is 2.

The invention also provides a multi-modal image classification system, which comprises computer equipment; the computer device is configured or programmed for performing the steps of the above-described method.

Compared with the prior art, the invention has the beneficial effects that:

1. aiming at the limitation of shallow image features, the invention provides a multi-level fusion learning network structure based on a convolutional neural network, and the framework processes any single-mode feature in a depth residual error module mode and transmits semantic information layer by layer. Along with the deepening of a network structure, the scale of the feature map is gradually reduced, global and detail local information is integrated, and richer semantic information is provided. Meanwhile, the single-mode features are mutually supervised and fused, and the single-mode feature information is continuously enriched and perfected so as to improve the spatial resolution of the feature map.

2. A fusion learning framework based on a cooperative attention mechanism is provided on the basis of multi-mode fusion, the framework is independent of a specific network of single modes, can be well embedded into a main flow backbone network, and is beneficial to keeping unique characteristics and exclusivity of each single mode. In addition, the framework can keep similarity structures between modalities and in the modalities, meanwhile, modality cooperation and feature fusion are considered, the consistency maximization of different modality expression sets is realized, data are transmitted and shared among the different modalities, and the excellent performance is shown.

Drawings

FIG. 1 is a block diagram of a monomodal network model according to an embodiment of the present invention;

FIG. 2 is a block diagram of a converged modal network model architecture according to an embodiment of the present invention;

FIG. 3 is a block diagram of a classification model according to an embodiment of the present invention.

Detailed Description

A multi-modal image classification method based on multi-level fusion learning and cooperative attention mechanism comprises the following steps:

taking the skin disease classification task as an example, the classification diagnosis of the skin disease is performed by using information of two modalities, namely a clinical image and a skin mirror image.

Step one, a multi-level feature extraction network is adopted to obtain hierarchical clinical image single-mode features and skin mirror image single-mode features.

Firstly, coding two input single-mode images by a feature extraction module based on a deep convolutional neural network to generate depth image features, specifically, establishing two single-mode network feature extraction modules which are respectively used for extracting clinical image features and skin mirror image features, wherein the feature extraction module consists of ResNet50 networks for removing a full connection layer and a classification layer, the modules are in a parallel relation, performing convolution and batch normalization operation on the input single-mode images respectively, performing nonlinear mapping by adopting a ReLU function (max (0, x)), performing further compression coding on single-mode image information by a pooling layer to obtain preliminary image features, extracting multi-level single-mode image features by a plurality of serial depth residual modules, sequentially connecting a last depth residual block with a pooling layer, the full connection layer and a softmax layer for feature dimension reduction and compression, a single-modal network model with parallel inputs is obtained.

Step two, adopting a cooperative attention mechanism to acquire supervision fusion characteristics

For the single-mode image features extracted by the depth residual error module in the step I, realizing the fusion of multi-mode features by utilizing a cooperative attention mechanism: firstly, splicing the extracted features of each single-mode image together, calculating a space attention vector through a scale perception module consisting of convolution operation and a Softmax activation function, and then sequentially multiplying the space attention vector by each input single-mode feature according to the spliced space sequence to realize the fusion of multi-mode features at the current stage; then, the fusion features extracted by the current depth residual error module are connected with the fusion features of the last depth residual error module, an automatic calibration module is used for calculating a fused channel feature map, a channel attention vector is generated through convolution and a Sigmoid activation function, the vector is multiplied by a fusion feature matrix according to the channel sequence to realize the fusion of the multi-modal features of the current stage and the multi-modal features of the last stage, and finally the fusion learning of a plurality of single-modal features is realized to obtain a fusion modal network model;

and thirdly, realizing multi-modal skin disease image classification by combining a multi-level feature extractor and a cooperative attention mechanism.

Based on the parallel single-mode network model and the fusion mode network model obtained in the first step and the second step, firstly, the clinical image and the dermatoscope image are sequentially input into the parallel single-mode network model,

and (4) extracting the output single-mode image features through the single-mode network model obtained in the step one, and extracting the output fusion-mode image features through the fusion-mode network model obtained in the step two. And then calculating by cosine distance similarity to obtain a difference score between the model output and an actual sample, and processing the difference score by using Tanh and SoftMax functions to obtain consistency weight parameters of different models. And respectively calculating the image characteristics of the single mode and the square sum of the difference value between the characteristic output result of the fusion mode and the actual label category as the loss of network training according to the consistency weight parameters, and minimizing the loss by a self-adaptive gradient descent method to update the parameters of the network model to obtain the trained network model. In actual use, only a plurality of single-mode data are input to obtain a classification result output by the model.

The first step specifically comprises the following processes:

firstly, a single-mode model is trained by utilizing a plurality of input single-mode images, the images are compressed by encoding through a feature extractor based on a convolutional neural network to generate depth features, and in the training stage, the input features of each layer are mapped by a three-dimensional array [ h, w, d ]]Where H and w are the feature map dimensions, d is the number of channels of the feature map, the feature maps of adjacent layers are connected by a field of dimension (L, H), and for convolution operations, let x be_ijIs the pixel value of the (i, j) position of the previous layer, yij is the pixel value of the corresponding position of the next layer, having

And b is a shared value of the bias, L is H is 3, and after the convolution, the nonlinear mapping is performed by a batch normalization layer and then a Relu function.

For the pooling operation, there are

And in order to reduce the feature loss caused by the pooling layer, the compression encoding stage adopts average pooling instead of maximum pooling to obtain the initial features of the single-mode image. Extracting multi-level image features through a plurality of depth residual modules, wherein a residual block is defined as:

y_ij＝F(x,{W_i})+x (3)

where x, y represent the input and output of the current residual block, respectively, F (x, { W)_i}) represents the residual mapping that needs to be learned. When the dimensions of x and F are different, pass a linear projection W_sTo match the dimensions, as follows:

y_ij＝F(x,{W_i})+W_sx (4)

along with the deepening of a network structure, the scale of the feature map is gradually reduced, multi-level single-mode image features are extracted through the combination of the depth residual error modules, global and local details can be integrated, and richer semantic information is provided. And then carrying out global average pooling on the extracted multi-level single-mode image characteristics, and after passing through a full connection layer, mapping through a softmax function to obtain an output result of the single-mode network model:

wherein z isⁱRepresenting the value of the ith element of the output matrix, S_iIs the probability of mapping to class i.

The specific implementation process of the cooperative attention mechanism in the step two comprises the following steps:

firstly, multi-level single-mode image features A and B extracted by a current depth residual error module are spliced through spatial dimensions to obtain a primary fusion feature, then a proper proportion feature is dynamically selected according to a primary fusion result through a spatial attention mechanism based on a scale perception module, and fusion is carried out through self-learning, as follows:

wherein A is_i,B_iRespectively representing the ith characteristic value in the single-mode image characteristics, respectively carrying out dot product with the original single-mode characteristics to obtain the characteristics (effective characteristics) after scale perception, and finally summing to obtain a feature graph after supervision fusion, thereby realizing the fusion of a plurality of single-mode characteristics at the current stage.

Then, the fusion characteristics of the current stage and the previous stage are spliced through a channel dimension, and then the splicing result is used for generating a one-dimensional excitation weight through a channel attention mechanism based on an adaptive calibration module to activate each layer of channel, so that the attention to a channel domain is enhanced, wherein the mechanism is divided into three parts, namely a squeezing function, an excitation function and a scale function, and the squeezing function is as follows:

the function is to add and average the characteristic values in each channel of the splicing characteristic to realize the process of global average pooling, and the excitation function is as follows:

s＝σ(g(z,W))＝σ(W₂δ(W₁z)) (8)

where σ denotes sigmoid activation function, δ denotes ReLU function, W₁And W₂The dimensions of the channel are C/r C and C/r respectively, wherein C is the number of channels, r is a scaling parameter, a one-dimensional channel attention vector is calculated through the self-adaptive calibration module, and a scale function is as follows:

essentially a scaling process, for each channel u_cMultiplying by different channel attention weights s_cTherefore, the attention to the key channel domain is enhanced, and the fusion of the current stage and the previous stage multi-modal features is realized.

The third step comprises the following processes:

and (3) based on the single mode network model and the fusion mode network model respectively obtained in the first step and the second step, expanding and connecting the final output characteristics of the two models together, synthesizing the characteristics of the two models, designing a consistency weight parameter to keep semantic consistency between the modes, firstly calculating a difference score between the model output and an actual sample through cosine distance similarity, converting the difference score into consistency weight distribution by using a Tanh activation function, and then normalizing the difference score by using a SoftMax function, thereby obtaining the consistency weight parameters of different models in the fusion process.

During the training process, the specific loss function is as follows:

L(v,m,S)＝λ_vL(v,S)+λ_mL(m,S) (10)

wherein, L (v, S) and L (m, S) respectively represent consistency loss between the output features of the single-mode features v and the fusion mode m and the actual sample information S, and the calculation mode is as follows:

λ_vand λ_mRepresenting a consistency loss weight parameter for the corresponding modality. The resulting loss function is thus as follows:

in the model training phase, the network parameters are updated using an adaptive gradient descent method by minimizing this loss. In the model prediction stage, multi-modal images of the same target to be classified are input into a trained complete network, and classification prediction is carried out after single-modal features and fusion modal features extracted by a network model are weighted, so that auxiliary classification of the multi-modal images is completed.

The embodiment of the invention also provides a multi-modal image classification system, which comprises computer equipment; the computer device is configured or programmed for performing the steps of the above-described embodiment method.

In the invention, the computer equipment can be a microprocessor, an upper computer and other equipment.

Claims

1. A method of multi-modal image classification, comprising the steps of:

2. The multi-modal image classification method according to claim 1, wherein in step S1, the fully connected hierarchy and the classification hierarchy of the ResNet50 network are removed, and the remaining cascaded residual modules constitute the single-modal network channel.

3. The multi-modal image classification method according to claim 1, wherein the step S2 is implemented by the following steps:

respectively and sequentially performing convolution operation on the single-mode image features extracted by the 1 st order residual error modules of all single-mode channels in the single-mode network model, and respectively inputting the results of the convolution operation into a Softmax activation function to obtain a spatial attention vector; multiplying the spatial attention vector by the single-mode image features extracted by each 1 st order residual error module to obtain 1 st order fusion features, performing convolution operation on the fusion features, and inputting the result of the convolution operation into a Sigmoid activation function to obtain a 1 st order fusion mode network model;

for the nth order residual error module, respectively and sequentially performing convolution operation on the single-mode image features extracted by each nth order residual error module of all single-mode channels in the single-mode network model, and respectively inputting the result of each convolution operation into a Softmax activation function to obtain a spatial attention vector; multiplying the space attention vector by the single-mode image features extracted by each nth-order residual error module to obtain nth-order fusion features; splicing the nth order fusion characteristic and the (n-1) th order fusion characteristic, performing convolution operation on the spliced fusion characteristic, and inputting a convolution operation result into a Sigmoid activation function to obtain an nth order fusion modal network model; wherein n is greater than 1; and the final-order converged modal network model is a converged modal network model.

4. The multi-modal image classification method according to any one of claims 1 to 3, wherein the step S3 is implemented by: respectively calculating the output of a single-mode network model and the difference score between the output of a fusion-mode network model and an actual sample through cosine distance similarity, converting the difference score into consistency weight distribution by using a Tanh activation function, and then normalizing the consistency weight distribution by using a SpftMax function so as to obtain consistency weight parameters of different models in the fusion process; respectively calculating a first difference value of the single-mode image features output by the single-mode network model and the actual label category and a second difference value of the fusion mode feature output result output by the fusion mode network model and the actual label category according to the consistency weight parameters, taking the sum of squares of the first difference value and the second difference value as a loss function of network training, minimizing the loss function through a self-adaptive gradient descent method, updating network model parameters, obtaining a trained network model, and obtaining a classification model.

5. The multi-modal image classification method according to any one of claims 1 to 3, wherein the number of the single-modal network channels is 2.

6. A multi-modal image classification system comprising a computer device; the computer device is configured or programmed for carrying out the steps of the method according to one of claims 1 to 5.