CN113516133B

CN113516133B - Multi-modal image classification method and system

Info

Publication number: CN113516133B
Application number: CN202110355430.5A
Authority: CN
Inventors: 王勇; 袁狄; 何小宇
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2021-04-01
Filing date: 2021-04-01
Publication date: 2022-06-17
Anticipated expiration: 2041-04-01
Also published as: CN113516133A

Abstract

The invention discloses a multi-modal image classification method and a multi-modal image classification system, wherein a plurality of single-modal network feature extraction modules are established, each single-modal network feature extraction module comprises a multi-order cascaded residual module, and the last-order residual module is sequentially connected with a pooling layer, a full connection layer and a softmax layer to obtain a single-modal network channel; all the single-mode network channels form a single-mode network model; fusing each order residual error module in the single mode network model by utilizing a cooperative attention mechanism to obtain a fused mode network model; and fusing the single-mode network model and the fused-mode network model to obtain a classification model. The invention can improve the classification precision.

Description

Multi-modal image classification method and system

Technical Field

The invention relates to the field of image processing, in particular to a multi-modal image classification method and system.

Background

Currently, the image classification technology based on deep learning has been widely applied, and a deep learning model is used to process a shot image to judge the category of an object in the image. The existing image classification method adopts a single-mode image for processing, but the single-mode image cannot well cover the characteristics of a target object, so that the classification precision is influenced.

Disclosure of Invention

The invention aims to solve the technical problem that the prior art is insufficient, and provides a multi-mode image classification method and system to improve classification accuracy.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method of multi-modal image classification comprising the steps of:

s1, establishing a plurality of single-mode network feature extraction modules, wherein each single-mode network feature extraction module comprises a multi-order cascaded residual module, and the last-order residual module is sequentially connected with a pooling layer, a full connection layer and a softmax layer to obtain a single-mode network channel; all the single-mode network channels form a single-mode network model;

s2, fusing each order residual error module in the single mode network model by utilizing a cooperative attention mechanism to obtain a fused mode network model;

and S3, fusing the single-mode network model and the fusion-mode network model to obtain a classification model.

The invention provides a multi-level fusion learning network structure, which processes any single-mode characteristics in a mode of a depth residual error module (residual error module) and transmits semantic information layer by layer. Along with the deepening of a network structure, the scale of the feature map is gradually reduced, global and detail local information is integrated, and richer semantic information is provided. Meanwhile, the single-mode features are mutually supervised and fused, single-mode feature information is continuously enriched and perfected, and the spatial resolution of the feature map is improved, so that the image classification precision is improved.

In step S1, the full link layer and the classification layer of the ResNet50 network are removed, and the remaining cascaded residual modules form the single-mode network channel. The structure can improve the expression capability of the network through sufficient depth and layer-by-layer feature learning, the image features of higher layers are extracted by utilizing the cascaded residual error modules, from the stage 1 to the stage n, the learned feature expression is more complex along with the deepening of the network, and the semantic information contained in the output feature graph becomes richer.

The specific implementation process of step S2 includes: all single modes in the single mode network modelRespectively and sequentially performing convolution operation on the single-mode image features extracted by each 1 st order residual error module of the state channel, and respectively inputting the results of the convolution operation

Activating a function to obtain a space attention vector; multiplying the spatial attention vector by the single-mode image features extracted by each 1 st order residual error module to obtain 1 st order fusion features, performing convolution operation on the fusion features, and inputting the result of the convolution operation into a Sigmoid activation function to obtain a 1 st order fusion mode network model; for the nth order residual error module, respectively and sequentially performing convolution operation on the single-mode image features extracted by each nth order residual error module of all single-mode channels in the single-mode network model, and respectively outputting the results of each convolution operation

Activating a function to obtain a space attention vector; multiplying the space attention vector by the single-mode image features extracted by each nth-order residual error module to obtain nth-order fusion features; splicing the nth order fusion characteristic and the (n-1) th order fusion characteristic, performing convolution operation on the spliced fusion characteristic, and inputting a convolution operation result into a Sigmoid activation function to obtain an nth order fusion modal network model; wherein n is greater than 1; and the final-order converged modal network model is a converged modal network model. The model can efficiently fuse the cross and complementary information among the single modes of each stage, reduce the operation amount of the network model in an information compression mode, emphasize key image information through an attention mechanism, and simultaneously inhibit noise information irrelevant to image classification.

The specific implementation process of step S3 includes: respectively calculating the difference scores between the output of the single-mode network model and the output of the fusion-mode network model and the actual sample through cosine distance similarity, and using

An activation function converts the difference score into a consistency weight distribution,is reused

The function carries out normalization processing on the consistency weight distribution, so that consistency weight parameters of different models in the fusion process are obtained; respectively calculating a first difference value of the single-mode image features output by the single-mode network model and the actual label category and a second difference value of the fusion mode feature output result output by the fusion mode network model and the actual label category according to the consistency weight parameters, taking the sum of squares of the first difference value and the second difference value as a loss function of network training, minimizing the loss function through a self-adaptive gradient descent method, updating network model parameters, obtaining a trained network model, and obtaining a classification model. The step can integrate the output information from the single mode network model and the fusion mode network model, and dynamically adjust the importance degree of each model through the consistency weight so as to obtain better global information and improve the performance and generalization capability of the models on image classification tasks.

In the invention, in order to simplify the model and improve the classification efficiency, the number of the single-mode network channels is 2.

The invention also provides a multi-modal image classification system, which comprises computer equipment; the computer device is configured or programmed for performing the steps of the above-described method.

Compared with the prior art, the invention has the beneficial effects that:

1. aiming at the limitation of shallow image features, the invention provides a multi-level fusion learning network structure based on a convolutional neural network, and the framework processes any single-mode feature in a depth residual error module mode and transmits semantic information layer by layer. Along with the deepening of a network structure, the scale of the feature map is gradually reduced, global and detail local information is integrated, and richer semantic information is provided. Meanwhile, the single-mode features are mutually supervised and fused, and the single-mode feature information is continuously enriched and perfected so as to improve the spatial resolution of the feature map.

2. A fusion learning framework based on a cooperative attention mechanism is provided on the basis of multi-mode fusion, the framework is independent of a specific network of single modes, can be well embedded into a main flow backbone network, and is beneficial to keeping unique characteristics and exclusivity of each single mode. In addition, the framework can keep similarity structures between modalities and in the modalities, meanwhile, modality cooperation and feature fusion are considered, the consistency maximization of different modality expression sets is realized, data are transmitted and shared among the different modalities, and the excellent performance is shown.

Drawings

FIG. 1 is a block diagram of a monomodal network model according to an embodiment of the present invention;

FIG. 2 is a block diagram of a converged modal network model architecture according to an embodiment of the present invention;

FIG. 3 is a block diagram of a classification model according to an embodiment of the present invention.

Detailed Description

A multi-modal image classification method based on multi-level fusion learning and cooperative attention mechanism comprises the following steps:

taking the skin disease classification task as an example, the classification diagnosis of the skin disease is performed by using information of two modalities, namely a clinical image and a skin mirror image.

Step one, a multi-level feature extraction network is adopted to obtain hierarchical clinical image single-mode features and skin mirror image single-mode features.

Firstly, coding two input single-mode images through a feature extraction module based on a deep convolutional neural network to generate depth image features, specifically, establishing two single-mode network feature extraction modules which are respectively used for extracting clinical image features and skin mirror image features, wherein the feature extraction module consists of ResNet50 networks for removing a full connection layer and a classification layer, the modules are in parallel relation, performing convolution and batch normalization operation on the input single-mode images respectively, and then adopting the depth convolutional neural network-based feature extraction module to generate depth image features

Function (a)

) Performing nonlinear mapping, performing further compression coding on the single-mode image information through a pooling layer to obtain initial image characteristics, extracting multi-level single-mode image characteristics through a plurality of depth residual modules connected in series, and sequentially connecting the last depth residual block with the pooling layer, the full-connection layer and the full-connection layer

And (4) performing feature dimension reduction and compression on the layers to obtain a single-mode network model with parallel input.

Step two, adopting a cooperative attention mechanism to acquire supervision fusion characteristics

Respectively and sequentially performing convolution operation on the single-mode image features extracted by the 1 st order residual error modules of all the single-mode channels in the single-mode network model, and respectively inputting the results of the convolution operation

Activating a function to obtain a space attention vector; multiplying the spatial attention vector by the single-mode image features extracted by each 1 st order residual error module to obtain 1 st order fusion features, performing convolution operation on the fusion features, and inputting the result of the convolution operation into a Sigmoid activation function to obtain a 1 st order fusion mode network model;

for the nth order residual error module, respectively and sequentially performing convolution operation on the single-mode image features extracted by each nth order residual error module of all single-mode channels in the single-mode network model, and respectively inputting the results of each convolution operation

Activating a function to obtain a space attention vector; multiplying the space attention vector by the single-mode image features extracted by each nth-order residual error module to obtain nth-order fusion features; splicing the nth order fusion characteristic and the (n-1) th order fusion characteristic, performing convolution operation on the spliced fusion characteristic, and inputting a convolution operation result into a Sigmoid activation function to obtain an nth order fusion modal network model; wherein n is greater than 1; last-order converged modal network modelType is a converged modal network model;

and thirdly, realizing multi-modal skin disease image classification by combining a multi-level feature extractor and a cooperative attention mechanism.

Based on the parallel single-mode network model and the fusion mode network model obtained in the first step and the second step, firstly, the clinical image and the dermatoscope image are sequentially input into the parallel single-mode network model,

and (4) extracting the output single-mode image features through the single-mode network model obtained in the step one, and extracting the output fusion-mode image features through the fusion-mode network model obtained in the step two. Then calculating to obtain a difference score between the model output and the actual sample through cosine distance similarity, and using the difference score

And

and processing the difference scores by the function to obtain consistency weight parameters of different models. And respectively calculating the image characteristics of the single mode and the square sum of the difference value between the characteristic output result of the fusion mode and the actual label category as the loss of network training according to the consistency weight parameters, and minimizing the loss by a self-adaptive gradient descent method to update the parameters of the network model to obtain the trained network model. In actual use, only a plurality of single-mode data are input to obtain a classification result output by the model.

The first step specifically comprises the following processes:

firstly, a single-mode model is trained by utilizing a plurality of input single-mode images, the images are compressed by encoding through a feature extractor based on a convolutional neural network to generate depth features, and in a training stage, the input features of each layer are mapped by a three-dimensional array

Is shown in whichhAndwis the size of the feature map and,dis the number of channels of the feature map, the feature map pass size of the adjacent layer is largeIs small as

Are connected, for convolution operations, set

Is the previous layer

The value of the pixel of the location is,

is the pixel value of the corresponding position of the next layer, has

(1)

Where denotes convolution, b is the biased shared value, L = H =3, and the convolution is followed by a batch normalization layer and then by a Relu function for non-linear mapping.

For the pooling operation, there are

(2)

Wherein L = H =2, in order to reduce the feature loss caused by the pooling layer, the compression encoding stage uses average pooling instead of maximum pooling to obtain the preliminary features of the single-mode image. Extracting multi-level image features through a plurality of depth residual modules, wherein a residual block is defined as:

(3)

wherein the content of the first and second substances,x,yrespectively representing the input and output of the current residual block,

representing the residual map that needs to be learned. When in usexWhen the dimensions of F and F are different, by a linear projection

To match the dimensions, as follows:

(4)

along with the deepening of a network structure, the scale of the feature map is gradually reduced, multi-level single-mode image features are extracted through the combination of the depth residual error modules, global and local details can be integrated, and richer semantic information is provided. And then carrying out global average pooling on the extracted multi-level single-mode image characteristics, and after passing through a full connection layer, mapping through a softmax function to obtain an output result of the single-mode network model:

(5)

wherein

To represent the output matrixiThe value of each of the elements is,

to be mapped toiThe probability of a class.

The specific implementation process of the cooperative attention mechanism in the step two comprises the following steps:

firstly, extracting multi-level single-mode image characteristics from current depth residual error module

The method comprises the following steps of obtaining preliminary fusion characteristics through spatial dimension splicing, dynamically selecting appropriate proportion characteristics according to a preliminary fusion result through a spatial attention mechanism based on a scale perception module, and fusing through self-learning, wherein the proportion characteristics comprise the following steps:

(6)

wherein

Respectively representing the second of the features of a single-mode image

And respectively carrying out dot product on the characteristic values and the original single-mode characteristics to obtain the characteristics (effective characteristics) after scale perception, and finally summing to obtain a feature graph after supervision fusion, so that the fusion of a plurality of single-mode characteristics at the current stage is realized.

Then, the fusion characteristics of the current stage and the previous stage are spliced through a channel dimension, and then the splicing result is used for generating a one-dimensional excitation weight through a channel attention mechanism based on an adaptive calibration module to activate each layer of channel, so that the attention to a channel domain is enhanced, wherein the mechanism is divided into three parts, namely a squeezing function, an excitation function and a scale function, and the squeezing function is as follows:

(7)

the function is to add and average the characteristic values in each channel of the splicing characteristic to realize the process of global average pooling, and the excitation function is as follows:

(8)

wherein

A sigmoid activation function is represented,

denotes the ReLU function, W₁And W₂Are respectively of dimensions

,

Where C is the number of channels, r is a scaling parameter, a one-dimensional channel attention vector is calculated by the adaptive calibration module, and the scale function is as follows:

(9)

essentially a scaling process, for each channel

Multiplying different channel attention weights

Therefore, the attention to the key channel domain is enhanced, and the fusion of the current stage and the previous stage multi-modal features is realized.

The third step comprises the following processes:

based on the single mode network model and the fusion mode network model respectively obtained in the first step and the second step, the final output characteristics of the two models are expanded and connected together, the characteristics of the two models are integrated, consistency weight parameters are designed to keep semantic consistency between the models, firstly, difference scores between model outputs and actual samples are obtained through cosine distance similarity calculation, and the difference scores are used

The activation function converts the difference score into a consistent weight distribution for reuse

And the function carries out normalization processing on the difference scores so as to obtain consistency weight parameters of different models in the fusion process.

During the training process, the specific loss function is as follows:

(10)

wherein, the first and the second end of the pipe are connected with each other,

and

respectively representing monomodal featuresvAnd fusion modalitiesmOutput characteristics and actual sample information ofSThe consistency loss is calculated as follows:

and

representing a consistency loss weight parameter for the corresponding modality. The resulting loss function is thus as follows:

in the model training phase, the network parameters are updated using an adaptive gradient descent method by minimizing this loss. And in the model prediction stage, multi-modal images of the same target to be classified are input into a trained complete network, and classification prediction is carried out after single-modal features and fusion modal features extracted by a network model are weighted, so that auxiliary classification of the multi-modal images is completed.

The embodiment of the invention also provides a multi-modal image classification system, which comprises computer equipment; the computer device is configured or programmed to perform the steps of the method of the above-described embodiment.

In the invention, the computer equipment can be a microprocessor, an upper computer and other equipment.

Claims

1. A method of multi-modal image classification, comprising the steps of:

s2, fusing each order residual error module in the single mode network model by utilizing a cooperative attention mechanism to obtain a fused mode network model; the specific implementation process of step S2 includes:

Activating a function to obtain a space attention vector; multiplying the space attention vector by the single-mode image features extracted by each nth-order residual error module to obtain nth-order fusion features; splicing the fusion feature of the nth order and the fusion feature of the (n-1) th order, performing convolution operation on the spliced fusion feature, and outputting the result of the convolution operationEntering a Sigmoid activation function to obtain an nth-order fusion modal network model; wherein n is greater than 1; the final-order fusion mode network model is a fusion mode network model;

2. The multi-modal image classification method according to claim 1, wherein in step S1, the fully connected hierarchy and the classification hierarchy of the ResNet50 network are removed, and the remaining cascaded residual modules constitute the single-modal network channel.

3. The multi-modal image classification method according to claim 1 or 2, wherein the step S3 is implemented by the following steps: respectively calculating the difference scores between the output of the single-mode network model and the output of the fusion-mode network model and the actual sample through cosine distance similarity, and using

The activation function carries out normalization processing on the consistency weight distribution, so that consistency weight parameters of different models in the fusion process are obtained; respectively calculating a first difference value of the single-mode image features output by the single-mode network model and the actual label category and a second difference value of the fusion mode feature output result output by the fusion mode network model and the actual label category according to the consistency weight parameters, taking the sum of squares of the first difference value and the second difference value as a loss function of network training, minimizing the loss function through a self-adaptive gradient descent method, updating network model parameters, obtaining a trained network model, and obtaining a classification model.

4. The method of multi-modal image classification according to claim 1 or 2, characterized in that the number of single-modal network channels is 2.

5. A multi-modal image classification system comprising a computer device; the computer device is configured or programmed for carrying out the steps of the method according to one of claims 1 to 4.