CN116433898A

CN116433898A - Method for segmenting transform multi-mode image based on semantic constraint

Info

Publication number: CN116433898A
Application number: CN202310150411.8A
Authority: CN
Inventors: 马伟; 陈颖
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2023-07-14

Abstract

The invention discloses a method for segmenting a transform multi-mode image based on semantic constraint, which comprises the following steps: extracting n different modal characteristics from m modalities through the line characteristics of a trunk encoder to obtain a characteristic diagram of the corresponding modality; redundant features are removed through the multi-modal feature interaction module, and the current modal features are reinforced to different degrees according to a gating matrix G generated by the cross-modal interaction module (CFI); then splicing and inputting the specific mode enhancement feature map to a transducer for carrying out inter-mode feature fusion to obtain final coding features; finally, the features are input to a Kmeans-transform decoder. Because the modal fusion network fuses the multi-modal features and gives corresponding weights to different modalities, the embodiment of the disclosure can effectively highlight important modalities which are favorable for multi-sequence image segmentation, inhibit interference of non-important modalities on multi-modal segmentation, and effectively improve multi-modal image segmentation accuracy.

Description

Method for segmenting transform multi-mode image based on semantic constraint

Technical Field

The invention belongs to the field of computer vision images, and relates to a method for segmenting a transform multi-mode image based on semantic constraint.

Background

Multimodal image segmentation plays a vital role in image segmentation. The supplemental information can be segmented with higher accuracy. Magnetic resonance imaging is a common imaging technique practice for quantitative evaluation, with a variety of imaging modes, namely T1 weighting (T1), T2 weighting (T2), contrast enhanced T1 weighting (T1 c), and fluid attenuation inversion recovery (FLAIR) images. If each imaging modality provides unique contrast, multi-modality magnetic resonance imaging provides rich supplemental information for analysis. Provided that each imaging mode provides unique contrast and structure. Associative learning for segmented multimodal images. Furthermore, in practice, enhanced image viewing is often used. The contrast agent passes through to produce a distinct contrast between normal tissue and abnormalities. Three morphological contrast enhanced imaging protocols include venous and arterial phases and intravenous contrast delay. The three-phase images help to better segment the image because they can be well complementary to each other.

The multi-mode image segmentation data has important research significance and value. However, the existing segmentation algorithm is poor in performance, multi-mode information is not fully utilized, and improvement is needed. Due to the strong feature representation capabilities. Convolutional Neural Networks (CNNs) have been widely used for image segmentation tasks and have achieved improved performance. Recently, vision transformer (ViT) has brought the most powerful technology in natural language processing into the field of computer vision imaging. Thanks to the self-focusing mechanism, the transducer can capture long-range features, which perfectly fit the 3D volumetric data. Thus, it has quickly adapted to segmentation in 3D MRI sequences. Based on these two popular techniques, many outstanding approaches have been proposed for image segmentation to address challenges including positional and morphological uncertainties, low contrast, and annotation bias. However, existing work ignores an important issue, namely how to fuse multi-modal images in a reasonable way. Most of which incorporate the modalities of the input stage or the feature stage. However, existing studies rarely consider how to fuse multimodal images in a reasonable way.

Accurate multi-modality image segmentation generally requires efficient learning of complementary information from multi-modality data and removal of redundant information. And an efficient multi-sequence segmentation algorithm is developed, so that segmentation capability can be improved. Therefore, the algorithm for multi-sequence segmentation has important research significance and wide application value.

Disclosure of Invention

The invention provides a multi-layer fused region transform multi-mode image segmentation method which aims at overcoming the defects of the existing multi-sequence image segmentation method and firstly codes single-mode characteristics through a single-mode hierarchical encoder. And the gating mechanism is adopted to perform inter-mode interaction by multi-mode characteristics, the characteristic enhancement of different degrees is performed on the current sequence according to the corresponding importance, and the gating module enhances the expression beneficial to multi-sequence images. And fusing non-local information among different modes through a transducer self-attention mechanism to further enhance the characteristic expression of the multiple sequences, wherein a region fusion module and a truth value calculate a truth value region probability map focus on a region, and suppressing the characteristics of non-focus regions. And finally accelerating network convergence through a K-means converter decoder. The whole network enhances the characteristic expression of the multiple sequences, and experimental results show that the enhanced multiple sequences are utilized for segmentation, so that the network accuracy is effectively improved, and the method has good performance.

To achieve the object, the technical scheme of the invention is as follows: step 1, extracting features of m modes through a trunk encoder to obtain a feature map of the corresponding mode; step 2, judging the importance degree of each M modes to the current mode segmentation by the mode weight matrix generated by the cross-mode interaction module to generate a mode weight matrix G, wherein the mode weight matrix G can be divided into M independent { G } ₁ ，...,g _m ，...,g _M Maps, each map a pattern. Next, the content code is re-weighted to F _m ＝z _m ·g _m The initial feature map of each mode is multiplied by the gating matrix of each mode through element multiplication, the current mode features are reinforced to different degrees, and a mode reinforced feature map F is obtained _m The method comprises the steps of carrying out a first treatment on the surface of the Step 3, splicing the mode enhancement feature map with feature F ^r Inputting the final coding feature F into a transducer for inter-mode feature fusion ^global The method comprises the steps of carrying out a first treatment on the surface of the Step 4, finally inputting the coding characteristicsTo a Kmeans-transform decoder, multi-sequence image segmentation is realized. The invention provides a multi-layer fusion region transform multi-mode image segmentation method.

Advantageous effects

1) By a multi-scale encoder: the performance of the interleaved sparse transducer encoder with convolutional Token hierarchical fusion is superior to that of the tandem superposition approach. 2) Cross-modal interaction module CIF and multi-modal interaction module MFF: and eliminating the inherent information redundancy of the multiple modes, and simultaneously considering the inherent complementary enhancement relationship of the multiple modes, so that the fusion of the characteristics of the multiple modes is more sufficient. 3) K-means converter decoder: the affinity logarithm between the pixel feature and the cluster center directly corresponds to the softmax logarithm of the segmentation mask, speeding up convergence.

Drawings

FIG. 1 is a schematic diagram of a network framework of the method of the present invention;

FIG. 2 is a schematic diagram of cross-modal interactions in an example of the invention;

FIG. 3 is a multimodal fusion transducer according to the present invention;

Detailed Description

The invention discloses a deep learning-based open source tool Pytorch implementation, which uses a GPU processor NVIDIA GTX3090 to train a network model.

The various block configurations of the method of the present invention will be further described with reference to the drawings and detailed description, it being understood that the description of the specific examples which follow is intended to illustrate the invention and not to limit the scope of the invention, and that various modifications of the invention, which are equivalent to those skilled in the art upon reading the invention, will fall within the scope of the claims appended hereto.

The network framework composition and flow of the invention are shown in figure 1, and specifically comprise the following steps:

wherein, step 1 includes: the multiple sequences f= { F ₁ ,...,f _m ,...,f _M The picture passes through the stem encoder model. Feature maps with local context within each modality produced by a convolutional encoder, each block containing concatenated group normalization, reLU and kernel size 3 convolutional layers, with the first stage of the first stageA convolution block contains only convolution layers. The input Token gradually downsamples from stage to stage by dividing the input volume into blocks and linearly embedding patches. A multi-layer perceptron (MLP) block is used to encode local features in the first two stages. The first stage of the MLP region is one and the second stage is two, each MLP is activated by one layer normalization and a GELU function between two fully connected layers. In the third and fourth stages, three and four transducer blocks are employed, respectively, to capture long dependencies by multi-headed self-attention (MSA). f (f) _m Initial modal feature map representing M-th modal feature extraction of M sequences each

Wherein->

For the mth modality map of the image, < >>

Wherein R represents a feature, m represents the number of multiple sequences, C represents the number of channels per sequence feature map, H represents the height of each sequence feature map, and W represents the width of each sequence feature map.

Wherein, step 2 includes: initial modality feature map of n modalities

Input to a multi-modal interaction module (CIF), which will filter the multi-sequence for modal information, connect each modal feature, then input to a convolutional layer activation with M output channels, the convolutional kernel size being 3 x3, step size 1, boundary fill 0, obtain a modal weight matrix G, which can be divided into M individual { G1, & gt, gm, & gt, G _M Maps, each map a pattern. Next, the content code is re-weighted to F _m ＝z _m ·g _m Through element multiplication, the initial feature map of each mode is multiplied by a gating matrix of the initial feature map to obtain M mode enhancement feature maps F= { F of the image ₁ ,...,F _m ,...,F _M }，F _m ∈R ^C×H×W . Four phases of intersection are performed in totalInteroperation, these outputs are concatenated to obtain features and forwarded to a 1×1 convolution, and then input to an activation function LeakyReLU, whereby features rich in sequence feature information can be highlighted. Some modal weights are randomly set to 0 in the training process, so that robustness of the model to missing data is improved.

Wherein step 3 comprises the sub-steps of: enhancement of the mode profile f= { F ₁ ,...,F _m ,...,F _M Respectively, to a multi-modal interaction Module (MFF). Wherein the MFF convolves each mode first, the convolution kernel has a size of 3 x3, a step size of 1, a boundary fill of 0, an input channel of 1, an output channel of 3, a foreground probability map is calculated and supervised with a true value,

the probability map calculation formula is shown, where φ () is the foreground/background classifier with parameter set θ. Conv () is a 3×3 convolution operation. FG and BG represent foreground and background, respectively. And then, carrying out dot product calculation by using the modal foreground probability map and the modal original features, highlighting the discriminant area and inhibiting redundant information. And then, the re-expressed modal characteristics concat are combined together to obtain a characteristic F ^r After the multi-modal features are extracted, they are fused in a multi-modal feature fusion module. Multiple degrees of freedom may be combined with related and complementary functions of different modes depending on inter-and intra-modal contexts. First, the multi-modal feature is converted to Token, and then applied to a transducer using Token to enhance the discrimination of the fused feature. And the foreground is estimated, i.e. based on the ROI in the form of a probability map for each modality and embedding the probability map into Token. The foreground probability map prediction based on the characteristics is sent to a VisionTransformer module to generate new characteristics F ^global 。/>

The MSA () and FFN () represent the operation of the layers are standardized, multi-headed self-care and feed-forward multi-layer perceptron, respectively. By embedding foreground cues, cross-modal fusion is performed with perceived foreground. The transducer multi-head self-attention Mechanism (MSA) breaks the locality of the features, realizes the crossingThe non-local characteristic of the mode is enhanced, so that the characteristic representation of any space of any mode is richer, and the expression of the mode characteristic is effectively enhanced.

Step 4 comprises: cross-modal feature map F '= { F' ₁ ,...,F’ _m ,...,F' _M Splicing to obtain F ^r And splice F by the feature after dimension reduction ^global Input to the K-means decoder for segmentation. In order to enhance the discrimination of fusion features, K-memstransformer was introduced as a decoder. K-means Transformer applies intra-class consistency to features to enhance their expressive power. The goal is to enhance the semantics of the decoded features by regularizing the decoded features using semantic centers. The K-means decoder includes a pixel decoder and KMaX decoder layers. The pixel decoder is composed of a transformer encoder and an upsampling layer. This KMaX decoder updates the cluster centers of the target class by taking a set of cluster centers and corresponding layers and outputting the updated cluster centers. The cluster center of the first KMaX decoder is randomly initialized. Others output their previous KMaX decoders. The KMaX input decoder is first re-expressed by a K-means cross-attention processing module, K-means cross-attention as follows:

wherein->

Representing an input cluster center with N segmentation classes (ROI plus background) and D channels. C' represents the center of the update. The features projected from the pixel features and class queries are denoted using superscripts j and c. Q (Q) ^c ∈R ^N×D ，K ^j ∈R ^HW×D ，V ^j ∈R ^HW×D Linear projection features representing queries, keys and values. K-means cross-attention replaces the spatial direction softmax function operation in the normal cross-attention mechanism with argmax function. In this way, similar pixel features are clustered into the same cluster. The cluster center of the last KMaX decoder output is used to regularize the pixel characteristics to enhance the representation consistency of pixels within the same class. In particular, decoding from pixelsThe characteristics of the device are denoted as F ^de ，F ^de ∈R ^HW×D 。/>

The clustered regularized pixel features are denoted as F ^de′ The middle subscript N indicates that the multi-modal image segmentation is implemented using the softmax axis.

In this embodiment, a comparison experiment is also performed on the three-dimensional object recognition method combining the modal importance network and the self-attention mechanism to evaluate the classification recognition effect. The BraTs2020 dataset, which contained 369 cases of data, of which 315 were scored as training sets, 37 were scored as test sets, and 17 as validation sets, was selected for the experiments and evaluations. Compliance with the two indices of Dice and HD95 reported in other works.

TABLE 1 results of Dice and HD95 comparisons for different multiple sequence partitioning methods

As shown in Table 1, in the invention, under four modes, the Dice values of ET, TC and WT are 0.821, 0.867 and 0.923 respectively, and the segmentation accuracy is higher than that of other multi-mode segmentation methods, so experiments show that the invention has advanced performance in multi-mode image segmentation and can better realize multi-mode image segmentation.

In summary, the method of the present invention is better than the previous methods from a quantitative and qualitative result. From the point of view of computational cost, the method obtains a higher evaluation index with less computational cost, so the network is efficient.

Experimental effects and effects

According to the multi-mode image segmentation method combining the K-means transform with the gating mechanism, the gating mechanism is used for carrying out multi-mode feature interaction on a plurality of modes, the expression of the corresponding modes is weighted again, the modes which are beneficial to multi-mode segmentation are highlighted, meanwhile, the information of redundant modes which are less helpful to multi-sequence segmentation is restrained, and the regional fusion module can enable the information with discrimination to be highlighted. The trans-former multi-head self-attention mechanism is used for splicing the characteristic concat of each mode to realize cross-mode non-local characteristic enhancement, and local information and non-local information can be effectively fused to enhance the characteristic expression of each mode. Finally, the segmentation of the multi-sequence images is realized through a K-Max decoder. In summary, the present embodiment can be applied to multi-sequence image segmentation.

The above embodiments are preferred examples of the present invention, and are not intended to limit the scope of the present invention.

Claims

1. A method for segmenting a transform multi-mode image based on semantic constraint is characterized by comprising the following steps of:

step 1, extracting features of m modes through a trunk encoder to obtain a feature map of the corresponding mode;

step 2, judging the importance degree of each M modes to the current mode segmentation by the mode weight matrix generated by the cross-mode interaction module to generate a mode weight matrix G, wherein the mode weight matrix G can be divided into M independent { G } ₁ ，...，g _m ，...，g _M -mapping, each mapping a pattern; next, the content code is re-weighted to F _m ＝z _m ·g _m The initial feature map of each mode is multiplied by the gating matrix of each mode through element multiplication, the current mode features are reinforced to different degrees, and a mode reinforced feature map F is obtained _m ；

Step 3, splicing the mode enhancement feature map with feature F ^r Inputting the final coding feature F into a transducer for inter-mode feature fusion ^global ；

And 4, finally inputting the coding features into a Kmeans-transform decoder to realize multi-sequence image segmentation.

2. The method according to claim 1, characterized in that:

wherein, step 1 includes: the multiple sequences f= { F ₁ ，...，f _m ，...，f _M The picture passes through a trunk encoder model; each mode generated by the convolution encoder has a local up and downA feature map of the text, each block containing concatenated group normalized, reLU and kernel size 3 convolutional layers, while the first convolutional block of the first stage contains only convolutional layers; the input Token gradually downsamples from stage to stage by dividing the input volume into blocks and linearly embedding patches; the multi-layer perceptron MLP block is used for encoding local features in the first two stages; the first stage of the MLP area is one, the second stage is two, and each MLP is activated by one layer normalization and a GELU function between two fully connected layers; in the third and fourth stages, three and four transducer blocks are employed, respectively, to capture long dependencies by multi-headed self-attention (MSA); f (f) _m Initial modal feature map representing M-th modal feature extraction of M sequences each

Wherein z is _m For the mth modality view of the image,

3. The method of claim 1, wherein step 2 comprises: initial modality feature map of n modalities

Input to a multi-modal interaction module (CIF), which will filter the multi-sequence for modal information, connect each modal feature, then input to a convolutional layer activation with M output channels, the convolutional kernel size being 3 x3, step size 1, boundary fill 0, obtain a modal weight matrix G, which can be divided into M individual { G1, & gt, gm, & gt, G _M -mapping, each mapping a pattern; next, the content code is re-weighted to F _m ＝z _m ·g _m The initial feature map of each mode is multiplied by the gating matrix thereof through element multiplication to obtain an image MIndividual modality enhancement feature map f= { F ₁ ，...，F _m ，...，F _M }，F _m ∈R ^C×H×W The method comprises the steps of carrying out a first treatment on the surface of the A total of four phases of interaction are performed, and these outputs are concatenated to obtain features and forwarded to a 1X 1 convolution, which is then input to the activation function, leak ReLU.

4. The method according to claim 1, characterized in that step 3 comprises the sub-steps of: enhancement of the mode profile f= { F ₁ ，...，F _m ，...，F _M Respectively input to a multi-modal interaction Module (MFF); wherein the MFF convolves each mode first, the convolution kernel has a size of 3 x3, a step size of 1, a boundary fill of 0, an input channel of 1, an output channel of 3, a foreground probability map is calculated and supervised with a true value,

where phi () is the foreground/background classifier with parameter set θ; conv () is a 3×3 convolution operation; f G and BG represent foreground and background, respectively; then, carrying out dot product calculation by using the modal foreground probability map and the modal original features, highlighting the discriminant area and inhibiting redundant information; and then, the re-expressed modal characteristics concat are combined together to obtain a characteristic F ^r After the multi-modal characteristics are extracted, fusing the multi-modal characteristics in a multi-modal characteristic fusion module;

firstly, converting the multi-mode characteristics into Token, and then using Token to apply to a transducer; and estimating the foreground, namely based on the ROI in the form of the probability map of each mode and embedding the probability map into the Token; the foreground probability map prediction based on the characteristics is sent to a VisionTransformer module to generate new characteristics F ^global ；

The operation of the MSA () and FF N () presentation layers is a normalized, multi-headed self-care and feed-forward multi-layer perceptron, respectively.

5. The method according to claim 1, characterized in that: step 4 packageThe method comprises the following steps: cross-modal feature map F '= { F' ₁ ，...，F’ _m ，...，F′ _M Splicing to obtain Fr and splicing F by the feature after dimension reduction ^global Inputting to a K-means decoder for segmentation; introducing K-memstransformer as a decoder; the K-means decoder includes a pixel decoder and a KMaX decoder layer; the pixel decoder consists of a transformer encoder and an up-sampling layer; the KMaX decoder updates the cluster centers of the target class by acquiring a group of cluster centers and corresponding layers and outputting updated cluster centers; the cluster center of the first KMaX decoder is randomly initialized; the other is to output the previous KMaX decoder; the KMaX input decoder is first re-expressed by a K-means cross-attention processing module, K-means cross-attention as follows:

wherein->

Representing an input cluster center with N segmentation classes (ROI plus background) and D channels; c' represents the center of update; features projected from pixel features and class queries are represented using superscripts j and c; q (Q) ^c ∈R ^N×D ，K ^j ∈R ^HW×D ，V ^j ∈R ^HW×D Linear projection features representing queries, keys, and values; k-means cross attention replaces the spatial direction softmax function operation in the common cross attention mechanism with argmax function; in this way, similar pixel features are clustered into the same cluster; the clustering center output by the last KMaX decoder is used for regularizing pixel characteristics so as to enhance the representation consistency of pixels in the same category; representing features from a pixel decoder as F ^de ，F ^de ∈R ^HW×D ；/>

The clustered regularized pixel features are denoted as F ^de′ The subscript N indicates that the axis of softmax is applied to achieve multi-modal image segmentation.