CN117649422B

CN117649422B - Training method of multi-modal image segmentation model and multi-modal image segmentation method

Info

Publication number: CN117649422B
Application number: CN202410121532.4A
Authority: CN
Inventors: 杜秀全; 章旭
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2024-01-30
Filing date: 2024-01-30
Publication date: 2024-04-12
Anticipated expiration: 2044-01-30
Also published as: CN117649422A

Abstract

The application relates to a training method of a multi-mode image segmentation model and a multi-mode image segmentation method, wherein the training method comprises the following steps: acquiring a training sample set, wherein the training sample comprises a source domain real image, a segmentation label and a target domain real image; training an image conversion module through a source domain real image and a target domain real image, and obtaining a source domain generation image and a characteristic image thereof, a target domain generation image and a characteristic image thereof, and a source domain real image and a target domain real image characteristic image through the image conversion module; the image segmentation module is trained through a training sample set, a source domain generation diagram and a characteristic diagram thereof, a target domain generation diagram and a characteristic diagram thereof, and a source domain real diagram and a characteristic diagram of a target domain real diagram. The method solves the problems that in the existing multi-mode image segmentation method based on semi-supervised learning, when a large data distribution difference exists between a generated image and a real image of a target domain, the effectiveness of the semi-supervised learning is affected, and finally the image segmentation effect is affected.

Description

Training method of multi-modal image segmentation model and multi-modal image segmentation method

Technical Field

The present disclosure relates to the field of medical image processing, and in particular, to a training method for a multi-modal image segmentation model and a multi-modal image segmentation method.

Background

Multimodal image segmentation has important applications in the medical field. For example, early prevention and effective treatment of cardiovascular diseases are hot problems in global research, and accurately dividing heart parts plays a very critical role in preventing and treating cardiovascular diseases. In clinical practice, heart images often show a multi-mode form, and because images of different modes often show finer pathology, the current diagnosis of heart diseases often adopts multi-mode images for comprehensive analysis, so that the accuracy of heart image segmentation is improved. For example, cardiac magnetic resonance images of multiple sequences can more clearly show detailed features of cardiac muscle, and cardiac computed tomography images are often applied to clinical demands such as coronary artery observation, which require higher resolution. However, due to the differences of physical principles and shooting angles of different imaging modes, the differences of the obtained heart images of different modes are obvious, and mainly comprise appearance differences and changeable structural shapes. There is a certain difficulty in directly segmenting images of different modalities. In particular, due to the differences of the heart images of different modalities, a serious performance degradation, called domain shift, occurs when the already trained segmentation model is applied on heterogeneous images.

To overcome this difficulty, the previous solutions have used supervised training strategies, which require extensive image data and corresponding labels to retrain the existing segmentation model. However, obtaining labels is a well-known problem, which motivates the creation of another solution, namely the domain adaptation method. The method requires two types of modal data (including images and labels) to participate in training together, and learns the common characteristics of the two types of modal images, thereby being beneficial to more accurately segmenting the heart structure. However, many field-adaptive methods still rely on both fields having rich tag information, which is impractical for practical clinical practice. Considering the problem of insufficient amounts of annotation data actually acquired, unsupervised domain adaptation (unsupervised domain adaptation, UDA) is the most efficient way to segment different modality data with existing annotated data. The UDA problem assumes that the present invention has image data and tags of one modality as source fields and image data of another modality as target fields. This greatly reduces the need for multi-modal medical image segmentation to enrich the tag data, which makes this approach practical.

However, the superior performance of unsupervised domain adaptation is based on the assumption that source domain label information is sufficient, because only the ability to precisely identify the decision boundary of the source domain is provided, the domain adaptation method can be adopted to transfer knowledge to the target domain to complete the cross-domain task. However, in clinical practice, the source domain may also not always be in sufficient tag quantity due to the limited expert knowledge involved in labeling or the problem of dataset access rights. Unsupervised domain adaptation can suffer from performance degradation when encountering a more physically challenging scenario where source domain labels are scarce.

In order to cope with the problems of unsupervised domain adaptation, semi-supervised learning, which can cope with a label-sparse scenario with data augmentation and synchronous training, is receiving a lot of attention. However, the disadvantage of semi-supervised learning is also obvious, and when a large data distribution difference exists between the generated graph and the real graph of the target domain, the effectiveness of the semi-supervised learning can be influenced, and meanwhile, the semi-supervised learning lacks the recognition capability of the multi-mode image semantic features.

Aiming at the problems that in the existing multi-mode image segmentation method based on semi-supervised learning, when a large data distribution difference exists between a generated image and a real image of a target domain, the effectiveness of the semi-supervised learning is affected and finally the image segmentation effect is affected, no effective solution is proposed at present.

Disclosure of Invention

The invention provides a training method of a multi-mode image segmentation model and a multi-mode image segmentation method, which are used for solving the problem that in the existing multi-mode image segmentation method based on semi-supervised learning, when a large data distribution difference exists between a generated image and a real image of a target domain, the effectiveness of the semi-supervised learning is influenced, and finally the image segmentation effect is influenced.

In a first aspect, the present invention provides a training method of a multi-modal image segmentation model, where the multi-modal image segmentation model includes an image conversion module and an image segmentation module based on semi-supervised learning, and the training method includes:

acquiring a training sample set, wherein the training sample comprises a source domain real image and a segmentation label and a target domain real image thereof;

training the image conversion module through the source domain real image and the target domain real image, and obtaining a source domain generation image and a characteristic image thereof, a target domain generation image and a characteristic image thereof, and a characteristic image of the source domain real image and the target domain real image through the image conversion module;

and training the image segmentation module through the training sample set, the source domain generation diagram, the characteristic diagram thereof, the target domain generation diagram, the characteristic diagram thereof, the source domain real diagram and the characteristic diagram of the target domain real diagram.

In some embodiments thereof, the image conversion module includes a content encoder, a source domain decoder, and a target domain decoder, the source domain generation map including a first source domain generation map and a second source domain generation map, the target domain generation map including a first target domain generation map and a second target domain generation map;

obtaining a source domain generation diagram and a feature diagram thereof, a target domain generation diagram and a feature diagram thereof, and a source domain real diagram and a feature diagram of the target domain real diagram through the image conversion module, wherein the method comprises the following steps:

inputting the source domain real image and the target domain real image into the content encoder respectively to obtain a characteristic image of the source domain real image and a characteristic image of the target domain real image;

inputting the feature map of the real map of the source domain into the target domain decoder to obtain the first target domain generation map, and inputting the feature map of the real map of the target domain into the source domain decoder to obtain the first source domain generation map;

inputting the first target domain generation diagram and the first source domain generation diagram into the content encoder respectively to obtain a characteristic diagram of the first target domain generation diagram and a characteristic diagram of the first source domain generation diagram;

Inputting the feature map of the first target domain generation map into the source domain decoder to obtain the second source domain generation map, and inputting the feature map of the first source domain generation map into the target domain decoder to obtain the second target domain generation map;

and respectively inputting the second source domain generation diagram and the second target domain generation diagram into the content encoder to obtain a characteristic diagram of the second source domain generation diagram and a characteristic diagram of the second target domain generation diagram.

In some of these embodiments, the image conversion module further comprises a first source domain discriminator and a first target domain discriminator;

the training loss of the image conversion module comprises:

a first source domain countermeasure learning penalty based on the first source domain generated graph and the source domain real graph calculated by the first source domain discriminator;

a first target domain countermeasure learning penalty calculated by the first target domain discriminator based on the first target domain generation map and the target domain real map;

source domain image reconstruction loss;

target domain image reconstruction loss;

generating a source domain consistency loss of the graph and the source domain real graph based on the second source domain;

and generating a target domain consistency loss of the graph and the target domain real graph based on the second target domain.

In some embodiments, the multi-modal image segmentation model further comprises a similarity mining module comprising a first multi-layer perceptron and a second multi-layer perceptron connected in sequence to the content encoder end;

the first multi-layer perceptron is used for obtaining a projection characteristic diagram of the input of the content encoder, and the second multi-layer perceptron is used for obtaining a prediction characteristic diagram of the input of the content encoder;

the training loss of the image conversion module further comprises:

and a cosine similarity loss between the projection characteristic diagram and the prediction characteristic diagram.

In some embodiments, the image segmentation module includes 4 sub-segmentation models, where the 4 sub-segmentation models are a source domain student model and a source domain teacher model corresponding thereto, and a target domain student model and a target domain teacher model corresponding thereto, respectively;

training the image segmentation module through the training sample set, the source domain generation diagram and the characteristic diagram thereof, the target domain generation diagram and the characteristic diagram thereof, the source domain real diagram and the characteristic diagram of the target domain real diagram, wherein the training comprises the following steps:

inputting the source domain real image, the feature image and the segmentation labels into the source domain student model, and training the source domain student model;

Inputting the first source domain generation diagram and the characteristic diagram thereof, the second source domain generation diagram and the characteristic diagram thereof into the source domain teacher model, and assisting the source domain student model to learn through the source domain teacher model;

inputting the first target domain generation diagram, the characteristic diagram thereof and the segmentation label of the source domain real diagram into the target domain student model, and training the target domain student model;

and inputting the target domain real image and the characteristic image thereof, the second target domain generated image and the characteristic image thereof into the target domain teacher model, and assisting the target domain student model to learn through the target domain teacher model.

In some of these embodiments, the training penalty of the image segmentation module includes:

source domain segmentation loss of the source domain student model;

target domain segmentation loss of the target domain student model;

the source domain segmentation loss and the target domain segmentation loss are obtained through calculation of a Soft Dice function and a weighted cross entropy CE function.

In some of these embodiments, the image segmentation module further comprises a second source domain discriminator and a second target domain discriminator;

the training loss of the image segmentation module further comprises:

A second source domain countermeasure learning loss calculated by the second source domain discriminator based on the segmentation result of the source domain student model and the segmentation result of the source domain teacher model;

a second target domain countermeasure learning loss calculated by the second target domain discriminator based on the segmentation result of the target domain student model and the segmentation result of the target domain teacher model.

In some embodiments, each of the sub-segmentation models includes at least a first downsampling convolution layer, a second downsampling convolution layer, a third downsampling convolution layer, a fourth downsampling convolution layer, and a fifth downsampling convolution layer connected in sequence;

the content encoder at least comprises 5 network layers which are sequentially connected, wherein the 5 network layers are an initial convolution layer, a first downsampling layer, a second downsampling layer, a first downsampling layer and a second downsampling layer respectively, and the characteristic diagram output by the content encoder comprises sub-characteristic diagrams output by the 5 network layers;

the data processing flow of each sub-segmentation model comprises the following steps:

fusing the feature map output by the second downsampling convolution layer with the sub-feature map output by the first downsampling convolution layer to obtain a first fusion feature, and inputting the first fusion feature into the third downsampling convolution layer;

Fusing the characteristic diagram output by the third downsampling convolution layer with the sub-characteristic diagram output by the second downsampling layer to obtain a second fusion characteristic, and inputting the second fusion characteristic to the fourth downsampling convolution layer;

fusing the feature map output by the fourth downsampling convolution layer with the sub-feature map output by the first downsampling convolution layer to obtain a third fused feature, and inputting the third fused feature to the fifth downsampling convolution layer;

fusing the feature map output by the fifth downsampling convolution layer with the sub-feature map output by the second downsampling convolution layer to obtain a fourth fusion feature;

connecting the fourth fusion feature with the third fusion feature through upsampling to obtain a first connection feature;

connecting the first connection feature with the second fusion feature through upsampling to obtain a second connection feature;

connecting the second connection feature with the first fusion feature through upsampling to obtain a third connection feature map;

connecting the third connection feature with the feature map output by the first downsampling convolution layer through upsampling to obtain a fourth connection feature map;

and carrying out convolution operation on the fourth connection feature diagram to obtain a segmentation result of the sub-segmentation model.

In some of these embodiments, the training method further comprises:

and performing joint training on the image conversion module and the image segmentation module through the sample training set.

In a second aspect, the present invention provides a multi-modal image segmentation method, the segmentation method comprising:

acquiring a multi-modal image to be segmented, wherein the multi-modal image comprises a source domain real image and a target domain real image;

inputting the multi-modal image into a multi-modal image segmentation model to obtain a segmentation result of the multi-modal image; the multi-mode image segmentation model is trained by the training method of the multi-mode image segmentation model in the first aspect.

In a third aspect, the present invention provides a training apparatus for a multi-modal image segmentation model, the multi-modal image segmentation model including an image conversion module and a semi-supervised learning-based image segmentation module, the training apparatus comprising:

the sample data acquisition module is used for acquiring a training sample set, wherein the training sample comprises a source domain real image, a segmentation label thereof and a target domain real image;

the image conversion training module is used for training the image conversion module through the source domain real image and the target domain real image, and obtaining a source domain generation image and a characteristic image thereof, a target domain generation image and a characteristic image thereof, and a characteristic image of the source domain real image and a characteristic image of the target domain real image through the image conversion module;

The image segmentation training module is used for training the image segmentation module through the training sample set, the source domain generation diagram and the characteristic diagram thereof, the target domain generation diagram and the characteristic diagram thereof, the source domain real diagram and the characteristic diagram of the target domain real diagram.

In a fourth aspect, the present invention provides a multi-modal image segmentation apparatus, the segmentation apparatus comprising:

the multi-mode image acquisition module is used for acquiring a multi-mode image to be segmented, wherein the multi-mode image comprises a source domain real image and a target domain real image;

the multi-mode image segmentation module is used for inputting the multi-mode image into a multi-mode image segmentation model to obtain a segmentation result of the multi-mode image; the multi-mode image segmentation model is trained by the training method of the multi-mode image segmentation model in the first aspect.

In a fifth aspect, the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the training method of the multi-modal image segmentation model described in the first aspect or the multi-modal image segmentation method described in the second aspect when the processor executes the computer program.

In a sixth aspect, the present invention provides a storage medium having stored thereon a computer program which, when executed by a processor, implements the training method of the multimodal image segmentation model described in the first aspect or the multimodal image segmentation method described in the second aspect.

Compared with the related art, the training method and the multi-mode image segmentation method of the multi-mode image segmentation model provided by the invention have the advantages that the image conversion module is used for obtaining the characteristic images of the target domain generation image, the target domain real image, the source domain generation image and the source domain real image, and supplying the characteristic images to the image segmentation module, so that the image segmentation module can fuse the characteristic images of the corresponding images when extracting the characteristics, further the characteristic identification in the image segmentation module can be enhanced, the adverse effect of poor generation effect of the target domain generation image on the follow-up segmentation work can be effectively relieved, the continuous attention of the image segmentation model on the characteristics of the target region is enhanced, and the global decision boundary is stabilized. Therefore, the method solves the problems that in the existing multi-mode image segmentation method based on semi-supervised learning, when a large data distribution difference exists between a generated image and a real image of a target domain, the effectiveness of the semi-supervised learning is affected, and finally the image segmentation effect is affected.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

FIG. 1 is a flow chart of a training method of a multi-modal image segmentation model provided in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a model of an image conversion section provided in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a model of a similarity mining portion provided in an embodiment of the present invention;

FIG. 4 is a schematic representation of a model of an image segmentation provided in an embodiment of the present invention;

FIG. 5 is a schematic representation of a model of a feature enhancement portion provided in an embodiment of the present invention;

FIG. 6 is a feature extraction comparison of a content encoder and a generator in CycleGAN provided in an embodiment of the present invention;

FIG. 7 is a graph comparing cardiac image segmentation in the MRI- > CT adaptation direction for the multi-modality image segmentation method and various unsupervised domain adaptation methods provided in embodiments of the present invention;

fig. 8 is a box-shaped diagram generated by an ablation experiment of the multi-modal image segmentation method provided in an embodiment of the present invention.

Detailed Description

For a clearer understanding of the objects, technical solutions and advantages of the present application, the present application is described and illustrated below with reference to the accompanying drawings and examples.

Unless defined otherwise, technical or scientific terms used herein shall have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," "these," and the like in this application are not intended to be limiting in number, but rather are singular or plural. The terms "comprising," "including," "having," and any variations thereof, as used in the present application, are intended to cover a non-exclusive inclusion; for example, a process, method, and system, article, or apparatus that comprises a list of steps or modules (units) is not limited to the list of steps or modules (units), but may include other steps or modules (units) not listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. Reference to "a plurality" in this application means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. Typically, the character "/" indicates that the associated object is an "or" relationship. The terms "first," "second," "third," and the like, as referred to in this application, merely distinguish similar objects and do not represent a particular ordering of objects.

The embodiment of the invention provides a training method of a multi-mode image segmentation model, wherein the multi-mode image segmentation model comprises an image conversion module and an image segmentation module based on semi-supervised learning. FIG. 1 is a flowchart of a training method of a multi-modal image segmentation model provided in an embodiment of the present invention, as shown in FIG. 1, the flowchart includes the following steps:

step S110, a training sample set is obtained, wherein the training sample comprises a source domain real image, a segmentation label thereof and a target domain real image.

In this step, the training sample set includes a multimodal image as a sample. In the medical field, the multimodal image may be a multimodal image of different locations, such as a multimodal heart image. The multi-modality image includes at least a magnetic resonance image and a computed tomography image. Thus, the source domain real map and the target domain real map may be a magnetic resonance image and a computed tomography image, respectively. It should be noted that, the source domain and the target domain are relatively, and a certain type of modal image with a segmentation tag in the multi-modal image is generally referred to as a source domain image, and other types of modal images are referred to as target domain images. The real image is obtained by actual scanning acquisition.

For example, the dataset for model training may employ 2017 multi-modal full heart segmentation challenge race dataset comprising unpaired MRI images of 20 patients and CT images of 20 patients and provide corresponding segmentation label information. The training sample set and the test sample set may be randomly divided in a proportion of 80% and 20% of the number of patients.

Preferably, after the data set is acquired and divided, the training sample set and the test sample set may be subjected to pixel space standardization. Meanwhile, as the modes of presenting the target area (such as the heart area or other part areas) by the images of different modes are different, including different angles, different fields of view and the like, the images of different modes and the segmentation labels of the fixed window can be adopted for cutting, so that the images of different modes and the segmentation labels can present the central target area. And when the number of the multi-mode images and the labels is limited, the multi-mode images and the corresponding labels can be randomly and horizontally flipped so as to expand the training sample set.

Step S120, training an image conversion module through the source domain real image and the target domain real image, and obtaining a source domain generation image and a characteristic image thereof, a target domain generation image and a characteristic image thereof, and a characteristic image of the source domain real image and a characteristic image of the target domain real image through the image conversion module.

Step S130, training the image segmentation module through a training sample set, a source domain generation diagram and a characteristic diagram thereof, a target domain generation diagram and a characteristic diagram thereof, and a source domain real diagram and a characteristic diagram of a target domain real diagram.

The multi-mode image segmentation model adopted by the scheme comprises an image conversion module and an image segmentation module. The image conversion module is mainly used for realizing the mutual conversion between images of different modes. For example, the source domain generation map is generally obtained by performing source domain restoration on a feature map of the target domain real map, and the target domain generation map is generally obtained by performing target domain restoration on a feature map of the source domain real map.

In the prior art, the image conversion module generally only provides the image segmentation module with the target domain generation image, and the image segmentation module is trained by the segmentation labels of the target domain generation image and the source domain real image, so that the image segmentation module can segment the target domain real image. However, when a large data distribution difference exists between the target domain generation diagram and the target domain real diagram, the training effect of the image segmentation module is affected. In the invention, the trained model can also obtain the target domain generation diagram, the target domain real diagram, the source domain generation diagram and the characteristic diagram of the source domain real diagram through the image conversion module, and the characteristic diagrams are given to the image segmentation module, so that the image segmentation module can fuse the characteristic diagrams of corresponding images when extracting the characteristics, further the characteristic recognition in the image segmentation module can be enhanced, the adverse effect of the poor generation effect of the target domain generation diagram on the follow-up segmentation work can be effectively relieved, the continuous attention of the image segmentation model on the characteristics of the target region is enhanced, and the global decision boundary is stabilized. Therefore, the method solves the problems that in the existing multi-mode image segmentation method based on semi-supervised learning, when a large data distribution difference exists between a generated image and a real image of a target domain, the effectiveness of the semi-supervised learning is affected, and finally the image segmentation effect is affected.

In some of these embodiments, the image conversion module includes a content encoder, a source domain decoder, and a target domain decoder, the source domain generation map includes a first source domain generation map and a second source domain generation map, and the target domain generation map includes a first target domain generation map and a second target domain generation map.

In step S120, obtaining, by the image conversion module, a source domain generation diagram and a feature diagram thereof, a target domain generation diagram and a feature diagram thereof, and a source domain real diagram and a feature diagram of a target domain real diagram, including: step S121, respectively inputting a source domain real image and a target domain real image into a content encoder to obtain a characteristic image of the source domain real image and a characteristic image of the target domain real image; step S122, inputting the feature map of the real map of the source domain into a target domain decoder to obtain a first target domain generation map, and inputting the feature map of the real map of the target domain into the source domain decoder to obtain the first source domain generation map; step S123, inputting the first target domain generation diagram and the first source domain generation diagram into a content encoder respectively to obtain a feature diagram of the first target domain generation diagram and a feature diagram of the first source domain generation diagram; step S124, inputting the feature map of the first target domain generation map into a source domain decoder to obtain a second source domain generation map, and inputting the feature map of the first source domain generation map into the target domain decoder to obtain a second target domain generation map; and step S125, inputting the second source domain generation diagram and the second target domain generation diagram into a content encoder respectively to obtain a characteristic diagram of the second source domain generation diagram and a characteristic diagram of the second target domain generation diagram.

In this embodiment, a specific image conversion module and a data processing flow thereof are provided. The image conversion module is mainly composed of a content encoder and two decoders (a source domain decoder and a target domain decoder, respectively). Through the mutual coordination of the content encoder, the source domain decoder and the target domain decoder, based on the data processing flow, a first source domain generation diagram and a characteristic diagram thereof, a second source domain generation diagram and a characteristic diagram thereof, a first target domain generation diagram and a characteristic diagram thereof, a second target domain generation diagram and a characteristic diagram thereof, and a source domain real diagram and a target domain real diagram can be obtained.

It should be noted that the above steps are performed at different stages. For example, in the training stage of the image conversion module, the above steps are repeatedly executed until the image conversion module converges, and a first target domain generation diagram, a first source domain generation diagram, a second target domain generation diagram and a second target domain generation diagram are obtained through the above steps S122 and S124; in the training stage of the image segmentation module, the real images and the feature images of the generated images are synchronously obtained through the steps S121, S123 and S125 and are fed to the image segmentation module.

To further enhance the image conversion effect of the image conversion module, in an embodiment, the image conversion module further comprises a first source domain discriminator and a first target domain discriminator. The two discriminators discriminate the true or false between the true map and the generated map of the corresponding modality (domain), respectively.

Referring to FIG. 2, a source domain realism map may be represented asx ^s The target domain real map can be expressed asx ^t The first source domain discriminator may be represented asD _s The first target domain discriminator may be expressed asD _t The content encoder may be represented asE _c The source domain decoder can be expressed asU _s The target domain decoder may be expressed asU _t . The first target domain generation map is expressed asThe second target domain generation map is denoted +.>The first source domain generation map is denoted +.>The second source domain generation graph is represented as。

Based on the image conversion module and the data processing flow thereof, the image conversion module can be trained by adopting the following training loss. These training losses include:

first source domain challenge learning loss based on the first source domain generated graph and the source domain real graph calculated by the first source domain discriminator:

first target domain challenge learning loss based on the first target domain generated graph and the target domain real graph calculated by the first target domain discriminator:

Source domain image reconstruction loss:

target domain image reconstruction loss:

source domain consistency loss for generating a graph and a source domain real graph based on the second source domain:

target domain consistency loss for generating a graph and a target domain real graph based on the second target domain:

by minimizing the training loss described above, the content encoder, decoder, and discriminator, among other modules in the image conversion module, can be trained until each module converges.

To further enhance the feature extraction effect of the content encoder, referring to FIG. 3, in some embodiments of the present invention, the multi-modal image segmentation model further includes a similarity cutThe mining module comprises a first multi-layer perceptron MLP1 and a second multi-layer perceptron MLP2 which are sequentially connected to the tail end of the content encoder. The input of the first multi-layer perceptron MLP1 is an output characteristic diagram of a content encoder, and the deep semantic diagram z is obtained through multi-layer convolution; meanwhile, the second multi-layer perceptron MLP2 carries out deeper convolution on the semantic graph z to obtain a deeper semantic graph p. The semantic graph z and the semantic graph p are respectively used as a projection feature graph and a prediction feature graph, and the feature information which is unchanged in the field can be respectively marked as: z=f% x) And p=h%f（x)). Wherein,s represents a source domain, t represents a target domain, and i represents a feature layer number. Thus, the training penalty of the image conversion module also includes a cosine similarity penalty between the projected feature map and the predicted feature map:

wherein,，/>is the normal loss normal pattern is calculated,stopgradthe gradient propagation is stopped.

The similarity mining module may be trained with a fixed learning rate and the computation optimized by the above-described penalty.

step S130 specifically includes: step S131, inputting a source domain real image, a feature image and a segmentation label into a source domain student model, and training the source domain student model; and inputting the first source domain generation diagram and the characteristic diagram thereof, the second source domain generation diagram and the characteristic diagram thereof into a source domain teacher model, and assisting the source domain student model to learn through the source domain teacher model. Step S132, inputting a first target domain generation diagram, a characteristic diagram thereof and a segmentation label of a source domain real diagram into a target domain student model, and training the target domain student model; and inputting the target domain real image and the characteristic image thereof, the second target domain generated image and the characteristic image thereof into a target domain teacher model, and assisting the target domain student model to learn through the target domain teacher model.

In this embodiment, a specific image segmentation module and a training method thereof are provided. Wherein, two groups of teacher student models are adopted to complete the image segmentation work. The input of each sub-segmentation model includes, in addition to the corresponding real or generated graph, a feature graph of the real or generated graph, which is extracted by the content encoder. The sub-segmentation model can realize feature enhancement in the process of extracting the self features through the feature map.

Referring to FIG. 4, the split labels of the source domain real map are denoted asy ^s The source domain teacher model is expressed asThe Source Domain student model is denoted +.>The target domain teacher model is expressed as +.>The target domain student model is expressed as +.>The enhanced features of the source domain real map are expressed as +.>The enhanced feature of the target domain generation map is expressed as +.>。

Based on the image segmentation module and the training method thereof, the training loss of the image segmentation module may include:

source domain segmentation loss for source domain student model:

target domain segmentation loss for target domain student model:

wherein,representing Soft Dice function,/->Representing a weighted cross entropy CE function.

The two loss functions are adopted to calculate the corresponding segmentation loss, so as to solve the problem of unbalanced classification between a relatively small target and a large background of the multi-mode image, and particularly in a heart image.

It should be noted that, the above scheme enhances learning of the cross-domain structural knowledge by transmitting the student model parameters with rich knowledge to the corresponding teacher model through the semi-supervised learning mode. Specifically, based on self-assembly strategy, in a training batchtIn the student model GPassed to teacher model by EMA (exponential moving average)G _intra The parameter update formula of EMA is:

wherein,is in teacher modeltNetwork rights for batchesHeavy parameter (I)>Is in a student modeltThe network weight parameter of the batch +.>Is an attenuation coefficient for adjusting the influence of the student model on the teacher model. The transfer of source domain to target domain structure knowledge may be normalized by the following losses:

wherein,N ^s is the number of source domain samples with tag information in a training batch,output representing source domain student model, +.>Is the output of the source domain teacher model.

Similarly, the transfer of target domain to source domain structural knowledge can be normalized by the following penalty:

wherein,is the number of unlabeled target domain samples in a training batch,output representing the student model of the target domain, +.>Is the output of the target domain teacher model.

To integrate countermeasure learning into the semantic prediction space in the teacher student network model, in some embodiments thereof, the image segmentation module further includes a second source domain discriminator and a second target domain discriminator; the teacher model and the student model play the role of a generator to generate predictive pictures of images in different fields, and the purpose of the field discriminator is to effectively distinguish whether the characteristics of the semantic pixels are extracted by the teacher model or the characteristics of the student model and are reversely propagated into the characteristic space of the student model. In the continuous transfer of resistance gradient information, student models are encouraged to generate semantic features that are more similar to the teacher.

Thus, the training penalty of the image segmentation module further includes:

through a second source domain discriminator(denoted as->) The calculated second source domain countermeasure learning loss based on the segmentation result of the source domain student model and the segmentation result of the source domain teacher model:

through a second target domain discriminator(denoted as->) A second target domain countermeasure learning loss based on the calculated segmentation result of the target domain student model and the segmentation result of the target domain teacher model: />

The training process is mainly to train each part of the two modules independently based on each training loss, and after the independent training is completed, the image conversion module and the image segmentation module can be trained in a combined mode, namely, the whole collaborative training is performed. The whole collaborative training is mainly divided into two parts, namely, the whole collaborative training is firstly trained by taking the total loss of the minimum image conversion module as a target, when the image conversion module converges, the whole collaborative training is trained by taking the total loss of the minimum image segmentation module as a target, and when the image segmentation module converges, the training is finished.

The total loss of the image conversion model is composed of the generation loss of the source domain and the target domain.

The generation loss of the source domain is as follows:

the generation loss of the target domain is:

The total loss of the image conversion model is:

wherein,、/>、/>and->Is a hyper-parameter of the image conversion model used to adjust the weights of the various subnetworks. In the subsequent experimental examples, the corresponding values were 10.0, 2.0, 0.001, 1.0, respectively.

The total loss of the image segmentation module consists of a symmetrical teacher student model and field discrimination loss.

Wherein,、/>and->Is a hyper-parameter of the image segmentation model used to adjust the weights of the various sub-networks. In the experimental examples which follow, the corresponding values are 1.0, 0.1, 0.5, respectively.

The training method of the multi-mode image segmentation model provided by the invention has been mainly described through a plurality of embodiments, and some preferred schemes of the multi-mode image segmentation model and the training method thereof are described as follows.

In some of these embodiments, in the image segmentation model, each sub-segmentation model includes at least a first downsampling convolution layer, a second downsampling convolution layer, a third downsampling convolution layer, a fourth downsampling convolution layer, and a fifth downsampling convolution layer that are connected in sequence; in the image conversion model, the content encoder at least comprises 5 network layers which are sequentially connected, wherein the 5 network layers are an initial convolution layer, a first downsampling layer, a second downsampling layer, a first downsampling layer and a second downsampling layer respectively, and the characteristic diagram output by the content encoder comprises sub-characteristic diagrams output by the 5 network layers.

The data processing flow of each sub-segmentation model comprises:

fusing the feature map output by the second downsampling convolution layer with the sub-feature map output by the first downsampling convolution layer to obtain a first fused feature, and inputting the first fused feature into a third downsampling convolution layer; fusing the feature map output by the third downsampling convolution layer with the sub-feature map output by the second downsampling convolution layer to obtain a second fusion feature, and inputting the second fusion feature into a fourth downsampling convolution layer; fusing the feature map output by the fourth downsampling convolution layer with the sub-feature map output by the first downsampling convolution layer to obtain a third fused feature, and inputting the third fused feature into the fifth downsampling convolution layer; fusing the feature map output by the fifth downsampling convolution layer with the sub-feature map output by the second downsampling convolution layer to obtain a fourth fusion feature; connecting the fourth fusion feature with the third fusion feature through upsampling to obtain a first connection feature; connecting the first connection feature with the second fusion feature through upsampling to obtain a second connection feature; connecting the second connection feature with the first fusion feature through upsampling to obtain a third connection feature map; connecting the third connection feature with the feature map output by the first downsampling convolution layer through upsampling to obtain a fourth connection feature map; and carrying out convolution operation on the fourth connection feature diagram to obtain a segmentation result of the sub-segmentation model.

In this embodiment, the content encoder and the sub-segmentation model are both of a multi-layer structure, and feature fusion of the sub-segmentation model refers to feature graph fusion between corresponding network layers. Accordingly, due to the corresponding fusion enhancement of the multi-layer feature map, the second source domain against learning loss can be extended as follows:

the second target domain against learning loss can be extended to:

wherein,representing an output of the i-th layer split network; />It is indicated that the scale discriminator discriminates the output of the layer j split network and does not share the parameter weights.

The content encoder is mainly used for encoding the input image, so as to obtain a corresponding characteristic diagram. Illustratively, the content encoder network can be based on the deep-ResNet 50 partial model. The content encoder at least comprises 5 network layers which are connected in sequence, and the 5 network layers are respectively initial volumesThe multi-layer structure comprises a lamination layer, a first downsampling layer, a second downsampling layer, a first lower convolution layer and a second lower convolution layer. Wherein the output result of the initialization convolution (comprising Conv2d convolution operation, instanceNorm normalization, reLU activation function) isThe output next to the first downsampling layer (comprising maxpool pooling, convolution layer) is +.>The output next to the second downsampling layer is +. >The output result next to the first lower convolution layer (comprising 4 residual blocks) is +.>The output next to the second lower convolutional layer (comprising 6 residual blocks) is +.>。/>To->And forming a characteristic diagram of the external output of the content encoder. In addition, in order to obtain deep semantic features required by the similarity mining module, the deep semantic features are obtained through a lower convolution layer containing 3 layers of residual blocks after the second lower convolution layer, and global average pooling and full-connection layer mapping are carried out on the obtained feature map to obtain deep features of the target channel number>(being the output of the first multi-layer perceptron). The above operations are all downsampling convolutional layers, the size of the feature map is sequentially reduced and the number of channels is gradually increased when each layer of convolutional layer passes through, wherein in order to effectively connect the segmentation model, the model can be modified so that the output of each layer of the content encoder and the feature map ruler obtained by downsampling in the sub-segmentation modelThe dimensions are the same.

The decoder is mainly used for decoding the input feature map and restoring the feature map to the real size (the size of the real map), so as to obtain a corresponding generated map. Specifically, the decoder may include 3 residual blocks, a 2-layer upsampling network, and the final output is a true-size generated map generated by the image conversion model, and the input size of the decoder may be (batch-size, 256, 64, 64).

The first source domain discriminator and the first target domain discriminator may include 1 convolution layer and 3 downsampling convolution layers, and finally the feature map is processed into a fixed size through a Sigmoid function to obtain a probability map representing whether the feature map is judged to be a real image or a generated image, and the probability map is used for calculating the generated loss in the total loss of the image conversion model.

The similarity mining module is used for mining the unchanged characteristic information of the field in the image conversion stage, and can assist the content encoder to process the characteristic of the image based on a Siamese network structure proposed by Chen and He et al. The network receives real images in two fields as input, and semantic extraction is performed through a neural network with weight sharing.

The teacher model and the student model have the same network structure, and each sub-segmentation model adopts a Unet network.

Referring to FIG. 5, a Model is shown for a subdivision Model. For each imagexThe processing procedure of sending the sub-segmentation model to obtain an output result is as follows: the input is obtained through a convolution layer in the Unet networkAt the same time, the content encoder will also inputxTreating to obtain->At this time will +.>Directly expressed as +.>The method comprises the steps of carrying out a first treatment on the surface of the Immediately below a layerSampling convolution to get +. >The inner containers are synchronized>The two feature images are fused and added to obtain +.>(first fusion feature); then three layers of downsampling convolution and feature enhancement fusion are respectively carried out to obtain +.>(second fusion feature, third fusion feature, and fourth fusion feature). A jump connection strategy is then used: />Up-sampling size x2, halving the number of channels and +.>Unifying the size, associating the result with +.>Concate connection is performed to obtain a first connection feature map +.>Then 2-layer convolution (without changing the number of channels) is followed to obtain a new feature map and +.>Unifying the size, associating the result with +.>Concate ligation is performed to obtain a second ligation feature map +.>. Then 2 concatate connections are followed to obtain +.>(third connection feature map and fourth connection feature map). Wherein->Is an input image of the same sizexThe same feature map. Obtaining a segmentation graph through final convolution operationThe size of the convolved dimension is unchanged and still the same as that of the convolved dimensionxKeep consistent, (h=w=256), and the number of channels becomes n+1, where N is the number of categories, +1 represents background. Final segmentation mapoutputEach channel in 1-N represents the segmentation effect on a category, and 0 channel represents the background category.

The two discriminators in the image segmentation module are arranged substantially in line with the discriminator network layer in the image conversion module, including 1 convolutional layer, 3 downsampled convolutional layers, and a final Sigmoid mapping function. The initial convolutional layer parameters are different for the two because the size of the input image is different for the two. For example, the input size of the discriminator in the image conversion module is (batch-size, 3, 256, 256), and the input size of the discriminator in the image segmentation module is (batch-size, n+1, 256, 256). The second source domain discriminator and the second target domain discriminator output a probability map representing whether the segmentation result is determined to originate from a real image or a generated image.

As described above, the image segmentation model and the training method thereof provided by the invention have been fully described, and after model training is finished, model testing and model evaluation are required.

The process results of the specific model experiments are as follows:

in an experimental embodiment, the effectiveness of the image segmentation model and the training method thereof provided by the invention is verified by application to MRI and CT heart substructure images. For this one dataset application, both MRI and CT data are unpaired and collected from different patient populations. In adaptive training, the label information of the target domain image is used only for evaluation and does not participate in training in the training phase.

1. Data set

The segmentation of cardiac substructure was performed using 2017 multi-modal whole-heart segmentation (MMWHS) challenge data set. The training data includes an unpaired MRI image of 20 patients and a CT image of 20 patients and provides corresponding label information. The invention aims to divide and analyze four heart structures including an Ascending Aorta (AA), a left atrial blood chamber (LAC), a left ventricular blood chamber (LVC) and a left ventricular Myocardium (MYO) by using the image division model provided by the invention.

In training, both modality data sets were randomly divided into two groups, with patient group label usage for training being 75% and 25%, respectively, with the remaining patients for testing. The MRI and CT in the MMWSH cardiac dataset are different for the field of view of the cardiac scan. Cardiac MRI scans capture the entire region from the neck to the abdomen, including different angles, different fields of view. While cardiac CT scans always present areas of the heart. In order for each modality dataset to achieve a similar, consistent view in training, the original scan image may be manually cropped to cover the target structure that the present invention intends to segment, for a cardiac dataset, a three-dimensional bounding box of fixed coronal dimensions 256 x 256 is used to crop the dataset, centered at the cardiac region. Further, in terms of data processing, all picture data is normalized to a pixel interval of 0 to 1, while tag data is normalized to 5 channels (including background class). Each data is resampled to 256 x 256, and rotation, scaling changes are used for data enhancement to reduce over-fitting and improve label utilization.

2. Experimental details

The model provided by the invention is realized on a PyTorch platform on a V100 display card loaded with 32GB of memory, 150 iterations are trained on the model by a heart data set and a generated graph in the corresponding field, the batch size of the training iterations is set to be 6, and the whole training process needs 50 hours. For the cardiac dataset segmentation task, the shared encoder, the domain decoder, and the segmentation model were all optimized by Adam optimizer training, with the shared encoder optimizer learning rate set to 2X10 ^-4 The optimizer learning rate of the source domain decoder and the target domain decoder is set to 1X10 ^-3 . Only the source domain student model and the target domain student model are optimized in the semi-supervised training process, and the learning rate is set to be 3X10 ^-2 The learning rate of each domain discriminator for resistance learning was set to 1X10 ^-3 。

3. Evaluation index

The conventional detection model evaluation index, the Dice similarity coefficient index, can be adopted:

wherein A represents a predictive graph generated by the model, and B represents a true label graph (group Truth). It should be noted that, the prediction graph a needs to compare the segmentation result on each channel with the pixel value channel of the label B according to the pixel channel, so for an input test image, a and B need to be calculated 4 times, and finally, the average value of the Dice is obtained. The higher the Dice value, the better the segmentation performance of the model.

4. Method comparison

In order to prove the effectiveness of the image segmentation model (FES-UDA framework) and the training method thereof, the image segmentation model is compared with a representative cross-domain adaptive method comprising CycleGAN, SIFA and MT-UDA for segmentation performance. Reasons for selecting CycleGAN and MT-UDA as comparison targets include: the CycleGAN is a very classical and mature image adaptation work, against which the invention needs to be compared to demonstrate the effectiveness of the image conversion module of the invention; the SIFA model is the most mature collaborative UDA framework at present, but the invention takes fully trained source domain data as an assumption, and the comparison with the SIFA model is used for proving that the actual problem of scarcity of source domain labels has influence on most UDA models and proving the effectiveness of semi-supervised learning work; MT-UDA is a semi-supervised method for solving the problem of scarcity of source domain labels, and is a cross-domain segmentation method applied to medical images, and the effectiveness and the regularity of the semi-supervised learning paradigm of the double-group teacher student model are proved by the invention.

The invention compares all kinds of segmentation models on the self-processing public data setIs a performance of the (c). Firstly, discussing the effectiveness of the image conversion model provided by the invention, the invention constructs the image conversion model around a content encoder which can output different scale feature images, and in addition, the similarity mining module can capture semantic information in different fields on different network layers. In order to prove the effectiveness of the similarity module, the output display of the generator in the content encoder and the CycleGAN of the invention under different scales of different channels of the target domain image is shown,as shown in fig. 6. As can be seen from fig. 6, the details of the image features extracted by the content encoder of the present invention are clearer and more abundant, so that the encoding capability of the content encoder can be improved by adding the similarity module after the content encoder in the present invention.

Experiments and studies are directed to the field adaptation using MRI images as the source domain and CT images as the target domain. Table 1 shows the cross-modal partitioning of various methods under the same scale source domain labels) Fig. 7 shows the visual comparison of the models on the cardiac image (CT as the target domain), and it can be seen that the method of the present invention is superior to other comparison methods. Specifically, for fair comparison, the present invention devised two basic works. The invention uses the same U-net segmentation network as MT-UDA in the training stage, adopts the label image training of the source domain, and does not use any target domain image and label information. In addition, the invention trains another U-net model only by using the target domain image and the label image, and does not use any source domain image and label information, and is named as Superviced.

Table 1 results of performance comparisons of the present invention with different unsupervised domain adaptation methods in cardiac segmentation

Table 1 shows the data performance of various methods of segmenting cardiac datasets. If No field adaptation is performed (No adaptation), only 17.22% of the average Dice value is obtained when the segmentation model trained on the MRI image directly predicts the CT image, and the average Dice performance difference from the Supervised training (supervision) with the CT image is 66.44 percentage points. This indicates that there is a serious field shift between MRI images and CT images. After the field adaptation work is adopted, for example, the CycleGan and SIFA can respectively reach 41.93% and 72.55% of the Dice performance display under the No adaptation scene, so that the validity of the field adaptation work is proved.

More importantly, the invention reduces the source domain label usage rate to simulate the scarce scene of the source domain label, namely, the 75% source domain label usage rate set in the conventional UDA work is reduced to 25%. According to experimental results, the segmentation performance of different UDA methods is reduced when the UDA methods face the challenge. For example, the Dice performance of the CyleGan method drops from 41.93% to 29.03% and the SIFA method drops from 72.55% to 57.11%, indicating that the reduction of source domain tags has a greater impact on existing UDA methods.

Meanwhile, although other semi-supervised learning methods have smaller average dime value drop magnitudes when dealing with label scarcity challenges than UDA methods, they are all larger than the average dime value drop magnitudes of the model of the present invention. When the model of the invention is used for coping with the label scarcity, the average Dice value of four heart substructures is as high as 71.62%, which is the average Dice value closest to the average Dice value of the direct CT image for monitoring training (supervision), thereby indicating that the model of the invention improves the performance of cross-mode segmentation under the source domain label scarcity scene.

Fig. 7 illustrates a visual presentation of segmentation of a cardiac dataset in a source domain scarcity scenario by various methods. The invention adopts different methods to generate visual segmentation results for the heart CT image, namely a test CT image (first column), a segmentation result (second column) of direct test by W/o adaptation free operation, a segmentation result (third column) of target domain CT heart image test by classical domain conversion operation of 'CycleGAN', a segmentation result (fourth column) of semi-supervision domain adaptation operation of SIFA on the pretreatment image of the invention, a segmentation result (fifth column) of semi-supervision domain adaptation operation of MT-UDA on the pretreatment image of the invention, a segmentation result (sixth column) of FES-UDA network of the invention, a segmentation result (seventh column) of network of supervision training by CT image and a true label graph (last column) are respectively generated from left to right. As shown in FIG. 7, the area indicated by the arrow in the figure is an unrecognized area or a prediction error area. The prediction results of part of the methods have unrecognized phenomena, the unrecognized phenomena are mostly generated in the UDA method, and most of the unrecognized phenomena are used for predicting the substructures as the background, because the source domain labels are scarce, the models cannot recognize effective semantic features, and therefore the similarity feature enhancement mode is adopted to keep the attention of the models to the image features of each layer. Meanwhile, the prediction results of part of methods have a prediction error phenomenon, and the decision boundary of a source domain is unstable due to the label scarcity phenomenon in UDA work, so that global semantic recognition is influenced. Therefore, the invention adds resistance learning in the feature alignment work, enhances the recognition of the semantic features by the model, and maintains the consistency of semantic prediction.

5. Ablation analysis of key components

The invention carries out an ablation experiment on the heart data set on the image adaptation to prove the effectiveness of the model in mining the cross-modal public characteristic information. As shown in FIG. 8, the lower limit network of the invention is the leftmost W/o adaptation, and the performance of the lower limit network proves the severe field transfer among different modes; the second column is a boxplot presentation of the performance on the pre-processed cardiac dataset of the invention using the MT-UDA network as baseline; the third column is the ablation experiments performed by the present invention, which is intended to demonstrate the effectiveness of the symmetrical teacher student model, with performance that shows less performance differences than MT-UDA. The invention guesses that the effectiveness of the alignment of the characteristics from the target domain to the source domain cannot be ensured due to the scarce influence of the source domain labels; the fourth column is the performance display of the LE-UDA paper network in the original text, which demonstrates the effectiveness of the self-training challenge identification training mode. The fifth column and the sixth column are ablation experiments performed by the invention, and the invention is intended to prove that the effectiveness of the invention is enhanced by the characteristics, and the effectiveness of the characteristic enhancement can be illustrated by adding the FA antagonism identification module, wherein the performance is higher than that of the fourth column and the third column; the sixth column is a box graph generated by dividing the heart substructure by adopting a symmetrical teacher student model on the basis of the fifth column, the performance reaches the highest, and the effectiveness of feature knowledge transfer from the target domain to the source domain can be proved.

In summary, the invention has the following advantages:

1. the need to reduce the number of data tags: the model adopts semi-supervised learning to improve the utilization rate of the label, so that only a small amount of label data of the source domain image is needed for training in training, the requirement of huge label data amount for multi-mode image segmentation is further reduced, the problem that a large amount of medical staff with professional knowledge is needed for medical images to finish accurate annotation is solved, and the cost is greatly reduced.

2. Feature enhancement effectively connects the image alignment work and the image segmentation work: the model starts from the advantages of an image conversion model in an unsupervised domain adaptation model (the accurate real image features are processed by a content encoder), a feature enhancement link is designed, adverse effects on subsequent segmentation work caused by poor generation effects are effectively relieved, continuous attention of the segmentation model on target features of the multi-mode image is enhanced, and then a global decision boundary is stabilized.

3. The segmentation model has stronger semantic recognition capability: the model enhances semantic recognition capability in a semi-supervised cross-domain segmentation model based on generation of an antagonistic network idea, and adds a domain discriminator、As discriminator, a split network model is used- >And->As a generator capable of generating different mode image predictive pictures, the segmentation network model forms a complete game process. Specifically, the generator generates predictive graphs of the two modal images, and the discriminator analyzes and discriminates the two predictive graphs to determine which modal field is sourced. When the domain resistance learning loss tends to converge, namely the discriminator cannot judge which mode the predictive image originates from, the segmentation capability of the default segmentation network model on the target domain real image is already close to the segmentation capability of the target domain generated image, so that the semantic recognition capability of the model on the multi-mode image is improved.

The embodiment of the invention also provides a multi-mode image segmentation method, which comprises the following steps:

step S210, acquiring a multi-modal image to be segmented, wherein the multi-modal image comprises a source domain real image and a target domain real image; step S220, inputting the multi-modal image into a multi-modal image segmentation model to obtain a segmentation result of the multi-modal image; the multi-mode image segmentation model is trained by adopting the training method of the multi-mode image segmentation model.

The multi-mode image segmentation model provided by the invention can effectively and accurately segment the multi-mode image. The method solves the problems that in the existing multi-mode image segmentation method based on semi-supervised learning, when a large data distribution difference exists between a generated image and a real image of a target domain, the effectiveness of the semi-supervised learning is affected, and finally the image segmentation effect is affected.

The embodiment of the invention also provides a training device based on the multi-mode image segmentation model and a multi-mode image segmentation device, which are used for realizing the embodiment and the preferred implementation mode, and the description is omitted. The terms "module," "unit," "sub-unit," and the like as used below may refer to a combination of software and/or hardware that performs a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware, are also possible and contemplated.

The multi-mode image segmentation model comprises an image conversion module and an image segmentation module based on semi-supervised learning, and the device comprises:

the image conversion training module is used for training the image conversion module through the source domain real image and the target domain real image, and obtaining a source domain generated image and a characteristic image thereof, a target domain generated image and a characteristic image thereof, and a characteristic image of the source domain real image and a characteristic image of the target domain real image through the image conversion module;

The image segmentation training module is used for training the image segmentation module through a training sample set, a source domain generation diagram and a characteristic diagram thereof, a target domain generation diagram and a characteristic diagram thereof, a source domain real diagram and a characteristic diagram of a target domain real diagram.

The model trained by the device obtains the feature images of the target domain generation image, the target domain real image, the source domain generation image and the source domain real image through the image conversion module, and gives the feature images to the image segmentation module, so that the image segmentation module can fuse the feature images of corresponding images when extracting features, further the feature recognition in the image segmentation module can be enhanced, adverse effects of poor generation effect of the target domain generation image on subsequent segmentation work can be effectively relieved, and the continuous attention of the image segmentation model on the features of the target region is enhanced, so that the global decision boundary is stabilized. Therefore, the method solves the problems that in the existing multi-mode image segmentation method based on semi-supervised learning, when a large data distribution difference exists between a generated image and a real image of a target domain, the effectiveness of the semi-supervised learning is affected, and finally the image segmentation effect is affected.

The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.

There is also provided in an embodiment of the invention an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the electronic device may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and are not described in detail in this embodiment.

In addition, in combination with the training method of the multi-modal image segmentation model or the multi-modal image segmentation method provided in the above embodiment, a storage medium may also be provided for implementation in the present embodiment. The storage medium has a computer program stored thereon; the computer program, when executed by a processor, implements the training method or the multimodal image segmentation method of any of the multimodal image segmentation models of the embodiments described above.

It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to be limiting. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present application, are within the scope of the present application in light of the embodiments provided herein.

It is evident that the drawings are only examples or embodiments of the present application, from which the present application can also be adapted to other similar situations by a person skilled in the art without the inventive effort. In addition, it should be appreciated that while the development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as an admission of insufficient detail.

Claims

1. A training method of a multi-modal image segmentation model, wherein the multi-modal image segmentation model includes an image conversion module and a semi-supervised learning-based image segmentation module, the training method comprising:

Training the image segmentation module through the training sample set, a source domain generation diagram, a characteristic diagram thereof, a target domain generation diagram, a characteristic diagram thereof, a real source domain diagram and a real target domain diagram;

the image conversion module comprises a content encoder, a source domain decoder and a target domain decoder, wherein the source domain generation diagram comprises a first source domain generation diagram and a second source domain generation diagram, and the target domain generation diagram comprises a first target domain generation diagram and a second target domain generation diagram;

inputting the second source domain generation diagram and the second target domain generation diagram into the content encoder respectively to obtain a characteristic diagram of the second source domain generation diagram and a characteristic diagram of the second target domain generation diagram;

the image segmentation module comprises 4 sub-segmentation models, wherein the 4 sub-segmentation models are respectively a source domain student model and a source domain teacher model corresponding to the source domain student model, a target domain student model and a target domain teacher model corresponding to the target domain student model;

inputting the target domain real image and the characteristic image thereof, the second target domain generated image and the characteristic image thereof into the target domain teacher model, and assisting the target domain student model to learn through the target domain teacher model;

the training loss of the image segmentation module comprises:

source domain segmentation loss of the source domain student model:

target domain segmentation loss of the target domain student model:

wherein,representing Soft Dice function,/->Representing a weighted cross entropy CE function,>segmentation labels representing source domain real map, +.>Representing a Source Domain teacher model, < >>Representing source domain student model->Representing a model of a teacher in the target domain,student model representing target area->Representing the enhanced features of the source domain real map,representing the enhanced features of the target domain generation diagram;

the multi-mode image segmentation model further comprises a similarity mining module, wherein the similarity mining module comprises a first multi-layer perceptron and a second multi-layer perceptron which are sequentially connected to the tail end of the content encoder;

the training loss of the image conversion module further comprises:

2. The method of claim 1, wherein the image conversion module further comprises a first source domain discriminator and a first target domain discriminator;

the training loss of the image conversion module comprises:

source domain image reconstruction loss;

target domain image reconstruction loss;

3. The method of claim 1, wherein the image segmentation module further comprises a second source domain discriminator and a second target domain discriminator;

the training loss of the image segmentation module further comprises:

4. The method of training a multi-modal image segmentation model according to claim 1, wherein each sub-segmentation model includes at least a first downsampling convolution layer, a second downsampling convolution layer, a third downsampling convolution layer, a fourth downsampling convolution layer, and a fifth downsampling convolution layer connected in sequence;

5. The method of training a multi-modal image segmentation model according to claim 1, further comprising:

6. A multi-modal image segmentation method, the segmentation method comprising:

inputting the multi-modal image into a multi-modal image segmentation model to obtain a segmentation result of the multi-modal image; wherein the multi-modal image segmentation model is trained using the training method of the multi-modal image segmentation model of any one of claims 1-5.