CN113610856B

CN113610856B - Method and device for training image segmentation model and image segmentation

Info

Publication number: CN113610856B
Application number: CN202110948547.4A
Authority: CN
Inventors: 陶大程; 翟伟
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2023-11-07
Anticipated expiration: 2041-08-18
Also published as: CN113610856A

Abstract

Embodiments of the present disclosure disclose methods and apparatus for training an image segmentation model and image segmentation. The specific implementation mode of the method comprises the following steps: acquiring a sample set and an image segmentation model; selecting a sample from the sample set; inputting the image of the selected sample into a feature extraction network to obtain a first feature image set; inputting the first feature image set into an auxiliary decoder to obtain a second feature image set and an auxiliary segmentation result; merging the first feature map set and the second feature map set to obtain a merged feature map set; inputting the fusion feature image set into a main decoder to obtain a main segmentation result; performing morphological operation on the original label of the selected sample to obtain an auxiliary label; calculating a total loss value based on the difference between the auxiliary label and the auxiliary segmentation result and the difference between the original label and the main segmentation result; and adjusting the image segmentation model according to the total loss value. This embodiment generates a model that enables image segmentation in complex scenes, thereby improving the accuracy of image segmentation.

Description

Method and device for training image segmentation model and image segmentation

Technical Field

The embodiment of the disclosure relates to the technical field of image processing, in particular to a method and a device for training an image segmentation model and image segmentation.

Background

Image segmentation is one of the most challenging tasks in computer vision, whose purpose is to predict pixel-level semantic tags for a given image. Inspired by the recent progress of FCN (Fully Convolutional Networks, full convolutional network), a number of methods have been proposed that can model the local environment of a target region hierarchically and learn a discriminative feature representation by stacking convolutional and pooling layers in sequence. The FCN-based image segmentation method is also used for special types of object segmentation including camouflage object detection, lung infection segmentation, polyp segmentation, cell segmentation, industrial surface detection, etc.

However, existing image segmentation methods have significantly degraded performance when deployed in some robust image segmentation tasks, such as camouflage object detection, medical image segmentation, and visual industry detection. In particular scenes in which the foreground objects and background regions are highly similar in appearance. In such complex scenarios, it is difficult to determine the association between a single element (local neighborhood) and an object boundary. Ambiguity in boundary assignment prevents context modeling, resulting in inaccurate and incomplete segmentation results.

Disclosure of Invention

Embodiments of the present disclosure propose methods and apparatus for training an image segmentation model and image segmentation.

In a first aspect, embodiments of the present disclosure provide a method of training an image segmentation model, comprising: acquiring a sample set and an image segmentation model, wherein samples in the sample set comprise images and original labels, and the image segmentation model comprises a feature extraction network, an auxiliary decoder and a main decoder; the following training steps are performed: selecting a sample from the sample set; inputting the image of the selected sample into the feature extraction network to obtain a first feature image set; inputting the first feature image set into an auxiliary decoder to obtain a second feature image set and an auxiliary segmentation result; combining the first feature map set and the second feature map set to obtain a combined feature map set; inputting the fusion feature image set into the main decoder to obtain a main segmentation result; performing morphological operation on the original label of the selected sample to obtain an auxiliary label; calculating a total loss value based on the difference between the auxiliary label and the auxiliary segmentation result and the difference between the original label and the main segmentation result; and if the total loss value is smaller than a preset threshold value, determining that the image segmentation model training is completed.

In some embodiments, the method further comprises: and if the total loss value is greater than or equal to a preset threshold value, adjusting related parameters of the image segmentation model, and continuing to execute the training step based on the adjusted image segmentation model.

In some embodiments, the auxiliary decoder comprises a low region decoder and/or a concavo-convex decoder.

In some embodiments, for a low-region decoder, the auxiliary tag is a low-region tag, obtained by: subtracting the original label subjected to morphological corrosion operation from the original label to obtain a 1 st sub-label, subtracting the original label from the original label subjected to morphological expansion operation to obtain a 2 nd sub-label, and connecting the 1 st sub-label and the 2 nd sub-label in a matrix along the channel dimension to obtain a low-area label.

In some embodiments, for a rugged decoder, the auxiliary tag is a rugged tag, obtained by: subtracting the original label subjected to morphological opening operation from the original label to obtain a 3 rd sub-label, subtracting the original label from the original label subjected to morphological closing operation to obtain a 4 th sub-label, and connecting the 3 rd sub-label and the 4 th sub-label in a matrix along the channel dimension to obtain the concave-convex label.

In some embodiments, the inputting the first feature map set into an auxiliary decoder to obtain a second feature map set and an auxiliary segmentation result includes: sequencing the feature images in the first feature image set according to the order of the scale from small to large; based on the feature map ordered first in the first feature map set, the following decoding step is performed: convolving the feature images with the first sequence, then upsampling to obtain a new feature image, and adding the new feature image to the second feature image set; the new feature map and the feature map of the second sequence are connected in a matrix mode along the channel dimension, and then the new feature map and the second feature map are obtained through rolling and up-sampling and added to the second feature map set; if the feature map of the second sequence is the feature map with the largest scale, performing convolution mapping on the new two feature maps to obtain an auxiliary segmentation result; if the feature map of the second rank is not the feature map with the largest scale, taking the feature map of the second rank as the feature map of the first rank, taking the feature map after the feature map of the second rank as the feature map of the second rank, and continuing to execute the decoding step.

In some embodiments, the image segmentation model further comprises an interaction enhancement module; and merging the first feature map set and the second feature map set to obtain a fused feature map set, including: and inputting the feature images with the same scale in the first feature image set and the second feature image set into an interaction enhancement module for each feature image in the first feature image set, performing matrix connection, and then performing non-local relation operation to obtain a fused feature image.

In some embodiments, the auxiliary decoder comprises a low region decoder and a concavo-convex decoder, and the second feature map set comprises a low region feature map set and a concavo-convex feature map set; and merging the first feature map set and the second feature map set to obtain a fused feature map set, including: inputting the low-area feature map and the concave-convex feature map with the same scale into an interaction enhancement module, respectively carrying out convolution and then carrying out local interrelation operation to obtain a fusion sub-feature map; and inputting each feature map in the first feature map set into an interaction enhancement module, and performing non-local relation operation after matrix connection to obtain a fused feature map.

In some embodiments, the inputting the fused feature atlas into the main decoder to obtain a main segmentation result includes: sequencing the feature images in the fusion feature image set according to the order of the scale from small to large; based on the feature map of the first rank in the fused feature map set, the following decoding step is performed: convolving the feature map with the first sequence, and then upsampling to obtain a new feature map; the new feature map is convolved to obtain a segmentation map and the segmentation map is added into a segmentation map set; the new feature map and the feature map of the second sequence are connected in a matrix mode along the channel dimension, and then rolling and up-sampling are carried out to obtain a new two feature map; the new two feature images are convolved to obtain a segmentation image and the segmentation image is added into a segmentation image set; if the feature map with the second rank is the feature map with the largest scale, taking the segmentation map with the largest scale in the segmentation map set as a main segmentation result; if the feature map of the second rank is not the feature map with the largest scale, taking the feature map of the second rank as the feature map of the first rank, taking the feature map after the feature map of the second rank as the feature map of the second rank, and continuing to execute the decoding step.

In a second aspect, an embodiment of the present disclosure provides an image segmentation method, including: acquiring an image to be segmented; inputting the image into an image segmentation model trained according to the method of the first aspect to obtain a segmentation result.

In a third aspect, an embodiment of the present disclosure provides an apparatus for training an image segmentation model, including: an acquisition unit configured to acquire a sample set and an image segmentation model, wherein samples in the sample set include an image and an original tag, the image segmentation model including a feature extraction network, an auxiliary decoder, a main decoder; a training unit configured to perform the following training steps: selecting a sample from the sample set; inputting the image of the selected sample into the feature extraction network to obtain a first feature image set; inputting the first feature image set into an auxiliary decoder to obtain a second feature image set and an auxiliary segmentation result; combining the first feature map set and the second feature map set to obtain a combined feature map set; inputting the fusion feature image set into the main decoder to obtain a main segmentation result; performing morphological operation on the original label of the selected sample to obtain an auxiliary label; calculating a total loss value based on the difference between the auxiliary label and the auxiliary segmentation result and the difference between the original label and the main segmentation result; and if the total loss value is smaller than a preset threshold value, determining that the image segmentation model training is completed.

In a fourth aspect, an embodiment of the present disclosure provides an image segmentation apparatus, including: an acquisition unit configured to acquire an image to be segmented; a segmentation unit configured to input the image into an image segmentation model trained according to the method of the first aspect, resulting in a segmentation result.

In a fifth aspect, embodiments of the present disclosure provide an electronic device, including: one or more processors; storage means having stored thereon one or more computer programs which, when executed by the one or more processors, cause the one or more processors to implement the method as described in the first or second aspect.

In a sixth aspect, embodiments of the present disclosure provide a computer readable medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method according to the first or second aspect.

The embodiment of the disclosure provides a method and a device for training an image segmentation model and image segmentation, which utilize clues related to a graph-bottom distribution mechanism to carry out robust object segmentation. The graph-bottom allocation cues are used as auxiliary supervision to facilitate the perception of object boundary context information, and thus better object segmentation results.

Drawings

Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings:

FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present disclosure may be applied;

FIG. 2 is a flow chart of one embodiment of a method of training an image segmentation model according to the present disclosure;

3a-3g are schematic diagrams of one application scenario of a method of training an image segmentation model according to the present disclosure;

FIG. 4 is a flow chart of one embodiment of a method of image segmentation according to the present disclosure;

FIG. 5 is a schematic structural view of one embodiment of an apparatus for training an image segmentation model according to the present disclosure;

FIG. 6 is a schematic structural view of one embodiment of an image segmentation apparatus according to the present disclosure;

fig. 7 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 of a method of training an image segmentation model, an apparatus of training an image segmentation model, an image segmentation method, or an image segmentation apparatus to which embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include terminals 101, 102, a network 103, a database server 104, and a server 105. The network 103 serves as a medium for providing a communication link between the terminals 101, 102, the database server 104 and the server 105. The network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user 110 may interact with the server 105 via the network 103 using the terminals 101, 102 to receive or send messages or the like. The terminals 101, 102 may have various client applications installed thereon, such as model training class applications, image segmentation class applications, shopping class applications, payment class applications, web browsers, instant messaging tools, and the like.

The terminals 101 and 102 may be hardware or software. When the terminals 101, 102 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video experts compression standard audio layer 3), laptop and desktop computers, and the like. When the terminals 101, 102 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

When the terminals 101, 102 are hardware, an image acquisition device may also be mounted thereon. The image capturing device may be various devices capable of implementing the function of capturing images, such as a camera, a sensor, and the like. The user 110 may acquire images using an image acquisition device on the terminal 101, 102.

Database server 104 may be a database server that provides various services. For example, a database server may have stored therein a sample set. The sample set contains a large number of samples. The sample may include a label of whether each pixel in the sample image belongs to the foreground. Thus, the user 110 may also select samples from the sample set stored by the database server 104 via the terminals 101, 102.

The server 105 may also be a server providing various services, such as a background server providing support for various applications displayed on the terminals 101, 102. The background server may train the initial model using the samples in the sample set sent by the terminals 101, 102, and may send the training results (e.g., the generated image segmentation model) to the terminals 101, 102. In this way, the user can apply the generated image segmentation model for image segmentation.

The database server 104 and the server 105 may be hardware or software. When they are hardware, they may be implemented as a distributed server cluster composed of a plurality of servers, or as a single server. When they are software, they may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein. Database server 104 and server 105 may also be servers of a distributed system or servers that incorporate blockchains. Database server 104 and server 105 may also be cloud servers, or intelligent cloud computing servers or intelligent cloud hosts with artificial intelligence technology.

It should be noted that, the method for training the image segmentation model or the image segmentation method provided by the embodiments of the present disclosure is generally performed by the server 105. Accordingly, a means for training an image segmentation model or an image segmentation means is typically also provided in the server 105.

It should be noted that the database server 104 may not be provided in the system architecture 100 in cases where the server 105 may implement the relevant functions of the database server 104.

It should be understood that the number of terminals, networks, database servers, and servers in fig. 1 are merely illustrative. There may be any number of terminals, networks, database servers, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a method of training an image segmentation model according to the present disclosure is shown. The method of training an image segmentation model may comprise the steps of:

step 201, a sample set and an image segmentation model are acquired.

In the present embodiment, the execution subject of the method of training the image segmentation model (e.g., the server 105 shown in fig. 1) can acquire a sample set in various ways. For example, the executing entity may obtain the existing sample set stored therein from a database server (e.g., database server 104 shown in fig. 1) through a wired connection or a wireless connection. As another example, a user may collect a sample through a terminal (e.g., terminals 101, 102 shown in fig. 1). In this way, the executing body may receive samples collected by the terminal and store the samples locally, thereby generating a sample set.

Here, the samples in the sample set include an image and an original label. The image may be some image that is difficult to distinguish between foreground and background, such as camouflage object images, cell images, etc. The original label is used to identify whether each pixel in the image belongs to the foreground. The foreground label can be marked on the pixel points of a region manually circled. The original label may be represented by a matrix, where each element in the matrix represents a pixel in the image, and if the pixel is foreground, the corresponding element value is 1, otherwise, it is 0.

The image segmentation model may include a feature extraction network, an auxiliary decoder, a primary decoder. The feature extraction network may be a neural network such as Resnet50, VGG, googlenet. The feature extraction network is used as an encoder, and a high-resolution semantically rich feature map is obtained through multi-scale features of the encoder. The decoder architecture is then used to aggregate the multi-layer feature map from the encoder to gradually upsample and generate a high resolution feature map. These encoder-decoder based methods can effectively obtain high resolution feature representations.

The application uses two decoders: an auxiliary decoder and a main decoder. The output of the main decoder is the final segmentation result. The auxiliary decoder outputs an intermediate auxiliary segmentation result which is only used for the supervision training of the auxiliary model, and the intermediate auxiliary segmentation result generated when the model is used for image segmentation in the follow-up mode is discarded.

Step 202, selecting a sample from a sample set.

In this embodiment, the execution subject may select a sample from the sample set acquired in step 201, and execute the training steps of steps 203 to 2010. The selection manner and the selection number of the samples are not limited in this disclosure. For example, at least one sample may be selected randomly, or a sample from which the image is selected with better sharpness (i.e., higher pixels) may be selected.

Step 203, inputting the image of the selected sample into a feature extraction network to obtain a first feature map set.

In this embodiment, the feature extraction network may output multi-scale feature maps that form the first feature map set. As shown in fig. 3a, taking the Resnet50 as an example, feature maps of four scales 1/32, 1/16, 1/8 and 1/4 can be extracted from the image, which are respectively: the 1 st feature map, the 2 nd feature map, the 3 rd feature map and the 4 th feature map form a first feature map set.

Step 204, inputting the first feature map set into an auxiliary decoder to obtain a second feature map set and an auxiliary segmentation result.

In the present embodiment, the auxiliary decoder may be one or both of a low Region (Lower Region) decoder and a concavo-convex (consistency) decoder. If only one auxiliary decoder is used, only one auxiliary segmentation result is obtained, and if two auxiliary decoders are used, two auxiliary segmentation results are obtained. Fig. 3a illustrates the simultaneous use of two auxiliary decoders, in practice one auxiliary decoder may be optional. The decoding process of the decoder is the prior art, and thus will not be described in detail.

If a Lower Region decoder is used, a low Region feature map is obtained and the decoding process is shown in FIG. 3 b.

S1: inputting the 1 st feature map, the 2 nd feature map, the 3 rd feature map and the 4 th feature map into a Lower Region decoder to obtain a 5 th feature map, a 6 th feature map, a 7 th feature map and an 8 th feature map;

s1.1: the 1 st feature map is subjected to convolution block 1 and double up-sampling to obtain a 5 th feature map;

s1.2: the 2 nd feature map and the output feature of the S1.1 are connected in a matrix mode along the channel dimension, and then the 6 th feature map is obtained through a convolution block 2 and double up-sampling;

s1.3: the 3 rd feature map and the output feature of the S1.2 are connected in a matrix mode along the channel dimension, and then the 7 th feature map is obtained through a convolution block 3 and double up-sampling;

s1.4: and (3) connecting the 4 th feature map and the output features of the S1.3 along the channel dimension in a matrix mode, obtaining an 8 th feature map through convolution block 4 and double up-sampling, and mapping the 8 th feature map through a convolution layer to obtain an auxiliary segmentation result as a 1 st auxiliary output. The second set of feature maps includes feature maps 5-8.

If a consistency decoder is used, a concave-convex characteristic diagram is obtained, and the decoding process is shown in fig. 3 c.

S2: inputting the 1 st feature map, the 2 nd feature map, the 3 rd feature map and the 4 th feature map into a consistency decoder to obtain a 9 th feature map, a 10 th feature map, an 11 th feature map and a 12 th feature map;

S2.1: the 1 st feature map is subjected to convolution block 5 and double up-sampling to obtain a 9 th feature map;

s2.2: the 2 nd feature map and the output feature of S2.1 are connected in a matrix mode along the channel dimension, and then the 10 th feature map is obtained through a convolution block 6 and double up-sampling;

s2.3: the 3 rd feature map and the output feature of S2.2 are connected in a matrix mode along the channel dimension, and then the 11 th feature map is obtained through a convolution block 7 and double up-sampling;

s2.4: and (3) connecting the 4 th feature map and the output features of the S2.3 along the channel dimension in a matrix manner, obtaining a 12 th feature map through a convolution block 8 and double up-sampling, and mapping the 12 th feature map through a convolution layer to obtain an auxiliary segmentation result as a 2 nd auxiliary output. The second set of feature maps includes feature maps 9-12.

And step 205, merging the first feature map set and the second feature map set to obtain a merged feature map set.

In this embodiment, feature maps of the same scale of the first feature map set and the second feature map set may be directly combined by matrix connection along the channel dimension. For example, a 1/32-scale feature map (1 st feature map) in the first feature map set and a 1/32-scale feature map (5 th feature map) in the second feature map set are combined to obtain a 1/32 fusion feature map.

Alternatively, the first set of feature maps and the second set of feature maps may be combined by an interaction enhancement module.

If only one auxiliary encoder is used, the interaction enhancement module may matrix connect the concave-convex feature map with the feature maps in the first feature map set along the channel dimension, and then perform a Non-local relational operation (note that there are various options for the Non-local relational operation, for example, non-Local Convolution, lambda Convolution) to obtain a fused feature map.

If two auxiliary encoders are used, the interaction enhancement module can convolve the low-region feature map and the concave-convex feature map of the same scale respectively and then obtain a fused sub-feature map through local interrelationship operation (note: there are various optional local interrelationship operation modes, such as Local Relation Convolution, lambda Convolution, etc.). And inputting each feature map in the first feature map set into an interaction enhancement module, and performing non-local relation operation after matrix connection to obtain a fused feature map.

The processing procedure of the interaction enhancement module is shown in fig. 3e, and a fusion feature map set is output.

S3: inputting a first feature map set (a 1 st feature map, a 2 nd feature map, a 3 rd feature map and a 4 th feature map), a low region feature map set (a 5 th feature map, a 6 th feature map, a 7 th feature map and an 8 th feature map) in the second feature map set and a concave-convex feature map set (a 9 th feature map, a 10 th feature map, an 11 th feature map and a 12 th feature map) in the second feature map set into an interaction enhancement module to obtain a 13 th feature map, a 14 th feature map, a 15 th feature map and a 16 th feature map;

S3.1: the 5 th feature map output by S1.1 passes through a convolution block 9, the 9 th feature map output by S2.1 passes through a convolution block 10, features obtained by carrying out local relation operation (note: the local relation operation modes have various options, such as Local Relation Convolution, lambda Convolution and the like) and the 1 st feature are connected in a matrix mode along the channel dimension, and then the 13 th feature map is obtained by carrying out Non-local relation operation (note: the Non-local relation operation modes have various options, such as Non-Local Convolution and Lambda Convolution).

S3.2: the 6 th feature output by the S1.2 passes through the convolution block 11, the 10 th feature output by the S2.2 passes through the convolution block 12, the features obtained by carrying out local relation operation (note: the local relation operation modes have various options, such as Local Relation Convolution, lambda Convolution and the like) and the 2 nd feature are connected in a matrix mode along the channel dimension, and then the 14 th feature map is obtained by carrying out Non-local relation operation (note: the Non-local relation operation modes have various options, such as Non-Local Convolution and Lambda Convolution).

S3.3: the 7 th feature map is subjected to convolution block 13, the 11 th feature map output by S2.3 is subjected to convolution block 14, features obtained by performing local relation operation (note: the local relation operation modes have various options, such as Local Relation Convolution, lambda Convolution and the like) and the 3 rd feature map are connected in a matrix manner along the channel dimension, and then the 15 th feature map is obtained by performing Non-local relation operation (note: the Non-local relation operation modes have various options, such as Non-Local Convolution and Lambda Convolution).

S3.4: the 8 th feature map output by S1.4 passes through a convolution block 15, the 12 th feature map output by S2.4 passes through a convolution block 16, the feature map obtained by carrying out local relation operation on the 8 th feature map and the 4 th feature map (note: the local relation operation modes have various options, such as Local Relation Convolution, lambda Convolution and the like) are connected in a matrix mode along the channel dimension, and then the 16 th feature map is obtained by carrying out Non-local relation operation (note: the Non-local relation operation modes have various options, such as Non-Local Convolution and Lambda Convolution).

The 13 th-16 th feature maps form a fused feature map set.

And 206, inputting the fusion characteristic image set into a main decoder to obtain a main segmentation result.

In this embodiment, the decoding process of the main decoder may be a common method in the prior art, and thus will not be described in detail.

Fig. 3f is a schematic diagram of a main decoder according to the present application.

S4: inputting the 13 th feature map, the 14 th feature map, the 15 th feature map and the 16 th feature map output by the S3 into a main decoder to obtain a 1 st division map, a 2 nd division map, a 3 rd division map and a 4 th division map;

s4.1: the 13 th feature image is subjected to convolution block 17 and double up-sampling to obtain a 17 th feature image, and the 17 th feature image is subjected to convolution block 21 to obtain a 1 st segmentation image;

S4.2: the 14 th feature map and the output feature of S4.1 are connected in a matrix along the channel dimension, then the 18 th feature map is obtained through a convolution block 18 and double up-sampling, and the 2 nd segmentation map is obtained through a convolution block 22;

s4.3: the 15 th feature map and the output feature of S4.2 are connected in a matrix along the channel dimension, then the 19 th feature map is obtained through a convolution block 19 and double up-sampling, and the 3 rd segmentation map is obtained through a convolution block 23;

s4.4: the 16 th feature map and the output features of S4.3 are connected in matrix along the channel dimension, then the 20 th feature map is obtained by convolution block 20 and one double up-sampling, and the 4 th segmentation map is obtained by convolution block 24 from the 20 th feature map.

The final segmentation result is the largest scale segmentation map, in this case the 4 th segmentation map.

Step 207, performing morphological operation on the original label of the selected sample to obtain an auxiliary label.

In this embodiment, several basic operations in image morphology: corrosion, expansion, opening operation, closing operation. These operations are common techniques in the art and are therefore not described in detail.

If a Lower Region encoder is used, a Lower Region tag needs to be made as an auxiliary tag. If a consistency encoder is used, a consistency tag needs to be made as an auxiliary tag. One pixel may have two kinds of auxiliary labels at the same time, or may have only any kind of auxiliary labels. The interaction enhancement module is not used when only one type of auxiliary tag is used. The auxiliary label is also used to characterize whether a pixel point in the image is foreground.

Fig. 3g shows a schematic view of an auxiliary label. #1 denotes the 1 st sub-tag, #2 denotes the 2 nd sub-tag, #3 denotes the 3 rd sub-tag, and #4 denotes the 4 th sub-tag. In the figure, F represents a foreground region and G represents a background region.

S5: obtaining a Lower Region label and a consistency label through morphological change of the original label;

s5.1: the original label can be seen as a black and white image, with black pixels representing the foreground. Subtracting the original label subjected to morphological corrosion operation from the original label to obtain a 1 st sub-label, subtracting the original label from the original label subjected to morphological expansion operation to obtain a 2 nd sub-label, and performing matrix connection on the 1 st sub-label and the 2 nd sub-label along the channel dimension to obtain a Lower Region label;

s5.2: subtracting the original label subjected to morphological opening operation from the original label to obtain a 3 rd sub-label, subtracting the original label from the original label subjected to morphological closing operation to obtain a 4 th sub-label, and connecting the 3 rd sub-label and the 4 th sub-label along the channel dimension in a matrix manner to obtain a consistency label.

Step 208, calculating the total loss value based on the difference between the auxiliary label and the auxiliary segmentation result and the difference between the original label and the main segmentation result.

In this embodiment, the first loss value may be calculated according to the difference between the auxiliary label and the auxiliary segmentation result. If the pixel point indicated by the auxiliary label is a foreground and the actual segmentation result is a background, the segmentation of the pixel point is indicated to have errors, the errors of the different points of the auxiliary label and the auxiliary segmentation result of each pixel point in the whole graph are counted according to the method, and the errors are input into a loss function to obtain a first loss value. Similarly, the second loss value may be calculated based on the difference between the original label and the primary segmentation result. The weighted sum of the first loss value and the second loss value is taken as the total loss value. The weight of the second loss value may be set to be greater than the weight of the first loss value. For example, the first loss value has a weight of 0.2 and the second loss value has a weight of 0.8.

If two auxiliary decoders are adopted, the Lower Region label is used for supervising the 1 st auxiliary output, the consistency label is used for supervising the 2 nd auxiliary output, the original data set label is used for respectively and simultaneously supervising the output of the main segmentation result, and finally joint training is carried out.

If only one auxiliary decoding order is employed, only the corresponding auxiliary output is used for joint training.

The binary cross entropy can be selected as a loss function, adamw is an optimization function, the initial learning rate is 1e-4, and the iteration is stopped after 100 cycles. The training termination may also be controlled by the total loss value.

In step 209, if the total loss value is smaller than the predetermined threshold, it is determined that the training of the image segmentation model is completed.

In this embodiment, if a plurality of samples are selected in step 202, the executing body may determine that the initial model training is completed in the case where the total loss value of each sample reaches a predetermined threshold. For another example, the executing entity may count the proportion of samples for which the total loss value reaches a predetermined threshold to the selected samples. And when the ratio reaches a predetermined sample ratio (e.g., 95%), it can be determined that the initial model training is complete.

Step 210, if the total loss value is greater than or equal to the predetermined threshold, the relevant parameters of the image segmentation model are adjusted, and training steps 202-210 are continuously performed based on the adjusted image segmentation model.

In this embodiment, if the execution subject determines that the initial model is not trained, the relevant parameters in the image segmentation model may be adjusted. For example, using back propagation techniques to modify the weights in the convolutional layers in the image segmentation model. And may return to step 202 to re-select samples from the sample set. So that the training steps described above can be continued.

According to the method for training the image segmentation model, the auxiliary segmentation result is obtained through the auxiliary decoder, the auxiliary label is synthesized by adopting the morphological method for the first time, and a new supervision target is introduced, so that the image segmentation model is more accurate when the image is segmented, and erroneous segmentation is avoided.

Referring to fig. 4, a flow 400 of one embodiment of an image segmentation method provided by the present disclosure is shown. The image segmentation method may include the steps of:

in step 401, an image to be segmented is acquired.

In the present embodiment, the execution subject of the image segmentation method (e.g., the server 105 shown in fig. 1) can acquire an image of a detection object in various ways. For example, the execution subject may acquire the image stored therein from a database server (e.g., the database server 104 shown in fig. 1) through a wired connection or a wireless connection. For another example, the executing subject may also receive images acquired by a terminal (e.g., terminals 101, 102 shown in fig. 1) or other device.

In the present embodiment, the detection object may be any object, for example, a camouflaged object, a polyp, a cell, or the like, which is highly similar to the background area.

Step 402, inputting the image into an image segmentation model to obtain a segmentation result.

In this embodiment, the execution subject may input the image acquired in step 401 into the image segmentation model, thereby generating a segmentation result of the detection object. The segmentation result may be a contour describing the object in the image.

In this embodiment, the image segmentation model may be generated using the method described above in connection with the embodiment of FIG. 2. The specific generation process may be referred to in the description of the embodiment of fig. 2, and will not be described herein.

It should be noted that, the image segmentation method of the present embodiment may be used to test the image segmentation model generated in the above embodiments. And then the image segmentation model can be continuously optimized according to the test result. The method may be a practical application method of the image segmentation model generated in each of the above embodiments. The image segmentation model generated by the embodiments is used for image segmentation, which is beneficial to improving the image segmentation performance. If more objects are found, the contours of the found objects are clear, etc.

With continued reference to fig. 5, as an implementation of the method illustrated in the above figures, the present disclosure provides one embodiment of an apparatus for training an image segmentation model. The embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device can be applied to various electronic devices.

As shown in fig. 5, the apparatus 500 for training an image segmentation model of the present embodiment may include: an acquisition unit 501 and a training unit 502. Wherein the obtaining unit 501 is configured to obtain a sample set and an image segmentation model, wherein samples in the sample set comprise images and original labels, and the image segmentation model comprises a feature extraction network, an auxiliary decoder and a main decoder; a training unit 502 configured to perform the following training steps: selecting a sample from the sample set; inputting the image of the selected sample into the feature extraction network to obtain a first feature image set; inputting the first feature image set into an auxiliary decoder to obtain a second feature image set and an auxiliary segmentation result; combining the first feature map set and the second feature map set to obtain a combined feature map set; inputting the fusion feature image set into the main decoder to obtain a main segmentation result; performing morphological operation on the original label of the selected sample to obtain an auxiliary label; calculating a total loss value based on the difference between the auxiliary label and the auxiliary segmentation result and the difference between the original label and the main segmentation result; and if the total loss value is smaller than a preset threshold value, determining that the image segmentation model training is completed.

In some optional implementations of the present embodiment, training unit 502 is further configured to: and if the total loss value is greater than or equal to a preset threshold value, adjusting related parameters of the image segmentation model, and continuing to execute the training step based on the adjusted image segmentation model.

In some alternative implementations of the present embodiment, the auxiliary decoder includes a low-region decoder and/or a concavo-convex decoder.

In some optional implementations of the present embodiment, training unit 502 is further configured to: subtracting the original label subjected to morphological corrosion operation from the original label to obtain a 1 st sub-label; subtracting the original label from the original label subjected to morphological expansion operation to obtain a sub-label 2; and (3) carrying out matrix connection on the 1 st sub-label and the 2 nd sub-label along the channel dimension to obtain a low-area label as an auxiliary label.

In some optional implementations of the present embodiment, training unit 502 is further configured to: subtracting the original label subjected to morphological opening operation from the original label to obtain a 3 rd sub-label; subtracting the original label from the original label subjected to morphological closing operation to obtain a 4 th sub-label; and (3) carrying out matrix connection on the 3 rd sub-label and the 4 th sub-label along the channel dimension to obtain the concave-convex label serving as an auxiliary label.

In some optional implementations of the present embodiment, training unit 502 is further configured to: sequencing the feature images in the first feature image set according to the order of the scale from small to large; based on the feature map ordered first in the first feature map set, the following decoding step is performed: convolving the feature images with the first sequence, then upsampling to obtain a new feature image, and adding the new feature image to the second feature image set; the new feature map and the feature map of the second sequence are connected in a matrix mode along the channel dimension, and then the new feature map and the second feature map are obtained through rolling and up-sampling and added to the second feature map set; if the feature map of the second sequence is the feature map with the largest scale, performing convolution mapping on the new two feature maps to obtain an auxiliary segmentation result; if the feature map of the second rank is not the feature map with the largest scale, taking the feature map of the second rank as the feature map of the first rank, taking the feature map after the feature map of the second rank as the feature map of the second rank, and continuing to execute the decoding step.

In some optional implementations of this embodiment, the image segmentation model further includes an interaction enhancement module; the training unit 502 is further configured to: and inputting the feature images with the same scale in the first feature image set and the second feature image set into an interaction enhancement module for each feature image in the first feature image set, performing matrix connection, and then performing non-local relation operation to obtain a fused feature image.

In some optional implementations of the present embodiment, the auxiliary decoder includes a low region decoder and a concave-convex decoder, and the second feature map set includes a low region feature map set and a concave-convex feature map set; the training unit 502 is further configured to: inputting the low-area feature map and the concave-convex feature map with the same scale into an interaction enhancement module, respectively carrying out convolution and then carrying out local interrelation operation to obtain a fusion sub-feature map; and inputting each feature map in the first feature map set into an interaction enhancement module, and performing non-local relation operation after matrix connection to obtain a fused feature map.

In some optional implementations of the present embodiment, training unit 502 is further configured to: sequencing the feature images in the fusion feature image set according to the order of the scale from small to large; based on the feature map of the first rank in the fused feature map set, the following decoding step is performed: convolving the feature map with the first sequence, and then upsampling to obtain a new feature map; the new feature map is convolved to obtain a segmentation map and the segmentation map is added into a segmentation map set; the new feature map and the feature map of the second sequence are connected in a matrix mode along the channel dimension, and then rolling and up-sampling are carried out to obtain a new two feature map; the new two feature images are convolved to obtain a segmentation image and the segmentation image is added into a segmentation image set; if the feature map with the second rank is the feature map with the largest scale, taking the segmentation map with the largest scale in the segmentation map set as a main segmentation result; if the feature map of the second rank is not the feature map with the largest scale, taking the feature map of the second rank as the feature map of the first rank, taking the feature map after the feature map of the second rank as the feature map of the second rank, and continuing to execute the decoding step.

With continued reference to fig. 6, as an implementation of the method illustrated in fig. 4 described above, the present disclosure provides one embodiment of an image segmentation apparatus. The embodiment of the device corresponds to the embodiment of the method shown in fig. 4, and the device can be applied to various electronic devices.

As shown in fig. 6, the image segmentation apparatus 600 of the present embodiment may include: an acquisition unit 601 and a segmentation unit 602. Wherein the acquiring unit 601 is configured to acquire an image to be segmented; segmentation unit 602 configured to input the image into an image segmentation model trained according to the method of claims 1-9, resulting in a segmentation result.

According to an embodiment of the disclosure, the disclosure further provides an electronic device, a readable storage medium.

An electronic device, comprising: one or more processors; storage means having stored thereon one or more computer programs which, when executed by the one or more processors, cause the one or more processors to implement the method described in flow 200 or 400.

A computer readable medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the method described in flow 200 or 400.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 701 performs the respective methods and processes described above, for example, a method of training an image segmentation model. For example, in some embodiments, the method of training the image segmentation model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into RAM 703 and executed by the computing unit 701, one or more steps of the method of training an image segmentation model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the method of training the image segmentation model by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a server of a distributed system or a server that incorporates a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology. The server may be a server of a distributed system or a server that incorporates a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of training an image segmentation model, comprising:

acquiring a sample set and an image segmentation model, wherein samples in the sample set comprise images and original labels, and the image segmentation model comprises a feature extraction network, an auxiliary decoder and a main decoder;

the following training steps are performed: selecting a sample from the sample set; inputting the image of the selected sample into the feature extraction network to obtain a first feature image set; inputting the first feature image set into an auxiliary decoder to obtain a second feature image set and an auxiliary segmentation result; combining the first feature map set and the second feature map set to obtain a combined feature map set; inputting the fusion feature image set into the main decoder to obtain a main segmentation result; performing morphological operation on the original label of the selected sample to obtain an auxiliary label; calculating a total loss value based on the difference between the auxiliary label and the auxiliary segmentation result and the difference between the original label and the main segmentation result; and if the total loss value is smaller than a preset threshold value, determining that the image segmentation model training is completed.

2. The method of claim 1, wherein the method further comprises:

and if the total loss value is greater than or equal to a preset threshold value, adjusting related parameters of the image segmentation model, and continuing to execute the training step based on the adjusted image segmentation model.

3. The method of claim 1, wherein the auxiliary decoder comprises a low-region decoder and/or a concavo-convex decoder.

4. A method according to claim 3, wherein the morphological operation of the original label of the selected sample to obtain an auxiliary label comprises:

subtracting the original label subjected to morphological corrosion operation from the original label to obtain a 1 st sub-label;

subtracting the original label from the original label subjected to morphological expansion operation to obtain a sub-label 2;

and (3) carrying out matrix connection on the 1 st sub-label and the 2 nd sub-label along the channel dimension to obtain a low-area label as an auxiliary label.

5. A method according to claim 3, wherein the morphological operation of the original label of the selected sample to obtain an auxiliary label comprises:

subtracting the original label subjected to morphological opening operation from the original label to obtain a 3 rd sub-label;

subtracting the original label from the original label subjected to morphological closing operation to obtain a 4 th sub-label;

And (3) carrying out matrix connection on the 3 rd sub-label and the 4 th sub-label along the channel dimension to obtain the concave-convex label serving as an auxiliary label.

6. The method of claim 1, wherein said inputting the first feature map set into an auxiliary decoder results in a second feature map set and auxiliary segmentation results, comprising:

sequencing the feature images in the first feature image set according to the order of the scale from small to large;

based on the feature map ordered first in the first feature map set, the following decoding step is performed: convolving the feature images with the first sequence, then upsampling to obtain a new feature image, and adding the new feature image to the second feature image set; the new feature map and the feature map of the second sequence are connected in a matrix mode along the channel dimension, and then the new feature map and the second feature map are obtained through rolling and up-sampling and added to the second feature map set; if the feature map of the second sequence is the feature map with the largest scale, performing convolution mapping on the new two feature maps to obtain an auxiliary segmentation result;

if the feature map of the second rank is not the feature map with the largest scale, taking the feature map of the second rank as the feature map of the first rank, taking the feature map after the feature map of the second rank as the feature map of the second rank, and continuing to execute the decoding step.

7. The method of claim 1, wherein the image segmentation model further comprises an interaction enhancement module; and

combining the first feature map set and the second feature map set to obtain a combined feature map set, including:

and inputting the feature images with the same scale in the first feature image set and the second feature image set into an interaction enhancement module for each feature image in the first feature image set, performing matrix connection, and then performing non-local relation operation to obtain a fused feature image.

8. The method of claim 7, wherein the auxiliary decoder comprises a low region decoder and a concavo-convex decoder, and the second feature map set comprises a low region feature map set and a concavo-convex feature map set; and

inputting the low-area feature map and the concave-convex feature map with the same scale into an interaction enhancement module, respectively carrying out convolution and then carrying out local interrelation operation to obtain a fusion sub-feature map;

and inputting each feature map in the first feature map set into an interaction enhancement module, and performing non-local relation operation after matrix connection to obtain a fused feature map.

9. The method of claim 1, wherein the inputting the fused feature atlas into the main decoder results in a main segmentation result, comprising:

sequencing the feature images in the fusion feature image set according to the order of the scale from small to large;

based on the feature map of the first rank in the fused feature map set, the following decoding step is performed: convolving the feature map with the first sequence, and then upsampling to obtain a new feature map; the new feature map is convolved to obtain a segmentation map and the segmentation map is added into a segmentation map set; the new feature map and the feature map of the second sequence are connected in a matrix mode along the channel dimension, and then rolling and up-sampling are carried out to obtain a new two feature map; the new two feature images are convolved to obtain a segmentation image and the segmentation image is added into a segmentation image set; if the feature map with the second rank is the feature map with the largest scale, taking the segmentation map with the largest scale in the segmentation map set as a main segmentation result;

10. An image segmentation method, comprising:

acquiring an image to be segmented;

inputting the image into an image segmentation model trained according to the method of claims 1-9 to obtain segmentation results.

11. An apparatus for training an image segmentation model, comprising:

an acquisition unit configured to acquire a sample set and an image segmentation model, wherein samples in the sample set include an image and an original tag, the image segmentation model including a feature extraction network, an auxiliary decoder, a main decoder;

a training unit configured to perform the following training steps: selecting a sample from the sample set; inputting the image of the selected sample into the feature extraction network to obtain a first feature image set; inputting the first feature image set into an auxiliary decoder to obtain a second feature image set and an auxiliary segmentation result; combining the first feature map set and the second feature map set to obtain a combined feature map set; inputting the fusion feature image set into the main decoder to obtain a main segmentation result; performing morphological operation on the original label of the selected sample to obtain an auxiliary label; calculating a total loss value based on the difference between the auxiliary label and the auxiliary segmentation result and the difference between the original label and the main segmentation result; and if the total loss value is smaller than a preset threshold value, determining that the image segmentation model training is completed.

12. An image segmentation apparatus comprising:

an acquisition unit configured to acquire an image to be segmented;

a segmentation unit configured to input the image into an image segmentation model trained according to the method of claims 1-9, resulting in a segmentation result.

13. An electronic device, comprising:

one or more processors;

a storage device having one or more computer programs stored thereon,

the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-10.

14. A computer readable medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-10.