CN114066899A

CN114066899A - Image segmentation model training method, image segmentation device, image segmentation equipment and image segmentation medium

Info

Publication number: CN114066899A
Application number: CN202111333309.9A
Authority: CN
Inventors: 丁宁; 李南; 张晓光; 夏轩; 马琳; 潘喜洲; 何星; 张爱东
Original assignee: Shenzhen Institute of Artificial Intelligence and Robotics
Current assignee: Shenzhen Institute of Artificial Intelligence and Robotics
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2022-02-18

Abstract

The application discloses an image segmentation model training method, an image segmentation device, electronic equipment and a computer-readable storage medium, comprising: acquiring a training image and a corresponding label; inputting the training image into an initial model to obtain a segmentation result; obtaining a loss value by using the segmentation result and the label, and performing parameter adjustment on the initial model by using the loss value; if the condition that the training is finished is detected to be met, determining the initial model after parameter adjustment as an image segmentation model; the method adopts an initial model training with a binary tree-shaped feature fusion structure to obtain an image segmentation model. In the model, a branch network of adjacent feature extraction depths has a binary attribute feature fusion structure; by the structure, the outputs of the feature extraction operations at the same stage on the adjacent branch networks are fused, so that the feature information on the deep branch network can be continuously transmitted to the shallow branch network, the fineness of the image segmentation model is further improved, and the algorithm complexity is reduced.

Description

Image segmentation model training method, image segmentation device, image segmentation equipment and image segmentation medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image segmentation model training method, an image segmentation model training apparatus, an electronic device, and a computer-readable storage medium.

Background

Segmenting the target (or foreground) and the background in an image is a classic task in the field of computer vision, is the first step of most image analysis and understanding processes, and is one of the most difficult problems in image processing. Image object segmentation techniques are widely used in production and living scenarios such as autopilot, industrial production, precision agriculture, mobile robots, image editing, and the like. The image target segmentation method based on the digital image processing technology and the traditional machine learning method has relatively small calculated amount, but the segmentation effect, the algorithm robustness and the generalization are insufficient. The current mainstream method is a segmentation method based on deep learning, and the segmentation effect and generalization are remarkably improved compared with the traditional method. However, the current segmentation method based on deep learning has two defects: (1) the segmentation effect on fine structures (such as the edges of objects) is not fine enough; (2) and the algorithm has higher calculation complexity, and is not suitable for being deployed on mobile platforms with limited calculation capacity, such as robots, intelligent automobiles, unmanned planes, mobile phones and the like, and performing real-time reasoning.

Therefore, the problems of poor segmentation effect and high computational complexity in the related art are technical problems to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, an object of the present invention is to provide an image segmentation model training method, an image segmentation model training apparatus, an electronic device and a computer-readable storage medium, which can improve segmentation fineness and reduce algorithm complexity.

In order to solve the above technical problem, the present application provides an image segmentation model training method, including:

acquiring a training image and a corresponding label;

inputting the training image into an initial model to obtain a segmentation result;

obtaining a loss value by using the segmentation result and the label, and performing parameter adjustment on the initial model by using the loss value;

if the condition that the training is finished is detected to be met, determining the initial model after parameter adjustment as an image segmentation model;

the initial model comprises a main network and a plurality of branch networks, wherein the main network comprises a plurality of main feature extraction modules which are connected in series; each trunk feature extraction module corresponds to one branch network, each branch network is provided with a plurality of branch feature extraction modules which are connected in series, and each trunk feature extraction module and the corresponding branch network correspond to different feature extraction depths respectively; the number of the branch feature extraction modules in the branch network with large feature extraction depth is not greater than the number of the branch feature extraction modules in the branch network with small feature extraction depth;

the input data of a second target module in a second branch network is composed of second output data of a preamble module of the second target module and first output data of a first target module in the first branch network corresponding to the position of the preamble module, and the feature extraction depth of the second branch network is the feature extraction depth of the first branch network minus one.

Optionally, the inputting the training image into an initial model to obtain a segmentation result includes:

inputting the training image into the backbone network, and respectively utilizing each backbone feature extraction network to perform feature extraction on input backbone input data to obtain backbone output data;

respectively inputting each trunk output data into the corresponding branch network, and respectively utilizing a target branch module in each branch network to perform feature extraction on the input branch input data to obtain branch output data;

inputting the branch output data into a subsequent branch module corresponding to the target branch module and an adjacent branch module corresponding to the subsequent branch module in the adjacent branch network;

respectively generating first segmentation data by utilizing tail branch modules in the branch networks;

and performing result fusion processing on each first segmentation data to obtain the segmentation result.

Optionally, the performing result fusion processing on each piece of the first segmentation data to obtain the segmentation result includes:

performing up-sampling processing or deconvolution processing on first segmentation data with different image sizes from the training image to obtain second segmentation data;

and performing fusion processing on the second segmentation data to obtain the segmentation result.

Optionally, the obtaining a loss value by using the segmentation result and the label includes:

calculating to obtain a cross entropy loss value and a polarization distribution loss value by using the segmentation result and the label;

carrying out weighted summation processing on the cross entropy loss value and the polarization distribution loss value to obtain the loss value;

wherein, the generation process of the two-polarization distribution loss value comprises the following steps:

determining a background pixel proportion and a foreground pixel proportion by using the label;

calculating a pixel loss value corresponding to a target pixel with a predicted value in a preset middle interval in the segmentation result by using the background pixel proportion and the foreground pixel proportion;

and calculating the average value of all the pixel loss values to obtain the polarization distribution loss value.

Optionally, the determining the background pixel proportion and the foreground pixel proportion by using the label includes:

by using

Obtaining the background pixel proportion and the foreground pixel proportion, wherein n is_bAnd n_oThe number of pixels, ω, in the label that belong to the background and foreground, respectively_bIs the foreground pixel proportion, ω_oIs the background pixel proportion;

correspondingly, calculating a pixel loss value corresponding to a target pixel with a predicted value in a preset intermediate interval in the segmentation result by using the background pixel proportion and the foreground pixel proportion, including:

by using

Obtaining the pixel loss value; wherein L is_b ⁽ⁱ⁾Is the pixel loss value of the ith pixel in the input image, z_iFor the label value, z, corresponding to the ith pixel point in the input image_i0 denotes a background pixel, z _i1 denotes a foreground pixel; y is_iAs the predicted value, y_i∈(k₁，k₂)，k₁And k₂Is the lower limit value and the upper limit value, k, of the preset intermediate interval₁、k₂E (0,1), and k₁<k₂。

Correspondingly, calculating the average value of all the pixel loss values to obtain the polarization distribution loss value, including:

by using

Obtaining the value of the dipolar distribution loss; where n denotes the number of target pixels, L_bRepresenting said two-polarization distributionLoss value.

Optionally, the trunk feature extraction module and/or the branch feature extraction module has a lightweight convolution module and/or an attention module.

The application also provides an image segmentation method, which comprises the following steps:

acquiring an image to be segmented;

inputting the image to be segmented into an image segmentation model to obtain a segmentation result;

wherein the image segmentation model is generated according to the image segmentation model training method of any one of claims 1 to 7.

The application also provides an image segmentation model training device, including:

the training acquisition module is used for acquiring a training image and a corresponding label;

the input module is used for inputting the training image into an initial model to obtain a segmentation result;

the parameter adjusting module is used for obtaining a loss value by utilizing the segmentation result and the label and carrying out parameter adjustment on the initial model by utilizing the loss value;

the module determination module is used for determining the initial model after parameter adjustment as an image segmentation model if the condition that the training is completed is detected to be met;

The present application further provides an electronic device comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the above-mentioned image segmentation model training method and/or the above-mentioned image segmentation method.

The present application further provides a computer-readable storage medium for storing a computer program, wherein the computer program is executed by a processor to implement the above-mentioned image segmentation model training method, and/or the above-mentioned image segmentation method.

The image segmentation model training method provided by the application obtains a training image and a corresponding label; inputting the training image into an initial model to obtain a segmentation result; obtaining a loss value by using the segmentation result and the label, and performing parameter adjustment on the initial model by using the loss value; if the condition that the training is finished is detected to be met, determining the initial model after parameter adjustment as an image segmentation model; the initial model comprises a main network and a plurality of branch networks, wherein the main network comprises a plurality of main feature extraction modules which are connected in series; each trunk feature extraction module corresponds to a branch network, each branch network is provided with a plurality of branch feature extraction modules which are connected in series, and each trunk feature extraction module and the corresponding branch network respectively correspond to different feature extraction depths; the number of the branch feature extraction modules in the branch network with large feature extraction depth is not greater than that of the branch network with small feature extraction depth; the input data of the second target module in the second branch network is composed of the second output data of the preorder module of the second target module and the first output data of the first target module in the first branch network corresponding to the preorder module, and the feature extraction depth of the second branch network is the feature extraction depth of the first branch network minus one.

Therefore, the method obtains the image segmentation model by adopting the initial model training with the binary tree-shaped feature fusion structure. In the model, a branch network of adjacent feature extraction depths has a binary attribute feature fusion structure, which is embodied in that: the input data of the second target module in the second branch network is composed of the second output data of the preorder module of the second target module and the first output data of the first target module in the first branch network corresponding to the preorder module, and the feature extraction depth of the second branch network is the feature extraction depth of the first branch network minus one. Through the structure, the output of the feature extraction operation at the same stage on the adjacent branch networks is fused, so that the feature information on the deep branch network can be continuously transmitted to the shallow branch network, the features on different scales can be fully fused, and the fineness of the image segmentation model is further improved. In addition, the whole image segmentation model is simple in structure and low in algorithm complexity, improves segmentation fineness, and reduces algorithm complexity, so that the image segmentation model is suitable for being deployed on mobile platforms with limited computing power, such as robots, intelligent automobiles, unmanned aerial vehicles and mobile phones, and conducts real-time reasoning.

In addition, the application also provides an image segmentation model training device, electronic equipment and a computer readable storage medium, and the beneficial effects are also achieved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or related technologies of the present application, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of an image segmentation model training method according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a specific image segmentation model provided in an embodiment of the present application;

fig. 3 is a specific data processing flow chart provided in the embodiment of the present application;

FIG. 4 is a flowchart of a model training and application provided by an embodiment of the present application;

fig. 5 is a schematic diagram of a test image, a label and a corresponding segmentation result provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of an image segmentation model training apparatus according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a flowchart of an image segmentation model training method according to an embodiment of the present disclosure. The method comprises the following steps:

s101: a training image and corresponding label are acquired.

The training image is an image used for training the initial model, and has a corresponding label after being labeled. The labeling process may be a manual labeling process, or may be a process of labeling the training image using a neural network or the like. The labeling process may be completed in advance, or may be performed after a training image for training the initial model is acquired, so as to obtain a label.

S102: and inputting the training image into the initial model to obtain a segmentation result.

In this embodiment, the branch network of adjacent feature extraction depths in the initial model has a binary attribute feature fusion structure. Specifically, the initial model includes a trunk network and a plurality of branch networks, the trunk network includes a plurality of trunk feature extraction modules connected in series, and the trunk feature extraction modules connected in series refer to that output data of a previous trunk feature extraction module is input data of a next trunk feature extraction module. In addition, each trunk feature extraction module corresponds to a branch network, and specifically, output data of the trunk feature extraction module is input data of the branch network. Each branch network is provided with a plurality of branch feature extraction modules which are connected in series, and the series connection mode of the branch feature extraction modules is the same as that of the trunk feature extraction module. Each main feature extraction model sequentially extracts features of different depths of an input training image and processes data after feature extraction by using corresponding branch networks, so that each main feature extraction module and the branch networks corresponding to the main network feature extraction modules respectively correspond to different feature extraction depths. That is, the feature extraction depths corresponding to different trunk feature extraction modules are different, and the feature extraction depth of the branch network is the same as that of the corresponding trunk feature extraction module. In addition, in order to ensure the feature extraction effect, the same training image should be input from the input layer of the main network to be output from the output layer of each branch network, and the number of stages of the feature extraction operation to be performed should be substantially the same. For a branch network with a large feature extraction depth, because the feature extraction depth of the corresponding trunk feature extraction module is large, that is, the trunk feature extraction module is provided with a large number of other trunk feature extraction modules before, the times of feature extraction operations of input data of the branch network in the trunk network are large, and the times of feature extraction of the input data by the branch network cannot be larger than the times of the branch network with a small feature extraction depth. Therefore, in the present application, the number of the branch feature extraction modules in the branch network with the large feature extraction depth is not greater than the number of the branch feature extraction modules in the branch network with the small feature extraction depth.

In addition, a special binary attribute feature fusion structure is designed in the application, which is expressed as follows: the input data of the second target module in the second branch network is composed of the second output data of the preorder module of the second target module and the first output data of the first target module in the first branch network corresponding to the preorder module, and the feature extraction depth of the second branch network is the feature extraction depth of the first branch network minus one. The first branch network and the second branch network may be two branch networks meeting the feature extraction depth condition, if a certain branch network is the branch network with the minimum feature extraction depth, the certain branch network cannot be used as the first branch network, and correspondingly, if the certain branch network is the branch network with the maximum feature extraction depth, the certain branch network cannot be used as the second branch network. The second target module refers to a branch feature extraction module that obtains data from two other branch feature extraction modules as its own input data, and may refer to fig. 2, where ConvS4_2, ConvS3_3, ConvS2_2, etc. in fig. 2 are all second target modules. The preorder module is a branch feature extraction module which is adjacent to the second target module and used for outputting data to the second target module. The position correspondence means that the sequence number of the first target module in the first branch network is the same as the sequence number of the second target module in the second branch network in the order from the input end to the output end of the branch network. That is, in the branch network with the smaller feature extraction depth, the input data of the second target module in the branch network not only has the output data (i.e., the second output data) of the preamble module of the second target module in the branch network, but also includes information related to the output data of the first target module at the corresponding position of the preamble module in the branch network with the larger feature extraction depth. By the mode, the output of the feature extraction operation at the same stage on the adjacent branch networks can be fused, so that the feature information on the deep branch network can be continuously transmitted to the shallow branch network, the features on different scales can be fully fused, and the fineness of the image segmentation model is further improved.

Specifically, please refer to fig. 2, and fig. 2 is a schematic structural diagram of a specific image segmentation model according to an embodiment of the present disclosure. Wherein, ConvB1, ConvB2, ConvB3, ConvB4 and ConvB5 are five trunk feature extraction modules constituting a trunk network, respectively, where the feature extraction depth of ConvB1 is the smallest and the feature extraction depth of ConvB5 is the largest. Each trunk feature extraction module corresponds to a different branch network. It can be seen that the number of branch feature extraction modules of the branch network with the larger feature extraction depth is 1 less than that of the branch network with the smaller feature extraction depth, so that the feature extraction operations of the same image when the same image is output from each branch network after being input from the ConvB1 of the main network are completely the same.

Specifically, since the branch network corresponding to the trunk feature extraction module, ConvB1, is composed of six branch feature extraction modules, i.e., ConvS1_1, ConvS1_2, ConvS1_3, ConvS1_4, ConvS1_5, and ConvS1_6, the data output from the branch network is subjected to one feature extraction by ConvB1 and six feature extractions by the six branch feature extraction modules in the branch network, and is subjected to 7 feature extractions in total. And accordingly. The branch network corresponding to the ConvB2 trunk feature extraction module is composed of five branch feature extraction modules, namely, ConvS2_1, ConvS2_2, ConvS2_3, ConvS2_4 and ConvS2_5, so that data output by the branch network is subjected to two feature extractions by the ConvB1 and the ConvB2 and five feature extractions by the five branch feature extraction modules in the branch network, and is subjected to 7 feature extractions in total. In other embodiments, the same image, after being input from the ConvB1 of the main network, may not necessarily be identical in the feature extraction operations that it undergoes when being output from the respective branch networks. For example, in order to further increase the data processing speed and reduce the model complexity, a branch feature extraction module, ConvS1_6, may be removed, in which case, the number of branch feature extraction modules in the branch network corresponding to ConvB1 is the same as that in the branch network corresponding to ConvB2, so that the number of feature extraction operations of two paths is kept substantially the same.

On this basis, with continued reference to fig. 2, it can be seen from fig. 2 that for ConvS1_1, ConvS1_2, ConvS1_3, ConvS1_4, ConvS1_5, and similar branch feature extraction modules (i.e., not the first and not the last) in other branch networks, the data it inputs is correlated with two portions of data. Taking the ConvS1_2 as an example, the input data includes not only output data of the ConvS1_1, but also data of the output data of the ConvS2_1 subjected to an upsampling operation, so that when the ConvS1_2 performs convolution processing, the output of the feature extraction operation at the same stage on the adjacent branch networks can be fused, the feature information on the deep branch networks can be continuously transmitted to the shallow branch networks, the features on different scales can be fully fused, and the fineness of the image segmentation model is further improved.

The initial model refers to a model with incomplete parameter adjustment, and after the parameter adjustment is completed, the initial model can be determined as an image segmentation model. Therefore, after the training image is obtained, the training image is input into the initial model, so that a corresponding segmentation result can be obtained, and parameter adjustment can be carried out according to the segmentation result. In one embodiment, the process of S102 may further include the steps of:

step 11: and inputting the training image into a backbone network, and respectively utilizing each backbone feature extraction network to perform feature extraction on the input backbone input data to obtain backbone output data.

Step 12: and respectively inputting each trunk output data into the corresponding branch network, and respectively utilizing the target branch module in each branch network to perform feature extraction on the input branch input data to obtain branch output data.

Step 13: and inputting the branch output data into a subsequent branch module corresponding to the target branch module and an adjacent branch module corresponding to the subsequent branch module in the adjacent branch network.

Step 14: and respectively generating first segmentation data by utilizing tail branch modules in all branch networks.

Step 15: and performing result fusion processing on each first segmentation data to obtain a segmentation result.

The target branching module and the tail branching module are both branch feature extraction modules, and the difference between the target branching module and the tail branching module is that the target branching module outputs data to two other branch feature extraction modules respectively, and the tail branching module only outputs one piece of data. It should be noted that one branch feature extraction module may be a target branch module, such as ConvS2_1 in fig. 2; or may be the second target module described above, such as ConvS1_5 in fig. 2; or neither a target branch module nor a second target module, such as ConvS1_1 and tail branch module in fig. 2; or may be both the target branch module and the second target module. Specifically, when a certain branch feature extraction module acquires data from two other branch feature extraction modules as input data and outputs data to the other two branch feature extraction modules, the branch feature extraction module may be a target branch module or a second target module. Such as ConvS2_2, ConvS2_3, ConvS2_4 in fig. 2. After the training image is input into the trunk network, each trunk feature extraction network can perform feature extraction processing with gradually increased depth on the training image, each trunk feature extraction network obtains own trunk output data after finishing processing input data, and the trunk output data is used as input data of a next trunk feature extraction module in the series sequence. Each trunk feature extraction module sends the trunk output data of the trunk feature extraction module to the next trunk feature extraction module, and also sends the trunk output data of the trunk feature extraction module to a corresponding branch network of the trunk feature extraction module, and the branch network is provided with a target branch module and a tail branch module. For the target branching module, after performing feature extraction on input branch input data, branch output data is obtained, and the data may be used for inputting the tail branching module, or may be input into the next target branching module, that is, the subsequent branching module may be the target branching module or the tail branching module. In addition, the branch output data is also for output to an adjacent branch module corresponding to a subsequent branch module location. The adjacent branch module refers to a branch feature extraction module corresponding to the position of a subsequent branch module in a branch network with the feature extraction depth reduced by one. For example, for ConvS2_1, the corresponding subsequent branch module is ConvS2_2, and the adjacent branch module is ConvS1_ 2. And finally, processing the input data by utilizing tail branch modules in all branch networks to obtain first segmentation data for output. And fusing the first segmentation data of each branch network to obtain a segmentation result. Specifically, the first segmentation data may be processed by using an activation function layer to obtain image segmentation results of each branch network, and the image segmentation results are weighted and averaged and then processed by using the activation function layer again to obtain the comprehensive data. The comprehensive data is represented by a heat map with the number of channels being 1 and the pixel values being distributed in [0,1], and a binary segmentation result is obtained by setting a threshold value and performing binary operation lessons on the heat map. The present embodiment does not limit the specific type of the activation function, and for example, a sigmoid function of two classes may be selected, or a softmax function of multiple classes may be selected.

Specifically, in an embodiment, each trunk feature extraction module includes a combination of multiple convolution layers, an active layer, and a batch normalization layer, and each trunk feature extraction module performs downsampling processing on the feature map by using a pooling operation or a convolution operation with a sliding step length of 2 at the end of the feature extraction stage, so that the length and width of the feature map become half of the original length and width, but the number of channels of the feature map increases, thereby implementing a step-by-step abstract representation of the features. For example, the original training image may be an RGB image of 512 × 384, where w × h × c represents length × width × number of channels, respectively. Alternatively, referring to fig. 2, for ConvB1, the input image size is 400 × 300 × 3 training images.

It can be understood that, as shown above, the sizes of the input data on the branch networks are different, and therefore, when data transmission between the branch networks is performed to form a binary tree-shaped feature fusion structure, it is necessary to perform upsampling processing or deconvolution processing on the output data output by the branch feature extraction module with the greater feature extraction depth, so that the size of the output data is the same as that of the output data of the branch feature extraction module with the smaller feature extraction depth (that is, the resolution of the feature map is the same), and then perform fusion. For example, in the case of ConvS2_1, when data is transferred to a shallow ConvS1_2, it is necessary to perform upsampling or deconvolution processing, and the processed data is fused with the output data of ConvS1_1 to obtain input data of ConvS1_ 2. Accordingly, when the division result is output, the first division data of various resolutions needs to be adjusted to the resolution consistency. Specifically, the process of performing result fusion processing on each first segmentation data to obtain the segmentation result may specifically include the following steps:

step 21: and performing up-sampling processing or deconvolution processing on first segmentation data with the image size different from that of the training image to obtain second segmentation data.

Step 22: and performing fusion processing on the second segmentation data to obtain a segmentation result.

Wherein, the resolution of the first segmentation data can be increased by the up-sampling process and the deconvolution process. Referring to fig. 2, the first segmentation data with different sizes are merged after being subjected to upsampling or deconvolution processing, so as to obtain a segmentation result with a size of 400 × 300 × 1. The embodiment does not limit the specific fusion manner, and for example, the fusion may be performed by means of feature map stitching or corresponding element summation.

Further, in order to make the network structure more lightweight while reducing the network computational complexity and spatial complexity, the trunk feature extraction module and/or the branch feature extraction module may have a lightweight convolution module and/or an attention module. The attention module can be a plug-and-play attention module (such as a Squeeze-and-Excitation module) and further improves the representation capability of the network on image characteristics.

Referring to fig. 3, fig. 3 is a specific data processing flow chart applied to a branch feature extraction module according to an embodiment of the present application. After the feature graph of the obtained shallow branch network (i.e. the second output data generated by the preamble module of the present branch feature extraction module) and the feature graph of the deeper branch network (i.e. the first output data of the first target module in the first branch network corresponding to the preamble module) are obtained by upsampling, the feature graphs are spliced and distributed by the attention module to obtain the feature graph after attention distribution, and then the feature graph is subjected to convolution operation to obtain the output feature graph of the present branch feature extraction module.

S103: and obtaining a loss value by using the segmentation result and the label, and carrying out parameter adjustment on the initial model by using the loss value.

After the segmentation result is obtained, the label corresponding to the training image can be used for convolution calculation to obtain a loss value, and the loss value is used for parameter adjustment. The embodiment does not limit the specific generation manner of the loss value, and for example, the loss value may be a cross entropy loss value, or may be other loss values, or may be a loss value obtained by weighting and calculating a plurality of loss values. In one embodiment, in order to improve the segmentation effect of the network on the edge pixels and improve the segmentation accuracy of the network on the fine structures, a special polarization distribution loss function may be used for performing the loss value calculation, in which case, the loss value calculation process may include the following steps:

step 31: and calculating to obtain a cross entropy loss value and a polarization distribution loss value by using the segmentation result and the label.

Step 32: and carrying out weighted summation processing on the cross entropy loss value and the polarization distribution loss value to obtain a loss value.

In one embodiment, the cross-entropy penalty values include sub-penalty values for each branch network and a fusion penalty value for the entire network. Specifically, taking fig. 2 as an example, the method can be utilized

And obtaining the corresponding sub-loss value of each branch network. Wherein the content of the first and second substances,

representing the calculated sub-loss values on the segmentation result output by the mth branch network. Wherein W represents a trainable parameter matrix over the backbone network, W^(m)Representing a matrix of trainable parameters on the mth branch network. X represents a training image, Z ═ Z_jJ-1, 2, …, | Z | } denotes a label corresponding to the training image, Z_jRepresenting the jth pixel on the training image. Pr (z)_j＝1|X；W；w^(m)) Denotes z_jThe probability of being predicted as a target (i.e., foreground) class. The total loss for each branch network is then:

wherein alpha is_iIs a weighting coefficient, which may be referred to as a first weighting coefficient, which is adjustable. To sum up, the fusion loss value of the whole network is:

wherein λ is_iIs a second weighting coefficient, A^(m) _sideFor the output characteristic diagram of the mth branch network (before the final activation function layer is not passed, since the activation function is a sigmoid function in the present embodiment because a polarization distribution loss value is adopted), h denotes the activation function, and σ denotes a cross entropy loss calculation function in the form of a calculation function of a sub-loss value. In this respect, the cross entropy loss function in this embodiment is L_fuseAnd L_sideThe sum of (1).

Wherein, the generation process of the polarization distribution loss value comprises the following steps:

step 33: the labels are used to determine the background pixel proportion and the foreground pixel proportion.

Step 34: and calculating a pixel loss value corresponding to a target pixel with a predicted value in a preset intermediate interval in the segmentation result by using the background pixel proportion and the foreground pixel proportion.

Step 35: and calculating the average value of all pixel loss values to obtain the polarization distribution loss value.

The two-level distribution loss value can highlight the background under the condition of a large foreground and highlight the foreground under the condition of a large Beijing, the target pixel of which the predicted value is in the prediction middle interval is regarded as the pixel which is not easy to be distinguished as the foreground or the background, and further loss calculation is carried out on the target pixel so as to improve the classification capability of the model on the pixel. Specifically, the process of determining the background pixel proportion and the foreground pixel proportion by using the label includes the following steps:

step 41: by using

Obtaining the background pixel proportion and the foreground pixel proportion, wherein n_bAnd n_oThe number of pixels in the label, omega, belonging to the background and foreground, respectively_bIs the foreground pixel proportion, omega_oIs the background pixel proportion;

correspondingly, the process of calculating the pixel loss value corresponding to the target pixel with the predicted value in the preset intermediate interval in the segmentation result by using the background pixel proportion and the foreground pixel proportion comprises the following steps:

step 42: by using

Obtaining a pixel loss value; wherein L is_b ⁽ⁱ⁾Is the pixel loss value of the ith pixel in the input image, z_iFor the label value, z, corresponding to the ith pixel point in the input image_i0 denotes a background pixel, z _i1 denotes a foreground pixel; y is_iTo predict value, y_i∈(k₁，k₂)，k₁And k₂For presetting lower and upper limits, k, of intermediate intervals₁、k₂E (0,1), and k₁<k₂. In the present embodiment, k is not limited₁And k₂The size of (b) may be set as required, and may be, for example, 0.3 and 0.7, or may be 0.2 and 0.8.

Correspondingly, the process of calculating the average value of all pixel loss values to obtain the polarization distribution loss value comprises the following steps:

step 43: by using

Obtaining a dipolar distribution loss value; where n denotes the number of target pixels, L_bRepresenting the value of the loss of the polarization distribution.

In summary, the loss value can be expressed as:

L_total＝L_side+L_fuse+β*L_b

wherein β is a weighting coefficient, and the specific size is not limited, and may be 4, L, for example_totalIs the loss value.

S104: and if the condition that the training is finished is detected to be met, determining the initial model after the parameters are adjusted as an image segmentation model.

The training completion condition is a condition indicating that the initial model parameter is completely adjusted, and the specific content is not limited, for example, the training completion condition may be a loss value threshold condition, that is, a condition triggered when the loss value is smaller than the loss value threshold; or may be a training round condition, i.e., a condition that is triggered when a training round is greater than a round threshold. If the training completion condition of the family of doors is satisfied, the initial model is trained sufficiently, and thus the initial model is determined as the image segmentation model.

In summary, please refer to fig. 4, and fig. 4 is a flowchart illustrating a model training and application according to an embodiment of the present disclosure. In the stage of constructing the data set, unlabeled data may be collected to perform pixel-level data labeling to obtain a corresponding label, or labeled data may be collected. And carrying out data division to obtain a training set, a verification set and a test set, wherein the training set is a training image. When the data is insufficient, the data can be enhanced by means of random turning, random cutting, white noise addition and the like. Setting network hyper-parameters and initializing the network, sending the images in the training set into the initial network in batches for training, minimizing the loss of the training set through a back propagation algorithm, and optimizing the model parameters. In the training process, the training effect of the current network is verified by using the verification set data at regular intervals of iteration times, so that the adjustment of the hyper-parameters of the network model is facilitated, and the network is ensured not to generate serious overfitting phenomenon. And after a plurality of rounds of training, the loss curves of the training set and the verification set are reduced to be stable, the training is finished, and the network model file is stored. And (2) introducing a trained network model, testing the trained network model on a test set, evaluating a network output result by adopting common evaluation indexes (such as Mean Absolute Error (MAE), intersection ratio (IoU) and F value (F-measure)) in an image segmentation task, adjusting a network hyper-parameter if the network output result does not reach an expectation, and retraining until the network output result reaches the expectation. And if the expectation is reached, deploying the network to the target computing equipment, and applying the network in an actual scene.

In particular, in an application, the image segmentation model is used for segmenting the foreground and the background. The application process comprises the following steps:

step 51: and acquiring an image to be segmented.

Step 52: and inputting the image to be segmented into the image segmentation model to obtain a segmentation result.

The image segmentation model is generated according to the image segmentation model training method. Referring to fig. 5, fig. 5 is a schematic diagram of a test image, a label and a corresponding segmentation result provided in the present application, wherein (a), (d) and (g) are the test image, (b), (e) and (h) are the label, and (c), (f) and (i) are the segmentation result.

By applying the training method for the image segmentation model provided by the embodiment of the application, the image segmentation model is obtained by adopting the training of the initial model with the binary tree-shaped feature fusion structure. In the model, a branch network of adjacent feature extraction depths has a binary attribute feature fusion structure, which is embodied in that: the input data of the second target module in the second branch network is composed of the second output data of the preorder module of the second target module and the first output data of the first target module in the first branch network corresponding to the preorder module, and the feature extraction depth of the second branch network is the feature extraction depth of the first branch network minus one. Through the structure, the output of the feature extraction operation at the same stage on the adjacent branch networks is fused, so that the feature information on the deep branch network can be continuously transmitted to the shallow branch network, the features on different scales can be fully fused, and the fineness of the image segmentation model is further improved. In addition, the whole image segmentation model is simple in structure and low in algorithm complexity, improves segmentation fineness, and reduces algorithm complexity, so that the image segmentation model is suitable for being deployed on mobile platforms with limited computing power, such as robots, intelligent automobiles, unmanned aerial vehicles and mobile phones, and conducts real-time reasoning.

In the following, the image segmentation model training device provided in the embodiment of the present application is introduced, and the image segmentation model training device described below and the image segmentation model training method described above may be referred to correspondingly.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an image segmentation model training apparatus according to an embodiment of the present application, including:

a training acquisition module 110, configured to acquire a training image and a corresponding label;

an input module 120, configured to input the training image into an initial model to obtain a segmentation result;

a parameter adjusting module 130, configured to obtain a loss value by using the segmentation result and the label, and perform parameter adjustment on the initial model by using the loss value;

a module determining module 140, configured to determine the initial model after parameter adjustment as an image segmentation model if it is detected that the training completion condition is met;

Optionally, the input module 120 includes:

a trunk input unit, configured to input the training image into the trunk network, and perform feature extraction on the input trunk input data by using each trunk feature extraction network, respectively, to obtain trunk output data;

the branch input unit is used for respectively inputting each main output data into the corresponding branch network, and respectively utilizing a target branch module in each branch network to perform feature extraction on the input branch input data to obtain branch output data;

the inter-branch transmission unit is used for inputting the branch output data into a subsequent branch module corresponding to the target branch module and an adjacent branch module corresponding to the subsequent branch module in the adjacent branch network;

the branch output unit is used for generating first segmentation data by utilizing tail branch modules in the branch networks respectively;

and the fusion unit is used for carrying out result fusion processing on each first segmentation data to obtain the segmentation result.

Optionally, a fusion unit comprising:

the size transformation unit is used for performing up-sampling processing or deconvolution processing on first segmentation data with the image size different from that of the training image to obtain second segmentation data;

and the fusion unit is used for carrying out fusion processing on the second segmentation data to obtain the segmentation result.

Optionally, the parameter adjusting module 130 includes:

the calculation unit is used for calculating a cross entropy loss value and a polarization distribution loss value by using the segmentation result and the label;

the weighted summation unit is used for carrying out weighted summation processing on the cross entropy loss value and the polarization distribution loss value to obtain the loss value;

wherein, the computational element includes:

a proportion determining subunit, configured to determine a background pixel proportion and a foreground pixel proportion using the label;

the loss calculation subunit is used for calculating a pixel loss value corresponding to a target pixel with a predicted value in a preset middle interval in the segmentation result by using the background pixel proportion and the foreground pixel proportion;

and the average processing subunit is used for carrying out average calculation on all the pixel loss values to obtain the polarization distribution loss value.

Optionally, the ratio determining subunit includes:

a first calculating subunit for utilizing

accordingly, a loss calculation subunit includes:

a second calculation subunit for utilizing

Obtaining the pixel loss value; wherein L is_b ⁽ⁱ⁾Is the pixel loss value of the ith pixel in the input image, z_iFor the label value, z, corresponding to the ith pixel point in the input image_i0 denotes a background pixel, z _i1 denotes a foreground pixel; y is_iAs the predicted value, y_i∈(k₁，k₂)，k₁And k₂A lower limit value and an upper limit value of the preset intermediate intervalLimit value, k₁、k₂E (0,1), and k₁<k₂。

Accordingly, the averaging processing subunit comprises:

a third computing subunit for utilizing

Obtaining the value of the dipolar distribution loss; where n denotes the number of target pixels, L_bRepresenting the two-polarization distribution loss value.

In addition, an embodiment of the present application further provides an image segmentation apparatus, including:

the to-be-segmented acquisition module is used for acquiring an image to be segmented;

the segmentation module is used for inputting the image to be segmented into an image segmentation model to obtain a segmentation result;

the image segmentation model is generated according to the image segmentation model training method.

In the following, the electronic device provided by the embodiment of the present application is introduced, and the electronic device described below and the image segmentation model training method described above may be referred to correspondingly.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. Wherein the electronic device 100 may include a processor 101 and a memory 102, and may further include one or more of a multimedia component 103, an information input/information output (I/O) interface 104, and a communication component 105.

The processor 101 is configured to control the overall operation of the electronic device 100 to complete all or part of the steps in the image segmentation model training method; the memory 102 is used to store various types of data to support operation at the electronic device 100, such data may include, for example, instructions for any application or method operating on the electronic device 100, as well as application-related data. The Memory 102 may be implemented by any type or combination of volatile and non-volatile Memory devices, such as one or more of Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic or optical disk.

The multimedia component 103 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 102 or transmitted through the communication component 105. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 104 provides an interface between the processor 101 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 105 is used for wired or wireless communication between the electronic device 100 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so that the corresponding Communication component 105 may include: Wi-Fi part, Bluetooth part, NFC part.

The electronic Device 100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic components, and is configured to perform the image segmentation model training method according to the above embodiments.

In the following, a computer-readable storage medium provided by an embodiment of the present application is introduced, and the computer-readable storage medium described below and the image segmentation model training method described above may be referred to correspondingly.

The present application further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the image segmentation model training method described above.

The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relationships such as first and second, etc., are intended only to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms include, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An image segmentation model training method is characterized by comprising the following steps:

acquiring a training image and a corresponding label;

2. The method for training the image segmentation model according to claim 1, wherein the inputting the training image into the initial model to obtain the segmentation result comprises:

3. The image segmentation model training method according to claim 2, wherein the performing result fusion processing on each of the first segmentation data to obtain the segmentation result includes:

4. The method for training an image segmentation model according to claim 1, wherein the obtaining a loss value by using the segmentation result and the label comprises:

5. The method for training the image segmentation model according to claim 4, wherein the determining the background pixel proportion and the foreground pixel proportion by using the labels comprises:

by using

by using

Obtaining the pixel loss value; wherein L is_b ⁽ⁱ⁾Is the pixel loss value of the ith pixel in the input image, z_iFor the label value, z, corresponding to the ith pixel point in the input image_i0 denotes a background pixel, z_i1 denotes a foreground pixel; y is_iAs the predicted value, y_i∈(k₁，k₂)，k₁And k₂Is the lower limit value and the upper limit value, k, of the preset intermediate interval₁、k₂E (0,1), and k₁<k₂；

by using

6. The image segmentation model training method according to claim 1, wherein the trunk feature extraction module and/or the branch feature extraction module has a light-weighted convolution module and/or an attention module.

7. An image segmentation method, comprising:

acquiring an image to be segmented;

8. An image segmentation model training device, comprising:

9. An electronic device comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the image segmentation model training method according to any one of claims 1 to 6 and/or the image segmentation method according to claim 7.

10. A computer-readable storage medium for storing a computer program, wherein the computer program is adapted to be executed by a processor to implement the image segmentation model training method according to any one of claims 1 to 6 and/or the image segmentation method according to claim 7.