CN117529753A

CN117529753A - Training method of image segmentation model, image segmentation method and device

Info

Publication number: CN117529753A
Application number: CN202280004145.1A
Authority: CN
Inventors: 时爱君
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2024-02-06
Also published as: WO2023230936A1

Abstract

A training method of an image segmentation model, an image segmentation method and a device, wherein the method comprises the following steps: acquiring a training sample, wherein the training sample comprises a training image, and the training image comprises a marked mask image and a marked ternary image; the training image is obtained by splicing the image to be processed and the ternary diagram corresponding to the image to be processed, a training sample is input into the image segmentation model to obtain a prediction ternary diagram and a target prediction mask diagram corresponding to the image to be processed, the target loss of the image segmentation model is determined according to the prediction ternary diagram and the target prediction mask diagram, and the image segmentation model is trained according to the target loss. The ternary diagram corresponding to the training sample does not need to be manually generated with higher precision, the robustness of the ternary diagram required by the training sample is improved, the optimization target of the image segmentation model is determined according to the prediction ternary diagram and the target prediction mask diagram in the model training process, and the training effect and the robustness of the image segmentation model are improved.

Description

Training method of image segmentation model, image segmentation method and device

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a training method for an image segmentation model, an image segmentation method and an image segmentation device.

Background

In recent years, convolutional neural networks have made a deep learning-based matting algorithm mainstream by virtue of their strong feature extraction capability. However, in the related art, a training sample adopted in the training of the image segmentation model comprises a ternary diagram, wherein the ternary diagram comprises a foreground region, a background region and an unknown region which cannot be identified as the foreground or the background, and the ternary diagram is based on a manually generated ternary diagram with higher precision, so that the robustness of the image segmentation model obtained by training is lower, and meanwhile, the accuracy of the training of the image segmentation model in the related art is also poorer.

Disclosure of Invention

The application provides a training method, an image segmentation method and an image segmentation device for an image segmentation model so as to improve the accuracy and the robustness of the image segmentation model.

In one aspect, an embodiment of the present application provides a training method for an image segmentation model, including:

obtaining a training sample; the training sample comprises a training image, and the training image comprises a marked mask image and a marked ternary image; the training images are obtained by splicing the images to be processed and the ternary images corresponding to the images to be processed; the marked mask image comprises a foreground area and a background area; the annotated ternary diagram comprises a foreground region, a background region and an unknown region between the foreground region and the background region;

Inputting the training sample into an image segmentation model to obtain a prediction ternary diagram and a target prediction mask diagram corresponding to the image to be processed;

determining a target loss of the image segmentation model according to the difference between the predicted ternary diagram and the marked ternary diagram and the difference between the target predicted mask diagram and the marked mask diagram;

and training the image segmentation model according to the target loss.

Another embodiment of the present application provides an image segmentation method, including:

acquiring a target to-be-processed image and a target ternary diagram corresponding to the target to-be-processed image;

inputting the target to-be-processed image and the target ternary diagram into a trained image segmentation model to obtain a target mask diagram; the image segmentation model is trained by adopting the training method of the image segmentation model in the aspect so as to obtain the trained image segmentation model;

and dividing the target to-be-processed image according to the target mask map to obtain an object in a foreground region in the target to-be-processed image.

Another embodiment of the present application proposes a training device for an image segmentation model, including:

The acquisition module is used for acquiring training samples; the training sample comprises a training image, and the training image comprises a marked mask image and a marked ternary image; the training images are obtained by splicing the images to be processed and the ternary images corresponding to the images to be processed; the marked mask image comprises a foreground area and a background area; the annotated ternary diagram comprises a foreground region, a background region and an unknown region between the foreground region and the background region;

the processing module is used for inputting the training sample into an image segmentation model to obtain a prediction ternary diagram and a target prediction mask diagram corresponding to the image to be processed;

the determining module is used for determining the target loss of the image segmentation model according to the difference between the prediction ternary diagram and the marked ternary diagram and the difference between the target prediction mask diagram and the marked mask diagram;

and the training module is used for training the image segmentation model according to the target loss.

Another embodiment of the present application provides an image matting apparatus, including:

the acquisition module is used for acquiring a target to-be-processed image and a target ternary diagram corresponding to the target to-be-processed image;

The processing module is used for inputting the target to-be-processed image and the target ternary diagram into a trained image segmentation model so as to obtain a target mask diagram; the image segmentation model is trained by adopting the training method of the image segmentation model in the aspect so as to obtain the trained image segmentation model;

and the segmentation module is used for segmenting the target to-be-processed image according to the target mask image to obtain an object in a foreground region in the target to-be-processed image.

In another aspect, an embodiment of the present application proposes an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the training method of the image segmentation model according to the previous aspect or the image segmentation method according to the previous aspect when the program is executed by the processor.

Another aspect of the present application proposes a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the training method of an image segmentation model as described in the previous aspect or the image segmentation method as described in the previous aspect.

Another embodiment of the present application proposes a computer program product having a computer program stored thereon, which when being executed by a processor implements the training method of an image segmentation model as described in the previous aspect or the image segmentation method as described in the previous aspect.

The training method, the image segmentation method and the device for the image segmentation model acquire a training sample, wherein the training sample comprises a training image, and the training image comprises a marked mask image and a marked ternary image; the training image is obtained by splicing an image to be processed and a ternary image corresponding to the image to be processed, wherein the marked mask image comprises a foreground area and a background area, the marked ternary image comprises the foreground area, the background area and an unknown area between the foreground area and the background area, a training sample is input into an image segmentation model to obtain a prediction ternary image and a target prediction mask image corresponding to the image to be processed, the target loss of the image segmentation model is determined according to the prediction ternary image and the target prediction mask image, and the image segmentation model is trained according to the target loss. According to the method and the device, the high-precision ternary diagram which is generated manually is not needed for the ternary diagram corresponding to the training sample, the robustness of the ternary diagram which is needed for the training sample is improved, the optimization target of the image segmentation model is determined according to the difference between the predicted ternary diagram and the marked ternary diagram and the difference between the target predicted mask diagram and the marked mask diagram in the model training process, and the training effect and the robustness of the image segmentation model are improved.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

fig. 1 is a flow chart of a training method of an image segmentation model according to an embodiment of the present application;

fig. 2 is a schematic diagram of generating an image to be processed in a training sample according to an embodiment of the present application;

FIG. 3A is a schematic diagram of a ternary diagram generation according to an embodiment of the present disclosure;

FIG. 3B is a second schematic diagram of a ternary diagram generation according to an embodiment of the present disclosure;

FIG. 3C is a third diagram illustrating a ternary diagram generation according to an embodiment of the present disclosure;

FIG. 4 is a flowchart of another training method of an image segmentation model according to an embodiment of the present disclosure;

FIG. 5 is a flowchart of another training method for an image segmentation model according to an embodiment of the present disclosure;

FIG. 6 is a flowchart of another training method for an image segmentation model according to an embodiment of the present disclosure;

FIG. 7 is a flowchart of another training method for an image segmentation model according to an embodiment of the present disclosure;

Fig. 8 is a schematic structural diagram of an image segmentation model according to an embodiment of the present application;

fig. 9 is a flowchart of an image segmentation method according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a training device for an image segmentation model according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an image segmentation apparatus according to an embodiment of the present application;

fig. 12 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application.

The training method, the image segmentation method and the device of the image segmentation model according to the embodiment of the application are described below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a training method of an image segmentation model according to an embodiment of the present application.

The execution main body of the training method of the image segmentation model in the embodiment of the application is a training device of the image segmentation model, the device can be arranged in electronic equipment, the electronic equipment can be a server, a terminal device and the like, and the terminal device can be a smart phone, a palm computer and the like, and is not limited in the embodiment.

As shown in fig. 1, the method may include the steps of:

step 101, obtaining a training sample.

The training sample comprises a training image, and the training image comprises a marked mask image and a marked ternary image. The training image is obtained by splicing the image to be processed and the ternary diagram corresponding to the image to be processed. The marked mask image comprises a foreground area and a background area; the annotated triples include a foreground region, a background region, and an unknown region between the foreground region and the background region. In this embodiment, the mask map and the ternary map are both gray maps, where the pixel value of each pixel in the foreground area is a first set value, the pixel value of each pixel in the background area is a second set value, the pixel value of each pixel in the unknown area is a third set value, the third set value is a pixel value between the first set value and the second set value, for example, the first set value is 255, the second set value is 0, and the third set value is a value between 0 and 255, for example, 128.

In the embodiment of the application, the training image is obtained by splicing the to-be-processed image and the ternary image corresponding to the to-be-processed image, wherein the to-be-processed image is the image to be subjected to foreground image segmentation, namely the image to be subjected to image matting processing. For the image to be processed, as shown in fig. 2, as an implementation manner, a pair of original images and labeled mask images can be obtained from a set data set, foreground information is diffused into a background area of the original images through a foreground prediction method, a foreground image after background processing is generated, the purpose of eliminating background information interference is achieved, as an implementation manner, the labeled mask images are adopted to carry out image segmentation on the original images, foreground parts are segmented, further, pixel information of each pixel point of the foreground parts is adopted, pixel information of adjacent background parts is updated, and the pixel information of the foreground parts is diffused to the whole background parts through continuous outward diffusion, so that the pixel information of the background parts is similar to the pixel information of small cats in the foreground image obtained by the original images through foreground prediction. And selecting images from the published COCO data set and the BG-20k data set at random to serve as background images, and carrying out weighted fusion on the foreground images and the background images after background processing according to the annotated mask image to obtain original images for training, namely images to be processed.

The ternary diagram corresponding to the image to be processed can be generated in at least one mode, compared with the ternary diagram with higher precision generated manually, the ternary diagram comprises an increased unknown area, the unknown area comprises pixels determined to be foreground and pixels determined to be background, namely, the pixels determined to be foreground or the pixels determined to be background are taken as the unknown pixels and added into the unknown area, so that the area of the unknown area is increased, the interference information is increased in the unknown area, various ternary diagrams are generated, and the generalization and the robustness of the image segmentation model on various ternary diagrams are improved.

Based on fig. 3A to 3C, an example of the generation manner of the ternary diagram is described:

as a first embodiment, an image to be processed is input into a segmentation model to obtain a mask image corresponding to the image to be processed, as shown in fig. 3A, the mask image includes noise signals, that is, gray parts on a cat body in the mask image, compared with the mask image marked in fig. 3B, the mask image is a binary image, wherein a plurality of pixel points included in the mask image correspond to a plurality of pixel points in the image to be processed, according to pixel values, that is, gray values, of the pixel points in each pixel point in the mask image, a plurality of pixel points with pixel values belonging to a (0, 255) interval are determined, and pixel values of the pixel points with pixel values belonging to the (0, 255) interval are set as set values, for example, 128 is set as a region of the pixel points with pixel values of 128 as an unknown region, that is, each pixel point included in the unknown region does not determine whether the pixel point belongs to a foreground region or a background region, and further, expansion check is performed on the unknown region with different sizes, so as to obtain a ternary image corresponding to the image to be processed.

As a second embodiment, a mask map of an image to be processed is obtained, as shown in fig. 3B, noise, for example, gaussian white noise, is added to the marked mask map, further, an unknown region in the marked mask map is determined, and expansion is performed on the unknown region by using expansion cores with different sizes, so as to obtain a ternary map corresponding to the image to be processed, and in fig. 3B, a ternary map generated by using expansion cores with sizes of 9, 19 and 29 is shown from left to right. The method for determining the unknown region in the labeled mask graph may refer to the description in the first embodiment, and the principle is the same, which is not repeated here.

As a third embodiment, a mask map of an image to be processed is obtained, a foreground region and a background region are determined from the mask map, wherein a pixel value of a pixel point of the foreground region is 255, a pixel value of a pixel point of the background region is 0, a set number of pixel points are randomly selected from the foreground region as known foreground points, a set number of pixel points are randomly selected from the background region as known background points, a ternary map generated by randomly selecting 4 points, 8 points and 18 points is sequentially shown from left to right as shown in fig. 3C, and the remaining region is taken as an unknown region, wherein the pixel value of the pixel point of the unknown region is a set value, for example 128, so that the ternary map corresponding to the image to be processed is obtained.

It should be noted that, the training samples may be obtained through a channel cascade according to the image to be processed and at least one ternary diagram in the above embodiment, so as to generate multiple training samples, and improve diversity of the training samples. When the image segmentation model is trained, the generated multiple different training samples can be used for model training so as to improve the model training effect, and meanwhile, the generalization of the image segmentation model on various ternary diagrams is improved. For example, the training sample is a color image, and includes pixel information of three channels of Red, green and Blue, the ternary diagram is a gray diagram, and includes pixel information of one channel, and the image to be processed and the ternary diagram are subjected to channel cascading, so that each pixel point of the training sample corresponds to the pixel information of 4 channels, and the information quantity carried by the training sample is improved.

And 102, inputting the training sample into an image segmentation model to obtain a prediction ternary diagram and a target prediction mask diagram corresponding to the image to be processed.

The image segmentation model can be a neural network model, and through training the image segmentation model, the image segmentation model can learn the corresponding relation between an input training sample and a corresponding mask image, so that accurate foreground region segmentation can be performed based on the mask image with higher precision, so that a matting comprising only foreground objects is obtained, that is, the foreground objects in the obtained matting are opaque, and the background part is transparent, and the matting effect is improved. Furthermore, any background can be replaced based on the obtained matting, and the method is applicable to various scenes with replaced backgrounds.

Regarding the predicted ternary diagram and the target predicted mask diagram, reference may be made to the explanation related to the ternary diagram and the predicted ternary diagram in step 101, and the principles are the same, which is not repeated here.

And step 103, determining the target loss of the image segmentation model according to the difference between the predicted ternary diagram and the marked ternary diagram and the difference between the target predicted mask diagram and the marked mask diagram.

In the embodiment of the application, the standard three-primary graph is used as a supervision signal, the difference between the predicted three-primary graph and the marked three-primary graph is determined, the marked mask graph is used as the supervision signal, the difference between the predicted mask graph and the marked mask graph which are obtained by the prediction of the image segmentation model is determined, the difference of the mask graph and the difference of the three-primary graph are used as the optimization target of the image segmentation model, the effect that the three-primary graph loss and the multi-task loss of the mask loss are used as the optimization target in the optimization process of the image segmentation model is realized, the loss of the three-primary graph is continuously corrected in the optimization process, so that the predicted three-primary graph is more and more accurate, the requirement of the image segmentation model on the accuracy of the three-primary graph corresponding to the input training sample is reduced, the robustness and the generalization of the image segmentation model on the three-primary graph are improved, the correction of the three-primary graph and the mask graph are mutually promoted, and finally the training effect of the image segmentation model is improved, and the obtained image segmentation model has higher accuracy in the image segmentation process.

And step 104, training the image segmentation model according to the target loss.

And performing inverse gradient training on the image segmentation model by adopting the target loss function to adjust parameters of the image segmentation model, further continuously training by adopting a training sample according to the adjusted parameters of the image segmentation model until the value of the target loss function is minimum, and determining that the image segmentation model is trained. The target loss of the image segmentation model is determined through the ternary diagram loss and the target mask loss, and the target loss is used as an optimization target. In addition, the image segmentation model is used for carrying out matting processing on the image, so that the accuracy of image processing is improved.

In the training method of the image segmentation model, a training sample is obtained, the training sample comprises a training image, the training image comprises a marked mask image and a marked ternary image, the training image is obtained by splicing a to-be-processed image and a ternary image corresponding to the to-be-processed image, the marked mask image comprises a foreground area and a background area, the marked ternary image comprises the foreground area, the background area and an unknown area between the foreground area and the background area, the training sample is input into the image segmentation model, a predicted ternary image and a target predicted mask image corresponding to the to-be-processed image are obtained, target loss of the image segmentation model is determined according to the predicted ternary image and the target predicted mask image, and the image segmentation model is trained according to the target loss. According to the method and the device, the ternary diagram which corresponds to the training sample does not need to be manually generated and has higher precision, the robustness and generalization of the ternary diagram which is needed by the training sample are improved, the optimization target of the image segmentation model is determined according to the difference between the predicted ternary diagram and the marked ternary diagram and the difference between the target predicted mask diagram and the marked mask diagram in the model training process, and the training effect and the robustness of the image segmentation model are improved.

Based on the above example, fig. 4 is a schematic flow chart of another training method of an image segmentation model according to an embodiment of the present application, as shown in fig. 4, the method includes the following steps:

step 401, a training sample is obtained.

The principle of step 401 is the same with reference to the explanation in the foregoing embodiment, and will not be repeated here.

And step 402, inputting the training image into an encoder of the image segmentation model to encode, and obtaining image characteristics output by each convolution layer in the multi-layer convolution layers of the encoder.

The image segmentation model comprises an encoder and two decoders, wherein the two decoders are a ternary diagram decoder for decoding and generating a prediction ternary diagram and a mask decoder for generating a prediction mask diagram.

The encoder comprises a plurality of convolution layers, each convolution layer is used for extracting features of the training image to obtain image features corresponding to different scales, and the image features output by each convolution layer comprise structural texture information and semantic information.

And step 403, inputting the image characteristics output by the multi-layer convolution layer into a ternary diagram decoder of the image segmentation model to obtain a prediction ternary diagram.

The ternary diagram decoder is formed by overlapping a plurality of deconvolution layers, the image features output by the plurality of deconvolution layers have corresponding relations with the plurality of deconvolution layers, and the image features output by the plurality of deconvolution layers are correspondingly input into the plurality of deconvolution layers of the ternary diagram decoder according to the corresponding relations so as to predict and obtain a prediction ternary diagram.

As one embodiment, the multi-layer convolution layers of the encoder include a higher layer convolution layer and a lower layer convolution layer, wherein the lower layer convolution layer outputs image features that include abundant structural and/or texture information with less semantic information and the higher layer convolution layer outputs image features that include abundant semantic information with less structural and/or texture information. The plurality of deconvolution layers of the ternary diagram decoder comprise a lower deconvolution layer and a higher deconvolution layer, so that the higher deconvolution layer of the encoder corresponds to the lower deconvolution layer of the ternary diagram decoder, and the lower deconvolution layer of the encoder corresponds to the higher deconvolution layer of the ternary diagram decoder, namely, staggered layer connection or jumped layer connection is realized, so that the ternary diagram decoder can extract the characteristics of a plurality of scales based on the multi-layer deconvolution layer of the encoder, predict and generate a predicted ternary diagram, and the accuracy of generating the predicted ternary diagram is improved.

And step 404, inputting the image characteristics and the prediction ternary diagram output by the multi-layer convolution layer into a mask decoder of the image segmentation model to obtain a target prediction mask diagram.

As one implementation mode, the image features and the prediction ternary diagrams output by the high-layer convolution layer are input into a first decoding layer of a mask decoder to obtain a first prediction mask diagram, the image features output by the first prediction mask diagram and the low-layer convolution layer are input into a second decoding layer of the mask decoder to obtain a target prediction mask diagram, and the purpose of connecting through layer jump is achieved, so that the mask decoder can generate the prediction mask diagram based on the features of multiple scales extracted by the multi-layer convolution layer of the encoder, and the accuracy of generating the prediction mask diagram is improved.

Step 405, determining a target loss of the image segmentation model according to the difference between the predicted ternary diagram and the labeled ternary diagram and the difference between the target predicted mask diagram and the labeled mask diagram.

Step 406, training the image segmentation model according to the target loss.

Step 405 and step 406 may refer to the explanation in the foregoing embodiments, and the principles are the same, which is not repeated here.

In the training method of the image segmentation model, the training image is the image to be processed and the corresponding ternary diagram, the ternary diagram does not need to adopt a high-precision ternary diagram which does not contain noise information, the requirement of the image segmentation model on the ternary diagram corresponding to the input sample is reduced, and the robustness of the ternary diagram is improved. Furthermore, the training image is input into an encoder of an image segmentation model for encoding, image features output by convolutions of each layer in a multi-layer convolution layer of the encoder are obtained, the image features output by the multi-layer convolutions are input into a ternary image decoder of the image segmentation model, a prediction ternary image is obtained, the image features and the prediction ternary image output by the multi-layer convolutions are input into a mask decoder of the image segmentation model, a target prediction mask image is obtained, decoding is carried out through the obtained multi-scale image features by adopting a double decoding structure, and further, multitasking loss is determined based on the target prediction mask image and the prediction ternary image so as to optimize the model, so that the training effect of the image segmentation model is improved, the generated target prediction mask image is more accurate, and high-precision matting is realized based on the target prediction mask image under a matting scene.

Based on the above example, fig. 5 is a flowchart of another training method of an image segmentation model according to an embodiment of the present application, as shown in fig. 5, step 405 includes the following steps:

step 501, determining a ternary diagram loss based on the difference between the predicted ternary diagram and the annotated ternary diagram.

In the embodiment of the application, the marked ternary diagram is used as a supervision signal, the difference between the predicted ternary diagram obtained by image segmentation model prediction and the marked ternary diagram is determined, the predicted ternary diagram comprises a foreground region, a background region and an unknown region between the foreground region and the background region, and the unknown region is accurately determined in the marked ternary diagram, so that the difference can be the difference between the unknown regions, particularly the difference between pixel values of pixel points included in the unknown region, and the ternary diagram loss determined based on the difference can be used as an optimization target of the predicted ternary diagram.

As one implementation, the ternary diagram decoder uses ternary diagram loss L _T As an optimization target, determining the probability that each pixel point in an unknown area in the prediction ternary diagram belongs to a foreground area through learning, and obtaining a more accurate prediction ternary diagram:

L _T ＝L _C-E (T′,T ^gt )

Wherein L is _C-E Representation ofMulti-class cross entropy loss, T' represents a predictive ternary diagram, T ^gt And representing a ternary diagram of training sample annotation.

Step 502, determining the target mask loss according to the difference between the target prediction mask map and the marked mask map.

In the embodiment of the present application, the noted mask image is used as a supervision signal, and the difference between the predicted mask image obtained by predicting the image segmentation model and the noted mask image is determined, where the predicted mask image includes a foreground region and a background region, so that the difference between the predicted mask image and the noted mask image may be the difference between the foreground region and the background region, and may specifically be the difference between pixel values of pixel points included in the foreground region and the difference between pixel values of pixel points included in the background region. And the target mask loss determined based on the difference can be used as a target for optimizing the predictive mask map.

The target mask loss will be explained in detail in the following embodiments, and will not be described here again.

In step 503, a target loss of the image segmentation model is determined according to the ternary diagram loss and the target mask loss.

In one embodiment of the present application, the target loss of the image segmentation model is obtained by performing weighted calculation on the ternary diagram loss and the target mask loss by using the set second weight value.

In the training method of the image segmentation model, the multi-task loss, namely the ternary diagram loss and the target mask loss, is determined based on the target prediction mask diagram and the prediction ternary diagram, and the loss function of the image segmentation model is determined according to the ternary diagram loss and the target mask loss so as to perform model optimization, so that the training effect of the image segmentation model is improved, the generated target prediction mask diagram is more accurate, and high-precision matting is realized based on the target prediction mask diagram in a matting scene.

Based on the above embodiments, fig. 6 is a flowchart of another training method of an image segmentation model according to the embodiments of the present application, as shown in fig. 6, step 502 includes the following steps:

step 601, determining an unknown region in the ternary diagram marked by the training sample according to the pixel value of each pixel point in the ternary diagram marked by the training sample.

Wherein the unknown region is an image region between the foreground region and the background region, i.e. the pixels in the unknown region are not determined as belonging to the foreground region or to the background region.

In this embodiment of the present application, the pixel value of each pixel is a gray value.

In this embodiment, the gray value of each pixel point in the unknown region in the labeled ternary diagram is a set value, and the unknown region can be determined by the gray value of each pixel point, that is, the positions of a plurality of pixel points included in the unknown region in the labeled ternary diagram are determined, so that a plurality of pixel points included in the unknown region can be identified.

Step 602, determining a sub-region in the unknown region according to the mask map marked by the training sample.

Wherein each pixel point included in the sub-region belongs to the foreground region.

In the embodiment of the application, the labeled mask map comprises an accurate foreground region and a background region, which pixel points belong to the foreground region and which pixel points belong to the background region can be determined, each pixel point included in the unknown region is respectively compared with the pixel points included in the foreground region and the pixel points included in the background region, each pixel point belonging to the foreground region can be determined from the unknown region, and the region corresponding to each pixel point belonging to the foreground region in the unknown region is a sub-region.

Step 603, determining a first area corresponding to the unknown area in the target prediction mask map and determining a second area corresponding to the unknown area in the mask map marked by the training sample according to the unknown area.

In the embodiment of the present application, each pixel point of the image to be processed corresponds to each pixel point in the labeled ternary diagram, and also corresponds to each pixel point in the target prediction mask diagram and the labeled mask diagram, so that according to the pixel points included in the unknown region, a first region corresponding to the unknown region can be determined in the target prediction mask diagram, and a second region corresponding to the unknown region can be determined in the training sample labeled mask diagram.

Step 604, determining a third area corresponding to the subarea in the target prediction mask map and determining a fourth area corresponding to the subarea in the marked mask map according to the subarea.

In the embodiment of the present application, each pixel point of the image to be processed corresponds to each pixel point in the labeled ternary diagram, and also corresponds to each pixel point in the target prediction mask diagram and the labeled mask diagram, so that according to the pixel points included in the sub-region in the labeled ternary diagram, a third region corresponding to the sub-region can be determined in the target prediction mask diagram, and a fourth region corresponding to the sub-region can be determined in the mask diagram labeled by the training sample.

Step 605 determines a first mask loss for the unknown region based on differences in pixel values between the plurality of pixels included in the first region and the plurality of pixels included in the second region.

As one embodiment, a first confidence that a pixel belongs to a foreground region is determined based on a pixel value of the pixel included in the first region, a second confidence that the pixel belongs to the foreground is determined based on a pixel value of the pixel included in the second region, and a first mask loss is determined based on the first confidence, the second confidence, and the number of pixels included in the unknown region.

Step 606, determining a second mask loss for the sub-region based on differences in pixel values between the plurality of pixels included in the third region and the plurality of pixels included in the fourth region.

As an embodiment, a third confidence that the pixel belongs to the foreground region is determined according to the pixel value of the pixel included in the third region, a fourth confidence that the pixel belongs to the foreground region is determined according to the pixel value of the pixel included in the fourth region, and a second mask loss is determined according to the third confidence, the fourth confidence, and the number of pixels included in the subregion.

In step 607, a target mask loss is determined based on the first mask loss and the second mask loss.

As one embodiment, the target mask loss is obtained by performing weighted calculation on the first mask loss and the second mask loss using the set first weight value. The first weight value is a set weight value, for example, 0.5.

The target mask loss may be determined based on the following equation:

wherein L is _Au For the first mask loss, L _Auf For the second mask loss, β is L _Auf For example, a value of 0.5.a' _i The confidence of the ith pixel point in the first area corresponding to the unknown area in the mask map is predicted for the target, Confidence of an ith pixel point in a second area corresponding to the unknown area in a mask diagram marked for a training sample; a' _j The confidence of the j-th pixel point in the third region corresponding to the sub-region in the mask map is predicted for the target,confidence of a j-th pixel point in a fourth region corresponding to the sub-region in the mask diagram marked for the training sample; the number of pixels included in the first region or the second region, |u|, is the number of pixels included in the unknown region, or the first region _f The i is the number of pixel points included in the sub-region, or the third region, or the fourth region.

In the training method of the image segmentation model, based on the prediction ternary diagram output by the ternary diagram decoder, the mask loss of the unknown region and the mask loss of the sub-region are determined to determine the loss function of the mask decoder, so that the optimization target of the mask decoder is improved, and the accuracy of determining the optimization target is improved.

Based on the above embodiments, another method for training an image segmentation model is provided in the embodiments of the present application, and fig. 7 is a schematic flow chart of another method for training an image segmentation model provided in the embodiments of the present application, as shown in fig. 7, and the method includes the following steps:

Step 701, obtaining a training sample.

Step 702, inputting the training image into an encoder of an image segmentation model to encode, and obtaining image features output by convolution of each layer in multiple convolution layers of the encoder.

In step 703, the image features output by the multi-layer convolution layer are input to a ternary diagram decoder of the image segmentation model to obtain a predicted ternary diagram.

The principles of steps 701 to 703 may be the same as those of the previous embodiments, and are not repeated here.

And step 704, inputting the image characteristics and the prediction ternary diagram output by the high-level convolution layer into a first decoding layer of a mask decoder to obtain a first prediction mask diagram.

Wherein the multi-layer convolution layers comprise a high-layer convolution layer and a low-layer convolution layer. The lower-layer convolution layers are multiple layers, and image features output by the multiple lower-layer convolution layers are divided into a first lower-layer structural feature, at least one second lower-layer structural feature and a third lower-layer structural feature from more to less according to included structural and/or texture information. It should be noted that, the image features output by the higher convolution layer include the least structural and/or texture information and the most semantic information. When the multi-layer convolution layer is divided, the multi-layer convolution layer may be divided into a plurality of high-layer convolution layers and a plurality of low-layer convolution layers based on the structure and/or texture information included in the output image feature, that is, the division of the high-layer convolution layer and the low-layer convolution layer is not limited in this embodiment.

The second decoding layer includes a first decoding sublayer, at least one second decoding sublayer, and a third decoding sublayer.

Step 705, inputting the first prediction mask map and the third low-layer structure feature into the first decoding sub-layer in the second decoding layer to obtain a prediction mask map output by the first decoding sub-layer.

Step 706, for a first second decoding sublayer of the second decoding sublayer, inputting the prediction mask image output by the first decoding sublayer and a second lower layer structure feature corresponding to the first second decoding sublayer into the first second decoding sublayer.

Step 707, for any second decoding sub-layer other than the first second decoding sub-layer, inputting the prediction mask image output by the second decoding sub-layer of the previous layer and the second low-layer structure feature corresponding to the second decoding sub-layer into the second decoding sub-layer.

And 708, inputting the predicted mask image and the first low-layer structural features output by the last second decoding sublayer into a third decoding sublayer in the second decoding layer to obtain a target predicted mask image.

The target prediction mask map is a mask map obtained by predicting a third decoding sublayer of the last layer of the mask decoder.

It should be noted that, after the prediction mask map output by the decoding sub-layer of the previous layer is input into the decoding sub-layer of the next layer, the decoding sub-layer of the next layer converts the input prediction mask map into a corresponding ternary map, so that each layer gradually predicts and obtains an accurate target prediction mask based on the output of the previous layer and the corresponding low-layer structural characteristics.

In this embodiment of the present application, the upper convolution layer corresponds to the first decoding layer, and the plurality of lower convolution layers correspond to at least one second decoding sub-layer and a third decoding sub-layer, so as to implement layer-skipping connection, as shown in fig. 8, an encoder includes 4 convolution layers, respectively, c1, c2, c3, and c4, where c4 is the upper convolution layer, c3, c2, and c1 are the plurality of lower convolution layers, and image features output by the upper convolution layer c4 are input to the first decoding layer d1 of the mask decoder; the image feature output by the lower convolution layer c3 is input into a second decoding sublayer d2 of a second decoding layer of the mask decoder; the image characteristics output by the lower convolution layer c2 are input into a first decoding sublayer d3 of a second decoding layer of the mask decoder; the image features output by the lower convolution layer c1 and the third decoding sublayer d4 of the second decoding layer of the mask decoder are input, so that the image features of all layers output by the multi-layer convolution layer of the encoder, especially the image features comprising rich structure and/or texture information, and the semantic features can be input into the decoding layer corresponding to the mask decoder, the layer-by-layer decoding based on the features of multiple scales is realized, the edge information such as the included beard, hairline and the like can be accurately identified, and the decoding reliability is improved.

Step 709, determining the ternary diagram loss according to the difference between the predicted ternary diagram and the ternary diagram marked by the training sample.

Step 710, determining the target mask loss according to the difference between the target prediction mask map and the mask map marked by the training sample.

In step 711, a target loss of the image segmentation model is determined from the ternary diagram loss and the target mask loss.

Step 712, training the image segmentation model according to the target loss.

In step 709 to step 712, reference may be made to the explanation in the foregoing embodiments, and the principles are the same, and are not repeated here.

In the training method of the image segmentation model in the embodiment of the application, the encoder is divided into the multi-layer convolution layers to extract the image characteristics, the two decoders are divided into the plurality of decoding layers, the multi-layer convolution layers of the encoder and the plurality of decoding layers of the decoder are connected through the jump layers, so that the image characteristics of each layer output by the multi-layer convolution layers of the encoder can be input into the decoding layers corresponding to the corresponding decoders, decoding based on the characteristics of a plurality of scales is realized, and the reliability and the accuracy of decoding are improved.

Based on the foregoing embodiments, the embodiments of the present application provide an image segmentation method, and fig. 9 is a schematic flow chart of the image segmentation method provided in the embodiments of the present application, as shown in fig. 9, the method includes the following steps:

Step 901, obtaining a target to-be-processed image and a target ternary diagram corresponding to the target to-be-processed image.

And step 902, inputting the target to-be-processed image and the target ternary diagram into a trained image segmentation model to obtain a target mask diagram.

The image segmentation model may be trained by using the training method in any embodiment corresponding to the foregoing training method of the image segmentation model, so as to obtain a trained image segmentation model, and the training method is not described herein.

And step 903, dividing the target to-be-processed image according to the target mask map to obtain an object in a foreground region in the target to-be-processed image.

The main execution body of the image segmentation method in the embodiment of the present application is an image segmentation apparatus, and the image segmentation apparatus may be disposed in any electronic device, where the electronic device may be any stationary or mobile computing device capable of performing data processing, for example, a mobile computing device such as a notebook computer, a smart phone, a wearable device, or a stationary computing device such as a desktop computer, or a server, or other types of computing devices, which is not limited in this disclosure.

In the embodiment of the application, the accurate foreground region segmentation can be performed on the target to-be-processed image based on the target mask image with higher precision, so to speak, the high-precision target mask image and the to-be-segmented target to-be-processed image are weighted, so as to obtain the matting of the object only comprising the foreground region, and the matting has higher precision and can realize fine matting. The object in the foreground area in the obtained matting is opaque, and the background part is transparent, so that any background can be replaced based on the obtained matting, and the matting is applicable to various background replaced scenes. For example, when the method is applied to man-machine interaction, for example, the image segmentation method is set in a mobile phone, and a user uses an application program in the mobile phone to run the image segmentation method, for example, at an image segmentation interface, the user clicks a certificate photo, so that the certificate photo can be replaced by a background, or clicks a principal person, a cat or a dog in the image to attach to various sceneries and posters.

In the image segmentation method, the target image to be processed is identified based on the image segmentation model obtained through training to obtain the high-precision target mask image, accurate image matting is performed based on the target mask image, and the image matting accuracy is improved.

In order to achieve the above embodiments, the embodiments of the present application further provide a training device for an image segmentation model.

Fig. 10 is a schematic structural diagram of a training device for an image segmentation model according to an embodiment of the present application.

As shown in fig. 10, the apparatus may include:

an obtaining module 1001, configured to obtain a training sample; the training sample comprises a training image, wherein the training image comprises a marked mask image and a marked ternary image; the training images are obtained by splicing the images to be processed and the ternary images corresponding to the images to be processed; the marked mask image comprises a foreground area and a background area; the annotated ternary graph includes a foreground region, a background region, and an unknown region between the foreground region and the background region.

And a processing module 1002, configured to input the training sample into an image segmentation model, to obtain a prediction ternary diagram and a target prediction mask diagram corresponding to the image to be processed.

A determining module 1003, configured to determine a target loss of the image segmentation model according to a difference between the predicted ternary diagram and the labeled ternary diagram, and a difference between the target predicted mask diagram and the labeled mask diagram.

And a training module 1004, configured to train the image segmentation model according to the target loss.

Further, in one implementation manner of the embodiment of the present application, the determining module 1003 is specifically configured to:

determining a ternary diagram loss according to the difference between the predicted ternary diagram and the marked ternary diagram;

determining target mask loss according to the difference between the target prediction mask map and the marked mask map;

and determining the target loss of the image segmentation model according to the ternary diagram loss and the target mask loss.

Further, in one implementation of the embodiment of the present application, the processing module 1002 is specifically configured to:

inputting the training image into an encoder of an image segmentation model for encoding to obtain image characteristics output by each convolution layer in the multi-layer convolution layers of the encoder;

inputting the image characteristics output by the multi-layer convolution layer into a ternary diagram decoder of the image segmentation model to obtain the prediction ternary diagram;

And inputting the image characteristics and the prediction ternary diagram output by the multi-layer convolution layer into a mask decoder of the image segmentation model to obtain the target prediction mask diagram.

In one implementation of the embodiment of the present application, the determining module 1003 is specifically configured to:

determining an unknown region in the marked ternary diagram according to the pixel value of each pixel point in the marked ternary diagram; the unknown region is an image region between a foreground region and a background region;

determining a subarea in the unknown area according to the marked mask map; each pixel point included in the sub-region belongs to a foreground region;

according to the unknown region, determining a first region corresponding to the unknown region in the target prediction mask map, and determining a second region corresponding to the unknown region in the marked mask map;

according to the subareas, determining a third area corresponding to the subareas in the target prediction mask map, and determining a fourth area corresponding to the subareas in the marked mask map;

determining a first mask loss of the unknown region according to differences in pixel values between a plurality of pixel points included in the first region and a plurality of pixel points included in the second region;

Determining a second mask loss for the sub-region according to differences in pixel values between a plurality of pixels included in the third region and a plurality of pixels included in the fourth region;

and determining the target mask loss according to the first mask loss and the second mask loss.

As an embodiment, the determining module 1003 is specifically further configured to:

determining a first confidence coefficient of the pixel points belonging to the foreground region according to the pixel values of the pixel points included in the first region, and determining a second confidence coefficient of the pixel points belonging to the foreground according to the pixel values of the pixel points included in the second region;

and determining the first mask loss according to the first confidence, the second confidence and the number of pixel points included in the unknown region.

determining a third confidence coefficient of the pixel points belonging to the foreground region according to the pixel values of the pixel points included in the third region, and determining a fourth confidence coefficient of the pixel points belonging to the foreground region according to the pixel values of the pixel points included in the fourth region;

and determining the second mask loss according to the third confidence, the fourth confidence and the number of pixel points included in the subarea.

and adopting a set first weight value to perform weighted calculation on the first mask loss and the second mask loss to obtain the target mask loss.

and adopting a set second weight value to perform weighted calculation on the ternary diagram loss and the target mask loss to obtain the target loss of the image segmentation model.

As an implementation manner, the multi-layer convolution includes a high-layer convolution layer and a low-layer convolution layer, and in an implementation manner of the embodiment of the present application, the processing module 1002 is specifically configured to:

inputting the image characteristics and the prediction ternary diagram output by the high-level convolution layer into a first decoding layer of the mask decoder to obtain a first prediction mask diagram;

and inputting the image features output by the first prediction mask map and the lower convolution layer into a second decoding layer of the mask decoder to obtain the target prediction mask map.

As one implementation mode, the lower-layer convolution layers are multiple layers, and image features output by the multiple lower-layer convolution layers are divided into a first lower-layer structural feature, at least one second lower-layer structural feature and a third lower-layer structural feature from more to less according to included structure and/or texture information; the second decoding layer comprises a first decoding sublayer, at least one second decoding sublayer and a third decoding sublayer;

As an embodiment, the processing module 1002 is specifically further configured to:

inputting the first prediction mask map and the third low-layer structural characteristics into a first decoding sublayer in the second decoding layer to obtain a prediction mask map output by the first decoding sublayer;

inputting a predictive mask image output by the first decoding sublayer and the second low-layer structural features corresponding to the first and second decoding sublayers into the first and second decoding sublayers aiming at the first and second decoding sublayers of the second decoding sublayers;

inputting a predictive mask image output by a second decoding sublayer of a previous layer and the second low-layer structural features corresponding to the second decoding sublayer into the second decoding sublayer for any second decoding sublayer except the first second decoding sublayer;

and inputting the prediction mask image output by the last second decoding sublayer and the first low-layer structural features into a third decoding sublayer in the second decoding layer to obtain the target prediction mask image.

It should be noted that the foregoing explanation of the method embodiment is also applicable to the apparatus of this embodiment, and will not be repeated here.

In the training device of the image segmentation model, a training sample is obtained, the training sample comprises a training image, and the training image comprises a marked mask image and a marked ternary image; the training image is obtained by splicing an image to be processed and a ternary image corresponding to the image to be processed, wherein the marked mask image comprises a foreground area and a background area, the marked ternary image comprises the foreground area, the background area and an unknown area between the foreground area and the background area, a training sample is input into an image segmentation model to obtain a prediction ternary image and a target prediction mask image corresponding to the image to be processed, the target loss of the image segmentation model is determined according to the prediction ternary image and the target prediction mask image, and the image segmentation model is trained according to the target loss. According to the method and the device, the ternary diagram corresponding to the training sample does not need to adopt the ternary diagram which does not contain noise information, the robustness of the ternary diagram required by the training sample is improved, the optimization target of the image segmentation model is determined according to the difference between the predicted ternary diagram and the marked ternary diagram and the difference between the target predicted mask diagram and the marked mask diagram in the model training process, and the training effect and the robustness of the image segmentation model are improved.

In order to achieve the above embodiments, the embodiments of the present application further provide an image segmentation apparatus.

Fig. 11 is a schematic structural diagram of an image segmentation apparatus according to an embodiment of the present application.

As shown in fig. 11, the apparatus may include:

the obtaining module 1101 is configured to obtain a target to-be-processed image and a target ternary diagram corresponding to the target to-be-processed image.

The processing module 1102 is configured to input the target to-be-processed image and the target ternary image into a trained image segmentation model, so as to obtain a target mask image; the image segmentation model is trained by adopting the training method of the image segmentation model according to any embodiment corresponding to the training method of the image segmentation model, so as to obtain the trained image segmentation model.

And the segmentation module 1103 is configured to segment the target to-be-processed image according to the target mask map, so as to obtain an object in a foreground region in the target to-be-processed image.

In the image segmentation device, the target to-be-processed image is identified based on the image segmentation model obtained through training to obtain the high-precision target mask image, accurate matting is performed based on the target mask image, and the matting accuracy is improved.

In order to implement the above embodiment, the application further proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the method according to the foregoing training method of the image segmentation model or the foregoing embodiment of the image segmentation method when executing said program.

In order to implement the above embodiments, the present application also proposes a non-transitory computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements a method as described in the foregoing training method of an image segmentation model or in the foregoing embodiments of an image segmentation method.

In order to implement the above embodiments, the present application also proposes a computer program product having a computer program stored thereon, which, when being executed by a processor, implements a method as described in the foregoing training method of an image segmentation model or in the foregoing embodiments of an image segmentation method.

Fig. 12 is a block diagram of an electronic device according to an embodiment of the present application. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 12, an electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate interaction between the multimedia component 208 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power component 806 provides power to the various components of the electronic device 800. Power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 800.

The multimedia component 808 includes a screen between the electronic device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the electronic device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,4G, or 5G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of electronic device 800 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments. In addition, each functional unit in each embodiment of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. The above-described embodiments are exemplary and not to be construed as limiting the application, and variations, modifications, alternatives, and alternatives to the above-described embodiments may be made by one of ordinary skill in the art within the scope of the application.

Claims

A method of training an image segmentation model, comprising:

obtaining a training sample; the training sample comprises a training image, and the training image comprises a marked mask image and a marked ternary image; the training images are obtained by splicing the images to be processed and the ternary images corresponding to the images to be processed; the marked mask image comprises a foreground area and a background area; the annotated ternary diagram comprises a foreground region, a background region and an unknown region between the foreground region and the background region;

inputting the training sample into an image segmentation model to obtain a prediction ternary diagram and a target prediction mask diagram corresponding to the image to be processed;

determining a target loss of the image segmentation model according to the difference between the predicted ternary diagram and the marked ternary diagram and the difference between the target predicted mask diagram and the marked mask diagram;

And training the image segmentation model according to the target loss.
The method of claim 1, wherein said determining the target loss of the image segmentation model based on the difference between the predicted ternary diagram and the annotated ternary diagram, and the difference between the target predicted mask diagram and the annotated mask diagram, comprises:

determining a ternary diagram loss according to the difference between the predicted ternary diagram and the marked ternary diagram;

determining target mask loss according to the difference between the target prediction mask map and the marked mask map;

and determining the target loss of the image segmentation model according to the ternary diagram loss and the target mask loss.
The method of claim 1, wherein the inputting the training sample into the image segmentation model to obtain the predicted ternary diagram and the target predicted mask diagram corresponding to the image to be processed comprises:

inputting the training image into an encoder of an image segmentation model for encoding to obtain image characteristics output by each convolution layer in the multi-layer convolution layers of the encoder;

inputting the image characteristics output by the multi-layer convolution layer into a ternary diagram decoder of the image segmentation model to obtain the prediction ternary diagram;

And inputting the image characteristics and the prediction ternary diagram output by the multi-layer convolution layer into a mask decoder of the image segmentation model to obtain the target prediction mask diagram.
The method of claim 2, wherein determining a target mask loss based on a difference between the target predicted mask map and the annotated mask map comprises:

determining an unknown region in the marked ternary diagram according to the pixel value of each pixel point in the marked ternary diagram; the unknown region is an image region between a foreground region and a background region;

determining a subarea in the unknown area according to the marked mask map; each pixel point included in the sub-region belongs to a foreground region;

according to the unknown region, determining a first region corresponding to the unknown region in the target prediction mask map, and determining a second region corresponding to the unknown region in the marked mask map;

according to the subareas, determining a third area corresponding to the subareas in the target prediction mask map, and determining a fourth area corresponding to the subareas in the marked mask map;

Determining a first mask loss of the unknown region according to differences in pixel values between a plurality of pixel points included in the first region and a plurality of pixel points included in the second region;

determining a second mask loss for the sub-region according to differences in pixel values between a plurality of pixels included in the third region and a plurality of pixels included in the fourth region;

and determining the target mask loss according to the first mask loss and the second mask loss.
The method of claim 4, wherein determining the first mask loss for the unknown region based on differences in pixel values between a plurality of pixel points included in the first region and a plurality of pixel points included in the second region comprises:

determining a first confidence coefficient of the pixel points belonging to the foreground region according to the pixel values of the pixel points included in the first region, and determining a second confidence coefficient of the pixel points belonging to the foreground according to the pixel values of the pixel points included in the second region;

and determining the first mask loss according to the first confidence, the second confidence and the number of pixel points included in the unknown region.
The method of claim 4, wherein the determining the second mask loss for the sub-region based on differences between pixel values of the plurality of pixel points included in the third region and pixel values of the plurality of pixel points included in the fourth region comprises:

determining a third confidence coefficient of the pixel points belonging to the foreground region according to the pixel values of the pixel points included in the third region, and determining a fourth confidence coefficient of the pixel points belonging to the foreground region according to the pixel values of the pixel points included in the fourth region;

and determining the second mask loss according to the third confidence, the fourth confidence and the number of pixel points included in the subarea.
The method of claim 4, wherein determining the target mask loss from the first mask loss and the second mask loss comprises:

and adopting a set first weight value to perform weighted calculation on the first mask loss and the second mask loss to obtain the target mask loss.
The method of any of claims 1-7, wherein determining a target loss of the image segmentation model from the ternary diagram loss and the target mask loss comprises:

And adopting a set second weight value to perform weighted calculation on the ternary diagram loss and the target mask loss to obtain the target loss of the image segmentation model.
The method of claim 3, wherein the multi-layer convolution layers include a higher layer convolution layer and a lower layer convolution layer, and the inputting the image features output by the multi-layer convolution layer and the prediction ternary diagram into a mask decoder of the image segmentation model to obtain the target prediction mask diagram comprises:

inputting the image characteristics and the prediction ternary diagram output by the high-level convolution layer into a first decoding layer of the mask decoder to obtain a first prediction mask diagram;

and inputting the image features output by the first prediction mask map and the lower convolution layer into a second decoding layer of the mask decoder to obtain the target prediction mask map.
The method of claim 9, wherein the lower-level convolution layers are multi-layered, and image features output by the plurality of lower-level convolution layers are divided into a first lower-level structural feature, at least one second lower-level structural feature, and a third lower-level structural feature according to included structure and/or texture information; the second decoding layer comprises a first decoding sublayer, at least one second decoding sublayer and a third decoding sublayer;

The step of inputting the image features output by the first prediction mask map and the lower convolution layer into a second decoding layer of the mask decoder to obtain the target prediction mask map includes:

inputting the first prediction mask map and the third low-layer structural characteristics into a first decoding sublayer in the second decoding layer to obtain a prediction mask map output by the first decoding sublayer;

inputting a predictive mask image output by the first decoding sublayer and the second low-layer structural features corresponding to the first and second decoding sublayers into the first and second decoding sublayers aiming at the first and second decoding sublayers of the second decoding sublayers;

inputting a predictive mask image output by a second decoding sublayer of a previous layer and the second low-layer structural features corresponding to the second decoding sublayer into the second decoding sublayer for any second decoding sublayer except the first second decoding sublayer;

and inputting the prediction mask image output by the last second decoding sublayer and the first low-layer structural features into a third decoding sublayer in the second decoding layer to obtain the target prediction mask image.
An image segmentation method, comprising:

Acquiring a target to-be-processed image and a target ternary diagram corresponding to the target to-be-processed image;

inputting the target to-be-processed image and the target ternary diagram into a trained image segmentation model to obtain a target mask diagram; the image segmentation model is trained by adopting the training method according to any one of claims 1-10 to obtain the trained image segmentation model;

and dividing the target to-be-processed image according to the target mask map to obtain an object in a foreground region in the target to-be-processed image.
An image segmentation model training apparatus, comprising:

the acquisition module is used for acquiring training samples; the training sample comprises a training image, wherein the training image comprises a marked mask image and a marked ternary image; the training images are obtained by splicing the images to be processed and the ternary images corresponding to the images to be processed; the marked mask image comprises a foreground area and a background area; the annotated ternary diagram comprises a foreground region, a background region and an unknown region between the foreground region and the background region;

the processing module is used for inputting the training sample into an image segmentation model to obtain a prediction ternary diagram and a target prediction mask diagram corresponding to the image to be processed;

The determining module is used for determining the target loss of the image segmentation model according to the difference between the prediction ternary diagram and the marked ternary diagram and the difference between the target prediction mask diagram and the marked mask diagram;

and the training module is used for training the image segmentation model according to the target loss.
A matting apparatus for an image, comprising:

the acquisition module is used for acquiring a target to-be-processed image and a target ternary diagram corresponding to the target to-be-processed image;

the processing module is used for inputting the target to-be-processed image and the target ternary diagram into a trained image segmentation model so as to obtain a target mask diagram; the image segmentation model is trained by adopting the training method according to any one of claims 1-10 to obtain the trained image segmentation model;

and the segmentation module is used for segmenting the target to-be-processed image according to the target mask image to obtain an object in a foreground region in the target to-be-processed image.
An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any one of claims 1-10 or the method according to claim 11 when the program is executed.
A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the method according to any one of claims 1-10 or the method according to claim 11.