WO2023230927A1

WO2023230927A1 - Image processing method and device, and readable storage medium

Info

Publication number: WO2023230927A1
Application number: PCT/CN2022/096483
Authority: WO
Inventors: 陈凌颖; 张亚森; 苏海军; 倪鹏程
Original assignee: 北京小米移动软件有限公司; 北京小米松果电子有限公司
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2023-12-07
Also published as: CN117501309A

Abstract

An image processing method and device, and a readable storage medium. The method comprises: acquiring a target image; determining segmentation information for segmenting the target image by means of a target matting model, wherein the target matting model is obtained by alternately training a basic matting network on the basis of a first sample segmentation image and a sample matting image, the basic matting network is obtained by training an original matting network on the basis of a second sample segmentation image, the first sample segmentation image and the second sample segmentation image carry a first segmentation tag, the sample matting image carries a second segmentation tag, and the segmentation granularity of the first segmentation tag is greater than that of the second segmentation tag; and determining a matting target in the target image according to the segmentation information.

Description

Image processing method, device and readable storage medium

Technical field

The present disclosure relates to the field of image processing, and in particular, to an image processing method, device and readable storage medium.

Background technique

Related matting algorithms without additional inputs are less accurate in segmentation information.

In order to improve the accuracy of the segmentation information related to the cutout algorithm based on the third image or the background image, it is not only necessary to input the target image to be cut out, but also to input the third image corresponding to the target image to be cut out. Or background image, and preparing the three-dimensional image or background image corresponding to the target image to be cut out requires a lot of time and energy.

Contents of the invention

In order to overcome problems existing in related technologies, the present disclosure provides an image processing method, device and readable storage medium.

According to a first aspect of an embodiment of the present disclosure, a data processing method is provided, including:

Get the target image;

The segmentation information used to segment the target image is determined through a target matting model. The target matting model is obtained by alternately training a basic matting network based on the first sample segmentation image and the sample matting image. The basic matting network is based on The second sample segmentation image is obtained by training the original matting network, the first sample segmentation image and the second sample segmentation image carry a first segmentation label, the sample matting image carries a second segmentation label, and the first The segmentation granularity of the segmentation label is greater than the segmentation granularity of the second segmentation label;

According to the segmentation information, the cutout target in the target image is determined.

According to a second aspect of an embodiment of the present disclosure, an image processing device is provided, including:

The first acquisition module is configured to acquire the target image;

The segmentation module is configured to determine segmentation information for segmenting the target image through a target matting model. The target matting model is obtained by alternately training a basic matting network based on the first sample segmentation image and the sample matting image. The basic matting network is obtained by training the original matting network based on the second sample segmentation image, the first sample segmentation image and the second sample segmentation image carry a first segmentation label, and the sample matting image carries a second segmentation label, the segmentation granularity of the first segmentation label is greater than the segmentation granularity of the second segmentation label;

The cutout target determination module is configured to determine the cutout target in the target image according to the segmentation information.

According to a third aspect of an embodiment of the present disclosure, another image processing device is provided, including:

processor;

Memory used to store instructions executable by the processor;

Wherein, the processor is configured to execute the steps of the image processing method provided by the first aspect of the embodiment of the present disclosure.

According to a fourth aspect of an embodiment of the present disclosure, there is provided a computer-readable storage medium on which computer program instructions are stored. When the program instructions are executed by a processor, the steps of the image processing method provided by the first aspect of the present disclosure are implemented.

In the technical solution provided by the embodiments of the present disclosure, more accurate segmentation information can be output only by inputting the target image into the target matting model, so that the matting target in the target image can be more accurately determined based on the segmentation information, that is, The target matting model only takes the target image as input, without the need to input additional auxiliary images, such as three-dimensional images or background images, thus saving a lot of time and energy in preparing auxiliary images.

Moreover, in the present disclosure, the original matting network is first trained through the second sample segmentation image to obtain the basic matting network. Since the segmentation granularity of the first segmentation label is smaller than the segmentation granularity of the second segmentation label, the prediction of the basic matting network can be improved. The robustness of the algorithm is improved, and the accuracy of the basic matting network in locating the matting target is improved. Then, the basic matting network is alternately trained through the first sample segmentation image and the sample matting image to obtain the target matting network. In this process, since the segmentation granularity of the second segmentation label is greater than the segmentation granularity of the first segmentation label , so that the sample cutout image can improve the accuracy of its segmentation after training the basic cutout network, and use the supervision information when training the basic cutout network on the first sample segmentation image to assist the sample cutout image on the basic cutout network Training can improve the accuracy of positioning the cutout target while improving the accuracy of segmentation information for the cutout target, and can speed up the training of the model.

It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and do not limit the present disclosure.

Description of the drawings

The above and/or additional aspects and advantages of the present disclosure will become apparent and readily understood from the following description of the embodiments in conjunction with the accompanying drawings, in which:

FIG. 1 is a flowchart of an image processing method according to an exemplary embodiment.

FIG. 2 is a flow chart of an image background blurring method according to an exemplary embodiment.

FIG. 3 is a schematic diagram illustrating the effect of image background blur according to an exemplary embodiment.

Figure 4 is a flow chart of an image background replacement method according to an exemplary embodiment.

Figure 5 is a schematic diagram illustrating the effect of image background replacement according to an exemplary embodiment.

Figure 6 is a schematic structural diagram of an original matting network according to an exemplary embodiment.

FIG. 7 is a schematic flowchart of an image processing method according to an exemplary embodiment.

Figure 8 is a schematic structural diagram of an original matting network according to an exemplary embodiment.

Figure 9 is a schematic structural diagram of a basic matting network according to an exemplary embodiment.

Figure 10 is a flow chart of single-task training according to an exemplary embodiment.

Figure 11 is a flow chart of dual-task training according to an exemplary embodiment.

Figure 12 is a flowchart of a method for obtaining the total loss of fine segmentation according to an exemplary embodiment.

Figure 13 is a flowchart of a method for obtaining semantic segmentation loss according to an exemplary embodiment.

Figure 14 is a flowchart of a method for obtaining semantic segmentation loss according to an exemplary embodiment.

Figure 15 is a flowchart of a method for obtaining target fine segmentation loss according to an exemplary embodiment.

Figure 16 is a flowchart of a method for obtaining target fine segmentation loss according to an exemplary embodiment.

FIG. 17 is a schematic diagram illustrating color migration of a foreground image according to an exemplary embodiment.

Figure 18 is a schematic diagram of a sampling point pair according to an exemplary embodiment.

Figure 19 is a structural block diagram of an image processing device according to an exemplary embodiment.

Figure 20 is a structural block diagram of an image processing device according to an exemplary embodiment.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with aspects of the disclosure as detailed in the appended claims.

It can be understood that "plurality" in this disclosure refers to two or more, and other quantifiers are similar. "And/or" describes the relationship between related objects, indicating that there can be three relationships. For example, A and/or B can mean: A exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the related objects are in an "or" relationship. The singular forms "a", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It is further understood that the terms "first", "second", etc. are used to describe various information, but such information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other and do not imply a specific order or importance. In fact, expressions such as "first" and "second" can be used interchangeably. For example, without departing from the scope of the present disclosure, the first message frame may also be called a second message frame, and similarly, the second message frame may also be called a first message frame.

It will be further understood that although the operations are described in a specific order in the drawings in the embodiments of the present disclosure, this should not be understood as requiring that these operations be performed in the specific order shown or in a serial order, or that it is required that Perform all operations shown to obtain the desired results. In certain circumstances, multitasking and parallel processing may be advantageous.

In addition, all actions to obtain signals, information or data in this application are performed under the premise of complying with the corresponding data protection laws and policies of the country where the location is located, and with authorization from the owner of the corresponding device.

Currently, there are three main types of cutout algorithms based on deep learning: cutout algorithms based on three-dimensional images, cutout algorithms based on background images, and cutout algorithms without additional input. Among them, the cutout algorithm based on the three-point map requires an additional input of a three-point map as a guide for the mask cutout area. Therefore, manual work or additional models are needed to provide labeling or prediction results of the three-part graph. The background image-based cutout algorithm needs to provide a background image input without foreground, thus implicitly providing foreground selection information and improving the accuracy of cutout. For image-cutting algorithms without additional input, there are many misclassifications during image-cutting. For example, during image-cutting, the content belonging to the background is mistakenly classified as the content of the foreground, or the content belonging to the foreground is mistakenly classified as the content of the background. content, resulting in poor cutout accuracy.

In order to solve the above problems, embodiments of the present disclosure provide an image processing method, device and readable storage medium. The following first introduces the application environment of the embodiment of the present disclosure.

An image processing method provided by an embodiment of the present disclosure can be applied to terminal devices, such as mobile phones or cameras and other terminal devices with a shooting function, or terminal devices with image processing functions. For a terminal device with a shooting function, the user can obtain the target image and background image by shooting, or from other devices. For a terminal device with an image processing function, the user can obtain the target image and background image from other devices. Background image. The user selects the target image and clicks it, and the output interface of the terminal device can display functional controls such as background blur or background replacement. The user clicks the functional control corresponding to the background blur, which can trigger the terminal device to execute the image provided by the embodiment of the present disclosure. The processing method obtains the cutout target, and then the terminal device can continue to perform the background blur algorithm on the target image based on the obtained cutout target to obtain a background blurred image. Or, when the user clicks on the function control corresponding to the background replacement, the mobile phone is triggered to display the option of selecting a background image. After the user completes the operation of selecting the background image, the image processing method in the present disclosure is triggered to be executed, and the cutout target is obtained, and according to the obtained The cutout target is combined with the background image, and the background replacement algorithm is continued to process, thereby obtaining the image after background replacement.

Figure 1 is a flow chart of an image processing method according to an exemplary embodiment. As shown in Figure 1, the image processing method includes:

S101. Obtain the target image.

The target image is an image to be processed, that is, the image that needs to be cut out in this disclosure. The image can be an RGB image, that is, an optical three primary color image, where R represents Red (red) and G represents Green (green). B represents Blue, and the target image includes a cutout target. Among them, the cutout target can be a portrait, an animal image, or an image of any other object.

S102. Determine the segmentation information for segmenting the target image through a target matting model. The target matting model is obtained by alternately training a basic matting network based on the first sample segmentation image and the sample matting image. The basic matting model is The original cutout network is trained by the network based on the second sample segmentation image. The first sample segmentation image and the second sample segmentation image carry a first segmentation label, and the sample cutout image carries a second segmentation label. The segmentation granularity of the first segmentation label is greater than the segmentation granularity of the second segmentation label.

In this embodiment, the segmentation information can be obtained by directly inputting the target image into the target matting model. In order to improve the accuracy of the segmentation information, the target images can be two images with the same content but different resolutions. And the target matting model can be a matting model for a specific type of matting target to improve the accuracy of segmentation information. For example, the target matting model can be a matting model for portraits or a matting model for cats, etc. The target cutouts are diverse and are not specifically limited here.

Wherein, the first sample segmentation image and the second sample segmentation image may be the same or different, the first segmentation label and the second segmentation label are labels with different segmentation granularities, and the segmentation granularity of the first segmentation label is larger than the second segmentation label. The segmentation granularity of the segmentation label. Among them, the size of the segmentation granularity is inversely related to the size of the fineness of segmenting the image. For example, the first segmentation label may be a two-category label, and the value of the cutout target may be marked as 1, and the value of the area outside the cutout target in the target image may be marked as 0. The second segmentation label can be a multi-category label (more refined than the two-category label). For example, the value inside the cutout target can be marked as 1, a transition area can be set at the edge of the cutout target, and the value in the transition area can be marked from There is a gradual transition from 1 to 0, and the label values can be 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, etc. For example, the closer to the interior of the cutout target, the closer its label value is to 1, and the label value of the part of the target image except the interior area and transition area of the cutout target is 0.

The original matting network is first trained through the second sample segmentation image to obtain the basic matting network. Since the segmentation granularity of the first segmentation label is greater than the segmentation granularity of the second segmentation label, the robustness of the prediction of the basic matting network can be improved, and Improve the accuracy of the basic cutout network in locating cutout targets. Then, the basic matting network is alternately trained through the first sample segmentation image and the sample matting image to obtain the target matting network. In this process, since the segmentation granularity of the second segmentation label is smaller than the segmentation granularity of the first segmentation label , so that the sample cutout image can improve the accuracy of its segmentation after training the basic cutout network, and use the supervision information when training the basic cutout network on the first sample segmentation image to assist the sample cutout image on the basic cutout network Training can improve the accuracy of positioning the cutout target while improving the accuracy of segmentation information for the cutout target, and can speed up the training of the model.

S103. Determine the cutout target in the target image according to the segmentation information.

After obtaining the segmentation information, the cutout target in the target image can be determined based on the predicted value in the segmentation information. For example, in the segmentation information, for the cutout target, the predicted value is a non-0 value, that is, 1 or 0 to Values between 1, excluding 0, for areas outside the cutout target, the predicted value is 0, so that the area represented by a predicted value other than 0 can be determined as the cutout target.

Using the above method, the original matting network is first trained through the second sample segmentation image to obtain the basic matting network. Since the segmentation granularity of the first segmentation label is greater than the segmentation granularity of the second segmentation label, the prediction accuracy of the basic matting network can be improved. It improves the accuracy of the basic cutout network in positioning the cutout target, making the interior of the resulting cutout target more complete and free of holes. Then, the basic matting network is alternately trained through the first sample segmentation image and the sample matting image to obtain the target matting network. In this process, since the segmentation granularity of the second segmentation label is smaller than the segmentation granularity of the first segmentation label , so that the sample cutout image can improve the accuracy of its segmentation after training the basic cutout network, and use the supervision information when training the basic cutout network on the first sample segmentation image to assist the sample cutout image on the basic cutout network Training can improve the accuracy of positioning the cutout target while improving the accuracy of segmentation information for the cutout target, and can speed up the training of the model. As a result, more accurate segmentation information can be output without inputting additional auxiliary images without using only the target image as input, so as to obtain more accurate matting targets.

Figure 2 is a flow chart of an image background blurring method according to an exemplary embodiment. As shown in Figure 2, the method includes:

S201. Obtain the target image.

S202. Determine the segmentation information used to segment the target image through a target matting model. The target matting model is obtained by alternately training a basic matting network based on the first sample segmentation image and the sample matting image. The basic matting is The original cutout network is trained by the network based on the second sample segmentation image. The first sample segmentation image and the second sample segmentation image carry a first segmentation label, and the sample cutout image carries a second segmentation label. The segmentation granularity of the first segmentation label is greater than the segmentation granularity of the second segmentation label.

S203. Determine the cutout target in the target image according to the segmentation information.

For explanations of steps S201 to S203, reference may be made to the explanations of steps S101 to S103 above, which will not be described again here.

S204. According to the cutout target, perform blurring processing on the portion of the target image other than the cutout target to obtain a background blurred image.

After determining the cutout target in the target image, the area outside the cutout target in the target image can be blurred. The blur processing method can be box filtering, normalized box filtering, Gaussian filtering, etc.

The above method can be used to blur the background in camera shooting scenes, using portraits as the cutout target, so that the remaining areas in the resulting image except for the portrait are blurred, making the portrait clearer and more prominent, and avoiding interference from the background on the portrait. .

Figure 3 is a schematic diagram of an image background blurring effect according to an exemplary embodiment. As shown in Figure 3, the terminal device can be a mobile phone. For example, the user uses the mobile phone camera to take a photo to obtain the photo, that is, the target image, and then , the user can click on the photo, thereby triggering the mobile phone to display functional controls such as background blur. The user clicks on the functional control corresponding to the background blur, triggering the mobile phone to execute the image processing method in the present disclosure, obtain the cutout target, and based on the obtained cutout The target continues to perform blurring algorithm processing to obtain an image with a blurred background.

Figure 4 is a flow chart of an image background replacement method according to an exemplary embodiment. As shown in Figure 4, the method includes:

S401. Obtain the target image.

S402. Determine the segmentation information for segmenting the target image through a target matting model. The target matting model is obtained by alternately training a basic matting network based on the first sample segmentation image and the sample matting image. The basic matting is The original cutout network is trained by the network based on the second sample segmentation image. The first sample segmentation image and the second sample segmentation image carry a first segmentation label, and the sample cutout image carries a second segmentation label. The segmentation granularity of the first segmentation label is greater than the segmentation granularity of the second segmentation label.

S403. Determine the cutout target in the target image according to the segmentation information.

For explanations of steps S401 to S403, reference may be made to the explanations of steps S101 to S103 above, which will not be described again here.

S404. Obtain the target background image.

The target background image is an image that requires background replacement of the target image, and the target background image can be any image.

S405. According to the cutout target, perform cutout processing on the target image to obtain a cutout target image.

Among them, after the cutout target in the target image is determined, the cutout target in the target image can be cutout. For example, the cutout target in the target image is separated and saved as a new image, that is, the cutout target. image, or directly transparentize the area outside the cutout target in the target image to obtain the cutout target image.

S406: Perform synthesis processing on the cutout target image and the target background image to obtain a background replacement image.

After obtaining the cutout target image, the cutout target image can be synthesized onto the target background image, wherein the area on the target background image that coincides with the cutout target is covered by the cutout target, thereby obtaining a background replacement image.

The above method can be used for image background replacement by obtaining the segmentation information of the input target image, picking out the cutout target, using the new target background image, and obtaining the background replacement image according to the formula I=αF+(1-α)B, where α is the segmentation information output by the model, F is the target image, and B is the target background image. As a result, the background in the existing target image can be replaced with a more beautiful background, so that the obtained background replacement image is more beautiful when it contains the cutout target.

Figure 5 is a schematic diagram of the effect of image background replacement according to an exemplary embodiment. As shown in Figure 5, the terminal device can be a mobile phone. For example, the user uses the mobile phone camera to take a photo to obtain the photo, that is, the target image. Then, The user can click on the photo, thereby triggering the mobile phone to display functional controls such as background replacement. The user clicks on the functional control corresponding to the background replacement, triggering the mobile phone to display the selection background image option. After the user completes the operation of selecting the background image, the execution of the image in this disclosure is triggered. The processing method is to obtain the cutout target, and based on the obtained cutout target, combined with the background image, continue to perform the background replacement algorithm processing, thereby obtaining the image after background replacement.

Figure 6 is a schematic structural diagram of an original matting network according to an exemplary embodiment. As shown in Figure 6, an exemplary embodiment of the present disclosure also provides an original matting network, which is used for The target matting model in the above image processing method is trained. The original matting network includes a feature extraction module, a dilated convolution pooling module, an upsampling module and a multiple upsampling module.

Among them, the overall network structure of the original cutout network belongs to the Encoder-Decoder model, and the Encoder part is a feature extraction module. Among them, the feature extraction module is used to extract the features of the target image. In order to obtain more contextual information, an ASPP (Atrous Spatial Pyramid Pooling, the hole spatial convolution pooling pyramid) module, that is, the hole convolution pooling module, is then gradually upsampled by a multi-level Decoder part (upsampling module), and finally uses a multiple upsampling convolution module (Multiple upsampling module) obtains high-resolution features and finally outputs high-resolution prediction results, which is a more accurate segmentation information.

Compared with the related video-specific matting models, the above-mentioned original matting network uses a multiple upsampling module to replace the depth-oriented filtering module, which solves the problem in related technologies that the video-oriented matting models are difficult to apply to mobile terminals. It removes the additional modules suitable for video, and the feature extraction module uses a network structure that is more lightweight and suitable for quantitative models, thus solving the problems of long time consumption and high power consumption on the mobile terminal, and is more suitable for mobile terminal deployment.

Figure 7 is a schematic flowchart of an image processing method according to an exemplary embodiment. As shown in Figure 7, based on the target matting model trained by the original matting network, the image processing method includes:

S701. Obtain the target image.

S702: Extract features from the target image through the feature extraction module to obtain original feature vectors.

S703. Use the atrous convolution pooling module to perform context extraction on the original feature vector to obtain a context feature vector.

S704. Use an upsampling module to upsample the context feature vector and the original feature vector to obtain a fine segmentation result.

S705: Use a multiple upsampling module to upsample the fine segmentation result and the target image to obtain the segmentation information.

S706: Determine the cutout target in the target image according to the segmentation information.

First, the feature extraction module can use a more lightweight network structure suitable for quantitative models. The feature extraction module includes multiple convolutional layers, such as 5 convolutional layers, and the ratio of the last convolutional layer is modified to make the features The final output feature size of the extraction module is 1/16 of the input, which can maintain the size of the feature map while keeping the receptive field consistent with the original network. The feature extraction module sequentially extracts feature sizes of 1/2, 1/4, 1/8, 1/16 and 1/16 of the input. The original feature vector can be obtained by extracting features from the target image through the feature extraction module. In order to obtain more contextual information, an ASPP (Atrous Spatial Pyramid Pooling, Atrous Spatial Convolution Pooling Pyramid) module is added, that is, atrous convolution The pooling module performs context extraction on the original feature vector through the dilated convolution pooling module to obtain the context feature vector. To speed up, depthwise separable convolutions can be used instead of ordinary convolutions in the ASPP module. The Decoder module (i.e., the upsampling module) consists of sequentially increasing upsampling convolution modules. By upsampling the context feature vector and the original feature vector, the fine segmentation results can be obtained. Specifically, the upsampling convolution module The number of modules is the same as the number of convolutional layers in the feature extraction module, and the input of each upsampling convolution module is output by the previous module (i.e., the contextual feature vector output by the atrous convolution pooling module or the fine output of the previous module). Segmentation result) and the corresponding module output of the feature extraction module (i.e., the original feature vector output by the corresponding convolutional layer in the feature extraction module).

Among them, the target image includes two sub-images of different sizes, for example, one sub-image is an image of size 1024, and the other sub-image is an image of size 512. The smaller sub-image is input into the feature extraction module, and finally the fine segmentation result is output. , where the 1024 size means that the width and height of the image are both 1024 pixels, and the 512 size means that the width and height of the image are both 512 pixels.

In order to obtain higher-precision segmentation information, a larger sub-image in the target image and the preliminary prediction result (ie, the fine segmentation result) are input into a multiple upsampling module, specifically a 2x upsampling convolution module, and we get final segmentation information. The multiple upsampling module consists of convolutional layers and upsampling, making it more convenient for mobile terminal deployment.

Figure 8 is a schematic structural diagram of an original matting network according to an exemplary embodiment. As shown in Figure 8, for the training phase, the original matting network includes an upsampling module composed of multiple upsampling convolution modules. , Each upsampling convolution module is connected to a Segmentation prediction head (coarse segmentation prediction head), and each coarse segmentation prediction head is used to output the coarse segmentation result after upsampling by the corresponding upsampling convolution module.

By connecting a coarse segmentation prediction head to each upsampling convolution module, after each round of training, each upsampling module can output a corresponding coarse segmentation result, so that multiple coarse segmentation results can be obtained, so that A more accurate semantic segmentation loss can be calculated to better optimize model parameters, accelerate model convergence, and improve model training speed.

Figure 9 is a schematic structural diagram of a basic matting network according to an exemplary embodiment. As shown in Figure 9, for the training phase, the basic matting network includes an upsampling module composed of multiple upsampling convolution modules. , each upsampling convolution module is connected to a Segmentation prediction head (coarse segmentation prediction head) and an Alpha prediction head (fine segmentation prediction head). Each coarse segmentation prediction head is used to output the corresponding upsampling convolution module upsampling. After obtaining the coarse segmentation result, each fine segmentation prediction head is used to output the upsampled fine segmentation result of the corresponding upsampling convolution module.

By connecting a coarse segmentation prediction head and a fine segmentation prediction head to each upsampling convolution module, each upsampling module can output a corresponding coarse segmentation result and fine segmentation result after each round of training, thus enabling Multiple coarse segmentation results and multiple fine segmentation results are obtained so that more accurate semantic segmentation loss and total fine segmentation loss can be calculated, thereby better optimizing model parameters, accelerating model convergence, and improving model training speed.

An exemplary embodiment of the present disclosure also provides a training method for a target matting model. The trained target matting model is used to implement the image processing method in any of the above embodiments. The training method for the target matting model may include There are two parts: sample data acquisition and model training.

The sample data acquisition is used to obtain the first sample segmentation image, the second sample segmentation image and the sample matting image. For example, the sample data includes a semantic segmentation data set and a Matting (matting) data set. Among them, the first sample segmentation image and the second sample segmentation image can be images in the semantic segmentation data set, and the sample matting image is an image in the Matting data set. Both the semantic segmentation data set and the Matting data set include self-collected data and public data. set. Specifically, the self-collected semantic segmentation data is about 7W, plus the public data set Dark Complexion Portrait Segmentation Dataset (Dark Portrait Segmentation Dataset). The self-collected Matting data set contains about 3,700 high-precision annotations, plus public data sets.

In order to obtain more diverse sample data and train a target matting model with more accurate segmentation information, data preprocessing, that is, data amplification, can also be performed on the collected sample data. Specifically, the semantic segmentation data input size For 512, data preprocessing corresponding to semantic segmentation data includes random scaling, horizontal flipping, rotation, and color dithering. The Mating data input size is 1024, which is downsampled to 512 before inputting the original matting network. Matting data preprocessing includes affine transformation, rotation, flipping, color dithering and random noise or sharpening.

For Mating data (i.e., sample cutout images), because its annotation is more detailed and requires more effort, the amount of Mating data is smaller. In order to expand the amount of Mating data, the existing Mating data can also be background replaced. , get new Mating data. By obtaining a new background image, the original sample cutout image can be background replaced, and the foreground image in the mating data is combined with the new background image to obtain a new sample cutout image. In the process of background replacement, since the color space distribution difference of the foreground image in the background image and the original sample cutout image is usually large, the foreground image can be color migrated to make the fusion of the foreground image and the background image more natural and realistic.

The first sample segmented image and the second sample segmented image may be the same or different. The first sample segmentation image and the second sample segmentation image carry the first segmentation label, the sample cutout image carries the second segmentation label, the first segmentation label and the second segmentation label are labels with different segmentation granularities, and the first segmentation label The segmentation granularity is greater than the segmentation granularity of the second segmentation label. For example, the first segmentation label can be a two-category label, the value of the cutout target can be marked as 1, and the value of the area outside the cutout target in the target image can be marked as 0. The second segmentation label can be a multi-category label. The value inside the cutout target can be marked as 1, a transition area is set at the edge of the cutout target, and the marked value in the transition area gradually transitions from 1 to 0. The marked values can be 0.9, 0.8, 0.7, 0.6, 0.5, 0.4 , 0.3, 0.2 and 0.1, etc., for example, the closer to the interior of the cutout target, the closer its label value is to 1, and the label value of the part of the target image except the interior area and transition area of the cutout target is 0.

Based on the first sample segmentation image, the second sample segmentation image and the sample cutout image obtained above, model training can be performed.

For model training, it is divided into two stages: single-task training and dual-task training. Single-task training is semantic segmentation training, specifically training the original matting network to obtain a basic matting network, and dual-task training is semantic segmentation training. Alternate with matting training, specifically training the basic matting network to obtain the target matting model.

Figure 10 is a flow chart of single-task training according to an exemplary embodiment. As shown in Figure 10, it includes:

S1001. Perform multiple rounds of iterative segmentation training on the original matting network based on multiple second sample segmented images.

S1002. After each round of iterative training, obtain multiple rough segmentation results output by this round of segmentation training.

S1003. Obtain the semantic segmentation loss corresponding to this round of segmentation training based on the multiple rough segmentation results output by this round of segmentation training and the first segmentation label carried by the second sample segmentation image in this round of segmentation training.

S1004. Optimize the original matting network according to the semantic segmentation loss corresponding to the current round of segmentation training.

S1005. When the original matting network converges, stop training and obtain the basic matting network.

Among them, multiple second sample segmented images can be divided into training sets, test sets and verification sets, and multiple rounds of segmentation training are performed on the original matting network. The specific structure of the original matting network can be referred to the above explanation, and will not be repeated here. . For the training phase, in the upsampling module of the original matting network, each upsampling convolution module included in the upsampling module is connected to a coarse segmentation prediction head, and each coarse segmentation prediction head is used to output the corresponding upsampling convolution module. After upsampling, after each round of iterative training, multiple rough segmentation results output from this round of segmentation training can be obtained. Therefore, based on the multiple rough segmentation results obtained and the first segmentation label carried by the second sample segmentation image used in this round of segmentation training, the semantic segmentation loss corresponding to this round of segmentation training is calculated. Based on the obtained semantic segmentation loss, calculate Adjust the parameters to optimize the original matting network based on the adjusted parameters. Through multiple rounds of training and optimization, until the original matting network converges. That is, if the calculated semantic segmentation loss no longer changes or the change rate is less than the preset threshold, the training can be stopped. The preset threshold of the change rate here can be 0.1, or the number of iterations of single-task training can also be set. When the number of iterations is reached, the training can be stopped and the basic matting network is obtained. In single-task training, the initial learning rate can be 0.0001, and the optimizer can use the RMSprop optimizer to reduce the learning rate with a momentum of 0.9 every 8 iterations.

Through single-task training, the robustness of the image segmentation of the obtained basic matting network can be improved, the matting target can be accurately located, and holes inside the matting target can be avoided.

Figure 11 is a flow chart of dual-task training according to an exemplary embodiment. As shown in Figure 11, it includes:

S1101. Perform multiple rounds of alternate iterations of segmentation training and matting training on the basic matting network based on a plurality of the first sample segmentation images and a plurality of the sample matting images.

S1102. After each round of segmentation training, obtain the semantic segmentation loss corresponding to this round of segmentation training.

S1103. Optimize the basic matting network according to the semantic segmentation loss corresponding to the current round of segmentation training.

S1104. After each round of matting training, according to the multiple coarse segmentation results, multiple fine segmentation results, segmentation information output by this round of matting training, and the second key carried by the sample matting images in this round of matting training. Segment the labels to get the total fine segmentation loss corresponding to this round of matting training.

S1105. According to the total fine segmentation loss corresponding to the current round of matting training, optimize the optimized basic matting network again.

S1106. When the basic matting network converges, stop training and obtain the target matting model.

In dual-task training, the basic matting network obtained from single-task training continues to be trained. In one example, the basic matting network is subjected to multiple rounds of alternating iterative segmentation training and matting training, that is, in each round of training In , a segmentation training is performed first, and then a cutout training is performed. For segmentation training, please refer to the above content and will not be repeated here.

After optimizing the basic matting network based on segmentation training, matting training can continue on the optimized basic matting network. The basic matting network includes an upsampling module composed of multiple upsampling convolution modules. Each upsampling convolution module is connected to a coarse segmentation prediction head and a fine segmentation prediction head. Each coarse segmentation prediction head is used to output the corresponding The coarse segmentation result after upsampling by the upsampling convolution module, and each fine segmentation prediction head is used to output the fine segmentation result after upsampling by the corresponding upsampling convolution module. Therefore, during each cutout training, multiple coarse segmentation results, multiple fine segmentation results will be output through the coarse segmentation prediction head and the fine segmentation prediction head, and the final segmentation information will be output. Among them, a coarse segmentation prediction head connected to each upsampling convolution module outputs a coarse segmentation result, and a fine segmentation prediction head connected to each upsampling convolution module outputs a fine segmentation result.

Then, the current round of matting can be obtained based on the multiple coarse segmentation results, multiple fine segmentation results, segmentation information output by this round of matting training, and the second segmentation labels carried by the sample matting images in this round of matting training. The total loss of fine segmentation corresponding to training. Then, based on the total fine segmentation loss corresponding to this round of matting training, the optimized basic matting network is optimized again until the basic matting network converges. That is, if the calculated semantic segmentation loss and fine segmentation total loss no longer change or the change rate is less than the preset threshold, the training can be stopped. The preset threshold of the change rate here can be 0.1, or dual tasks can also be set. The number of alternating iterations of training. When the number of alternating iterations is reached, the training can be stopped and the target matting model can be obtained. In dual-task training, the initial learning rate can be 0.00001, and the optimizer can use the RMSprop optimizer to reduce the learning rate with a momentum of 0.9 every 4 iterations until the learning rate reaches 0.000001, and the learning rate no longer changes.

Using the supervision information from the first sample segmentation image to train the basic matting network to assist the sample matting image in training the basic matting network can improve the accuracy of positioning the matting target while improving the segmentation of the matting target. The accuracy of the information and the speed of model training can be accelerated. As a result, more accurate segmentation information can be output without inputting additional auxiliary images without using only the target image as input, so as to obtain more accurate matting targets.

Figure 12 is a flow chart of a method for obtaining the total loss of fine segmentation according to an exemplary embodiment. As shown in Figure 12, the method includes:

S1201. Binarize the second segmentation label carried by the sample matting image in the current round of matting training according to the preset segmentation value to obtain the binary value corresponding to the sample matting image in this round of matting training. Split tags.

In the cutout training, sample cutout images are used, and the sample cutout images carry the second segmentation label with a more accurate label, that is, the label value is not only 1 and 0, but also includes values between 1 and 0. , and the coarse segmentation result contains a value of 0 or 1. At this time, in order to calculate the semantic segmentation loss, the second segmentation label can be binarized according to the preset segmentation value. For example, the preset segmentation value can be 0.1, That is, all values greater than or equal to 0.1 in the second segmentation label are replaced with 1, and all values less than 0.1 are replaced with 0, thereby obtaining a binary segmentation label containing only 1 and 0.

S1202. Obtain the semantic segmentation loss corresponding to this round of matting training based on the multiple rough segmentation results output by this round of matting training and the binary segmentation labels.

After obtaining the binary segmentation labels, the semantic segmentation loss corresponding to this round of matting training can be calculated based on the multiple rough segmentation results and the binary segmentation labels output by this round of matting training.

S1203. Obtain the target fine segmentation loss corresponding to this round of matting training based on the multiple fine segmentation results, segmentation information and the second segmentation label output by this round of matting training.

The fine segmentation results and segmentation information contain values between 0 and 1, and include 0 and 1. Therefore, the second segmentation label can be directly used to calculate the target fine segmentation loss.

S1204. According to the semantic segmentation loss corresponding to the current round of matting training and the target fine segmentation loss corresponding to the current round of matting training, obtain the total fine segmentation loss corresponding to the current round of matting training.

The total fine segmentation loss corresponding to this round of matting training can be obtained by weighting the sum of the semantic segmentation loss corresponding to this round of matting training and the target fine segmentation loss corresponding to this round of matting training.

In this embodiment, the matting training outputs not only multiple coarse segmentation results, but also multiple fine segmentation results and one segmentation information, so that the calculated total fine segmentation loss includes semantic segmentation loss and target fine segmentation loss, so that Adjust the parameters of each upsampling convolution module more precisely to speed up model training.

Figure 13 is a flow chart of a method for obtaining semantic segmentation loss according to an exemplary embodiment. The method is used to calculate the segmentation loss corresponding to segmentation training. As shown in Figure 13, it includes:

S1301. Obtain the first semantic segmentation sub-loss based on the distance between multiple coarse segmentation results and the first segmentation label corresponding to this round of segmentation training.

For multiple coarse segmentation results, calculate the distance between each coarse segmentation result output in this round and the first segmentation label corresponding to this round of segmentation training to obtain multiple single first semantic segmentation sub-losses, where, This round of segmentation training may be the training of the original matting network using the second sample segmentation image. At this time, the first segmentation label corresponding to this round of segmentation training may be the first segmentation corresponding to the second sample segmentation image used in this round. label; among them, this round of segmentation training can also be the training of the basic matting network using the first sample segmentation image. At this time, the first segmentation label corresponding to this round of segmentation training can be the first sample used in this round. The first segmentation label corresponding to the segmented image, that is, a rough segmentation result and the first segmentation label corresponding to this round of segmentation training, can be calculated to obtain a single first semantic segmentation sub-loss, which can be calculated based on multiple single first semantic segmentations The sub-loss is calculated to obtain the final first semantic segmentation sub-loss. Specifically, multiple single first semantic segmentation sub-losses can be weighted and summed to obtain the final first semantic segmentation sub-loss.

Among them, the calculation formula of the single first semantic segmentation sub-loss can be:

Among them, m _g is the real label value corresponding to the first segmentation label, m _p is the value corresponding to the output rough segmentation result, β is a parameter used to control the weight of the sample's impact on the loss, and γ is used to adjust the sample that is difficult to classify. The parameter of attention, L _f is the single first semantic segmentation sub-loss.

S1302. Obtain multiple pairs of sampling points from the plurality of rough segmentation results, and obtain the second semantic segmentation sub-loss according to the distance between the multiple pairs of sampling points and the first segmentation labels corresponding to the current round of segmentation training. .

Among them, the sampling point pair is the predicted value corresponding to the two points obtained from the coarse segmentation result, and multiple pairs of sampling point pairs are obtained from multiple coarse segmentation results. Specifically, the pair can be: collecting two points from the output coarse segmentation result. The predicted value of the point at the edge of the cutout target is used as a sampling point pair, or the predicted value of two points inside the cutout target is collected as a sampling point pair in the output coarse segmentation result, or the predicted value is collected outside the cutout target. The predicted values of the two points are used as a pair of sampling points, so that the prediction accuracy of the edge of the matting target can be optimized in a targeted manner. Among them, a single second semantic segmentation sub-loss can be calculated for each pair of sampling points and the first segmentation label corresponding to this round of segmentation training. The final second semantic segmentation sub-loss can be calculated based on multiple single second semantic segmentation sub-losses. Second semantic segmentation sub-loss, specifically, multiple single second semantic segmentation sub-losses can be weighted and summed to obtain the final second semantic segmentation sub-loss.

Among them, the calculation formula of a single second semantic segmentation sub-loss can be:

in

and

represents the predicted value corresponding to the two sampling points, and s represents the preset distance threshold. In order to ensure calculation accuracy, when the distance between the value of the sampling point pair and the real label value corresponding to the first segmentation label is greater than the preset threshold, Then the sampling point pair is not added to the calculation of the single second semantic segmentation sub-loss. L _r is the single second semantic segmentation sub-loss, and || || ₁ is the absolute value.

S1303. Obtain the semantic segmentation loss corresponding to this round of segmentation training based on the first semantic segmentation sub-loss and the second semantic segmentation sub-loss.

Among them, after obtaining the first semantic segmentation sub-loss and the second semantic segmentation sub-loss, the first semantic segmentation sub-loss can be directly determined as the semantic segmentation loss corresponding to this round of segmentation training, or the second semantic segmentation sub-loss can be directly determined as To determine the semantic segmentation loss corresponding to this round of segmentation training, the first semantic segmentation sub-loss and the second semantic segmentation sub-loss can also be weighted and summed to obtain the semantic segmentation loss corresponding to this round of segmentation training.

Figure 14 is a flow chart of a method for obtaining semantic segmentation loss according to an exemplary embodiment. The method is used to calculate the semantic segmentation loss in matting training. As shown in Figure 14, it includes:

S1401. Obtain the third semantic segmentation sub-loss according to the distance between the multiple rough segmentation results output by this round of matting training and the binary segmentation labels.

S1402. Obtain multiple pairs of sampling points from multiple rough segmentation results output by this round of matting training, and obtain the fourth semantic segmentation based on the distances between the multiple pairs of sampling points and the binary segmentation labels. child loss.

S1403. Obtain the semantic segmentation loss corresponding to this round of matting training based on the third semantic segmentation sub-loss and the fourth semantic segmentation sub-loss.

For the semantic segmentation loss in cutout training, since the sample cutout image is used, and the sample cutout image carries the second segmentation label with a more accurate label, it is necessary to use the binary value after the second segmentation label is binarized. Use binary segmentation labels to calculate semantic segmentation loss. For specific calculation methods, please refer to the calculation method of semantic segmentation loss in segmentation training. Just replace the first segmentation label with a binary segmentation label, which will not be described again here.

Figure 15 is a flow chart of a method for obtaining target fine segmentation loss according to an exemplary embodiment. As shown in Figure 15, the method includes:

S1501. Calculate the average absolute error between the segmentation information output by this round of matting training and the multiple fine segmentation results and the second segmentation label.

S1502. Determine the average absolute error as the first fine segmentation sub-loss corresponding to this round of matting training.

In this method, the above-mentioned target fine segmentation loss may include a first fine segmentation sub-loss, in which one round of training only outputs one segmentation information, and may also output multiple fine segmentation results. This segmentation information and the second segmentation label can be calculated to obtain a Average absolute sub error. A fine segmentation result can be calculated with the second segmentation label to obtain an average absolute sub error. Multiple fine segmentation results can correspond to multiple average absolute sub errors. Then, the average absolute sub-error obtained from the segmentation information and the multiple average absolute sub-errors obtained corresponding to multiple fine segmentation results are weighted and summed to obtain the average absolute error, and then the average absolute error is determined as the current round of matting Train the corresponding first fine segmentation sub-loss.

Among them, the calculation formula of the average absolute sub-error can be:

Among them, α _p is the segmentation information or fine segmentation result, α _g is the corresponding second segmentation label,

is the average absolute sub-error, || || ₁ is the absolute value.

By weighting the sum of the semantic segmentation loss corresponding to this round of matting training and the target fine segmentation loss corresponding to this round of matting training, the total fine segmentation loss corresponding to this round of matting training can be obtained.

The above target fine segmentation loss also includes at least one of the following: a second fine segmentation sub-loss, a third fine segmentation sub-loss, and a fourth fine segmentation sub-loss. The second fine segmentation sub-loss can be obtained by calculating the multi-scale Laplacian loss between the segmentation information output by this round of matting training and multiple fine segmentation results respectively and the second segmentation label; the third The fine segmentation sub-loss can be obtained by calculating the segmentation information output by the current round of matting training and the gradient loss between multiple fine segmentation results and the second segmentation label respectively; the fourth fine segmentation sub-loss can be obtained by calculating multiple predictions. The multiple predicted synthetic images are obtained by the synthesis loss between the synthetic image and the label synthetic image. The multiple predicted synthetic images are synthesized with the background image based on the multiple fine segmentation results and segmentation information obtained from the current round of matting training. The image, the label composite image is a composite image of the cutout target obtained according to the second segmentation label and the background image. When the target fine segmentation loss includes multiple fine segmentation sub-losses, the multiple fine segmentation sub-losses can be weighted and summed to obtain the target fine segmentation loss corresponding to this round of matting training.

Figure 16 is a flow chart of a method for obtaining target fine segmentation loss according to an exemplary embodiment. As shown in Figure 16, the method includes:

S1601. Calculate the average absolute error between the segmentation information output by this round of matting training and the multiple fine segmentation results and the second segmentation label.

S1602. Determine the average absolute error as the first fine segmentation sub-loss corresponding to this round of matting training.

S1603. Calculate the multi-scale Laplacian loss between the segmentation information output by this round of matting training and the multiple fine segmentation results respectively and the second segmentation labels, and obtain the second fine segmentation corresponding to this round of matting training. child loss.

S1604. Calculate the gradient loss between the segmentation information output by this round of matting training and the multiple fine segmentation results and the second segmentation labels respectively, and obtain the third fine segmentation sub-loss corresponding to this round of matting training.

S1605. Calculate the synthesis loss between multiple predicted synthetic images and the label synthetic image respectively, and obtain the fourth fine segmentation sub-loss corresponding to this round of matting training. The multiple predicted synthetic images are output according to this round of matting training. An image composed of multiple cutout targets obtained from multiple fine segmentation results and segmentation information and the background image respectively. The label composite image is an image synthesized from the cutout targets obtained according to the second segmentation label and the background image.

S1606. Perform a weighted sum of the first fine segmentation sub-loss, the second fine segmentation sub-loss, the third fine segmentation sub-loss and the fourth fine segmentation sub-loss to obtain the target fine segmentation loss corresponding to this round of matting training.

Among them, one segmentation information and the second segmentation label can be calculated to obtain a multi-scale Laplacian loss, and a fine segmentation result can be calculated with the second segmentation label to obtain a multi-scale Laplacian loss. Multiple fine segmentation results That is, multiple multi-scale Laplacian losses can be obtained correspondingly. Then, the multi-scale Laplacian loss obtained from the segmentation information and the multi-scale Laplacian loss obtained corresponding to multiple fine segmentation results are weighted. and, the second fine segmentation loss can be obtained.

Among them, the calculation formula of multi-scale Laplacian loss can be:

Among them, α _p is the segmentation information or fine segmentation result, α _g is the corresponding second segmentation label, f _s (x) represents the Laplacian pyramid calculation, the segmentation information is sequentially downsampled, and similarity is calculated at different scales. property, x in f _s (x) can be α _p or α _g , and L _lap is the multi-scale Laplacian loss.

Among them, one segmentation information and the second segmentation label can be calculated to obtain a gradient loss, a fine segmentation result can be calculated with the second segmentation label to obtain a gradient loss, and multiple fine segmentation results can correspond to multiple gradient losses. Then, The third fine segmentation sub-loss can be obtained by weighting the sum of the gradient loss obtained from the segmentation information and the gradient losses corresponding to multiple fine segmentation results.

Among them, the calculation formula of gradient loss can be:

L _g =||G(α _p )-G(α _g )|| ₁

Among them, G(x) represents the sobel operator (Sobel operator), α _p is the segmentation information or fine segmentation result, α _g is the corresponding second segmentation label, and x in G(x) can be α _p or α _g , L _g is the gradient loss.

Among them, the cutout target and the background image obtained from a fine segmentation result can be synthesized to obtain a predicted composite image, multiple predicted composite images can be obtained from each fine segmentation result, and a segmentation information and the background image can be synthesized to obtain a predicted composite image. Then, a predicted synthetic image can be combined with a label synthetic image to obtain a synthetic loss. Multiple predicted synthetic images can correspond to multiple synthetic losses. The multiple synthetic losses can be weighted and summed to obtain the corresponding result of this round of matting training. Fourth fine segmentation loss.

Among them, the calculation formula of synthetic loss can be:

L _C =||cp-cg|| ₁

Among them, c _p is the prediction synthetic map, c _g is the label synthetic map, and L _C is the synthetic loss.

Among them, in order to highlight the differences at the edges of the cutout target, the background image is a new background image randomly selected, and the new background image is different from the original background image of the cutout target.

Through the above method, multiple loss function calculation methods are designed for segmentation training and matting training, so that a more accurate loss function can be calculated, so that the segmentation information output by the trained target matting model can be more accurate.

The following takes the cutout target as a portrait as an example to provide a training method for a target cutout model for portraits, as follows:

The training of the target matting model can include two parts: sample data acquisition and model training.

Among them, sample data acquisition includes the acquisition of data sets and data preprocessing, where,

The data set is obtained as follows: the data set used is divided into two parts, the semantic segmentation data set (including multiple first sample segmentation images and second sample segmentation images carrying the first segmentation label) and the Matting data set (including multiple sample cutout image carrying the second segmentation label). It contains self-collected data and public datasets. The self-collected semantic segmentation data is about 7W, plus the public data set Dark Complexion Portrait Segmentation Dataset (Dark Portrait Segmentation Dataset). The self-collected Matting data set contains about 3,700 high-precision annotations, plus public data sets.

Then, perform data preprocessing on the obtained data set, specifically:

Data preprocessing is divided into two parts. The input size of the portrait segmentation data is 512. Data preprocessing includes random scaling, horizontal flipping, rotation and color dithering. The input size of the portrait mating data is 1024, which will be downsampled to 512 before being input to the original matting network. Matting data preprocessing includes affine transformation, rotation, flipping, color dithering and random noise or sharpening. Figure 17 is a schematic diagram of color migration of a foreground image according to an exemplary embodiment. As shown in Figure 17, in the amplification of Matting data, due to the color of the background image and the foreground image (image in the Matting data) The spatial distribution difference is usually large. The foreground image is color migrated with a certain probability to make the foreground image and background image more natural and realistic. The Alpha annotation in the image is the label of the foreground image, and the preprocessed image is Matting data. The image in the Matting data is the image after background replacement and color migration. After preprocessing, Alpha is the label corresponding to the image in the Matting data after background replacement and color migration.

Model training:

The entire training process is divided into two stages, the semantic segmentation stage and the dual-task training stage. The semantic segmentation stage only uses portrait segmentation data, and the prediction heads at all levels only use segmentation heads. The initial learning rate is 0.0001, the optimizer uses RMSprop (Root Mean Square Prop, root mean square propagation), and every 8 epochs (data is input into the network, a forward calculation and backpropagation are completed) to reduce the learning rate with a momentum of 0.9 , that is, the learning rate is adjusted to 0.9 times the original value. The second stage is the simultaneous training stage of semantic segmentation and matting training, using portrait segmentation data and portrait mating data (portrait matting data), and semantic segmentation and matting training are performed alternately. This alternating training strategy plays a crucial role in maintaining the semantically robust performance of the model when the amount of matting data is small and the network does not have additional third-part image or background image input. Only the semantic segmentation loss is calculated when using portrait segmentation data. In order to strengthen the semantic supervision in the matting data training stage, the labels of the matting data are binarized according to the threshold of 0.1, and the semantic segmentation loss and matting loss (total fine segmentation loss) are calculated at the same time. The initial learning rate at this stage is 0.00001, and the learning rate is reduced with a momentum of 0.9 every 4 epochs until it remains unchanged at 0.000001.

Among them, the loss function consists of two parts, semantic segmentation loss and matting loss. The semantic segmentation loss includes a focal loss (focus loss) and a loss function for portrait segmentation based on ranking loss (ranking loss). Matting losses include L1 loss, multi-scale Laplacian loss, gradient loss and synthetic graph loss. The specific description is as follows:

Semantic segmentation loss: composed of focal loss and improved ranking loss. Assume that the mask value predicted by the network is m _p and the corresponding value of the real label is m _g , where the mask value is the mask value, and here is the mask value corresponding to each pixel in the mask map predicted by the network, The real label is the label carried by the sample image, for example, the label carried by the portrait segmentation data or the portrait mating data. focal loss is L _f , and the specific calculation formula is as follows:

Among them, β is a parameter used to control the weight of positive and negative samples on the loss, and γ is used to adjust the attention to difficult-to-classify samples.

Rankingloss is L _r , and the specific calculation formula is as follows:

in

and

It represents the mask prediction value corresponding to the two sampling points, and s represents a certain threshold. When the distance between the negative sample and the label is greater than a certain threshold, the loss calculation will not be added.

Figure 18 is a schematic diagram of a pair of sampling points according to an exemplary embodiment. As shown in Figure 18, in order to optimize the accuracy of portrait edge prediction, two sampling methods are designed, respectively at the edge of the portrait and between the portrait and the portrait. Sampling is performed inside the background to obtain a certain number of sampling points for loss calculation. The original image in the picture is the target image, the label is the label corresponding to the target image, and the edge sampling points are multiple points sampled at the edge of the portrait. Internal sampling points are multiple points sampled inside the portrait and the background.

Matting loss: This part of the loss consists of four components, three of which are loss calculations between the predicted Alpha (i.e., the output fine segmentation result or segmentation information) and the label, and the last one is generated based on the predicted alpha and label respectively. Loss calculation between synthetic images, where the predicted alpha is a mask image, and each pixel in the mask image corresponds to a mask value from 0 to 1. Assume that the predicted alpha value is α _p and the corresponding label value is α _g . The first loss function is L1 loss

Calculated as follows:

The second loss function is a multi-scale Laplacian loss, which downsamples the predicted alpha in sequence and calculates similarity at different scales. The specific calculation is as follows:

where f _s (x) represents the Laplacian pyramid calculation.

The third loss is the gradient loss L _g , using G(x) to represent the sobel operator, then the specific calculation is as follows:

L _g =||G(α _p )-G(α _g )|| ₁

The last loss is the composite image loss L _C. In order to highlight the differences at the edges of the portrait, a random new background image is used in the calculation for the generation of the composite image. Assume that c _p represents the image generated by predicted alpha, and c _g represents the image generated by label. The calculation formula is as follows:

L _C =||c _p -c _g || ₁

The total loss is the sum of the above losses according to a certain proportion.

For the above method embodiments, for the sake of simple description, they are all expressed as a series of action combinations. However, those skilled in the art should know that the present disclosure is not limited by the action sequence described above. Secondly, those skilled in the art should also know that the embodiments described above are preferred embodiments, and the steps involved are not necessarily necessary for the present disclosure.

Figure 19 is a structural block diagram of an image processing device according to an exemplary embodiment. The image processing device 1900 may be implemented by software, hardware, or a combination of software and hardware, and is used to execute the steps of the image processing method provided by the foregoing method embodiments. Referring to FIG. 19 , the image processing device 1900 includes a first acquisition module 1901 , a segmentation module 1902 and a cutout target determination module 1903 .

The first acquisition module 1901 is configured to acquire the target image;

The segmentation module 1902 is configured to determine segmentation information for segmenting the target image through a target matting model. The target matting model is obtained by alternately training a basic matting network based on the first sample segmentation image and the sample matting image. , the basic matting network is obtained by training the original matting network based on the second sample segmentation image, the first sample segmentation image and the second sample segmentation image carry a first segmentation label, and the sample matting image carries a first segmentation label. Two segmentation labels, the segmentation granularity of the first segmentation label is greater than the segmentation granularity of the second segmentation label;

The cutout target determination module 1903 is configured to determine the cutout target in the target image according to the segmentation information.

Optionally, the original matting network includes an upsampling module composed of multiple upsampling convolution modules, each upsampling convolution module is connected to a coarse segmentation prediction head, and each of the coarse segmentation prediction heads is used to output The corresponding coarse segmentation result after upsampling by the upsampling convolution module.

Optionally, the device also includes:

A first training module configured to segment images according to a plurality of the second samples, and perform multiple rounds of iterative segmentation training on the original matting network;

The second acquisition module is configured to acquire multiple rough segmentation results output by this round of segmentation training after each round of iterative training;

The first acquisition module is configured to obtain the semantic segmentation loss corresponding to this round of segmentation training based on the multiple rough segmentation results output by this round of segmentation training and the first segmentation label carried by the second sample segmentation image in this round of segmentation training;

The first optimization module is configured to optimize the original matting network according to the semantic segmentation loss corresponding to the current round of segmentation training;

The second acquisition module is configured to stop training when the original matting network converges to obtain the basic matting network.

Optionally, the basic matting network includes an upsampling module composed of multiple upsampling convolution modules, each upsampling convolution module is connected to a coarse segmentation prediction head and a fine segmentation prediction head, each of which The coarse segmentation prediction head is used to output the coarse segmentation result after upsampling by the corresponding upsampling convolution module, and each of the fine segmentation prediction heads is used to output the fine segmentation result after upsampling by the corresponding upsampling convolution module.

Optionally, the device also includes:

A second training module configured to perform multiple rounds of alternating iterations of segmentation training and matting training on the basic matting network based on a plurality of the first sample segmentation images and a plurality of the sample matting images;

The third acquisition module is configured to obtain the semantic segmentation loss corresponding to this round of segmentation training after each round of segmentation training;

The second optimization module is configured to optimize the basic matting network according to the semantic segmentation loss corresponding to the current round of segmentation training;

The third acquisition module is configured to, after each round of matting training, output multiple coarse segmentation results, multiple fine segmentation results, segmentation information according to the current round of matting training, and the sample matting in this round of matting training. The second segmentation label carried by the image is used to obtain the total fine segmentation loss corresponding to this round of matting training;

The third optimization module is configured to optimize the optimized basic matting network again based on the total fine segmentation loss corresponding to the current round of matting training;

The fourth acquisition module is configured to stop training when the basic matting network converges to obtain the target matting model.

Optionally, the third acquisition module includes:

The first acquisition sub-module is configured to perform binarization processing on the second segmentation label carried by the sample matting image in the current round of matting training according to the preset segmentation value to obtain the sample in the current round of matting training. The binary segmentation label corresponding to the cutout image;

The second acquisition sub-module is configured to obtain the semantic segmentation loss corresponding to this round of matting training based on the multiple rough segmentation results output by this round of matting training and the binary segmentation labels;

The third acquisition submodule is configured to obtain the target fine segmentation loss corresponding to this round of matting training based on the multiple fine segmentation results, segmentation information and the second segmentation label output by this round of matting training;

The fourth acquisition sub-module is configured to obtain the fine segmentation total corresponding to the current round of matting training based on the semantic segmentation loss corresponding to the current round of matting training and the target fine segmentation loss corresponding to the current round of matting training. loss.

Optionally, the device further includes a semantic segmentation loss sub-module, which includes:

The first acquisition sub-unit is configured to obtain the first semantic segmentation sub-loss based on the distance between the multiple coarse segmentation results and the first segmentation label corresponding to the current round of segmentation training;

The second acquisition subunit is configured to acquire multiple pairs of sampling points from the plurality of coarse segmentation results, based on the distances between the multiple pairs of sampling points and the first segmentation labels corresponding to the current round of segmentation training, Obtain the second semantic segmentation sub-loss;

The third obtaining sub-unit is configured to obtain the semantic segmentation loss corresponding to the current round of segmentation training based on the first semantic segmentation sub-loss and the second semantic segmentation sub-loss.

Optionally, the second acquisition sub-module includes:

The fourth acquisition sub-unit is configured to obtain the third semantic segmentation sub-loss based on the distance between the multiple rough segmentation results output by the current round of matting training and the binary segmentation labels;

The fifth acquisition subunit is configured to obtain multiple pairs of sampling points from multiple coarse segmentation results output by the current round of matting training, and according to the distance between the multiple pairs of sampling points and the binary segmentation labels, distance, the fourth semantic segmentation sub-loss is obtained;

The sixth obtaining sub-unit is configured to obtain the semantic segmentation loss corresponding to the current round of matting training based on the third semantic segmentation sub-loss and the fourth semantic segmentation sub-loss.

Optionally, the target fine segmentation loss includes a first fine segmentation sub-loss;

The third acquisition sub-module includes:

A calculation subunit configured to calculate the segmentation information output by the current round of matting training and the average absolute error between the multiple fine segmentation results and the second segmentation label;

The determination subunit is configured to determine the average absolute error as the first fine segmentation sub-loss corresponding to the current round of matting training.

Optionally, the target fine segmentation loss also includes at least one of the following: a second fine segmentation sub-loss, a third fine segmentation sub-loss, and a fourth fine segmentation sub-loss. Correspondingly, the third acquisition sub-module also include:

The seventh acquisition subunit is configured to calculate the segmentation information of the current round of matting training output and the multi-scale Laplacian loss between the multiple fine segmentation results and the second segmentation label respectively, to obtain the current round of matting. Train the corresponding second fine segmentation sub-loss; and/or,

The eighth acquisition subunit is configured to calculate the segmentation information output by the current round of matting training and the gradient loss between the multiple fine segmentation results and the second segmentation labels, and obtain the third fine corresponding to the current round of matting training. split sub-loss; and/or,

The ninth acquisition sub-unit is configured to calculate the synthesis loss between multiple predicted synthetic images and the label synthetic image respectively, and obtain the fourth fine segmentation sub-loss corresponding to the current round of matting training. The multiple predicted synthetic images are based on The multiple fine segmentation results output by this round of matting training and the multiple matting targets obtained from the segmentation information are respectively combined with the background image. The label composite image is the matting target obtained according to the second segmentation label and the image. The image is synthesized from the background image.

Optionally, the device also includes:

The background blur module is configured to blur the portion of the target image other than the cutout target according to the cutout target to obtain a background blurred image.

Optionally, the device also includes:

The fourth acquisition module is configured to acquire the target background image;

The fifth acquisition module is configured to perform cutout processing on the target image according to the cutout target to obtain the cutout target image;

The background replacement module is configured to synthesize the cutout target image and the target background image to obtain a background replacement image.

Optionally, the original matting network includes a feature extraction module, a dilated convolution pooling module, an upsampling module and a multiple upsampling module;

The segmentation module 1902 includes:

A feature extraction submodule configured to perform feature extraction on the target image through the feature extraction module to obtain an original feature vector;

The context extraction sub-module is configured to perform context extraction on the original feature vector through the atrous convolution pooling module to obtain a context feature vector;

The first upsampling sub-module is configured to upsample the context feature vector and the original feature vector through the upsampling module to obtain a fine segmentation result;

The second upsampling sub-module is configured to upsample the fine segmentation result and the target image through a multiple upsampling module to obtain the segmentation information.

Regarding the devices in the above embodiments, the specific manner in which each module performs operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

Figure 20 is a structural block diagram of an image processing device according to an exemplary embodiment. For example, the device 2000 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like.

Referring to Figure 20, device 2000 may include one or more of the following components: processing component 2002, memory 2004, power supply component 2006, multimedia component 2008, audio component 2010, input/output interface 2012, sensor component 2014, and communication component 2016.

Processing component 2002 generally controls the overall operations of device 2000, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 2002 may include one or more processors 2020 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 2002 may include one or more modules that facilitate interaction between processing component 2002 and other components. For example, processing component 2002 may include a multimedia module to facilitate interaction between multimedia component 2008 and processing component 2002.

Memory 2004 is configured to store various types of data to support operations at device 2000. Examples of such data include instructions for any application or method operating on device 2000, contact data, phonebook data, messages, pictures, videos, etc. Memory 2004 may be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EEPROM), Programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

Power supply component 2006 provides power to various components of device 2000. Power supply components 2006 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 2000.

Multimedia component 2008 includes a screen that provides an output interface between the device 2000 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide action. In some embodiments, multimedia component 2008 includes a front-facing camera and/or a rear-facing camera. When the device 2000 is in an operating mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front-facing camera and rear-facing camera can be a fixed optical lens system or have a focal length and optical zoom capabilities.

Audio component 2010 is configured to output and/or input audio signals. For example, audio component 2010 includes a microphone (MIC) configured to receive external audio signals when device 2000 is in operating modes, such as call mode, recording mode, and speech recognition mode. The received audio signals may be further stored in memory 2004 or sent via communications component 2016 . In some embodiments, audio component 2010 also includes a speaker for outputting audio signals.

The input/output interface 2012 provides an interface between the processing component 2002 and a peripheral interface module. The peripheral interface module may be a keyboard, a click wheel, a button, etc. These buttons may include, but are not limited to: Home button, Volume buttons, Start button, and Lock button.

Sensor component 2014 includes one or more sensors that provide various aspects of status assessment for device 2000 . For example, the sensor component 2014 can detect the open/closed state of the device 2000, the relative positioning of components, such as the display and keypad of the device 2000, and the sensor component 2014 can also detect the position change of the device 2000 or a component of the device 2000. , the presence or absence of user contact with device 2000 , device 2000 orientation or acceleration/deceleration and temperature changes of device 2000 . Sensor assembly 2014 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 2014 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 2014 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

Communication component 2016 is configured to facilitate wired or wireless communication between apparatus 2000 and other devices. Device 2000 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 2016 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communications component 2016 also includes a near field communications (NFC) module to facilitate short-range communications. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, apparatus 2000 may be configured by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable Gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are implemented for executing the above method.

In an exemplary embodiment, a non-transitory computer-readable storage medium including instructions, such as a memory 2004 including instructions, which can be executed by the processor 2020 of the device 2000 to complete the above method is also provided. For example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In another exemplary embodiment, a computer program product is also provided, the computer program product comprising a computer program executable by a programmable device, the computer program having a function for performing the above when executed by the programmable device. The code part of the image processing method.

Other embodiments of the disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure that follow the general principles of the disclosure and include common knowledge or customary technical means in the technical field that are not disclosed in the disclosure. . It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the disclosure is limited only by the appended claims.

Claims

An image processing method, characterized by including:

Get the target image;

The segmentation information used to segment the target image is determined through a target matting model. The target matting model is obtained by alternately training a basic matting network based on the first sample segmentation image and the sample matting image. The basic matting network is based on The second sample segmentation image is obtained by training the original matting network, the first sample segmentation image and the second sample segmentation image carry a first segmentation label, the sample matting image carries a second segmentation label, and the first The segmentation granularity of the segmentation label is greater than the segmentation granularity of the second segmentation label;

According to the segmentation information, the cutout target in the target image is determined.
The image processing method according to claim 1, characterized in that the original matting network includes an upsampling module composed of multiple upsampling convolution modules, each upsampling convolution module is connected to a coarse segmentation prediction head, Each of the coarse segmentation prediction heads is used to output the coarse segmentation result upsampled by the corresponding upsampling convolution module.
The image processing method according to claim 1, characterized in that the basic matting network is obtained through the following steps:

Carry out multiple rounds of iterative segmentation training on the original matting network according to the plurality of second sample segmented images;

After each round of iterative training, multiple coarse segmentation results output from this round of segmentation training are obtained;

Based on the multiple rough segmentation results output by this round of segmentation training and the first segmentation label carried by the second sample segmentation image in this round of segmentation training, the semantic segmentation loss corresponding to this round of segmentation training is obtained;

Optimize the original matting network according to the semantic segmentation loss corresponding to the current round of segmentation training;

When the original matting network converges, training is stopped and the basic matting network is obtained.
The image processing method according to claim 1, characterized in that the basic matting network includes an upsampling module composed of a plurality of upsampling convolution modules, and each upsampling convolution module is connected to a coarse segmentation prediction head. and a fine segmentation prediction head. Each of the coarse segmentation prediction heads is used to output the upsampled coarse segmentation result of the corresponding upsampling convolution module. Each of the fine segmentation prediction heads is used to output the corresponding upsampling convolution. Fine segmentation results after module upsampling.
The image processing method according to claim 1, characterized in that the target matting model is obtained through the following steps:

According to a plurality of the first sample segmentation images and a plurality of the sample matting images, perform multiple rounds of alternating iterative segmentation training and matting training on the basic matting network;

After each round of segmentation training, the semantic segmentation loss corresponding to this round of segmentation training is obtained;

Optimize the basic matting network according to the semantic segmentation loss corresponding to the current round of segmentation training;

After each round of matting training, multiple coarse segmentation results, multiple fine segmentation results, segmentation information output according to this round of matting training, and the second segmentation label carried by the sample matting image in this round of matting training are , get the total fine segmentation loss corresponding to this round of matting training;

According to the total fine segmentation loss corresponding to the current round of matting training, the optimized basic matting network is optimized again;

When the basic matting network converges, training is stopped and the target matting model is obtained.
The image processing method according to claim 5, characterized in that:

According to the multiple coarse segmentation results, multiple fine segmentation results, segmentation information output by this round of cutout training, and the second segmentation labels carried by the sample cutout images in this round of cutout training, the corresponding round of cutout training is obtained. The total loss of fine segmentation includes:

According to the preset segmentation value, the second segmentation label carried by the sample matting image in the current round of matting training is binarized to obtain the binarized segmentation corresponding to the sample matting image in this round of matting training. Label;

According to the multiple rough segmentation results output by this round of matting training and the binary segmentation labels, the semantic segmentation loss corresponding to this round of matting training is obtained;

According to the multiple fine segmentation results, segmentation information and the second segmentation label output by this round of matting training, the target fine segmentation loss corresponding to this round of matting training is obtained;

According to the semantic segmentation loss corresponding to the current round of cutout training and the target fine segmentation loss corresponding to the current round of cutout training, the total fine segmentation loss corresponding to the current round of cutout training is obtained.
The image processing method according to claim 3 or 5, characterized in that obtaining the semantic segmentation loss corresponding to segmentation training includes:

According to the distance between multiple coarse segmentation results and the first segmentation label corresponding to this round of segmentation training, the first semantic segmentation sub-loss is obtained;

Obtain multiple pairs of sampling points from the plurality of coarse segmentation results, and obtain the second semantic segmentation sub-loss according to the distance between the multiple pairs of sampling points and the first segmentation labels corresponding to this round of segmentation training;

The semantic segmentation loss corresponding to this round of segmentation training is obtained according to the first semantic segmentation sub-loss and the second semantic segmentation sub-loss.
The image processing method according to claim 6, wherein the semantic segmentation loss corresponding to the current round of cutout training is obtained based on the multiple rough segmentation results output by the current round of cutout training and the binary segmentation labels. include:

According to the distance between the multiple rough segmentation results output by this round of matting training and the binary segmentation labels, the third semantic segmentation sub-loss is obtained;

Multiple pairs of sampling points are obtained from multiple rough segmentation results output from this round of matting training, and the fourth semantic segmentation sub-loss is obtained based on the distance between the multiple pairs of sampling points and the binary segmentation labels. ;

The semantic segmentation loss corresponding to this round of matting training is obtained according to the third semantic segmentation sub-loss and the fourth semantic segmentation sub-loss.
The image processing method according to claim 6, characterized in that:

The target fine segmentation loss includes a first fine segmentation sub-loss;

According to the multiple fine segmentation results, segmentation information and the second segmentation label output by this round of matting training, the target fine segmentation loss corresponding to this round of matting training is obtained, including:

Calculate the average absolute error between the segmentation information output by this round of matting training and the multiple fine segmentation results and the second segmentation label;

The average absolute error is determined as the first fine segmentation sub-loss corresponding to this round of matting training.
The image processing method according to claim 9, characterized in that:

The target fine segmentation loss also includes at least one of the following: a second fine segmentation sub-loss, a third fine segmentation sub-loss, and a fourth fine segmentation sub-loss. Correspondingly, the multiple fine segmentation sub-losses output according to this round of matting training The segmentation results, segmentation information and the second segmentation label are used to obtain the target fine segmentation loss corresponding to this round of matting training, which also includes:

Calculate the multi-scale Laplacian loss between the segmentation information output by this round of matting training and multiple fine segmentation results respectively and the second segmentation label, and obtain the second fine segmentation sub-loss corresponding to this round of matting training. ;and / or,

Calculate the gradient loss between the segmentation information output by this round of matting training and the multiple fine segmentation results and the second segmentation label respectively, to obtain the third fine segmentation sub-loss corresponding to this round of matting training; and/or,

Calculate the synthesis loss between multiple predicted synthetic images and the label synthetic image respectively to obtain the fourth fine segmentation sub-loss corresponding to this round of matting training. The multiple predicted synthetic images are multiple output according to this round of matting training. The multiple cutout targets obtained from the fine segmentation results and the segmentation information are respectively combined with the background image, and the label composite image is an image combined with the background image and the cutout target obtained according to the second segmentation label.
The image processing method according to claim 1, characterized in that the method further includes:

According to the cutout target, the portion of the target image other than the cutout target is blurred to obtain a background blurred image.
The image processing method according to claim 1, characterized in that the method further includes:

Get the target background image;

According to the cutout target, perform cutout processing on the target image to obtain the cutout target image;

The cutout target image and the target background image are synthesized to obtain a background replacement image.
The image processing method according to claim 1, characterized in that:

The original matting network includes a feature extraction module, a dilated convolution pooling module, an upsampling module and a multiple upsampling module;

Determining segmentation information for segmenting the target image through the target matting model includes:

Extract features from the target image through the feature extraction module to obtain original feature vectors;

Use the dilated convolution pooling module to perform context extraction on the original feature vector to obtain a context feature vector;

The context feature vector and the original feature vector are upsampled through an upsampling module to obtain a fine segmentation result;

Through a multiple upsampling module, the fine segmentation result and the target image are upsampled to obtain the segmentation information.
An image processing device, characterized in that it includes:

The first acquisition module is configured to acquire the target image;

The segmentation module is configured to determine segmentation information for segmenting the target image through a target matting model. The target matting model is obtained by alternately training a basic matting network based on the first sample segmentation image and the sample matting image. The basic matting network is trained based on the second sample segmentation image, the first sample segmentation image and the second sample segmentation image carry a first segmentation label, the sample matting image carries a second segmentation label, and the The segmentation granularity of the first segmentation label is greater than the segmentation granularity of the second segmentation label;

The cutout target determination module is configured to determine the cutout target in the target image according to the segmentation information.
An image processing device, characterized in that it includes:

processor;

Memory used to store instructions executable by the processor;

Wherein, the processor is configured to perform the steps of the method according to any one of claims 1-13.
A computer-readable storage medium on which computer program instructions are stored, characterized in that when the program instructions are executed by a processor, the steps of the method described in any one of claims 1-13 are implemented.