CN111369581B

CN111369581B - Image processing method, device, equipment and storage medium

Info

Publication number: CN111369581B
Application number: CN202010099612.6A
Authority: CN
Inventors: 刘钰安
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2023-08-08
Anticipated expiration: 2040-02-18
Also published as: WO2021164534A1; CN111369581A

Abstract

The embodiment of the application discloses an image processing method, an image processing device, image processing equipment and a storage medium, and belongs to the field of image processing. The method comprises the following steps: acquiring an original image, wherein the original image comprises at least one target object; inputting an original image into a first prediction model to obtain a first transparent channel image output by the first prediction model, wherein the first transparent channel image comprises predicted transparency values corresponding to all pixel points in the original image; inputting the first transparent channel image and the original image into a second prediction model to obtain a second transparent channel image output by the second prediction model, wherein the fineness of the second transparent channel image is higher than that of the first transparent channel image; and dividing the original image according to the second transparent channel image to obtain an image corresponding to the target object. Compared with the image segmentation method in the related art, the method has the advantages that a three-dimensional image is not required to be introduced, the transparent channel image can be directly generated from the original image, and the accuracy of image segmentation is further improved.

Description

Image processing method, device, equipment and storage medium

Technical Field

Embodiments of the present disclosure relate to the field of image processing, and in particular, to an image processing method, apparatus, device, and storage medium.

Background

The image segmentation refers to a process of accurately separating a foreground object of interest from a background in a static image or a continuous video sequence, has wide application in the aspects of human image blurring, background replacement and the like, and aims at mainly obtaining a transparent channel image, wherein transparency values corresponding to all pixels are marked in the transparent channel image, a region with the transparency value of 1 is a foreground image region, a region with the transparency of 0 is a background image region, and the foreground image in an original image can be separated by utilizing the obtained transparent channel image.

In the related art, a method for dividing an image is provided, wherein a trimap image is required to be generated according to an original image, the trimap image is used for dividing the original image into three parts, namely a determined foreground image area, a determined background image area and an uncertain area, the uncertain area is determined firstly by using the trimap image, and then the trimap image and the original image are input into a neural network after training is completed, and transparency values corresponding to all pixel points in the uncertain area are determined, so that a transparent channel image for image division is output.

Clearly, the clear channel image obtained in the related art depends on the accuracy of the trimap image, which needs to be generated by training a specific neural network or obtained by manual labeling, resulting in lower accuracy of the generated clear channel image.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, image processing equipment and a storage medium. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides an image processing method, including:

acquiring an original image, wherein the original image comprises at least one target object;

inputting the original image into a first prediction model to obtain a first transparent channel image output by the first prediction model, wherein the first transparent channel image comprises predicted transparency values corresponding to all pixel points in the original image;

inputting the first transparent channel image and the original image into a second prediction model to obtain a second transparent channel image output by the second prediction model, wherein the fineness of the second transparent channel image is higher than that of the first transparent channel image;

and dividing the original image according to the second transparent channel image to obtain an image corresponding to the target object.

In another aspect, an embodiment of the present application provides an image processing apparatus, including:

the first acquisition module is used for acquiring an original image, wherein the original image comprises at least one target object;

The first prediction module is used for inputting the original image into a first prediction model to obtain a first transparent channel image output by the first prediction model, wherein the first transparent channel image comprises a predicted transparency value corresponding to each pixel point in the original image;

the second prediction module is used for inputting the first transparent channel image and the original image into a second prediction model to obtain a second transparent channel image output by the second prediction model, and the fineness of the second transparent channel image is higher than that of the first transparent channel image;

and the segmentation processing module is used for carrying out segmentation processing on the original image according to the second transparent channel image to obtain an image corresponding to the target object.

In another aspect, embodiments of the present application provide a computer device, where the computer device includes a processor and a memory, where at least one instruction, at least one program, a code set, or an instruction set is stored, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the image processing method as described in the above aspect.

In another aspect, embodiments of the present application provide a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which are loaded and executed by a processor to implement the image processing method as described in the above aspect.

The beneficial effects that technical scheme that this application embodiment provided include at least:

and inputting the obtained original image into a first prediction model to obtain a first transparent channel image (comprising predicted transparency values corresponding to all pixel points in the original image) output by the first prediction model, so that the first transparent channel image and the original image are input into a second prediction model to obtain a second transparent channel image output by the second prediction model, and the second transparent channel image is used for dividing the original image according to the second transparent channel image to obtain an image corresponding to the target object. Because the fineness of the second transparent channel image is higher than that of the first transparent channel image, the accuracy of image segmentation can be improved; compared with the image segmentation method in the related art, the transparent channel image for image segmentation can be directly generated from the original image without introducing a trimap image, and the accuracy of image segmentation is further improved.

Drawings

FIG. 1 illustrates a flow chart of an image processing method provided by an exemplary embodiment of the present application;

FIG. 2 illustrates a flow chart of an image processing method illustrated in an exemplary embodiment of the present application;

FIG. 3 illustrates a flowchart of a method of training a first predictive model, as illustrated in an exemplary embodiment of the present application;

FIG. 4 illustrates a flowchart of a method of training a first predictive model, as illustrated in another exemplary embodiment of the present application;

FIG. 5 illustrates a process diagram of a training method of a first predictive model, as illustrated in an exemplary embodiment of the present application;

FIG. 6 shows a schematic diagram of the structure of various convolution blocks used by the multi-scale decoding network;

FIG. 7 illustrates a flowchart of a method of training a second predictive model, as illustrated in an exemplary embodiment of the present application;

FIG. 8 illustrates a flowchart of a training method of a second predictive model, as illustrated in another exemplary embodiment of the present application;

FIG. 9 illustrates a process diagram of a training method of a second predictive model, as illustrated in an exemplary embodiment of the present application;

FIG. 10 illustrates a flowchart of an image processing method illustrated in another exemplary embodiment of the present application;

FIG. 11 illustrates a flowchart of an image processing method illustrated in another exemplary embodiment of the present application;

FIG. 12 illustrates a network deployment diagram of an image processing method illustrated in an exemplary embodiment of the present application;

fig. 13 is a block diagram showing the structure of an image processing apparatus provided in an exemplary embodiment of the present application;

fig. 14 shows a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

References herein to "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

Image segmentation refers to a process of precisely separating a foreground object of interest from a background in a static image or a continuous video sequence, in an image segmentation task, a transparent channel image for separating the foreground image needs to be generated, and the transparent channel image contains transparency values corresponding to each pixel point, for example, a region with the transparency value of 1 represents a foreground image region, and a region with the transparency of 0 represents a background image region, so that the foreground image in an original image can be separated by using the obtained transparent channel image.

In the related art, an image processing method is proposed, which is mainly divided into two stages, wherein the first stage is: generating a trimap image according to the original image, wherein the trimap image is used for dividing the original image into three parts, namely a determined foreground image area, a determined background image area and an uncertain area, and the uncertain area in the original image can be divided by the generated trimap image; the second stage is: and inputting the generated trimap image and original image into a trained neural network, determining the transparency value corresponding to each pixel point of the uncertain region, and outputting a transparent channel image for image segmentation.

By adopting the method in the related art, if the uncertain region marked by the three-division map is not accurate enough, the accuracy of the generated transparent channel map is low, so that the accuracy of image segmentation is affected. In addition, the three-dimensional image is generated by depending on a specific neural network or manually marked, so that the complexity of the training process is increased, and the corresponding transparent channel image cannot be directly generated by the original image.

In order to solve the above-described problems, an embodiment of the present application provides an image processing method. Referring to fig. 1, a flowchart of an image processing method according to an exemplary embodiment of the present application is shown. The method comprises the following steps:

Step 101, training a first prediction model according to the sample image and the sample segmentation image, wherein the first prediction model is used for generating a first transparent channel image, and the first transparent channel image comprises predicted transparency values corresponding to all pixel points in the original image.

Step 102, training a first prediction model according to the sample image and the sample segmentation image, wherein the first prediction model is used for generating a first transparent channel image, and the first transparent channel image comprises predicted transparency values corresponding to all pixel points in the original image.

And step 103, after preprocessing the original image, inputting a trained first prediction model to obtain a first transparent channel image output by the first prediction model.

And 104, inputting the first transparent channel image and the original image into a second prediction model to obtain a second transparent channel image output by the second prediction model.

And 105, dividing the original image according to the second transparent channel image to obtain a foreground image.

In the embodiment of the application, the first prediction model and the second prediction model of the transparent channel image can be generated through training, the original image is preprocessed and then input into the first prediction model to obtain the first transparent channel image output by the first prediction model, the generated first transparent channel image and the original image are input into the second prediction model again to obtain the second transparent channel image output by the second prediction model, so that the second transparent channel image is used for image processing.

It should be noted that, the image processing method provided in the embodiments of the present application may be used in a computer device having an image processing function, where the computer device may be a smart phone, a tablet computer, a personal portable computer, or the like. In a possible implementation manner, the image processing method provided by the embodiment of the application may be applied to an application program needing to perform tasks such as image segmentation, background replacement, target object blurring and the like. For example, an application program having a beautifying function; optionally, in the image processing method provided by the embodiments of the present application, a training process of the prediction model may be performed in a server, and after the training of the prediction model is completed, the trained prediction model is deployed in a computer device to perform subsequent image processing; alternatively, the image processing method provided in the embodiments of the present application may also be used for a server having an image processing function.

For convenience of description, in the following method embodiments, description will be made taking only an example in which an execution subject of the image processing method is a computer device.

Referring to fig. 2, a flowchart of an image processing method according to an exemplary embodiment of the present application is shown. This embodiment will be described by taking the method for a computer device as an example, and the method includes the following steps.

In step 201, an original image is acquired.

Since the purpose of image processing in the embodiments of the present application is to separate a foreground image from an original image, the foreground image is an image corresponding to a target object. Therefore, the original image should contain at least one target object.

The target object may be a person, a scene, an animal, etc., and the type of the target object is not limited in the embodiment of the present application.

In one possible implementation, the original image needs to be preprocessed in a manner that may include data enhancement processing such as random rotation, random left-right flipping, random cropping, gamma (Gamma) transformation, etc., for use in a subsequent feature extraction process.

Step 202, inputting an original image into a first prediction model to obtain a first transparent channel image output by the first prediction model, wherein the first transparent channel image comprises predicted transparency values corresponding to all pixel points in the original image.

The first transparent channel image is a probability map, that is, the predicted transparency value corresponding to each pixel point in the first transparent channel image is 0-1, for example, the predicted transparency value of a certain pixel point is 0.9.

For the first prediction model, the first prediction model is obtained by training a sample image, a sample labeling image and a sample segmentation image, and the sample labeling image is labeled with standard transparency values corresponding to all pixels in the sample image, so that in one possible implementation, after the original image is preprocessed, the first prediction model after the training is input into the first prediction model, a predicted first transparent channel image can be obtained, and the first transparent channel image comprises the predicted transparency values corresponding to all pixels in the original image.

Step 203, inputting the first transparent channel image and the original image into a second prediction model to obtain a second transparent channel image output by the second prediction model, wherein the fineness of the second transparent channel image is higher than that of the first transparent channel image.

Since the accuracy of the divided image depends on the accuracy of the obtained transparent channel image, in order to improve the accuracy of the first transparent channel image, a second prediction model is deployed that can generate a second transparent channel image of higher definition from the input original image and the first transparent channel image.

The second prediction model is mainly used for correcting transparency values of all pixel points in the first prediction transparent channel image, so that the prediction transparency values of all pixel points in the second transparent channel image are closer to the standard transparency values.

In one possible implementation manner, after the first transparent channel image and the original image are subjected to splicing (Concat), the first transparent channel image and the original image are input into a second prediction model, so that a second transparent channel image with higher definition can be obtained for use in a subsequent image segmentation process.

And 204, dividing the original image according to the second transparent channel image to obtain an image corresponding to the target object.

In a possible implementation manner, the second transparent channel image includes a predicted transparency value corresponding to each pixel point, and since the transparency value corresponding to the foreground image is 1 and the transparency value corresponding to the background image is 0, the foreground image and the background image in the original image can be separated according to the second transparent channel image, so that an image corresponding to the target object can be obtained.

In summary, in the embodiment of the present application, the obtained original image is input into the first prediction model to obtain the first transparent channel image (including the predicted transparency value corresponding to each pixel point in the original image) output by the first prediction model, so that the first transparent channel image and the original image are input into the second prediction model to obtain the second transparent channel image output by the second prediction model, and the second transparent channel image is used for performing segmentation processing on the original image according to the second transparent channel image to obtain the image corresponding to the target object. Because the fineness of the second transparent channel image is higher than that of the first transparent channel image, the accuracy of image segmentation can be improved; compared with the image segmentation method in the related art, the transparent channel image for segmentation can be directly generated from the original image without introducing a trimap image, and the accuracy of image segmentation is further improved.

In one possible embodiment, since the process of generating the second transparent channel image is divided into two prediction model phases, namely a first prediction model and a second prediction model, the training phase of the first prediction model and the training phase of the second prediction model should also be included in the model training phase.

Referring to FIG. 3, a flow chart of a method of training a first predictive model is shown in accordance with an exemplary embodiment of the present application. The method comprises the following steps:

step 301, obtaining a sample image, a sample labeling image and a sample segmentation image, wherein the sample labeling image is labeled with transparency values corresponding to each pixel point in the sample image, and the sample segmentation image is a binarized image obtained by performing binarization processing on the sample labeling image.

Aiming at the training process of the first prediction model, the adopted data set comprises a preset number of data pairs, each data pair is a sample image and a sample labeling image corresponding to the sample image, wherein the sample labeling image is labeled with standard transparency values corresponding to all pixel points in the corresponding sample image.

Alternatively, the preset number may be set by the developer, and the greater the number of data pairs, the higher the prediction accuracy of the first prediction model. For example, a dataset may include 5000 data pairs.

Alternatively, the sample annotation image may be annotated by a developer.

Alternatively, the first predictive model may be trained based on a deep learning tensor library (PyTorch) framework, and a graphics processor (Graphics Processing Unit, GPU).

In order to allow a rapid convergence of the first predictive model, thereby increasing the training speed of the first predictive model, the first predictive model is trained using the sample segmentation image and the original image.

For the acquisition mode of the sample segmentation image, in one possible implementation manner, the sample labeling image may be obtained by performing binarization processing, that is, a transparency threshold is set, if the transparency value corresponding to a pixel point is greater than the transparency threshold, the transparency value corresponding to the pixel point is represented by 1, and if the transparency value corresponding to the pixel point is less than the transparency threshold, the transparency value corresponding to the pixel point is represented by 0.

Alternatively, the transparency threshold may be set by the developer, for example, the transparency threshold may be 0.8, that is, the transparency value of the pixel point greater than 0.8 is represented by 1, and the transparency value of the pixel point less than 0.8 is represented by 0.

In one possible implementation, the acquired preset number of sample images and sample segmentation images are divided into a test set and a sample set according to a certain proportion, wherein the sample set is used for a subsequent training process of the first prediction model, and the test set is used for a verification process of the first prediction model.

Alternatively, the preset ratio may be set by a developer, for example, the preset ratio is 2:8, and the data set may be divided into the test set and the sample set according to the ratio of 2:8.

Step 302, training a first prediction model according to the sample image and the sample segmentation image.

In one possible implementation, the sample data size may be expanded by preprocessing the image, such as, for example, preprocessing the sample image in the sample set by random rotation, random left-right flipping, random cropping, gamma transformation, and the like. The preprocessing method of the sample image in the embodiment of the present application is not limited.

In one possible implementation, the first predictive model may include a multi-scale encoding network, a feature pyramid network, a multi-scale decoding network, a depth supervision network, and the like.

Illustratively, step 301 may include steps 302A through 302F, as shown in fig. 4.

Step 302A, inputting a sample image into a multi-scale coding network to obtain m first sample feature images output by the multi-scale coding network, where m is an integer greater than or equal to 2.

Wherein the resolution and the channel number of the different first sample feature graphs are different, and m is an integer greater than or equal to 2.

The multi-scale encoding network may adopt a neural network model for feature extraction, for example, a mobile facial network (MobileNetV 2) model, which is not limited in the embodiment of the present application.

In one possible implementation manner, the sample image after being preprocessed is input into a multi-scale coding network, and multi-scale feature extraction is performed on the sample image through the multi-scale coding network, so that m first sample feature images can be obtained, and as feature extraction is performed on multiple scales, the resolution and the channel number of the obtained m first sample feature images are different.

Illustratively, if n is 4, the obtained n first sample feature maps may be: 320×1/32, 64×1/16, 32×1/8, 24×1/4, where 320, 64, 32, 24 represent the number of channels corresponding to each first sample feature map, 1/32, 1/16, 1/8, 1/4 represent the resolution of each first sample feature map relative to the sample image, for example, 1/4 represents that the resolution corresponding to the first sample feature map is 1/4 of the sample image.

Illustratively, as shown in fig. 5, after the sample image is input into the multi-scale encoding network 501 and multi-scale feature extraction is performed, 4 first sample feature maps output by the multi-scale encoding network 501 are obtained, which are 320×1/32, 64×1/16, 32×1/8, and 24×1/4, respectively.

And 302B, inputting the m first sample feature graphs into a feature pyramid network to obtain m second sample feature graphs output by the feature pyramid network.

Wherein the number of channels of the different second sample feature maps is the same and the resolution is different.

The feature pyramid network is used for mixing the extracted feature graphs and processing the corresponding channel number into the target channel number.

In one possible implementation manner, m first sample feature graphs output by the multi-scale coding network are input into a feature pyramid network, and the first sample feature pyramids are formed by arranging according to the resolution of each first sample feature graph, wherein the level of the first sample feature graph on the first sample feature pyramids is in negative correlation with the resolution thereof.

Illustratively, as shown in fig. 5, 4 first sample feature graphs output by the multi-scale coding network are input into the feature pyramid network 502, first sample feature pyramids (as shown in the left pyramid of the feature pyramid network) are formed according to the arrangement of the resolution, that is, each level included in the first sample feature pyramids and the first sample feature graphs corresponding to the level are respectively: 24X 1/4 (first layer), 32X 1/8 (second layer), 64X 1/16 (third layer), 320X 1/32 (fourth layer).

In one possible implementation, after the first feature pyramid is formed, the respective first sample feature maps are mixed through upsampling and convolution processes, so that the obtained second sample feature map focuses not only on features of the same sample size, but also fully utilizes the respective first sample feature maps. If the number of channels corresponding to the first sample feature map is the maximum number of channels (namely corresponding to the minimum resolution), performing convolution processing on the first sample feature map to obtain a second sample feature map corresponding to the minimum resolution, if the number of channels corresponding to the first sample feature map is not the maximum number of channels, performing convolution processing on the sample feature map to obtain a first intermediate sample feature map, performing convolution and up-sampling operations on the first sample feature map which is one level higher than the first sample feature map to obtain a second intermediate feature map, mixing the first intermediate feature map and the second intermediate feature map, and performing convolution processing to obtain the second sample feature map corresponding to the resolution.

Schematically, as shown in fig. 5, for the first sample feature map 320×1/32 corresponding to the fourth layer, only the convolution operation is performed on the first sample feature map, so that a second sample feature map corresponding to 1/32 with a resolution of 128×1/32 can be obtained; for the first sample characteristic diagram 64 multiplied by 1/16 corresponding to the third layer, firstly, carrying out convolution processing on the first sample characteristic diagram 64 multiplied by 1/16 to obtain a first middle sample characteristic diagram, carrying out convolution and bilinear difference 2 times up-sampling (up 2 x) processing on the first sample characteristic diagram 320 multiplied by 1/32 corresponding to the fourth layer (higher one level) to obtain a second middle sample characteristic diagram, mixing the first middle sample characteristic diagram and the second middle sample characteristic diagram, and carrying out convolution processing on the mixed sample characteristic diagram to obtain a second sample characteristic diagram with the resolution of 1/16, wherein the second sample characteristic diagram is represented by 128 multiplied by 1/16; similarly, a second sample feature map (128×1/8) corresponding to a resolution of 1/8 and a second sample feature map (128×1/4) corresponding to a resolution of 1/4 can be obtained, respectively. As shown in the right half of the feature pyramid network 502 in fig. 5, are the respective second feature graphs of the feature pyramid network output.

In one possible embodiment, the second sample feature maps are also arranged according to the size of the resolution to form a second sample feature pyramid, wherein the level of the second sample feature map on the second sample feature pyramid has a negative correlation with the resolution thereof.

For example, each level included in the second sample feature pyramid and the second sample feature map corresponding to each level are respectively: 128X 1/4 (first layer), 128X 1/8 (second layer), 128X 1/16 (third layer), 128X 1/32 (fourth layer).

Optionally, the number of target channels corresponding to the m second feature maps may be set by a developer, for example, the number of target channels is 128, which is not limited in the embodiment of the present application.

And step 302C, inputting the m second sample feature maps into a multi-scale decoding network to obtain a first sample transparent channel image output by the multi-scale decoding network.

Wherein the resolution of the first sample transparent channel image is the same as the resolution of the sample image.

In one possible implementation manner, m second sample feature images output by the feature pyramid are input into a multi-scale decoding network, the multi-scale decoding network performs addition and resolution conversion operations on the second sample feature images to obtain a first sample transparent channel image corresponding to the sample image, and the first sample transparent channel image contains predicted transparency values corresponding to each pixel point in the sample image and is used for comparing with the sample segmentation image subsequently to calculate cross entropy loss.

Since the second sample feature maps are corresponding to different resolutions, the addition processing cannot be directly performed, and the minimum resolution is 1/4, it is first necessary to unify the resolutions of the second sample feature maps to 1/4 of the original map. In one possible implementation, each second sample feature map is processed by a convolution block, and different resolutions correspond to different convolution blocks, and the number of convolution blocks corresponding to different resolutions is different.

The types of convolution blocks used by the multi-scale decoding network comprise cgr2x, sgr and the like, wherein the cgr2x comprises a convolution layer, a group normalization (Group Normalization) layer, an activation function (ReLU) layer and a bilinear difference 2 times up-sampling layer, the number of input channels and the number of output channels corresponding to the convolution layer are the same, for example, the number of input channels is 128, and the number of output channels is 128; sgr2x comprises a convolution layer, group Normalization layers, a ReLU layer and a bilinear difference 2 times up-sampling layer, wherein the convolution layer has different numbers of input and output channels, for example, the number of input channels is 128, and the number of output channels is 64; sgr includes a convolution layer, group Normalization layers, and a ReLU layer, where the convolution layers have different numbers of input channels and output channels, for example, the number of input channels is 128, and the number of output channels is 64.

Schematically, as shown in fig. 6, a schematic diagram of the structure of each convolution block used by the multi-scale decoding network is shown. Fig. 6 (a) is a schematic diagram corresponding to cgr2x, fig. 6 (B) is a schematic diagram corresponding to sgr x, and fig. 6 (C) is a schematic diagram corresponding to sgr.

Schematically, as shown in fig. 5, 4 second sample feature graphs output by the feature pyramid network 502 are input into the multi-scale decoding network 503, and through different convolution blocks, 4 third sample feature graphs (not shown in the figure) with the resolution of the original graph 1/4 are formed, for example, for the second sample feature graphs 128×1/32, the corresponding third sample feature graphs can be obtained through two cgr2x and one sgr x convolution block in sequence, for the second sample feature graphs 128×1/16, the corresponding third sample feature graphs can be obtained through one cgr2x and one sgr x convolution block in sequence, and similarly, the corresponding third sample feature graphs of the second sample feature graphs can be obtained; and adding the obtained 4 third sample feature images, and performing convolution and 4 times of up-sampling operation to obtain the first sample transparent channel image.

And step 302D, inputting the m second sample feature maps into a depth supervision network to obtain m third sample transparent channel images output by the depth supervision network.

The different second sample feature maps correspond to different upsampling multiples, and the resolution of the m third sample transparent channel images is the same as that of the sample images.

In one possible implementation manner, the m second sample feature maps output by the feature pyramid network are input into a depth supervision network, and the depth supervision network is used for carrying out up-sampling processing on the m second sample feature maps to obtain m third sample transparent channel images with the same resolution as the sample images, so as to provide cross entropy loss on different resolutions for the first prediction model.

The upsampling multiple corresponding to the different second sample feature map is related to the corresponding resolution, for example, the upsampling multiple corresponding to the second sample feature map with the resolution of 1/32 is 32 times, and the upsampling multiple corresponding to the second sample feature map with the resolution of 1/16 is 16 times.

Schematically, as shown in fig. 5, the second sample feature map 128×1/4 is up-sampled by 4 times to obtain a third sample transparent channel image 4, the second sample feature map 128×1/8 is up-sampled by 8 times to obtain a third sample transparent channel image 8, and similarly, the third sample transparent channel image 16 and the third sample transparent channel image 32 can be obtained.

In step 302E, binarization processing is performed on the first sample transparent channel image and the m third sample transparent channel images, so as to obtain a first sample segmentation image and m second sample segmentation images.

Because the first sample transparent channel image and the m third sample transparent channel images are probability images, and in order to accelerate the convergence speed of the first prediction model, the sample segmentation images are adopted and are binarized, binarization processing is needed to be carried out on the first sample transparent channel image and the m third sample transparent channel images, so that the first sample segmentation images and the m second sample segmentation images are obtained, and can be compared with the sample segmentation images, and the cross entropy loss of the first prediction model is calculated.

The binarization processing of the first sample transparent channel image and the m third sample transparent channel images may refer to the generation process of the sample segmentation image in the above embodiment, which is not described herein.

Illustratively, as shown in fig. 5, the binarization processing is performed on the 4 third sample transparent channel images, so as to obtain 4 second sample divided images, which are respectively represented as a second sample divided image 32, a second sample divided image 16, a second sample divided image 8, and a second sample divided image 4. And performing binarization processing on the first sample transparent channel image to obtain a first sample segmentation image.

Step 302F, training a first prediction model based on the first sample segmentation image, the m second sample segmentation images, and the sample segmentation image.

The loss of the first prediction model adopts cross entropy loss, namely, the cross entropy loss between the first sample segmentation image and the cross entropy loss between the m third sample segmentation images and the sample segmentation image are summed, so that the cross entropy loss corresponding to the first prediction model can be obtained.

Schematically, as shown in fig. 5, the cross entropy loss between the first sample segmentation image and the cross entropy loss between each second sample segmentation image and the sample segmentation image are calculated respectively, and the cross entropy losses are summed to obtain the cross entropy loss corresponding to the first prediction model.

Wherein, the formula of cross entropy loss can be expressed as:

wherein if itCross entropy loss for sample segmentation image and first sample segmentation image, then y _i Representing a sample segmentation image corresponding to the sample image, p _i Representing the first sample split image, the log loss for all samples represents the average of the log loss for each sample, which in the ideal case is 0.

Illustratively, the corresponding loss of the first predictive model may be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the corresponding integrated loss of the first predictive model, < >>Representing cross entropy loss of the first sample segmentation image and the sample segmentation image, < ->Represents the cross entropy loss of the second sample segmentation image 32 and the sample segmentation image,/i->Represents the cross entropy loss of the second sample-divided image 16 and the sample-divided image, < >>Represents the cross entropy loss of the second sample segmentation image 8 and the sample segmentation image,/i->Representing the cross entropy loss of the second sample-divided image 4 and the sample-divided image.

In one possible implementation manner, the comprehensive loss corresponding to the first prediction model may be calculated according to the above formula (1) and formula (2), so that the back propagation algorithm is performed on the first prediction model by using the comprehensive loss, and each parameter in the first prediction model is updated.

Optionally, training a prediction model is repeated according to the method shown in the foregoing embodiment in a plurality of training periods until the loss function corresponding to the first prediction model is completely converged, so as to complete training of the first prediction model, save the first prediction model, and not freeze parameters.

In this embodiment, the first prediction model is trained by the acquired sample image and the sample segmentation image, and cross entropy loss is introduced in the training process of the first prediction model, so that the first prediction model can be quickly converged, and the training efficiency of the first prediction model is improved.

In one possible implementation, after the training of the first prediction model is completed, the sample image may be input into the trained first prediction model to obtain a first sample transparent channel image for use in the training process of the second prediction model.

Referring to fig. 7, a flowchart of a training method of the second prediction model according to an exemplary embodiment of the present application is shown. The method comprises the following steps:

and 701, inputting the sample image into a first prediction model obtained by training to obtain a first sample transparent channel image output by the first prediction model.

Since the training of the second prediction model needs to depend on the output result of the first prediction model, i.e. the corresponding first sample transparent channel image needs to be output by the first prediction model. Therefore, the second predictive model needs to be trained after the first predictive model is trained, so that the second predictive model can be trained.

In one possible implementation manner, a sample image is input into a first prediction model obtained through training, and a first transparent channel image corresponding to the sample image output by the first prediction model is obtained and used for a subsequent training process of a second prediction model.

Alternatively, each sample image in the dataset may be input into the first prediction model to obtain a first sample transparent channel image corresponding to each sample image, so that the sample image and the corresponding first sample transparent channel image are used as the dataset for training the second prediction model.

Step 702, training a second prediction model according to the first sample transparent channel image, the sample annotation image and the sample image.

Since the purpose of the second prediction model is to improve the fineness of each pixel point in the first transparent channel image, and the sample segmentation image is obviously more accurate than the sample annotation image, in one possible implementation, the second prediction model is trained by the first transparent channel image, the sample annotation image and the sample image.

In one possible implementation, to improve the accuracy of the second sample transparent channel image output by the second prediction model, losses other than the basic matting loss, such as connectivity difference loss, structural similarity loss, and edge gradient loss, are introduced, which are all more focused on the transparency channel value of the boundary region of the foreground image and the background image, so that the second sample transparent channel image output by the refinement network is higher in accuracy than the sample transparent channel image.

Illustratively, as shown in FIG. 8, step 702 may include steps 702A through 702F.

Step 702A, inputting the first sample transparent channel image and the sample image into a refinement network, and obtaining a second sample transparent channel image output by the refinement network.

In a possible implementation manner, the second prediction model mainly includes a refinement network, and the refinement network is used for performing convolution processing on the first sample transparent channel image and the sample image, so that some error transparency channel values in the first sample transparent channel image can be corrected, and transparency channel values of a boundary area between the foreground image and the background image are corrected, so that the fineness of the transparency channel values corresponding to each pixel point is improved, and the second sample transparent channel image is output.

Alternatively, after the first sample transparent channel image and the sample image are subjected to the Concat processing, the sample image may be input into a refinement network.

Alternatively, the refinement network may include three convolutional network blocks and one convolutional layer, reducing the operation process.

Schematically, as shown in fig. 9, a sample image is input into a first prediction model 901 after training is completed, a first sample transparent channel image corresponding to the sample image can be obtained, and after the first sample transparent channel image and the sample image Concat, the sample image is input into a refinement network 902, and a second sample transparent channel image is output, where the refinement network 902 is formed by three convolution blocks and one convolution layer.

Step 702B, inputting the sample image, the second sample transparent channel image and the sample labeling image into an edge gradient network, so as to obtain an edge gradient loss corresponding to the second sample transparent channel image.

In one possible implementation manner, for the calculation mode of the edge gradient loss, a network for specially calculating the edge gradient loss, namely an edge gradient network is provided, and the edge gradient loss corresponding to the second sample transparent channel image can be obtained by inputting the sample image, the second sample transparent channel image and the sample labeling image into the edge gradient network, so that the edge gradient loss on the edge is provided for the subsequent training process.

In one possible embodiment, the process of obtaining the edge gradient loss may include the steps of:

1. inputting the sample image into a preset operator to obtain a sample gradient image corresponding to the sample image, wherein the preset operator is used for performing first-order reciprocal operation on the sample image.

Since the image edge is the interface area between the foreground image and the background image, to obtain the edge gradient loss, it is necessary to first obtain edge images respectively corresponding to the sample image and the second sample transparent channel image.

In one possible implementation, a preset operator is arranged in the edge gradient network, and the preset operator can perform a first derivative operation on the sample image to obtain gradients of the sample image in x and y directions, so as to output the sample gradient image.

Alternatively, the preset operator may use a Sobel (Sobel) operator, or may use other filtering operators that generate image gradients, such as Sha Er (Scharr) operator, laplace (laplace) operator, or the like. The preset operators adopted in the embodiment of the application are not limited.

Illustratively, taking the Sobel operator as an example, the process of generating a sample gradient image can be expressed as:

wherein A represents an input sample image, gx represents a gradient map of the sample image in the x direction, gy represents a gradient map of the sample image in the y direction, and G represents a sample gradient image output after passing through a Sobel operator.

In one possible implementation manner, after the gradient operation is performed on the sample image according to the formula (3) and the formula (4), a gradient map of the sample image in the x and y directions is obtained, and the gradient map is brought into the formula (5), so that a sample gradient image corresponding to the sample image can be obtained by calculation.

2. And performing binarization and dilation and erosion operations on the sample annotation image to obtain a sample edge image, wherein the sample edge image is used for indicating the boundary area of the foreground image and the background image in the sample annotation image.

In one possible embodiment, the label image is subjected to binarization and dilation-erosion operations, and a sample edge image is obtained, which is used to divide the edge regions in the second sample channel image and the sample gradient image.

3. And generating a sample edge gradient image according to the sample edge image and the sample gradient image, wherein the sample edge gradient image is used for indicating the boundary area of the foreground image and the background image in the sample image.

In one possible implementation manner, the sample edge image and the sample gradient image are multiplied, and the boundary area of the foreground image and the background image in the sample image can be divided from the sample gradient image, so as to obtain the sample edge gradient image.

4. And generating an edge transparent channel image according to the second sample transparent channel image and the sample edge image, wherein the edge transparent channel image is used for indicating the boundary area of the foreground image and the background image in the second sample transparent channel image.

In one possible implementation manner, the second sample transparent channel image and the sample edge image are multiplied, and the boundary area of the foreground image and the background image in the second sample transparent channel image can be divided, so that the corresponding edge transparent channel image is obtained.

5. And calculating according to the edge transparent channel image and the sample edge gradient image to obtain edge gradient loss.

In one possible implementation manner, according to the obtained edge transparent channel image and the sample edge gradient image, the edge gradient loss corresponding to the second sample transparent channel image can be calculated.

Illustratively, the manner in which the edge gradient loss is calculated can be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the corresponding edge gradient loss of the second sample transparent channel image, G _input Representing a sample gradient image, E _label Representing an edge image of the sample, G _Refined,Mask Representing a second sample transparent channel image, | … || ₁ Representing edge gradient loss employing L ₁ The norm is calculated.

Schematically, as shown in fig. 9, a sample image is input into an edge gradient network 903, and first, a sobel operator is passed to obtain a sample gradient image; inputting the sample labeling image into an edge gradient network 903, and obtaining a sample edge image after binarization and expansion corrosion operation; inputting the second sample transparent channel image into an edge gradient network 903, and multiplying the second sample transparent channel image by the sample edge image to obtain an edge transparent channel image; multiplying the sample gradient image and the sample edge image to obtain a sample edge gradient image; edge gradient loss is calculated from the sample edge gradient image and the edge transparent channel image.

Step 702C, calculating structural similarity loss and matting loss corresponding to the second sample transparent channel image according to the second sample transparent channel image and the sample labeling image.

The calculation formula of the matting loss can be expressed as:

Wherein, the liquid crystal display device comprises a liquid crystal display device,representing the matting loss corresponding to the second sample transparent channel image,/for>Indicating transparency channel value corresponding to the ith pixel point in the second sample transparent channel image,/->And the transparency channel value corresponding to the ith pixel point in the sample labeling image is represented, and E is a constant.

In one possible implementation manner, the second sample transparent channel image and the sample labeling image are brought into the above formula, so that the matting loss corresponding to the second sample transparent channel image can be obtained.

The calculation formula of the structural similarity loss can be expressed as:

wherein SSIM (x, y) represents a structural similarity index,for the loss of the structural similarity corresponding to the second sample transparent channel image, mu _x Labeling the mean value, sigma, of an image for a sample _x Labeling the sample with the variance of the image, μ _y Sigma, the mean value of the second sample transparent channel image _y C is the variance of the second sample transparent channel image ₁ And C ₂ Is constant.

And 702D, inputting the second sample transparent channel image and the sample labeling image into a connectivity difference network to obtain connectivity difference loss corresponding to the second sample transparent channel image.

Connectivity refers to pixels with the same value on the upper, lower, left and right sides of adjacent pixels in a gray scale picture. If the prediction effect of the second prediction model is better, the predicted second sample transparent channel image and the sample labeling image should also have more similar connectivity graphs, and the connectivity is also more similar.

In one possible implementation manner, the developer is preset with a connectivity difference network, and the second sample transparent channel image and the sample labeling image can be input into the connectivity difference network for calculating the connectivity difference loss corresponding to the second sample transparent channel image.

Wherein, the calculation formula of connectivity difference loss can be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,and the cumulative sum of connectivity differences of all pixel points is represented, and omega represents a connected region with the maximum value of 1 shared by the second sample transparent channel image and the sample labeling image. />The function calculates the ith pixel p of the second sample transparent channel image _i A degree of connectivity with Ω, a value of 1 indicating full connectivity, and 0 indicating no connectivity, +.>Representing the i-th pixel point on the sample annotation image.

Wherein, the liquid crystal display device comprises a liquid crystal display device,the function may be expressed in the following form:

d _i ＝p _i -l _i

where θ is a threshold parameter, d _i Representing the current pixel value p _i To critical threshold l _i Distance of (d) _i And less than θ is ignored. Wherein, the liquid crystal display device comprises a liquid crystal display device,represents p _i To l _i A set of discrete pixel values in between, dist _k (i) Representing the normalized euclidean distance from pixel i for the pixel connected to the source domain where pixel i is nearest to the threshold value k.

Illustratively, as shown in fig. 9, the second sample transparent channel image and the sample label image are input into the connectivity difference network 904, so that connectivity difference loss output by the connectivity difference network 904 can be obtained.

Step 702E, training the refinement network according to the edge gradient loss, connectivity difference loss, matting loss and structural similarity loss.

In one possible implementation, by combining the multiple losses obtained in the above examples to train the refinement network, the fineness of the generated second sample transparent channel image can be significantly improved compared to using only the matting losses.

And step 702F, determining the refined network obtained through training as a second prediction model.

In one possible implementation manner, a back propagation algorithm is executed on the refinement network, parameters of each convolution layer of the refinement network are updated, and the training process in the above embodiment is repeated in each training period until the loss function corresponding to the second prediction model is completely converged, and then the refinement network after the training is determined as the second prediction model.

In this embodiment, the refinement network is trained by introducing a plurality of loss function connectivity difference loss, edge gradient loss, matting loss and structural similarity loss, so that the second sample transparent channel image output by the refinement network focuses more on the transparency channel value on the edge region, thereby being beneficial to improving the accuracy of image segmentation.

In one possible implementation, after training of the first prediction model and the second prediction model is completed according to the methods shown in the foregoing embodiments, the trained prediction model may be deployed on a computer device, and the segmentation process of the original image may be implemented using the first prediction model and the second prediction model.

Referring to fig. 10, a flowchart of an image processing method according to another exemplary embodiment of the present application is shown. This embodiment will be described by taking the method for a computer device as an example, and the method includes the following steps.

In step 1001, an original image is acquired.

The implementation of this step may refer to step 201, and this embodiment is not described herein.

Step 1002, inputting an original image into a multi-scale coding network to obtain n first feature maps output by the multi-scale coding network, wherein the resolutions and the channel numbers of different first feature maps are different, and n is an integer greater than or equal to 2.

Step 1003, inputting the n first feature maps into a feature pyramid network to obtain n second feature maps output by the feature pyramid network, wherein the channel numbers of the different second feature maps are the same and the resolutions are different, and the channel numbers of the n second feature maps are the target channel numbers.

Illustratively, as shown in FIG. 11, step 1003 may include step 1003A, step 1003B, and step 1003C.

In step 1003A, the n first feature graphs are arranged according to the resolution to form a feature pyramid, where the resolution of the first feature graphs in the feature pyramid and the level where the first feature graphs are located are in a negative correlation relationship.

In step 1003B, in response to the number of channels corresponding to the nth first feature map being the maximum number of channels, convolution processing is performed on the nth first feature map to obtain an nth second feature map.

In step 1003C, in response to the number of channels corresponding to the nth first feature map not being the maximum number of channels, performing convolution processing on the nth first feature map to obtain a fourth feature map, performing convolution and upsampling processing on the (n+1) th first feature map to obtain a fifth feature map, mixing the fourth feature map and the fifth feature map, and performing convolution processing to obtain the nth second feature map.

It should be noted that, step 1003B and step 1003C may be performed simultaneously; step 1003A may be performed before step 1003C is performed; step 1003C may be performed first and then step 1003B may be performed, and the order of execution of step 1003B and step 1003C is not limited in this embodiment.

Step 1004, inputting the n second feature maps into a multi-scale decoding network to obtain a first transparent channel image output by the multi-scale decoding network.

Illustratively, as shown in FIG. 11, step 1004 includes step 1004A and step 1004B.

In step 1004A, n second feature maps are processed by convolution blocks respectively to obtain n third feature maps, where the resolutions corresponding to the n third feature maps are the same, different second feature maps correspond to different convolution blocks, and the numbers of convolution blocks corresponding to the different second feature maps are different.

Step 1004B, adding, rolling and upsampling the n third feature maps to obtain a first transparent channel image.

The above process of generating the first transparent channel image may refer to the training process of the first prediction model in the above embodiment, which is not described herein.

Step 1005, inputting the first transparent channel image and the original image into a second prediction model to obtain a second transparent channel image output by the second prediction model, wherein the fineness of the second transparent channel image is higher than that of the first transparent channel image.

In step 1006, the original image is segmented according to the second transparent channel image, so as to obtain an image corresponding to the target object.

The implementation of step 1005 and step 1006 may refer to step 201 and step 202, and this embodiment is not described herein.

In the embodiment of the application, the first prediction model and the second prediction model which are completed through deployment training are used for preprocessing an original image and inputting the preprocessed original image into the first prediction model to obtain the first transparent channel image output by the first prediction model, and the generated first transparent channel image and the original image are input into the second prediction model again to obtain the second transparent channel image output by the second prediction model, so that the second transparent channel image is used for image processing.

Referring to fig. 12, a network deployment diagram of an image processing method according to an exemplary embodiment of the present application is shown. The network deployment map comprises: a multi-scale encoding network, a feature pyramid network, a multi-scale decoding network, and a refinement network.

In a possible implementation manner, after the original image is preprocessed, the original image is input into the multi-scale coding network 1201, so as to obtain n first feature maps output by the multi-scale coding network 1201; inputting the n first feature images into the feature pyramid network 1202 to obtain n second feature images output by the feature pyramid network 1202, wherein the number of channels of the n second feature images is the target number of channels; inputting the n second feature maps into a multi-scale decoding network 1203, and obtaining a first transparent channel image output by the multi-scale decoding network 1203 after adding, resolution conversion and other operations; the first transparent channel image and the original image are input into a refinement network 1204 to obtain a second transparent channel image output by the refinement network 1204, so that the original image is segmented by the second transparent channel image to obtain an image corresponding to the target object.

Referring to fig. 13, a block diagram of an image processing apparatus according to an exemplary embodiment of the present application is shown. The apparatus may be implemented as all or part of a computer device by software, hardware, or a combination of both, the apparatus comprising:

a first obtaining module 1301, configured to obtain an original image, where the original image includes at least one target object;

a first prediction module 1302, configured to input the original image into a first prediction model, and obtain a first transparent channel image output by the first prediction model, where the first transparent channel image includes predicted transparency values corresponding to each pixel point in the original image;

the second prediction module 1303 is configured to input the first transparent channel image and the original image into a second prediction model, and obtain a second transparent channel image output by the second prediction model, where the fineness of the second transparent channel image is higher than that of the first transparent channel image;

and a segmentation processing module 1304, configured to perform segmentation processing on the original image according to the second transparent channel image, so as to obtain an image corresponding to the target object.

Optionally, the apparatus further includes:

The second acquisition module is used for acquiring a sample image, a sample labeling image and a sample segmentation image, wherein the sample labeling image is labeled with transparency values corresponding to all pixel points in the sample image, and the sample segmentation image is a binarized image obtained by performing binarization processing on the sample labeling image;

the first training module is used for training the first prediction model according to the sample image and the sample segmentation image;

the third prediction module is used for inputting the sample image into the first prediction model obtained through training to obtain a first sample transparent channel image output by the first prediction model;

and the second training module is used for training the second prediction model according to the first sample transparent channel image, the sample labeling image and the sample image.

Optionally, the second training module includes:

the refinement unit is used for inputting the first sample transparent channel image and the sample image into a refinement network to obtain a second sample transparent channel image output by the refinement network;

the edge gradient unit is used for inputting the sample image, the second sample transparent channel image and the sample labeling image into an edge gradient network to obtain edge gradient loss corresponding to the second sample transparent channel image;

The calculating unit is used for calculating the structural similarity loss and the matting loss corresponding to the second sample transparent channel image according to the second sample transparent channel image and the sample labeling image;

the connectivity difference unit is used for inputting the second sample transparent channel image and the sample labeling image into a connectivity difference network to obtain a connectivity difference loss corresponding to the second sample transparent channel image;

a first training unit, configured to train the refinement network according to the edge gradient loss, the connectivity difference loss, the matting loss, and the structural similarity loss;

and the determining unit is used for determining the refined network obtained through training as the second prediction model.

Optionally, the edge gradient unit is further configured to:

inputting the sample image into a preset operator to obtain a sample gradient image corresponding to the sample image, wherein the preset operator is used for performing first-order reciprocal operation on the original sample image;

performing binarization and dilation-erosion operation on the sample annotation image to obtain a sample edge image, wherein the sample edge image is used for indicating a boundary area of a foreground image and a background image in the sample annotation image;

Generating a sample edge gradient image according to the sample edge image and the sample gradient image, wherein the sample edge gradient image is used for indicating the boundary area of a foreground image and a background image in the sample image;

generating an edge transparent channel image according to the second sample transparent channel image and the sample edge image, wherein the edge transparent channel image is used for indicating the boundary area of a foreground image and a background image in the second sample transparent channel image;

and calculating the edge gradient loss according to the edge transparent channel image and the sample edge gradient image.

Optionally, the first prediction model includes a multi-scale encoding network, a feature pyramid network, a multi-scale decoding network, and a depth supervision network;

optionally, the first training module includes:

the first multi-scale coding unit is used for inputting the sample image into the multi-scale coding network to obtain m first sample feature images output by the multi-scale coding network, wherein the resolutions and the channel numbers of different first sample feature images are different, m is an integer greater than or equal to 2, and the multi-scale coding network is used for extracting the features of the sample image;

The first feature pyramid unit is used for inputting the m first sample feature graphs into the feature pyramid network to obtain m second sample feature graphs output by the feature pyramid network, wherein the channel numbers of different second sample feature graphs are the same and the resolutions are different, and the feature pyramid network is used for processing the channel numbers of the m first sample feature graphs into target channel numbers;

the first multi-scale decoding unit is used for inputting the m second sample feature images into the multi-scale decoding network to obtain the first sample transparent channel image output by the multi-scale decoding network, the multi-scale decoding network is used for carrying out addition and resolution conversion operation on the m second sample feature images, and the resolution of the first sample transparent channel image is the same as that of the sample image;

the depth supervision unit is used for inputting the m second sample feature images into the depth supervision network to obtain m third sample transparent channel images output by the depth supervision network, the depth supervision network is used for carrying out up-sampling processing on the m second sample feature images, different second sample feature images correspond to different up-sampling multiples, and the resolution of the m third sample transparent channel images is the same as that of the sample images;

The binarization processing unit is used for performing binarization processing on the first sample transparent channel image and the m third sample transparent channel images to obtain a first sample segmentation image and m second sample segmentation images;

and the second training unit is used for training the first prediction model according to the first sample segmentation image, the m second sample segmentation images and the sample segmentation image.

Optionally, the first prediction module 1302 includes:

the second multi-scale coding unit is used for inputting the original image into the multi-scale coding network to obtain n first feature images output by the multi-scale coding network, wherein the resolutions and the channel numbers of different first feature images are different, and n is an integer greater than or equal to 2;

the second feature pyramid unit is used for inputting n first feature graphs into the feature pyramid network to obtain n second feature graphs output by the feature pyramid network, wherein the channel numbers of different second feature graphs are the same and the resolution is different, and the channel numbers of the n second feature graphs are the target channel numbers;

and the second multi-scale decoding unit is used for inputting the n second feature maps into the multi-scale decoding network to obtain the first transparent channel image output by the multi-scale decoding network.

Optionally, the second feature pyramid unit is further configured to:

arranging n first feature graphs according to resolution to form a feature pyramid, wherein the resolution of the first feature graphs in the feature pyramid and the level of the first feature graphs are in negative correlation;

responding to the channel number corresponding to the nth first feature map as the maximum channel number, and carrying out convolution processing on the nth first feature map to obtain an nth second feature map;

and responding to the fact that the channel number corresponding to the nth first feature map is not the maximum channel number, carrying out convolution processing on the nth first feature map to obtain a fourth feature map, carrying out convolution and up-sampling processing on the (n+1) th first feature map to obtain a fifth feature map, mixing the fourth feature map and the fifth feature map, and carrying out convolution processing to obtain the nth second feature map.

Optionally, the second multi-scale decoding unit is further configured to:

processing the n second feature images through convolution blocks to obtain n third feature images, wherein the resolutions corresponding to the n third feature images are the same, different second feature images correspond to different convolution blocks, and the number of convolution blocks corresponding to different second feature images is different;

And adding, rolling and upsampling the n third feature maps to obtain the first transparent channel image.

In the embodiment of the application, the obtained original image is input into the first prediction model to obtain a first transparent channel image (including a predicted transparency value corresponding to each pixel point in the original image) output by the first prediction model, so that the first transparent channel image and the original image are input into the second prediction model to obtain a second transparent channel image output by the second prediction model, and the second transparent channel image is used for dividing the original image according to the second transparent channel image to obtain an image corresponding to the target object. Because the fineness of the second transparent channel image is higher than that of the first transparent channel image, the accuracy of image segmentation can be improved; compared with the image segmentation method in the related art, the transparent channel image for segmentation can be directly generated from the original image without introducing a trimap image, and the accuracy of image segmentation is further improved.

Referring to fig. 14, a schematic structural diagram of a computer device according to an exemplary embodiment of the present application is shown. The computer apparatus 1400 includes a central processing unit (Central Processing Unit, CPU) 1401, a system Memory 1404 including a random access Memory (Random Access Memory, RAM) 1402 and a Read-Only Memory (ROM) 1403, and a system bus 1405 connecting the system Memory 1404 and the central processing unit 1401. The computer device 1400 also includes a basic Input/Output system (I/O) 1406 that facilitates the transfer of information between various devices within the computer device, and a mass storage device 1407 for storing an operating system 1413, application programs 1414, and other program modules 1415.

The basic input/output system 1406 includes a display 1408 for displaying information and an input device 1409, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1408 and the input device 1409 are connected to the central processing unit 1401 via an input output controller 1410 connected to the system bus 1405. The basic input/output system 1406 may also include an input/output controller 1410 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 1410 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1407 is connected to the central processing unit 1401 through a mass storage controller (not shown) connected to the system bus 1405. The mass storage device 1407 and its associated computer-readable storage media provide non-volatile storage for the computer device 1400. That is, the mass storage device 1407 may include a computer readable storage medium (not shown) such as a hard disk or a compact disk-Only (CD-ROM) drive.

The computer-readable storage medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable storage instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-Only register (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1404 and mass storage device 1407 described above may be collectively referred to as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1401, the one or more programs containing instructions for implementing the above-described method embodiments, the central processing unit 1401 executing the one or more programs to implement the methods provided by the respective method embodiments described above.

According to various embodiments of the present application, the computer device 1400 may also operate through a network, such as the Internet, to a remote server on the network. I.e., the computer device 1400 may be connected to the network 1412 through a network interface unit 1411 connected to the system bus 1405, or other types of networks or remote server systems (not shown) may be connected to the computer device using the network interface unit 1411.

The memory also includes one or more programs stored in the memory, the one or more programs including steps for performing the methods provided by the embodiments of the present application, as performed by the computer device.

Embodiments of the present application also provide a computer readable storage medium storing at least one instruction that is loaded and executed by the processor to implement the image processing method described in the above embodiments.

Embodiments of the present application also provide a computer program product storing at least one instruction that is loaded and executed by the processor to implement the image processing method described in the above embodiments.

Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable storage medium. Computer-readable storage media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims

1. An image processing method, the method comprising:

obtaining a sample image, a sample labeling image and a sample segmentation image, wherein the sample labeling image is labeled with transparency values corresponding to all pixel points in the sample image, and the sample segmentation image is a binarized image obtained by performing binarization processing on the sample labeling image;

Inputting the sample image into a multi-scale coding network to obtain m first sample feature images output by the multi-scale coding network, wherein the resolution and the channel number of different first sample feature images are different, m is an integer greater than or equal to 2, and the multi-scale coding network is used for extracting features of the sample image;

inputting the m first sample feature images into a feature pyramid network to obtain m second sample feature images output by the feature pyramid network, wherein the channel numbers of different second sample feature images are the same and the resolutions are different, and the feature pyramid network is used for processing the channel numbers of the m first sample feature images into target channel numbers;

inputting the m second sample feature images into a multi-scale decoding network to obtain a first sample transparent channel image output by the multi-scale decoding network, wherein the multi-scale decoding network is used for adding and converting the m second sample feature images, and the resolution of the first sample transparent channel image is the same as that of the sample image;

inputting the m second sample feature images into a depth supervision network to obtain m third sample transparent channel images output by the depth supervision network, wherein the depth supervision network is used for carrying out up-sampling processing on the m second sample feature images, different second sample feature images correspond to different up-sampling multiples, and the resolution of the m third sample transparent channel images is the same as that of the sample images;

Binarizing the first sample transparent channel image and m third sample transparent channel images to obtain a first sample segmentation image and m second sample segmentation images;

training a first prediction model according to the first sample segmentation image, m second sample segmentation images and the sample segmentation image;

inputting the sample image into the first prediction model obtained by training to obtain the first sample transparent channel image output by the first prediction model;

training a second prediction model according to the first sample transparent channel image, the sample annotation image and the sample image;

inputting the original image into the first prediction model to obtain a first transparent channel image output by the first prediction model, wherein the first transparent channel image comprises predicted transparency values corresponding to all pixel points in the original image;

inputting the first transparent channel image and the original image into the second prediction model to obtain a second transparent channel image output by the second prediction model, wherein the fineness of the second transparent channel image is higher than that of the first transparent channel image;

2. The method of claim 1, wherein the training a second predictive model from the first sample transparent channel image, the sample annotation image, and the sample image comprises:

inputting the first sample transparent channel image and the sample image into a refinement network to obtain a second sample transparent channel image output by the refinement network;

inputting the sample image, the second sample transparent channel image and the sample labeling image into an edge gradient network to obtain edge gradient loss corresponding to the second sample transparent channel image;

calculating structural similarity loss and matting loss corresponding to the second sample transparent channel image according to the second sample transparent channel image and the sample labeling image;

inputting the second sample transparent channel image and the sample labeling image into a connectivity difference network to obtain connectivity difference loss corresponding to the second sample transparent channel image;

training the refinement network according to the edge gradient loss, the connectivity difference loss, the matting loss and the structural similarity loss;

And determining the refined network obtained through training as the second prediction model.

3. The method of claim 2, wherein inputting the sample image, the second sample transparent channel image, and the sample annotation image into an edge gradient network, obtaining an edge gradient loss corresponding to the second sample transparent channel image, comprises:

inputting the sample image into a preset operator to obtain a sample gradient image corresponding to the sample image, wherein the preset operator is used for performing first-order reciprocal operation on the sample image;

4. The method of claim 1, wherein said inputting the original image into the first predictive model results in a first transparent channel image output by the first predictive model, comprising:

inputting the original image into the multi-scale coding network to obtain n first feature images output by the multi-scale coding network, wherein the resolutions and the channel numbers of different first feature images are different, and n is an integer greater than or equal to 2;

inputting the n first feature images into the feature pyramid network to obtain n second feature images output by the feature pyramid network, wherein the channel numbers of different second feature images are the same and the resolutions are different, and the channel numbers of the n second feature images are the target channel numbers;

and inputting the n second feature maps into the multi-scale decoding network to obtain the first transparent channel image output by the multi-scale decoding network.

5. The method of claim 4, wherein inputting the n first feature maps into the feature pyramid network, obtaining n second feature maps output by the feature pyramid network, comprises:

6. The method of claim 4, wherein inputting the n second feature maps into the multi-scale decoding network to obtain the first transparent channel image output by the multi-scale decoding network comprises:

7. An image processing apparatus, characterized in that the apparatus comprises:

the first training module is used for inputting the sample image into a multi-scale coding network to obtain m first sample feature images output by the multi-scale coding network, wherein the resolutions and the channel numbers of different first sample feature images are different, m is an integer greater than or equal to 2, and the multi-scale coding network is used for extracting the features of the sample image; inputting the m first sample feature images into a feature pyramid network to obtain m second sample feature images output by the feature pyramid network, wherein the channel numbers of different second sample feature images are the same and the resolutions are different, and the feature pyramid network is used for processing the channel numbers of the m first sample feature images into target channel numbers; inputting the m second sample feature images into a multi-scale decoding network to obtain a first sample transparent channel image output by the multi-scale decoding network, wherein the multi-scale decoding network is used for adding and converting the m second sample feature images, and the resolution of the first sample transparent channel image is the same as that of the sample image; inputting the m second sample feature images into a depth supervision network to obtain m third sample transparent channel images output by the depth supervision network, wherein the depth supervision network is used for carrying out up-sampling processing on the m second sample feature images, different second sample feature images correspond to different up-sampling multiples, and the resolution of the m third sample transparent channel images is the same as that of the sample images; binarizing the first sample transparent channel image and m third sample transparent channel images to obtain a first sample segmentation image and m second sample segmentation images; training a first prediction model according to the first sample segmentation image, m second sample segmentation images and the sample segmentation image;

The third prediction module is used for inputting the sample image into the first prediction model obtained through training to obtain the first sample transparent channel image output by the first prediction model;

the second training module is used for training a second prediction model according to the first sample transparent channel image, the sample labeling image and the sample image;

the first prediction module is used for inputting the original image into the first prediction model to obtain a first transparent channel image output by the first prediction model, wherein the first transparent channel image comprises a predicted transparency value corresponding to each pixel point in the original image;

the second prediction module is used for inputting the first transparent channel image and the original image into the second prediction model to obtain a second transparent channel image output by the second prediction model, and the fineness of the second transparent channel image is higher than that of the first transparent channel image;

8. A computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set, the at least one instruction, at least one program, code set or instruction set being loaded and executed by the processor to implement the image processing method of any of claims 1 to 6.

9. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement the image processing method of any one of claims 1 to 6.