CN111369581A

CN111369581A - Image processing method, device, equipment and storage medium

Info

Publication number: CN111369581A
Application number: CN202010099612.6A
Authority: CN
Inventors: 刘钰安
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2020-07-03
Anticipated expiration: 2040-02-18
Also published as: WO2021164534A1; CN111369581B

Abstract

The embodiment of the application discloses an image processing method, an image processing device, image processing equipment and a storage medium, and belongs to the field of image processing. The method comprises the following steps: acquiring an original image, wherein the original image comprises at least one target object; inputting an original image into a first prediction model to obtain a first transparent channel image output by the first prediction model, wherein the first transparent channel image comprises a prediction transparency value corresponding to each pixel point in the original image; inputting the first transparent channel image and the original image into a second prediction model to obtain a second transparent channel image output by the second prediction model, wherein the fineness of the second transparent channel image is higher than that of the first transparent channel image; and carrying out segmentation processing on the original image according to the second transparent channel image to obtain an image corresponding to the target object. Compared with the image segmentation method in the related art, the transparent channel image can be directly generated from the original image without introducing a trimap image, and the accuracy of image segmentation is further improved.

Description

Image processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing, and in particular, to an image processing method, an image processing apparatus, an image processing device, and a storage medium.

Background

The image segmentation refers to a process of accurately separating an interested foreground object from a background in a static image or a continuous video sequence, and the process is widely applied to aspects of portrait blurring, background replacement and the like, and aims at an image segmentation task, namely obtaining a transparent channel image, wherein transparency values corresponding to all pixel points are marked in the transparent channel image, an area with a transparency value of 1 is a foreground image area, an area with a transparency of 0 is a background image area, and the obtained transparent channel image can be used for separating the foreground image in an original image.

In the related art, a trimap image is generated according to an original image, and is used for dividing the original image into three parts, namely a determined foreground image area, a determined background image area and an uncertain area, the trimap image is used for determining the uncertain area firstly, then the trimap image and the original image are input into a trained neural network, transparency values corresponding to all pixel points in the uncertain area are determined, and a transparent channel image for image segmentation is output.

Obviously, the transparent channel image obtained in the related art depends on the accuracy of the trimap image, and the trimap image needs to be generated by training a specific neural network or obtained by artificial labeling, so that the accuracy of the generated transparent channel image is low.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, image processing equipment and a storage medium. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides an image processing method, where the method includes:

acquiring an original image, wherein the original image comprises at least one target object;

inputting the original image into a first prediction model to obtain a first transparent channel image output by the first prediction model, wherein the first transparent channel image comprises a prediction transparency value corresponding to each pixel point in the original image;

inputting the first transparent channel image and the original image into a second prediction model to obtain a second transparent channel image output by the second prediction model, wherein the fineness of the second transparent channel image is higher than that of the first transparent channel image;

and segmenting the original image according to the second transparent channel image to obtain an image corresponding to the target object.

In another aspect, an embodiment of the present application provides an image processing apparatus, including:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring an original image, and the original image comprises at least one target object;

the first prediction module is used for inputting the original image into a first prediction model to obtain a first transparent channel image output by the first prediction model, wherein the first transparent channel image comprises a prediction transparency value corresponding to each pixel point in the original image;

the second prediction module is used for inputting the first transparent channel image and the original image into a second prediction model to obtain a second transparent channel image output by the second prediction model, and the fineness of the second transparent channel image is higher than that of the first transparent channel image;

and the segmentation processing module is used for carrying out segmentation processing on the original image according to the second transparent channel image to obtain an image corresponding to the target object.

In another aspect, embodiments of the present application provide a computer device including a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the image processing method according to the above aspect.

In another aspect, embodiments of the present application provide a computer-readable storage medium having at least one instruction, at least one program, a set of codes, or a set of instructions stored therein, which is loaded and executed by a processor to implement the image processing method according to the above aspect.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

and inputting the acquired original image into a first prediction model to obtain a first transparent channel image (including a prediction transparency value corresponding to each pixel point in the original image) output by the first prediction model, so that the first transparent channel image and the original image are input into a second prediction model to obtain a second transparent channel image output by the second prediction model, and the second transparent channel image is used for carrying out segmentation processing on the original image according to the second transparent channel image to obtain an image corresponding to the target object. The fineness of the second transparent channel image is higher than that of the first transparent channel image, so that the accuracy of image segmentation can be improved; compared with the image segmentation method in the related art, the transparent channel image for image segmentation can be directly generated from the original image without introducing a trimap image, and the accuracy of image segmentation is further improved.

Drawings

FIG. 1 illustrates a flow chart of an image processing method provided by an exemplary embodiment of the present application;

FIG. 2 illustrates a flow chart of an image processing method shown in an exemplary embodiment of the present application;

FIG. 3 illustrates a flow chart of a method of training a first predictive model in accordance with an exemplary embodiment of the present application;

FIG. 4 illustrates a flow chart of a method of training a first predictive model, shown in another exemplary embodiment of the present application;

FIG. 5 illustrates a process diagram of a method of training a first predictive model, shown in an exemplary embodiment of the present application;

FIG. 6 shows a schematic structural diagram of individual convolutional blocks used by a multi-scale decoding network;

FIG. 7 illustrates a flow chart of a method of training a second predictive model in accordance with an exemplary embodiment of the present application;

FIG. 8 illustrates a flow chart of a method of training a second predictive model in accordance with another exemplary embodiment of the present application;

FIG. 9 illustrates a process diagram of a method of training a second predictive model in accordance with an exemplary embodiment of the present application;

FIG. 10 shows a flow chart of an image processing method shown in another exemplary embodiment of the present application;

FIG. 11 shows a flow chart of an image processing method shown in another exemplary embodiment of the present application;

FIG. 12 illustrates a network deployment diagram of an image processing method shown in an exemplary embodiment of the present application;

fig. 13 is a block diagram illustrating a configuration of an image processing apparatus according to an exemplary embodiment of the present application;

fig. 14 shows a schematic structural diagram of a computer device provided in an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Image segmentation refers to a process of accurately separating a foreground object of interest from a background in a static image or a continuous video sequence, and in an image segmentation task, a transparent channel image for segmenting the foreground image needs to be generated, where the transparent channel image includes transparency values corresponding to each pixel point, for example, an area with a transparency value of 1 represents a foreground image area, and an area with a transparency of 0 represents a background image area, so that the obtained transparent channel image can be used to separate the foreground image in an original image.

The related art provides an image processing method which is mainly divided into two stages, wherein the first stage is as follows: generating a trisection map according to the original image, wherein the trisection map is used for dividing the original image into three parts, namely a determined foreground image area, a determined background image area and an uncertain area, and the uncertain area in the original image can be divided through the generated trisection map; the second stage is: and inputting the generated three-part graph and the original graph into a trained neural network, and determining transparency values corresponding to each pixel point in the uncertain region, thereby outputting a transparent channel image for image segmentation.

By adopting the method in the related technology, if the uncertain region divided by the trimap image is not accurate enough, the accuracy of the generated transparent channel image is low, thereby affecting the accuracy of image segmentation. In addition, the trimap image needs to be generated by a specific neural network or needs to be labeled manually, so that the complexity of the training process is increased, and the corresponding transparent channel image cannot be generated directly by the original image.

In order to solve the above problem, an embodiment of the present application provides an image processing method. Referring to fig. 1, a flowchart of an image processing method according to an exemplary embodiment of the present application is shown. The method comprises the following steps:

step 101, training a first prediction model according to the sample image and the sample segmentation image, wherein the first prediction model is used for generating a first transparent channel image, and the first transparent channel image comprises a prediction transparency value corresponding to each pixel point in the original image.

And 102, training a first prediction model according to the sample image and the sample segmentation image, wherein the first prediction model is used for generating a first transparent channel image, and the first transparent channel image comprises a prediction transparency value corresponding to each pixel point in the original image.

And 103, preprocessing the original image, and inputting the trained first prediction model to obtain a first transparent channel image output by the first prediction model.

And 104, inputting the first transparent channel image and the original image into a second prediction model to obtain a second transparent channel image output by the second prediction model.

And 105, segmenting the original image according to the second transparent channel image to obtain a foreground image.

In the embodiment of the application, a first prediction model and a second prediction model of a transparent channel image can be generated through training, an original image is input into the first prediction model after being preprocessed, a first transparent channel image output by the first prediction model is obtained, the generated first transparent channel image and the generated original image are input into the second prediction model again, and a second transparent channel image output by the second prediction model is obtained, so that the second transparent channel image is used for image processing.

It should be noted that the image processing method provided in the embodiments of the present application may be applied to a computer device with an image processing function, where the computer device may be a smart phone, a tablet computer, a personal portable computer, or the like. In a possible implementation manner, the image processing method provided by the embodiment of the present application may be applied to an application program that needs to perform tasks such as image segmentation, background replacement, blurring of a target object, and the like. For example, an application having a beauty function; optionally, the training process of the prediction model in the image processing method provided in each embodiment of the present application may be performed in a server, and after the training of the prediction model is completed, the trained prediction model is deployed in a computer device to perform subsequent image processing; optionally, the image processing method provided in the embodiments of the present application may also be used in a server having an image processing function.

For convenience of description, in the method embodiments described below, description is given only taking as an example that the execution subject of the image processing method is a computer apparatus.

Referring to fig. 2, a flowchart of an image processing method according to an exemplary embodiment of the present application is shown. The embodiment is described by taking the method as an example for a computer device, and the method comprises the following steps.

Step 201, acquiring an original image.

The image processing in the embodiment of the present application aims to separate a foreground image from an original image, where the foreground image is an image corresponding to a target object. Therefore, the original image should contain at least one target object.

The target object may be a person, a scene, an animal, and the like, and the type of the target object is not limited in the embodiments of the present application.

In one possible implementation, the original image needs to be preprocessed, and the preprocessing may include data enhancement processing such as random rotation, random left-right flipping, random cropping, Gamma (Gamma) transformation, and the like, so as to be used in the subsequent feature extraction process.

Step 202, inputting the original image into a first prediction model to obtain a first transparent channel image output by the first prediction model, wherein the first transparent channel image comprises a prediction transparency value corresponding to each pixel point in the original image.

The first transparent channel image is a probability map, that is, the value of the prediction transparency value corresponding to each pixel point in the first transparent channel image is 0-1, for example, the prediction transparency value of a certain pixel point is 0.9.

In a possible implementation manner, after the original image is preprocessed, the preprocessed original image is input into the trained first prediction model, so that a predicted first transparent channel image can be obtained, where the predicted first transparent channel image includes the predicted transparency value corresponding to each pixel point in the original image.

And 203, inputting the first transparent channel image and the original image into a second prediction model to obtain a second transparent channel image output by the second prediction model, wherein the fineness of the second transparent channel image is higher than that of the first transparent channel image.

Since the accuracy of the divided image depends on the accuracy of the obtained transparent channel image, in order to improve the accuracy of the first transparent channel image, a second prediction model is deployed, and the second prediction model can generate a second transparent channel image with higher fineness according to the input original image and the first transparent channel image.

The second prediction model is mainly used for correcting the transparency value of each pixel point in the first prediction transparent channel image, so that the prediction transparency value of each pixel point in the second transparent channel image is closer to the standard transparency value.

In a possible implementation manner, after the first transparent channel image and the original image are subjected to stitching (Concat) processing, and then input into the second prediction model, a second transparent channel image with higher fineness can be obtained so as to be used in a subsequent image segmentation processing process.

And 204, segmenting the original image according to the second transparent channel image to obtain an image corresponding to the target object.

In a possible implementation manner, the second transparent channel image includes a predicted transparency value corresponding to each pixel point, and since the transparency value corresponding to the foreground image is 1 and the transparency value corresponding to the background image is 0, the foreground image and the background image in the original image can be separated according to the second transparent channel image, so that the image corresponding to the target object can be obtained.

To sum up, in the embodiment of the present application, an acquired original image is input into a first prediction model, so as to obtain a first transparent channel image (including a prediction transparency value corresponding to each pixel point in the original image) output by the first prediction model, and thus the first transparent channel image and the original image are input into a second prediction model, so as to obtain a second transparent channel image output by the second prediction model, where the second transparent channel image is used to perform segmentation processing on the original image according to the second transparent channel image, so as to obtain an image corresponding to a target object. The fineness of the second transparent channel image is higher than that of the first transparent channel image, so that the accuracy of image segmentation can be improved; compared with the image segmentation method in the related art, the transparent channel image for segmentation can be directly generated from the original image without introducing a trimap image, and the accuracy of image segmentation is further improved.

In a possible embodiment, since the process of generating the second transparent channel image is divided into two prediction model stages, namely, the first prediction model and the second prediction model, in the model training stage, the training stage of the first prediction model and the training stage of the second prediction model should also be included.

Referring to fig. 3, a flowchart illustrating a method for training a first predictive model according to an exemplary embodiment of the present application is shown. The method comprises the following steps:

step 301, a sample image, a sample labeling image and a sample segmentation image are obtained, wherein the sample labeling image is labeled with transparency values corresponding to each pixel point in the sample image, and the sample segmentation image is a binarized image obtained by binarizing the sample labeling image.

Aiming at the training process of the first prediction model, the adopted data set comprises a preset number of data pairs, each data pair is a sample image and a sample labeling image corresponding to the sample image, and the sample labeling image is labeled with a standard transparency value corresponding to each pixel point in the corresponding sample image.

Optionally, the preset number may be set by a developer, and the greater the number of data pairs, the higher the prediction accuracy of the first prediction model. For example, 5000 data pairs may be included in a data set.

Optionally, the sample annotation image may be obtained by annotation by a developer.

Optionally, the first prediction model may be trained based on a deep learning tensor library (pytorreh) framework, and a Graphics Processing Unit (GPU).

In order to enable the first prediction model to be converged quickly and improve the training speed of the first prediction model, the first prediction model is trained by using the sample segmentation images and the original images.

In a possible implementation manner, the sample labeled image may be obtained by performing binarization processing on the sample labeled image, that is, setting a transparency threshold, and if the transparency value corresponding to the pixel point is greater than the transparency threshold, the transparency value corresponding to the pixel point is represented by 1, and if the transparency value corresponding to the pixel point is less than the transparency threshold, the transparency value corresponding to the pixel point is represented by 0.

Optionally, the transparency threshold may be set by a developer, for example, the transparency threshold may be 0.8, that is, the transparency value of a pixel greater than 0.8 is represented by 1, and the transparency value of a pixel less than 0.8 is represented by 0.

In a possible implementation manner, the acquired sample images and sample segmentation images of the preset number are divided into a test set and a sample set according to a certain proportion, wherein the sample set is used in a subsequent training process of the first prediction model, and the test set is used in a verification process of the first prediction model.

Optionally, the preset ratio may be set by a developer, for example, if the preset ratio is 2:8, the data set may be divided into the test set and the sample set according to the ratio of 2: 8.

Step 302, training a first prediction model according to the sample image and the sample segmentation image.

In one possible embodiment, the sample data size may be expanded by preprocessing the image, for example, preprocessing such as random rotation, random left-right flipping, random cropping, and Gamma transformation on the sample image in the sample set. The embodiment of the present application does not limit the preprocessing method for the sample image.

In one possible implementation, the first prediction model may include a multi-scale coding network, a feature pyramid network, a multi-scale decoding network, a deep surveillance network, and the like.

Illustratively, as shown in FIG. 4, step 301 may include steps 302A through 302F.

Step 302A, inputting the sample image into a multi-scale coding network to obtain m first sample feature maps output by the multi-scale coding network, wherein m is an integer greater than or equal to 2.

Wherein, the resolution and the channel number of different first sample characteristic diagrams are different, and m is an integer larger than or equal to 2.

The multi-scale coding network may use a neural network model for feature extraction, for example, a mobile face network (MobileNetV2) model, and the neural network model used by the multi-scale coding network is not limited in the embodiments of the present application.

In a possible implementation manner, the preprocessed sample image is input into a multi-scale coding network, and multi-scale feature extraction is performed on the sample image through the multi-scale coding network, so that m first sample feature maps can be obtained.

Illustratively, if n is 4, the obtained n first sample feature maps may be 320 × 1/32, 64 × 1/16, 32 × 1/8 and 24 × 1/4, where 320, 64, 32 and 24 represent the number of channels corresponding to each first sample feature map, 1/32, 1/16, 1/8 and 1/4 represent the resolution of each first sample feature map relative to the sample image, and for example, 1/4 represents that the resolution corresponding to the first sample feature map is 1/4 of the sample image.

Schematically, as shown in fig. 5, after a sample image is input into a multi-scale coding network 501 and multi-scale feature extraction is performed, 4 first sample feature maps output by the multi-scale coding network 501 are obtained, which are 320 × 1/32, 64 × 1/16, 32 × 1/8 and 24 × 1/4 respectively.

Step 302B, inputting the m first sample feature maps into the feature pyramid network, and obtaining m second sample feature maps output by the feature pyramid network.

And the channels of different second sample characteristic graphs are the same in number and different in resolution.

The feature pyramid network is used for mixing the extracted feature maps and processing the number of channels corresponding to the feature maps into a target number of channels.

In one possible implementation, the m first sample feature maps output by the multi-scale coding network are input into the feature pyramid network, and are arranged according to the size of the resolution of each first sample feature map to form a first sample feature pyramid, wherein the level of the first sample feature map on the first sample feature pyramid is in a negative correlation with the resolution.

Schematically, as shown in fig. 5, 4 first sample feature maps output by the multi-scale coding network are input into the feature pyramid network 502, and first sample feature pyramids (shown as the left pyramid of the feature pyramid network) are formed according to the high-low arrangement of the resolution, that is, each level included in the first sample feature pyramid and the corresponding first sample feature map are 24 × 1/4 (first layer), 32 × 1/8 (second layer), 64 × 1/16 (third layer), and 320 × 1/32 (fourth layer), respectively.

In one possible embodiment, after the first feature pyramid is formed, the respective first sample feature maps are mixed through an upsampling and convolution process, so that the obtained second sample feature map not only focuses on features of the same sampling size, but fully utilizes the respective first sample feature maps. If the number of channels corresponding to the first sample feature map is the maximum number of channels (i.e., corresponding to the minimum resolution), performing convolution processing on the first sample feature map to obtain a second sample feature map corresponding to the minimum resolution, if the number of channels corresponding to the first sample feature map is not the maximum number of channels, performing convolution processing on the sample feature map to obtain a first intermediate sample feature map, performing convolution and upsampling operations on the first sample feature map at a higher level than the first sample feature map to obtain a second intermediate feature map, mixing the first intermediate feature map and the second intermediate feature map, and performing convolution processing to obtain a second sample feature map corresponding to the resolution.

Schematically, as shown in fig. 5, as for the first sample feature map 320 × 1/32 corresponding to the fourth layer, a convolution operation is performed only on the first sample feature map, so that a second sample feature map corresponding to a resolution 1/32 can be obtained, which is represented by 128 × 1/32, as for the first sample feature map 64 × 1/16 corresponding to the third layer, a convolution processing is performed on the first sample feature map 64 × 1/16 to obtain a first intermediate sample feature map, a convolution processing and a bilinear difference value 2 times upsampling (up2x) are performed on the first sample feature map 320 × 1/32 corresponding to the fourth layer (the higher level), so that a second intermediate sample feature map is obtained, the first intermediate sample feature map and the second intermediate sample feature map are mixed, the mixed sample feature map is convolved, so that a second sample feature map corresponding to a resolution 1/16 is obtained, which is represented by 128 × 1/16, and similarly, a second sample feature map (128) corresponding to a resolution 1/8 can be obtained, and a second sample feature map corresponding to a resolution 1/4 is output as a right pyramid feature map 502 of the second sample feature map 35 (the right pyramid) shown in the second sample feature map).

In one possible embodiment, the second sample feature maps are also arranged according to the size of the resolution to form a second sample feature pyramid, wherein the level of the second sample feature map on the second sample feature pyramid is in a negative correlation with the resolution.

For example, the second sample feature pyramid includes levels 128 × 1/4 (first level), 128 × 1/8 (second level), 128 × 1/16 (third level), and 128 × 1/32 (fourth level), respectively, and the corresponding second sample feature maps thereof.

Optionally, the number of target channels corresponding to the m second feature maps may be set by a developer, for example, the number of target channels is 128, and the number of target channels is not limited in this embodiment of the present application.

And step 302C, inputting the m second sample characteristic graphs into the multi-scale decoding network to obtain a first sample transparent channel image output by the multi-scale decoding network.

Wherein the resolution of the first sample transparent channel image is the same as the resolution of the sample image.

In a possible implementation manner, m second sample feature maps output by the feature pyramid are input into a multi-scale decoding network, the multi-scale decoding network performs addition and resolution conversion operations on the second sample feature maps to obtain a first sample transparent channel image corresponding to the sample image, and the first sample transparent channel image contains a prediction transparency value corresponding to each pixel point in the sample image and is used for comparing with a sample segmentation image subsequently to calculate cross entropy loss.

Since the second sample feature maps correspond to different resolutions, the addition processing cannot be directly performed, and the minimum resolution is 1/4, it is necessary to unify the resolutions of the second sample feature maps into 1/4 of the original image. In one possible implementation, each second sample feature map is subjected to convolution block processing, and different resolutions correspond to different convolution blocks, and the numbers of convolution blocks corresponding to different resolutions are different.

The types of convolutional blocks used by the multi-scale decoding network include cgr2x, sgr2x, sgr, and the like, and cgr2x includes a convolutional layer, a Group Normalization layer, an activation function (ReLU) layer, and a bilinear difference 2 times upsampling layer, where the number of input channels and the number of output channels corresponding to the convolutional layer are the same, for example, the number of input channels is 128, and the number of output channels is 128; sgr2x includes convolutional layers, Group Normalization layers, ReLU layers, and bilinear difference 2 times upsampling layers, where the number of input/output channels corresponding to convolutional layers is different, for example, the number of input channels is 128, and the number of output channels is 64; sgr includes convolutional layers, Group Normalization layers, and ReLU layers, where the number of input/output channels corresponding to the convolutional layers is different, for example, the number of input channels is 128 and the number of output channels is 64.

Schematically, as shown in fig. 6, it shows a schematic structural diagram of each volume block used by the multi-scale decoding network. Fig. 6 (a) is a schematic structural diagram corresponding to cgr2x, fig. 6 (B) is a schematic structural diagram corresponding to sgr2x, and fig. 6 (C) is a schematic structural diagram corresponding to sgr.

Schematically, as shown in fig. 5, 4 second sample feature maps output by the feature pyramid network 502 are input into the multi-scale decoding network 503, and pass through different convolution blocks to form 4 third sample feature maps (not shown) with resolution of the original image 1/4, for example, for the second sample feature map 128 × 1/32, two cgr2x and one sgr2x convolution blocks are sequentially passed through, so as to obtain a corresponding third sample feature map, for the second sample feature map 128 × 1/16, one cgr2x and one sgr2x convolution blocks are sequentially passed through, so as to obtain a corresponding third sample feature map, and similarly, a third sample feature map corresponding to each second sample feature map is obtained, and after the obtained 4 third sample feature maps are added, the obtained third sample feature maps are convolved and subjected to up-sampling operation by 4 times, so as to obtain a first sample transparent channel image.

Step 302D, inputting the m second sample feature maps into the deep surveillance network to obtain m third sample transparent channel images output by the deep surveillance network.

And the resolution of the m third sample transparent channel images is the same as that of the sample images.

In a possible implementation manner, the m second sample feature maps output by the feature pyramid network are input into a deep supervision network, and the deep supervision network is configured to perform upsampling processing on the m second sample feature maps to obtain m third sample transparent channel images with the same resolution as the sample images, and is configured to provide cross entropy loss on different resolutions for the first prediction model.

The upsampling multiple corresponding to different second sample feature maps is related to the resolution thereof, for example, the upsampling multiple corresponding to the second sample feature map with the resolution of 1/32 is 32 times, and the upsampling multiple corresponding to the second sample feature map with the resolution of 1/16 is 16 times.

Schematically, as shown in fig. 5, the second sample feature map 128 × 1/4 is up-sampled by 4 times to obtain a third sample transparent channel image 4, and the second sample feature map 128 × 1/8 is up-sampled by 8 times to obtain a third sample transparent channel image 8, and similarly, the third sample transparent channel image 16 and the third sample transparent channel image 32 can be obtained.

And step 302E, performing binarization processing on the first sample transparent channel image and the m third sample transparent channel images to obtain a first sample segmentation image and m second sample segmentation images.

Since the first sample transparent channel image and the m third sample transparent channel images are both probability images, and in order to accelerate the convergence speed of the first prediction model, the sample segmentation images are adopted, and the sample segmentation images are binarization images, it is necessary to perform binarization processing on the first sample transparent channel image and the m third sample transparent channel images to obtain the first sample segmentation image and the m second sample segmentation images, and then the first sample transparent channel image and the m second sample segmentation images can be compared with the sample segmentation images to calculate the cross entropy loss of the first prediction model.

The method for performing binarization processing on the first sample transparent channel image and the m third sample transparent channel images may refer to the generation process of the sample segmentation image in the above embodiment, which is not described herein again.

Schematically, as shown in fig. 5, binarization processing is performed on 4 third sample transparent channel images to obtain 4 second sample divided images, which are respectively represented as a second sample divided image 32, a second sample divided image 16, a second sample divided image 8, and a second sample divided image 4. And carrying out binarization processing on the first sample transparent channel image to obtain a first sample segmentation image.

Step 302F, training a first prediction model according to the first sample segmentation image, the m second sample segmentation images, and the sample segmentation image.

The loss of the first prediction model is cross entropy loss, namely cross entropy loss between the first sample segmentation image and the sample segmentation image and cross entropy loss between the m third sample segmentation images and the sample segmentation image are summed, so that cross entropy loss corresponding to the first prediction model can be obtained.

Schematically, as shown in fig. 5, cross entropy losses between the first sample segmentation image and the sample segmentation image and cross entropy losses between each second sample segmentation image and the sample segmentation image are respectively calculated, and the sum of the cross entropy losses is the cross entropy loss corresponding to the first prediction model.

Wherein, the formula of cross entropy loss can be expressed as:

wherein, if

The cross entropy loss for the sample segmentation image and the first sample segmentation image, then y_iPresentation sampleSample segmentation image, p, corresponding to the present image_iRepresenting the first sample, the image is segmented and the log loss for all samples represents the average of the log loss for each sample, ideally 0.

Illustratively, the corresponding penalty of the first prediction model may be expressed as:

wherein the content of the first and second substances,

representing the corresponding composite loss of the first predictive model,

representing the first sample-segmented image and the cross-entropy loss of the sample-segmented image,

representing the cross entropy loss of the second sample segmented image 32 with the sample segmented image,

representing the cross entropy loss of the second sample segmented image 16 with the sample segmented image,

representing the cross entropy loss of the second sample segmented image 8 and the sample segmented image,

representing the cross entropy loss of the second sample segmented image 4 and the sample segmented image.

In a possible implementation manner, the synthetic loss corresponding to the first prediction model may be calculated according to the above formula (1) and formula (2), so that the first prediction model is subjected to a back propagation algorithm by using the synthetic loss to update each parameter in the first prediction model.

Optionally, in a plurality of training periods, a prediction model is repeatedly trained according to the method shown in the above embodiment until the loss function corresponding to the first prediction model is completely converged, the training of the first prediction model is completed, and the first prediction model is stored without freezing parameters.

In this embodiment, the first prediction model is trained through the acquired sample image and the sample segmentation image, and cross entropy loss is introduced in the training process of the first prediction model, so that the first prediction model can be rapidly converged, and the training efficiency of the first prediction model is improved.

In a possible implementation manner, after the training of the first prediction model is completed, the sample image may be input into the trained first prediction model to obtain a first sample transparent channel image for the training process of the second prediction model.

Referring to fig. 7, a flowchart illustrating a method for training a second predictive model according to an exemplary embodiment of the present application is shown. The method comprises the following steps:

and 701, inputting the sample image into the first prediction model obtained by training to obtain a first sample transparent channel image output by the first prediction model.

Since the training of the second prediction model needs to depend on the output result of the first prediction model, the corresponding first sample transparent channel image needs to be output by the first prediction model. Therefore, the second prediction model needs to be trained after the first prediction model is trained, so that the training process of the second prediction model can be performed.

In a possible implementation manner, the sample image is input into a first prediction model obtained through training, and a first transparent channel image corresponding to the sample image output by the first prediction model is obtained and used in a subsequent training process of a second prediction model.

Optionally, each sample image in the data set may be input into the first prediction model to obtain a first sample transparent channel image corresponding to each sample image, so that the sample image and the corresponding first sample transparent channel image are used as the data set for training the second prediction model.

Step 702, training a second prediction model according to the first sample transparent channel image, the sample annotation image and the sample image.

Since the second prediction model aims to improve the fineness of each pixel point in the first transparent channel image, and the sample segmentation image is obviously more accurate than the sample annotation image, in a possible implementation, the second prediction model is obtained by training the first transparent channel image, the sample annotation image and the sample image.

In a possible implementation manner, in order to improve the precision of the second sample transparent channel image output by the second prediction model, losses other than basic matting losses, such as connectivity difference loss, structural similarity loss, and edge gradient loss, are introduced, and these losses pay more attention to the transparency channel value of the boundary region between the foreground image and the background image, so that the precision of the second sample transparent channel image output by the refinement network is higher than that of the first sample transparent channel image.

Illustratively, as shown in FIG. 8, step 702 may include steps 702A through 702F.

Step 702A, inputting the first sample transparent channel image and the sample image into a refinement network to obtain a second sample transparent channel image output by the refinement network.

In a possible implementation manner, the second prediction model mainly includes a refinement network, and the refinement network is used for performing convolution processing on the first sample transparent channel image and the sample image, and can correct some wrong transparency channel values in the first sample transparent channel image and correct transparency channel values in a boundary region between the foreground image and the background image, so that the fineness of the transparency channel values corresponding to each pixel point is improved, and the second sample transparent channel image is output.

Optionally, the first sample transparent channel image and the sample image may be input to a refinement network after Concat processing.

Optionally, the refinement network may include three convolutional network blocks and one convolutional layer, so as to reduce the operation process.

Schematically, as shown in fig. 9, a sample image is input into a trained first prediction model 901, so that a first sample transparent channel image corresponding to the sample image can be obtained, and then the first sample transparent channel image and the sample image Concat are input into a refinement network 902, so as to output a second sample transparent channel image, where the refinement network 902 is composed of three convolution blocks and one convolution layer.

Step 702B, inputting the sample image, the second sample transparent channel image and the sample labeling image into an edge gradient network to obtain an edge gradient loss corresponding to the second sample transparent channel image.

In a possible implementation manner, for the calculation manner of the edge gradient loss, a network specially for calculating the edge gradient loss is provided, that is, an edge gradient network, and the sample image, the second sample transparent channel image and the sample label image are input into the edge gradient network, so that the edge gradient loss corresponding to the second sample transparent channel image can be obtained, and the loss on the edge is provided for the subsequent training process.

In one possible embodiment, the process of obtaining the edge gradient loss may comprise the steps of:

the method comprises the steps of firstly, inputting a sample image into a preset operator to obtain a sample gradient image corresponding to the sample image, wherein the preset operator is used for carrying out first order reciprocal operation on the sample image.

Since the image edge is a boundary region between the foreground image and the background image, in order to obtain the edge gradient loss, it is necessary to first obtain edge images corresponding to the sample image and the second sample transparent channel image, respectively.

In a possible implementation manner, a preset operator is arranged in the edge gradient network, and the preset operator can perform a first derivative operation on the sample image to obtain gradients of the sample image in the x and y directions, so as to output the sample gradient image.

Optionally, the preset operator may adopt a Sobel (Sobel) operator, or may adopt other filtering operators that generate an image gradient, for example, a scherr (Scharr) operator, a Laplacian (Laplacian) operator, and the like. The embodiment of the application does not limit the adopted preset operator.

Illustratively, taking Sobel operator as an example, the process of generating the sample gradient image can be represented as follows:

wherein, a represents an input sample image, Gx represents a gradient map of the sample image in the x direction, Gy represents a gradient map of the sample image in the y direction, and G represents a sample gradient image output after passing through a Sobel operator.

In a possible implementation manner, after performing gradient operation on the sample image according to the formula (3) and the formula (4), a gradient map of the sample image in the x and y directions is obtained, and the gradient map is substituted into the formula (5), so that a sample gradient image corresponding to the sample image can be obtained through calculation.

And secondly, carrying out binarization and expansion corrosion operations on the sample labeled image to obtain a sample edge image, wherein the sample edge image is used for indicating a boundary area of a foreground image and a background image in the sample labeled image.

In a possible implementation manner, binarization and dilation-erosion operations are performed on the labeled image, so that a sample edge image can be obtained, and the sample edge image is used for dividing edge regions in the second sample channel image and the sample gradient image.

And thirdly, generating a sample edge gradient image according to the sample edge image and the sample gradient image, wherein the sample edge gradient image is used for indicating the boundary area of the foreground image and the background image in the sample image.

In a possible implementation manner, the sample edge image and the sample gradient image are multiplied, and the boundary area of the foreground image and the background image in the sample image can be divided from the sample gradient image, so that the sample edge gradient image is obtained.

And fourthly, generating an edge transparent channel image according to the second sample transparent channel image and the sample edge image, wherein the edge transparent channel image is used for indicating the boundary area of the foreground image and the background image in the second sample transparent channel image.

In a possible implementation manner, the second sample transparent channel image is multiplied by the sample edge image, and the boundary area between the foreground image and the background image in the second sample transparent channel image can be divided, so as to obtain the corresponding edge transparent channel image.

And fifthly, calculating to obtain edge gradient loss according to the edge transparent channel image and the sample edge gradient image.

In a possible implementation manner, according to the obtained edge transparent channel image and the sample edge gradient image, the edge gradient loss corresponding to the second sample transparent channel image can be calculated.

Illustratively, the edge gradient penalty can be calculated as:

wherein the content of the first and second substances,

representing the corresponding edge gradient loss, G, of the second sample clear channel image_inputRepresenting a sample gradient image, E_labelRepresenting the edge image of the sample, G_Refined,MaskRepresenting a second sample transparent channel image, | … |₁Indicating edge gradient loss by L₁And calculating the norm.

Schematically, as shown in fig. 9, a sample image is input into an edge gradient network 903, and is subjected to a sobel operator to obtain a sample gradient image; inputting the sample label image into an edge gradient network 903, and performing binarization and expansion corrosion operations to obtain a sample edge image; inputting the second sample transparent channel image into an edge gradient network 903, and multiplying the second sample transparent channel image by the sample edge image to obtain an edge transparent channel image; multiplying the sample gradient image and the sample edge image to obtain a sample edge gradient image; and calculating the edge gradient loss according to the sample edge gradient image and the edge transparent channel image.

And step 702C, calculating the structural similarity loss and the matting loss corresponding to the second sample transparent channel image according to the second sample transparent channel image and the sample annotation image.

Wherein, the calculation formula of the scratch loss can be expressed as:

wherein the content of the first and second substances,

representing the corresponding matte loss of the second sample clear channel image,

representing the transparency channel value corresponding to the ith pixel point in the second sample transparent channel image,

the transparency channel value corresponding to the ith pixel point in the sample labeled image is represented, and ∈ is a constant.

In a possible implementation manner, the second sample transparent channel image and the sample annotation image are substituted into the above formula, so that the matting loss corresponding to the second sample transparent channel image can be obtained.

The calculation formula of the structural similarity loss can be expressed as follows:

wherein SSIM (x, y) represents a structural similarity index,

for the corresponding structural similarity loss, μ, of the second sample clear channel image_xLabeling samples with the mean, σ, of the image_xLabeling the sample with the variance, μ, of the image_yIs the mean, σ, of the second sample clear channel image_yIs the variance of the second sample clear channel image, C₁And C₂Is a constant.

Step 702D, inputting the second sample transparent channel image and the sample annotation image into a connectivity difference network to obtain connectivity difference loss corresponding to the second sample transparent channel image.

The connectivity refers to that for a single pixel in the grayscale picture, pixels with the same value exist in the upper, lower, left and right adjacent to the single pixel. If the prediction effect of the second prediction model is better, the more similar the connectivity graph of the predicted second sample transparent channel image and the sample labeled image should be, the more similar the connectivity graph.

In a possible implementation manner, a developer is provided with a connectivity difference network in advance, and the second sample transparent channel image and the sample annotation image may be input into the connectivity difference network to be used for calculating a connectivity difference loss corresponding to the second sample transparent channel image.

Wherein, the calculation formula of the connectivity difference loss can be expressed as:

wherein the content of the first and second substances,

and omega represents a connection region with the maximum value of 1 shared by the second sample transparent channel image and the sample labeling image.

The function calculates the ith pixel p of the second sample transparent channel image_iThe degree of connectivity to Ω is 1 indicating full connectivity and 0 indicating no connectivity,

and indicating the ith pixel point on the sample labeling image.

Wherein the content of the first and second substances,

the function may be represented in the form:

d_i＝p_i-l_i

where θ is a threshold parameter, d_iRepresenting the current pixel value p_iTo a critical threshold value l_iWhen d is a distance of_iLess than θ is ignored. Wherein the content of the first and second substances,

represents p_iTo l_iSet of discrete pixel values between, dist_k(i) When the threshold is set to k, the normalized euclidean distance between the pixel i and the pixel connected to the source domain whose pixel i is closest to the source domain is shown.

Illustratively, as shown in fig. 9, inputting the second sample transparent channel image and the sample annotation image into the connectivity difference network 904 may obtain a connectivity difference loss output by the connectivity difference network 904.

And step 702E, training a refinement network according to the edge gradient loss, the connectivity difference loss, the matting loss and the structural similarity loss.

In a possible implementation manner, the refinement network is trained by synthesizing various losses obtained in the above embodiments, and compared with using only the matting loss, the fineness of the generated second sample transparent channel image can be obviously improved.

In step 702F, the refined network obtained by training is determined as the second prediction model.

In a possible implementation manner, a back propagation algorithm is executed on the refined network, parameters of each convolution layer of the refined network are updated, the training process in the above embodiment is repeated in each training period until the loss function corresponding to the second prediction model is completely converged, and the trained refined network is determined as the second prediction model.

In this embodiment, a refinement network is trained by introducing a plurality of loss function connectivity difference losses, edge gradient losses, matting losses, and structural similarity losses, so that a second sample transparent channel image output by the refinement network focuses more on a transparency channel value in an edge region, thereby facilitating improvement of the accuracy of image segmentation.

In a possible implementation manner, after the training of the first prediction model and the second prediction model is completed according to the methods shown in the above embodiments, the trained prediction models may be deployed on a computer device, and the segmentation process on the original image is implemented by using the first prediction model and the second prediction model.

Referring to fig. 10, a flowchart of an image processing method according to another exemplary embodiment of the present application is shown. The embodiment is described by taking the method as an example for a computer device, and the method comprises the following steps.

Step 1001, an original image is acquired.

Step 201 may be referred to in the implementation manner of this step, and this embodiment is not described herein again.

Step 1002, inputting an original image into a multi-scale coding network to obtain n first feature maps output by the multi-scale coding network, wherein the resolution and the number of channels of different first feature maps are different, and n is an integer greater than or equal to 2.

Step 1003, inputting the n first feature maps into the feature pyramid network, and obtaining n second feature maps output by the feature pyramid network, wherein the number of channels of different second feature maps is the same and the resolution is different, and the number of channels of the n second feature maps is the target number of channels.

Illustratively, as shown in fig. 11, step 1003 may include step 1003A, step 1003B, and step 1003C.

Step 1003A, arranging the n first feature maps according to the resolution to form a feature pyramid, wherein the resolution of the first feature map in the feature pyramid is in a negative correlation with the level of the first feature map.

And 1003B, performing convolution processing on the nth first feature map to obtain an nth second feature map in response to the fact that the number of channels corresponding to the nth first feature map is the maximum number of channels.

And 1003C, in response to the fact that the number of channels corresponding to the nth first feature map is not the maximum number of channels, performing convolution processing on the nth first feature map to obtain a fourth feature map, performing convolution and upsampling processing on the (n + 1) th first feature map to obtain a fifth feature map, mixing the fourth feature map and the fifth feature map, and performing convolution processing to obtain an nth second feature map.

It should be noted that step 1003B and step 1003C may be executed simultaneously; step 1003A may be executed first, and then step 1003C may be executed; step 1003C may be executed first, and then step 1003B may be executed, and the execution order of step 1003B and step 1003C is not limited in this embodiment.

And 1004, inputting the n second feature maps into the multi-scale decoding network to obtain a first transparent channel image output by the multi-scale decoding network.

Illustratively, as shown in FIG. 11, step 1004 includes step 1004A and step 1004B.

Step 1004A, the n second feature maps are respectively processed by convolution blocks to obtain n third feature maps, wherein the resolution corresponding to the n third feature maps is the same, different second feature maps correspond to different convolution blocks, and the number of the convolution blocks used by different second feature maps is different.

And 1004B, adding, rolling and upsampling the n third feature maps to obtain a first transparent channel image.

The process of generating the first transparent channel image may refer to the training process of the first prediction model in the above embodiments, which is not described herein again in this embodiment of the present application.

Step 1005, inputting the first transparent channel image and the original image into a second prediction model to obtain a second transparent channel image output by the second prediction model, wherein the fineness of the second transparent channel image is higher than that of the first transparent channel image.

And step 1006, segmenting the original image according to the second transparent channel image to obtain an image corresponding to the target object.

Step 201 and step 202 may be referred to in the implementation of step 1005 and step 1006, which is not described herein in detail in this embodiment.

In the embodiment of the application, the trained first prediction model and the trained second prediction model are deployed, the original image is preprocessed and then input into the first prediction model, a first transparent channel image output by the first prediction model is obtained, the generated first transparent channel image and the generated original image are input into the second prediction model again, and a second transparent channel image output by the second prediction model is obtained, so that the second transparent channel image is used for image processing.

Referring to fig. 12, a network deployment diagram of an image processing method according to an exemplary embodiment of the present application is shown. The network deployment map includes: the system comprises a multi-scale coding network, a characteristic pyramid network, a multi-scale decoding network and a fine network.

In a possible implementation manner, after an original image is preprocessed, the original image is input into a multi-scale coding network 1201, and n first feature maps output by the multi-scale coding network 1201 are obtained; inputting the n first feature maps into the feature pyramid network 1202 to obtain n second feature maps output by the feature pyramid network 1202, wherein the number of channels of the n second feature maps is a target number of channels; inputting the n second feature maps into the multi-scale decoding network 1203, and performing operations such as addition, resolution conversion and the like to obtain a first transparent channel image output by the multi-scale decoding network 1203; the first transparency channel image and the original image are input into a refinement network 1204 to obtain a second transparency channel image output by the refinement network 1204, and the original image is segmented by using the second transparency channel image to obtain an image corresponding to the target object.

Referring to fig. 13, a block diagram of an image processing apparatus according to an exemplary embodiment of the present application is shown. The apparatus may be implemented as all or part of a computer device in software, hardware or a combination of both, the apparatus comprising:

a first obtaining module 1301, configured to obtain an original image, where the original image includes at least one target object;

a first prediction module 1302, configured to input the original image into a first prediction model, so as to obtain a first transparent channel image output by the first prediction model, where the first transparent channel image includes a prediction transparency value corresponding to each pixel point in the original image;

a second prediction module 1303, configured to input the first transparent channel image and the original image into a second prediction model to obtain a second transparent channel image output by the second prediction model, where fineness of the second transparent channel image is higher than fineness of the first transparent channel image;

and a segmentation processing module 1304, configured to perform segmentation processing on the original image according to the second transparent channel image to obtain an image corresponding to the target object.

Optionally, the apparatus further comprises:

the second acquisition module is used for acquiring a sample image, a sample labeling image and a sample segmentation image, wherein the sample labeling image is labeled with transparency values corresponding to all pixel points in the sample image, and the sample segmentation image is a binarization image obtained by performing binarization processing on the sample labeling image;

a first training module, configured to train the first prediction model according to the sample image and the sample segmentation image;

the third prediction module is used for inputting the sample image into the first prediction model obtained by training to obtain a first sample transparent channel image output by the first prediction model;

and the second training module is used for training the second prediction model according to the first sample transparent channel image, the sample labeling image and the sample image.

Optionally, the second training module includes:

a refinement unit, configured to input the first sample transparent channel image and the sample image into a refinement network, and obtain a second sample transparent channel image output by the refinement network;

the edge gradient unit is used for inputting the sample image, the second sample transparent channel image and the sample labeling image into an edge gradient network to obtain an edge gradient loss corresponding to the second sample transparent channel image;

the calculating unit is used for calculating the structural similarity loss and the matting loss corresponding to the second sample transparent channel image according to the second sample transparent channel image and the sample annotation image;

a connectivity difference unit, configured to input the second sample transparent channel image and the sample labeling image into a connectivity difference network, so as to obtain a connectivity difference loss corresponding to the second sample transparent channel image;

a first training unit, configured to train the refinement network according to the edge gradient loss, the connectivity difference loss, the matting loss, and the structural similarity loss;

and the determining unit is used for determining the refined network obtained by training as the second prediction model.

Optionally, the edge gradient unit is further configured to:

inputting the sample image into a preset operator to obtain a sample gradient image corresponding to the sample image, wherein the preset operator is used for performing first order reciprocal operation on the original sample image;

carrying out binarization and expansion corrosion operations on the sample labeling image to obtain a sample edge image, wherein the sample edge image is used for indicating a boundary area of a foreground image and a background image in the sample labeling image;

generating a sample edge gradient image according to the sample edge image and the sample gradient image, wherein the sample edge gradient image is used for indicating a boundary area of a foreground image and a background image in the sample image;

generating an edge transparent channel image according to the second sample transparent channel image and the sample edge image, wherein the edge transparent channel image is used for indicating a boundary area of a foreground image and a background image in the second sample transparent channel image;

and calculating to obtain the edge gradient loss according to the edge transparent channel image and the sample edge gradient image.

Optionally, the first prediction model includes a multi-scale coding network, a feature pyramid network, a multi-scale decoding network, and a deep surveillance network;

optionally, the first training module includes:

a first multi-scale encoding unit, configured to input the sample image into the multi-scale encoding network, so as to obtain m first sample feature maps output by the multi-scale encoding network, where the resolution and the number of channels of different first sample feature maps are different, m is an integer greater than or equal to 2, and the multi-scale encoding network is configured to perform feature extraction on the sample image;

a first feature pyramid unit, configured to input the m first sample feature maps into the feature pyramid network, and obtain m second sample feature maps output by the feature pyramid network, where channel numbers of different second sample feature maps are the same and have different resolutions, and the feature pyramid network is configured to process the channel numbers of the m first sample feature maps into a target channel number;

a first multi-scale decoding unit, configured to input the m second sample feature maps into the multi-scale decoding network, so as to obtain the first sample transparent channel image output by the multi-scale decoding network, where the multi-scale decoding network is configured to perform addition and resolution conversion operations on the m second sample feature maps, and a resolution of the first sample transparent channel image is the same as a resolution of the sample image;

the depth supervision unit is used for inputting the m second sample feature maps into the depth supervision network to obtain m third sample transparent channel images output by the depth supervision network, the depth supervision network is used for performing upsampling processing on the m second sample feature maps, different second sample feature maps correspond to different upsampling multiples, and the resolution of the m third sample transparent channel images is the same as that of the sample images;

a binarization processing unit, configured to perform binarization processing on the first sample transparent channel image and the m third sample transparent channel images to obtain a first sample segmentation image and m second sample segmentation images;

and the second training unit is used for training the first prediction model according to the first sample segmentation image, the m second sample segmentation images and the sample segmentation images.

Optionally, the first prediction module 1302 includes:

a second multi-scale encoding unit, configured to input the original image into the multi-scale encoding network, so as to obtain n first feature maps output by the multi-scale encoding network, where resolution and channel number of different first feature maps are different, and n is an integer greater than or equal to 2;

a second feature pyramid unit, configured to input the n first feature maps into the feature pyramid network, so as to obtain n second feature maps output by the feature pyramid network, where channel numbers of different second feature maps are the same and have different resolutions, and the channel number of the n second feature maps is the target channel number;

and the second multi-scale decoding unit is used for inputting the n second feature maps into the multi-scale decoding network to obtain the first transparent channel image output by the multi-scale decoding network.

Optionally, the second feature pyramid unit is further configured to:

arranging the n first feature maps according to the resolution to form a feature pyramid, wherein the resolution of the first feature maps in the feature pyramid is in a negative correlation relation with the level of the first feature maps;

performing convolution processing on the nth first feature diagram to obtain an nth second feature diagram in response to the fact that the number of channels corresponding to the nth first feature diagram is the maximum number of channels;

and in response to the fact that the number of channels corresponding to the nth first feature map is not the maximum number of channels, performing convolution processing on the nth first feature map to obtain a fourth feature map, performing convolution and up-sampling processing on the (n + 1) th first feature map to obtain a fifth feature map, mixing the fourth feature map and the fifth feature map, and performing convolution processing to obtain the nth second feature map.

Optionally, the second multi-scale decoding unit is further configured to:

respectively processing the n second feature maps through convolution blocks to obtain n third feature maps, wherein the resolution corresponding to the n third feature maps is the same, different second feature maps correspond to different convolution blocks, and the number of the convolution blocks used by different second feature maps is different;

and adding, rolling and upsampling the n third feature maps to obtain the first transparent channel image.

In the embodiment of the application, the acquired original image is input into the first prediction model to obtain the first transparent channel image (including the prediction transparency value corresponding to each pixel point in the original image) output by the first prediction model, so that the first transparent channel image and the original image are input into the second prediction model to obtain the second transparent channel image output by the second prediction model, and the second transparent channel image is used for carrying out segmentation processing on the original image according to the second transparent channel image to obtain the image corresponding to the target object. The fineness of the second transparent channel image is higher than that of the first transparent channel image, so that the accuracy of image segmentation can be improved; compared with the image segmentation method in the related art, the transparent channel image for segmentation can be directly generated from the original image without introducing a trimap image, and the accuracy of image segmentation is further improved.

Referring to fig. 14, a schematic structural diagram of a computer device according to an exemplary embodiment of the present application is shown. The computer apparatus 1400 includes a Central Processing Unit (CPU) 1401, a system Memory 1404 including a Random Access Memory (RAM) 1402 and a Read-Only Memory (ROM) 1403, and a system bus 1405 connecting the system Memory 1404 and the Central Processing Unit 1401. The computer device 1400 also includes a basic Input/Output system (I/O system) 1406 that facilitates transfer of information between devices within the computer device, and a mass storage device 1407 for storing an operating system 1413, application programs 1414, and other program modules 1415.

The basic input/output system 1406 includes a display 1408 for displaying information and an input device 1409, such as a mouse, keyboard, etc., for user input of information. Wherein the display 1408 and input device 1409 are both connected to the central processing unit 1401 via an input-output controller 1410 connected to the system bus 1405. The basic input/output system 1406 may also include an input/output controller 1410 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1410 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1407 is connected to the central processing unit 1401 through a mass storage controller (not shown) connected to the system bus 1405. The mass storage device 1407 and its associated computer-readable storage media provide non-volatile storage for the computer device 1400. That is, the mass storage device 1407 may include a computer-readable storage medium (not shown) such as a hard disk or Compact disk-Only Memory (CD-ROM) drive.

Without loss of generality, the computer-readable storage media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable storage instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1404 and mass storage device 1407 described above may collectively be referred to as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1401, the one or more programs containing instructions for implementing the method embodiments described above, and the central processing unit 1401 executes the one or more programs to implement the methods provided by the various method embodiments described above.

According to various embodiments of the present application, the computer device 1400 may also operate as a remote server connected to a network through a network, such as the Internet. That is, the computer device 1400 may be connected to the network 1412 through the network interface unit 1411 connected to the system bus 1405, or may be connected to other types of networks or remote server systems (not shown) using the network interface unit 1411.

The memory also includes one or more programs, stored in the memory, that include instructions for performing the steps performed by the computer device in the methods provided by the embodiments of the present application.

The embodiment of the present application further provides a computer-readable storage medium, which stores at least one instruction, where the at least one instruction is loaded and executed by the processor to implement the image processing method according to the above embodiments.

The embodiment of the present application further provides a computer program product, where at least one instruction is stored, and the at least one instruction is loaded and executed by the processor to implement the image processing method according to the above embodiments.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable storage medium. Computer-readable storage media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An image processing method, characterized in that the method comprises:

2. The method of claim 1, wherein prior to said obtaining the original image, the method further comprises:

acquiring a sample image, a sample labeling image and a sample segmentation image, wherein the sample labeling image is labeled with transparency values corresponding to all pixel points in the sample image, and the sample segmentation image is a binarization image obtained by performing binarization processing on the sample labeling image;

training the first prediction model according to the sample image and the sample segmentation image;

inputting the sample image into the first prediction model obtained by training to obtain a first sample transparent channel image output by the first prediction model;

and training the second prediction model according to the first sample transparent channel image, the sample annotation image and the sample image.

3. The method of claim 2, wherein training the second prediction model from the first sample transparent channel image, the sample annotation image, and the sample image comprises:

inputting the first sample transparent channel image and the sample image into a refinement network to obtain a second sample transparent channel image output by the refinement network;

inputting the sample image, the second sample transparent channel image and the sample labeling image into an edge gradient network to obtain an edge gradient loss corresponding to the second sample transparent channel image;

calculating the structural similarity loss and the matting loss corresponding to the second sample transparent channel image according to the second sample transparent channel image and the sample annotation image;

inputting the second sample transparent channel image and the sample labeling image into a connectivity difference network to obtain connectivity difference loss corresponding to the second sample transparent channel image;

training the refinement network according to the edge gradient loss, the connectivity difference loss, the matting loss and the structural similarity loss;

and determining the refined network obtained by training as the second prediction model.

4. The method of claim 3, wherein the inputting the sample image, the second sample transparent channel image and the sample labeling image into an edge gradient network to obtain an edge gradient loss corresponding to the second sample transparent channel image comprises:

5. The method of claim 2, wherein the first prediction model comprises a multi-scale coding network, a feature pyramid network, a multi-scale decoding network, and a deep surveillance network;

the training the first prediction model based on the sample image and the sample segmentation image includes:

inputting the sample image into the multi-scale coding network to obtain m first sample feature maps output by the multi-scale coding network, wherein the resolution and the channel number of different first sample feature maps are different, m is an integer greater than or equal to 2, and the multi-scale coding network is used for extracting features of the sample image;

inputting the m first sample feature maps into the feature pyramid network to obtain m second sample feature maps output by the feature pyramid network, wherein the number of channels of different second sample feature maps is the same and the resolution is different, and the feature pyramid network is used for processing the number of channels of the m first sample feature maps into a target number of channels;

inputting the m second sample feature maps into the multi-scale decoding network to obtain the first sample transparent channel image output by the multi-scale decoding network, wherein the multi-scale decoding network is used for performing addition and resolution conversion operations on the m second sample feature maps, and the resolution of the first sample transparent channel image is the same as that of the sample image;

inputting the m second sample feature maps into the deep monitoring network to obtain m third sample transparent channel images output by the deep monitoring network, wherein the deep monitoring network is used for performing upsampling processing on the m second sample feature maps, different second sample feature maps correspond to different upsampling multiples, and the resolutions of the m third sample transparent channel images are the same as the resolution of the sample images;

performing binarization processing on the first sample transparent channel image and the m third sample transparent channel images to obtain a first sample segmentation image and m second sample segmentation images;

training the first prediction model according to the first sample segmentation image, the m second sample segmentation images and the sample segmentation image.

6. The method of claim 5, wherein inputting the original image into a first prediction model to obtain a first transparent channel image output by the first prediction model comprises:

inputting the original image into the multi-scale coding network to obtain n first feature maps output by the multi-scale coding network, wherein the resolution and the channel number of different first feature maps are different, and n is an integer greater than or equal to 2;

inputting the n first feature maps into the feature pyramid network to obtain n second feature maps output by the feature pyramid network, wherein the number of channels of different second feature maps is the same and the resolution is different, and the number of channels of the n second feature maps is the target number of channels;

inputting the n second feature maps into the multi-scale decoding network to obtain the first transparent channel image output by the multi-scale decoding network.

7. The method of claim 6, wherein the inputting the n first feature maps into the feature pyramid network to obtain n second feature maps output by the feature pyramid network comprises:

8. The method according to claim 6, wherein the inputting n second feature maps into the multi-scale decoding network to obtain the first transparent channel image output by the multi-scale decoding network comprises:

9. An image processing apparatus, characterized in that the apparatus comprises:

10. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the image processing method according to any one of claims 1 to 8.

11. A computer-readable storage medium, having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the image processing method according to any one of claims 1 to 8.