CN113744280A

CN113744280A - Image processing method, apparatus, device and medium

Info

Publication number: CN113744280A
Application number: CN202110819544.0A
Authority: CN
Inventors: 董斌; 汪天才
Original assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2021-12-03

Abstract

The embodiment of the invention provides an image processing method, an image processing device, image processing equipment and an image processing medium, belongs to the technical field of image processing, and aims to improve the accuracy of example segmentation, wherein the method comprises the following steps: performing feature extraction on an image to be processed to obtain deep features and shallow features of a target region in the image to be processed; determining a transparency feature of the target region based on the deep feature and the shallow feature, and determining a first preliminary example segmentation result of the image to be processed based on the deep feature and the shallow feature; and carrying out example segmentation on the image to be processed according to the first preliminary example segmentation result and the transparency characteristic to obtain an example segmentation result of the target area.

Description

Image processing method, apparatus, device and medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image processing method, an image processing apparatus, an image processing device, and an image processing medium.

Background

In an example segmentation task performed by using a deep neural network model, the deep neural network model is required to accurately segment different types of targets and different individuals of the same type of targets on an image. In the related art, a deep neural network model is generally used to perform an instance segmentation task, and specifically, the deep neural network model is used to frame a target region of an object in an image and classify the object in the image, and then perform instance segmentation according to the selected target region to segment the object from the image.

The process of extracting the features of the target area comprises the following steps: the features in the target areas with different sizes (resolutions) are interpolated to obtain the features with fixed sizes (namely, fixed resolutions), so that the edge information of the object in the target area is lost, and the performance of edge segmentation is limited when instance segmentation is performed subsequently, so that the accuracy of instance segmentation is low.

Disclosure of Invention

In view of the above problems, an image processing method, apparatus, device, and medium of the embodiments of the present invention are proposed to overcome or at least partially solve the above problems.

In order to solve the above problem, a first aspect of the present invention discloses an image processing method, including:

performing feature extraction on an image to be processed to obtain deep features and shallow features of a target region in the image to be processed;

determining a transparency feature of the target region based on the deep feature and the shallow feature, and determining a first preliminary example segmentation result of the image to be processed based on the deep feature and the shallow feature;

and carrying out example segmentation on the image to be processed according to the first preliminary example segmentation result and the transparency characteristic to obtain an example segmentation result of the target area.

Optionally, the determining the transparency feature of the target region based on the deep feature and the shallow feature comprises:

processing the deep features to obtain a ternary diagram corresponding to the target area;

splicing the ternary graph and the shallow layer feature to obtain a first splicing feature;

and processing the first splicing characteristic to obtain the transparency characteristic.

Optionally, the determining a first preliminary example segmentation result of the image to be processed based on the deep features and the shallow features includes:

performing fusion processing on the deep features and the first splicing features to obtain fusion features corresponding to the deep features and the first splicing features;

and carrying out example segmentation based on the fusion characteristics to obtain the first preliminary example segmentation result.

Optionally, the processing the deep features to obtain a ternary map corresponding to the target region includes:

carrying out example segmentation based on the deep features to obtain a second preliminary example segmentation result of the target region;

performing morphological processing on the second non-preliminary example segmentation result to obtain a processed second preliminary example segmentation result;

and carrying out ternary processing on the processed second primary example segmentation result to obtain a ternary diagram corresponding to the target area.

Optionally, the fusing the deep feature and the first stitching feature to obtain a fused feature corresponding to the deep feature and the first stitching feature includes:

performing convolution processing on the deep features to obtain processed deep features;

performing convolution processing on the first splicing characteristic to obtain a processed first splicing characteristic;

and adding the feature values corresponding to the same positions in the processed deep features and the processed first splicing features to obtain the fusion features.

Optionally, the performing, according to the first preliminary example segmentation result and the transparency feature, example segmentation on the image to be processed to obtain an example segmentation result of the target region includes:

splicing the first preliminary example segmentation result and the transparency characteristic to obtain a second splicing characteristic;

and carrying out example segmentation based on the second splicing characteristics to obtain an example segmentation result of the target area.

Optionally, the example segmentation of the image to be processed is realized based on a neural network model;

accordingly, the neural network model is trained by:

performing feature extraction on the sample image to obtain deep features and shallow features of a sample region in the sample image;

inputting the deep features and the shallow features of the sample region into a neural network model to generate an example segmentation result of the sample region;

determining a loss value of the neural network model according to labels corresponding to all pixel points in the sample region and an example segmentation result of the sample region, wherein the label corresponding to one pixel point represents whether the pixel point belongs to a foreground or a background;

and updating the parameters of the neural network model according to the loss value to obtain an instance segmentation module.

Optionally, the method further comprises:

processing the deep features of the sample region to obtain a covariance corresponding to the sample region;

determining a loss value of the neural network model according to the labels corresponding to the pixel points in the sample region and the example segmentation result of the sample region, wherein the step comprises the following steps:

determining a loss value of the neural network model according to the covariance, labels corresponding to all pixel points in the sample region and an example segmentation result of the sample region and the following formula;

wherein, y_uIs a label corresponding to the u-th pixel point in the sample area, p_uThe predicted value of the u-th pixel point in the example segmentation result of the sample area is used for predicting whether the u-th pixel point belongs to the foreground or the background, N is the total number of the pixel points in the sample area, and sigma is_iRepresenting the covariance.

In a second aspect of the embodiments of the present application, there is provided an image processing apparatus, including:

the characteristic extraction module is used for extracting characteristics of an image to be processed to obtain deep characteristics and shallow characteristics of a target region in the image to be processed;

a feature enhancement module for determining a transparency feature of the target region based on the deep features and the shallow features;

a first segmentation module, configured to determine a first preliminary instance segmentation result of the image to be processed based on the deep features and the shallow features;

and the second segmentation module is used for carrying out example segmentation on the image to be processed according to the first preliminary example segmentation result and the transparency characteristic to obtain an example segmentation result of the target area.

In a third aspect of the embodiments of the present invention, an electronic device is further disclosed, including: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed performs the image processing method as described in the first aspect of the embodiments of the present invention.

In a fourth aspect of the embodiments of the present invention, a computer-readable storage medium is further disclosed, which stores a computer program for causing a processor to execute the image processing method according to the embodiment of the first aspect of the embodiments of the present invention.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, the characteristic extraction can be carried out on the image to be processed to obtain the deep characteristic and the shallow characteristic of the target region in the image to be processed; determining a transparency feature of the target region based on the deep feature and the shallow feature, and determining a first preliminary example segmentation result of the image to be processed based on the deep feature and the shallow feature; and carrying out example segmentation on the image to be processed according to the first preliminary example segmentation result and the transparency characteristic to obtain an example segmentation result of the target area.

By adopting the technical scheme of the embodiment, on one hand, in consideration of the fact that the shallow feature contains rich edge information, the fusion feature is obtained based on the deep feature and the shallow feature of the target region, the shallow feature can be utilized to supplement the edge information of an object in the target region, and then the target region supplemented with the edge information is subjected to initial example segmentation to obtain a first initial example segmentation result; on the other hand, the transparency feature of the target region is obtained based on the deep feature and the shallow feature, the transparency feature can reflect the transparency of the image region framed and selected by the target region in an Alpha channel, when the image to be processed is subjected to instance segmentation based on the transparency feature and the first preliminary instance segmentation result, the segmentation result of the edge information in the first preliminary instance segmentation result can be further strengthened by using the transparency feature, so that the complete edge information of the object in the target region is kept as much as possible, and the instance segmentation is performed on the basis of keeping the complete edge information of the object in the target region, so that the edge segmentation performance is improved, and the instance segmentation accuracy is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of the steps of a method of image processing in an embodiment of the invention;

FIG. 2 is a flow chart of the training steps of an example segmentation module in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a neural network model in the practice of the present invention;

FIG. 4 is a block diagram of an example partitioning module in the practice of the present invention;

FIG. 5 is a block diagram of yet another example partitioning module in an implementation of the present invention;

FIG. 6 is a block diagram of an image processing apparatus in an implementation of the invention;

fig. 7 is a block diagram of an electronic device in the practice of the invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below to clearly and completely describe the technical solutions in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the related art, the example segmentation is performed by using the features with fixed size (i.e., fixed resolution), and the accuracy of the example segmentation is low because the features with fixed size (i.e., fixed resolution) do not contain the edge information of the object in the target region, which results in limited performance of the edge segmentation. In order to solve the technical characteristics, the embodiment of the invention provides the following technical conception: on one hand, the deep-layer feature and the shallow-layer feature of the target area are fused by utilizing the characteristic of rich edge information of the shallow-layer feature to obtain a fusion feature supplemented with the edge information; on the other hand, the image of the target area is utilized to strengthen the edge information contained in the fusion characteristics in the transparency characteristics of the Alpha channel, so that the complete edge information of the object in the target area is kept as much as possible, and the example segmentation is carried out on the basis of keeping the complete edge information of the object in the target area, so that the edge segmentation performance is improved, and the example segmentation accuracy is improved.

Referring to fig. 1, a flowchart illustrating steps of an image processing method according to an embodiment of the present application is shown, and the method may be applied to a terminal device or a server as shown in fig. 1.

A neural network model (for example, a deep neural network model) can be operated in the terminal device, the neural network model is obtained through training, and example segmentation is carried out on an object in the image based on the neural network model. In one embodiment, an example segmentation module in the neural-completed model may perform example segmentation on an object in the image, and thus, the example segmentation module mentioned in the embodiments of the present application may be embedded in an example segmentation branch in the deep neural network model to provide a mask map of the object in the image for the example segmentation branch to segment the object from the image.

In a further embodiment, the neural network model in the terminal device may comprise a plurality of modules, for example, at least a positioning module, a classification module and an instance segmentation module, and thus, an output end of the positioning module is connected with the instance segmentation module, and an output end of the classification module is connected with the instance segmentation module. The example segmentation module is used for generating a mask map of the object according to the deep features and the shallow features of the target region, and the example segmentation module is used for carrying out example segmentation according to the mask map of the object.

Of course, in some embodiments, the positioning module, the classification module, or the example segmentation module may also be a separate neural network model, and the embodiment of the present invention does not limit the specific structure of the neural network model used.

As shown in fig. 1, an image processing method according to an embodiment of the present application may specifically include the following steps:

step S101: and performing feature extraction on the image to be processed to obtain deep features and shallow features of the target region in the image to be processed.

The target area is an area where an object in the image to be processed is located.

In this embodiment, the image to be processed may include one or more objects, and when a plurality of objects exist in the image to be processed, the categories of the plurality of objects may be the same or different, for example, the plurality of objects may be a plurality of animals including cats and dogs, and cats may include cats of various categories. The example segmentation of the image to be processed refers to: the method comprises the steps of segmenting a plurality of objects in an image to be processed from the image to be processed, and determining the category of each segmented object.

In the specific implementation, the object in the image to be processed can be positioned, so that the target area where the object is located can be obtained, and the target area can represent the position of the object in the image to be processed. Then, feature extraction may be performed on the image region framed by the target region, and the extracted features are interpolated to a preset size to obtain deep features of the target region, where the deep features include global information of an object in the target region. When the features of the image of the target area are extracted, the image of the target area may be subjected to multiple scales of processing, for example, downsampling processing of multiple scales is performed, the feature maps obtained by each downsampling processing may be fused, and the fused features may be interpolated to a preset size, so as to obtain deep features of the target area.

When a target area where a target object is located is obtained, shallow features of an image area framed and selected by the target area can be obtained, wherein the shallow features retain richer detail information of the object in the target area. In a possible implementation manner, an RGB image corresponding to an image area framed and selected by the target area may be obtained, and then feature extraction is performed on the RGB image to obtain a shallow feature of the target area. Wherein, RGB respectively represents three basic color channels of red, green and blue. The RGB image may be a color map of an image region framed by the target region, and is an image displayed in an RGB color mode.

In the embodiments of the present application, the shallow feature may also be referred to as a shallow feature map, and the deep feature may also be referred to as a deep feature map, where the dimensions of the shallow feature map and the deep feature map may be the same.

Step S102: determining a transparency feature of the target region based on the deep feature and the shallow feature, and determining a first preliminary instance segmentation result of the image to be processed based on the deep feature and the shallow feature.

In this embodiment, the deep features include global information of the object in the target region, and the shallow features retain richer detailed information of the object in the target region, so that the shallow features can be used to supplement the detailed information lost by the deep features. Specifically, the shallow feature may be used to supplement edge information in the deep feature, and then initial instance segmentation is performed based on the deep feature supplemented with the edge information, so as to obtain a first preliminary instance segmentation result.

In this embodiment, the edge information included in the fusion feature may also be strengthened by using the transparency feature, so that the complete edge information of the object in the target region is retained as much as possible. The transparency features, which may also be referred to as Alpha feature maps, may reflect the transparency of the image of the target region in Alpha channels. If the image to be processed carries the Alpha characteristic diagram, the transparency characteristic of the target area can be directly obtained from the Alpha characteristic diagram of the image to be processed. If the image to be processed does not have the Alpha characteristic diagram, the transparency characteristic can be generated by using the transparency characteristic generation method provided by the embodiment of the application. Specifically, the transparency characteristic of the target region is obtained based on the deep layer characteristic and the shallow layer characteristic of the target region.

Step S103: and carrying out example segmentation on the image to be processed according to the first preliminary example segmentation result and the transparency characteristic to obtain an example segmentation result of the target area.

In this embodiment, after the first preliminary example segmentation result is obtained, the first preliminary example segmentation result may be further enhanced by using a transparency feature, specifically, the example segmentation may be performed on the image to be processed according to the transparency feature and the first preliminary example segmentation result, so as to obtain an example segmentation result, where the example segmentation result may be understood as a result that each pixel point included in the target region in the image to be processed belongs to a foreground or a background.

Specifically, the transparency feature and the first preliminary example segmentation result may be fused to obtain a mask map of the target region. The pixel value of a pixel point of the mask map can represent the probability that the pixel point belongs to the foreground and the background. Wherein, the foreground generally represents the object and the background generally represents the environment in which the object is located.

Therefore, the generated mask image can clearly distinguish the foreground and the background of the target area, and the object can be accurately separated from the target area.

By adopting the technical scheme of the embodiment, on one hand, the shallow feature is considered to contain rich edge information, so that the target region is subjected to instance segmentation based on the deep feature and the shallow feature of the target region to obtain a first preliminary instance segmentation result of the image to be processed, and thus, the shallow feature can be utilized to supplement the edge information of the object in the target region so as to improve the instance segmentation result with higher accuracy; on the other hand, the transparency feature of the target region is obtained based on the deep feature and the shallow feature, the transparency feature can reflect the transparency of the image region framed and selected by the target region in an Alpha channel, when the image to be processed is subjected to instance segmentation based on the transparency feature and the first preliminary instance segmentation result, the segmentation result of the edge information in the first preliminary instance segmentation result can be further strengthened by using the transparency feature, so that the complete edge information of the object in the target region is strengthened as much as possible, and the instance segmentation is performed on the basis of keeping the complete edge information of the object in the target region, so that the edge segmentation performance is improved, and the instance segmentation accuracy is improved. In addition, the first preliminary example segmentation result in the embodiment of the present application is obtained based on the shallow feature and the deep feature, and the transparency feature is also obtained based on the shallow feature and the deep feature, so that the shallow feature and the deep feature are both utilized multiple times by the first preliminary example segmentation result and the final example segmentation result, so as to achieve gradient enhancement of example segmentation, and perform accurate example segmentation.

In one embodiment of the present application, the process of obtaining the transparency characteristic may be as follows: processing the deep features to obtain a ternary diagram corresponding to the target area; then, splicing the ternary graph and the shallow feature (namely concat operation) to obtain a first spliced feature; and then, processing the first splicing characteristic to obtain the transparency characteristic.

In this embodiment, the deep feature processing may refer to convolution processing on the deep features, for example, 2 convolutions of 3x3 and 1 convolution of 1x1 may be performed on the deep features to generate a rough mask map (representing foreground/background); and performing ternary processing on the rough mask graph to generate a ternary graph.

The pixel values of the pixel points in the ternary image represent that the pixel points are a foreground region, a background region and an unknown region, namely, the foreground region, the background region and the unknown region in the target region are represented by adopting three pixel values.

In some embodiments, when obtaining the ternary map, example segmentation may be performed based on the deep features to obtain a second preliminary example segmentation result of the target region; performing morphological processing on the second preliminary example segmentation result to obtain a processed second preliminary example segmentation result; and carrying out ternary processing on the processed second primary example segmentation result to obtain a ternary diagram corresponding to the target area.

In this embodiment, performing example segmentation based on the deep features may refer to performing 3 × 3 convolution on the deep features 2 times to obtain a second preliminary example segmentation result of the target region, where the second preliminary example segmentation result may refer to a rough mask map, and performing morphological processing on the second preliminary example segmentation result may refer to performing dilation and erosion operations on the second preliminary example segmentation result (for example, a dilation kernel may be (3, 3), and the number of iterations may be 20) to obtain a processed second example segmentation result (which may be denoted as uknown)_i). Then, the processed second preliminary example segmentation result is subjected to ternary processing to obtain a ternary map (which can be recorded as Tr) corresponding to the target area_i，Tr_i＝255*CM_i+128*(Unknown_i-CM_i)). The third-valued processing may refer to determining that each pixel point in the second preliminary example segmentation result belongs to a foreground, a background, and an unknown region.

The second preliminary example segmentation result may be understood as a segmentation result of example segmentation based on deep features, in which case the detail information included in the shallow features is not supplemented.

After obtaining the ternary map, the ternary map may be stitched with the shallow feature to obtain a first stitched feature. Specifically, the ternary map and the shallow feature may be spliced and then subjected to 2 3 × 3 convolutions, so that the detail information included in the shallow feature is supplemented to the ternary map, and then the first spliced feature is subjected to convolution processing, so as to obtain the transparency feature. In particular, the convolution process may refer to 1 × 1 convolution.

When the method is adopted, the deep features are processed to obtain the ternary image, and then the ternary image and the shallow features are spliced, so that the shallow features contain the detail information and are supplemented into the ternary image, the transparency features with the strengthened detail information are obtained, and the more accurate transparency features are obtained. In this way, the shallow feature may be used to perform a first enhancement on the second preliminary instance segmentation result.

In yet another embodiment, the deep features may be processed to obtain a ternary map corresponding to the target region; splicing the ternary graph and the shallow feature (namely concat operation) to obtain a first spliced feature; then, the deep features and the first splicing features are fused to obtain fused features corresponding to the deep features and the shallow features, and then example segmentation is carried out based on the fused features to obtain a first preliminary example segmentation result.

The process of obtaining the ternary map may be as described in the above embodiments, and the process of obtaining the first stitching feature may also be as described in the above embodiments. In this embodiment, after the first splicing feature is obtained, the deep feature and the first splicing feature may be fused to obtain a fused feature. Because the deep features are used twice in the process of obtaining the fusion features (the deep features are used when the first splicing features are obtained, and the splicing features are used again when the fusion processing is carried out), the global information contained in the deep features is fully enhanced, so that the fusion features containing the fully enhanced global information and the edge information are supplemented.

In another embodiment, when the deep feature and the first stitching feature are fused to obtain the fused feature, a convolution process may be performed on the deep feature to obtain a processed deep feature; performing convolution processing on the first splicing characteristic to obtain a processed first splicing characteristic; and adding the feature values corresponding to the same positions in the processed deep features and the processed first splicing features to obtain the fusion features.

In this embodiment, the convolution processing on the deep features may be to perform 2-by-3-3 convolution on the deep features to perform two-scale feature extraction on the deep features, so as to obtain the processed deep features, and by this means, the high-level semantic information included in the deep features may be enhanced.

In practice, performing convolution processing on the first mosaic feature may refer to sequentially performing 2 3 × 3 convolution processing on the first mosaic feature to obtain a processed first mosaic feature, where the convolution processing may adjust the size of the first mosaic feature to be the same as the size of the processed deep feature.

Then, the processed deep features and the features at the same position in the processed first stitching features may be added, and the same position may refer to the same spatial position, that is, the processed deep features and the processed first stitching features are subjected to feature addition processing, so as to obtain the fusion features. By the method, after the edge information contained in the shallow layer feature is supplemented into the deep layer feature to form the first splicing feature, the information of the first splicing feature is added with the high-layer semantic information of the deep layer feature, so that the edge information contained in the shallow layer feature is supplemented into the deep layer feature, and the enhancement of the edge information is realized.

In this embodiment, when obtaining the example segmentation result of the target region, the example segmentation may be performed based on the fusion feature to obtain a first preliminary example segmentation result corresponding to the target region.

In this embodiment, performing example segmentation based on the fusion feature may refer to performing point convolution on the fusion feature to obtain a first preliminary example segmentation result corresponding to the target region, where the first preliminary example segmentation result may be referred to as an initial mask map.

The first preliminary example segmentation result may refer to example segmentation performed on features supplemented with edge information, and may be understood as an example segmentation result obtained after the edge information is supplemented, and since the first stitching feature is obtained on the basis of a second preliminary example segmentation result (obtaining a ternary diagram is required for obtaining a first stitching feature, and obtaining a ternary diagram is required for obtaining a second preliminary example segmentation result), when the fusion feature is obtained by using the deep feature and the first stitching feature, and the first preliminary example segmentation result is obtained, it may be understood that the second preliminary example segmentation result is enhanced by using the first stitching feature and the deep feature, so as to implement second enhancement of example segmentation.

In this embodiment, the information of the shallow feature and the deep feature is fused in both the fusion feature and the transparency feature, so that the global information of the object included in the deep feature and the edge information of the object included in the shallow feature are both enhanced, and thus, when the example segmentation is performed on the image to be processed based on the fusion feature and the example segmentation is performed on the target region based on the first preliminary example segmentation result and the transparency feature, the fully enhanced global information and edge information of the object can be utilized.

In yet another embodiment, when the to-be-processed image is subjected to instance segmentation according to the first preliminary instance segmentation result and the transparency feature, the first preliminary instance segmentation result and the transparency feature are subjected to stitching processing (i.e., concat operation) to obtain a second stitching feature; and carrying out example segmentation based on the second splicing characteristics to obtain an example segmentation result of the target area.

Then, the first preliminary example segmentation result and the transparency feature may be subjected to a stitching process (e.g., concat operation) to obtain a second stitching feature, and when such an embodiment is adopted, the first preliminary example segmentation result may be further enhanced by the transparency feature.

Then, the second stitching feature may be point-convolved, so as to obtain a final mask map, that is, a result of example segmentation of the target region.

In this embodiment, the finally obtained example segmentation result is obtained by further enhancing the first preliminary example segmentation result by using the transparency feature on the basis of the first preliminary example segmentation result, and may be regarded as a third enhancement of the example segmentation. Therefore, the target area can be subjected to multiple instances segmentation and reinforcement, so that the fineness of the instances segmentation is continuously improved, and the accuracy of the instances segmentation is improved.

In some embodiments, the example segmentation of the image to be processed described herein may be implemented by a neural network module.

Specifically, the instance splitting module may be configured to perform the operations of step S102 to step S103: determining a transparency feature of the target region based on the deep feature and the shallow feature, and determining a first preliminary example segmentation result of the image to be processed based on the deep feature and the shallow feature; and carrying out example segmentation on the image to be processed according to the fusion characteristic and the transparency characteristic to obtain an example segmentation result of the target area.

Of course, in another embodiment, the example dividing module may be configured to perform the operations of step S101 to step S103.

Accordingly, referring to fig. 2, a flowchart illustrating steps of training a neural network model to obtain an example segmentation module is shown, and specifically, the example segmentation module can be obtained by training the following steps:

step S201: and performing feature extraction on the sample image to obtain deep features and shallow features of the sample region in the sample image.

In this embodiment, the process of extracting the features of the sample image may be performed with reference to the step S101, which is not described herein again. Where the sample image may refer to an image that has been instance segmented.

The sample region may refer to an image region in the sample image where the sample object is located.

Step S202: and inputting the deep features and the shallow features of the sample region into a neural network model to generate an example segmentation result of the sample region.

In this embodiment, the deep-layer features and the shallow-layer features of the sample region may be used as training samples and input to the neural network model, so as to obtain an example segmentation result of the sample region output by the neural network model.

The process of generating the example segmentation result of the sample region by the neural network model may refer to the process described in the above embodiments, and is not described herein again.

Step S203: and determining a loss value of the neural network model according to the labels corresponding to the pixel points in the sample region and the example segmentation result of the sample region.

And the label corresponding to one pixel point represents whether the pixel point belongs to the foreground or the background.

In this embodiment, the label corresponding to each pixel point may represent whether the pixel point belongs to the foreground or the background, for example, if the label is 1, the pixel point is represented as belonging to the foreground, and if the label is 0, the pixel point is represented as belonging to the background.

The loss value of the neural network model can be calculated according to the labels corresponding to the pixel points in the sample region and the pixel values of the pixel points in the example segmentation result of the sample region, and the loss value can reflect the accuracy of the example segmentation of the neural network model.

Step S204: and updating the parameters of the neural network model according to the loss value to obtain an instance segmentation module.

In this embodiment, parameters of the neural network model may be updated according to the loss value, and when the update reaches a preset number of times or the model converges, the training is stopped, so as to obtain the instance segmentation module.

Accordingly, in some embodiments, the deep features of the sample region may also be processed to obtain the covariance of the sample region.

The deep features of the sample region may be processed in the process of training the neural network model to obtain the covariance corresponding to the sample region, or the deep features of the sample region may be processed in advance to obtain the covariance, which is not limited herein.

In this embodiment, the processing of the deep features of the sample region may be: and sequentially carrying out 2 times of convolution processing with the size of 3x3 and 1 time of convolution processing with the size of 1x1 on the deep features of the sample region, and then carrying out normalization through a sigmoid function to generate a variance prediction result, namely covariance, wherein the value of the covariance can represent the reliability of an example segmentation result output by a neural network model.

Accordingly, when determining the loss value of the neural network model, the loss value of the neural network model can be determined according to the covariance, the labels corresponding to the pixel points in the sample region, and the example segmentation result of the sample region according to the following formula;

Wherein σ_i ²∈[0,1]，y_uIs 0 or 1, 1 represents foreground and 0 represents background.

In this embodiment, the larger the covariance is, the more unreliable the example segmentation result output by the neural network model is, the contribution to the loss function should be reduced, and conversely, the smaller the covariance is, the more reliable the example segmentation result output by the neural network model is, the contribution to the loss function should be improved.

By adopting the technical scheme of the embodiment of the application, the neural network model can be guided to pay more attention to features which are easy to misjudge (namely, whether one feature is a foreground or a background is not determined and can be understood as a feature point causing Loss oscillation) in the process of training or updating the neural network model, the effect of enhancing the significance of the edge information is achieved, the uncertain modeling is carried out on the edge information of the sample target area by adopting the formula s, and the neural network model can be guided to pay more attention to the features which are easy to misjudge.

The edge feature of the target area referred to in the embodiments of the present application may be understood as: features on the contour of the object in the target region. The more perfect the features on the contour of the target object, the more accurate the division of the foreground and the background can be performed, so that the boundary of the foreground and the background is more obvious.

Next, an image processing method according to an embodiment of the present application will be described with reference to a specific example.

Referring to fig. 3, a neural network model in the present example is shown, specifically including a Region of interest (ROI) pooling module and an example segmentation module. The shallow features and the deep features may be obtained by an ROI pooling module, and the instance segmentation module may be configured to perform instance segmentation based on the shallow features and the deep features, specifically, the ROI pooling module may be independent of the instance segmentation module, or may be located at an input end of the instance segmentation module, and when the ROI pooling module is independent of the instance segmentation module, it may be located in the positioning module, or may be connected between the positioning module and the instance segmentation module.

As shown in fig. 3, the ROI pooling module is independent of the instance segmentation module, an input of the ROI pooling module is connected to an output of the localization module, the localization module includes a Feature Pyramid Network (FPN) module and a backbone network module, and an output of the ROI pooling module is connected to an input of the instance segmentation module.

Accordingly, in some embodiments, the FPN module in the backbone network module may be configured to perform multi-scale feature extraction and fusion on the image to be processed, and then, the features output by the FPN module are input into the ROI pooling module to obtain deep features, and the color features extracted from the image to be processed are input into the ROI pooling module, and are subjected to ROI pooling to obtain color features, that is, shallow features, so that the shallow features of the target region and the deep features of the target region may be obtained by the ROI pooling module, respectively.

In this embodiment, the ROI Pooling module may also be referred to as the ROI Pooling layer. The method includes the steps of performing feature extraction processing of multiple scales on an image in a target area where an object in an image to be processed is located, for example, performing downsampling processing of multiple scales to obtain a multi-scale feature map, fusing the obtained multi-scale feature map of the target area, and inputting the fused multi-scale feature map into an ROI pooling module, wherein the ROI pooling module can be used for interpolating fused features of the target area to a preset size, and accordingly outputting deep features of the target area.

The image region framed and selected by the target region can be cut out from the image to be processed, and then the image region framed and selected by the target region is input into the ROI pooling module, wherein the color information of the original image is generally reserved for framing and selecting the target region, so the image region input into the ROI pooling module can be regarded as an RGB image, and then the ROI pooling module can perform feature extraction on the image region to obtain an RGB feature map, wherein the RGB feature map is also called as a shallow feature.

Illustratively, as shown in fig. 3, the FPN module and the backbone network module may be used to perform feature extraction on the input to-be-processed image f0 at various scales. The process of processing the input image to be processed by the backbone network module is as shown in fig. 3, that is, continuously performing convolution processing and down-sampling of multiple scales on the image to be processed, wherein each layer in the backbone network module is connected with a corresponding network layer of the FPN module, and the multiple network layers of the FPN module sequentially perform down-sampling on the target region where the target object is located, so as to output a multi-scale feature map of the target region where the target object is located, such as the feature map f1, the feature map f2, the feature map f3, and the feature map f4, and then, after the feature map f1, the feature map f2, the feature map f3, and the feature map f4 are fused, the fused feature map is input to the ROI pooling module, so that the ROI pooling module outputs deep features f of the target region_i. Meanwhile, the image of the target region where the target object is located in the image f0 to be processed may also be compressed to a preset size, and then the color feature of the image to be processed may be extracted, and then the color feature may be input to the ROI pooling module, thereby outputting the shallow feature r of the target region_i。

By adopting the ROI pooling module of the embodiment of the application, the input image area can be processed simultaneously, and the multi-scale feature map of the image area can also be processed, so that the application range of the ROI pooling module is widened, the condition that the image area is subjected to feature extraction by independently setting the module is avoided, and the complexity of the whole neural network structure is reduced. In this way, the instance segmentation module and the ROI pooling module also implement weight sharing, and therefore, a large amount of extra computation is not introduced, thereby improving the instance segmentation efficiency.

Referring to fig. 4, a schematic structural diagram of an example segmentation module according to an embodiment of the present application is shown, and in combination with fig. 4, how the example segmentation module of the present application generates a mask map of a target region according to a deep feature and a shallow feature, and performs example segmentation.

As shown in fig. 4, in the present embodiment, the instance splitting module at least includes: the device comprises an initial example segmentation module, an Alpha characteristic graph generation module and a mask graph generation module:

the initial example segmentation module is used for processing deep features of the target region and determining a second initial example segmentation result of the image to be processed;

the Alpha characteristic map generation module is used for processing deep characteristics of the target area and shallow characteristics of the target area to obtain transparency characteristics of the target area;

the mask map generating module is used for processing the first preliminary example segmentation result and the transparency feature to generate a mask map of the target area so as to complete example segmentation of the target area and obtain an example segmentation result.

The initial instance segmentation module can comprise a feature fusion unit and a plurality of convolution units, the feature fusion unit can perform convolution processing on deep features of the target area for multiple times and fuse the shallow features and the deep features after the convolution processing for multiple times so as to obtain fusion features, and the convolution units can obtain a first initial instance segmentation result of the image to be processed based on the fusion features.

The Alpha characteristic diagram generation module can be used for splicing deep-layer characteristics and shallow-layer characteristics, the convolution units are used for performing convolution on the spliced characteristic diagram for multiple times, and therefore transparency characteristics are obtained, and the transparency characteristics can fully represent the transparency of the image of the target area in an Alpha channel.

The plurality of convolution units may include a fifth convolution unit, a sixth convolution unit, and a second point convolution unit. Specifically, as shown in fig. 4, the mask map generating module may further include: a splicing unit and a third point convolution unit;

the fifth convolution unit and the sixth convolution unit may adopt a convolution kernel of 3 × 3 to perform convolution processing on the fusion feature, and the second point convolution unit may adopt a convolution kernel of 1 × 1 to perform convolution processing on the feature map output by the sixth convolution unit, so as to obtain a first preliminary example segmentation result of the target region, so that the first preliminary example segmentation result of the target region may be input to the splicing unit.

And the splicing unit is used for splicing the first preliminary example segmentation result and the transparency characteristic to obtain a second splicing characteristic.

The third point convolution unit is configured to perform point convolution processing on the second stitching features to generate a mask map of the target area, that is, an example segmentation result of the target area. The point convolution processing is to perform convolution processing on the second splicing feature by using a convolution kernel of 1 × 1.

Referring to fig. 5, a schematic structural diagram of an example segmentation module in this example is shown, and fig. 5 shows an internal detail structure of each module in the example segmentation module in a refined manner, as shown in a dashed-line box in fig. 5, the initial example segmentation module and the Alpha feature map generation module share part of units, and specifically, the shared part of units are: the device comprises a first convolution unit, a second convolution unit, a ternary graph generation unit and a first splicing unit which are sequentially connected.

In an embodiment, the ternary map generating unit is connected after the second convolution unit, and configured to process the feature map output by the second convolution unit to generate a ternary map of the target region, where a pixel value of each pixel in the ternary map represents: and each pixel point in the target area belongs to the foreground, the background or unknown.

The first splicing unit is used for splicing the ternary map and the shallow feature of the target area to obtain a first splicing feature.

In this way, the feature fusion unit in the initial instance segmentation module may be configured to perform fusion processing on the deep feature and the first stitching feature to obtain the fusion feature. Correspondingly, the Alpha feature map generation module further comprises a third convolution unit and a fourth convolution unit, and the third convolution unit and the fourth convolution unit are used for processing the first splicing feature to obtain the transparency feature.

Wherein, the first convolution unit and the second convolution unit can both be convolution kernels of 3 × 3, and the ternary diagram generated by the ternary diagram generating unit can be understood as follows: the pixel value of each pixel point in the ternary image represents that the pixel point is a foreground, a background and an unknown area, namely, three pixel values are adopted to represent the foreground, the background and the unknown area in the target area.

The ternary diagram is one-channel, and the shallow feature is three-channel, so that the first stitching unit stitching the ternary diagram and the shallow feature may refer to: and performing concat operation, namely splicing the ternary diagram of the channel and the shallow feature of the three channels in the channel dimension to obtain a splicing result.

The fusion unit is connected with the output end of the second convolution unit and the output end of the fourth convolution unit, wherein the first convolution unit and the second convolution unit are used for processing the deep features in sequence to obtain the processed deep features, the third convolution unit and the fourth convolution unit are used for processing the first splicing features in sequence to obtain the processed first splicing features, and then the features corresponding to the same position in the processed deep features and the processed first splicing features are added, which is equivalent to adding the deep features lost after the edge details are interpolated to the preset size and the features after the edge information is supplemented to obtain the fusion features.

As shown in fig. 5, in an embodiment, a first point convolution unit may be connected after the second convolution unit, and the first point convolution unit may be connected to an input end of the ternary map generation unit, so that the deep features are processed by the first convolution unit and the second convolution unit, and then pass through the first point convolution unit, so as to obtain a background/foreground of the target area, that is, a coarse mask map, and then pass through the ternary map generation unit, so as to obtain a ternary map of the background/foreground/unknown area precision prompt. Wherein, the first point convolution unit is a convolution kernel of 1 × 1.

The ternary diagram generating unit in the embodiment of the present application obtains a ternary diagram after performing expansion and erosion operations on the coarse mask diagram, but in other embodiments, the ternary diagram may be obtained in other manners.

To facilitate understanding of the above technical solutions of the present application, the following is illustrated by an example:

firstly, by fully utilizing the shallow layer features rich in edge information and the transparency characteristic of an image Alpha channel, a learnable self-supervision M2M module and a feature refining and fusing module are designed, wherein the learnable self-supervision M2M module and the feature refining and fusing module form an example segmentation module shown in the application, and the structure of the example segmentation module is shown in FIG. 4 or FIG. 5. In turn, edge information lost due to the ROI Pooling mechanism can be supplemented by the instance segmentation module.

The example segmentation module is obtained by training a neural network model (for example, CNN), a shallow layer of the CNN has rich high-frequency information such as edges, a color image has four channels RGBA, RGB respectively represents three basic color channels of red, green and blue, A represents an Alpha channel and describes the transparency of the image, but in the real process, the Alpha information is not recorded in the image shooting process, and therefore the example segmentation module is required to predict the Alpha information of the image.

Specifically, for example, for picture a, the image size is (w, nine, 3), and the coordinates of the ith target region predicted by the positioning module are recorded as

The ith target region is processed by ROI Pooling, i.e. the ROI Pooling module of the present application, to obtain the corresponding RGB feature map as r_i(shallow feature) in the shape of (3, 28, 28), the ith target region is also processed by the ROI pooling module to obtain the corresponding deep feature of f_i(deep level feature) in the shape of (256, 28, 28).

Then, the deep layer feature f_iGenerating rough mask map (representing foreground/background) CM after 2 convolutions of 3x3 and 1 convolution of 1x1_iThe shape is (1, 28, 28); the rough mask pattern CM_iA ternary graph Tr is produced as follows_i: to CM at first_iPerforming classical expansion and corrosion operations, wherein the expansion kernel is (3, 3), the iteration number is 20, and the obtained result is recorded as Unknown_iThen, the ternary diagram Tr_i＝255*CM_i+128*(Unknownn_i-CM_i) The shape is (1, 28, 28).

Then, the ternary map Tr_iSplicing with the RGB characteristic diagram rj to obtain a first splicing result Cat_i＝[Tr_i，R_i]The shape is (4, 28, 28), and an Alpha feature map (transparency feature) is predicted after 2 convolutions of 3x3 and 1 convolution of 1x1, and the shape of the Alpha feature map is (1, 28, 28).

Through the steps, the deep layer characteristic f is supplemented_iMissing edge information is due to the ROI Pooling mechanism.

Then, the characteristics of the previous step of the Alpha characteristic diagram, namely the first splicing result Cat_i＝[Tr_i，R_i]The feature with the shape of (4, 28, 28) obtained by 2 convolutions of 3 × 3 and the deep feature f_iThe features obtained after 2 3 × 3 convolutions are added for feature enhancement.

Finally, splicing the predicted Alpha characteristic diagram with the characteristic diagram after the characteristic enhancement of the previous step, and obtaining a mask diagram M through convolution of 1x1_iBecause each pixel point in the Alpha feature map represents the transparency of each pixel point, that is, the confidence that each pixel belongs to the foreground, the transparency plays a role of prior.

By adopting the technical scheme of the embodiment of the application, the Alpha characteristic diagram generation module in the example segmentation module can be used for autonomously learning the Alpha information of the image to be processed, so that the Alpha characteristic diagram of the target area in the image to be processed is predicted. The Alpha characteristic graph actually refers to the confidence coefficient that each pixel point belongs to the foreground, and can play a role of prior, so that the Alpha information in the image shooting process does not need to be recorded, and the technical purpose of supplementing the edge information of the target area lost by the deep characteristic through the color information of each pixel point of the image to be processed can be still realized under the condition that the Alpha information is not recorded.

The above embodiments have described in detail how the example segmentation module generates a mask map for example segmentation according to the shallow features and the deep features with reference to fig. 4 and 5.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 6, a block diagram of an image processing apparatus according to an embodiment of the present invention is shown, and as shown in fig. 6, the apparatus may specifically include the following modules:

the feature extraction module 601 is configured to perform feature extraction on an image to be processed to obtain deep features and shallow features of a target region in the image to be processed;

a feature enhancement module 602, configured to determine a transparency feature of the target region based on the deep features and the shallow features;

a first segmentation module 603 configured to determine a first preliminary instance segmentation result of the image to be processed based on the deep features and the shallow features;

a second segmentation module 604, configured to perform instance segmentation on the image to be processed according to the first preliminary instance segmentation result and the transparency feature, so as to obtain an instance segmentation result of the target region.

Optionally, the feature enhancing module 602 specifically includes the following units:

the first processing unit is used for processing the deep features to obtain a ternary diagram corresponding to the target area;

the splicing unit is used for splicing the ternary image and the shallow feature to obtain a first splicing feature;

and the second processing unit is used for processing the first splicing characteristic to obtain the transparency characteristic.

Optionally, the first segmentation module 603 specifically includes the following units:

the first splicing unit is used for splicing the ternary image and the shallow feature to obtain a first splicing feature;

the fusion unit is used for performing fusion processing on the deep features and the first splicing features to obtain fusion features;

and the initial segmentation unit is used for carrying out example segmentation on the basis of the fusion characteristics to obtain the first initial example segmentation result.

Optionally, the first processing unit specifically includes:

the first segmentation subunit is used for carrying out example segmentation on the basis of the deep features to obtain a second preliminary example segmentation result of the target region;

a morphology processing subunit, configured to perform morphology processing on the second preliminary example segmentation result to obtain a processed second preliminary example segmentation result;

and the ternary processing subunit is used for carrying out ternary processing on the processed second primary example segmentation result to obtain a ternary diagram corresponding to the target area.

Optionally, the fusion unit includes:

the first processing subunit is used for performing convolution processing on the deep features to obtain processed deep features;

the first splicing subunit is used for performing convolution processing on the first splicing characteristic to obtain a processed first splicing characteristic;

and the feature adding subunit is configured to add feature values corresponding to the same positions in the processed deep features and the processed first splicing features to obtain the fusion features.

Optionally, the second segmentation module 604 includes:

the second splicing unit is used for splicing the first preliminary example segmentation result and the transparency characteristic to obtain a second splicing characteristic;

and the segmentation unit is used for carrying out example segmentation on the basis of the second splicing characteristics to obtain an example segmentation result of the target area.

Optionally, the example segmentation of the image to be processed is realized based on a neural network model; correspondingly, the device also comprises the following modules:

the sample feature extraction module is used for performing feature extraction on the sample image to obtain deep features and shallow features of a sample region in the sample image;

the input module is used for inputting the deep features and the shallow features of the sample region into a neural network model and generating an example segmentation result of the sample region;

a loss determining module, configured to determine a loss value of the neural network model according to a label corresponding to each pixel point in the sample region and an example segmentation result of the sample region, where a label corresponding to one pixel point represents whether the pixel point belongs to a foreground or a background;

and the updating module is used for updating the parameters of the neural network model according to the loss value to obtain an instance segmentation module.

Optionally, the apparatus further comprises:

the covariance processing module is used for processing deep features of the sample region to obtain covariance corresponding to the sample region;

the updating module is specifically configured to determine a loss value of the neural network model according to the covariance, the label corresponding to each pixel point in the sample region, and the example segmentation result of the sample region, and according to the following formula;

It should be noted that the device embodiments are similar to the method embodiments, so that the description is simple, and reference may be made to the method embodiments for relevant points.

Referring to fig. 7, which is a block diagram illustrating a structure of an electronic device 700 according to an embodiment of the present disclosure, as shown in fig. 7, the electronic device 700 may be configured to execute an image processing method and may include a memory 701, a processor 702, and a computer program stored in the memory and executable on the processor, where the processor 702 is configured to execute the image processing method.

As shown in fig. 7, in an embodiment, the electronic device 700 may completely include an input device 703, an output device 704, and an image capturing device 705, where when the image processing method according to the embodiment of the present disclosure is executed, the image capturing device 705 may capture an image to be processed, then the input device 703 may obtain the image to be processed captured by the image capturing device 705, the image to be processed may be processed by the processor 702 to perform example segmentation on the image to be processed, and the output device 704 may output a segmentation result of the example segmentation on the image to be processed.

Of course, in one embodiment, the memory 701 may include both volatile memory and non-volatile memory, where volatile memory may be understood to be random access memory for storing and storing data. The non-volatile memory is a computer memory in which stored data does not disappear when the current is turned off, and of course, the computer program of the image processing method of the present application may be stored in either or both of the volatile memory and the non-volatile memory.

Embodiments of the present invention further provide a computer-readable storage medium storing a computer program for causing a processor to execute the image processing method according to the embodiments of the present invention.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The foregoing detailed description of an image processing method, an image processing apparatus, an image processing device, and a storage medium according to the present invention has been presented, and the principles and embodiments of the present invention are described herein by using specific examples, and the descriptions of the above examples are only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An image processing method, characterized in that the method comprises:

2. The method of claim 1, wherein determining the transparency characteristic of the target region based on the deep features and the shallow features comprises:

3. The method of claim 1, wherein determining a first preliminary instance segmentation result for the image to be processed based on the deep features and the shallow features comprises:

4. The method according to claim 2 or 3, wherein the processing the deep features to obtain a ternary map corresponding to the target region comprises:

performing morphological processing on the second preliminary example segmentation result to obtain a processed second preliminary example segmentation result;

5. The method of claim 3, wherein the fusing the deep feature with the first stitched feature to obtain a fused feature corresponding to the deep feature and the first stitched feature comprises:

6. The method according to any one of claims 1 to 5, wherein the performing instance segmentation on the image to be processed according to the first preliminary instance segmentation result and the transparency feature to obtain an instance segmentation result of the target region comprises:

7. The method according to any one of claims 1-6, wherein the instance segmentation of the image to be processed is performed based on a neural network model;

accordingly, the neural network model is trained by:

8. The method of claim 7, further comprising:

9. An image processing apparatus, characterized in that the apparatus comprises:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing implementing the image processing method according to any of claims 1-8.

11. A computer-readable storage medium storing a computer program for causing a processor to execute the image processing method according to any one of claims 1 to 8.