CN112258436A

CN112258436A - Training method and device of image processing model, image processing method and model

Info

Publication number: CN112258436A
Application number: CN202011132152.9A
Authority: CN
Inventors: 裴仁静; 郝磊; 许松岑
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2021-01-22
Anticipated expiration: 2040-10-21
Also published as: CN112258436B

Abstract

The application provides a training method of an image processing model, wherein the image processing model comprises a first module, a second module and/or a third module, and the method comprises the following steps: inputting the first image and the second image into a first module to generate a first feature map; the first image comprises a target object image; generating a second feature map according to the first feature map and generating a mask image according to the second feature map by a second module; the mask image corresponds to the target object image; generating a third image according to the first feature map and the second feature map by a third module; the third image is a predicted composite image including the target object image and the second image; and training the image processing model according to the loss function for generating the third image and the loss function for generating the mask image. Correspondingly, the training device of the image processing model, the image processing method, the model, the computing equipment and the storage medium are also provided. The method and the device can improve the segmentation precision of the edge in image segmentation.

Description

Training method and device of image processing model, image processing method and model

Technical Field

The present invention relates to the field of neural network technology, and in particular, to a training method and apparatus for an image processing model, an image processing method and model, a computing device, and a storage medium.

Background

Image segmentation is a fundamental and important study in the field of computer vision, and is also the basis of video segmentation, and by treating a video frame as a single image, the problem can be translated into image segmentation. The method requires that a model divides an image into a plurality of mutually disjoint areas according to characteristics such as gray scale, color, spatial texture, geometric shape and the like, namely, an object is separated from a background. Image segmentation has wide application in the fields of image processing, video processing, computer vision and the like, for example, Medical Imaging, Face recognition, Fingerprint recognition, automatic driving and the like, all of which need to perform image segmentation on a target first, and then perform processing such as classification and recognition on the target; for example, when the method is applied to background replacement of an object (the object may be a human body, a building, a sky, a river, or the like) on an image or a video, the method is also based on that the object is firstly segmented into images and then fused with other backgrounds.

Compared with the traditional image segmentation method based on digital image processing, the image segmentation method based on the deep neural network realizes image segmentation based on semantics, greatly improves the performance of image segmentation, for example, extracts a feature map (namely semantic features) of an image through a convolutional neural network, and then uses pixel-level segmentation and categories (for example, foreground or background binary classification) of the image as supervision signals to carry out end-to-end training on a model. Typical neural network models include FCN, U-net, Mask-RCNN, etc. FCN proposes a full convolution network and a deconvolution network, which replaces the full connection network in the conventional convolutional network with a convolutional network, so that the network can accept pictures of arbitrary size, and uses deconvolution to output a segmentation map of the same size as the original map from the feature map. U-net is a typical coding module-decoding module structure, semantic information of the whole graph is extracted through a coding module, and then high-resolution information of a shallow layer is fused for segmentation. The Mask-RCNN simultaneously solves two tasks of target detection and image segmentation.

The accuracy of image segmentation, particularly the accuracy of the target edge region, for example, the accuracy of edge region segmentation of a Mask (Mask) image generated based on semantic segmentation, is a key technique for image segmentation. The accuracy of image segmentation also affects the effect of subsequent tasks, and if the segmented image is inaccurate and has excessive edge noise, the problem of low reality degree and the like caused by fusion with other background images or the problem of more computation for subsequent analysis of the image is caused in the subsequent tasks. Therefore, how to improve the accuracy of image segmentation is a technical problem that is continuously improving at present.

Disclosure of Invention

In view of the above problems of the prior art, the present application provides a training method and apparatus for an image processing model, an image processing method and model, a computing device, and a storage medium, which can improve the accuracy of image segmentation.

To achieve the above object, a first aspect of the present application provides a method for training an image processing model, where the image processing model includes a first module, a second module, and/or a third module, the method includes:

inputting the first image and the second image into a first module to generate a first feature map; the first image comprises a target object image;

generating a second feature map according to the first feature map and generating a mask image according to the second feature map by a second module; the mask image corresponds to the target object image;

generating a third image according to the first feature map and the second feature map by a third module; the third image is a predicted composite image including the target object image and the second image;

and training the image processing model according to the loss function for generating the third image and the loss function for generating the mask image.

The first module may be an encoding module for generating a feature map from an image, and the second and third modules may be different decoding modules for restoring an image from the feature map. Therefore, the training of the image processing model is performed according to the loss function for generating the third image and the loss function for generating the mask image, and the training is performed in a back propagation mode, so that the second module can transfer the second feature map features related to the generated mask image to the third module for learning, and the third module reversely corrects the second module in the training process, namely the second module and the third module perform collaborative training, so that the image segmentation accuracy of the second module is improved, and correspondingly, the accuracy of the third module in image combination is also improved.

As a possible implementation manner of the first aspect, the second module and the third module respectively include at least one reverse feature extraction layer; the reverse feature extraction layer is used for restoring a high-resolution feature map from the low-resolution feature map;

generating a second, different feature map from a second, different reverse feature extraction layer of the second module;

transmitting the at least one second feature map to at least one reverse feature extraction layer corresponding to the third module;

and the third module generates a third image according to the first feature map and the second feature map, and the third module generates the third image according to the first feature map and the second feature map which is respectively received by the at least one reverse feature extraction layer and fused to the original input feature map of the layer.

Therefore, through the transmission of the plurality of second feature maps, the feature maps of the second module at different levels can be transmitted to the third module for learning, and as the transmitted feature parameters are more and are feature parameters at different levels, the third module can correct the second module more accurately in a reverse direction during training, so that the image segmentation accuracy of the second module is further improved.

As a possible implementation manner of the first aspect, the generating, by the third module, a third image according to the first feature map and the second feature map includes:

generating a corresponding affinity diagram according to the second feature diagram; the affinity diagram is used for representing the relation between at least one pixel and at least one other pixel in the second feature diagram;

generating, by a third module, the third image from the first feature map and the affinity map.

More context information in the second feature graph is transmitted to the third module for learning through the affinity graph, and the transmitted parameters are fused with more context information, so that the third module can correct the second module more accurately in a reverse direction in training, and the image segmentation accuracy of the second module is further improved.

As a possible implementation manner of the first aspect, the second module and the third module respectively include at least one reverse feature extraction layer;

generating at least one corresponding affinity diagram according to the at least one second feature diagram, and transmitting the affinity diagram to at least one reverse feature extraction layer corresponding to the third module;

and the third module generates a third image according to the first feature map and the affinity map, and the third module generates the third image according to the first feature map and the affinity map which is respectively received by the at least one reverse feature extraction layer and fused to the original input feature map of the layer.

Therefore, more context information in the second feature diagram is further transmitted to the third module for learning through more affinity diagrams, and the transmitted parameters are fused with more context information, so that the third module can be more accurate when reversely correcting the second module in training, and the image segmentation accuracy of the second module is further improved.

As a possible implementation manner of the first aspect, the generating a corresponding affinity graph according to the second feature graph includes:

and multiplying the second feature diagram and the transpose of the second feature diagram, and obtaining the affinity diagram corresponding to the second feature through activation function operation.

The implementation process is changed from the above mode to be simple and convenient, and the implementation cost is low.

As a possible implementation manner of the first aspect, the merging into the layer of the original input feature graph includes one of the following:

multiplying the original input characteristic diagram of the layer, carrying out counterpoint summation on the result of the multiplication and the original input characteristic diagram of the layer, and taking the result as the fused input characteristic diagram;

multiplying the original input characteristic diagram of the layer by the original input characteristic diagram of the layer, and taking the result as the fused input characteristic diagram;

carrying out alignment summation with the original input characteristic diagram of the layer; the result is used as the input feature map after fusion.

The fusion of the feature maps is flexibly realized by the above mode, and one of the feature maps can be selected according to requirements.

As a possible implementation manner of the first aspect, the training of the image processing model according to the loss function for generating the third image and the loss function for generating the mask image includes: the following loss function was used in the training:

LOSS function for generating mask image + LOSS function for generating third image

＝seg loss+(image level loss+feature level loss)；

seg loss refers to an image segmentation network loss function;

image level loss is a loss function of the predicted composite graph and the GT composite graph;

feature level loss is a loss function of the predicted composite image features and GT composite image features;

the predictive composite map refers to a third image; the GT composite image refers to a composite image obtained by wrapping the first image and the second image with a known GT mask; the GT mask refers to an artificially labeled mask image.

Therefore, the loss function at the image level and the loss function at the feature level are adopted, and the semantic features (whole) and the details at the image level are considered in the model training process, so that the image segmentation accuracy of the second module is improved.

＝seg loss+(image level loss+feature level loss)*(1+w)；

seg loss refers to an image segmentation network loss function;

the predictive composite map refers to a third image; the GT composite image refers to a composite image obtained by wrapping the first image and the second image with a known GT mask; GT mask refers to an artificially labeled mask image;

w is a two-dimensional image with the same size as the GT mask, the pixel value range of the two-dimensional image is 0 to 1, and the two-dimensional image is an image of the edge area of the corresponding target object obtained by a method of expanding and corroding the GT mask.

Therefore, through the introduction of w, the punishment on the edge area is increased, so that more loss values in the error area can be transmitted in a reverse mode during model training, the error places are corrected, the accuracy of image segmentation is further improved, and the accuracy of edge area segmentation is particularly improved because w is related to the mask image.

As a possible implementation manner of the first aspect, the first image and the second image are from a data pair of a data set for training; the data set includes a plurality of data pairs, which are a plurality of data pairs composed of the same first image and a plurality of different second images.

Therefore, the same foreground is matched with different backgrounds for synthesis, and punishment of wrong places is strengthened in the wrong areas on different background synthesis images, so that more loss values in the wrong areas can be transmitted reversely during model training, the wrong places are corrected, and the accuracy of image segmentation is improved.

A second aspect of the present application provides an apparatus for training an image processing model, the image processing model comprising a first module, a second module and/or a third module, the apparatus comprising:

the first characteristic map generating unit is used for inputting the first image and the second image into the first module to generate a first characteristic map; the first image comprises a target object image;

the mask image generating unit is used for generating a second feature map according to the first feature map by a second module and generating a mask image according to the second feature map; the mask image corresponds to the target object image;

a third image generation unit, configured to generate a third image according to the first feature map and the second feature map by a third module; the third image is a predicted composite image including the target object image and the second image;

and the training unit is used for training the image processing model according to the loss function for generating the third image and the loss function for generating the mask image.

As a possible implementation manner of the second aspect, the second module and the third module respectively include at least one reverse feature extraction layer; the reverse feature extraction layer is used for restoring a high-resolution feature map from a low-resolution feature map;

the mask image generating unit generates different second feature maps by different reverse feature extraction layers of a second module;

the feature map transmission module is used for transmitting at least one second feature map to at least one reverse feature extraction layer corresponding to the third module;

the third image generating unit generates a third image according to the first feature map and a second feature map which is received by the at least one reverse feature extraction layer and fused to the original input feature map of the layer by a third module.

As a possible implementation manner of the second aspect, the method further includes an affinity module, configured to generate a corresponding affinity diagram according to the second feature diagram; the affinity diagram is used for representing the relation between at least one pixel and at least one other pixel in the second feature diagram;

the third image generating unit generates the third image according to the first feature map and the affinity map by a third module.

As a possible implementation manner of the second aspect, the second module and the third module respectively include at least one reverse feature extraction layer;

the affinity module is at least one and is used for generating at least one corresponding affinity diagram according to at least one second feature diagram;

the affinity diagram transmission module is used for transmitting the at least one affinity diagram to at least one reverse feature extraction layer corresponding to the third module;

the third image generating unit generates a third image according to the first feature map and the affinity map received and fused by the at least one reverse feature extraction layer of the third module to the original input feature map of the layer.

As a possible implementation manner of the second aspect, the affinity module is specifically configured to multiply the second feature map and the transpose of the second feature map, and then obtain the affinity map through an activation function operation.

As a possible implementation manner of the second aspect, the merging of the third module into the layer of the original input feature graph includes one of:

and carrying out alignment summation with the original input feature map of the layer, and taking the result as the fused input feature map.

As a possible implementation manner of the second aspect, the training unit specifically performs the training of the image processing model by using the following loss function:

＝seg loss+(image level loss+feature level loss)；

seg loss refers to an image segmentation network loss function;

＝seg loss+(image level loss+feature level loss)*(1+w)；

seg loss refers to an image segmentation network loss function;

As a possible implementation manner of the second aspect, the system further includes a training dataset including a plurality of data pairs composed of a same first image and a plurality of different second images, and the first image and the second image are from one data pair of the dataset.

A third aspect of the present application provides an image processing method, including:

inputting a first image into a first module to generate a first feature map; the first image comprises a target object image;

the first module and the second module are trained by any possible implementation manner of the training method of the image processing model provided by the first aspect of the present application.

Therefore, the image processing method can realize accurate segmentation of the image, namely the accuracy of generation of the mask image, especially the accuracy of the edge.

A fourth aspect of the present application provides an image processing model, comprising:

the first module is used for receiving the first image and generating a first feature map; the first image comprises a target object image;

the second module is used for generating a second feature map according to the first feature map and generating a mask image according to the second feature map; the mask image corresponds to the target object image;

A fifth aspect of the present application provides an image processing method, including:

generating a second feature map by a second module according to the first feature map;

generating a third image according to the first feature map and the second feature map; the third image is a predicted composite image including the target object image and the second image;

the first module, the second module and the third module are trained by any possible implementation manner of the training method of the image processing model provided by the first aspect of the present application.

Therefore, by the image processing method, the target object image can be accurately extracted when the two images are synthesized, so that the synthesized image is more accurate.

As a possible implementation manner of the fifth aspect, a mask image is further generated by the second module according to the second feature map; the mask image corresponds to the target object image.

Therefore, the image processing method can also realize accurate segmentation of the image, namely the accuracy of the generation of the mask image, especially the accuracy of the edge.

As a possible implementation manner of the fifth aspect, the generating a third image according to the first feature map and the second feature map includes:

generating an affinity diagram according to the second feature diagram; the affinity diagram is used for representing the relation between at least one pixel and at least one other pixel in the feature diagram;

and generating a third image according to the first feature map and the affinity map by a third module.

Therefore, due to the adoption of the image processing model with the affinity diagram structure, the target object image can be accurately extracted when the two images are synthesized, so that the synthesized image is more accurate.

A sixth aspect of the present application provides an image processing model, comprising:

the first module is used for receiving the first image and the second image and generating a first feature map; the first image comprises a target object image;

a second module, configured to generate a second feature map according to the first feature map;

a third module, configured to generate a third image according to the first feature map and the second feature map; the third image is a predicted composite image including the target object image and the second image;

As a possible implementation manner of the sixth aspect, the second module is further configured to generate a mask image according to the second feature map; the mask image corresponds to the target object image.

As a possible implementation manner of the sixth aspect, the method further includes:

an affinity module for generating a corresponding affinity graph from the second profile graph; the affinity diagram is used for representing the relation between at least one pixel and at least one other pixel in the feature diagram;

the third module generates the third image from the first feature map and the affinity map.

A seventh aspect of the present application provides a computing device comprising:

a bus; a communication interface connected to the bus; at least one processor coupled to the bus; and at least one memory coupled to the bus and storing program instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of any of the aspects described above.

An eighth aspect of the present application provides a computer-readable storage medium having stored thereon program instructions that, when executed by a computer, cause the computer to perform the method of any of the aspects described above.

In addition, compared with the first prior art in the specific implementation mode, the calculation amount of the scheme is small, the speed is high, the method can be deployed on a very small network module at the user terminal side, the second module and the third module are trained cooperatively in the training process, the third module can acquire the feature map which is provided by the second module and related to the mask image, and particularly when more context information is provided, as can be seen from fig. 5A and 5B, the detail of the segmentation of the model trained by the method is improved, meanwhile, the whole can be concerned, further, the hollow segmentation is better, the edge is more accurate in the image segmentation, and the edge is more natural.

In addition, compared with the second prior art in the specific implementation mode, the information sources of the method are different, and the method is not low in practicability caused by the fact that the original image background is adopted as in the second prior art, multiple groups of virtual backgrounds which can be randomly generated are used, the training process is not limited, and a large number of training sample pairs can be generated; by establishing different context relations (especially at the edges), the feedback optimization segmentation method can enhance the segmentation branch learning context information, and as can be seen from fig. 5A and 5B, the model trained by the application improves the accuracy of edge region identification.

In a word, the method and the device improve the accuracy of segmentation of the foreground area, particularly the edge area, and further improve the effect of subsequent tasks. In addition, through tests, on the premise of providing the segmentation accuracy, the method has high operation speed and high real-time performance, does not basically improve the operation amount, and can be operated on the user terminal side.

These and other aspects of the invention will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.

Drawings

The various features and the connections between the various features of the present invention are further described below with reference to the attached figures. The figures are exemplary, some features are not shown to scale, and some of the figures may omit features that are conventional in the art to which the application relates and are not essential to the application, or show additional features that are not essential to the application, and the combination of features shown in the figures is not intended to limit the application. In addition, the same reference numerals are used throughout the specification to designate the same components. The specific drawings are illustrated as follows:

FIG. 1A is a schematic diagram of a first embodiment of an image processing model according to the present application;

FIG. 1B is a schematic diagram illustrating a feature map and a feature map fusion in a first embodiment of an image processing model according to the present application;

FIG. 2A is a schematic structural diagram of a second embodiment of an image processing model of the present application;

FIG. 2B is a schematic diagram illustrating a second embodiment of an image processing model according to the present application;

FIG. 2C is a schematic diagram of the principle of generating an affinity diagram in the present application;

FIG. 2D is a schematic diagram of the fusion of a feature map with an affinity map in the present application;

FIG. 3 is a flowchart of a first embodiment of a training method of an image processing model according to the present application;

FIG. 4A is a first embodiment of an apparatus for training an image processing model according to the present application;

FIG. 4B is a diagram illustrating a second embodiment of an apparatus for training an image processing model according to the present application;

FIG. 4C is a third embodiment of the training apparatus for image processing model of the present application;

FIG. 4D is a fourth embodiment of the present application for training an image processing model;

FIG. 5A is a diagram showing a comparison of segmentation results of the respective methods;

FIG. 5B is a diagram illustrating a comparison of a segmentation result graph of a second embodiment of the present application with a segmentation result graph of the second prior art;

FIG. 6 is a schematic diagram of a computing device of the present application;

FIG. 7A is a schematic flow chart of a prior art one;

fig. 7B is a flowchart of a second prior art.

Detailed Description

The terms "first, second, third and the like" or "module a, module B, module C and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, it being understood that specific orders or sequences may be interchanged where permissible to effect embodiments of the present application in other than those illustrated or described herein.

In the following description, reference to reference numerals indicating steps, such as S110, S120 … …, etc., does not necessarily indicate that the steps are performed in this order, and the order of the preceding and following steps may be interchanged or performed simultaneously, where permissible.

The term "comprising" as used in the specification and claims should not be construed as being limited to the contents listed thereafter; it does not exclude other elements or steps. It should therefore be interpreted as specifying the presence of the stated features, integers, steps or components as referred to, but does not preclude the presence or addition of one or more other features, integers, steps or components, and groups thereof. Thus, the expression "an apparatus comprising the devices a and B" should not be limited to an apparatus consisting of only the components a and B.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment, but may.

Semantic Segmentation (Semantic Segmentation): semantic segmentation is a classification at the pixel level, and pixels belonging to the same class are classified into one class, so that the semantic segmentation is to understand an image from the pixel level. Neural networks implementing semantic segmentation typically include two parts, an encoding module (Encoder) and a decoding module (Decoder). The Encoder is used for extracting features of an input image, and for example, the extraction of the image features may be realized based on a Convolutional Neural Network (CNN), or feature extraction networks such as a residual error network (ResNet), a ResNext network, a dense convolutional network (densneet), and a Feature Pyramid Network (FPN) may be used. The Decoder is used for fusing the characteristics of different resolutions proposed by the Encode and restoring to the original input size, and can be realized through a convolution group.

The inverse feature extraction, which is relative to the feature extraction, refers to restoring a high-resolution feature map from a low-resolution feature map, and is often applied to an upsampling step in a decoding module, and can be implemented by deconvolution (or referred to as transposed convolution).

Context Affinity Map (CAB), or simply Affinity Map (Affinity Map), is used to represent the relationship response between each pixel and the rest of the pixels in the feature Map, i.e. to represent the inter-pixel relationship (i.e. the establishment of Context information in the segmentation).

Reshaping (Reshape): the method is used for rearranging the multidimensional matrix with a certain dimension to construct a new matrix which keeps the same element quantity but has different dimension sizes, and the Reshape operation does not change the data of the original elements.

Convolutional layer (Convolution): the method is used for performing convolution operation on an input image matrix to extract image features and output a Feature Map (Feature Map).

Softmax: one of the functions is activated.

The Pooling layer (Pooling) is used to down-sample data, reducing data throughput while retaining useful information.

Skip Connection structure (Skip Connection): the network correspondingly comprises an encoding module and a decoding module network, and the network with the structure typically comprises a U-shaped structure network (U-Net), and by Skip Connection, the feature maps at the corresponding positions of the encoding module can be fused on channels in the up-sampling process of each Level or some levels of the decoding module network, namely, the feature matrix at the corresponding position of the encoding module is cascaded with the feature matrix of the encoding module in the decoding process of the decoding module to be used as the feature matrix to be used. Through the fusion of the bottom-layer features and the high-layer features, the network can keep more high-resolution detail information contained in the high-layer feature map, and therefore the image segmentation precision is improved.

Mask (Mask): also called mask, or mask, refers to a portion that masks a portion of a picture, usually represented by a completely black pixel.

Backpropagation Pass: is one of the general methods for training neural network models. The backward propagation is relative to the forward propagation (fed Pass), in which each layer of the network sequentially performs a process of calculating a final output predicted value (preset Label) according to an input sample, and since an error (usually represented by a value of a Loss (Loss) function) necessarily exists between an obtained predicted value and a real value (Label), the information of the Loss error that will exist is fed back to each layer through the backward propagation, so that the layers modify their model parameters, such as modifying each weight and deviation b of the weight matrix W, to reduce the Loss error (Loss). And (3) carrying out an iterative training process on the neural network model based on a back propagation algorithm, namely continuously updating each weight and deviation b of the weight matrix W of each layer through forward and back propagation of each time, so that the loss function value is continuously reduced and converged (even if the error between the predicted value and the true value is gradually reduced) until the training target is reached, and finishing the training of the model to obtain the final weight matrix W and each deviation b. When a back propagation algorithm training mode is adopted, a gradient descent method is usually adopted to update the weights.

Loss (Loss) function: in the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be lower, and the adjustment is continuously carried out until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions or objective functions, which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible. The difference between the predicted value and the target value may be referred to as a loss value of the neural network.

The L1 norm loss function (L1_ loss), also known as minimum absolute deviation (LAD), minimum absolute error (LAE). It minimizes the sum of the absolute differences of the target value and the estimated value. The L1_ Loss formula is as follows:

the L2 norm loss function (L2_ loss), also known as Least Squares Error (LSE). It is to minimize the sum of the squares of the differences between the target and estimated values. The L1_ Loss formula is as follows:

and a cross entropy loss function used for measuring the difference information between the two probability distributions, such as the target value and the estimated value, wherein the smaller cross entropy indicates the closer the two probability distributions are. The cross entropy loss formula for the N samples is as follows:

the mean square error loss function (MSE loss), which is the expected value of the square of the difference between the estimated value of the parameter and the value of the parameter. MSE is a convenient method for measuring average error, the MSE can evaluate the change degree of data, and the smaller the value of MSE is, the better the accuracy of the prediction model is.

Dilation and Erosion (Dilation and Erosion), Dilation of bright white areas in a magnified image by adding pixels to the perceptual boundaries of objects in the image. Erosion is the opposite, removing pixels along the object boundary and reducing the size of the object. Usually these two operations are performed sequentially to enhance important object features, such as features of image edges.

Training and reasoning stages of the model: respectively, a stage of training a model (or algorithm) through existing sample data, and a stage of executing the trained model (or algorithm) in the equipment to process data.

[ analysis of the Prior Art ]

The first prior art is as follows: the technical scheme is as shown in fig. 7A, the edge segmentation precision is improved by using a Matting method, and the processing process mainly comprises the following steps: firstly, obtaining a preliminary segmentation result; further obtaining a non-determined area in a direct or indirect way; and then inputting the RGB image or the bottom layer characteristic, the preliminary segmentation result and the non-determined area into a network to obtain a final mask result.

The defects of the prior art I are as follows: 1) if end-to-end drawing Matting is carried out, the interior is hollowed out easily because the Matting emphasizes details and parts; 2) due to the existence of the non-specific area, edge jitter is more easily caused on the video, namely the stability of the accuracy of the edge area is low; 2) the flow is complex and therefore slow to run.

The second prior art is: the technical scheme is as shown in fig. 7B, the edge segmentation precision is improved by adopting a real background input method, namely, an original background image without foreground is input to assist in improving the segmentation effect, the processing process mainly comprises the steps of inputting an RGB image and a priori (original image unmanned background image), respectively inputting different encoders, and then fusing semantic features obtained by the two encoders by using a Context switching (Context switching) module; inputting the fused semantic features into two decoders, wherein one Decoder is used for foreground reconstruction, and the other Decoder is used for generating a Mask (Mask); then, synthesizing the reconstructed foreground and background images, namely synthesizing the reconstructed foreground and background images through alpha F + (1-alpha) B, wherein alpha represents opacity, F represents foreground color, and B represents background color; and finally, a Discriminator (Discriminator) is used to judge whether the composite image is good or bad.

The second prior art has the following defects: 1) the method has the advantages that an unmanned background image needs to be acquired independently and needs to be aligned, the practicability is low, especially for handheld mobile equipment, the background caused by the fact that the background is changed by the fact that the background is held by hands, or the background is changed by the fact that other people move, and therefore it is difficult to ensure that the required unmanned background image can be acquired, the method is unusable, or the method needs to be used after the image of the unmanned/unmanned background image is acquired and then is subjected to scaling and rotating matching, but the complexity of the implementation method is increased; 2) the redundancy of information extracted by the foreground branch and the segmentation branch is high; 3) training with a discriminator results in less sensitivity to the accuracy of the edge region, since the discriminator itself is insensitive to the discrimination of small differences. 4) The Context switching (Context switching) module has weak learning guidance because the module adopts a simple structure of feature superposition and convolution. 5) The whole network has extremely high algorithm complexity and is not practical to use on the user terminal side.

Based on the defects in the prior art, the image segmentation method can improve the edge segmentation precision. The method is based on the basic principle that an original RGB image and a virtual background pair are input, a third module of a composite image is led out through a second module for generating a mask image (or called a segmentation image) during model training, a predicted composite image is obtained, and the third module assists in improving the segmentation precision of the second module; and during reasoning, only the second module is reserved, and the trained model is used for obtaining a segmentation result, namely a mask image. According to the method and the device, when the mask is generated, under the condition that the calculation amount is not increased basically, the edge segmentation precision is improved. In addition, through the member affinity module, more context relations can be transmitted back to the second module in the training process, and the accuracy of the segmentation edge is further improved.

Image processing model application scenario example

The image processing model can generate a mask image, and can be used in the field of image segmentation based on the mask image, namely, the target is separated from the background. The method can be applied to the fields of image processing, video processing, computer vision and the like, for example, in Medical Imaging, image recognition of a lesion part in a nuclear magnetic resonance image, for example, in Face recognition (Face recognition), Face recognition from an image, for example, in Fingerprint recognition (Fingerprint recognition), Fingerprint texture or Fingerprint region recognition from an image, for example, in automatic driving (automatic driving), pedestrian, building, obstacle, traffic setting and the like are recognized from a collected front image or video, and the processes of classifying and recognizing the target and the like are required on the basis of dividing the target into images; for another example, the method can also be applied to background replacement of a target object (the target object may be a human body, a building, a sky, a river, or the like) on an image or a video, and is also based on that an image including the target object is segmented, and then the segmented target object image is fused with other background images, and when the method is applied to the corresponding background replacement, the background image may also adopt a virtual background (the virtual background refers to a non-real image), such as a cartoon background, an automatically generated background, a dynamic background, or the like.

On the other hand, the image processing model can also be directly applied to the fusion of the identified target object and the background image by taking the target object identified on one image as the foreground and the other image as the background image for synthesis aiming at the two images.

Some application scenarios of the image processing model of the present application are further illustrated below in some specific applications:

a user utilizes equipment such as a terminal and the like to carry out real-time video semantic segmentation on a portrait, and further functions of background blurring or background replacement, live broadcast production, movie or animation production and the like are achieved. For example, a user starts a video call function at a mobile phone end, during the video call, after the portrait in the picture is segmented in real time, the portrait is selected as a foreground, the user selects a virtual background, and the original background is replaced by the virtual background, for example, the virtual background can be a virtual historical scene or a future scene, so that the effects of space-time change and background crossing can be realized. If the sky area in the background is identified and segmented, and the virtual sky replaces the sky area in the original image, when the virtual sky is a designed dynamic magic sky, the effect of magic sky of the picture or video at the user terminal side can be realized.

[ first embodiment of image processing model ]

Before describing the training method of the image processing model, the image processing model to be trained is first described,

the first embodiment of the image processing model of the present invention is described with reference to a specific implementation of the second embodiment of the image processing model shown in fig. 2B, while fig. 1A shows the first embodiment of the image processing model, which includes:

a first module 110, configured to receive the first image and the second image, and generate a first feature map; the first image includes an image of a target object.

In some embodiments, the first module 110 may be implemented by an encoding module, where the encoding module includes at least one feature extraction layer, the feature extraction layers are sequentially coupled, and each feature extraction layer respectively performs higher-level feature extraction on a feature map output by a previous layer, so that each feature extraction layer sequentially extracts a low-level feature map (low-level feature map) to a high-level feature map (high-level feature map), where the low-level feature map corresponds to a feature map with a high resolution, and the high-level feature map corresponds to a feature map with a low resolution, where this feature extraction process may also be referred to as a downsampling process of the encoding module. The first module 110 may be implemented based on a Convolutional Neural Network (CNN), a residual error network (ResNet), a ResNext network, a dense convolutional network (densneet), a Feature Pyramid Network (FPN), or any combination thereof.

In this embodiment, the first module 110 may be implemented based on a CNN structure, where the first module 110 includes a plurality of convolutional layers coupled in sequence, and the downsampling process is completed through the plurality of convolutional layers. Each convolutional layer or part of the convolutional layer may be followed by a pooling layer (pooling) and downsampled to reduce the amount of data. In this embodiment, the first feature map is a feature map obtained by the last sampling in the down-sampling process.

The target object image included in the first image may be a human body, a building, a sky, a river, or a region distinguishable from surrounding pixels in the first image. The second image may be any image, such as a photograph, or may be an automatically randomly generated image, such as an automatically randomly generated virtual background image.

A second module 120, configured to generate a second feature map according to the first feature map; the second module 120 is further configured to generate a mask image according to the second feature map; the mask image corresponds to a target object image in the first image.

In some embodiments, the second module 120 may be implemented by a decoding module, where the decoding module includes at least one inverse feature extraction layer (inverse feature extraction, which refers to recovering a high-resolution feature map from a low-resolution feature map), each inverse feature extraction layer is sequentially coupled, and each inverse feature extraction layer generates a high-resolution feature map for a low-resolution feature map output by a previous layer, respectively, that is, performs recovery of the high-resolution feature map, so that a higher-resolution feature map may be sequentially recovered from the low-resolution feature map, and this feature map recovery process may also be referred to as an upsampling process of the decoding module. The second module 120 may be implemented based on a Convolutional Neural Network (CNN), a residual error network (ResNet), a ResNext network, a dense convolutional network (densneet), a Feature Pyramid Network (FPN), or any combination thereof.

In this embodiment, the second module 120 is implemented based on a CNN structure, where the second module 120 includes a plurality of convolutional layers, and the upsampling process is implemented by inverse convolution operation (or referred to as a transposed convolution operation) of the plurality of convolutional layers. Each convolutional layer or part of convolutional layer may be followed by an up-pooling layer (up-pooling) to further fill in the data volume to improve the resolution of the feature map. In this embodiment, the second feature map is obtained by sampling any time during the up-sampling process performed by the second module 120. In other embodiments, the second feature maps are different pluralities of feature maps respectively obtained by the second module 120 performing the sampling corresponding to different convolutional layers or upper pooling layers in the upsampling process.

A third module 130, configured to generate a third image according to the first feature map and the second feature map; the third image is a predicted composite image including the target object image and the second image.

In some embodiments, the third module 130 may be implemented by another decoding module, where the decoding module includes at least one inverse feature extraction layer, each inverse feature extraction layer is sequentially coupled, and each inverse feature extraction layer generates a feature map with a higher resolution for a feature map output by a previous layer, that is, performs a restoration of the higher resolution feature map, so that a higher resolution feature map may be sequentially restored from a low resolution feature map, where the feature restoration process may also be referred to as an upsampling process of the decoding module. The third module 130 may be implemented based on a Convolutional Neural Network (CNN), a residual error network (ResNet), a ResNext network, a dense convolutional network (densneet), a Feature Pyramid Network (FPN), or any combination thereof.

In this embodiment, the third module 130 is implemented based on a CNN structure, where the third module 130 includes a plurality of convolutional layers, and the upsampling process is implemented by inverse convolution operation (or referred to as a transposed convolution operation) of the plurality of convolutional layers. Each convolutional layer or part of convolutional layer may be followed by an up-pooling layer (up-pooling) to further fill in the data volume to improve the resolution of the feature map. In this embodiment, the feature map output by any sampling in the up-sampling process is fused with the second feature map as the input of the next sampling (or, when any sampling is performed in the up-sampling process, the feature map output by the last sampling is fused with the second feature map as the input of the current sampling). In other embodiments, a plurality of different second feature maps are fused to the feature maps of the original inputs of the corresponding different convolutional layers or upper pooling layers in the up-sampling process performed by the third module 130.

As shown in fig. 1B, the implementation manner of the fusion described in this embodiment may be: and multiplying the feature map output by the last sampling by the second feature map, and performing Element-wise Sum (Element-wise Sum) on the feature map output by the last sampling, wherein the result is used as the input of the current sampling. In order to implement the multiplication operation and the alignment summation operation, shaping (reshape) operations can be performed on the feature maps before and after the operation.

In other embodiments, the implementation manner of the fusion may also be: multiplying the feature map output by the last sampling with the second feature map, and taking the result as the input of the current sampling; or, the feature map output by the last sampling and the second feature map are subjected to alignment summation, and the result is used as the input of the current sampling. In order to implement the multiplication operation and the alignment summation operation, shaping (reshape) operations can be performed on the feature maps before and after the operation.

In other embodiments, there are other fusion manners, for example, the feature map output by the last sampling is cascaded with the second feature map as the input adopted this time.

In an embodiment of this embodiment, the first module 110 and the third module 130 may form a U-type network (U-Net), and each network layer corresponding to the down-sampling and the up-sampling, or a part of the network layers, is provided with a skip connection structure (skip connect), so that the network layer in the up-sampling process can receive a feature map of the high-resolution detail information included in the down-sampling, and cascade the feature map to an original feature map input by the network layer as an input thereof. Usually, the cascade requires the same resolution of the two feature maps, and if they are not the same, the adjustment can be performed by shaping (reshape). The jump connection structure improves the precision of image resolution reduction, and further improves the precision of image segmentation.

When in the model training phase, the first image and the second image form an image pair, which is input to the first module 110 as a training sample. When in the model inference phase, and used to generate a mask image of the first image, the first image may be input only to the first module 110, in which case the image processing model may not include the third module 130. When in the model inference phase and used for generating a composite image of the target object in the first image and the second image, or also simultaneously generating a mask image of the first image, the first image and the second image constitute an image pair input to the first module 110.

[ second embodiment of image processing model ]

As shown in fig. 2A, a schematic structural diagram of a second embodiment of the image processing model, and only differences from the first embodiment of the image processing model will be described below, the second embodiment of the image processing model further includes a context Affinity module (CAB), which is referred to as Affinity block (Affinity block)140 for generating an Affinity map according to the second feature map; the affinity graph is used to represent the relationship of at least one pixel to at least one other pixel in the feature map. In some embodiments, the affinity module 140 is implemented as: and multiplying the second feature map and the transpose of the second feature map, and generating an affinity map of the second feature map through activating function operation.

The difference between the first embodiment and the second embodiment of the image processing model is also that in the above-mentioned manner of fusing the second feature map, in the third module 130 of this embodiment, fusing the second feature map is not as direct fusion as in the first embodiment of the image processing model, but rather the affinity map generated by the second feature map, in such a way that the second feature map is indirectly fused, and since the affinity graph represents the relationship (i.e., context) of at least one pixel and at least one other pixel in the feature map, it can represent the context of a second feature map used to generate the mask image, since the second feature map is related to the mask image, it represents the context at the edges of the target object image very well, the context information learned by the second module 120 is continuously enhanced during the training process, thereby improving the accuracy of edge region identification.

In this embodiment, the principle of the affinity module 140, i.e. the generation process of the affinity graph, is described with reference to fig. 2C, wherein the process of obtaining the affinity graph S of a feature graph a of the second module 120 is taken as an example, and the principle is as follows:

from the second module 120 feature map a, a matrix B, C is obtained, with dimensions C × H × W. Shaping B, C to dimension C × N, (N × H × W, representing the number of pixels); the transpose of C is then multiplied by B to a matrix of dimensions N x N, and the softmax activation function processing is performed for each point of the matrix to sji, and the affinity diagram S is described in formula as follows:

sji, the larger the similarity, the larger the influence of the ith position on the jth position, i.e. the relationship (i.e. context) between the ith position and the jth position of the pixel, or the degree of association/correlation.

In other embodiments, the affinity graph may be generated in other manners, such as generating a relationship matrix of each element through a neural network, or generating a matrix of relationship of each element based on a structure as described in CPNet, or generating the affinity graph based on pixel positions and further combining characteristics of different channels.

In the third module 130, the process of the feature map fusion affinity map S is described as follows, taking a feature map a' of the third module 130 as an example to obtain a fused feature map a ″ through the fusion affinity map, as shown in fig. 2D:

a matrix D is obtained from the feature map a 'of the third module 130, D is also shaped (reshape) to dimension C × N, (N × H × W, indicating the number of pixels), D is then multiplied by the affinity map S, and then shaped (reshape) to dimension C × H × W, and an Element-wise summation (Element-wise Sum) is performed with the original feature map a' of the third module 130 to obtain a fused feature map a ″, which is expressed as follows:

A″∈R^C×H×W：

α represents a scale coefficient, is initialized to 0, and is gradually learned to be assigned a larger weight.

From the above, it can be seen that in the feature map a ″ of the third module 130, the affinity maps S associated with all positions obtained by the second module 120 and the original feature map a' of the third module 130 are fused. When training is carried out in the forward direction, each feature of the affinity diagram S of the second module 120 is enhanced on the feature diagram of the third module 130, and the features of the third module 130 are fused with each feature of the affinity diagram S of the second module 120 to obtain a final composite diagram prediction result; the building of the composite map requires more learning of the pixel relationship building between the foreground edge and the virtual background, so that the affinity map S must be modified in reverse during the inverse transfer of the loss function of the third module 130, that is, the context relationship, especially the edge, of the second module 120 is modified, so as to optimize the model of the second module 120.

It should be noted that the feature diagram a of the second module and the feature diagram a' of the third module shown in fig. 2D are of the same size and channel. If the size and the channel are different, the size and the channel can be adjusted to be the same through a reshaping (reshape) mode.

In other embodiments, the implementation manner of the fusion may also be: the third module 130 multiplies the feature map a' output by the last sampling with the affinity map S, and the result is used as the input a ″ of the current sampling; or, the feature diagram A 'output by the last sampling and the affinity diagram S are subjected to alignment summation, and the result is used as the input A' of the current sampling. In order to implement the multiplication operation and the alignment summation operation, shaping (reshape) operations can be performed on the feature maps before and after the operation.

In other embodiments, there are other fusion manners, for example, cascading the feature map a' output from the last sampling with the affinity map S as the input a "adopted this time.

As shown in fig. 2B, which is a schematic structural diagram of an embodiment in which the first module and the third module form a U-type network in the second embodiment of the image processing model, the first module 110 and the third module 130 form a U-type network (U-Net), and the respective network layers for down-sampling and up-sampling, or a part of the network layers, are further provided with a Skip connection structure (Skip connection).

[ first embodiment of training method for image processing model ]

Having described the image processing model provided by the present application, a first embodiment of a training method for an image processing model is described in detail below with reference to a flowchart of a first embodiment of the training method for an image processing model of the present application shown in fig. 3, and the training method includes the following steps:

s310, inputting the first image and the second image into a first module to generate a first feature map; the first image comprises a target object image.

When the virtual background image is adopted, the virtual background image can be randomly generated without limit, so that a sufficient number of pairs of the virtual background image and the first image can be generated, and a training data set is met. When the first image and each virtual background form an image pair, namely the same foreground is matched with different virtual backgrounds to form the image pair, the method has the advantages that the same foreground is matched with different virtual backgrounds for synthesis, and penalty of wrong places is strengthened on different background synthesis images by wrong areas, so that more loss values in the wrong areas can be transmitted in a reverse mode during model training, and the wrong places can be corrected.

In this embodiment, when the virtual background image is generated, the image format and the resolution may be generated according to the RGB image, that is, the same RGB format image with the same resolution may be generated, so that the preprocessing steps may be reduced, for example, the scaling or cutting for performing the size alignment of the image may be omitted.

In other embodiments, when the image format and resolution generated by the virtual background image are different from those of the RGB image, preprocessing such as conversion of the image format, scaling or clipping of the image is required.

In addition, even in the case of image size alignment, the virtual background image and/or the original RGB image may be preprocessed, and the preprocessing may include format conversion, scaling, and clipping as described above, and may further include preprocessing such as rotation, translation, denoising, brightness adjustment, gray scale adjustment, contrast adjustment, and color temperature adjustment.

S320, generating a second feature map by a second module according to the first feature map, and generating a mask image according to the second feature map; the mask image corresponds to the target object image.

In this embodiment, the first feature map is used as an input of the second module, the second module performs an upsampling process, the second feature map is obtained by performing any one-time sampling in the upsampling process by the second module, and the mask image is an output of the second module

In other embodiments, the second feature maps are different pluralities of feature maps respectively obtained by performing different levels of sampling by different network layers in the upsampling process by the second module.

S330, generating a third image by a third module according to the first feature map and the second feature map; the third image is a predicted composite image including the target object image and the second image.

In some embodiments, the first feature map is used as an input of a third module, the third module performs an upsampling process, and fuses a feature map output by any sampling in the upsampling process with the second feature map as an input of next sampling, or fuses a feature map output by last sampling with the second feature map as an input of this sampling when any sampling is performed in the upsampling process.

In other embodiments, a plurality of different second feature maps are respectively fused to the feature maps input by the third module when performing corresponding different levels of sampling in the up-sampling process.

In other embodiments, the above-mentioned way of fusing the second feature maps is not directly fusing the second feature maps, but fusing the affinity maps generated from the second feature maps, by which way the second feature maps are indirectly fused.

S340, training the image processing model according to the loss function for generating the third image and the loss function for generating the mask image.

In some embodiments, the training may be performed in a back-propagation manner, so that the second module may transfer the feature map of the pixels extracted by the second module, or the context relationship features between the pixels, to the third module for learning, and the third module performs back correction on the second module during the training. Wherein, when training in a back propagation mode, a gradient descent method can be adopted for training.

In some embodiments, the final LOSS function (LOSS) includes a LOSS function of the second module and a LOSS function of the third module. Wherein the penalty function of the third module includes image-level and feature-level penalties and weights are increased at the edges. Where the loss function is described in terms of a formula, the following may be used:

LOSS _ second module + LOSS _ third module

＝seg loss+(image level loss+feature level loss)*(1+w)

The loss _ second module may be a common image segmentation network loss function (seg loss), such as cross entropy loss, MSE loss, or the like.

The loss _ third block includes an image-level synthesis loss function (image level loss) and a feature-level synthesis loss function (feature level loss).

Here, image level loss is L1_ loss or L2_ loss of the predicted composite map (third image as described in the present application) and the GT composite map.

feature level loss is L1_ loss or L2_ loss of the predicted and GT composite features. The predicted synthetic image features and the GT synthetic image features can be extracted by a pre-trained feature extractor;

in the formula, (1+ w) represents that the penalty on the edge area is increased on the loss _ third module, and the edge area of the foreground is obtained by adopting a GT mask expansion corrosion method and is marked as w. w is a two-dimensional image with the same size as the GT mask, and the pixel value range is 0-1.

Wherein, the GT mask refers to a foreground mask marked by people, and the value range of the GT mask is 0 to 1; the GT composite image refers to a composite image obtained by a fusion algorithm using a known GT mask, a first image containing a foreground image (i.e., a target object image), and a second image as a background image, which may require manual adjustment of the edge region.

In other embodiments, the final LOSS function (LOSS) formula does not include the (1+ w) portion, and can be used for model training as well.

[ first embodiment of training apparatus for image processing model ]

Fig. 4A discloses a first embodiment of a training apparatus for an image processing model provided in the present application, the training apparatus comprising:

a first feature map generating unit 410, configured to input the first image and the second image into the first module to generate a first feature map; the first image comprises a target object image;

a mask image generating unit 420, configured to generate a second feature map according to the first feature map by a second module, and generate a mask image according to the second feature map; the mask image corresponds to the target object image;

a third image generating unit 430, configured to generate a third image according to the first feature map and the second feature map by a third module; the third image is a predicted composite image including the target object image and the second image;

a training unit 440, configured to perform training of the image processing model according to a loss function for generating the third image and a loss function for generating the mask image.

[ second embodiment of training apparatus for image processing model ]

As fig. 4B discloses a second embodiment of the training apparatus for an image processing model, which is a variation of the first embodiment of the training apparatus for an image processing model described above, the second module and the third module each comprise at least one inverse feature extraction layer; the reverse feature extraction layer is used for restoring a high-resolution feature map from a low-resolution feature map;

the mask image generating unit 420 generates different second feature maps from different inverse feature extraction layers of the second module;

the feature map transmission module 450 is further included, and is configured to transmit the at least one second feature map to the at least one reverse feature extraction layer corresponding to the third module;

the third image generating unit 430 is used by a third module to generate a third image according to the first feature map and a second feature map received and fused by the at least one reverse feature extraction layer to the original input feature map of the layer.

[ third embodiment of training apparatus for image processing model ]

Fig. 4C discloses a third embodiment of the training apparatus for an image processing model, which is a variation of the first embodiment of the training apparatus for an image processing model described above, and further includes an affinity module 460 for generating a corresponding affinity diagram according to the second feature diagram; the affinity diagram is used for representing the relation between at least one pixel and at least one other pixel in the second feature diagram;

the third image generating unit 430 is used by a third module to generate the third image according to the first feature map and the affinity map.

[ fourth embodiment of training apparatus for image processing model ]

As disclosed in fig. 4D, a fourth embodiment of the training apparatus for an image processing model, which is a variation of the third embodiment of the training apparatus for an image processing model described above, said second and third modules each comprise at least one inverse feature extraction layer; generating a second, different feature map from a second, different reverse feature extraction layer of the second module;

the affinity module 460 is at least one for generating a corresponding at least one affinity graph from at least one second feature graph;

an affinity diagram delivery module 470 is further included for delivering the at least one affinity diagram to at least one reverse feature extraction layer corresponding to the third module;

the third image generating unit 430 is used by the third module to generate a third image according to the first feature map and the affinity map received and fused by the at least one reverse feature extraction layer to the original input feature map of the layer.

In some embodiments, the affinity module 460 is specifically configured to multiply the second feature map and the transpose of the second feature map and then obtain the affinity map through an activation function operation.

In some embodiments, said fusing of said third module to said layer of original input feature maps comprises: multiplying the original input characteristic diagram of the layer, carrying out contraposition summation on the result and the original input characteristic diagram of the layer, and taking the result as the fused input characteristic diagram; or, multiplying the original input characteristic diagram of the layer, and taking the result as the input characteristic diagram after fusion; or, the input feature map is subjected to bit alignment and summation with the original input feature map of the layer, and the result is used as the input feature map after fusion.

[ PROBLEMS ] to provide a method for producing a semiconductor device

Tests were performed on the portrait segmentation data set as shown in the following table, which shows the gains of a segmentation network (corresponding to the model formed by the first and second modules in this application), a segmentation network plus synthesis branches (corresponding to the model formed by the first, second and third modules in this application), and an additional CAB (referring to the additional affinity module, corresponding to the model formed by the first, second, third modules in this application, and the affinity module), respectively. Finally, under the condition that the reasoning calculation amount is basically unchanged, the overall mIOU is improved by 1.1%, and the edge details are improved by 0.9%. A graph of the segmentation results of the corresponding methods can be seen in fig. 5A.

More visualization effects are shown as the human image segmentation visualization result shown in fig. 5B, and comparison results of various schemes show that segmentation at a plurality of previous detail difficult points is greatly improved, such as fingers, shaded arms with hand chains, hair with strong reflection, hand-held objects, ears and the like.

Compared with the scheme of the second prior art, the method has the advantages that the comparison result is shown under the condition that the original background image input method without the foreground is used for simple comparison, and the image model trained by the method has great advantages in the aspects of reproduction effect, such as edges, small hollowing-out and the like, and the calculated amount is small, and an unmanned real background image is not needed. From the measured background changing effect, the edge transition is natural.

Fig. 6 is a schematic structural diagram of a computing device 1500 provided in an embodiment of the present application. The computing device 1500 includes: processor 1510, memory 1520, communications interface 1530, and bus 1540.

It is to be appreciated that the communication interface 1530 in the computing device 1500 illustrated in FIG. 6 can be utilized to communicate with other devices.

The processor 1510 may be connected to a memory 1520, among other things. The memory 1520 may be used to store the program code and data. Accordingly, the memory 1520 may be a storage unit inside the processor 1510, an external storage unit independent of the processor 1510, or a component including a storage unit inside the processor 1510 and an external storage unit independent of the processor 1510.

Optionally, computing device 1500 may also include a bus 1540. The memory 1520 and the communication interface 1530 may be connected to the processor 1510 via a bus 1540. Bus 1540 can be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus 1540 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one line is shown in FIG. 6, but it is not intended that there be only one bus or one type of bus.

It should be understood that, in the embodiment of the present application, the processor 1510 may adopt a Central Processing Unit (CPU). The processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. Or the processor 1510 uses one or more integrated circuits for executing related programs to implement the technical solutions provided in the embodiments of the present application.

The memory 1520, which may include both read-only memory and random access memory, provides instructions and data to the processor 1510. A portion of the processor 1510 may also include non-volatile random access memory. For example, the processor 1510 may also store information of the device type.

When the computing device 1500 is run, the processor 1510 executes the computer-executable instructions in the memory 1520 to perform the operational steps of the above-described method.

It should be understood that the computing device 1500 according to the embodiment of the present application may correspond to a corresponding main body for executing the method according to the embodiments of the present application, and the above and other operations and/or functions of each module in the computing device 1500 are respectively for implementing corresponding flows of each method of the embodiment, and are not described herein again for brevity.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The present embodiments also provide a computer-readable storage medium, on which a computer program is stored, the program being used for executing a diversification problem generation method when executed by a processor, the method including at least one of the solutions described in the above embodiments.

The computer storage media of the embodiments of the present application may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It should be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention.

Claims

1. A method for training an image processing model, the image processing model comprising a first module, a second module and/or a third module, the method comprising:

2. The method of claim 1, wherein the second and third modules each comprise at least one inverse feature extraction layer; the reverse feature extraction layer is used for restoring a high-resolution feature map from a low-resolution feature map;

3. The method of claim 1, wherein generating, by a third module, a third image from the first and second feature maps comprises:

4. The method of claim 3, wherein the second and third modules each comprise at least one inverse feature extraction layer;

5. The method of claim 3 or 4, wherein generating the corresponding affinity graph from the second profile comprises:

6. The method according to claim 2 or 4, wherein the merging into the layer of the original input feature map comprises one of:

7. The method according to any one of claims 1 to 6, wherein the following loss functions are used in the training of the image processing model based on the loss function for generating the third image and the loss function for generating the mask image:

＝segloss+(imagelevelloss+featurelevelloss)；

segloss refers to the image segmentation network loss function;

imagelevelloss is a loss function of the predicted composite image and the GT composite image;

8. The method according to any of claims 1 to 6, wherein the training of the image processing model according to the loss function for generating the third image and the loss function for generating the mask image uses the following loss functions:

＝segloss+(imagelevelloss+featurelevelloss)*(1+w)；

segloss refers to the image segmentation network loss function;

9. The method of any of claims 1 to 8, wherein the first and second images are from a data pair of a training dataset; the data set includes a plurality of data pairs, which are a plurality of data pairs composed of the same first image and a plurality of different second images.

10. An apparatus for training an image processing model, the image processing model comprising a first module, a second module and/or a third module, the apparatus comprising:

11. The apparatus of claim 10, wherein the second and third modules each comprise at least one inverse feature extraction layer; the reverse feature extraction layer is used for restoring a high-resolution feature map from a low-resolution feature map;

12. The apparatus of claim 10,

the affinity module is used for generating a corresponding affinity diagram according to the second feature diagram; the affinity diagram is used for representing the relation between at least one pixel and at least one other pixel in the second feature diagram;

13. The apparatus of claim 12, wherein the second and third modules each comprise at least one inverse feature extraction layer;

14. The apparatus according to claim 12 or 13, wherein the affinity module is specifically configured to multiply the second feature map and the transpose of the second feature map, and obtain the affinity map through an activation function operation.

15. The apparatus according to claim 11 or 13, wherein said merging into the layer of the original input feature map comprises one of:

16. The apparatus according to any of the claims 10 to 15, wherein the training unit is adapted to perform the training of the image processing model using in particular the following loss function:

＝segloss+(imagelevelloss+featurelevelloss)；

segloss refers to the image segmentation network loss function;

17. The apparatus according to any of the claims 10 to 15, wherein the training unit is adapted to perform the training of the image processing model using in particular the following loss function:

＝segloss+(imagelevelloss+featurelevelloss)*(1+w)；

segloss refers to the image segmentation network loss function;

18. The apparatus of any one of claims 10 to 17, further comprising a training dataset comprising a plurality of data pairs comprising a same first image and a plurality of different second images, the first and second images being from a data pair of the dataset.

19. An image processing method is characterized in that,

the first and second modules are trained by the method of any one of claims 1 to 9.

20. An image processing model, comprising:

21. An image processing method is characterized in that,

the first, second and third modules are trained by the method of any one of claims 1 to 9.

22. The method of claim 21, further generating, by the second module, a mask image from the second feature map; the mask image corresponds to the target object image.

23. The method according to claim 21 or 22, wherein the generating a third image from the first and second feature maps comprises:

24. An image processing model, comprising:

25. The model of claim 24, wherein the second module is further configured to generate a mask image from the second feature map; the mask image corresponds to the target object image.

26. A model according to claim 24 or 25, further comprising:

27. A computing device, comprising:

a bus;

a communication interface connected to the bus;

at least one processor coupled to the bus; and

at least one memory coupled to the bus and storing program instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of any of claims 1-9, 19, 21-23.

28. A computer readable storage medium having stored thereon program instructions, which when executed by a computer, cause the computer to perform the method of any of claims 1 to 9, 19, 21 to 23.