CN111815649A

CN111815649A - Image matting method and computer readable storage medium

Info

Publication number: CN111815649A
Application number: CN202010621083.1A
Authority: CN
Inventors: 董宇涵; 王克; 张凯; 李志德
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2020-10-23
Anticipated expiration: 2040-06-30
Also published as: CN111815649B

Abstract

The invention provides a portrait matting method and a computer readable storage medium, the method comprises: acquiring portrait data, wherein the portrait data comprises a portrait picture or a portrait video; calculating to obtain the transparency of the portrait in the portrait data as a first transparency based on the trained deep learning network; obtaining a first portrait foreground by using the first transparency and the portrait data; adjusting the transparency of the portrait to a second transparency; and obtaining a second portrait foreground by utilizing the second transparency, and finishing portrait matting. The portrait picking is automatically realized through a deep learning network; after the results of automatic portrait matting are obtained, an interactive matting function is provided, allowing the user to further refine the matting effect through interactive operations on the results of automatic portrait matting to achieve higher quality matting results. The method has the advantages of high running speed and less memory occupation, and can be deployed on various intelligent devices to carry out real-time portrait matting processing.

Description

Image matting method and computer readable storage medium

Technical Field

The invention relates to the technical field of image matting, in particular to an image matting method and a computer-readable storage medium.

Background

The cutout is a basic image editing technology and has wide application and important economic value. Matting is the basic operation in the work of visual special effects, artistic design, later stage of films and televisions and the like. Many commercialized products for matting or integrating matting functions are available in our daily lives.

The current matting techniques mainly have two types:

the first type is an interactive matting method represented by the design software "Photoshop". This type of method requires the user to guide the matting algorithm to complete the matting through interactive operations. The method has good matting effect, but the interactive operation is very complicated, a large amount of time is consumed for a user, and certain learning cost and skill requirements are also provided for the user. Matting is an ill-defined problem. Additional interaction information may be introduced in order to solve the problem. From the interaction information the algorithm knows part of the foreground and part of the background. The sampling modeling method is based on a mathematical statistics method, and is used for respectively sampling a known foreground and a known background and building a foreground distribution model and a background distribution model. The distribution model of the unknown region is a hybrid model of the foreground model and the background model. The method specifically comprises parameter methods such as Ruzon and Tomasi matting algorithm and Bayesian matting algorithm, and a nonparametric method for representing a data distribution model in a frequency histogram personalized manner. The method based on the affinity measurement understands the transparency of the foreground as the affinity of the pixels of the unknown region with respect to the image background and the image foreground, such as a poisson image matting method, a random walking method, a geodesic distance method and a closed method. However, when the form of the interactive information is rough, the sample information is insufficient to cause a large error; when the image texture is more complex, the estimation error of the model is larger.

The second category is a fully automatic portrait matting method represented by intelligent photo applications. The method does not need user interaction guide and can automatically realize portrait keying. The method is simple and quick to operate, but the matting effect is poor. The full-automatic portrait matting method is mainly realized based on a deep learning theory, such as a DAPM model, an SHM model, an LDN + FB model, an MMNet model and an SDPN model, and does not need a user to provide interactive information any more. However, the matting effect is to be improved.

Therefore, a simple and convenient image matting method with good matting effect is lacked in the prior art.

The above background disclosure is only for the purpose of assisting understanding of the concept and technical solution of the present invention and does not necessarily belong to the prior art of the present patent application, and should not be used for evaluating the novelty and inventive step of the present application in the case that there is no clear evidence that the above content is disclosed at the filing date of the present patent application.

Disclosure of Invention

The invention provides a portrait matting method and a computer-readable storage medium for solving the existing problems.

In order to solve the above problems, the technical solution adopted by the present invention is as follows:

an image matting method comprises the following steps: s1: acquiring portrait data, wherein the portrait data comprises a portrait picture or a portrait video; s2: calculating to obtain the transparency of the portrait in the portrait data as a first transparency based on the trained deep learning network; s3: obtaining a first portrait foreground by using the first transparency and the portrait data; s4: adjusting the transparency of the portrait to a second transparency; s5: and obtaining a second portrait foreground by utilizing the second transparency, and finishing portrait matting.

Preferably, the deep-learning network structure comprises an encoder unit and a decoder unit; each of the encoder units comprises two branches: a coding branch and a spatial attention branch; the coding branch is used for coding and inputting the result to the next encoder unit; the spatial attention branch is used for generating a spatial attention distribution map, and the spatial attention distribution map is fused into a corresponding decoder unit in a point-to-point corresponding addition mode; each decoder unit only has one branch, the input characteristics are sampled firstly, then the spatial attention distribution maps output by the corresponding encoder units are fused according to the operation of adding the corresponding elements, then decoding is carried out, and finally the spatial attention distribution maps are output to the next decoder unit; and normalizing the output data of the decoder unit to 0-1 through a Sigmoid function, namely obtaining the first transparency of the portrait.

Preferably, the encoding branch of the encoder unit comprises in sequence: a two-dimensional convolution layer, a batch normalization layer, a modified linear active layer, a maximum pooling layer; the spatial attention branch sequentially comprises: a two-dimensional convolution layer, a batch normalization layer, a modified linear activation layer; the decoder unit includes: a 2-fold upsampling layer, a two-dimensional convolution layer, a batch normalization layer, a modified linear active layer.

Preferably, the number of the encoder units is 5, and the number of the decoder units is 5.

Preferably, the deep learning network is trained by using a data set, the data set includes portrait pictures or portrait videos and corresponding labels, and a loss function during training is as follows:

L(A，A^gt)＝γL_mse(A，A^gt)+tL_rgb(A，A^gt)+wL_grad(A，A^gt)

wherein L is_mse(A,A^gt) Is the loss of mean square error, L_rgb(A,A^gt) Is a loss of synthesis, L_grad(A,A^gt) Gradient loss, wherein gamma, t and w are weight coefficients of mean square error loss, synthesis loss and gradient loss respectively;

the loss of mean square error is:

the synthesis loss is:

the gradient loss is:

wherein z represents pixel points of a portrait picture or a portrait video frame, A represents the transparency of the portrait output by the deep learning network, and A_zA value A representing the human image transparency output by the deep learning network at a pixel point z^gtA tag representing a portrait picture or a portrait video frame,

value of a label representing a portrait picture or a portrait video frame at a pixel point z, I_zRepresenting an input portrait picture or video frame,

is a gradient operator and k represents the number of pixels of an image or video frame.

Preferably, adaptive moment estimation is employed as the optimizer.

Preferably, adjusting the transparency of the portrait to a second transparency comprises: displaying the portrait picture as a first gray-scale image according to the first transparency of the portrait in the portrait picture, adjusting a black area and a white area of the first gray-scale image to obtain an adjusted second gray-scale image, and normalizing the second gray-scale image to 0-1 to obtain the second transparency of the portrait.

Preferably, adjusting the transparency of the portrait to a second transparency comprises: the first transparency of the portrait in the portrait video frame is obtained by a Sigmoid function as follows:

adjusting the first transparency of the portrait to the second transparency according to the following formula:

where x is the output data of the decoder unit 5 and P is the custom coefficient of x.

Preferably, the method also comprises the step of evaluating the effect of the portrait matting by adopting a gradient error, a connectivity error, a mean absolute value error and a mean square error;

the average absolute value error is:

the mean square error is:

the gradient error is:

the connectivity error is:

the value of the label representing the portrait picture or the portrait video frame at pixel point z,

is a gradient operator, k represents the number of pixels of a portrait picture or a portrait video frame,

representing the neighborhood Ω according to pixel point z_zCalculate output transparency A_zThe degree of connectivity at the pixel point z,

representing the neighborhood Ω according to pixel point z_zLabel for calculating portrait picture or portrait video frame

Connectivity at pixel point z.

The invention also provides a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of the above.

The invention has the beneficial effects that: a method for portrait matting and a computer readable storage medium are provided, which automatically realize portrait matting through a deep learning network; after the results of automatic portrait matting are obtained, an interactive matting function is provided, allowing the user to further refine the matting effect through interactive operations on the results of automatic portrait matting to achieve higher quality matting results.

Furthermore, the method has high running speed and small occupied memory, and can be deployed on various intelligent devices to carry out real-time portrait matting processing.

Drawings

Fig. 1 is a schematic diagram of a method for image matting according to an embodiment of the present invention.

Fig. 2(a) and fig. 2(b) are schematic diagrams of a deep learning network structure including an encoder unit and a decoder unit according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a deep learning network structure according to an embodiment of the present invention.

FIG. 4 is a schematic diagram of symbolic representations of different depth learning layers in an embodiment of the present invention.

Fig. 5 is a schematic diagram of a network structure and specific parameters of a deep learning-based image matting algorithm in the embodiment of the present invention.

FIG. 6 is a diagram illustrating the PSigmoid function at different P values in an embodiment of the present invention.

Fig. 7(a) -7 (d) are schematic diagrams illustrating the effect of an image matting algorithm in the embodiment of the present invention.

Fig. 8(a) -8 (c) are schematic diagrams illustrating the adjustment effect of "P value" in the embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for either a fixing function or a circuit connection function.

It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.

As shown in fig. 1, the present invention provides a method for image matting, comprising the following steps:

s1: acquiring portrait data, wherein the portrait data comprises a portrait picture or a portrait video;

s2: calculating to obtain the transparency of the portrait in the portrait data as a first transparency based on the trained deep learning network;

s3: obtaining a first portrait foreground by using the first transparency and the portrait data;

s4: adjusting the transparency of the portrait to a second transparency;

s5: and obtaining a second portrait foreground by utilizing the second transparency, and finishing portrait matting.

The method of the invention obtains the first portrait foreground through the deep learning network, does not need user interaction, and is simple and convenient. For the picture with simple background or simple texture, the effect of the foreground of the first portrait is better, and the foreground of the second portrait can be obtained without interactive operation; for the picture with complex background or complex texture, the first portrait foreground may have larger error, and at this time, the second portrait foreground can be obtained by further repairing on the basis of the first portrait foreground through interactive operation, so that the ab initio interactive operation is avoided, and the interactive cost is saved.

The invention automatically realizes portrait digging through a deep learning network; after the results of automatic portrait matting are obtained, an interactive matting function is provided, allowing the user to further refine the matting effect through interactive operations on the results of automatic portrait matting to achieve higher quality matting results.

Different deep learning-based portrait matting algorithms have different network structures. Different network architectures exhibit different algorithmic performance. The deep learning-based portrait matting algorithm provided by the invention has a unique deep learning-based network structure.

As shown in fig. 2(a) and 2(b), the deep learning network structure includes an encoder unit and a decoder unit;

each of the encoder units comprises two branches: a coding branch and a spatial attention branch;

the coding branch is used for coding and inputting the result to the next encoder unit;

the spatial attention branch is used for generating a spatial attention distribution map, and the spatial attention distribution map is fused into a corresponding decoder unit in a point-to-point corresponding addition mode;

each decoder unit only has one branch, the input characteristics are sampled firstly, then the spatial attention distribution maps output by the corresponding encoder units are fused according to the operation of adding the corresponding elements, then decoding is carried out, and finally the spatial attention distribution maps are output to the next decoder unit;

and normalizing the output data of the decoder unit to 0-1 through a Sigmoid function, namely obtaining the first transparency of the portrait.

In one embodiment of the present invention, the encoding branch of the encoder unit sequentially comprises: a two-dimensional convolution layer, a batch normalization layer, a modified linear active layer, a maximum pooling layer;

the spatial attention branch sequentially comprises: a two-dimensional convolution layer, a batch normalization layer, a modified linear activation layer;

the decoder unit includes: a 2-fold upsampling layer, a two-dimensional convolution layer, a batch normalization layer, a modified linear active layer.

The input data enters an encoder unit, an encoding branch performs operations of 'Conv', 'BN', 'ReLU', 'Conv', 'BN', 'ReLU' and 'Pooling' on the input data from top to bottom in sequence, and then the data is output; the spatial attention branch performs "Conv", "BN", and "ReLU" operations on input data in order from top to bottom, and then outputs the data. As in the network structure of the encoder unit shown in fig. 2(b), input data enters the decoder unit. The input data is operated in the order of "Up-sampling", "Conv", "BN", "ReLU", "Conv", "BN", and "ReLU" from top to bottom, and then the data is output.

The English names used are the names of different types of layers in deep learning, and are explained in detail as follows:

conv is a two-dimensional convolutional Layer (2D volume Layer);

BN Batch Normalization Layer (Batch Normalization Layer);

ReLU modified Linear activation Unit (Rectified Linear Unit);

pooling: maximum Pooling Layer (Max Pooling Layer);

up-sampling: 2 times the upsampled layer.

As shown in fig. 3, the number of the encoder units is 5, and the number of the decoder units is 5. The input portrait data passes through 5 encoder units and enters the decoder unit. The transparency of the portrait is output via 5 decoder units. The input portrait data goes directly to the encoder unit 1 and at the same time to the decoder unit 5. The data output by the encoding branch of encoder unit 1 enters encoder unit 2, and encoder unit 1 has no spatial attention branch. The data output by the encoding branch of the encoder unit 2 enters the encoder unit 3, and the data output by the spatial attention branch is fused into the input data of the decoder unit 4 in a manner that corresponding elements are added. The data output by the encoding branch of the encoder unit 3 enters the decoder unit 4, and the data output by the spatial attention branch is fused into the input data of the decoder unit 3 in a manner that corresponding elements are added. The data output by the encoding branch of the encoder unit 4 enters the encoder unit 5 and the data output by the spatial attention branch is fused into the input data of the decoder unit 2 in a manner that the corresponding elements are added. The data output by the encoding branch of the encoder unit 5 and the data output by the spatial attention branch of the encoder unit 5 are fused in such a way that the corresponding elements are added, and then enter the decoder unit 1. The output data of the decoder unit 1 and the data output from the spatial attention branch of the encoder unit 4 are fused in such a manner that the corresponding elements are added, and then enter the decoder unit 2. The output data of the decoder unit 2 and the output data of the spatial attention branch of the encoder unit 3 are fused in a corresponding element addition manner, and then enter the decoder unit 3. The output data of the decoder unit 3 and the data output from the spatial attention branch of the encoder unit 2 are fused in such a manner that the corresponding elements are added, and then enter the decoder unit 4. The output data of the decoder unit 4 and the input portrait data are concatenated according to the channel and then enter the decoder unit 5. The output data of the decoder unit 5 passes through the defined PSigmoid function and then outputs the transparency of the portrait. In the training and testing stage, the P value of the PSigmoid function is set to be 1; in the interactive repairing stage of the implementation stage, the P value can be adjusted correspondingly according to the output effect of the algorithm.

As shown in fig. 4 and 5, the input portrait data is scaled to a specification of 256 × 3 into the encoder unit 1. The detailed parameters in encoder unit 1, encoder unit 2, encoder unit 3, encoder unit 4 and encoder unit 5 are set as: "kernel _ size" of all convolution layers Conv is set to 3 x 3, "stride" is set to 1, "padding" is set to 1, no bias term is added; the size of all the largest Pooling layers "Pooling" is set to 2. The encoder unit 1 inputs 3 channels of data and outputs 8 channels of data; the encoder unit 2 inputs 8-channel data and outputs 16-channel data; the encoder unit 3 inputs 16-channel data and outputs 32-channel data; the encoder unit 4 inputs 32 channel data and outputs 48 channel data; the encoder unit 5 inputs 48 channel data and outputs 64 channel data. After passing through the encoder unit 1, the encoder unit 2, the encoder unit 3, the encoder unit 4, and the encoder unit 5, the input portrait data is changed from the specification of 256 × 3 to the specification of 16 × 64. And then into decoder unit 1.

The detailed parameters in decoder unit 1, decoder unit 2, decoder unit 3, decoder unit 4 and decoder unit 5 are set to: the size of all upsampling layers "UP-sampling" is set to 2, and an upsampling mode of bilinear interpolation is adopted. All convolution layers Conv have "kernel _ size" set to 3 x 3, "stride" set to 1, "padding" set to 1, no bias term added. The decoder unit 1 inputs 64 channel data and outputs 48 channel data; the decoder unit 2 inputs 48 channel data and outputs 32 channel data; the decoder unit 3 inputs 32-channel data and outputs 16-channel data; the decoder unit 4 inputs 16 channels of data and outputs 8 channels of data; the decoder unit 5 inputs 11-channel data and outputs 1-channel data. The output data of the decoder unit 5 is output through the PSigmoid function, and the portrait transparency with the specification of 256 × 1 is obtained. And the portrait transparency is scaled back to the size of the originally input portrait data, and the portrait transparency is multiplied by corresponding elements of the originally input portrait data to obtain the portrait foreground.

On the basis of the above embodiment, 6 encoder units and corresponding 6 decoder units are adopted, and the specific structure and parameter setting of each encoder unit and decoder unit are the same as those in embodiment 1. The other portions are the same as in the above embodiment.

On the basis of the above-described embodiment, 4 encoder units and corresponding 4 decoder units are employed, and the specific structure and parameter settings of each encoder unit and decoder unit are the same as those in embodiment 1. The other portions are the same as in example 1.

In one embodiment of the present invention, the deep learning network needs to be pre-trained. Training data is prepared, a loss function and an optimizer are set, and training of the deep learning network can be completed in an iterative updating mode.

The training data may be from a public data set, with data enhancement of the training data. Firstly, the image size is randomly zoomed, and the zooming proportion is uniformly distributed in 0.8-1.2. Then, the image is rotated with a 50% probability, and the rotation angles are uniformly distributed at-30 to-30 degrees. Finally, the luminance conversion is performed by using the parameter 0.1, the contrast conversion is performed by using the parameter 0.1, the saturation conversion is performed by using the parameter 0.1, and the hue conversion is performed by using the parameter 0.05. And (4) generating 20 pictures according to the steps for each picture in the training set. During the training process, the input image is then horizontally flipped with a 50% probability. The above data enhancement methods are merely exemplary, and other data enhancement methods may be employed.

Training a deep learning network by adopting a data set, wherein the data set comprises portrait pictures or portrait videos and corresponding labels, and a loss function during training is as follows:

L(A，A^gt)＝γL_mse(A，A^gt)+tL_rgb(A，A^gt)+wL_grad(A，A^gt)

the mean square error loss utilizes the mean square error between the output result of the deep learning network and the label as a loss function:

the synthesis loss utilizes the mean square error of the color picture synthesized by transparency as a loss function:

the gradient loss utilizes the gradient of the output result of the deep learning network and the gradient of the label, and utilizes the absolute value error between the two as a loss function:

In order to complete the training of the portrait matting algorithm based on deep learning, the invention adopts Adaptive Moment Estimation (Adam) as an optimizer. It is understood that other optimizers may be used to perform the training of the algorithm proposed by the present invention, and the present invention is not limited thereto.

In one embodiment of the present invention, adjusting the transparency of the portrait to a second transparency comprises:

displaying the portrait picture as a first gray-scale image according to the first transparency of the portrait in the portrait picture, adjusting a black area and a white area of the first gray-scale image to obtain an adjusted second gray-scale image, and normalizing the second gray-scale image to 0-1 to obtain the second transparency of the portrait.

Specifically, after the portrait transparency and the portrait foreground are automatically obtained by the portrait matting algorithm based on deep learning, the user can further repair the portrait foreground through interactive operation. There are two specific interactive operations:

the first interaction is for a portrait picture: and displaying the transparency of the portrait output by the deep learning network as a gray scale map, and modifying the gray scale map of the transparency of the portrait through 'smearing' and 'erasing' operations. "painting" black indicates that the area is modified into a foreground area of a portrait; "painted" white indicates that the area is modified to be the background area of the portrait. And normalizing the gray level diagram of the modified portrait transparency to 0-1 to obtain the corrected portrait transparency. And calculating to obtain the portrait foreground according to the corrected portrait transparency and the original input data.

Second adjusting the transparency of the portrait to a second transparency comprises:

output data of a decoder unit in the portrait matting algorithm based on deep learning is normalized to 0-1 through a Sigmoid function, and therefore the transparency of the portrait is obtained. The Sigmoid function is a nonlinear activation unit in deep learning theory, and is defined as follows:

the human image transparency output by the human image matting algorithm based on deep learning is influenced by the data distribution of the data output by the decoder unit and the characteristics of the Sigmoid function. Then changing the characteristics of the Sigmoid function itself will also change the transparency of the portrait output by the algorithm.

where x is the output data of the decoder unit 5 and P is the coefficient of x, it can be customized. When P is 1, PSigmoid is Sigmoid. And directly using the PSigmoid, and setting the P value to be 1 in a training stage, a testing stage and a full-automatic portrait matting stage. And replacing the Sigmoid function in the deep learning-based portrait matting algorithm with the PSigmoid function. The user can improve the output effect of video matting by adjusting the 'P value'.

As shown in fig. 6, the PSigmoid functions are respectively when P is 0.5, P is 1, and P is 5.0. When P is 1, PSigmoid is Sigmoid; when P is larger than 1, the slope of the PSigmoid function is increased, and the semitransparent area of the transparency of the portrait output by the algorithm is reduced, so that the edge of the foreground of the portrait output by the algorithm is smoother; when P is smaller than 1, the slope of the PSigmoid function becomes smaller, and the semitransparent area of the transparency of the portrait output by the algorithm becomes larger, so that the details of the foreground of the portrait output by the algorithm become richer. In particular, as P approaches infinity, portrait matting can degrade into portrait segmentation. In training and testing, the P value is set to be 1, and in interactive repairing, a user can influence the output effect of the algorithm by setting the P value.

In a specific embodiment of the present invention, the image matting method proposed by the present invention is implemented using a PyTorch deep learning framework and Python programming language. The PyTorch deep learning framework and Python programming language used in this example are not limitations of the present invention. The method provided by the invention is realized by using any other deep learning framework and any programming language, and the method belongs to the coverage of the invention. Where the deep learning network architecture of the present invention is fixed, the specific parameter settings are allowed to vary. This embodiment gives only a special case of one of the algorithm parameter settings. Any specific parameter setting under the network architecture described in the present method is within the scope of the present invention.

To complete the training of the deep learning network of the present invention, the public dataset DAPM natural portrait dataset is used. The data set was divided into a training set of 1700 portrait data and a testing set of 300 portrait data. Data enhancement is performed on the training set. Firstly, the image size is randomly zoomed, and the zooming proportion is uniformly distributed in 0.8-1.2. Then, the image is rotated with a 50% probability, and the rotation angles are uniformly distributed at-30 to-30 degrees. Finally, the luminance conversion is performed by using the parameter 0.1, the contrast conversion is performed by using the parameter 0.1, the saturation conversion is performed by using the parameter 0.1, and the hue conversion is performed by using the parameter 0.05. And (4) generating 20 pictures according to the steps for each picture in the training set. The 1700 pictures can generate 34000 training data. During the training process, the input image is then horizontally flipped with a 50% probability.

The above-described loss function is implemented using Python programming language, where γ is 1, t is 1, and w is 4. The Python programming language used and the specific settings of the parameters γ, t and w are not limitations of the present invention. It is within the scope of the present invention to implement the loss function proposed by the present invention using any other programming language and parameters γ, t, and w.

Adaptive Moment Estimation (Adam) in the PyTorch depth learning framework was used as the optimizer. The "weight _ decay" parameter of the Adam optimizer is set to 0.0005. The algorithm was trained for 200 rounds with an initial learning rate of 0.01, multiplying the learning rate by 0.1 every 50 rounds.

In this embodiment, a PyTorch deep learning framework, a Python programming language, and an OpenCV function library are used to implement the image matting method. Portrait data is input. When the input portrait data is a picture, the picture is converted into a picture in an RGB format, and the longest edge of the picture is scaled to 256 pixels. It is then converted into 3-dimensional tensor data input based on a PyTorch deep learning network structure. The network structure based on deep learning operates on the input tensor. And outputting a one-dimensional tensor by a portrait matting algorithm based on deep learning. The tensor is normalized to 0-1 through a Sigmoid function, and the transparency of the portrait is obtained. The portrait transparency is scaled back to the size of the original input data. A one-dimensional tensor representing the transparency of the portrait and a 3-dimensional tensor of the original input data are combined into a 4-dimensional tensor. Wherein the one-dimensional tensor representing the transparency of the portrait is at the 4 th channel of the 4-dimensional tensor. And storing the 4-dimensional tensor into a picture in a PNG format to obtain a portrait foreground picture. And repeatedly amplifying the one-dimensional tensor output by the portrait matting algorithm into a three-dimensional tensor, and storing the three-dimensional tensor into a JPEG-format picture to obtain a grayscale image of portrait transparency. When the input portrait data is a video, the video is treated as pictures frame by frame, and the method is the same as the above operation. And (4) gathering the portrait foreground sequences obtained frame by frame according to a time sequence and storing the portrait foreground sequences into a video to obtain a video matting result.

When the input portrait data is a picture, a portrait foreground picture and a grayscale picture of the portrait transparency are obtained through a portrait matting algorithm. The user can restore the grey-scale map of the transparency of the portrait by means of "painting" and "erasing" operations. The specific method is that the black modified area is painted as the foreground; the area is modified to background by "painting" white. And after the gray level image of the human image transparency is modified, converting the gray level image into a one-dimensional tensor. And converting the portrait foreground picture in the PNG format into a 4-dimensional tensor, and replacing the 4 th channel of the 4-dimensional tensor of the portrait foreground picture by using the modified one-dimensional tensor of the portrait transparency. And the updated 4-dimensional tensor is restored into the portrait foreground picture in the PNG format. And restoring the gray scale image of the transparency of the modified portrait into the gray scale image of the transparency of the portrait. The above operations may be repeated until the user completes the interactive remedial operation.

And when the input data is the portrait video, obtaining the video of the portrait foreground through a portrait matting algorithm. The 'Sigmoid' function in the figure matting algorithm based on deep learning is replaced by the 'PSigmoid' function defined by the invention. And adjusting the P value in the PSimoid function, and recalculating the transparency of the portrait based on a deep learning portrait matting algorithm. And obtaining the repaired portrait foreground video according to the portrait transparency and the original input data. The above operations may be repeated until the user completes the interactive remedial operation.

This example completes the test on the DAPM natural portrait dataset. The data set is a commonly used public test set in the field of image matting and is used for evaluating and comparing the processing effects of different image matting algorithms.

The method adopts gradient error, connectivity error, mean absolute value error and mean square error to evaluate the effect of portrait matting;

the average absolute value error is:

the mean square error is:

the gradient error is:

the connectivity error is:

wherein z represents pixel points of portrait pictures or portrait video frames, and A represents output of the deep learning networkTransparency of the portrait, A_zA value A representing the human image transparency output by the deep learning network at a pixel point z^gtA tag representing a portrait picture or a portrait video frame,

Connectivity at pixel point z.

The image matting method has the function of automatically extracting the image foreground. As shown in fig. 7(a) -7 (d), the original image in the data set, the first portrait foreground obtained by the method of the present invention, and the image of the synthesized new background are sequentially from left to right, and it can be seen from the images that the present invention has a good effect of automatically extracting the portrait foreground, and has a good processing effect on portrait boundaries, especially on hair. The image matting method provided by the invention has the function of interactively repairing the image matting result.

As shown in fig. 8(a) -8 (c), the original pictures are sequentially from left to right, the first portrait foreground and the second portrait foreground obtained by the method of the present invention are sequentially adjusted from fig. 8(a) -8 (c) to have P equal to 0.1, P equal to 0.07, and P equal to 0.03, it can be seen that the output effect can be further improved by adjusting the "P value", the portrait foreground automatically extracted by the portrait matting algorithm based on deep learning has errors at the hair part, and the influence of the errors on the matting effect is weakened by changing the value of the P value. Aiming at the repair of the video image matting result, the 'P value' method is simple to operate and easy to implement.

As shown in Table 1, the method of the present invention was tested on DAPM natural portrait datasets with mean absolute error MAD of 22.071 × 10^-3Mean square error MSE of 11.806 × 10^-3Gradient error GE of 2.043X 10^-3The connectivity error CE is 18.937 × 10^-3。

TABLE 1 evaluation results of the portrait matting algorithm

As shown in Table 2, the evaluation result of the invention is improved by about 18% compared with the optimal result of other image matting algorithms in the prior art, and the gradient error GE is from 2.48 multiplied by 10^-3Reduced to 2.043X 10^-3。

TABLE 2 comparison of portrait matting algorithms

On a "DELL Inspiron 15-7572" notebook computer, the run time required for the method of the present invention to process a picture is approximately 17 milliseconds. The notebook computer carries a core display card with the model number of 'Intel UHD Graphics 620', the display memory of 128MB and the shared memory of 4044 MB.

The model size of the method of the invention based on PyTorch implementation is 1.52 MB. The method of the invention is tested on an intelligent mobile phone with the model of OPPO Find X. The smart phone is matched with a CPU with the model of Snapdragon 845 and a GPU with the model of Adreno 630. The test results show that the method of the invention takes about 60 milliseconds to process a picture on the mobile phone. The method of the invention has real-time processing speed on the mobile phone.

An embodiment of the present application further provides a control apparatus, including a processor and a storage medium for storing a computer program; wherein a processor is adapted to perform at least the method as described above when executing the computer program.

Embodiments of the present application also provide a storage medium for storing a computer program, which when executed performs at least the method described above.

Embodiments of the present application further provide a processor, where the processor executes a computer program to perform at least the method described above.

The storage medium may be implemented by any type of volatile or non-volatile storage device, or combination thereof. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an erasable Programmable Read-Only Memory (EPROM), an electrically erasable Programmable Read-Only Memory (EEPROM), a magnetic random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double data rate Synchronous Dynamic Random Access Memory (DDRSDRAM, Double DataRateSync Synchronous Random Access Memory), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM, Enhanced Synchronous Dynamic Random Access Memory), Synchronous link Dynamic Random Access Memory (SLDRAM, Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRMBER, Random Access Memory). The storage media described in connection with the embodiments of the invention are intended to comprise, without being limited to, these and any other suitable types of memory.

In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims

1. A method for portrait matting is characterized by comprising the following steps:

s4: adjusting the transparency of the portrait to a second transparency;

2. The method of portrait matting according to claim 1, characterized in that the deep-learned network structure comprises an encoder unit and a decoder unit;

3. The image matting method according to claim 2, characterized in that the coding branches of the encoder unit comprise in order: a two-dimensional convolution layer, a batch normalization layer, a modified linear active layer, a maximum pooling layer;

4. The image matting method according to claim 3, wherein said encoder unit is 5 and said decoder unit is 5.

5. The portrait matting method according to claim 4, characterized in that the deep learning network is trained with a data set, the data set comprises portrait pictures or portrait videos and corresponding labels, and the loss function during training is:

L(A，A^gt)＝γL_mse(A，A^gt)+tL_rgb(A，A^gt)+wL_grad(A，A^gt)

the loss of mean square error is:

the synthesis loss is:

the gradient loss is:

6. The image matting method according to claim 5, characterized in that adaptive moment estimation is used as an optimizer.

7. The portrait matting method according to claim 6, wherein adjusting the transparency of the portrait to a second transparency comprises:

8. The portrait matting method according to claim 7, wherein adjusting the transparency of the portrait to a second transparency comprises:

the first transparency of the portrait in the portrait video frame is obtained by a Sigmoid function as follows:

9. The image matting method according to any one of claims 1 to 8, further comprising evaluating the effect of image matting using gradient error, connectivity error, mean absolute value error and mean square error;

the average absolute value error is:

the mean square error is:

the gradient error is:

the connectivity error is:

Connectivity at pixel point z.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.