CN113079391A

CN113079391A - Portrait image mixing processing method, equipment and computer readable storage medium

Info

Publication number: CN113079391A
Application number: CN202011644727.5A
Authority: CN
Inventors: 马勇; 肖汉雄
Original assignee: Wuxi Le Chi Technology Co ltd
Current assignee: Wuxi Le Chi Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-07-06

Abstract

The invention discloses a portrait image hybrid processing method, equipment and a computer readable storage medium, and belongs to the field of image processing. The processing method realizes the mixed processing of semantic analysis and portrait matting by arranging a neural network comprising two parallel decoders and a common encoder, trains a model by utilizing a multi-task learning mechanism, and adopts different decoding modes for different tasks in the test process. The invention aims to solve the problem of repeated coding-decoding in the process of image matting in the prior art. The invention can promote and improve the parameters of the return update on the basis of realizing the feature sharing, is beneficial to improving the matting performance, reducing the model size, reducing the memory occupation and accelerating the reasoning speed of the test.

Description

Portrait image mixing processing method, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of image processing, and in particular, to a method and apparatus for blending and processing a portrait image, and a computer-readable storage medium.

Background

The human image matting and semantic parsing are two important tasks in the field of image processing, the traditional human image matting and semantic parsing are two independent tasks, an encoder and a decoder are respectively needed to achieve corresponding effects in the process of establishing a model by utilizing a neural network, the result obtained by actually utilizing the human image matting from an image original image is an image containing a foreground, a background and a transition scene, the semantic parsing is obtained by continuously processing the image by the encoder and the decoder for semantic decoding on the basis, the requirements of two continuous encoding-decoding processes on computing resources are huge in the actual training process, and the processing efficiency is relatively low.

Disclosure of Invention

In order to solve the problem of repeated coding-decoding in the process of portrait matting in the conventional technology, the invention provides a method for optimizing high consumption and low efficiency in the algorithm process by designing a network structure comprising an encoder and two parallel decoders connected with the encoder.

In view of the above, the present invention provides a portrait image mixing processing method, including: the system comprises two parallel decoders and a shared encoder, wherein the two decoders are a portrait decoder and a semantic decoder respectively, the portrait decoder is used for decoding portrait information, the semantic decoder is used for decoding semantic information, and the encoder is used for encoding an input image and extracting features for the two decoders.

Preferably, the processing method specifically includes the steps of:

s1: construction and pre-processing of a portrait image dataset,

cutting all training face images into uniform sizes after acquiring the training face images;

s2: constructing a neural network training model and establishing a neural network training model,

the trained neural network model comprises six parts, namely an input layer, a coding layer, a first decoding layer, a second decoding layer, a first output layer and a second output layer; the encoding layer completes compression on the feature data from the input layer through a complete rolling machine system structure, then the feature data are respectively output to the first decoding layer and the second decoding layer through two output heads, and the two decoding layers respectively complete decoding on the encoded image through multi-layer transposition convolution;

s3: the model is optimized, and the model is optimized,

the loss of two output layers is respectively calculated through the image after the forward propagation compression recovery, and then the weight matrix of the encoder is respectively updated through the backward propagation, the optimization is continuously carried out,

specifically, the network training model in S2 is a parallel training model using one encoder and two parallel decoders,

preferably, alpha matte in the first decoding layer is set as a weighted combination of the background and foreground of the input layer image, namely:

I_p＝α_pF_p+(1-α_p)B_p

preferably, the loss function in the neural network is defined as:

wherein sigma₁And σ₂Is the respective weight of each loss;

L_sas a loss function of the semantic decoder,

where C is the number of semantic classes, y_ncIs the true value of C, p_ncIs the predicted value corresponding to C,

L_αfor the loss function of the image decoder, after completing semantic image decoding, L is adopted_αThe target value is directly estimated and,

L_α＝||α_g-α_p||₁

wherein alpha is_gAnd alpha_pRespectively representing the true value and the predicted value of the alpha matte;

the multi-task learning is a complete end-to-end process, so that a loss function L is adopted to train a semantic decoder and a portrait decoder so as to realize the function of automatically adjusting the weights of different tasks.

An electronic device comprising a processor and a memory, the memory having machine executable instructions capable of being executed by the processor, the processor being capable of executing the machine executable instructions to implement the above method.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the above-mentioned method.

Compared with the prior art, the invention has the following beneficial effects: a simple parallel network structure is provided, the training process of portrait matting and semantic analysis is connected with two parallel decoders through an encoder, in the training process, features are shared, parameters returned and updated are mutually promoted, mutual improvement is achieved, the performance of matting is improved, the size of a model is reduced, memory occupation is reduced, and the reasoning speed of testing is accelerated.

Drawings

FIG. 1 is a schematic diagram of a network structure according to the present invention

FIG. 2 is a diagram of a comparative network architecture

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

As shown in fig. 1-2, fig. 1 is a schematic network structure diagram of a face image hybrid processing method provided by the present invention, which includes: the system comprises two parallel decoders and a shared encoder, wherein the two decoders are a portrait decoder and a semantic decoder respectively, the portrait decoder is used for decoding portrait information, the semantic decoder is used for decoding semantic information, and the encoder is used for encoding an input image and extracting features for the two decoders;

firstly, after a network structure acquires a training set, namely an input image and an output image, firstly, constructing and preprocessing a data set, and cutting all training face images into uniform sizes;

secondly, a neural network training model is constructed, a coder and two parallel decoders are adopted for training,

the trained neural network model comprises six parts, namely an input layer, a coding layer, a first decoding layer, a second decoding layer, a first output layer and a second output layer; the encoding layer completes compression on feature data from an input layer through a complete rolling machine system structure, then outputs the feature data to a first decoding layer and a second decoding layer through two output heads respectively, and is different from the traditional portrait cutout and semantic segmentation, the portrait cutout is a task crossed with the portrait cutout, so that the foreground and background of each pixel need to be classified, the opacity value of the foreground and the background needs to be estimated, namely the alpha of the portrait in a predicted image is needed, and therefore the alpha of the first decoding layer is set as a weighted combination of the background and the foreground of the image of the input layer, namely:

I_p＝α_pF_p+(1-α_p)B_p

where p represents the pixel position, α ∈ [0, 1] represents the foreground opacity value,

the two decoding layers respectively complete the decoding of the coded image through the transposition convolution of multiple layers;

thirdly, model optimization, namely calculating the loss of two output layers through the image after the compression recovery by forward propagation, updating the weight matrix of the encoder by backward propagation respectively, continuously optimizing,

wherein the loss function is defined as:

σ₁and σ₂Is the respective weight of each loss;

L_sas a loss function of the semantic decoder,

L_αis a personLoss function of image decoder, after completing semantic image decoding, using L_αThe target value is directly estimated and,

L_α＝||α_g-α_p||₁

In other words, in the training process, the whole network is trained by using a multi-task learning method, so that the network can train two tasks, and in the testing process, one decoder is deleted for testing, so that compared with a single manual analysis task, a multi-task learning mechanism has a great improvement on a final segmentation result.

Compared to the processing method of the comparative network structure of figure 2,

FIG. 2 shows that the original RGB image is input into a T network including an encoder and a decoder, and then output into an image T including foreground, background and transition scenes_gThen T is added_gCompared with the method provided by the invention, the method provided by the invention needs to execute two coding and two decoding actions, and after the encoder and the two parallel decoders are set to be trained, the effect same as that of the traditional method can be realized by only performing one coding and decoding action, the processing speed is doubled, and the occupied resources are greatly reduced.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and/or modifications of the invention can be made, and equivalents and modifications of some features of the invention can be made without departing from the spirit and scope of the invention.

Claims

1. A portrait image mixing processing method is characterized by comprising the following steps: the system comprises two parallel decoders and a shared encoder, wherein the two decoders are a portrait decoder and a semantic decoder respectively, the portrait decoder is used for decoding portrait information, the semantic decoder is used for decoding semantic information, and the encoder is used for encoding an input image and extracting features for the two decoders.

2. The portrait image mixing processing method according to claim 1, wherein: the processing method specifically comprises the following steps:

s1: construction and pre-processing of a portrait image dataset,

s3: the model is optimized, and the model is optimized,

and respectively calculating the losses of the two output layers through the image subjected to forward propagation compression recovery, respectively propagating and updating the weight matrix of the encoder in the reverse direction, and continuously optimizing.

3. The portrait image mixing processing method according to claim 2, wherein: alpha matte in the first decoding layer is set as a weighted combination of the background and the foreground of the input layer image, namely:

I_p＝α_pF_p+(1-α_p)B_p

4. the portrait image mixing processing method according to claim 2, wherein: the loss function of the neural network is defined as:

wherein sigma₁And σ₂Is the corresponding weight for each loss.

5. An electronic device comprising a processor and a memory, the memory having machine executable instructions executable by the processor to perform the method of any one of claims 1 to 4.

6. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4.