CN113643305B

CN113643305B - Portrait detection and segmentation method based on deep network context promotion

Info

Publication number: CN113643305B
Application number: CN202110913353.0A
Authority: CN
Inventors: 许赢月; 王俊宇; 高自立
Original assignee: Zhuhai Fudan Innovation Research Institute
Current assignee: Zhuhai Fudan Innovation Research Institute
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2023-08-25
Anticipated expiration: 2041-08-10
Also published as: CN113643305A

Abstract

The invention discloses a portrait detection and segmentation method based on depth network context promotion, which specifically comprises the steps of extracting L depth features with different scales from a portrait picture based on a depth network frame; based on the highest scale features, feature fusion is carried out on the depth features of the highest scale on a plurality of pyramid scales through a pyramid pooling module, and global priori information is generated; the context information of the depth features is promoted and fused from a high scale to a low scale through a fusion block, and the output features of each scale are obtained; respectively optimizing and training the output characteristics of each scale to finish portrait detection and segmentation; by the method, the context information of the depth network can be deeply excavated from multiple scales, multiple spaces and multiple channels without additional knowledge, and precise portrait detection and segmentation of the monocular image are realized.

Description

Portrait detection and segmentation method based on deep network context promotion

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a portrait detection and segmentation method based on deep network context promotion.

Background

Portrait detection and segmentation are a special task for semantic segmentation, and have a wide application range. Aiming at beautifying application, portrait detection is the basis of portrait picture stylization, field depth blurring processing, matting and other applications; for safety protection application, portrait detection can be subjected to fuzzy processing to replace portrait picture background information; portrait detection of monocular images is more important in practical applications because the multiple images captured relative to a dual camera are more unrestricted by the capture light and distance.

The main challenge of image detection based on deep learning is to accurately position the image, accurately segment the boundary between the image and the background, and the edge details of the image such as hair exacerbates the difficulty of edge segmentation. The existing deep learning-based algorithm mainly carries out finer portrait detection and segmentation through additional knowledge.

Some deep learning based algorithms more accurately locate the network by taking additional knowledge as additional input to the deep network. For example Automatic portrait segmentation for image stylization, by computing portrait location and shape range as additional input channels to the depth network; the gesture detector is added to generate human body key point images as an additional input channel of a depth network. Additional input, while beneficial for accurate positioning of the portrait, requires additional computation and memory requirements while not helping with edge segmentation.

Some deep learning based algorithms use additional calibration (e.g., edge calibration) as additional knowledge for deep network training. These additional edge scaling helps the depth network refine the edge details. In practice, however, edge labeling is expensive and the edge labeling of most current datasets is based on manual portrait labeling, with a blurring of the finesse around the edge. Thus, edge labeling is beneficial for summarizing portrait shapes, but has little impact on accurate edge segmentation.

Therefore, how to perform refined portrait detection and segmentation without additional knowledge becomes a key problem of current research.

Disclosure of Invention

In view of the above problems, the present invention provides a method for detecting and segmenting a portrait based on context promotion of a depth network, which at least solves some of the above technical problems, and performs portrait detection and segmentation on a monocular image by deep mining of context information of the depth network from multiple scales, multiple spaces and multiple channels without additional knowledge.

The embodiment of the invention provides a portrait detection and segmentation method based on deep network context promotion, which comprises the following steps:

s1, extracting L depth features with different scales from a portrait picture based on a depth network frame;

s2, based on the highest scale features, feature fusion is carried out on the depth features of the highest scale on a plurality of pyramid scales through a pyramid pooling module, and global prior information is generated;

s3, the context information of the depth features is promoted and fused from a high scale to a low scale through a fusion block, and the output features of each scale are obtained;

and S4, respectively optimizing and training the output characteristics of each scale to finish portrait detection and segmentation.

Further, the step S2 specifically includes:

s21, reducing the feature size of the depth feature through an average pooling layer to generate features with sizes of 1 multiplied by 1, 3 multiplied by 3 and 5 multiplied by 5 respectively;

s22, respectively carrying out dimension reduction on the features with the sizes of 1 multiplied by 1, 3 multiplied by 3 and 5 multiplied by 5 through a convolution layer with the convolution kernel of 1 multiplied by 1, so as to obtain three dimension reduction features;

s23, upsampling the three dimension reduction features through bilinear interpolation, and splicing the depth features and the three features after upsampling to obtain a first spliced feature;

s24, smoothing the first splicing characteristic through a convolution layer with a convolution kernel of 3 multiplied by 3 to obtain global priori information.

Further, the fusion block in the step S3 includes a channel lifting module, a space lifting module, and a scale lifting module.

Further, the step S3 specifically includes:

s31, lifting context information of depth features from a channel through a channel lifting module;

s32, lifting context information of the depth features from the space through a space lifting module;

s33, fusing context information of the depth features from multi-scale through a scale lifting module.

Further, the step S31 specifically includes:

s311, taking the depth features corresponding to the scales from l=1 to l=L-1 as initial features, and processing the initial features by adopting a convolution layer with a convolution kernel of 3×3 and a group number equal to the number of channels to obtain generated features;

s312, splicing the generated features and the initial features to obtain second spliced features;

s313, performing dimension reduction processing on the second spliced feature through a convolution layer with a convolution kernel of 1 multiplied by 1 and an output channel equal to the number of input feature channels, and outputting the second spliced feature to obtain a first output feature.

Further, the step S32 specifically includes:

s321, reducing the characteristic size of the first input characteristic through an average pooling layer, reducing the characteristic size of the first input characteristic through the average pooling layer, and generating the characteristics of 1/2,1/4 and 1/8 of the first characteristic with pooling kernel sizes of 2×2, 4×4 and 8×8 respectively;

s322, smoothing the features with the sizes of 1/2,1/4 and 1/8 respectively through a convolution layer with a convolution kernel of 3 multiplied by 3;

s323, upsampling the features subjected to the smoothing processing in the S322 through bilinear interpolation, and adding and fusing upsampling results;

s324, smoothing the features after the addition and fusion in S323 through a convolution layer with a convolution kernel of 3 multiplied by 3 to obtain a second output feature.

Further, the step S33 specifically includes:

s331, processing the second output characteristic through a convolution layer with a convolution kernel of 3 multiplied by 3, and up-sampling the processed result through bilinear interpolation to obtain a third output characteristic;

s332, adding and fusing the second output characteristic and the third output characteristic;

s333, smoothing the features after the addition and fusion of the S332 through a convolution layer with a convolution kernel of 3 multiplied by 3; and obtaining a multi-scale feature fusion result.

Further, the step S4 specifically includes:

s41, respectively processing the output characteristics of each scale through a convolution layer with a convolution kernel of 1 multiplied by 1 to generate a portrait predictive picture;

s42, performing optimization training on each prediction graph through a cross entropy loss function;

s43, training a portrait detection and segmentation model through a large-scale portrait detection data set;

s43, calibrating a fine portrait data set through the carefully selected small-scale portrait edge, and performing model fine adjustment to realize a fine portrait detection model;

s44, detecting and dividing the portrait.

Compared with the prior art, the portrait detection and segmentation method based on the deep network context promotion has the following beneficial effects:

according to the invention, when the operations such as edge calibration and additional detection operators are performed on the portrait pictures, the portrait can be accurately detected and segmented only by deep mining of the context information of the depth network from multiple scales, multiple spaces and multiple channels without depending on additional knowledge, so that the data marking cost is reduced, and the requirements of industrial production and practical application are more met.

The invention can be used to perform well beyond depth models that use additional knowledge without using it.

The invention can realize accurate detection and segmentation of the portrait pictures, and the segmentation result can be used for subsequent applications such as matting, depth of field blurring processing, background replacement, sketching, stylized cartoon and the like.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

fig. 1 is a frame diagram of a portrait detection and segmentation method based on context promotion of a depth network according to an embodiment of the present invention.

Fig. 2 is a pyramid pooling block diagram according to an embodiment of the present invention.

Fig. 3 is a diagram of a channel-lifting module according to an embodiment of the present invention.

Fig. 4 is a block diagram of a space lifting module according to an embodiment of the present invention.

FIG. 5 is a block diagram illustrating a dimension improvement according to an embodiment of the present invention

Fig. 6 is a structural diagram of a portrait detection and segmentation method based on context promotion of a depth network according to an embodiment of the present invention.

Fig. 7 is a result diagram of labeling a portrait picture using an existing dataset.

Fig. 8 is an effect diagram of the image detection method according to the embodiment of the present invention in expanding applications.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Referring to fig. 1, an embodiment of the present invention provides a portrait detection and segmentation method based on context promotion of a depth network, which specifically includes the following steps:

The above steps are described in detail below.

In the step S1, given an input portrait picture I, extracting L depth features with different scales under a depth network frame; the depth network framework selected in the embodiment of the invention is selected as a plurality of popular depth network structures, and a convolution form adopted by the depth network framework is reserved; the aggregate set of extracted depth features is noted asWherein f _l Is a feature on the l scale; l=0 represents the depth network highest scale; l=l-1 represents the lowest dimension of the depth network.

In the step S2, the pyramid pooling model is used for spatially counting the features in a plurality of pyramid spaces to summarize the feature information of the full scene; feature fusion is carried out on depth features on multiple pyramid scales through a pyramid pooling module, namely fusion blocks with L-1 scales are usedThe depth features are fused, and the formula is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing fusion chunk->At the output on the first scale, xi represents model weights.

The embedded module of the embodiment of the invention adopts depth separable convolution to reduce the parameter quantity and the computation complexity, and particularly referring to fig. 2, three pyramid scales can be adopted, firstly, the feature size of the depth feature is reduced through an average pooling layer, and the features with the sizes of 1 multiplied by 1, 3 multiplied by 3 and 5 multiplied by 5 are generated; secondly, respectively carrying out dimension reduction on the features with the sizes of 1 multiplied by 1, 3 multiplied by 3 and 5 multiplied by 5 through a convolution layer with the convolution kernel of 1 multiplied by 1 to obtain three dimension reduction features; then up-sampling the three dimension-reducing features through bilinear interpolation, and splicing the depth features and the three features after up-sampling treatment to obtain a first spliced feature; finally, smoothing the first splicing characteristic through a convolution layer with a convolution kernel of 3 multiplied by 3 to obtain global priori information; the global priori information is used for gradually transmitting global information from a high scale to a low scale through the fusion block, so that the global priori information is used for guiding the overall portrait positioning and ensuring the precise portrait positioning;

global a priori information at the highest scale l=0Calculated by the following method:

wherein P (·) represents a pyramid pooling module, W represents a deep network framework weight, and W _P Representing pyramid pooling module weights.

In the step S3, the depth features are promoted and fused from a high scale to a low scale by a fusion block to obtain output features of each scale, wherein the fusion block comprises a channel promotion module, a space promotion module and a scale promotion module;

since there are many similar pairs of depth features extracted directly from portrait pictures, it is considered that there is information redundancy; redundancy of depth features on the channel can be improved through the channel lifting module, so that expressive force of the features is more abundant; when the channel lifting module is used for lifting the context information of the depth features from the channel, referring to fig. 3, firstly, the depth features corresponding to the scale l=1 to the scale l=l-1 are used as initial features, and the convolution layers with the convolution kernel of 3×3 and the number of groups equal to the number of channels are adopted to process the initial features to obtain generated features; secondly, splicing the generated features and the initial features to obtain second spliced features; finally, performing dimension reduction processing on the second spliced feature through a convolution layer with a convolution kernel of 1 multiplied by 1 and an output channel equal to the number of input feature channels, and outputting the second spliced feature to obtain a first output feature; the first output feature has a rich expressive force.

When the context information of the depth feature is lifted from the space by using the space lifting module, the pyramid pooling concept is used, specifically referring to fig. 4, firstly, the feature size of the first input feature is reduced by an average pooling layer, the feature size of the first input feature is reduced by the average pooling layer, and the pooling kernel sizes are respectively 2×2, 4×4 and 8×8, so that the features with the sizes of 1/2,1/4 and 1/8 of the first feature are respectively generated; smoothing the features with the sizes of 1/2,1/4 and 1/8 respectively by a convolution layer with a convolution kernel of 3 multiplied by 3; then upsampling the smoothed characteristics through bilinear interpolation, and adding and fusing the upsampling results; finally, smoothing the features after addition and fusion through a convolution layer with a convolution kernel of 3 multiplied by 3 to obtain a second output feature; the feature quality of the second output feature obtained by the space lifting module is greatly improved.

When context information of depth features is fused from a multi-scale by using a scale lifting module, referring to fig. 5 and fig. 6 specifically, first, a second output feature is processed through a convolution layer with a convolution kernel of 3×3, and the processed result is up-sampled through bilinear interpolation to obtain a third output feature; secondly, adding and fusing the second output characteristic and the third output characteristic; finally, smoothing the features after addition and fusion through a convolution layer with a convolution kernel of 3 multiplied by 3; and obtaining a multi-scale feature fusion result.

The context information of the depth features is promoted and fused step by step from a high scale to a low scale at a channel angle, a space angle and a scale angle, and the human image detection result with high accuracy on the scale of l=L-1 is finally obtained from global positioning to local detail and from coarse to fine human image detection and segmentation prediction.

In the step S4, the output features of each scale are optimized, specifically: processing the output characteristics of each scale through a convolution layer with a convolution kernel of 1 multiplied by 1 to generate a portrait predictive picture; optimizing and training each prediction graph through a cross entropy loss function; secondly, carrying out data training on the output characteristics of each scale, wherein most of existing portrait detection data have the characteristic of relatively fuzzy edge labeling, and particularly referring to FIG. 7, the edge labeling error is large as can be seen through the last column of enlarged diagram of FIG. 7 of the accompanying drawing, so that the edge labeling obtained based on the labeling is inaccurate and cannot guide the training of a refined model; therefore, in order to perform the training of the refined portrait detection model, the invention specifically performs the training through two stages: the first stage is to train a high-robustness high-accuracy portrait detection and segmentation model through a large-scale portrait detection dataset, such as providing a large number of portrait pictures and corresponding labels; in the second stage, fine portrait data sets are calibrated through carefully chosen small-scale portrait edges, and model fine adjustment is carried out, so that the judgment of the portrait edge pixels is more accurate; the deep network framework selection in the embodiment of the invention can use various deep network structures which are popular currently. Taking VGG-16 as an example, the characteristic outputs of conv5, conv4, conv2, and conv1 can be used as f _l . The convolution form adopted by the depth network framework can be reserved, and the modules embedded by the algorithm adopt depth separable convolution to reduce the parameters and the computational complexity. During training, the parameters of VGG-16 are set as follows: the weight decay is 0.0005; momentum is 0.9; the weight of the loss function of each scale is 1; batch size 1; the optimizer uses an adam optimizer. Training in the first stage, wherein the initial learning rate is fixed to be 1e-4, and the learning rate is divided by 10 every 10 cycles after training for 30 cycles, and the total training is 80 cycles; training in the second stage, wherein the initial learning rate is fixed to be 1e-5, and training is performed for 50 cycles by dividing the learning rate by 10 every 10 cycles;referring to fig. 8, the image detection and segmentation method based on the context promotion of the depth network provided by the invention can accurately detect and segment images, and can realize end-to-end image detection; when the image with the resolution of 300X400 is detected, the detection speed can reach 57.21FPS, and the segmentation result can be used for subsequent applications such as image matting, depth of field blurring processing, background replacement, sketching, stylization, cartoon and the like.

The embodiment of the invention provides a portrait detection and segmentation method based on context promotion of a depth network, which is shown by referring to fig. 6, firstly, global priori information based on the highest scale features is calculated through a pyramid pooling module based on a depth network frame and used for guiding the overall portrait positioning; then, the global prior information is used for gradually transmitting global information from a high scale to a bottom scale through a channel lifting module, a space lifting module and a scale lifting module so as to ensure accurate positioning of the portrait; the characteristic expressive force can be enriched through the channel lifting module; the quality of the feature map can be improved through the space lifting module; the multi-scale feature fusion result can be obtained through the scale lifting module; finally, respectively optimizing and training the output characteristics of each scale to realize a refined portrait detection model; based on the method, accurate detection and segmentation of the portrait picture can be realized.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A portrait detection and segmentation method based on deep network context promotion is characterized by comprising the following steps:

s4, respectively optimizing and training the output characteristics of each scale to finish portrait detection and segmentation;

the fusion block in the step S3 includes a channel lifting module, a space lifting module and a scale lifting module;

the step S3 specifically comprises the following steps:

s31, lifting context information of the depth feature from a channel angle through a channel lifting module;

s32, lifting context information of the depth feature from a space angle through a space lifting module;

s33, fusing context information of the depth features from a multi-scale angle through a scale lifting module;

the step S31 specifically includes:

s313, performing dimension reduction processing on the second spliced feature through a convolution layer with a convolution kernel of 1 multiplied by 1 and an output channel equal to the number of input feature channels, and outputting the second spliced feature to obtain a first output feature;

the step S32 specifically includes:

s321, reducing the characteristic size of the first output characteristic through an average pooling layer, wherein the pooling core sizes are respectively 2 multiplied by 2, 4 multiplied by 4 and 8 multiplied by 8, and the characteristics of the first output characteristic 1/2,1/4 and 1/8 are respectively generated;

s324, performing smoothing processing on the features after addition and fusion in S323 through a convolution layer with a convolution kernel of 3 multiplied by 3 to obtain a second output feature;

the step S33 specifically includes:

2. The method for detecting and segmenting portraits based on deep network context promotion as defined in claim 1, wherein S2 specifically comprises:

3. The method for detecting and segmenting portraits based on deep network context promotion as claimed in claim 1, wherein said S4 specifically comprises:

s44, detecting and dividing the portrait.