CN115272906A

CN115272906A - Video background portrait segmentation model and algorithm based on point rendering

Info

Publication number: CN115272906A
Application number: CN202210699582.1A
Authority: CN
Inventors: 张笑钦; 徐航; 林盛; 黎敏; 周杰; 陈熙祥
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2022-11-01

Abstract

The invention discloses a video background portrait segmentation model based on point rendering, which relates to the technical field of video image processing and comprises the following steps: the invention discloses a video background portrait segmentation algorithm based on point rendering, which comprises a Back bone network module, a multi-scale feature fusion module, a segmentation prediction module and a neural network module based on point rendering, wherein the Back bone network module is used for performing initial feature extraction on a video background portrait to be processed, the multi-scale feature fusion module is used for performing feature fusion to obtain a detection object, the segmentation prediction module is used for generating rough mask prediction for the detection object, and the neural network module based on point rendering is used for performing segmentation prediction.

Description

Video background portrait segmentation model and algorithm based on point rendering

Technical Field

The invention relates to the technical field of video image processing, in particular to a video background portrait segmentation model and an algorithm based on point rendering.

Background

In recent years, the portrait segmentation by using a deep learning model has achieved remarkable success in terms of performance and accuracy, and has been widely applied to practical scenes such as mobile phone photography, video monitoring, face recognition and the like, and the portrait segmentation has higher requirements on precision and speed compared with other semantic segmentation tasks.

With the popularization of smart phones and social portal websites (such as microblog and Facebook), personal photos can be captured and shared more conveniently, and many image editing software have a portrait segmentation function, so that the regions corresponding to the head and the upper body of the portrait image are separated from the background for subsequent operations, such as background replacement, hairstyle change, portrait stylization and the like.

The human image segmentation is an important research subject of computer vision, and can be widely applied to various fields, such as changing a background of a video, integrating characters into different scenes to generate creative applications, and also can quickly design creative pictures aiming at the detail processing of images.

Therefore, it is a problem to be solved by those skilled in the art to provide a new technical solution to improve the above problems.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a video background portrait segmentation model and algorithm based on point rendering for effective segmentation, and a portrait main body can be accurately segmented in a complex scene.

A video background portrait segmentation model based on point rendering, comprising: the device comprises a backhaul network module, a multi-scale feature fusion module, a segmentation prediction module and a point rendering-based neural network module.

In the above scheme, the backhaul network module is configured to perform preliminary feature extraction on a video background portrait to be processed, and obtain low-resolution features of multiple scales and high-resolution features of multiple scales.

In the above scheme, the multi-scale feature fusion module is configured to perform feature aggregation on the high-resolution features output by the backhaul network module, perform upsampling feature filling on the low-resolution features, and fuse the upsampled low-resolution features with the feature aggregated high-resolution features to obtain the detection object.

In the foregoing solution, the segmentation prediction module is configured to generate a rough mask prediction for a detection object output by the multi-scale feature fusion module.

In the above scheme, the neural network module based on point rendering is configured to perform segmentation prediction by using an iterative subdivision algorithm to obtain a fine-grained mask.

In the above scheme, the backhaul network module main body includes a first stage network, a second stage network, a third stage network, and a fourth stage network, where the first stage network includes a first downsampling unit and a first residual error unit, the first downsampling unit is configured to downsample, by 4 times, a video background portrait to be processed, obtain a low-resolution feature map of a size 1/4 of the video background portrait to be processed, and extract a low-resolution feature, the first residual error unit is configured to perform multiple convolution operations on the low-resolution feature map output by the first downsampling unit, expand the number of channels of the low-resolution feature map output by the first downsampling unit to 18, obtain a first high-resolution feature map, and extract a high-resolution feature, and the first residual error unit employs 4 bottomleneck residual error modules; the second-stage network comprises a second downsampling unit and a second residual error unit, wherein the second downsampling unit is used for downsampling a video background portrait to be processed by 8 times, acquiring a low-resolution feature map of the size of 1/8 of the video background portrait to be processed and extracting low-resolution features, the second residual error unit is used for performing convolution operation on the low-resolution feature map output by the second downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the second downsampling unit to 36, acquiring a second high-resolution feature map and extracting the high-resolution features, and the second residual error unit adopts 1 basic Block residual module; the third-stage network comprises a third downsampling unit and a third residual error unit, wherein the third downsampling unit is used for performing 16-time downsampling on a video background portrait to be processed, obtaining a low-resolution feature map with the size of 1/16 of the video background portrait to be processed and extracting low-resolution features, the third residual error unit is used for performing convolution operation on the low-resolution feature map output by the third downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the second downsampling unit to 72, obtaining a third high-resolution feature map and extracting high-resolution features, and the third residual error unit adopts 4 BasicBlock residual error modules; the fourth-stage network comprises a fourth downsampling unit and a fourth residual unit, wherein the fourth downsampling unit is used for downsampling 32 times of the video background portrait to be processed, obtaining a low-resolution feature map with the size of 1/32 of the video background portrait to be processed and extracting low-resolution features, the fourth residual unit is used for carrying out convolution operation on the low-resolution feature map output by the fourth downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the fourth downsampling unit to 144, obtaining a fourth high-resolution feature map and extracting the low-resolution features, and the fourth residual unit adopts 3 BasicBlock residual modules.

In the above solution, the width parameter of the bottleeck residual module is 64, the bottleeck residual module includes a first 1 × 1 convolutional layer, a 3 × 3 convolutional layer, and a second 1 × 1 convolutional layer, and the first 1 × 1 convolutional layer, the 3 × 3 convolutional layer, and the second 1 × 1 convolutional layer all include a BN layer and a ReLU function.

In the above scheme, the BasicBlock residual module includes a first 3 × 3 convolutional layer and a second 3 × 3 convolutional layer, and the first 3 × 3 convolutional layer and the second 3 × 3 convolutional layer are followed by a BN layer and a ReLU function.

The invention also provides a video background portrait segmentation algorithm based on point rendering, which is applied to the video background portrait segmentation model based on point rendering for processing and comprises the following steps:

step S1: performing primary feature extraction on a video background portrait to be processed through a backhaul network module to obtain low-resolution feature maps in multiple scales and high-resolution feature maps in multiple scales;

step S2: after feature aggregation is carried out on the high-resolution features output by the Backbone network module through a multi-scale feature fusion module, up-sampling feature filling is carried out on the low-resolution features, and the low-resolution features subjected to up-sampling processing and the high-resolution features subjected to feature aggregation are fused to obtain a detection object;

and step S3: generating a rough mask prediction for the detection object output by the multi-scale feature fusion module through a segmentation prediction module;

and step S4: and performing segmentation prediction by adopting an iterative subdivision algorithm through a neural network module based on point rendering to obtain a mask with fine granularity.

In the foregoing solution, after performing feature aggregation on the high resolution features output by the backhaul network module through the multi-scale feature fusion module, performing upsampling feature filling on the low resolution features, and fusing the low resolution features subjected to the upsampling with the high resolution features subjected to the feature aggregation to obtain the detection object includes:

step S21: adopting 1 or a plurality of 3 multiplied by 3 convolutional layers with the step length of 2 to carry out down-sampling operation on each high-resolution feature map output by the Backbone network module, leading each high-resolution feature map to have the same size with the corresponding low-resolution feature map, and aggregating the features of the high-resolution feature maps which are transformed by the down-sampling operation;

step S22: performing upsampling operation on each low-resolution feature map output by the Backbone network module by adopting 1 or a plurality of 3 × 3 convolutional layers to enable each low-resolution feature map to be the same as the corresponding high-resolution feature map in size, performing identity mapping on the low-resolution feature map subjected to the upsampling operation through the 1 × 1 convolutional layers, and fusing the features of the low-resolution feature map and the high-resolution features subjected to aggregation processing according to a feature channel;

step S23: and outputting a fusion result.

In the foregoing aspect, the generating, by the segmentation prediction module, a rough mask prediction for the detection object output by the multi-scale feature fusion module includes:

step S31: selecting a group of points from the detection objects output by the multi-scale feature fusion module through a PointRend network;

step S32: carrying out independent prediction on each point through a small multilayer perceptron;

step S33: and iterating the independent prediction process through a subdivision mask rendering algorithm, and roughly predicting the mask of the detection object through a fine-granularity image recognition algorithm.

In the foregoing solution, the obtaining a fine-grained mask by performing segmentation prediction by using an iterative subdivision algorithm through a neural network module based on point rendering includes:

step S41: inputting the fusion result output by the multi-scale feature fusion module into a PointRend network;

step S42: and taking the mask output by the segmentation prediction module as a coarse prediction, and performing segmentation prediction by using a position execution point adaptively selected by an iterative subdivision algorithm through PointRend network processing to obtain a fine-grained mask.

In conclusion, the beneficial effects of the invention are as follows: a point rendering-based video background portrait segmentation model comprising a backhaul network module, a multi-scale feature fusion module, a segmentation prediction module and a point rendering-based neural network module is built by adopting a deep learning PyTorch framework, and a mask with fine boundary details is generated for a portrait video and a picture through the point rendering-based video background portrait segmentation model, so that the portrait main body segmentation can be rapidly and accurately carried out on the portrait video, the video is further processed, and an excellent visual processing effect is achieved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention.

Fig. 1 is a schematic diagram of a network structure of a video background portrait segmentation model based on point rendering in the present invention.

Fig. 2 is a schematic diagram of a network structure of the multi-scale feature fusion module according to the present invention.

FIG. 3 is a schematic diagram of a network structure of a BasicBlock residual module and a bottleeck residual module in the present invention.

FIG. 4 is a step diagram of a video background human image segmentation algorithm based on point rendering according to the present invention.

FIG. 5 is a diagram illustrating the steps of feature fusion performed by the multi-scale feature fusion module according to the present invention.

FIG. 6 is a diagram illustrating the steps of coarse mask prediction performed by the partition prediction module according to the present invention.

FIG. 7 is a diagram of the steps of segmentation prediction by a neural network module based on point rendering according to the present invention.

FIG. 8 is a diagram illustrating the effect of segmenting the human image of the video image in different scenes according to the present invention.

Detailed Description

In order to make the objects and advantages of the technical solutions of the present invention clearer, the technical solutions implemented by the present invention will be clearly and completely described below with reference to the accompanying drawings of specific embodiments of the present invention. Like reference symbols in the various drawings indicate like elements. It should be noted that the described embodiments are part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without inventive step, are within the scope of protection of the invention.

As shown in fig. 1 and fig. 2, a video background human image segmentation model based on point rendering according to the present invention includes: the device comprises a backhaul network module, a multi-scale feature fusion module, a segmentation prediction module and a point rendering-based neural network module.

The connection relationship between the above modules of the present invention will be further described in detail with reference to the accompanying drawings.

The Backbone network module is used for performing primary feature extraction on the video background portrait to be processed to obtain low-resolution features of multiple scales and high-resolution features of multiple scales; the multi-scale feature fusion module is used for performing feature aggregation on the high-resolution features output by the backhaul network module, performing up-sampling feature filling on the low-resolution features, and fusing the low-resolution features subjected to up-sampling processing with the high-resolution features subjected to feature aggregation to obtain a detection object; the segmentation prediction module is used for generating a rough mask prediction for the detection object output by the multi-scale feature fusion module; the neural network module based on the point rendering is used for carrying out segmentation prediction by adopting an iterative subdivision algorithm to obtain a mask with fine granularity.

In this embodiment, a video background portrait segmentation model based on point rendering is built by adopting a deep learning PyTorch frame.

In this embodiment, the backhaul network module may obtain semantic information and detail information of the enhanced portrait features, and obtain position information of the character target in the portrait on the background of the video.

In this embodiment, the multi-scale feature fusion module combines semantic information, detail information, and position information of a human target, and may detect objects in different scales.

Further, the backhaul network module main body includes a first stage network, a second stage network, a third stage network, and a fourth stage network, where the first stage network includes a first downsampling unit and a first residual error unit, the first downsampling unit is configured to perform 4-fold downsampling on a video background portrait to be processed, obtain a low-resolution feature map of a size of 1/4 of the video background portrait to be processed, and extract a low-resolution feature, the first residual error unit is configured to perform multiple convolution operations on the low-resolution feature map output by the first downsampling unit, expand the number of channels of the low-resolution feature map output by the first downsampling unit to 18, obtain a first high-resolution feature map, and extract a high-resolution feature, and the first residual error unit employs 4 bottellen residual error modules; the second-stage network comprises a second downsampling unit and a second residual error unit, wherein the second downsampling unit is used for downsampling 8 times of a video background portrait to be processed, acquiring a low-resolution feature map with the size of 1/8 of the video background portrait to be processed and extracting low-resolution features, the second residual error unit is used for performing convolution operation on the low-resolution feature map output by the second downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the second downsampling unit to 36, acquiring a second high-resolution feature map and extracting high-resolution features, and the second residual error unit adopts 1 BasicBlock residual error module; the third-stage network comprises a third downsampling unit and a third residual error unit, wherein the third downsampling unit is used for downsampling 16 times of a video background portrait to be processed, obtaining a low-resolution feature map with the size of 1/16 of the video background portrait to be processed and extracting low-resolution features, the third residual error unit is used for carrying out convolution operation on the low-resolution feature map output by the third downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the second downsampling unit to 72, obtaining a third high-resolution feature map and extracting high-resolution features, and the third residual error unit adopts 4 BasicBlock residual error modules; the fourth-stage network comprises a fourth downsampling unit and a fourth residual unit, the fourth downsampling unit is used for downsampling 32 times of the video background portrait to be processed, obtaining a low-resolution feature map with the size of 1/32 of the video background portrait to be processed and extracting low-resolution features, the fourth residual unit is used for performing convolution operation on the low-resolution feature map output by the fourth downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the fourth downsampling unit to 144, obtaining a fourth high-resolution feature map and extracting the low-resolution features, and the fourth residual unit adopts 3 BasicBlock residual modules.

In this embodiment, the output channels of the first-stage network, the second-stage network, the third-stage network, and the fourth-stage network are C, 2C, 4C, and 8C, respectively.

As shown in fig. 3, the width parameter of the bottommost residual module is 64, the bottommost residual module includes a first 1 × 1 convolutional layer, a 3 × 3 convolutional layer, and a second 1 × 1 convolutional layer, and the first 1 × 1 convolutional layer, the 3 × 3 convolutional layer, and the second 1 × 1 convolutional layer each include a BN layer and a ReLU function.

Further, the BasicBlock residual module includes a first 3 x 3 convolutional layer and a second 3 x 3 convolutional layer, the first 3 x 3 convolutional layer and the second 3 x 3 convolutional layer being post-homobn layer and a ReLU function.

As shown in fig. 4, the present invention further provides a point-rendering-based video background portrait segmentation algorithm, which is applied to the point-rendering-based video background portrait segmentation model for processing, and includes:

step S2: after feature aggregation is carried out on the high-resolution features output by the backhaul network module through a multi-scale feature fusion module, up-sampling feature filling is carried out on the low-resolution features, and the low-resolution features subjected to up-sampling processing and the high-resolution features subjected to feature aggregation are fused to obtain a detection object;

As shown in fig. 5, after performing feature aggregation on the high-resolution features output by the backhaul network module through the multi-scale feature fusion module, performing upsampling feature filling on the low-resolution features, and fusing the low-resolution features subjected to the upsampling process with the high-resolution features subjected to the feature aggregation to obtain the detection object includes:

step S21: carrying out down-sampling operation on each high-resolution feature map output by the Backbone network module by adopting 1 or a plurality of 3 x 3 convolution layers with the step length of 2, so that each high-resolution feature map has the same size with the corresponding low-resolution feature map, and aggregating the features of the high-resolution feature maps subjected to the conversion processing of the down-sampling operation;

step S22: carrying out upsampling operation on each low-resolution feature map output by the Backbone network module by adopting 1 or a plurality of 3 x 3 convolutional layers to ensure that each low-resolution feature map has the same size with a corresponding high-resolution feature map, carrying out identity mapping on the low-resolution feature map subjected to the upsampling operation by the 1 x 1 convolutional layers, and fusing the features of the low-resolution feature map with the high-resolution features subjected to aggregation processing according to a feature channel;

step S23: and outputting a fusion result.

As shown in fig. 6, the generating, by the segmentation prediction module, a rough mask prediction for the detected object output by the multi-scale feature fusion module includes:

In this embodiment, the small multi-layer perceptron MLP uses a fine-grained feature map coarse prediction mask, with interpolation features computed at these points as input, and the coarse-grained mask function enables the small multi-layer perceptron MLP to make different predictions at a single point containing two or more boxes, during which a subdivided mask rendering algorithm is iteratively applied to refine the uncertainty region of the prediction mask.

As shown in fig. 7, the obtaining of the fine-grained mask by performing segmentation prediction by using an iterative subdivision algorithm through the neural network module based on point rendering includes:

In this embodiment, the neural network module based on point rendering selects a small number of real value points for prediction, avoids performing over-computation on all pixels in the high-resolution output grid, and extracts point-by-point feature representation for each selected point. The feature of the true value point is computed by bilinear interpolation of the feature map, using the 4 nearest neighbors of the point on the regular grid of the feature map, and as a result it is able to predict partitions with higher resolution than the feature map using the sub-pixel information encoded in the channel dimensions of the feature map.

In this embodiment, in the iterative operation process of the PointRend network processing using the iterative subdivision algorithm, the PointRend network uses the bilinear interpolation algorithm to perform upsampling on the segmentation result of the previous prediction, then selects N points with the most uncertainty on the grid with higher density, for example, selects a point with a probability of being close to 0.5 obtained for binary prediction, then calculates a point feature representation for each point of the N points, predicts a label of each point, and repeats the above process until the segmentation is up-sampled to the required resolution.

As shown in fig. 8, an effect diagram of segmenting a portrait of a video image in different scenes is shown, and it can be seen that the scheme provided by the present invention can accurately segment a portrait subject in different complex scenes, and perform edge refinement.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A video background portrait segmentation model based on point rendering is characterized by comprising:

the system comprises a backhaul network module, a multi-scale feature fusion module, a segmentation prediction module and a point rendering-based neural network module;

the Backbone network module is used for performing primary feature extraction on the video background portrait to be processed to obtain low-resolution features of multiple scales and high-resolution features of multiple scales;

the multi-scale feature fusion module is used for performing feature aggregation on the high-resolution features output by the Backbone network module, performing up-sampling feature filling on the low-resolution features, and fusing the low-resolution features subjected to up-sampling processing and the high-resolution features subjected to feature aggregation to obtain a detection object;

the segmentation prediction module is used for generating a rough mask prediction for the detection object output by the multi-scale feature fusion module;

the neural network module based on the point rendering is used for carrying out segmentation prediction by adopting an iterative subdivision algorithm to obtain a mask with fine granularity.

2. The video background portrait segmentation model based on point rendering according to claim 1, wherein the Backbone network module body includes a first stage network, a second stage network, a third stage network and a fourth stage network, the first stage network includes a first downsampling unit and a first residual error unit, the first downsampling unit is configured to perform 4-fold downsampling on a video background portrait to be processed, obtain a low resolution feature map of a size of 1/4 of the video background portrait to be processed, and extract low resolution features, the first residual error unit is configured to perform multiple convolution operations on the low resolution feature map output by the first downsampling unit, expand a channel number of the low resolution feature map output by the first downsampling unit to 18, obtain a first high resolution feature map, and extract high resolution features, and the first residual error unit employs 4 bootstrap residual error modules; the second-stage network comprises a second downsampling unit and a second residual error unit, wherein the second downsampling unit is used for downsampling 8 times of a video background portrait to be processed to obtain a low-resolution feature map with the size of 1/8 of the video background portrait to be processed and extract low-resolution features, the second residual error unit is used for carrying out convolution operation on the low-resolution feature map output by the second downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the second downsampling unit to 36 to obtain a second high-resolution feature map and extracting the high-resolution features, and the second residual error unit adopts 1 BasicBlock residual error module; the third-stage network comprises a third downsampling unit and a third residual error unit, wherein the third downsampling unit is used for downsampling 16 times of a video background portrait to be processed, obtaining a low-resolution feature map with the size of 1/16 of the video background portrait to be processed and extracting low-resolution features, the third residual error unit is used for carrying out convolution operation on the low-resolution feature map output by the third downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the second downsampling unit to 72, obtaining a third high-resolution feature map and extracting high-resolution features, and the third residual error unit adopts 4 Basicck block residual error modules; the fourth-stage network comprises a fourth downsampling unit and a fourth residual unit, the fourth downsampling unit is used for downsampling 32 times of the video background portrait to be processed, obtaining a low-resolution feature map with the size of 1/32 of the video background portrait to be processed and extracting low-resolution features, the fourth residual unit is used for performing convolution operation on the low-resolution feature map output by the fourth downsampling unit for multiple times, expanding the number of channels of the low-resolution feature map output by the fourth downsampling unit to 144, obtaining a fourth high-resolution feature map and extracting the low-resolution features, and the fourth residual unit adopts 3 BasICBlock residual modules.

3. The point-rendering-based video background portrait segmentation model of claim 1, wherein the width parameter of the bottleeck residual module is 64, the bottleeck residual module includes a first 1 × 1 convolution layer, a 3 × 3 convolution layer, and a second 1 × 1 convolution layer, and the first 1 × 1 convolution layer, the 3 × 3 convolution layer, and the second 1 × 1 convolution layer each include a BN layer and a ReLU function.

4. The point-rendering based video background portrait segmentation model of claim 1, wherein the BasicBlock residual module includes a first 3 x 3 convolutional layer and a second 3 x 3 convolutional layer, the first 3 x 3 convolutional layer and the second 3 x 3 convolutional layer being followed by a mean BN layer and a ReLU function.

5. A point-rendering-based video background human image segmentation algorithm applied to the point-rendering-based video background human image segmentation model of any one of claims 1 to 4 for processing, comprising:

6. The video background portrait segmentation algorithm based on point rendering according to claim 5, wherein the obtaining of the detection object includes, after feature aggregation is performed on the high resolution features output by the backhaul network module through the multi-scale feature fusion module, performing upsampling feature filling on the low resolution features, and fusing the upsampled low resolution features and the feature aggregated high resolution features:

step S21: performing down-sampling operation on each high-resolution feature map output by the Backbone network module by adopting 1 or a plurality of 3 × 3 convolution layers with the step length of 2, so that each high-resolution feature map has the same size as the corresponding low-resolution feature map, and aggregating the features of the high-resolution feature maps subjected to the down-sampling operation conversion;

step S22: performing upsampling operation on each low-resolution feature map output by the Backbone network module by adopting 1 or a plurality of 3 × 3 convolutional layers to enable each low-resolution feature map to be the same as the corresponding high-resolution feature map in size, performing identity mapping on the low-resolution feature map subjected to the upsampling operation through the 1 × 1 convolutional layers, and fusing the features of the low-resolution feature map and the high-resolution features subjected to aggregation processing according to the feature channels;

step S23: and outputting a fusion result.

7. The point-rendering-based video background portrait segmentation algorithm of claim 5, wherein the generating a rough mask prediction for the detected object output by the multi-scale feature fusion module through the segmentation prediction module comprises:

step S33: and iterating the independent prediction process through a subdivision mask rendering algorithm, and roughly predicting the mask of the detected object through a fine-grained image recognition algorithm.

8. The point-rendering-based video background human image segmentation algorithm of claim 5, wherein the obtaining the fine-grained mask through the point-rendering-based neural network module using an iterative subdivision algorithm for segmentation prediction comprises: