CN115100409B

CN115100409B - Video portrait segmentation algorithm based on twin network

Info

Publication number: CN115100409B
Application number: CN202210759308.9A
Authority: CN
Inventors: 张笑钦; 廖唐飞; 赵丽; 冯士杰; 徐曰旺
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2024-04-26
Anticipated expiration: 2042-06-30
Also published as: CN115100409A

Abstract

The invention discloses a video portrait segmentation algorithm based on a twin network, which relates to the technical field of image processing, and adopts a twin network structure, wherein the basic structure of the video portrait segmentation algorithm comprises a video frame acquisition image module, an RGB separation module, a Encoder network module, an SE module, a Decoder network module and a JPU module; the invention adopts a deep learning PyTorch framework to construct the modules, and adopts a model learning video processing method to realize high-resolution video image segmentation under a complex scene by predicting an accurate alpha mask for each frame of a video and extracting tasks from a given image or video.

Description

Video portrait segmentation algorithm based on twin network

Technical Field

The invention relates to the technical field of image processing, in particular to a video portrait segmentation algorithm based on a twin network and a conveying method thereof.

Background

In computer vision, image semantic segmentation is an important research topic of computer vision, can be widely applied to various fields, such as foreground segmentation of images, and can be used for changing the background of video, so that foreground characters can be fused into different scenes to generate creative algorithm application.

The purpose of image segmentation is to predict an accurate alpha mask that can be used to extract a person from a given image or video. It has a wide range of applications such as photo editing and movie authoring. The video image segmentation algorithm aims at predicting a video frame alpha mask in a complex scene to carry out front background segmentation through the video image segmentation algorithm. The existing real-time high-resolution video image segmentation algorithm which falls to the ground can obtain high-quality prediction by means of green cloth, but the algorithm which does not use the green cloth also has some problems, such as a data set needs a trimap image, and the cost of obtaining the trimap image is relatively high.

Therefore, there is an urgent need to solve the above-mentioned problems for a video portrait segmentation algorithm based on a twin network.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a video portrait segmentation algorithm based on a twin network, which can ensure high-precision segmentation of video portraits in various environments such as complex background, complex main body shape and the like.

In order to achieve the above purpose, the invention designs and realizes a video portrait segmentation algorithm based on a twin network, and a high-resolution alpha mask is obtained through the sharing of weights by the twin network, the capturing of time sequence and space characteristics by a cyclic neural network and the joint up-sampling. The technical scheme is specifically as follows: a video portrait segmentation algorithm based on a twin network adopts a twin network structure, and the basic structure comprises a video frame image acquisition module, an RGB separation module, a Encoder network module, an SE module, a Decoder network module and a JPU module, and comprises the following steps:

Step S1: acquiring a current video frame image from a video to be segmented through the video frame acquisition module and preprocessing the current video frame image to obtain a preprocessed current frequency frame image;

step S2: the RGB separation module is used for separating the obtained preprocessed video frame image into three channel RGB video frame images in an RGB color mode;

Step S3: inputting three-channel RGB video frame images through the Encoder network module, and extracting multi-scale coarse granularity characteristics of five three-channel RGB video frame images by adopting a Mobilenet V network;

Step S4: connecting the Encoder network module and the Decoder network module through the SE module, and recalibrating the characteristics by learning the importance degree of each channel;

Step S5: obtaining features of different scales from the Encoder network module, the downsampling of the current video frame image and the ConvGRU cyclic neural network through the Decoder network module, carrying out feature fusion, capturing edge features lost in downsampling, shallow-layer cultural features, and time and space features, and obtaining a high-resolution feature map;

Step S6: three different scale features are obtained from the current video frame, the downsampling of the current video frame and the Decoder network module through the JPU module, and a high resolution feature map is effectively generated under the condition of corresponding low resolution output and high resolution images.

Further, the obtaining, by the video frame obtaining module, the current video frame image from the video to be segmented and preprocessing, and obtaining the preprocessed current video frame image includes: step S11: acquiring a current video frame image of a video to be segmented; step S12: and preprocessing the acquired current video frame image.

Furthermore, the video frame acquisition module acquires a current video frame image from the video to be segmented and performs preprocessing to obtain a preprocessed current frequency frame image.

Further, the three-channel RGB video frame image is input through the Encoder network module, a Mobilenet V3 network is adopted, the multi-scale coarse granularity characteristics of five three-channel RGB video frame images are extracted, a lightweight network Mobilenet V3 Larges is adopted as a backstone, a four-level encoder is constructed based on a twin network, and coarse granularity characteristic diagrams of 1/4, 1/8, 1/16, 1/32 and 1/64 of three-channel RGB video frame resolution are obtained through a downsampling layer and the four-level encoder.

Further, the Encoder network module comprises a downsampling layer and a four-level encoder, wherein the downsampling layer adopts bilinear interpolation to perform 4 times downsampling, and a feature diagram with the original image resolution of 1/4 is obtained; the four-stage encoder comprises a first encoder, a second encoder, a third encoder and a fourth encoder, wherein each stage encoder adopts a bottleneck structure with a plurality of weights shared, each stage encoder firstly uses a point-by-point convolution group, secondly uses a depth convolution group, is connected with a SE module to learn weights, and finally transmits shallow features containing structured information to deep features through short links.

Further, the connecting the Encoder network module and the Decoder network module through the SE module inputs coarse-grained characteristics into the SE module, and performs characteristic recalibration of a characteristic channel level by learning importance degree of each channel, including: the method is used for converting the obtained coarse-grain features into global features through the Squeeze operation, and global averaging is adopted to obtain the global features; and performing specification operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining weights of different channels, and recalibrating the features.

Further, the Decoder network module obtains different scale features from the Encoder network module, the downsampling of the current video frame image and the ConvGRU cyclic neural network, performs feature fusion, captures edge features lost in downsampling, shallow grammar features, and time and space features, and gradually restores and amplifies high-level semantic information through a four-level Decoder corresponding to the Encoder module to obtain a high-resolution feature map.

Furthermore, the four-stage decoder is used for multi-layer feature fusion, channel number reduction and high resolution feature map obtaining, and the feature maps of the current video frame resolutions 1/32, 1/16, 1/8 and 1/4 are respectively obtained; the input of each decoder is combined by the output of the down sampling process, and the information of the previous frame and the current frame is calculated and output by ConvGRU circulation network after convolution normalization.

Further, the step of obtaining three features of different scales from the current video frame, the downsampling of the current video frame, and the Decoder network module by the JPU module, and effectively generating a high resolution feature map given the corresponding low resolution output and high resolution image, includes the steps of: feature fusion is carried out on three features with different scales obtained from the current video frame, the downsampling of the current video frame and the obtaining of a Decoder network module, and a feature map is output; step S42: the method comprises the steps of using separable convolution groups with different void ratios to enlarge the field of view, capturing context information, outputting four groups of feature images with unchanged resolution, and merging multi-scale context information; step S43: an alpha mask with a channel number of 1 is generated using a3 x 3 2D convolution on the fused multi-scale context information.

Furthermore, the feature fusion is performed on three features with different scales obtained from the current video frame, the downsampled current video frame and the Decoder network module, and the feature map output comprises the following steps: firstly, carrying out 3X 3 2D convolution operation to unify the number of channels of three input features, secondly, carrying out up-sampling operation to unify and restore to a high-resolution feature scale, and finally outputting a feature map with the resolution consistent with the current video frame.

From the above technical solution, the advantages of the present invention are:

Compared with the prior art, the invention can capture the edge characteristics, shallow texture characteristics, time sequence, space characteristics and other multi-level characteristics, can supplement the time sequence, space and edge structural information of the video current frame alpha Meng Bantu, and can realize the accurate prediction of the alpha mask, thereby dividing the portrait and the background. The method can obtain accurate segmentation of the portrait edge under various complex environments such as low foreground and background contrast, complex background, complex main body shape and the like, and has stronger robustness.

The invention can carry out high-precision video image segmentation on targets in complex scenes such as multiple targets, target shielding, tiny targets, faster target movement and the like, and adopts a deep learning PyTorch framework to construct each model learning video processing method. The index result and visual effect on the test data set provide a video portrait segmentation algorithm pre-training model based on a twin network which exceeds the current other algorithms.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application.

Fig. 1 is a flowchart of a video portrait segmentation algorithm based on a twin network according to the present invention.

Fig. 2 is a schematic diagram of an overall network structure of a twin-network-based video image segmentation algorithm according to the present invention.

Fig. 3 is a step diagram of obtaining a video frame image according to the present invention.

Fig. 4 is a step diagram of the present invention for preprocessing a current video frame.

Fig. 5 is a schematic diagram of a Encoder network module Bottleneck according to the present invention.

Fig. 6 is a detailed network structure diagram of a Encoder network module according to the present invention.

Fig. 7 is a schematic diagram of a Decoder module according to the present invention.

FIG. 8 is a schematic diagram of a JPU module according to the present invention.

FIG. 9 is a step diagram of a JPU module of the present invention.

Fig. 10 is an effect diagram of video image segmentation under different scenes by the video image segmentation algorithm based on the twin network.

Detailed Description

The present invention will be described in further detail with reference to the following embodiments and the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. The exemplary embodiments of the present invention and the descriptions thereof are used herein to explain the present invention, but are not intended to limit the invention.

The invention designs and realizes a video portrait segmentation algorithm based on a twin network, carries out multi-scale feature fusion based on the twin network, guides the algorithm network to capture edge features, shallow texture features, time sequences and spatial features, and can rapidly and accurately segment video portraits.

Fig. 1 and 2 show a flow chart and an overall network structure diagram of a video portrait segmentation algorithm based on a twin network.

According to the video image segmentation algorithm based on the twin network shown in fig. 1 and fig. 2, the purpose of the video image segmentation algorithm based on the twin network is to obtain a more accurate alpha mask by constructing the video image segmentation algorithm and combining semantic information, time sequence and spatial information and structural detail information of the video image. The video portrait segmentation algorithm based on the twin network adopts a twin network structure, and the basic structure comprises a video frame acquisition image, an RGB separation module, a Encoder network module, a SE (Squeeze and Excitation) module, a Decoder network module and a JPU (Joint Pyramid Upsampling) module, and specifically comprises the following steps:

Step S5: obtaining different scale features from the Encoder network module, the downsampling of the current video frame image and the ConvGRU cyclic neural network through the Decoder network module, carrying out feature fusion, and capturing edge features lost in downsampling, shallow-layer cultural features, time sequence and space features;

Fig. 3 and 4 show step diagrams for obtaining video frame images.

According to the video frame image acquisition shown in fig. 3, the video frame image acquisition and preprocessing of the current video frame image of the video to be segmented includes:

Step S11: acquiring a current video frame image of a video to be segmented;

Step S12: and preprocessing the acquired current video frame image.

Preprocessing the obtained current video frame according to the preprocessing of the obtained current video frame shown in fig. 4, including:

Step S121: the size of the video to be segmented is adjusted to be a preset size, wherein the preset size is the size of an input image required by a twin network;

step S122: normalizing pixels of the resized image;

step S123: and adjusting the sequence of the color channels of the normalized image according to a preset sequence.

The current video frame is processed into an image form suitable for a twin network structure through preprocessing the current video frame, so that the image input is convenient, and the accurate segmentation is realized.

Fig. 5 and 6 show Encoder a network module Bottleneck architecture diagram and detailed network diagrams.

According to Encoder and 6, the Encoder network module is a twin network structure, and a lightweight network Mobilenet V3 Largespecial for semantic segmentation is selected as a backhaul, and a four-level encoder is built based on the twin network.

Inputting the three-channel RGB video frame images through the Encoder network module, extracting multi-scale coarse granularity characteristics of five three-channel RGB video frame images by adopting a Mobilenet V network, adopting a lightweight network Mobilenet V3 Largeas a backstone, constructing a four-level encoder based on a twin network, and obtaining coarse granularity characteristic diagrams of 1/4, 1/8, 1/16, 1/32 and 1/64 of three-channel RGB video frame resolutions by a downsampling layer and the four-level encoder.

The downsampling layer adopts bilinear interpolation to downsample for 4 times, and the resolution ratio of the original image of the current video frame is 1/4 of that of the feature image; the four-stage encoder comprises a first-stage encoder, a second-stage encoder, a third-stage encoder and a fourth-stage encoder, wherein each-stage encoder adopts a bottleneck structure with multiple weight sharing, each-stage encoder firstly uses a point-by-point convolution group, secondly uses a depth convolution group, is connected with a SE module to learn weights, and finally transmits shallow features containing structured information to deep features through short links.

The four-level encoder includes a first-level encoder Encoder _blk1, a second-level encoder Encoder _blk2, a third-level encoder Encoder _blk3 and a fourth-level encoder Encoder _blk4, the first-level encoder Encoder _blk1, the second-level encoder Encoder _blk2, the third-level encoder Encoder _blk3 and the fourth-level encoder Encoder _blk4 use a plurality of weight-sharing bottleneck structures, the bottleneck is an inverted residual structure, firstly, a point-by-point convolution group (1×1 2D convolution+batch processing+activation layer), secondly, a depth convolution group (3×3 2D convolution+batch processing+activation layer), and a SE module learning weight is connected, and finally shallow features containing structured detail information are transferred to deep features through a shortcut short link.

Specifically, the first-stage encoder Encoder _blk1 includes two bottleneck blocks to obtain a feature map with an original resolution of 1/8, the second-stage encoder Encoder _blk2 includes two bottleneck blocks to obtain a feature map with an original resolution of 1/16, the third-stage encoder Encoder _blk3 includes three bottleneck blocks to obtain a feature map with an original resolution of 1/32, and the fourth-stage encoder Encoder _blk4 includes six bottleneck blocks to obtain a feature map with an original resolution of 1/64, and to obtain multi-scale coarse granularity features with five three-channel RGB video frame resolutions.

The connecting the Encoder network module and the Decoder network module through the SE module inputs coarse granularity characteristics into the SE module, and performs characteristic recalibration of a characteristic channel level by learning importance degree of each channel, including: the method is used for converting the obtained coarse-grain features into global features through the Squeeze operation, and global averaging is adopted to obtain the global features; and performing specification operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining weights of different channels, and recalibrating the features. The SE module is used for obtaining the weight coefficient of each channel, so that the model has more distinguishing ability on the characteristics of each channel.

Fig. 7 shows a detailed network architecture diagram of a Decoder network module.

According to one Decoder network module shown in fig. 7, the Decoder network module is a twin network, weights among a plurality of Decoder blocks are shared, and a pseudo twin network is formed with the Encoder network module. The Decoder network module is used for obtaining four different scale features from a current video frame, the Encoder module, an Image LR (Image LR) obtained by downsampling the current video frame and a ConvGRU cyclic neural network, and performing feature fusion to obtain edge features lost in downsampling, shallow-layer cultural features and features based on time sequence and space, and comprises the steps of gradually reducing and amplifying high-level semantic information through a four-level Decoder corresponding to the Encoder module, gradually reducing and amplifying the high-level semantic information to obtain a high-resolution feature map, and obtaining a high-resolution feature map.

In order to reduce the number of parameters and calculation, the four-stage decoder corresponding to the Encoder modules splits the input in the channel dimension, convGRU cyclic neural networks in all modules calculate by using the split features, and the rest is combined with the result through short links. The four-stage decoder is used for multi-layer feature fusion, reducing the number of channels and obtaining a high-resolution feature map, and the feature maps of 1/32, 1/16, 1/8 and 1/4 of the resolution of the current video frame are respectively obtained; the input of each decoder is combined by the output of the down sampling process, and the information of the previous frame and the current frame is calculated and output by ConvGRU circulation network after convolution normalization.

Specifically, the four-stage decoder includes a 3×3 2D convolution+batch normalization+relu activation combination, a ConvGRU loop network, and 2-fold bilinear interpolation upsampling, where the input of the decoder is similar to the conventional U-net structure upsampling, and is that the output of the downsampling process is used to combine, and after the 3×3 2D convolution+batch normalization+relu activation combination, the ConvGRU loop network uses the previous frame and the current frame information to calculate and output.

Fig. 8 and 9 show a schematic view of a JPU module structure and a step diagram.

According to the JPU modules shown in fig. 8 and 9, the JPU modules are configured to convert the extracted high-resolution feature map into joint upsampling, to effectively generate a high-resolution Image given the respective low-resolution output (Image LR, decoder network module output) and high-resolution Image (Image HR) guidance, the three different-scale features obtained by the current video frame (Image HR), the current video frame downsampled (Image LR), and the Decoder network module. The JPU module comprises the following steps:

Step S41: feature fusion is carried out on three features with different scales obtained from a current video frame (Image HR), a current video frame downsampling result (Image LR) and a Decoder network module, and a feature map is output;

Step S42: using separable convolution groups with different void ratios to enlarge the field of view and capture the context information, outputting four groups of feature images with unchanged resolution, and fusing multi-scale context information by merging (Concatenate);

Step S43: an alpha mask with a channel number of 1 is generated using a 3 x3 2D convolution.

In order to reduce the calculation complexity and parameter quantity of convolution operation, separable convolution groups formed by hole convolution and point-by-point convolution with different hole rates are utilized to replace common standard convolution. The standard convolution is replaced by a3x 3 depth convolution and a1 x1 point convolution by decoupling the channel correlation from the spatial correlation.

Specifically, the feature fusion is performed on three features with different scales obtained from a current video frame, a current video frame downsampling result and a Decoder network module, and the output feature map comprises the following steps: firstly, carrying out 3X 3 2D convolution operation to unify the number of channels of three input features, secondly, carrying out up-sampling operation to unify and restore to a high-resolution feature scale, and finally outputting a feature map with the resolution consistent with the current video frame.

Fig. 10 shows a figure segmentation effect diagram of a video figure segmentation algorithm based on a twin network under different scenes.

According to the figure segmentation effect diagram in different scenes shown in fig. 10, it can be seen that the figure edge can be accurately segmented under various complex environments such as low foreground and background contrast, complex background, complex main body shape, and the like, so that the figure is segmented, and the robustness is high. The invention can capture the edge characteristics, shallow layer grammar characteristics, time sequence and space characteristics and other multi-level characteristics, can supplement the time sequence, space and edge structural information of the video current frame alpha Meng Bantu, realizes the accurate prediction of the alpha mask, and further segments the portrait and the background.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The video image segmentation algorithm based on the twin network is characterized by adopting a twin network structure, wherein the basic structure comprises a video frame image acquisition module, an RGB separation module, a Encoder network module, an SE module, a Decoder network module and a JPU module, and the video image segmentation algorithm comprises the following steps of:

step S2: the obtained preprocessed video frame image is separated into three channel RGB video frame images in an RGB color mode through the RGB separation module;

Step S3: inputting three-channel RGB video frame images into the Encoder network module, and extracting multi-scale coarse granularity characteristics of five three-channel RGB video frame images by adopting a Mobilenet V network;

Step S4: the Encoder network module and the Decoder network module are connected through the SE module, coarse granularity characteristics are input into the SE module, and characteristic recalibration of characteristic channel levels is carried out by learning the importance degree of each channel;

Step S6: the JPU module is used for obtaining the characteristics of three different scales from the current video frame, the downsampling of the current video frame and the Decoder network module, and a high-resolution characteristic image is effectively generated under the condition of corresponding low-resolution output and high-resolution image.

2. The video image segmentation algorithm based on the twin network according to claim 1, wherein the obtaining, by the video frame obtaining module, the current video frame image from the video to be segmented and performing preprocessing, and obtaining the preprocessed current video frame image includes: step S11: acquiring a current video frame image of a video to be segmented; step S12: and preprocessing the acquired current video frame image.

3. The twin network-based video image segmentation algorithm as defined in claim 2, wherein the preprocessing of the current video frame comprises: step S121: the size of the video to be segmented is adjusted to be a preset size, wherein the preset size is the size of an input image required by a twin network; step S122: normalizing pixels of the resized image; step S123: and adjusting the sequence of the color channels of the normalized image according to a preset sequence.

4. The twin network-based video image segmentation algorithm according to claim 1, wherein the inputting the three-channel RGB video frame images into the Encoder network module, extracting multi-scale coarse granularity features of five three-channel RGB video frame images by using Mobilenet V3 network, includes using a lightweight network Mobilenet V3 target as a back bone, constructing a four-level encoder based on the twin network, and obtaining coarse granularity feature maps of 1/4, 1/8, 1/16, 1/32 and 1/64 of three-channel RGB video frame resolutions by a downsampling layer and the four-level encoder.

5. The video image segmentation algorithm based on the twin network according to claim 4, wherein the downsampling layer performs 4 times downsampling by bilinear interpolation to obtain a feature map with original image resolution of 1/4; the four-stage encoder comprises a first encoder, a second encoder, a third encoder and a fourth encoder, wherein each stage encoder adopts a bottleneck structure with a plurality of weights shared, each stage encoder firstly uses a point-by-point convolution group, secondly uses a depth convolution group, is connected with a SE module to learn weights, and finally transmits shallow features containing structured information to deep features through short links.

6. The twin network-based video image segmentation algorithm according to claim 1, wherein the connecting the Encoder network module and the Decoder network module through the SE module inputs coarse-grained features into the SE module, performs feature recalibration at a feature channel level by learning importance of each channel, and comprises: the method is used for converting the obtained coarse-grain features into global features through the Squeeze operation, and global averaging is adopted to obtain the global features; and performing specification operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining weights of different channels, and recalibrating the features.

7. The video image segmentation algorithm based on the twin network according to claim 1, wherein the Decoder network module obtains different scale features from the Encoder network module, the downsampling of the current video frame image and the ConvGRU cyclic neural network, performs feature fusion, captures edge features lost in downsampling, shallow-layer cultural features and time and space features, and gradually restores and amplifies high-level semantic information through a four-level Decoder corresponding to the Encoder module to obtain a high-resolution feature map.

8. The twin network-based video image segmentation algorithm according to claim 7, wherein the four-stage decoder is configured to perform multi-layer feature fusion, reduce the number of channels, and obtain a high-resolution feature map, so as to obtain feature maps of 1/32, 1/16, 1/8, and 1/4 of the current video frame resolution, respectively; the input of each decoder is combined by the output of the down sampling process, and the information of the previous frame and the current frame is calculated and output by ConvGRU circulation network after convolution normalization.

9. The twin network-based video image segmentation algorithm as defined in claim 1, wherein the step of effectively generating a high resolution feature map given the respective low resolution output and high resolution image by the JPU module to obtain three different scale features from the current video frame, the current video frame downsampling, and the Decoder network module comprises the steps of: feature fusion is carried out on three features with different scales obtained from the current video frame, the downsampling of the current video frame and the obtaining of a Decoder network module, and a feature map is output; step S42: the method comprises the steps of using separable convolution groups with different void ratios to enlarge the field of view, capturing context information, outputting four groups of feature images with unchanged resolution, and merging multi-scale context information; step S43: an alpha mask with a channel number of 1 is generated using a 3x 3 2D convolution on the fused multi-scale context information.

10. The video image segmentation algorithm based on the twin network according to claim 9, wherein the feature fusion is performed on three features of different scales obtained from a current video frame, a downsampled current video frame, and a Decoder network module, and the outputting of the feature map includes the steps of: firstly, carrying out 3X 3 2D convolution operation to unify the number of channels of three input features, secondly, carrying out up-sampling operation to unify and restore to a high-resolution feature scale, and finally outputting a feature map with the resolution consistent with the current video frame.