CN115100409B - Video portrait segmentation algorithm based on twin network - Google Patents

Video portrait segmentation algorithm based on twin network Download PDF

Info

Publication number
CN115100409B
CN115100409B CN202210759308.9A CN202210759308A CN115100409B CN 115100409 B CN115100409 B CN 115100409B CN 202210759308 A CN202210759308 A CN 202210759308A CN 115100409 B CN115100409 B CN 115100409B
Authority
CN
China
Prior art keywords
module
video frame
network
encoder
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210759308.9A
Other languages
Chinese (zh)
Other versions
CN115100409A (en
Inventor
张笑钦
廖唐飞
赵丽
冯士杰
徐曰旺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou University
Original Assignee
Wenzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou University filed Critical Wenzhou University
Priority to CN202210759308.9A priority Critical patent/CN115100409B/en
Publication of CN115100409A publication Critical patent/CN115100409A/en
Application granted granted Critical
Publication of CN115100409B publication Critical patent/CN115100409B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/54Extraction of image or video features relating to texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a video portrait segmentation algorithm based on a twin network, which relates to the technical field of image processing, and adopts a twin network structure, wherein the basic structure of the video portrait segmentation algorithm comprises a video frame acquisition image module, an RGB separation module, a Encoder network module, an SE module, a Decoder network module and a JPU module; the invention adopts a deep learning PyTorch framework to construct the modules, and adopts a model learning video processing method to realize high-resolution video image segmentation under a complex scene by predicting an accurate alpha mask for each frame of a video and extracting tasks from a given image or video.

Description

Video portrait segmentation algorithm based on twin network
Technical Field
The invention relates to the technical field of image processing, in particular to a video portrait segmentation algorithm based on a twin network and a conveying method thereof.
Background
In computer vision, image semantic segmentation is an important research topic of computer vision, can be widely applied to various fields, such as foreground segmentation of images, and can be used for changing the background of video, so that foreground characters can be fused into different scenes to generate creative algorithm application.
The purpose of image segmentation is to predict an accurate alpha mask that can be used to extract a person from a given image or video. It has a wide range of applications such as photo editing and movie authoring. The video image segmentation algorithm aims at predicting a video frame alpha mask in a complex scene to carry out front background segmentation through the video image segmentation algorithm. The existing real-time high-resolution video image segmentation algorithm which falls to the ground can obtain high-quality prediction by means of green cloth, but the algorithm which does not use the green cloth also has some problems, such as a data set needs a trimap image, and the cost of obtaining the trimap image is relatively high.
Therefore, there is an urgent need to solve the above-mentioned problems for a video portrait segmentation algorithm based on a twin network.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a video portrait segmentation algorithm based on a twin network, which can ensure high-precision segmentation of video portraits in various environments such as complex background, complex main body shape and the like.
In order to achieve the above purpose, the invention designs and realizes a video portrait segmentation algorithm based on a twin network, and a high-resolution alpha mask is obtained through the sharing of weights by the twin network, the capturing of time sequence and space characteristics by a cyclic neural network and the joint up-sampling. The technical scheme is specifically as follows: a video portrait segmentation algorithm based on a twin network adopts a twin network structure, and the basic structure comprises a video frame image acquisition module, an RGB separation module, a Encoder network module, an SE module, a Decoder network module and a JPU module, and comprises the following steps:
Step S1: acquiring a current video frame image from a video to be segmented through the video frame acquisition module and preprocessing the current video frame image to obtain a preprocessed current frequency frame image;
step S2: the RGB separation module is used for separating the obtained preprocessed video frame image into three channel RGB video frame images in an RGB color mode;
Step S3: inputting three-channel RGB video frame images through the Encoder network module, and extracting multi-scale coarse granularity characteristics of five three-channel RGB video frame images by adopting a Mobilenet V network;
Step S4: connecting the Encoder network module and the Decoder network module through the SE module, and recalibrating the characteristics by learning the importance degree of each channel;
Step S5: obtaining features of different scales from the Encoder network module, the downsampling of the current video frame image and the ConvGRU cyclic neural network through the Decoder network module, carrying out feature fusion, capturing edge features lost in downsampling, shallow-layer cultural features, and time and space features, and obtaining a high-resolution feature map;
Step S6: three different scale features are obtained from the current video frame, the downsampling of the current video frame and the Decoder network module through the JPU module, and a high resolution feature map is effectively generated under the condition of corresponding low resolution output and high resolution images.
Further, the obtaining, by the video frame obtaining module, the current video frame image from the video to be segmented and preprocessing, and obtaining the preprocessed current video frame image includes: step S11: acquiring a current video frame image of a video to be segmented; step S12: and preprocessing the acquired current video frame image.
Furthermore, the video frame acquisition module acquires a current video frame image from the video to be segmented and performs preprocessing to obtain a preprocessed current frequency frame image.
Further, the three-channel RGB video frame image is input through the Encoder network module, a Mobilenet V3 network is adopted, the multi-scale coarse granularity characteristics of five three-channel RGB video frame images are extracted, a lightweight network Mobilenet V3 Larges is adopted as a backstone, a four-level encoder is constructed based on a twin network, and coarse granularity characteristic diagrams of 1/4, 1/8, 1/16, 1/32 and 1/64 of three-channel RGB video frame resolution are obtained through a downsampling layer and the four-level encoder.
Further, the Encoder network module comprises a downsampling layer and a four-level encoder, wherein the downsampling layer adopts bilinear interpolation to perform 4 times downsampling, and a feature diagram with the original image resolution of 1/4 is obtained; the four-stage encoder comprises a first encoder, a second encoder, a third encoder and a fourth encoder, wherein each stage encoder adopts a bottleneck structure with a plurality of weights shared, each stage encoder firstly uses a point-by-point convolution group, secondly uses a depth convolution group, is connected with a SE module to learn weights, and finally transmits shallow features containing structured information to deep features through short links.
Further, the connecting the Encoder network module and the Decoder network module through the SE module inputs coarse-grained characteristics into the SE module, and performs characteristic recalibration of a characteristic channel level by learning importance degree of each channel, including: the method is used for converting the obtained coarse-grain features into global features through the Squeeze operation, and global averaging is adopted to obtain the global features; and performing specification operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining weights of different channels, and recalibrating the features.
Further, the Decoder network module obtains different scale features from the Encoder network module, the downsampling of the current video frame image and the ConvGRU cyclic neural network, performs feature fusion, captures edge features lost in downsampling, shallow grammar features, and time and space features, and gradually restores and amplifies high-level semantic information through a four-level Decoder corresponding to the Encoder module to obtain a high-resolution feature map.
Furthermore, the four-stage decoder is used for multi-layer feature fusion, channel number reduction and high resolution feature map obtaining, and the feature maps of the current video frame resolutions 1/32, 1/16, 1/8 and 1/4 are respectively obtained; the input of each decoder is combined by the output of the down sampling process, and the information of the previous frame and the current frame is calculated and output by ConvGRU circulation network after convolution normalization.
Further, the step of obtaining three features of different scales from the current video frame, the downsampling of the current video frame, and the Decoder network module by the JPU module, and effectively generating a high resolution feature map given the corresponding low resolution output and high resolution image, includes the steps of: feature fusion is carried out on three features with different scales obtained from the current video frame, the downsampling of the current video frame and the obtaining of a Decoder network module, and a feature map is output; step S42: the method comprises the steps of using separable convolution groups with different void ratios to enlarge the field of view, capturing context information, outputting four groups of feature images with unchanged resolution, and merging multi-scale context information; step S43: an alpha mask with a channel number of 1 is generated using a3 x 3 2D convolution on the fused multi-scale context information.
Furthermore, the feature fusion is performed on three features with different scales obtained from the current video frame, the downsampled current video frame and the Decoder network module, and the feature map output comprises the following steps: firstly, carrying out 3X 3 2D convolution operation to unify the number of channels of three input features, secondly, carrying out up-sampling operation to unify and restore to a high-resolution feature scale, and finally outputting a feature map with the resolution consistent with the current video frame.
From the above technical solution, the advantages of the present invention are:
Compared with the prior art, the invention can capture the edge characteristics, shallow texture characteristics, time sequence, space characteristics and other multi-level characteristics, can supplement the time sequence, space and edge structural information of the video current frame alpha Meng Bantu, and can realize the accurate prediction of the alpha mask, thereby dividing the portrait and the background. The method can obtain accurate segmentation of the portrait edge under various complex environments such as low foreground and background contrast, complex background, complex main body shape and the like, and has stronger robustness.
The invention can carry out high-precision video image segmentation on targets in complex scenes such as multiple targets, target shielding, tiny targets, faster target movement and the like, and adopts a deep learning PyTorch framework to construct each model learning video processing method. The index result and visual effect on the test data set provide a video portrait segmentation algorithm pre-training model based on a twin network which exceeds the current other algorithms.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application.
Fig. 1 is a flowchart of a video portrait segmentation algorithm based on a twin network according to the present invention.
Fig. 2 is a schematic diagram of an overall network structure of a twin-network-based video image segmentation algorithm according to the present invention.
Fig. 3 is a step diagram of obtaining a video frame image according to the present invention.
Fig. 4 is a step diagram of the present invention for preprocessing a current video frame.
Fig. 5 is a schematic diagram of a Encoder network module Bottleneck according to the present invention.
Fig. 6 is a detailed network structure diagram of a Encoder network module according to the present invention.
Fig. 7 is a schematic diagram of a Decoder module according to the present invention.
FIG. 8 is a schematic diagram of a JPU module according to the present invention.
FIG. 9 is a step diagram of a JPU module of the present invention.
Fig. 10 is an effect diagram of video image segmentation under different scenes by the video image segmentation algorithm based on the twin network.
Detailed Description
The present invention will be described in further detail with reference to the following embodiments and the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. The exemplary embodiments of the present invention and the descriptions thereof are used herein to explain the present invention, but are not intended to limit the invention.
The invention designs and realizes a video portrait segmentation algorithm based on a twin network, carries out multi-scale feature fusion based on the twin network, guides the algorithm network to capture edge features, shallow texture features, time sequences and spatial features, and can rapidly and accurately segment video portraits.
Fig. 1 and 2 show a flow chart and an overall network structure diagram of a video portrait segmentation algorithm based on a twin network.
According to the video image segmentation algorithm based on the twin network shown in fig. 1 and fig. 2, the purpose of the video image segmentation algorithm based on the twin network is to obtain a more accurate alpha mask by constructing the video image segmentation algorithm and combining semantic information, time sequence and spatial information and structural detail information of the video image. The video portrait segmentation algorithm based on the twin network adopts a twin network structure, and the basic structure comprises a video frame acquisition image, an RGB separation module, a Encoder network module, a SE (Squeeze and Excitation) module, a Decoder network module and a JPU (Joint Pyramid Upsampling) module, and specifically comprises the following steps:
Step S1: acquiring a current video frame image from a video to be segmented through the video frame acquisition module and preprocessing the current video frame image to obtain a preprocessed current frequency frame image;
step S2: the RGB separation module is used for separating the obtained preprocessed video frame image into three channel RGB video frame images in an RGB color mode;
Step S3: inputting three-channel RGB video frame images through the Encoder network module, and extracting multi-scale coarse granularity characteristics of five three-channel RGB video frame images by adopting a Mobilenet V network;
Step S4: connecting the Encoder network module and the Decoder network module through the SE module, and recalibrating the characteristics by learning the importance degree of each channel;
Step S5: obtaining different scale features from the Encoder network module, the downsampling of the current video frame image and the ConvGRU cyclic neural network through the Decoder network module, carrying out feature fusion, and capturing edge features lost in downsampling, shallow-layer cultural features, time sequence and space features;
Step S6: three different scale features are obtained from the current video frame, the downsampling of the current video frame and the Decoder network module through the JPU module, and a high resolution feature map is effectively generated under the condition of corresponding low resolution output and high resolution images.
Fig. 3 and 4 show step diagrams for obtaining video frame images.
According to the video frame image acquisition shown in fig. 3, the video frame image acquisition and preprocessing of the current video frame image of the video to be segmented includes:
Step S11: acquiring a current video frame image of a video to be segmented;
Step S12: and preprocessing the acquired current video frame image.
Preprocessing the obtained current video frame according to the preprocessing of the obtained current video frame shown in fig. 4, including:
Step S121: the size of the video to be segmented is adjusted to be a preset size, wherein the preset size is the size of an input image required by a twin network;
step S122: normalizing pixels of the resized image;
step S123: and adjusting the sequence of the color channels of the normalized image according to a preset sequence.
The current video frame is processed into an image form suitable for a twin network structure through preprocessing the current video frame, so that the image input is convenient, and the accurate segmentation is realized.
Fig. 5 and 6 show Encoder a network module Bottleneck architecture diagram and detailed network diagrams.
According to Encoder and 6, the Encoder network module is a twin network structure, and a lightweight network Mobilenet V3 Largespecial for semantic segmentation is selected as a backhaul, and a four-level encoder is built based on the twin network.
Inputting the three-channel RGB video frame images through the Encoder network module, extracting multi-scale coarse granularity characteristics of five three-channel RGB video frame images by adopting a Mobilenet V network, adopting a lightweight network Mobilenet V3 Largeas a backstone, constructing a four-level encoder based on a twin network, and obtaining coarse granularity characteristic diagrams of 1/4, 1/8, 1/16, 1/32 and 1/64 of three-channel RGB video frame resolutions by a downsampling layer and the four-level encoder.
The downsampling layer adopts bilinear interpolation to downsample for 4 times, and the resolution ratio of the original image of the current video frame is 1/4 of that of the feature image; the four-stage encoder comprises a first-stage encoder, a second-stage encoder, a third-stage encoder and a fourth-stage encoder, wherein each-stage encoder adopts a bottleneck structure with multiple weight sharing, each-stage encoder firstly uses a point-by-point convolution group, secondly uses a depth convolution group, is connected with a SE module to learn weights, and finally transmits shallow features containing structured information to deep features through short links.
The four-level encoder includes a first-level encoder Encoder _blk1, a second-level encoder Encoder _blk2, a third-level encoder Encoder _blk3 and a fourth-level encoder Encoder _blk4, the first-level encoder Encoder _blk1, the second-level encoder Encoder _blk2, the third-level encoder Encoder _blk3 and the fourth-level encoder Encoder _blk4 use a plurality of weight-sharing bottleneck structures, the bottleneck is an inverted residual structure, firstly, a point-by-point convolution group (1×1 2D convolution+batch processing+activation layer), secondly, a depth convolution group (3×3 2D convolution+batch processing+activation layer), and a SE module learning weight is connected, and finally shallow features containing structured detail information are transferred to deep features through a shortcut short link.
Specifically, the first-stage encoder Encoder _blk1 includes two bottleneck blocks to obtain a feature map with an original resolution of 1/8, the second-stage encoder Encoder _blk2 includes two bottleneck blocks to obtain a feature map with an original resolution of 1/16, the third-stage encoder Encoder _blk3 includes three bottleneck blocks to obtain a feature map with an original resolution of 1/32, and the fourth-stage encoder Encoder _blk4 includes six bottleneck blocks to obtain a feature map with an original resolution of 1/64, and to obtain multi-scale coarse granularity features with five three-channel RGB video frame resolutions.
The connecting the Encoder network module and the Decoder network module through the SE module inputs coarse granularity characteristics into the SE module, and performs characteristic recalibration of a characteristic channel level by learning importance degree of each channel, including: the method is used for converting the obtained coarse-grain features into global features through the Squeeze operation, and global averaging is adopted to obtain the global features; and performing specification operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining weights of different channels, and recalibrating the features. The SE module is used for obtaining the weight coefficient of each channel, so that the model has more distinguishing ability on the characteristics of each channel.
Fig. 7 shows a detailed network architecture diagram of a Decoder network module.
According to one Decoder network module shown in fig. 7, the Decoder network module is a twin network, weights among a plurality of Decoder blocks are shared, and a pseudo twin network is formed with the Encoder network module. The Decoder network module is used for obtaining four different scale features from a current video frame, the Encoder module, an Image LR (Image LR) obtained by downsampling the current video frame and a ConvGRU cyclic neural network, and performing feature fusion to obtain edge features lost in downsampling, shallow-layer cultural features and features based on time sequence and space, and comprises the steps of gradually reducing and amplifying high-level semantic information through a four-level Decoder corresponding to the Encoder module, gradually reducing and amplifying the high-level semantic information to obtain a high-resolution feature map, and obtaining a high-resolution feature map.
In order to reduce the number of parameters and calculation, the four-stage decoder corresponding to the Encoder modules splits the input in the channel dimension, convGRU cyclic neural networks in all modules calculate by using the split features, and the rest is combined with the result through short links. The four-stage decoder is used for multi-layer feature fusion, reducing the number of channels and obtaining a high-resolution feature map, and the feature maps of 1/32, 1/16, 1/8 and 1/4 of the resolution of the current video frame are respectively obtained; the input of each decoder is combined by the output of the down sampling process, and the information of the previous frame and the current frame is calculated and output by ConvGRU circulation network after convolution normalization.
Specifically, the four-stage decoder includes a 3×3 2D convolution+batch normalization+relu activation combination, a ConvGRU loop network, and 2-fold bilinear interpolation upsampling, where the input of the decoder is similar to the conventional U-net structure upsampling, and is that the output of the downsampling process is used to combine, and after the 3×3 2D convolution+batch normalization+relu activation combination, the ConvGRU loop network uses the previous frame and the current frame information to calculate and output.
Fig. 8 and 9 show a schematic view of a JPU module structure and a step diagram.
According to the JPU modules shown in fig. 8 and 9, the JPU modules are configured to convert the extracted high-resolution feature map into joint upsampling, to effectively generate a high-resolution Image given the respective low-resolution output (Image LR, decoder network module output) and high-resolution Image (Image HR) guidance, the three different-scale features obtained by the current video frame (Image HR), the current video frame downsampled (Image LR), and the Decoder network module. The JPU module comprises the following steps:
Step S41: feature fusion is carried out on three features with different scales obtained from a current video frame (Image HR), a current video frame downsampling result (Image LR) and a Decoder network module, and a feature map is output;
Step S42: using separable convolution groups with different void ratios to enlarge the field of view and capture the context information, outputting four groups of feature images with unchanged resolution, and fusing multi-scale context information by merging (Concatenate);
Step S43: an alpha mask with a channel number of 1 is generated using a 3 x3 2D convolution.
In order to reduce the calculation complexity and parameter quantity of convolution operation, separable convolution groups formed by hole convolution and point-by-point convolution with different hole rates are utilized to replace common standard convolution. The standard convolution is replaced by a3x 3 depth convolution and a1 x1 point convolution by decoupling the channel correlation from the spatial correlation.
Specifically, the feature fusion is performed on three features with different scales obtained from a current video frame, a current video frame downsampling result and a Decoder network module, and the output feature map comprises the following steps: firstly, carrying out 3X 3 2D convolution operation to unify the number of channels of three input features, secondly, carrying out up-sampling operation to unify and restore to a high-resolution feature scale, and finally outputting a feature map with the resolution consistent with the current video frame.
Fig. 10 shows a figure segmentation effect diagram of a video figure segmentation algorithm based on a twin network under different scenes.
According to the figure segmentation effect diagram in different scenes shown in fig. 10, it can be seen that the figure edge can be accurately segmented under various complex environments such as low foreground and background contrast, complex background, complex main body shape, and the like, so that the figure is segmented, and the robustness is high. The invention can capture the edge characteristics, shallow layer grammar characteristics, time sequence and space characteristics and other multi-level characteristics, can supplement the time sequence, space and edge structural information of the video current frame alpha Meng Bantu, realizes the accurate prediction of the alpha mask, and further segments the portrait and the background.
The invention can carry out high-precision video image segmentation on targets in complex scenes such as multiple targets, target shielding, tiny targets, faster target movement and the like, and adopts a deep learning PyTorch framework to construct each model learning video processing method. The index result and visual effect on the test data set provide a video portrait segmentation algorithm pre-training model based on a twin network which exceeds the current other algorithms.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The video image segmentation algorithm based on the twin network is characterized by adopting a twin network structure, wherein the basic structure comprises a video frame image acquisition module, an RGB separation module, a Encoder network module, an SE module, a Decoder network module and a JPU module, and the video image segmentation algorithm comprises the following steps of:
Step S1: acquiring a current video frame image from a video to be segmented through the video frame acquisition module and preprocessing the current video frame image to obtain a preprocessed current frequency frame image;
step S2: the obtained preprocessed video frame image is separated into three channel RGB video frame images in an RGB color mode through the RGB separation module;
Step S3: inputting three-channel RGB video frame images into the Encoder network module, and extracting multi-scale coarse granularity characteristics of five three-channel RGB video frame images by adopting a Mobilenet V network;
Step S4: the Encoder network module and the Decoder network module are connected through the SE module, coarse granularity characteristics are input into the SE module, and characteristic recalibration of characteristic channel levels is carried out by learning the importance degree of each channel;
Step S5: obtaining different scale features from the Encoder network module, the downsampling of the current video frame image and the ConvGRU cyclic neural network through the Decoder network module, carrying out feature fusion, and capturing edge features lost in downsampling, shallow-layer cultural features, time sequence and space features;
Step S6: the JPU module is used for obtaining the characteristics of three different scales from the current video frame, the downsampling of the current video frame and the Decoder network module, and a high-resolution characteristic image is effectively generated under the condition of corresponding low-resolution output and high-resolution image.
2. The video image segmentation algorithm based on the twin network according to claim 1, wherein the obtaining, by the video frame obtaining module, the current video frame image from the video to be segmented and performing preprocessing, and obtaining the preprocessed current video frame image includes: step S11: acquiring a current video frame image of a video to be segmented; step S12: and preprocessing the acquired current video frame image.
3. The twin network-based video image segmentation algorithm as defined in claim 2, wherein the preprocessing of the current video frame comprises: step S121: the size of the video to be segmented is adjusted to be a preset size, wherein the preset size is the size of an input image required by a twin network; step S122: normalizing pixels of the resized image; step S123: and adjusting the sequence of the color channels of the normalized image according to a preset sequence.
4. The twin network-based video image segmentation algorithm according to claim 1, wherein the inputting the three-channel RGB video frame images into the Encoder network module, extracting multi-scale coarse granularity features of five three-channel RGB video frame images by using Mobilenet V3 network, includes using a lightweight network Mobilenet V3 target as a back bone, constructing a four-level encoder based on the twin network, and obtaining coarse granularity feature maps of 1/4, 1/8, 1/16, 1/32 and 1/64 of three-channel RGB video frame resolutions by a downsampling layer and the four-level encoder.
5. The video image segmentation algorithm based on the twin network according to claim 4, wherein the downsampling layer performs 4 times downsampling by bilinear interpolation to obtain a feature map with original image resolution of 1/4; the four-stage encoder comprises a first encoder, a second encoder, a third encoder and a fourth encoder, wherein each stage encoder adopts a bottleneck structure with a plurality of weights shared, each stage encoder firstly uses a point-by-point convolution group, secondly uses a depth convolution group, is connected with a SE module to learn weights, and finally transmits shallow features containing structured information to deep features through short links.
6. The twin network-based video image segmentation algorithm according to claim 1, wherein the connecting the Encoder network module and the Decoder network module through the SE module inputs coarse-grained features into the SE module, performs feature recalibration at a feature channel level by learning importance of each channel, and comprises: the method is used for converting the obtained coarse-grain features into global features through the Squeeze operation, and global averaging is adopted to obtain the global features; and performing specification operation on the global features obtained by the Squeeze operation, learning the nonlinear relation among all channels, obtaining weights of different channels, and recalibrating the features.
7. The video image segmentation algorithm based on the twin network according to claim 1, wherein the Decoder network module obtains different scale features from the Encoder network module, the downsampling of the current video frame image and the ConvGRU cyclic neural network, performs feature fusion, captures edge features lost in downsampling, shallow-layer cultural features and time and space features, and gradually restores and amplifies high-level semantic information through a four-level Decoder corresponding to the Encoder module to obtain a high-resolution feature map.
8. The twin network-based video image segmentation algorithm according to claim 7, wherein the four-stage decoder is configured to perform multi-layer feature fusion, reduce the number of channels, and obtain a high-resolution feature map, so as to obtain feature maps of 1/32, 1/16, 1/8, and 1/4 of the current video frame resolution, respectively; the input of each decoder is combined by the output of the down sampling process, and the information of the previous frame and the current frame is calculated and output by ConvGRU circulation network after convolution normalization.
9. The twin network-based video image segmentation algorithm as defined in claim 1, wherein the step of effectively generating a high resolution feature map given the respective low resolution output and high resolution image by the JPU module to obtain three different scale features from the current video frame, the current video frame downsampling, and the Decoder network module comprises the steps of: feature fusion is carried out on three features with different scales obtained from the current video frame, the downsampling of the current video frame and the obtaining of a Decoder network module, and a feature map is output; step S42: the method comprises the steps of using separable convolution groups with different void ratios to enlarge the field of view, capturing context information, outputting four groups of feature images with unchanged resolution, and merging multi-scale context information; step S43: an alpha mask with a channel number of 1 is generated using a 3x 3 2D convolution on the fused multi-scale context information.
10. The video image segmentation algorithm based on the twin network according to claim 9, wherein the feature fusion is performed on three features of different scales obtained from a current video frame, a downsampled current video frame, and a Decoder network module, and the outputting of the feature map includes the steps of: firstly, carrying out 3X 3 2D convolution operation to unify the number of channels of three input features, secondly, carrying out up-sampling operation to unify and restore to a high-resolution feature scale, and finally outputting a feature map with the resolution consistent with the current video frame.
CN202210759308.9A 2022-06-30 2022-06-30 Video portrait segmentation algorithm based on twin network Active CN115100409B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210759308.9A CN115100409B (en) 2022-06-30 2022-06-30 Video portrait segmentation algorithm based on twin network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210759308.9A CN115100409B (en) 2022-06-30 2022-06-30 Video portrait segmentation algorithm based on twin network

Publications (2)

Publication Number Publication Date
CN115100409A CN115100409A (en) 2022-09-23
CN115100409B true CN115100409B (en) 2024-04-26

Family

ID=83295324

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210759308.9A Active CN115100409B (en) 2022-06-30 2022-06-30 Video portrait segmentation algorithm based on twin network

Country Status (1)

Country Link
CN (1) CN115100409B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116205928B (en) * 2023-05-06 2023-07-18 南方医科大学珠江医院 Image segmentation processing method, device and equipment for laparoscopic surgery video and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287940A (en) * 2020-10-30 2021-01-29 西安工程大学 Semantic segmentation method of attention mechanism based on deep learning
WO2021088300A1 (en) * 2019-11-09 2021-05-14 北京工业大学 Rgb-d multi-mode fusion personnel detection method based on asymmetric double-stream network
CN114299944A (en) * 2021-12-08 2022-04-08 天翼爱音乐文化科技有限公司 Video processing method, system, device and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021088300A1 (en) * 2019-11-09 2021-05-14 北京工业大学 Rgb-d multi-mode fusion personnel detection method based on asymmetric double-stream network
CN112287940A (en) * 2020-10-30 2021-01-29 西安工程大学 Semantic segmentation method of attention mechanism based on deep learning
CN114299944A (en) * 2021-12-08 2022-04-08 天翼爱音乐文化科技有限公司 Video processing method, system, device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于智能视觉的机械零件图像分割技术;洪庆;宋乔;杨晨涛;张培;常连立;;机械制造与自动化;20201020(05);全文 *

Also Published As

Publication number Publication date
CN115100409A (en) 2022-09-23

Similar Documents

Publication Publication Date Title
CN110287849B (en) Lightweight depth network image target detection method suitable for raspberry pi
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
CN112163449B (en) Lightweight multi-branch feature cross-layer fusion image semantic segmentation method
CN113052210B (en) Rapid low-light target detection method based on convolutional neural network
CN112396607B (en) Deformable convolution fusion enhanced street view image semantic segmentation method
CN113033570B (en) Image semantic segmentation method for improving void convolution and multilevel characteristic information fusion
CN110717851A (en) Image processing method and device, neural network training method and storage medium
CN111461217B (en) Aerial image small target detection method based on feature fusion and up-sampling
CN109902809B (en) Auxiliary semantic segmentation model by using generated confrontation network
CN108416292B (en) Unmanned aerial vehicle aerial image road extraction method based on deep learning
CN113837938B (en) Super-resolution method for reconstructing potential image based on dynamic vision sensor
CN111861880A (en) Image super-fusion method based on regional information enhancement and block self-attention
CN111429466A (en) Space-based crowd counting and density estimation method based on multi-scale information fusion network
CN114898284B (en) Crowd counting method based on feature pyramid local difference attention mechanism
CN116343043B (en) Remote sensing image change detection method with multi-scale feature fusion function
CN116797787B (en) Remote sensing image semantic segmentation method based on cross-modal fusion and graph neural network
CN115424017B (en) Building inner and outer contour segmentation method, device and storage medium
CN115100409B (en) Video portrait segmentation algorithm based on twin network
CN116645598A (en) Remote sensing image semantic segmentation method based on channel attention feature fusion
CN114359297A (en) Attention pyramid-based multi-resolution semantic segmentation method and device
CN116630704A (en) Ground object classification network model based on attention enhancement and intensive multiscale
Schirrmacher et al. Sr 2: Super-resolution with structure-aware reconstruction
CN115661451A (en) Deep learning single-frame infrared small target high-resolution segmentation method
CN114565764A (en) Port panorama sensing system based on ship instance segmentation
CN113111736A (en) Multi-stage characteristic pyramid target detection method based on depth separable convolution and fusion PAN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant